Rea
p

.51

$5..”
a“ ,.

 

 

 

- {11
X . pV...
t. .‘I

 

\L...

 

f‘ "
Therein

I999

Illllllllllllllllllllllllllllllllllllllllllllllllllllllllll

301812 5454

This is to certify that the

dissertation entitled

Scoring Performance Assessment Based on Judgements:
Utilizing Meta-Analysis to Estimate Variance Components in Generalizability
Theory for Unbalanced Situations

presented by

Christopher Wing-Tat Chiu

has been accepted towards fulﬁllment
of the requirements for

Ph D degree in mung,—
Educational Psychology & Special Education
(Measurement & Quantitative Methods)

Robert E. Floden % 6 '

Betsy J. Becker m //d w: 76% 4;.”
i. /(ﬂajor professor

 

  
 

Date June 184 1999

MS U it an Afﬁrmative Action/Equal Opportunity Institution 0- 12771

 

 

LIBRARY

Michigan State

University

 

 

PLACE IN RETURN Box to remove this checkout from your record.
TO AVOID FINE return on or before date due.
MAY BE RECALLED with earlier due date if requested.

 

DATE DUE I

DATE DUE

DATE DUE

 

JAN 2 8 200‘

 

T032 7 '01

 

702335.?ngg

 

 

 

 

 

 

 

 

 

 

 

 

use wCIHCIDamOmpGS-p.“

 

SCORING PERFORMANCE ASSESSMENTS BASED ON JUDGEMENTS:
UTILIZING META-ANALYSIS To ESTIMATE VARIANCE COMPONENTS IN
GENERALIZABILITY THEORY FOR UNBALANCED SITUATIONS
By

Christopher Wing-Tat Chiu

A DISSERTATION
Submitted to
Michigan State University
in partial fulﬁllment of the requirements

for the degree of

DOCTOR OF PHILOSOPHY

Department of Counseling, Educational Psychology, and Special Education

June 1999

ABSTRACT
SCORING PERFORMANCE ASSESSMENTS BASED ON JUDGEMENT S:

UTILIZING META-ANALYSIS TO ESTIMATE VARIANCE COMPONENTS IN
GENERALIZABILITY THEORY FOR UNBALANCED SITUATIONS

By

Christopher Wing-Tat Chiu

In generalizability analyses, unstable and potentially invalid variance component
estimates may result from using only a limited portion of available data. However, missing
observations are common in operational performance assessment settings (e.g., Brennan, 1992
and I997; Cronbach, Gleser, Nanda, & Rajaratnam, l972; Shavelson & Webb, l99l) because of
the nature of the assessment design. In this dissertation, I describe a procedure to analyze data
with missing observations by extracting data from a sparsely-ﬁlled data matrix into analyzable
smaller subsets of data. This subdividing method, drawing on the conceptual framework in meta-
analysis (e.g., Hedges & Olkin, I985), is accomplished by creating data sets that exhibit
structural designs (i.e., crossed, nested, and modiﬁed balanced incomplete block designs) then
pooling variance components obtained from these designs. This method is always more
computationally effective than any other methods that require sparsely collected scores to be
analyzed all at once. A Monte Carlo simulation is used to examine the statistical properties of the
variance-component estimates and some commonly used composite indices, namely the
generalizability coefﬁcient (for norm-referenced decisions), the dependability coefﬁcient (for
criterion-referenced decisions), and the misclassiﬁcation rates. The smallest unbalanced data set
used to evaluate the subdividing method is composed of 750 examinees, four raters, and two
tasks while the largest unbalanced data subset is composed of 6000 examinees, 28 raters, and two
tasks. Graphic displays are used to evaluate the accuracy, stability, and consistency of the

variance component estimates and the composite indices. Experimental conditions, modeling

operational performance assessments, are manipulated to examine how well the subdividing
method would perform in practice. These conditions included: (1) volume of examinees, (2) size
of rater pool, (3) variation in item difﬁculty, (4) levels of rater inconsistency, (5) rules used to
decide how to group raters and assign tasks to raters, and (6) the minimum number of examinees
scored by a group of raters.

Results indicate that the subdividing method produce outcomes having properties
(unbaisedness and consistency) that are similar to those of complete data methods. Evidence was
provided to support that the pattern of missing data was frequently, in large-scale performance
assessments, determined by the rules used to assign examinees and tasks to raters during scoring
sessions. A collection of these rules, deﬁned as a rating plan, was examined. Speciﬁcally, this
dissertation compared two prevalent rating plans (i.e., the disconnected crossed and connected
mixture plans). It was found that increasing the number of raters to score examinees boosts the
precision in estimating rater-related measurement errors, namely rater- and rater-by-item errors,
for the disconnected crossed rating plan but lowers the precision for the connected mixture rating
plan.

The subdividing method recovers variance component estimates with high accuracy and
precision in a variety of conditions (i.e., low and high variations in item difﬁculty and rater
inconsistency). Increasing the number of examinees scored by the same group of raters from 12
to 24 has virtually no effect on the accuracy and precision of the variance component estimates.
This dissertation also illustrates that: (l) the amounts and patterns of missing data inﬂuences the
standard error to a larger degree than they inﬂuence the accuracy of the variance component
estimates, assuming unobserved scores are missing completely at random, and (2) the use of only
a few tasks varying much in difﬁculty is a major source of variation, lowering the dependability

of measurement procedures and thus leading to unreliable criterion-reference decisions.

Copyright by
CHRISTOPHER WING-TAT CHIU

1999

ACKNOWLEDGEMENTS

Many have supported this dissertation throughout the years. The American College
Testing Program (ACT), the Educational Testing Service (ETS), the Graduate School at
Michigan State University, and the Society of Multivariate Experimental Psychology (SMEP)
provided generously support through their internships, dissertation fellowships, and grant
program.

I am in debt to my committee members Robert Floden, Betsy Becker, William Mehrens,
and David Pearson; and my colleagues and friends, Isaac Bejar and Edward Wolfe. My early
interest in Generalizability Theory and performance assessment was fostered in a seminar led by
Robert Floden and David Pearson. The invaluable feedback and kind support provided by Betsy
Becker, the SynRG meta-analysis research group, and the faculty of the Measurement and
Quantitative Method program at Michigan State University are deeply appreciated. Betsy
Becker's editing Skills taught me the importance of being thorough to produce high quality work.
Robert Floden's keen research skills and high standards helped me focus and expand my
thinking. I would like to thank William Mehrens for his inspiring teaching in psychometrics and
educational measurement. My mentor, Edward Wolfe stimulated my thoughts for this
dissertation, while I participated in a summer internship at ACT.

Isaac Bejar, my mentor at ETS while I was a pre-doctoral fellow, has given me ample
opportunities to master technical skills indispensable for the completion of my dissertation. The
simulation in this dissertation would not have been possible without the equipment sponsored by
ETS (contract number: Ref. No. 5530) and by the computer center at Michigan State University.
Many thanks to Carol Baker, Robert Brennan, Randy Fotiu, Brad Hanson, Michael Harewell,
Suzanne Lane, Mark MacCullen, Eiji Muraki, Timm Neil, Paul Nicholes, Connie Page, Mark
Reckase, Philip Smith, .Ion Sticklen, Clement Stone, and Ross Traub, who shared their extensive
knowledge and provided me with their insights in research methods employed in this dissertation.

I thank my best friend, Ivy Li, for believing in me and provided me with endless support
and laughter. I also would like to express my deepest gratitude to my family for their
understanding and encouragement; I am grateful to my parents, Siu-Wai and Lai-Ping Chiu, and

Aunt Sue Chiu for everything they have taught me.

V

TABLE OF CONTENTS

LIST OF TABLES ....................................................................................................................... viii
LIST OF FIGURES ........................................................................................................................ ix
LIST OF APPENDICES ................................................................................................................ xi
CHAPTER I: INTRODUCTION ................................................................................................... 1
Signiﬁcance of the Current Study ............................................................................................... 2
CHAPTER 2: LITERATURE REVIEW AND PROBLEM FORMULATION ............................ 5
2.1) Indices commonly used in criterion-referenced and norm-referenced tests ......................... 5
2.2) Analyzing missing data in G theory ..................................................................................... 9
2.3) Historical approaches to analyzing missing data ............................................................... 12
2.4) Imputation as a method to handle missing data ................................................................. 14
2.5) Potential solutions in handling missing data in G theory ................................................... l4
2.6) Summary of literature review ............................................................................................. l7
CHAPTER 3: METHODOLOGY ................................................................................................ l9
3. l) Procedures of the subdividing method ............................................................................... 20
3.2) Research questions ............................................................................................................. 19
3.3) Conditions to vary .............................................................................................................. 31
3.4) Data generation .................................................................................................................. 40
3.5) Outcomes and data analysis ............................................................................................... 43
CHAPTER 4: RESULTS ............................................................................................................. 48
4. I) Comparison of pooled results with weights and without weights ...................................... 50
4.2) The effect of packing essays into batches of l2 versus batches of 24 ............................... 59
4.3) Accuracy of the variance components for two rating plans ............................................... 62
4.4) Precision of the subdividing method and the effects of expanding rater pool sizes .......... 64

4.5) Precision of the subdividing method and the effects of increasing volume of examinees. 70

4.6) Findings on the disconnected crossed and the connected mixture rating plans ................. 72

vi

4.7) Precision of the subdividing method for item effects ......................................................... 78

4.8) Accuracy and precision in making norm- and criterion- referenced decisions .................. 80
CHAPTER 5: CONCLUSIONS, DISCUSSIONS, AND FUTURE DIRECTIONS .................... 87
5.1) Subdividing method and unbalanced situations in performance assessment ..................... 87
5.2) Major ﬁndings and implications ........................................................................................ 88
5.3) New applications of the subdividing method and future directions ................................... 94
5.4) Suggestions to test developers and educational values ...................................................... 97
REFERENCES ............................................................................................................................ 128

vii

LIST OF TABLES

Table I: Research Questions ................................................................................ 30
Table 2: Experimental conditions to evaluate the subdividing method ........................................ 32
Table 3: Summary of variance-component magnitudes in the literature ...................................... 33
Table 4: Principles of rating plans ................................................................................................ 39
Table 5: Population parameters for the variance components and composites ............................ 41
Table 6: Comparsion between normal and rounded scores .......................................................... 42
Table 7 : Table of major ﬁndings .................................................................................................. 49

Table 8: The ratio of standard errors of indices obtained using a batch size of 24 to those
obtained using a batch size of 12 for the disconnected crossed rating plan ................... 59

Table 9: The ratio of accuracy of indices obtained using a batch size of 24 to those obtained
using a batch size of 12 for the disconnected crossed rating plan ................................. 61

Table 10: The ratio of standard errors of indices obtained using a batch size of 24 to those
obtained using a batch size of 12 for the connected mixture rating plan ....................... 62

 

Table 1 I: The ratio of accuracy of indices obtained using a batch size of 24 to those obtained

using a batch size of 12 for the connected mixture rating plan ..................................... 62
Table 12: Accuracy of the disconnected crossed rating plan ........................................................ 63
Table 13: Accuracy of the connected mixture rating plan ............................................................ 63

Table 14: Average SEs and average reduction in empirical standard error for the rater effect.... 66

Table 15: Relationship between size of rater pool and reduction in standard error of the item-by-
rater effect as a function of sample size ........................................................................ 68

Table 16: SE and changes in standard error of the person-by-rater effect as sample size increases
....................................................................................................................................... 69

Table 17: Increases in uncertainty of the person—by-rater effect in the connected mixture rating
plan ................................................................................................................................ 72

Table l8:Wilks' Lambda for predicting accuracy of variance components ................................ l l 1

Table 19: Regression models for the accuracy of the variance components in the disconnected
crossed rating plan ...................................................................................................... l 12

viii

LIST OF FIGURES

Figure 1: Decision rule for weighting ........................................................................................... 27

Figure 2: A hypothetical data set illustrating the disconnected crossed rating plan ..................... 37

Figure 3: A hypothetical data set illustrating the connected mixture rating plan ......................... 39

Figure 4: Weighted estimates 6%, under the connected mixture rating plan ................................ 52

Figure 5: Unweighted estimates &% under the connected mixture rating plan ............................ 52

Fi ure 6: Wei ted estimates 62~ under the connected mixture ratin lan ............................... 53
g p, g p

Fi ure 7: Unwei hted estimates 53. under the connected mixture ratin Ian ........................... 53
g g p, g P

Fi ure 8: Wei hted estimates 62 .- under the connected mixture ratin Ian .......................... 54
g g pr.t.e g P

Figure 9: Unweighted estimates éiriie under the connected mixture rating plan ...................... 54

PI ure 10: Wei hted estimates (32 under the disconnected crossed ratin Ian .......................... 55
g g p g P

Figure I l: Unweighted estimates (3%, under the disconnected crossed rating plan ...................... 55

F i ure 12: Wei hted estimates (32. under the disconnected crossed ratin Ian ........................ 56
g g p, g P

Figure 13: Unweighted estimates (3%)]. under the disconnected crossed rating plan ..................... 56

Figure 14: Weighted estimates 6%, under the disconnected crossed rating plan ........................ 57

Figure 15: Unweighted estimates rig, under the disconnected crossed rating plan .................... 57

Figure 16: Weighted estimates dine under the disconnected crossed rating plan ..................... 58

Figure 17: Unweighted estimates 6%,“, under the disconnected crossed rating plan ................. 58

Figure 18: The reduction of standard error for the rater effect as a function of the size of rater

pool and sample size .................................................................................................... 65
Figure 19: The reduction trends of the standard error of the rater-by-item effect ........................ 67
Figure 20: The standard error of the person-by-rater effect as a function of sample size ........... 69

Figure 21: The standard error of the person effect as a function of sample size and rater pool size

................................................................................................................................................ 70
Figure 22: The standard error of the person-by-item-by-rater effect as a function of sample size

and rater pool size .................................................................................................................. 71
Figure 23: The standard error of the person-by-item effect as a function of sample size ............. 71

Figure 24: The relationship between the improvement of the person-by-rater effect and the
expansion of rater pool size using the connected mixture rating plan. .................................. 73

Figure 25: The effect of employing two different rating plans on the precision of the person-by-
item effect .............................................................................................................................. 75

Figure 26: The relationship between the improvement of the person-by-item-by-rater effect and
the expansion of the rater pool size using the connected mixture rating plan ....................... 76

Figure 27: The decrease in standard error as a function of rater pool size after utilizing all the

available data ......................................................................................................................... 77
Figure 28: The randomness of the standard errors for the item effect .......................................... 78
Figure 29: The randomness of the standard errors for the item effect .......................................... 79
Figure 30: Empirical conﬁdence intervals for generalizability coefﬁcients ................................. 81
Figure 31: Theoretical generalizability coefﬁcients ..................................................................... 81

Figure 32: Distribution of the item variance components for the disconnected crossed rating plan
(averaged across batch size, sample size, and rater pool size) ............................................... 83

Figure 33: Dependability coefﬁcients estimated in the disconnected crossed rating plan (high
item effects) ........................................................................................................................... 83

Figure 34: Dependability coefﬁcients estimated in the disconnected crossed rating plan (low item
effects) .................................................................................................................................... 84

Figure 35: Misclassiﬁction error obtained for the disconnected crossed rating plan ................... 85

Figure 36: Standard errors of the misclassiﬁcation rates for the disconnected crossed rating plan

................................................................................................................................................ 86
Figure 37: A hypothetical connected crossed rating plan ............................................................. 92
Figure 38: Hypothetical data subsets for the modified balanced incomplete block .................... 108

LIST OF APPENDICES

Appendix A: Equations for scores and coefﬁcients in generalizability theory (Adapted from

Brennan, 1992) .............................................................................. 102
Appendix B: Standard errors for variance components in a two facet crossed design ........... 103
Appendix C: Computation of misclassiﬁcation rate for conjunctive decision rules ............... 104

Appendix D: Illustration of an out—of-range sample correlation based on different data sets for

sample covariance and variances ......................................................... 105
Appendix E: Correlations based on missing data ....................................................... 106
Appendix F: The structure of a modiﬁed balance incomplete block design ........................ 107
Appendix G: A mathematical model to determine the size of a rater pool .......................... 109

Appendix H: A multivariate regression model predicting the accuracy of variance components
.................................................................................................. 1 1 1

Appendix 1: Computer program: Codes for data simulation analysis in SPSS ...................... l 13

xi

CHAPTER 1: INTRODUCTION

The importance of Generalizability theory (G theory) lies in its applications to
educational measurement. Two of its major functions are: 1) to evaluate the quality of
measurement procedures; and 2) to make projections about how one can improve the quality of
measurement procedures. Regardless of its wide applications (Brennan, 1997, 1998; Lane,
Ankenmann, & Stone, 1996; and Linn, Burton, DeStefano, & Hanson, 1996), G theory, a
framework relying on the estimation of variance components, has a major limitation in its
incapability of handling missing data —— a common problem in large-scale assessments. Test
developers often cannot use ordinary algorithms for estimating variance components in G theory
because the computational requirements are excessive.

The current dissertation developed and scrutinized a method called the subdividing
method (deﬁned in Chapter 3), which allows investigators to use more of the data, with lower
computational power needs than conventional methods. It also examines the subdividing method
as a way to obtain G theory estimates for large-scale assessments by exploring the robustness of
this method across a variety of experimental conditions reﬂecting realistic operational processes
adopted by performance assessment centers. Speciﬁcally, the current study examines the
accuracy and precision (deﬁned in the chapter 3 entitled "Methodology") of the estimators
recovered by the subdividing method. "Literature Review" in chapter 2 summarizes and critiques
conventional methods used for analyzing missing data with a focus on those applied to G theory.
Chapter 2 also summarizes the background of the research questions, which are stated in detail in
chapter 3. Along with the research questions, chapter 3 describes the three major steps to
implement the subdividing method, namely "Subdividing", "Estimating", and "Synthesizing". In
addition, that chapter also reveals the experimental conditions manipulated to evaluate the
subdividing method. Chapter 3 also discusses technical issues in applying the subdividing

method and practical issues in planning scoring sessions for open-ended questions.

Questions regarding the planning of scoring procedures include: In what way should test
developers set up a scoring procedure? How does the use of different set-ups inﬂuence the
quality of quantifying measurement errors? How many examinees and tasks should a common
group of raters score in order to obtain a reliable scoring procedure? Technical questions
regarding the subdividing method include: What are the situations in which one can ignore
weighting schemes when synthesizing data subsets? What are the consequences of not using
weights? When does one need to use weights? To what extent can the subdividing method
produce accurate and precise G theory estimates for large volumes of examinees and raters? How
sensitive is the subdividing method in estimating reliability when measurement errors are small
and when they are large? Chapter 4 reports the results, which are all in the context of data
missing completely at random (MCAR, Little & Rubin, 1987). Chapter 5 summarizes the results,

which are interpreted in the light of providing suggestions to test developers.

Signiﬁcance of the Current Study

Currently, researchers are frequently forced to discard data when some data are missing,
leading to unstable variance components and reliability coefﬁcient estimates used to evaluate
measurement procedures in large-scale assessments. Interpreting those unstable estimates can
lead to inconsistent decisions (Burdick 1992, pp.16-18). For example, if our goal is to develop a
performance assessment procedure with a generalizability coefﬁcient of 0.90, we may reach
different conclusions using the following three conﬁdence intervals [0.85, 0.89], [0.44, 0.89],
and [0.44, 0.50]. In the ﬁrst case, we may conclude that for practical purposes, the
generalizability coefﬁcient is close enough to 0.90 that it may not be efﬁcient to increase the
reliability. In the second case, we may decide that the conﬁdence interval is too wide to make
conclusive decisions about the reliability of the assessment procedure. In the ﬁnal scenario, the
evidence suggests that the assessment procedure is not well developed. In the context of scoring

performance tasks in which observations andjudgements are involved, the ﬁnal scenario

indicates that raters differ greatly in severity and need further training. The current dissertation
evaluates the subdividing method in terms of point estimates, standard errors, and conﬁdence
intervals of the variance components and composite indices.

The amount of missing data and the mechanism causing data to be missing are nontrivial
factors, because they affect standard errors of the estimates (Little & Rubin, 1987). When
generalizability coefﬁcients have sizable sampling errors, scores are also unreliable and decision
makers may assign a higher rank to one examinee than to other examinees on one occasion but
not on other occasions. Like generalizability-coefﬁcient estimates, variance-component estimates
are not as precise when observed data are discarded in dealing with incomplete data. By reducing
the amount of data to be discarded, the proposed subdividing method produces more stable
decision study results (Cronbach et al., 1972).

Another limitation of G theory is that it demands intensive computational resources to
model unbalanced data (Babb, 1986; Bell, 1985; Brennan, 1992; Brennan, 1997; Searle, I992;
Shavelson, 1981). Estimation methods for G theory often require extensive computational
resources. Methods like Restricted Maximum Likelihood require large amounts of computational
resources to analyze data matrices that can be too large to invert. The subdividing method
reduces the intensive computational demands by partitioning a large unbalanced data set into
smaller data subsets.

Data collection procedures determine the pattern of missing data (Engelhard, 1997) and
thus inﬂuence the precision of parameter estimates (Little & Rubin, 1987). In multiple facet
generalizability studies employed for large-scale assessments, the ways to assign tasks to raters
and the mechanisms to distribute examinees' work to raters constitute the data collection
procedures, also deﬁned as "rating plans" throughout the current dissertation. Although data
collection procedures are critical, rarely did research set out to investigate how these procedures
inﬂuence the statistical properties of G theory estimates (Personal communication from Gordon,

1998; Vickers, 1998; & Welch, I996). The current dissertation summarizes two data collection

procedures frequently appeared in the literature for scoring open-ended questions based on
human judgements. It then deduces principles underlying these procedures. The robustness and
performance of the proposed subdividing method is evaluated, utilizing a Monte Carlo
simulation, in operational settings parallel to the two data collection procedures.

Existing research relevant to G theory for unbalanced data has tended to focus on the
estimation of variance components outside of the measurement framework. In particular, much
research (e.g., Babb, I986; Burdick, 1992; Henderson, 1953; Malley, 1986; Marcoulides, 1988;
Rao, 1997; Satterthwaite, 1946; Searle, I992; Seeger, 1970; Townsend, 1968) has focused on the
statistical properties of variance components. No research has examined the statistical properties
for composite indices commonly used in G theory, particularly in unbalanced situations. The
current dissertation compensates for this demerit. Knowing the statistical properties for variance
components alone does not necessarily help us to interpret composite indices. Nor does it help us
to make decisions regarding the reliability of a measurement procedure. (For the applications of
the composite indices, see the subsequent section entitled Indices Commonly Used in Criterion-
Referenced and Norm-Referenced Tests.) For instance, knowing the conﬁdence intervals for the
variance components, per se, does not allow inferences to the conﬁdence intervals for the
standard errors of measurement and for the dependability coefﬁcient. This is because the
variance components do not have a linear relationship with all the composite indices (e.g., the
absolute standard error of measurement is the square root of the sum of all the error variance
components). The unbalanced data sets caused by missing observations make it difﬁcult to

construct conﬁdence intervals analytically.

CHAPTER 2: LITERATURE REVIEW AND PROBLEM FORMULATION

Despite the efforts made in measurement research to deal with the limitations
encountered when analyzing unbalanced data via G theory, the research in this area suffers from
major restrictions. In this chapter, I introduce the applications of G theory in norm-referenced
and criterion-referenced testing. Next, I review the advantages and disadvantages of a variety of
methods for handling large and unbalanced data sets. Last, I summarize studies that provide a

foundation for the proposed subdividing method.

2.1) Indices commonly used in criterion-referenced and norm-referenced tests

None of the research conducted for missing data in G theory has investigated the
behavior of composite indices that are used for rank ordering examinees and compared the
performance of examinees to a criterion. Brennan (1992) and Satterthwaite (I941 & 1946) have
provided computational formulas for conﬁdence intervals and standard errors for the various
composite indices. Unfortunately, those formulas were derived for balanced data. Given the
importance of the composite indices, additional research is needed to examine the conﬁdence
intervals and standard error of measurement for those indices. The applications and importance

of those composite indices are discussed in the following sections. Appendix A shows the

equations for the composite indices (reproduced from Gao, 1992).

Reliability coefﬁcients — The generalizability coefﬁcient (denoted Epz) and the
dependability coefﬁcient (denoted 4)) are important in many aspects. Much like the classical test
reliability coefficient, the generalizability coefﬁcient has various advantageous for making
educational decisions. The coefﬁcient Ep2 can be deﬁned as the square correlation of test scores
between two randomly parallel test forms (Crocker & Algina, 1986, p. 124) assembled in the

same universe of generalization (Brennan, 1992, p.3; Cronbach et al., 1972, pp. 18-23; Shavelson

& Webb, 1991, pp. 12-13). Put differently, the generalizability coefﬁcient shows how well one
can rank order students in the same manner using two test forms (or two similar measurement
procedures from the same universe of generalization), which were assembled in accordance with
the conditions to which one would want to infer. Understanding how well randomly parallel
forms rank order test scores is useful in both classroom assessments and large-scale assessments.
Frequently, testing agencies or classroom teachers need to prepare several tests containing
different samples of questions drawn from the same domain of knowledge. In order to compare
students or to evaluate instructions based on test scores obtained from the two randomly parallel
forms, one has to estimate the generalizability coefﬁcient. A high generalizability coefﬁcient
warrants the comparisons of student learning and teaching practice, because we know that a large
portion of the test score variation is due to the variation in students' ability rather than the

discrepancies in difﬁculty of two forms.

Another advantage of the generalizability coefﬁcient is that its transformation ( J?) can

be used to indicate the degree to which observed scores correlate with universe scores. The

higher lpz , the more conﬁdent one can be when using students' test scores to infer to how much

they would know if the students were tested on a broader scope (e.g., test students on all the
items in the item bank). Another reason that reliability coefﬁcients are important is that they can
be used for criterion-referenced decisions. The dependability coefﬁcient is an index monitoring
the degree to which a test can be used to make absolute decisions (e.g., Can an examinee master
half of the test items in the domain? How reliably could a measurement procedure determine that
a random examinee passes a criterion?) Also, the dependability coefﬁcient can be used to
approximate other indices. One such index is the criterion-referenced reliability coefﬁcient
denoted (MA) (Brennan & Kane, 1977). This index is derived to summarize the relationship

between cut-scores and the consistency of a measurement procedure. Patterson

(1985, p.35) demonstrated that the (I) is a lower limit of (MA). Like the generalizability coefﬁcient,
the higher the dependability coefﬁcient, the more reliably one can make an absolute decision.

The dependability coefﬁcient has become more useful as state departments and schools
emphasize standards. For instance, Tucker (1998) advocates that education agencies become
active in setting and evaluating standards. Dependability coefﬁcients are well suited for this
purpose because one can use dependability coefﬁcients to forecast the consistency of a
measurement procedure in relation to where one sets the standard.

Regardless of the wide range of possible applications of reliability coefﬁcients, one
cannot take full advantage of those coefﬁcients unless they are estimated accurately. Reliability
coefﬁcients, like many other statistics, are subject to sampling errors. Two statistical properties
are important for the interpretation of estimates of reliability coefﬁcients in G theory (Personal
communication from Gordon, 1998; Vickers, 1998; & Welch, I996): unbiasedness and
efﬁciency. (See Aczel, 1996; and Hays & Winkler,l970). The current dissertation summarizes
two data-collection procedures frequently appearing in the literature, and examines these

properties of the variance components estimated by the subdividing method.

Standard error Of measurement (SEM) — The SEMS based on relative and absolute
decisions are effective for evaluating the improvement of measurement procedures (Brennan,
Gao, and Colton, 1995). As pointed out by many researchers, measurement procedures can be
made more reliable in three ways. Kane (1982) provided a succinct account, noting that one
could improve the quality of a measurement procedure by taking any or all of the following three
actions: 1) restricting the universe of generalization; 2) increasing the measurement conditions in
a measurement procedure such as using more items and more raters; and 3) standardizing

measurement procedures. Brennan, Gao and Colton (1995) advocated the use of SEMs, in place

of the generalizability coefﬁcient, to monitor the improvement of measurement procedures

because SEMs are more sensitive to change in error variances than is the generalizability
coefﬁcient. This occurs because the universe score variance is larger than error variances and so
reduction in error variances is not well reﬂected in the generalizability coefﬁcient. To use the
SEM as an index of quality improvement, one would compare the ratio (the SEM divided by the
total variation) obtained before and after the improvement of a measurement procedure.

Besides the monitoring feature just mentioned, Brennan, Gao, and Colton (1995)
demonstrated a wide variety of applications based on the SEMs. First, one can use the SEMs to
construct conﬁdence intervals for students’ universe scores (true scores based on repeated
testing). For instance, with the use of the absolute SEM, Brennan, Gao, and Colton (1995)
showed that a 95% conﬁdence interval for the mean on a writing test based on a O to 5 scale
would cover a range of 1.5 points. The writing test had six prompts and each prompt was judged
by two raters.

Another application that Brennan, Gao, and Colton (1995) described was to use the SEM
to examine the probability that an examinee’s true score is within a certain range of his or her
observed score. One can ask "What is the probability that a student's true score lies between 3
and 5 given that he or she scored a 4 on the test?" Despite the ease of interpretation of this index,
Cronbach, Linn, Brennan, & Haertel (1997) suggested researchers examine the distributional
properties of this index before applying it to high stakes decisions. However, examining the SEM
in unbalanced situations is not a trivial issue. In my dissertation. I examine the properties of this

index for unbalanced designs.

Misclassiﬁcation rate for conjunctive decision rules — The generalizability
coefﬁcient is particularly important for criterion-referenced examinations such as certiﬁcation
exams (Mehrens, 1987), because it is used to investigate how many additional tasks or raters are
needed to reduce the misclassiﬁcation rate in a D-study. One can ask, "How many added raters or

tasks are needed to reduce a misclassiﬁcation rate to, for example, 0.01 , in a writing test?"

Cronbach et al. (1997) further elaborated this application to include misclassiﬁcation rates when
using compound decision rules (conjunctive rules). For instance, on a writing test with two
writing prompts, one can use the SEM to ﬁnd out the probability of misclassiﬁcation of a random
student who has received two scores of 2.5 (observed scores) on the two prompts, given that the
student deserves two scores of 3.5 (has universe scores of 3.5). In Cronbach et al.'s (1997)
example, they showed that one could obtain a probability of incorrectly classifying an examinee
in a 6-task assessment. Assuming the hypothetical examinee had universe scores 2.5, 2.5, 2.5,
3.5, 3.5, 3.5, Cronbach et al. (1997) demonstrated that with an absolute SEM of 0.7, the

examinee had roughly a 25% chance of having one or more true scores less than 1.5. See

Appendix C for the details of the computation of this classiﬁcation rate.

2.2) Analyzing missing data in G theory

Researchers such as Brennan (1992) have classiﬁed unbalanced situations in G theory
into two categories, namely unbalanced in terms of nesting and unbalanced in terms of missing
data. Methods such as multivariate G theory (Brennan, 1992; Cronbach et al., 1972) have been
used to handle unbalancing in terms of nesting, in which the numbers of test questions vary
across batteries of a test. Multivariate G theory, however, does not account for missing data.
Other methods for handling missing data have limitations that make them inappropriate for large-
scale assessments.

The Henderson Methods I and I I (Analysis of Variance (ANOVA)-like methods,
Henderson, 1953) are incompatible with the conceptual framework in G theory and are
computationally extensive (Brennan, Jarjoura, and Deaton, 1980, pp. 37-38). These methods use
quadratic forms analogous to the sums of squares of balanced data. The expected values of the
quadric forms are then functionally expressed in terms of the variance components. The set of

equations characterizing this functional relationship is solved for the variance components. The

demerit of these methods is that the quadratic forms are computationally extensive for large
amounts of data given that only one unit is observed per cell in the G theory framework (e.g.,
Brennan, Jarjoura, and Deaton, 1980, p. 37).

Generalizability analyses frequently involve both ﬁxed effects and random effects
(Brennan, 1992, pp.76-77). For instance, in performance assessment, one may wish to examine
the reliability of using the same raters over time, but using different sets of essay questions in
each administration. In this case, the rater facet is ﬁxed whereas the essay facet is considered
random. Henderson's Method 1, however, is incapable of estimating variance components for
such a mixed model (Searle, Casella, and McCulloch, p.189) because it does not restrict the sum
of the deviations (from the mean) for the ﬁxed effects to be zero. Without this restriction, the
property of unbiased variance component estimates is of questionable value (Brennan, Jarjoura,
and Deaton, 1980, p.37). Although Henderson' Method II can handle mixed-effect models, it has
limitations, as it cannot be used when there are interactions between ﬁxed and random effects
(Searle, Casella, and McCulloch, 1992). In addition, these ANOVA-like methods produced
biased estimates in unbalanced situations (Marcoulides, I988; Olsen, Seely, & Birkes, 1976;
Searle, 1971).

Even though Henderson's Method 111 overcomes the shortcomings of Methods I and II, it
produces different estimates for variance components depending upon the order in which the
variance components are estimated (Babb, 1986; Brennan, Jarjoura, and Dealton, 1980; and
Searle, Casella, and McCulloch, 1992). The choice can be critical since there is usually little
justiﬁcation about which variance components should be estimated ﬁrst. Without a unique set of
estimates for variance components, Henderson's Method III makes the interpretation of
generalizability analyses inconclusive. This disadvantage is magniﬁed in computing composite
indices (e.g., SEM and reliability coefﬁcients) in generalizability analysis because this method

can yield many different composite indices for the same set of data.

Using a two-faceted crossed model, Marcoulides (1988, 1990) randomly deleted
observations to compare two estimation methods, ANOVA and Restricted Maximum Likelihood
(REML) for data sets with small sample sizes (25 persons, 2 occasions, and 4 raters). He
concluded that the REML method was more stable in estimating variance components than the
ANOVA method. However, he did not examine the performance of the REML method for large-

scale data sets such as those that are common in large-scale performance assessments (e.g., essay
writing). As was noted by Babb (1986, p.3), Bell (1985), Rao (1997), and Searle, Casella, and
McCulloch (1992), the REML method requires extensive calculations that are infeasible for large
data sets. Even for ANOVA methods, the model matrix was frequently very large and thus was
too large to invert (Brennan, 1992, p.107; Matherson, 1998).

Cornﬁeld and Tukey (1956), Kirk (1982), Millman (1967), and Searle, Casella, and
McCulloch (1992) discuss concise algorithms to determine the coefﬁcients used in the Expected
Mean Square (EMS) equations for the estimation of variance components. Those algorithms are
so simple that they can be implemented by hand calculations. One needs to know only the
numbers of levels in each factor of the ANOVA design. However, one cannot apply the
algorithms to unbalanced designs because the numbers of levels in unbalanced designs vary
depending upon how many data points exist in each factor. In order to determine the EMS
equation, extremely large design matrices had to be created for each factor, including the
interaction factors. This is problematic especially for G theory analysis because the object of
measurement (Cronbach et al., 1972), say examinees, always has a large number of levels (each
person is considered as a level). For a data set of 6000 examinees, to model the object of
measurement, or the examinee factor, requires a square matrix of 36,000,000 cells.

Babb (1986) pointed out that in the Maximum Likelihood (ML) and REML methods, one
has to calculate the inverse of the variance-covariance matrix associated with the observations at
each round of iteration. In addition to the extensive resources required for the ML and REML

methods, these inversions of large matrices may not always converge. According to Searle et al.

11

(1992), nonconvergerence in REML indicates that the ANOVA model does not ﬁt the data. Just
knowing that the ANOVA model does not ﬁt the data gives very little useful information for data
analysis because it is not a surprise that the ANOVA model would not ﬁt a large, sparsely ﬁlled
data matrix. More information is needed. The subdividing method proposed here overcomes this
deﬁcit of the REML by analyzing smaller subsets of data which allows one to examine
measurement errors in depth. In case model misﬁt occurs, one may further examine individual
data sets that might have contributed to the misﬁt.

Other estimation methods such as the Minimum Norm Quadratic Unbiased Estimation
(MINQUE ) also have problems. In order to assume non-negative estimates for variance
components, constraints must be imposed on the parameter space associated with the variance
components. Those constraints can cause the quadratic unbiased estimation methods (e.g.,
MINQUE) to be biased (e.g., Babb, 1986). In a relatively recent publication, Longford (1995)
derived two ANOVA models to estimate variance components for essay rating. His models were
designed to estimate variance components different than those typically used in the G theory
framework. Speciﬁcally, the models did not estimate interaction effects. For instance, the
person-by-rater interaction and rater-by-item interaction variance components common in a two-
faceted generalizability analysis were not estimated in Longford's model (p. 79 - 82). In addition,
the universe-score variance component (variation among examinees, also denoted 62p) was not
estimated. With Longford's models, one could not compute the generalizability coefﬁcient for
making norm-referenced decisions, which is based on the variance components of interaction

effects.

2.3) Historical approaches to analyzing missing data

Little and Rubin (1987) reviewed historical approaches for handling missing data and
proposed various likelihood-based approaches to the analysis of missing data. In their summary

(see pp. 40-47 of their book), Little and Rubin ( 1987) pointed out that these methods did not

necessarily produce accurate results and the accuracy depended upon the assumptions to be made
about the missing values and the nature (i.e., categorical vs. continuous) of the observed data.
"Complete-case analysis" (Little & Rubin, 1987) is frequently referred to as "listwise deletion"
by some researchers and commercial statistical packages. In this method, one analyzes only
complete cases, where all variables of interest are present. A critical concern with this method is
whether the selection of complete cases leads to biases in sample estimates. This method yields
seriously biased results if the complete cases were not a random subsample of the original cases
(1987, p.40). Complete-case analysis requires discarding data in a G theory framework to obtain
a balanced design. Chiu and Wolfe (1997, p.6) pointed out that this method is likely to ignore
scores given by large portions of raters and thus, the chosen pair(s) of raters (those with complete
data) may not be representative of the universe of raters.

The "available-case methods" (Little & Rubin, 1987, pp.4l-43) are another quick but
unsatisfactory alternative for handling multiple outcomes with missing values. One such method,
also known as pairwise deletion by some data analysts, estimates covariation for two variables
based on cases for which responses to both variables are present. A criticism of this method is
that it can yield correlation estimates that lie outside the range (-1,1), unlike the possible range of
a population correlation. This can happen if the sample covariance and the sample variances are
based on different cases. (See Appendix D for an example.) Little and Rubin (1987, p.43) also
pointed out that available-case methods can lead to paradoxical conclusions when missing values
are systematic, rather than randomly distributed. In their example, one could ﬁnd that two
variables, say A and B, were each perfectly correlated with a third variable, say C (i.e., rAc = ch
= 1), yet these two variables showed absolutely no correlation with one another in the samples of
observed values (rAB =0). See Appendix E for a hypothetical example of this situation. One other
disadvantage of the available-case methods is that they may produce covariance matrices that are

not positive deﬁnite, a property required by many analyses based on the covariance matrix,

including multiple regression. Kim and Curry (1977) and Little and Rubin (1987, p.43)
concluded that if the data were Missing Completely at Random (MCAR) and correlations were
modest, the available-case methods were more desirable than the complete-case analysis, because

they did not waste as many data points as the complete-case analysis did.

2.4) Imputation as a method to handle missing data

In addition to the aforementioned historical approaches, Little and Rubin
(1987 pp.39-71) summarized major imputation methods (methods to ﬁll in sparsely ﬁlled data)
that are commonly used in sample surveys. These methods, however, were not always applicable
to analyses of assessments, tests, or examinations usually designed to measure fewer constructs
(which are manifested as groups of test items) than sample surveys are designed to measure.
Imputation methods often replace missing values by other observed values in the same survey.
Assessments, particularly performance based assessments, do not always have as many items as
in sample surveys or in national tests. Performance-based tasks such as writing prompts are
frequently scored as separate items. Because of cost and time constraints, very few items are
typically administered in performance assessments. In fact, in all the examples of large-scale
examinations that follow, very few items were administered. Frequently, large-scale tests
administer only two items (The Collegiate Assessment of Academic Proﬁciency, Authors, 1998)
1998). Some tests administer even just one item (The 1998 NAEP Writing Assessment, US.
Department of Education, 1998; The TOEFL Test of Written English test, Authors, 1998b).

Putting aside other controversial reservations against imputation, having too few items makes

imputation impractical for performance assessments.

2.5) Potential solutions in handling missing data in G theory

Smith (1978) examined the variability (stability) of the variance components for a two-

facet crossed model. This model is a major model used in performance assessment (e.g., the
person-by-tasks-by-rater design). With a focus on the variance component of the person effect,
Smith (1978) found that the stability of variance component estimates varied. Variability
depended on several factors, including the number of levels in the facets and the complexity of
the expected mean square equations used for estimation. Smith found that operational data sets
always were very large, and frequently, large data sets contain unbalanced designs, which can
cause unstable variance component estimates. He also found that changing the conﬁguration
(e.g., from a crossed model to a nested model) affected the stability of variance components like
that for the person effect. Therefore, Smith (1981) suggested the use of multiple generalizability
analyses {e.g., P:(lezF) and I:(FxP:S) }}' in place of a large and complex model (e.g, P:SxI:F)
because the expected mean square equations are less complicated and so one would obtain more

stable variance component estimates. Smith (1978; 1981) called for further examinations of the
use of multiple generalizability analyses in the context of unbalanced situations. Unfortunately,
according to the Social Science Citation Index (1978 - present), no follow-up research has been
conducted to examine the generalization of the Smith method.

Unsatisﬁed with the limitations of the MIVQUE, MINQUE, ML, and REML methods,
Babb (1986) developed a model and notation for pooling estimates of variance components
obtained from subsets of unbalanced data. According to Babb (1986), one can partition data into
subsets, each small enough to allow ML and REML estimation to be computationally feasible.
Then, one can pool variance component estimates obtained from subsets of unbalanced data.
Though the Babb models and notations were invented for unbalanced designs, one can adopt his
approach to handle balanced data because balanced data can be construed as a special case of

unbalanced data (Searle, Casella, & McCulloch, 1992). However, due to insufﬁcient time,

 

P:(le):F rs the shorthand notation for a three faceted desrgn In WthI‘l drflcrent test forms (1') contain drﬂerent sets 01
items (I). The testing agent administered any one of the forms to every school (S). Students (P) in a school respond to
only a portion ofthe items administered to the school.

15

computation resources, and ﬁnancial support, Babb (1986) did not demonstrate the extent to
which the pooled method worked for unbalanced data. Babb (1986) and Searle, Casella, and
McCulloch (1992) called for further research. In particular, they suggested one use Monte Carlo
simulation to validate the pooled method (e.g., Babb, 1986, p.26). Regardless of the potential
usefulness of Babb's approach, no other studies have followed up with that research. (No
research has referenced Babb in the Social Science Citation Index from 1981 to present).
Independent of Babb (1986) and Smith (1978, 1981), Chiu and Wolfe (1997) applied a
subdividing method to analyze unbalanced performance assessment data and concluded that the
subdividing method was practical and provided stable results in estimating variance components.
Chiu and Wolfe's (1997) subdividing method differed from those of Babb (1986) and Smith
(1978, 1981) in the following ways. Smith (1981) advocated using multiple generalizability
analyses to reduce complexities in design conﬁgurations (e.g., use the P:(lezF) and I:(FxP:S)
designs to estimate variance components in the P:SxI:F design). Chiu and Wolfe (1997) proposed
to divide a large data set into many smaller data sets with similar conﬁguration (i.e., divide a
large two-facet data set into multiple smaller two-facet data sets). Despite the distinctions, both
Chiu and Wolfe (1997) and Smith (1981) had the same purpose, which was to obtain more stable
estimates for the variance components. Babb (1986) developed notation and models to combine
variance components from balanced and unbalanced data subsets. For balanced data, Babb (1986,
p.22) showed that one pooled estimator of a variance component was the simple arithmetic
average of the estimators obtained for each individual data subset. For unbalanced data, he used
an approximation to estimate the covariances of variance components. Those covariances were
then used in estimating the pooled variance components. Babb (1986), however, did not examine
how to combine variance components for the modiﬁed balanced incomplete block design
frequently used in essay reading. (In this design, examinees respond to two essays and each essay

is graded by two raters, one rater grades both essays and is paired with a different rater on each

essay). Although Chiu and Wolfe (1997) combined variance components for crossed, nested, and
MBIB designs, they applied their method to only a single data set. They did not examine the
performance of that method in other data sets.

Three other studies also employed methods similar to the proposed subdividing method.
Lane, Liu, Ankenmann, and Stone (1996) examined the generalizability and validity of a
mathematics performance assessment using a two facet design, the person x rater x task design
(denoted p x r x t). Due to large and unbalanced data, they divided their data sets into 17 smaller
subsets of crossed design (p.81). Brennan, Gao, and Colton (1995) employed three completely
crossed designs to examine the generalizability of a listening and a writing test. They then
suggested that one might consider pooling the results from the three crossed data subsets to judge
the reliability of those two measurement procedures. Linn, Burton, DeStefano, and Hanson
(1996) conducted a pilot study to examine the generalizability of a mathematics test and used six
two-faceted crossed designs (p x r x t). Unlike the Chiu and Wolfe (1997) study, in which they
developed and examined the subdividing method, all three aforementioned studies focused on the
interpretations of the results based on the unveriﬁed subdividing method. None of the four

studies investigated the performance and generalization of the subdividing method.

2.6) Summary of literature review

This chapter reviewed the advantages and disadvantages of a number of methods (i.e.,
imputation, Iistwise and pairwise deletions, ANOVA methods, MINQUE, ML, and REML) and
concluded that all of these methods were disadvantageous in analyzing large amounts of
unbalanced data. They were either unable to produce unbiased variance component estimates or
required excessive computational power for obtaining G theory estimates. In addition, none of
these methods investigated the relationship between the accuracy and precision of the estimates
and the pattern and amounts of missing data in the context of performance assessments.

Engelhard (1997) surveyed various ways of constructing rater and task banks and showed how

missing data were manifested in these rater and task banks. However, Engelhard (1997) focused
on exemplifying different rater and task banks rather than investigating how rater and task banks
inﬂuenced the estimation of measurement errors in the G theory framework. Searle (1987)
suggested one use "subset analysis" to analyze unbalanced data using ANOVA models. This
method, however, was not examined in the context of G theory. In addition, Searle (1987) did not
relate the accuracy and precision of variance component estimates to the pattern and amounts of
unbalanced data. Babb (1985) was the closest study to the current dissertation as Babb described
how to modify the General Linear Model to analyze subsets of data. Nonetheless, he did not
verify his method because of the lack of computational resources to conduct a simulation study.
The current dissertation investigates a subdividing method, which allows investigators to
utilize unbalanced data to estimate variance components while requiring low computational
power. The current study also sets out to examine the extent to which the decision rules used to
set up a scoring procedure inﬂuence the accuracy and precision of G theory estimates. The
decision rules used to set up a scoring procedure are coined "rating plans" and they are studied
with other factors to determine in what circumstances the subdividing method can perform
optimally. Speciﬁcally, the current study examines these factors: rating plans, variations in item
diﬂiculty and in rater inconsistency, number of examinees, number of raters, and number of tasks

scored by a common group of raters. Chapter 3 reveals the rationales for choosing these factors.

CHAPTER 3: METHODOLOGY

This chapter ﬁrst summarizes the procedures of the subdividing method, then lists
research questions tailored to the rating of student work in performance assessments. Next, it
describes the conditions to vary, the data generation procedures, and then the outcomes used to
evaluate the subdividing method. The design of a Monte Carlo simulation study (Mooney, 1997;
Rubinstein, 1981) is discussed in detail in the context of "rating plans", which are sets of rules
used to assign examinees and tasks to raters during scoring sessions.

Monte Carlo studies (Cronbach et al., 1972; Mooney, 1997) are especially suitable for
the current study because one can evaluate the practicality and statistical properties (e.g., bias
and efﬁciency) of an estimation method by comparing the estimated parameters to the known
population parameters. In the measurement context, it is infeasible to conduct many reliability
experiments in which actual raters and examinees are crossed. Also, it is difﬁcult to imagine how
one could control precisely raters' severity / leniency and inconsistency for such experiments. In
addition, the real problem is that one cannot evaluate estimation methods based on data for which
one does not know the population parameters. These restrictions make Monte Carlo studies
especially appropriate for examining this subdividing method. Furthermore, as has been pointed
out by many researchers (e.g., Harwell, Stone, Hsu, & Kirisci, 1996; Longford, 1995;
Psychometrika Editorial Board, 1979), analytical methods (e.g., deriving expected mean squares
equations) can be inadequate for the examination of statistical properties in generalizability
analysis (especially for composite indices). Unlike analytical methods, simulation studies can be
used even if probabilities of selection (determined by the amount of missing data), sample sizes,
and the magnitude of variance components are treated as independent variables. Also, Monte
Carlo methods are especially suitable when it is difﬁcult to satisfy asymptotic assumptions
because only a small number of levels are sampled in each factors (e.g., only two levels in the

item facets, n, = 2).

3.1) Procedures of the subdividing method
The subdividing method has three stages. They are the 1) Modeling, 2) Estimating, and

3) Synthesizing stages. The sections that follow illustrate each of these stages in detail.

Modeling Stage —— In the ﬁrst step, the sparsely ﬁlled data set is divided into smaller
subsets of balanced data that exhibit structural designs common in Analysis of Variance, namely
the crossed design, the nested design, and the modified balanced incomplete block design
(MBIB). The sparsely-ﬁlled data set is divided into S, data subsets, with t = l, 2, or 3, and with
S, indicating the number of subsets in each of the crossed ( S I ), nested ( S 2 ), and MBIB ( S3)
designs, respectively. Note that the crossed and nested designs are structural designs that are
common in generalizability analysis. A MBIB design is formed every time one rater scores both
items and is paired with a different second rater on each item. See Appendix F for the structure of
the MBIB design. Chiu and Wolfe (1997) provide a detailed description of the algorithm used to
divide an unbalanced data into subsets of data. In the paragraph that follows, I summarize the
notation that can be used to implement the algorithm.

Throughout this dissertation, I use n!” to represent the numbers of levels in the f 'h
factor, tth design, and sth data subset, where f = {p, i, r, pi, pr, ir, pir} = {1,2,3,4,5,6,7}, t =
{crossed, nested, and MBIB} = {1,2,3}, and s = {subseth subset], subsets, } = {1,2, 5, }. The
unbalanced data set has a sample size of np... The number of raters involved to score examinees
in the unbalanced data set was denoted n,” and the number of items administered was denoted
n,. .. The two periods in the subscript indicate that the sample size was added across different
types of design and different data subsets in each design.

With the notation developed above, one can use the General Linear Model (GLM) to
structure the data in each subset and then estimate the variance components from those subsets
(stage 2), followed by pooling the estimates to obtain an overall estimate for the variance

components. The GLM, described in many places, such as Searle et al. (1992), is summarized

20

below. The pooling method is summarized in the subsequent section entitled "Synthesizing

Stage". The general linear model is:
_. 7
yAZXsﬂsl-Eng-ﬂf‘s' (I)

The )7 in the above equation represents a vector of scores in the sth data subset for each design

5

type. The scores are expressed as a sum of the overall mean and the effects of the seven factors.

The grand mean is represented in [3 , a vector of ones. The vector u comprises the means of a
3

single level of each of the seven factors and the design matrices Z contained only zeros and ones
to indicate the level to which a score belonged. Utilizing this GLM model, one can disentangle
the variations of test scores into multiple facets of variations (Brennan, 1992). The second
"Estimating Stage" described on page 23 serves this purpose. The following paragraphs illustrate
what constitute data subsets and how to determine the number of data subsets can be extracted
from a sparsely-ﬁlled data set.

Data subsets. In unbalanced situations, rather than having all raters score all the
examinees, sets of raters may score groups of examinees. Sets of raters can be either mutually
exclusive or inclusive and this is determined by the rules governing the scoring procedures,
termed the "rating plan" in this dissertation (this is discussed in detail in subsequent sections).
The scores that a collection of raters assigns to a group of students form the basis for conducting
a generalizability analysis and so, a collection of raters with a group of students can be construed
as a subset of data. Inclusive rater groups share raters and for this reason the number of rater
groups always exceeds the number of raters. In a hypothetical scenario with four raters, one
could form six rater groups of two raters. Using the letters A, B, C, and D to represent the four

hypothetical raters, one could form up to six collections. These six collections are denoted by six

21

pairs of letters as: {A8, AC, AD, BC, BD, CD}. Since, for example, raterA sat on three rater
groups (i.e., {AB, AC, AD} ), these groups were inclusive or connected by rater A.

In another scenario assuming no connections, the same four raters would form only two
collections of raters, which could be set up in one of the following three ways: {AB, CD}, {AC,
BD}, or {A0, BC}. The number of rater groups expands as the number of raters (denoted size of
rater pool hereafter) increases. If a rater pool was composed of 28 raters, as many as 378
inclusive (connected) rater groups could serve to score examinees. Alternatively, these 28 raters
would form 14 exclusive (disconnected) rater groups. The general equations for determining the

number of connected and disconnected rater groups given that each examinee is responding to

 

one item are:
n,“
number of connected rater groups or connected data subsets = [2 j; and (2)
. . n
number of disconnected rater groups or disconnected data subsets = :2" . (3)

Assuming that the rater groups score the same number of examinees, then the two types

of rater groups would score n,,.,/ [m] and np../ 12: examinees, respectively. In other words,
2

[up] subsets of data each include n,,../ ("r”) examinees for a rating plan utilizing connected
2 2

groups, and " - subsets contain n,,../ "r" examinees for a rating plan utilizing disconnected rater

 

 

groups. The scores that these groups assign to examinees exhibit a crossed structural design
when examinees respond to two items and both raters in the same group score both items. The
aforementioned GLM model estimates the measurement errors (i.e., variance components)
associated with the collections of raters.

The subsequent section entitled "Decision rule for weighting" on page 27 shows a set of

general equations, Equations (9), (10), and (l 1), to predict number of rater groups (data subsets)

22

given the size of rater pool and the connected rating plan. That set of general equations was
developed for a more ﬂexible connected rating plan in which each set of raters was not required

to score all items responded to by the examinees.

Estimating Stage —- In this stage, variance components are estimated for each subset of
data. The ANOVA method (e.g., Brennan, 1992, Searle et al., 1992, p. 173) can be used to
estimate variance components for the crossed and nested designs. For the MBIB design, the
Minimum Norm Quadratic Unbiased Estimate (MINQUE) method can be used to obtain the
variance component estimators (Bell, 1985; Giesbrecht, 1983; Goodnight, 1978; Rao, 1997).

For the ANOVA method, variance components are estimated by solving sets of Expected
Mean Squares (EMS) equations (Brennan, 1992, p.130) relating the variance components and

sums of squares. The EMS of each subset of data is expressed in the following matrix formula,

7{2 . . . .
where 0'. IS a vector of estimated variance components. The estimates are

& =CI'Z1} (4)

where C: is an f x f upper-triangular matrix of coefﬁcients of the variance components
estimated based on the GLM model (Equation (1 )), f = 1,2, ..., 7 represent the effects in the sth
data subset, and a: is a vector of sums of squares for the effects observed in the data. The

following is a representation of Equation (4) expressed in data matrices:

23

 

 

 

 

_ .2 _ Fninr 0 nl‘ Hi 0 0 I P _
:52", O npnr Dr 0 0 Up 1 :2?)
63.: O 0 "pm 0 “1 HP 1 SS...
6:.» = O O 0 nr 0 0 1 SSs.pi '
0 0 0 0 n, 0 1 $st
6:" SSs.ir

L63...“ 0 0 O O 0 Tip 1 bSSspir‘

L 0 O 0 0 0 0 1_

 

 

The C: matrix shows the numbers of levels for each factor. The computational formula for C: ,

derived by Searle et al. (1992, p. 173, Equation 18), is described in Equation (5).
-1 I
C. ={ ”117004sz As,j Zs,j) },_., (5)

where m = mg refers to the 1‘“ row and 1"“ column in Cj' . The Z matrices are the design matrices

for each of the effects in the GLM model shown in Equation (1). The A matrices are symmetric
matrices obtained in the quadratic forms when computing the sums of squares for each factor.
The sizes of the Z matrices depend on the numbers of levels in the factors (also termed as
facets in G theory). In the context of G theory, the person facet always has a large number of
levels and so the Z matrices can be too large to process, for example when there are 6000
examinees (yielding a square matrix with 36,000,000 entries). The subdividing method
overcomes this restriction by dividing the large data set into smaller subsets so that the A
matrices for the subsets are small enough to be processed. For instance, a subset of 100
examinees exhibiting a crossed structure has a square matrix A,” of dimension 100 x 100, with
10,000 entries, which is considerably smaller than the 36,000,000 described above when n =

6000.

24

Assuming a multivariate normal distribution for the score effects, the variance-

covariance matrix associated with the estimated variance components in 8: is (Brennan, 1992, p.

133):

V = CLZD,..(CI,'.)' (6)

where t = {crossed or nested} and D” is an f x f diagonal matrix containing the diagonal
elements 2(MS])2/(dfj +2). Note that the index j = 1, 2, f designates the score effects in a design

including both main effects and interaction effects.

Synthesizing Stage —— Meta-analysis (Hedges and Olkin, 1995), a quantitative method
to summarize research results, is especially suitable for the subdividing method because it is
capable of estimating an overall outcome based on many outcomes obtained from individual
empirical studies, or, here, data subsets. Thus, meta-analytic methods were used to aggregate
variance components from each subset of data. In the Synthesis Stage, data subsets were
weighted by subset sample size and then variance components were pooled. Composite indices
were computed on the basis of those pooled variance component estimates.

Weighted estimates of the variance components can be obtained by weighting the data
subsets by their sample sizes across all subsets from both the disconnected crossed and

connected mixture rating plans. For factor f we obtain

 

0'1” = 0'; = I] .~ .~ (7)

{ p, i, r , pi , pr , ir , and pir for a crossed or MBIB design
where f =

p, i , rzi, pi, and part for a nested design
3 = the s "' data subset
t = the t”' structural design , and
n = number of examinees in the s "' data subset of the t”I structural design.

p1,)

When the data subsets were equal in sample size (i.e., had the same number of

examinees), the weighted average variance component became:

6’}=5/=_S (8)

where S. is the number of data subsets across all the structual designs.

The critical questions about weighting are: (a) What are the consequences if weighting is
not used when it is needed? and (b) Under which rating plans can weighting be ignored? The
simulation study reported below in the Results section addresses question (a). Equation (8)
indicates one answer to question (b), which is that weighting is not needed when sample sizes are
equal in the data subsets. This occurs when raters evenly share the workload. The answer to
question (b) becomes elusive when there is no plan to ensure that raters evenly share the
workload. In that case, one has to decide whether or not data subsets are equal in size. The
following section provides a decision rule and its validity was tested in the simulation study

reported in the Results section.

26

 

start

V
Does the rating plan Yes ,
have data subsets of ' D Weighting "01 needed
equal size?
No or Uncertain
Y

Comparison between the average data subset size and
the minimum batch size.

See Decision Rule for Weighting

V

Does the average
data subset size
exceed the minimum batch
size?

Yes
> Weighting needed

No
V

Weighting not needed

 

 

 

Figure 1: Decision rule for weighting

Decision Rule for Weighting

If a sparsely ﬁlled data set is divided into many subsets containing different numbers of
examinees (i.e., sample size), these subsets will have to be weighted in order to provide precise
variance component estimates. Data subsets differ in sample as a result of the random process
used to assign a large volume of examinees and tasks to a relatively small number of raters.
When batches of the work submitted by examinees outnumber the rater groups, some rater
groups will have to score more batches than the other groups and thus weighting is needed to
account for such a difference. The following are the steps to determine whether or not the

number of batches exceeds the number of rater groups.

27

(I)

(1')

Compute the total number of possible data subsets, which is the count of all possible data
subsets in the structural designs available in the rating plan. For instance, the connected
mixture rating plan has data sets from three structural designs, where counts are shown

here.

The number of crossed data subsets, Smw = (:2), (9)
The number of MBIB data subsets, SW,” = [ ::.J n,.( n, — nr.), (10)
The number of nested data subsets, SW .1 = [:1 n;;2], and (l 1)
Total number of possible data subsets, S. = SU,,,,.,.,, + Sm”, + S,,,.,,,.,,, (12)

where n,. is the number of raters to be subsampled from a pool of n, raters. It is

equivalent to the number of ratings on an item for a given examinee.

. n ..
Detennrne whether or not the number of batches (#) exceeds the number of rater
m at

groups or potential data subsets (S.). If so, then weighting is needed. Alternatively, this
decision rule can be expressed in terms of any other forms including the volume of
examinees (n,,.. ), number of rater groups (S.), and minimum batch size (MinBat). For
example, these three terms can be expressed as follows (if the the following inequality is

true for any given sparsely-ﬁlled data set, then weighting is needed).

n,,,, > s, * MinBat (13)

28

For example, given a pool of four raters to score 1500 examinees using a minimum batch size of
12, should one weight the data subsets by sample size? The computations and decision for steps I

and II are shown here.

4
(I) Number of crossed data subsets, Smw = [ J: 6

4
Number ofMBIB data subsets, SW”, = [212(4 — 2) = 24

4 4 — 2
Number of nested data subsets, SW“, = [2X 2 J: 6

Total number of possible data subsets, S. = 6 + 24 + 6 = 36
(11) Since there were more batches (i.e., 1500/ 12 = 125) than possilbe rater

groups (i.e., total number of possible data subsets = rater groups = 36),
weighting is needed.

In succeeding sections, 1 state research questions and describe how to manipulate
independent variables to address those questions. A review of literature guided the choices of the

population values manipulated in the independent variables.

3.2) Research questions

What is the performance of the subdividing method? The answers to research questions
one through four in Table l are addressed in the light of the accuracy and precision of variance
components. The amount of missing data is manifested by the size of a rater pool and the volume
of performance-based tasks to be scored. The m of missing data is manifested by the rating
plan used to score the examinees. The certaing of a decision can be evaluated by examining the
accuracy and stability of the estimated variance components and composite indices. The
reliability and the dﬂendability of a scoring procedure are represented by the p2 and (I)

coefﬁcients, respectively. To what extent do the amounts and patterns of unbalanced data

29

inﬂuence the certainty of j udging the measurement errors, reliability, and dependability of a

scoring procedure? Research questions four through eight in Table 1 address these questions.

Table 1 : Research Questions

 

Research Questions Rationales and Signiﬁcance

 

How does the use of weighting schemes to combine the variance
components inﬂuence the accuracy and stability of the estimates of the
1) How did weighting inﬂuence the G variance-component and the composite indices? When do data subsets

theory estimates? need to be weighted? What are the situations in which weighting can be
ignored? What are the consequences of not using weights when they
should be used?

 

A batch contains a number of examinees whose performances are scored
by a common group of raters. How well can the subdividing method
2) What was the effect of doubling recover the population value of the variance components and

the batch size? generalizability coefﬁcients when the "batch size" changes? Will
doubling the batch size (i.e., from 12 to 24) increase the accuracy and
precision of the variance components estimates?

 

3) How accurately did the Both the variance components and composite indices are critical in
subdividing method recover determining the quality of measurement procedures. How accurate can
variance components and the subdividing method recover the parameter values of variance
composite indices? components and composite indices?

 

Increasing the sample size of one facet has an effect on just that facet
itself when one uses ordinary algorithms such as ANOVA, MINQUE,
ML, and REML methods. To what degree does the subdividing method
have the properties that ordinary algorithms have? Speciﬁcally, when
one increases the number of raters used and examinees tested, it should
have little to do with the item effect. Does this property hold for the
subdividing method?

4) How well did the subdividing
method estimate the item effect?

 

Increasing the rater pool size, on the one hand, gives more information
about the degree to which raters score examinees differently. It, on the
other hand, causes more unobserved data because no matter how large
the rater pool size, only a random pair of rater is chosen to score an
5) What was the effect of expanding examinee. To what degree does the expansion of the rater pool increase
the size of the rater pool? the precision of the estimation of rater-related measurement errors, in
unbalanced situations? How do the two factors (rater pool size and
amounts of missing data) inﬂuence the person-by-rater effect? Will the
increase in the rater pool size compensate for the increase in the amount
of missing data?

 

Frequently. large-scale testing programs score a tremendous volume of
examinees and it is infeasible to have all raters score all examinees.
Given the large amounts of data and the sparse nature of the data
structure in these testing programs. can the subdividing method handle
the data? If so, how well does it perform?

6) Can the subdividing method
handle a large volume of
examinees? How well did it
perform?

 

 

 

 

30

 

Research Questions

Rationales and Signiﬁcance

 

7) What were the advantages and
disadvantages ofthe two rating
plans?

To what extent does the disconnected crossed rating plan provide better
estimates than the connected mixture rating plan? In the connected
mixture rating plan, only a portion of the data is allocated to estimate the
rater related effects such as the person-by-rater effect; how precisely are
these effects estimated? How do these estimates compare to those in the
disconnected crossed rating plan?

 

8) How did the amounts and
pattern of missing data
inﬂuence the norrn- and
criterion- referenced indices?

 

 

In addition to variance components, generalizability coefﬁcients
and misclassiﬁcation rates are used for making decisions
regarding the overall quality of a measurement procedure. How
do the amounts and patterns of missing data inﬂuence these
composite indices, namely the generalizability coefﬁcient,
dependability coefﬁcient. and the misclassiﬁcation rate?

 

3.3) Conditions to vary

Summary of All Conditions — Table 2 shows the conditions used to evaluate

performance of the subdividing method. The simulation study entailed the following factors,

resulting in 176 conditions.

' Rating plans (2 levels)

' Number of examinees (4 levels)

' Number of raters (3 levels for the 750-, 1500-, and 3000- examinee
conditions and 2 levels for the 6000 examinee condition)

' Variation in item difﬁculty (2 levels)

' Rater inconsistency (2 levels)

' Number of essays in a batch (2 levels)

31

 

Table 2: Experimental conditions to evaluate the subdividing method

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Hi h Item Effect 02 Low Item Effect 0'2
WW 9 ( ') l ')
Blunt-tar Lemma: Hummer Lowrator
W103.) 1W3 Image-J inconsistency (5")
12my1124auydi12oouy1124usayai12may124 120a“ 124ml
N . amin N rotors per batch porbatch parbatch par batch par batch batch porbatch porbatch
2 .
750 4 I ' """""
8 ........ J ................................
4 I -- _
101mm ‘5°° *3 :IIZIIIEIZIIIICZIIIIIIII-IIIIIL.
Crossed Rating 1‘ l
“W 8 ........ I .................... . _________ -. ___________
3000 14 11
28 l
14 T
5000 28 ‘1 '"
2 I
750 4 I
3 i‘
‘ WV
tuinunor 1500 a j
ConnoctodRating: 1‘ .
Piano 8 T
F°°° ‘4 """"" """"" III: ______ t
28
14 ‘
500° 28

 

 

 

 

 

 

 

 

 

 

 

 

Note The nwnbors of raters reported indicate the number of raters needed to complete the scoring ”salon in 40, 20. and 10 days. respectively it is anon-rod that oxatmnoos

respond to two dams and each item is scored twice by two different raters and it takes 10 minutes to read an essay and all raters work 7 5 hours a day Consequently, it will take 40
days to score when a low volume at raters (I e . 2. 4_ 8,6 14) is recruited for each N examinees value The cconng time wil decrease by 50% when a rhodium rotor pool 1| employed

(1 a , 4, 8, t4, 8. 28) it takes only 10 days when a large pool of raters is iacmned (I o 8 14 8 28) A rater pool size of 4 is considered medium when it is used to score 750 examinees,
but it is considered low when is used to score 1500 examinees The same rule applies to other Size of rater pools

 

 

 

 

Rater Severity and Inconsistencies Rater severity and rater inconsistencies are
reﬂected in the variance components (02,, 02W, 02". and Oz"). Longford (1995, pages 21-2) deﬁned
02, as the rater severity and 02p, as the rater inconsistency in a one facet person-by-rater model. I
have adopted his deﬁnition for 02, and elaborated his deﬁnition of rater inconsistency to

distinguish the two types of rater consistency in a two facet model. Speciﬁcally, I deﬁne 62p, as

the effect for person-by-rater inconsistency, 02., as item-by-rater inconsistency, and 02p" as

idiosyncratic inconsistency. Rater severity (02r = E( y, - ,u )2 ) refers to the expected variation

in a random rater's mean score ,u, (over the population of examinees and items) about the mean
score of all raters ,u (mean over the populations of examinees, items, and raters). So, a large

rater severity effect indicates that the mean scores were different between raters and thus some
raters were more lenient or harsh than the others. Research has repeatedly found that rater

severity is almost negligible across many different types of assessments (e.g., Brennan, 1995, a

32

writing test; Shavelson, 1993, a science test), given that sufﬁcient training and monitoring is
provided to raters (Cronbach et al., 1994; Koretz et al., 1994; Patz, 1996; Wainer, 1993).

I varied the magnitude of the person-by-rater inconsistency effect in the simulation
because it has important implications to fair assessment in scoring (Do the raters score
examinees differently averaged across items?) Since rater severity has been shown to be
universally so small that it can be neglected, it was practical to hold it constant and manipulate
person-by-rater inconsistency in the simulations.

Table 3 on page 33 shows a summary of the variance components reported in four
published studies involving human judgements. The third column indicates the scoring scale
employed in the studies. As the studies employed a different scoring scale (a 6-point scale on the
ﬁrst two studies and 5-point scale on the last two), it was necessary to use a common metric
(relative percent of variation based on the total variation) to compare the variance components.
The mean percent of total variation for person-by-rater inconsistency (mean 02p, = 3%) was both

Iar er and more variable than that for rater severi (mean 02, < l%).
g

Table 3: Summary of variance component magnitudes in the literature

 

 

 

 

Scoring Total Person Item Rater
Authors Year Subjects Scale Variation P l R Pl PR IR PIR, E
Brennan et al. 1995 Writing 0 to 5 1.15967 59% 2% 1% 14% 4% 1% 19%
Chiu et al. 1997 Writing 1 to 6 0.47281 41% 5% 0% 25% 4% 1% 23%
Lane ct al. 1996 Math 0 to 4 1.84215 25% 1 1% 0% 53% 0% 0% 10%
Linn et al. 1996 Math 0 to 4 1.01500 20% 22% 0% 33% 1% 2% 21%
Average 0.96331 35% l 1% 0% 31 % 3% 1% 20%

 

 

Examinecs — Large-scale assessments can have numbers of examinees ranging from a
few hundred to several thousand, or even tens of thousands for state and national tests. Longford
(1995) reported that 3,756 examinees responded to the Studio Art Portfolio Assessment and
Myford, Marr, and Linacre(1995) reported 5,400 examinees took the Test of Written English
(TWE) in one administration. Chiu and Wolfe (1997) stated that 5,905 examinees participated in

an administration of the Collegiate Assessment of Academic Proﬁciency (CAAP). Lane, Liu,

33

Ankenmann, and Stone (1996) conducted generalizability analyses on 2,514 examinees who had

responded to all the tasks in the QUASAR Cognitive Assessment Instrument (QCAI).

Item Sampling — Data were simulated to model a case with two items and two ratings
per item. This choice reﬂects the common practice in examinations where essay writing was
involved (e.g., Collegiate Assessment of Academic Proﬁciency or CAAP, Graduate Management
Admission Test or GMAT, and Medical College Admission Test or MCAT). Each item was

scored two times (occasions) by completely different raters.

Size of Rater Pool —— The size of the rater pool was varied to reﬂect practical situations.
In the study by Lane, Liu, Ankenmann, and Stone (1996), 34 raters were hired to score the QCAI
examinations of 2,5 I 4 examinees. In another study by Chiu and Wolfe (1997), nine raters were
used to score 5,905 examinees. The two examples (Lane et al., 1996 and Chiu & Wolfe, 1997)
show that the number of examinees and the size of the pool of raters need not be in direct
proportion. Many other intervening operational factors inﬂuence this functional relationship.
Such factors include the number of tasks answered by an examinee, number of ratings on each
task, total number of days available for scoring, time (in minutes) it takes to score a task, and
average work hours per rater per day. Appendix G shows an equation to determine the number
of raters needed to complete the scoring of an examination for varying sample sizes, while

holding constant the other operational factors.

Amounts and Patterns of Missing Data — To test the robustness of the subdividing
method, I modeled a practical situation in which the pattern of missing data was contingent on
the measurement procedure. Changing the size of a hypothetical rater pool while holding
constant the number of times a task was rated changed the amount of missing data. If ﬁve

examinees were crossed with four raters, the number of possible ratings per item is 20 (5

34

examinees x 4 raters). However, if one randomly chooses only a pair of raters out of the pool of
four raters, the number of possible ratings reduces to 10 (5 x 2), which is 50% of the total
available ratings if all four raters were used (20). The other half of the ratings would be missing
because they were unobserved by design. The percent of unobserved data increases further as the
rater pool size expands to six raters. Here using a pair of raters resulted in 66.6% of all possible
data being unobserved (10 ratings of 30 possible are observed). The proportion of missing data
increases as the size of the rater pool increases, holding constant the number of ratings each
examinee received. The current study manipulated the rater pool size to investigate the effects of

amounts and patterns of missing data on the accuracy and precision of the subdividing method.

Rating Plans — Frequently, raters work in groups during scoring sessions (Clauser,
Clyman, and Swanson, 1999). The decision to group raters determines the nature of the
unbalanced data patterns in generalizability analyses and these decisions are referred to as "rating
plans" throughout the current dissertation. How many rater groups can a rater sit in? Are the
raters required to score all tasks or just one task submitted by an examinee? How many times is
an examinee scored for each task submitted? How ﬂexible are the rules used to assign examinees
and tasks to raters? Despite the fact that these decisions are indispensable in setting up scoring
procedures in operational settings, they are seldom written or published.

Using the rating plans employed by Brennan, Gao, and Colton (1995), Chiu and Wolfe
(1997), Lane et al. (1996), Linn et al. (1996), Gordon, (1998), Vickers (1998), and Welch (1996),
four principles that characterize rating plans, listed in Table 4 (p. 39), were deduced. With these
principles, two basic rating plans were examined in the current dissertation, namely the
disconnected crossed rating plan and the connected mixture rating plan.

Disconnected crossed rating plan. The disconnected crossed rating plan has often been
used for research purposes, and entails rigorous rules for setting up scoring procedures (e.g.,

Brennan, Gao, Colton, 1995). An example of the disconnected crossed rating plan follows. Prior

35

to staring to the scoring, raters were grouped and each group would be expected to score the
same number of examinees. In this arrangement, raters work within rather than across groups —
all members in the same group score all the items / tasks submitted by all examinees assigned to
the group. This scenario was referred to as "disconnected", as there were no common raters
sitting in two groups (Engelhard, 1996 and Searle, I987). The merit of this setup is the capability
of accumulating a large amount of data for each group for subsequent generalizability analysis.
The disconnected crossed rating plan is frequently adapted when the volume of examinees is
manageable to manipulate the ways of assigning examinees and tasks to raters, assuming the
assignment is implemented manually rather than electronically (storing the tasks on digital
formats and use computers to assign the task to raters).

Regarding this disconnected crossed rating plan, researchers have studied whether raters
were aware of their membership in a group and whether raters were allowed to discuss the
scoring process (e.g., C lauser, Swanson, & Clyman, 1996). For this dissertation, no assumption
was made concerning rater discussions. As much as raters might be aware of their membership in
the disconnected crossed rating plan, they are not necessarily aware of the group they belong to
because the group membership may be decided as a post hoc or a random process. For instance,
portfolios may be grouped in advance and one rater assigned to score those portfolios once.
Another rater may be chosen at random, without noticing or knowing of the ﬁrst rater, to assign a
second rating to the same set of portfolios. Although the two raters did not know with whom they
worked, they are considered to belong to the same group as they scored the same set of portfolios
from the same examinees. This is a "crossed" rating plan as raters in a group are instructed to
score all the tasks submitted by examinees assigned to the group. Figure 2 depicts a sample
disconnected crossed rating plan using a hypothetical data set with a pool of four raters scoring
two items for and 50 examinees. Each "X" represents a test score assigned by a rater to an

examinee on an item. Cells without an "X" indicate missing or unobserved data.

36

 

Essay 1 Essay 2
81111.1 Batu] Eaten Ham Ham mu

I

 

)
I

 

 

 

 

coma—~3mxrn

‘u-yyxuyy

 

 

 

 

 

 

axxxx

 

Figure 2: A hypothetical data set illustrating the
disconnected crossed rating plan

Connected mixture rating plan. The connected mixture rating plan is frequently used
when the volume of examinees is large and it is more cost-effective and convenient to use a
random process than to impose rigorous rules guiding the rater-task-examinee assignments. For
instance, an examinee's tasks might be organized in a portfolio (containing tasks submitted by the
same examinee), which would then be mixed with other examinees' portfolios for raters to select
at random. Once a rater had selected a portfolio, s/he scored one or more tasks in that portfolio.
Whether or not the rater scored all the tasks in a portfolio (denoted as a crossed structural design)
may depend on convenience, guidelines suggested by scoring centers, the nature of the
examination, and expertise of the rater. Although tasks are more likely to be scored according to
raters' expertise in a highly specialized examination than in, for instance, a language arts writing
exam, variations exist regarding the number and nature of tasks in a portfolio scored by a rater.

Raters may be instructed to score an essay in a portfolio and then return it so that another rater

37

can be randomly selected for scoring the other essay in the same portfolio (denoted as a nested
structural design). If one rater works with one other rater on a ﬁrst essay and then works with
another rater on a second essay, this leads to the MBIB structural design mentioned in the
Modeling Stage on page 20.

Unlike the disconnected crossed rating plan in which raters were usually grouped prior
to the starting of scoring, the idea of forming groups in this rating plan is less conspicuous —
raters do not usually know who they will work with. Due to the random process used for the
rater-task-examinee assignment, with this plan raters have the opportunity to work with more
raters than they could in the disconnected crossed rating plan. Raters are "connected" in this
rating plan because their ratings are compared directly or indirectly through other raters. Figure 3
on page 39 shows a hypothetical data set illustrating the connected mixture rating plan with two
essays, four raters, and 50 examinees. By examining which raters were chosen to score an
examinee, one can observe that the hypothetical data set contains three structural designs, namely
the crossed, MBIB, and nested designs (these designs are separated by two horizontal lines in the

ﬁgure).

38

 

 

 

 

 

Essay 1 Essay 2
W WW
I I 8"“ I
I I I I
I I I I
I I I I
I I I I
I I I I
I I I I
I I I I
l I I I
—'I‘” IV H W __#. F— 'i
I I I I
I I I I
I I l I
I I I
I I I I
I I I I
I I I I
I I I I
E I I I I
I I I I
x I I I I
I I I I
a I I I I
I I I I
m I I I I
I I I I
' I I I I
l I I I I
I I I I
n I I I I
I I I I
e I l I I
I I I I
e I I I
I I I
I I l
I I I I
I I I
I l I
I I l I
I I I I
'——_I M T, #‘ I I
I I I I
I I I
I I I I
I I I I
I I I I
I I I I
I I I I
I I I I

 

 

 

Figure 3: A hypothetical data set illustrating the connected
mixture rating plan

In practice, different versions of the two rating plans introduced in the above section
were adopted in various scoring centers (Gordon, 1998; Vickers. 1998) and the principles shown

in Table 4 capture the essentials.

Table 4: Principles of rating plans

 

 

 

 

 

Principles Disconnected Connected
crossed plan mixture plan
I. Number of ratings on each task 2 2
for a given examinee
2. Linking between groups raters belong to only one no restriction
group
3. Guiding structural designs limited to only the no limitation; structural design could be crossed,
within a given group crossed design nested, and other designs
4. Number of examinees scored by planned; unplanned:
groups of raters. every group of raters groups may or may not score the same number of
scores the same number examinees depending on the size of rater pool and
of examinees. sample size.

 

39

 

Batch Size — The procedure used to assign tasks to raters was called task assignment. In
addition to randomly assigning essays to raters, scoring centers (e.g., Georgia State Department
of Education and ACT, Inc.) often impose rules for the scoring procedures to accommodate
operational needs. An important dimension of setting up a rating plan is to arrange essays so that
they can be efﬁciently graded. For example, rather than randomly assigning every essay to each
rater, scoring centers often organize essays in batches (Gordon, 1998; Schafer, I998; Vickers,
I998; Welch, 1996; Wolfe, 1998). Those batches can be randomly assigned to groups of raters.
Packing essays in batches saves operational time because it takes more time for raters to
exchange single essays than a batch of essays. Packing essays in bundles also controls the
number of essays to be scored by a common group of raters. Bundling also structures ratings for
reliability analysis -- without the bundling of essays, the data set may be too sparsely-ﬁlled to

conduct an analysis.

3.4) Data generation

Linear model — A total of 17,600 balanced data sets were generated, 100 sets for each
the 176 conditions under the speciﬁcation of a two-faceted balanced design (Schroeder, 1982,
p.36), namely the person x item x rater design speciﬁed in Appendix A. The score Xp" of any

given observation in this model was expressed as a sum of seven components,

Xpir = xp + x. + xr + xp, + xpr + x,, + xpm, (14)
Each of the seven components were generated under a normal distribution Xa ~(0. of, ),
where o-f,={ 0': =0.3; 0,2 = 0.02 (low), 0.11 (high); of =0.01; 03/3,, = 0.30; 0;, = 0.01 (low), 0.1
(high); 0,2, = 0.01; and of,” =0.20). For example, the score for person p, responding to item i,

judged by rater r, was the sum of seven random numbers each generated independently from the
above seven normal distributions. Table 5 on page 41 shows the population values of the

variance components and the values of the corresponding composite population indices.

40

Table 5: Population parameters for the variance components and composites

 

 

HI?!) 5:" Enrol Lav. ism Elk—c!
Crosses a-id MBIB Deserts Hgn Ram
, ”or
F
I
l'
Variance Components pl 0 30 0 30 0.30 o 30
p. o 10 0 01 o 10 o 01
II 0.01 0 01 0.01 o 01
plr 0 20 O 20 0 20 0.20
W 5" “M‘M‘ 0.50000 0 45277 0.50000 0 45277
w w“"‘""'"" 0.55902 0.51720 0 51720 0 47170
Compost-e WW 0 58333 0.63063 0 58333 0 63063
WWII 0.52830 0 56680 0 56680 0 61135
W Error 0 03682 0.02659 0 02659 0.01700

 

Observed Score Scale — The observed scores Xplr were the sums of the scores of the
seven effects in (14). The sums would have a mean of 0 and standard deviation of approximately
one indicating that roughly 99.9% of the scores should be between -3 and 3. If 3 were added to
all scores, the total scores would approximate scores on a seven-point scale ranging from 0 to 6

with a mean 3 and standard deviation of 1.

Score Scale Truncation -— In practice, test scores for performance assessment are often
assigned as integer scores (e.g., I, 2, ..., 6) with an underlying discrete distribution. Much
research, however. has employed the normal distribution or other continuous distributions for
research purposes (e.g., Brennan, Harris, and Hanson, 1987; Bost, 1995; and Smith, 1992).
Longford (1995) examined the effect of using normal scores as opposed to integer scores when
using simulations to examine the accuracy and precision of estimated variance components.
Using a one-faceted model (person-by-rater), Longford simulated 200 trials with test scores
generated from a normal distribution. He then compared the estimated variance components
obtained from the normal distribution with those obtained by truncating the fractional scores to

integers. Longford concluded that the bias due to the truncated scores is somewhat greater than

41

that for the estimator for the 'normal' scores, but the difference in bias was unimportant. The

following reported the results found in Longford's comparison (Longford, 1995, pp. 43 - 45).

Table 6: Comparsion between normal and rounded scores

 

 

 

11031131 Scores Rounded Scores
02p 03, 02p“ 02,, 02: Ozprc
True value 0730 0.062 0.370 0.730 0.062 0.371
Mean 0.749 0.077 0.350 0.722 0.074 0.421
Std. (0067) (0.038) (0.040) (0.065) (0.038) (0.041)
deviation

 

Missing Data Generation — Following the generation of the balanced data sets,
sparsely ﬁlled data sets were created. This was accomplished by randomly deleting scores from
the balanced data sets. The sparse patterns were modeled to reﬂect the unbalanced patterns
appearing in the two rating plans (see Appendix I for programming code).

Disconnected crossed rating plan. The following three rules were employed to generate

 

data for the disconnected crossed rating plan.

I = crossed (15)
rater,“- ¢ rater',,,,..~ (l6)
rater” ¢ raterm- (17)

The rule listed in Equation (16) ensured that no single rater score an examinee on the same
item (i) twice in a given data subset (s) of the crossed design. The last rule, in Equation ( I 7),
required that every rater participate in the scoring of only one data subset. In Equation (15), only
the crossed design appears implying that the raters who scored one item for an examinee also
scored the other item and that the same raters scored all the examinees in a data subset (s).

Connected mixture rating plan. Regarding this rating plan, only the second rule (16)

applied to the data generation procedure. The third rule was not imposed on this rating plan so

that raters could participate in scoring more than just one data subset. Whether or not they
participated in more than one data subset was a random process. When raters participated in
more than one data subset, they provided a link between the subsets they scored and for this
reason, the current rating plan was referred to as "connected". The ﬁrst rule was amended so that
t = crossed, MBIB, nested. (l 8)
Because raters scored eitherjust one item or both items for a given examinee in a data
subset, this arrangement allowed a data subset to exhibit either a crossed, MBIB, and nested

structure and so this rating plan was referred to as a mixture of structural designs.

Negative Variance Components — Variance component estimates can be negative
because of many reasons discussed in Brennan (1992), Cronbach, Gleser, Nanda, Rajaratnam
(I972), Marcoulides (1987), and Searle, Casella, and McCulloch (1992). Some reasons are: (a)
The population values are indeed zero or close to zero; (b) Insufﬁcient data are used to estimate
the variance components; (c) The model is misspeciﬁed; and (d) The estimation procedure is
incorrect. Brennan (I992, p.48) suggested one examine possible reasons contributing to the
occurrence of negative variance components and asserted that setting negative estimates to zero
resulted in biased estimates. Because unbiased estimates were desirable, negative variance
components were set to zero for reporting, but their negative values were used in all

computational procedures for composite indices.

3.5) Outcomes and data analysis

Outcomes — Two performance measures of the estimators produced by the subdividing
method were examined. Accuracy indicated the degree to which the average of an estimator
departs from its population value. Precision indicated the variability of an estimator. Both
criteria were important for the estimates of the variance-components and the composite indices

because how well the estimators perform on these criteria affects high-stakes decisions made

43

based on G theory. The estimates and the true parameter values for variance components, the
generalizability coefﬁcient, dependability coefﬁcient, and misclassiﬁcation rates were examined

using the Accuracy and Precision measures, which were summarized as follows.

Accuracy, Bias, and Precision — The Mean Square Error (MSE) indicates the squared
loss, or the averaged square difference between an estimator and its known population value.
Harwell et al. ( 1996) and Othman (I995) used this index to evaluate the quality of an estimator.
This index comprised two components, namely the squared bias and the variance (see the

following relations).

 

 

MSE: F‘
e
L' A T 2
= (9 -6 ) +
= Squared Bias + Variance (l9)

The 63} in (19) represents the G theory estimate from the f“ trial; 0 is the known
population parameter, representing true values for the variance components and composites; e is
the number of trials of each simulation (i.e., l00); and 9 is the mean of 9'} over the e trials.

Ideally, a zero MSE would indicate that the subdividing method provided an estimate identical to
its population value. A low MSE is desirable because it indicates very little bias and variability
of an estimate. A large MSE is less desirable and can be contributed to by either or both a large
variance and a large bias. To disentangle these two sources of errors in estimation, researchers
(e.g., Marcoulides, I988; Othman, 1995) have reported variance and bias as two separate indices,
and it is a common practice to modify these two indices so that they become more meaningful

and easy to interpret.

44

mm Errors (inverse of precision). The square root of the variance in (19) equaled
the empirical standard error used to examine the variability of the estimators produced by the
subdividing method. The standard error was computed by obtaining the standard deviation of an
estimator in a simulation. The inverse of the square root of variance was referred to as
"precision", which was used interchangeably with standard error to describe the variability of the
G theory estimates in the current study. A precise estimate has low variance (or standard error)
and an imprecise estimate has high variance.

Accuracy (measure of bias). Accuracy of a simulation can be measured in many ways

 

and one lucid way was to express accuracy as a percentage. Technically, it was measured as the
average ratio between an estimate and the parameter value of that estimate across all replications.

Computationally, accuracy is deﬁned as:
Accuracy = 12—6—1 . (20)
e F. 0

The above index treated all discrepancies between the estimators to their population values
equally serious. An accuracy equals one indicates that the estimates were recovered perfectly;
whereas an accuracy higher than one indicates overestimation and yet an accuracy lower than

one indicates underestimation.

Empirical versus theoretical standard errors (SEs) — Standard errors based on
asymptotic assumptions (Brennan, I992, pp. 133-135; Burdick & Graybill, I992) can be
inaccurate because the degree to which the SE reﬂects the sampling distribution of a variance
component depends upon factors like sample size, normality, and the amounts and patterns of
missing data involved. In addition to relying on assumptions that were difﬁcult to satisfy in the
current data, Brennan (I992) and Burdick and Graybill (I992) did not discuss how to estimate or
compute a SE when multiple samples were available. According to Brennan ( 1992), the

theoretical SEs for the variance components are functions of the mean squares of the facets and

45

their degrees of freedom. The matrix notation for the theoretical standard error was summarized
in Equation (6) and the formulae are given in terms of variance components and sample sizes in
Appendix B. The theoretical standard errors reported in the Results section were based on the
formula in Appendix B, computed using two raters. the value two was chosen because a random
pair of raters was selected to score each examinee, even though more than two raters were
available in the pool. These standard errors were compared to the empirical standard errors to

investigate how precisely variance components were recovered over the 100 replications.

Empirical versus theoretical conﬁdence intervals (Cl) — The skewed distribution
(due to low df in a 12 distribution) of the variance components can cause their CI to be
asymmetrical; that is, the two sides of the CI have unequal lengths (e.g., a hypothetical 95% CI
for a variance-component estimate of 0.45 could be [0.40, 060]). Brennan (1992) and Burdick
and Graybill (I992) developed methods to construct Cls for variance components under balanced
designs but not for unbalanced designs. Nor did they develop methods to construct C Is for
composite indices.

To construct an empirical 95% CI for a variance component, I used the observed
variance components at the 2.5th and 97.5th percentiles of each of the seven simulated
distributions. The theoretical 95% CI was computed by multiplying the standard error by a
correction factor reported in Brennan ( 1992, Table D.l). The 95% empirical CI for the
composites was computed by ﬁrst obtaining the composites based on the synthesized variance
components followed by reporting the composites at the 2.5th and the 97.5th percentiles of the
composites.

Gao (1992) suggested that one use the upper and lower limits of the absolute standard
error of measurement and those of the universe score variance to compute the theoretical CI for
the composites. The upper limit of the CI for composites was computed by dividing the upper

bound of the universe score variance by the sum of itself and lower bound of the absolute

46

standard error of measurement. For instance, the upper limit of the generalizability coefﬁcient is

deﬁned as

 

 

Ep . = A 2 ”ﬂ: 2 ’ (2 l )
“"’" 0' + 0'
Pm... (5 ...,
A 2 6'2 6-2
3 0- . .
where 6. = p1,”, + p r + p zr.e,_ﬂ. (22)
1, arr! n: n: ”in:

Likewise, the lower limit of a 95% theoretical CI for the composites was obtained by
dividing the lower bound of the universe score variance by the sum of this variance and the upper

bound of the absolute standard error of measurement

 

Ep : ﬂ, Irnn'r 7 . (23)
low" A - A ..
O-plmur + 0-6'W'

47

CHAPTER 4: RESULTS

As described in the previous chapter, the volume of examinees, number of raters, and
examinee- and task- to-rater assignment determined the amounts and patterns of missing data and
thus inﬂuence the estimation of measurement errors. How does each of these factors impact the
precision of individual variance components (measurement errors for scoring performance-based
assessments)? How can one use the subdividing method to alleviate the potential inaccuracy and
imprecision caused by unbalanced data? Which variance components are inﬂuenced by the
increase in sample size, and which variance components are inﬂuenced by the expansion of rater
pool size? These questions are central to this chapter.

Section 4.1 examines the effect of using weights when synthesizing variance
components. Section 4.2 compares the measurement errors estimated from data generated by
different missing data mechanisms, namely using a small batch size (12 examinees) versus a
large batch size (24 examinees). Section 4.3 summarizes data on accuracy of the subdividing
method and sections 4.4 through 4.6 summarize data on the precision of the subdividing method.
Section 4.7 compares and contrasts the two rating plans (disconnected crossed and connected
mixture) in the light of the precision of variance components and section 4.8 covers the
performance of the subdividing on the generalizability coefﬁcient, dependability coefficient, and
misclassiﬁcation errors. The composite indices are discussed in the context of the disconnected
crossed rating plan as it utilized a larger amount of data than did the connected mixture rating
plan, for the analysis of a completely crossed person-by-rater-by-item design. Table 7 on page 49
provides an overview of the results reported in the subsequent sections, obtained from the

simulation with 100 replications.

48

Table 7 : Table of major ﬁndings

 

Research Questions

Results

 

l) Ilow did weighting inﬂuence the G theory
estimates?

Weighting had no effect on the variance components
estimated in the disconnected crossed rating plan. It
increased the precision of variance component estimates in
the connected mixture rating plan when the data subsets
differed in size.

 

2) What was the effect of doubling the batch size?

Using a batch size of 24 did not have any noticeable effects
on the accuracy and precision of the variance components
and composite indices in both rating plans.

 

3) How accurately did the subdividing method recover
variance components and composite indices?

The variance components and composites were recovered
with accuracy close to 100% in both rating plans.

 

4) How well did the subdividing method estimate the
item effect?

The subdividing method estimated item effects in the way
they should be estimated using ordinary algorithms in
balanced designs. Increasing the size of rater pool or the
sample size did not change the accuracy and precision ofthe
item effect.

 

5) What was the effect of expanding the size ofthe
rater pool?

Increasing the rater pool reduced the standard error of the
rater and item-by-rater effects in the disconnected crossed
rating plan and the rater-nested-in-item effect in the
connected mixture rating plan.

 

6) Can the subdividing method handle a large volume
of examinees? How well did it perform?

(a) The subdividing method can always handle more
examinees than any other methods that analyze the entire
data set all at once. As long as one can partition a sparsely
data set into subsets with manageable sizes, there is no
restriction on the size of the sparsely ﬁlled data set. (b) The
larger the volume of the examinees. the more precisely the
following effects were estimated: person, person-by-item,
person-by-rater, and person-by-item-by-rater effects.

 

7) What were the advantages and disadvantages ofthe
two rating plans?

The disconnected crossed rating plan requires more effort to
route the tasks to the raters but it provided more precise
estimates for the rater-related effects than did the connected
mixture rating plan.

 

 

8) How did the amounts and pattern of missing data
inﬂuence the norm- and criterion- referenced
indices?

 

All composite indices were estimated as accurately and
precisely as they should be estimated in balanced situations
in both rating plans. Subsampling different pairs of raters to
score different groups of examinees provided conﬁdence
intervals parallel to using the same pair of raters to score all
examinees.

 

49

 

4.1) Comparison of pooled results with weights and without weights

Precision (Inverse of Standard Errors) — Weighting data subsets by sample size
recovered very closely the theoretical standard error when the data subsets differed in sample
size. Data subsets differed by sample size occurred in only two conditions (ﬁrst condition:
n,,..=750 and nr..=4; second condition: n,,..=1500 and n,..=4) where the rater pool size had four
raters in the connected mixture rating plan. Figure 4 through Figure 9 on pages 52 to 54 show the
precision of the three variance components interacting with the object of measurement (person),

A2

namely 5-: , 0'" , and (3,, Weighting increased the precision of the variance components

obtained in the 750-examinees-by-4-raters and 1500-examinees-by-4-rater conditions. In contrast
to the weighted components, unweighted variance components manifested standard errors that
were larger and more different from the theoretical standard errors, which are shown as
horizontal lines in Figure 5, Figure 7, and Figure 9. The improvements in precision due to
weighting, averaged across the eight conditions (2 batch sizes x 2 levels of the item eﬂect x 2

levels of the person-by-rater effect), were 0.015 (4.3% of the population value), 0.010 (3.3%),

and 0.003 (1%) for 6—:, (if, , and (33,”. respectively. Weighting increased the precision of

composite indices 13,02 and (MK) with an average 0.019 (2.1% of the population values) and

0.013 (1.3%), respectively.

When the data subsets were very similar in size (i.e., those with expected subset size equal
the minimum batch size) the differences in precision between the weighted and unweighted
components were negligible. All increments in precision were less than 0.0035. These conditions
included 750-examinees-by-2-raters, 750 x 8, 1500 x 8, 1500 x 14, and all other conditions with
3000 or more examinees. This ﬁnding applied to the person (Figure 4 and Figure 5), person-by-

item (Figure 6 and Figure 7), and person-by-rater-nested-within-item (Figure 8 and Figure 9)

50

effects. The difference in precision was too low to inﬂuence any high—stake decisions based on
the composite indices.

When there was no subsampling of raters (i.e., 750-examinees-by-Z-raters), weighting did
not inﬂuence the precision at all (no increment in precision was present with rounding error at 7
decimal places). The data series labeled 750-examinee-by-2—rater in Figure 4, Figure 6 and
Figure 8 were identical to those in their corresponding ﬁgures showing the unweighted results.

Weighting by subset size did not increase the precision in conditions where the target facets
were unrelated to the object of measurement (increased precision only by 0.5% and 2.5% of the
population values of a"? and &3,, respectively.

As hypothesized, weighting had no effect on the disconnected crossed rating plan. Figures
10, 12, 14, and 16 show the weighted variance components for the disconnected crossed rating
plan. The unweighted variance components (Figures 1 1, 13, 15, and 17) were recovered as
precisely as the weighted estimates (i.e., the standard errors were identical between the weighted
and the unweighted estimates of the seven variance components for all conditions). In addition,
the empirical standard errors for both estimators matched the theoretical standard errors depicted

as horizontal lines in the ﬁgures.

51

m

me moan—5.8 503E333 ”m charm

me 838:8 Bias? 3 «Eu...—

 

 

 

Sa 22.8 a .2... .33. 3 3m

008 83 08'
0N 1 on I. o v— o v

in 22.8 a .25 .23. 3 3m

coon 8a..
on 3 o 3 o v o

 

 

 

 

 

a. .5an

 

. 88
o3 a-z 8 z
o . N m z ............
- -- :1 o
1.1.. so
.30 S
a
u
D.
a
9.1118... m.
3
m
.30
3o
.8...

 

 

 

E. coach.

v

o2 amz
N z 2

 

 

i ..t ll: Ill.l1 (A. O

t 5o

2 E
10113 DJ'PUI’IS

Es

moo

 

 

 

Embaomoo I I
thnomoo.
vnwozo . 4.
N? Po—Po
cm 86 C o ltl-
Np Foo—roll
vn '63le
S. #030
vn 8030
N.— 5036.181:
:23 5 .

 

an... «53.. EaBE 38.93.39

2: ..8 moan—:38 33303:: 2:. cox—$95 a... .3 55:59.80

52

B.

 

 

 

 

 

 

 

 

 

 

 

Se 83:58 8 E3»: u 25w. . b 83:58 8 a; " 95w.
N O O I N O O O
a.» 22.8 a .8; .83. .o .5 88 ﬁmhneum a .2... =5. .msnwa ...2
08¢ 88 8o. 9:. a12 on 2 an I o 3 o v «12
a 3 on I o : o e v N «.12 , , o
11 .. I 11 o H
1 11 “mooo
1 1 1 m8° ... ’43 .l.
. . IWHB ...,..I1 1 so 1.1-1 1 1 . erukﬂwﬂm . - 1 1 5o
inn»... .I. . 1.1.19.1 I .

lib-ﬂ n s 1 i 1 1 1 s
I 11 I. ..I. 'Q.”.....‘.- ’1- - 111 mPOO a m
. 1. )1 I . m w
. .. . 3o 1 1 m.
. - w a
l .. .93... m 111 m

I ll 1 ”CO .I l I I ..l MOO

mnoo I. l 1 11.82.

M 3.0 to o

:5 E3. 3 .389... :3 ES. 3 .323.

 

 

 

Quaataeu» .3..— u:_3.. 2:35 3.8853.»

2: ..8 83:58 82383:: .33 82$?» 2: .3 :82:an0

53

Macrame 83:58 83393.5 "a PEEK

9.3LnMUO mﬁwﬂamumﬁ ‘Qw—mumﬁg a” Obs-“mm

 

 

 

36 can...» a .09.. .83. .o cum

 

 

 

88 89.. 8.... on. auz
on 1. ca 3 o vw c v a v N 1 2
ll] 1!. .i!il .11 o
1 . mood
' Z . -
1m I’D“. I 30.0
/
coco
,. 000.0
1 . 50
N56
1 v8.0
.1122.
080

 

3:5 .52. 55.3 8.82 .33. 3 cocoa

10113 91390338

35 0585 a .3... 83m .0 36

 

 

 

 

 

8.2.3 .82. 55.3 3.82 831 an .3an

 

 

88 88 82 8.. ..-z
8 3 8 z o 1 o v a v N m z
; .1. 1,. o
1 11 1 11 8.;
I I ' .
. J...“ I 1 1 :20
I . I h..'1I“/'/' U I __
<11 111: s:
.LI/.A I 1. a:
1 1 1 8.0
1 laws...
1 11 1 too
11 .22.
£3...

101-13 PJ'PU'IS

 

 

 

 

«8:533. .3..— w.._3.. 258...: 32.8589
2.. ..8 83:58 82383.... E3 68.383 2.. ..e :emtanEeU

54

 

 

 

 

 

 

 

 

 

 

 

 

 

 

a. . a .
Nb 83:58 83395.5 . g a 95w...“ Nb 8:358 8:383 A: 95”:
33 29:3 a .00.“. 88¢ 88 on...” 39:!» a .8; .85. Hanna 8. ..
800 88 com. can alz 8 v. on 1 o 1 n v o w n a
8 3 8 to 1o . o v ~ «12 o
. . 1 .11. - 1. 1» .. i. 1. .. 1. 1. 1... 2311:! 1.. 1 11 . 1-1:..- .11. 1. . . 1..- ..1. 1.1 . O
1‘ 1“ ‘1 11 \“ \‘ mood
.108 o
w .11..1 11 1 1 1
23 m. w 1 11 1 ..1. 1 \ 23 S
U
w ...m
. .310 N.
3 . w.
32. m 1 1 18:
118° 1111‘11111
. mac 0
3.0 t . 1 v0.0
3. .3an a. 5an

 

 

 

 

5.... 95a.— .8898 uauuagummu

of ..8 83:58 83w83:= .23 82383 2.. .3 acmtaQEcD

55

 

.3

 

 

 

 

 

 

 

 

 

 

i . .
.Nb 83:58 8:38.55 .2 «Suﬁ Nb 83:58 8:383 .2 95$..—
36 295m a .8; .83. .5 238 a .8; .23. .o 36
88 88 83 SN. auz 88 8 82 auz
on 3 cm 3 o 3 c v o v N m z on 3 on 3 o 3 o v m 2
‘ll'. 1.1.1.113... 11013“?! ‘111111 .11. , .1. . 2.... 1” o 1 g o
1111. .88 ”111‘11b111 1 .7810
El H lanai
1 “ 8.0 S 5o S
a n
U U
. w . w
\‘ mac m. 92. m.
a m
1 So m no... 1
1 11 .83 1 ”198,0
moo mod

:8 Es. 3 82:.

 

 

 

:3 as. 3 82:.

 

 

«83558» .3..— w:_3._ .8898 883:8»..3
05 ..8 83:58 8:38.55 23 8:38.: 05 .3 :cmtuaﬁou

S6

xmb 83:58 Bangui—D "m— 9...”:— kmb 83:58 6.8.383 “3 «Law:—

 

 

 

 

 

 

  

 

 

 

ou on:— 5 on
3.» o_nEama_oou .83. .m _ 5.23.. 8.50 .m
1. 88 88 89 02 a1:
coon 82. on“ a z 8 I 8 I o I a v o v n «12
83.. 3o. ov~ «12 . 10
N80 . I “be. t I _
I]. w
1.11.1' 300
8c 1 1,
S 1 ,
a 1.1V. 1. 11 11...:
81° N. 1
u
p I . .
8... 3 11 1 11111 8...
m
1 81° 11 111 11 so
I 1!
. l1 . 1h. I. W
1 1 23 1 1 11 1 /\\u .23
:oo 1.1 1 . ,: 1 11 11178.0
ca. .83. an 2080.. :3 33¢ 3 .3801

 

 

 

 

«35538» .3... w:_3.. @388 3893:8958
2.. .8.— 83:.=8 82393:: :.3 83E“; 2: .3 5.8.2.580

57

10113 ”PMS

 

3.5% 83:58 cat—Hog "E FEM:—

 

 

   

Etmb 835:8 cut—E9551— “: e.:.w_ 1.— N
36 29an a Son .83. SB 2..an a .09... .83. .0 05m
88 Son 82 S: anz 88 88 o8 _ S: auz
on 1 cu 1 m 1 n t o v N z z aw I am 3 m I u c o q N m z
1 1 1 1 1 1111 1 1 1, 0 111111111111 0
1 1 1 1 «8 o 1 1 1 1 1 wn8 0
IF
M K 1 1 18,0 5 ‘ “vooc s
a 7 u
1 u u
D. D.
1 183 m 1 Son m
D. D.
m m
1 68 o J 1 1 1 18 o 1
1 6 o 11 1 1 ..o o
.~s o 1 1 123

3:33.031 .3 E3. 3 c029“.

 

 

 

 

353531 .3 Eu: .3 :oEun

 

 

«8:538» :2.— u:_3._ 33.88 38888.8
2: ..8 83:58 83993:: 6.3 83303 a... .3 8:39:80

58

Accuracy —— No matter whether or not variance component estimates were weighted by
sample size, they were recovered with high accuracy for both the disconnected crossed and

connected mixture rating plans in all 88 experimental conditions, which were composed by

crossing the high and low magnitudes of 0,2 and 02,; large and small minimum batch size; high,

medium, and low rater pool sizes; and the small, medium, large, very large sample sizes. The
weighted variance component estimates differed only minimally from the unweighted estimates,
in accuracy. For the connected mixture rating plan, the differences between the weighted and
unweighted estimates of the ﬁve variance components were 0.08%, 0.34%, 0.25%, 0.10%, and
0.04%, respectively. For the disconnected crossed rating plan, the accuracy of the variance
component estimates were close to identical; excluding the item-by-rater effect which had a
mean difference in accuracy 0.01%, the other six variance components were identical between
the weighted and unweighted conditions. Since weighting reduced the standard errors (reported
in the previous section "precision") and did not lower the accuracy (reported in the current
section) of variance component estimates, only the weighted estimates were examined and

reported in the results that follow.

4.2) The effect of packing essays into batches of 12 versus batches of 24

Packing essays into batches of 24 as opposed to l2 did not systematically raise nor lower
the accuracy and precision of variance components and composite indices in both rating plans.
Table 8 on page 59 shows the range, mean, and standard deviation of the ratio between the

standard errors of the variance components when using two levels of batch size (i.e..
SE(&'§)/ SE(&;°;) ), where 0",? and a“: represent the variance components or composites

obtained by using a minimum batch size of 24 and 12 respectively.

59

Table 8: The ratio of standard errors of indices obtained
using a batch size of 24 to those obtained using a batch size of
12 for the disconnected crossed rating plan

 

Std n conditions

person 0.790 0.996 1 .185 0.085 44

item 0.558 1 .036 1 .639 0.272 44

SE rater 0.748 1.030 1 .660 0.176 44
Ratios

for person-by-item 0.824 1 .007 1 .1 87 0.086 44

Variance

Components person-by-rater 0.806 1.018 1.281 0.095 44

item-by-rater 0.724 1.016 1.430 0.156 44

person-by-item-rater, error 0.836 1.037 1 .1 73 0.086 44

SE Generalizability Coefﬁcient 0.831 1.003 1.240 0.096 44
Ratios

for De d b'l' C ff ' t 0764 1020 1442 0146 44

Composite pen a My oe 1c1en . . . .

Indices

Misclassiﬂcation Error 0.631 1.043 1.603 0.247 44

 

 

 

Table 8 shows that the mean ratios of the standard errors across the 44 conditions for
each of the variance components were very close to one (i 0.05). This result indicated that the
average estimates obtained by packing essays into batches of 24 were as precise as those
obtained by packing essays into batches of 12. The range of ratios was from 0.56 to 1.66. Table 9
on page 61 shows the descriptive information for the ratio of the accuracy between the two levels

of batch size.

60

Table 9: The ratio of accuracy of indices obtained using a
batch size of 24 to those obtained using a batch size of 12 for
the disconnected crossed rating plan

 

Max Std n conditions
person 0.981 0.998 1 .032 0.009 44
item 0693 1 .053 1 .569 0.224 44
Accuracy rater 0.692 1.021 1 .432 0.171 44
Ratios
for person-by-item 0.985 1.001 1 .024 0.007 44
Variance
Components person-by-rater 0.866 1.006 1.168 0.055 44
item-by-rater 0.849 1 .007 1 .219 0.091 44
person-by-item-rater. error 0.988 0.999 1 .010 0.005 44
Generalizability Coefﬁcient 0.990 0.999 1.016 0.005 44
Accuracy
Ratios
for ' Dependability Coefﬁcient 0.978 0.998 1.023 0.010 44
Composrte
Indices
Misclassiﬁcation Error 0.886 1.007 1.119 0.053 44

 

 

 

The results for accuracy were similar to those for precision for all the variance

component estimates and the corresponding composite indices, providing no evidence that
ACC URAC Y (6'3) > AC C URAC Y (63,) for the disconnected crossed rating plan. Batch size also

did not make a difference for the connected mixture rating plan. Table 10 and Table l l on page

62 give the ratios of the standard error and accuracy for this rating plan.

6l

Table 10: The ratio of standard errors of indices

obtained using a batch size of 24 to those obtained

using a batch size of 12 for the connected mixture
rating plan

Min Mean Max Std
person 0.733 0.997 1.157 0.090
SE item 0.570 1.009 1.767 0.293
Ratios
for rater 0.757 1.019 1.358 0.122
Variance
Componems person—by-item 0.763 0.995 1.231 0.124
person-by-rater-nested- 0.638 0.947 1 .299 0.138
in-item. error
Generalizability Coefﬁcient 0.727 1.012 1.244 0111
SF.
Ratios
for Dependability Coefﬁcient 0.727 1.012 1.244 0.111
Composite
Indices
Misclassiﬁcation Error 0.575 1.017 1.727 0.272

 

 

 

Table 11: The ratio of accuracy of indices obtained using
a batch size of 24 to those obtained using a batch size of
12 for the connected mixture rating plan

 

person 0 968 1.002 1 024 0010
Accuracy item 0 608 1.045 1 793 0.285
Ratio
for rater-nested-in-item 0 863 1.016 1 230 0 082
Variance
». person-by-item 0 989 1.001 1.027 0 007
Components
person-byeitem-by-rater, error 0 992 1.000 1.015 0.005
Accuracy
- Generalizabuity Coefﬁcient 0 982 1.000 1 011 0 005
Ratio
for
- De ndab'lt Coeff cent 0 971 1.000 1,039 0 014
Composrte pe "" '
lndices
Misclassmcation Error 0 781 1.008 1 266 0 090

 

 

4.3) Accuracy of the variance components for two rating plans

The mean accuracy of the variance component estimates in the two rating plans was
high. For the disconnected crossed rating plan, the mean accuracy across the 88 conditions was

l00% i 0.8%. The mean accuracy for reliably scoring a measurement procedure to make nonn-

62

referenced decisions (using the generalizability coefﬁcient) was 99.8% and the mean accuracy
for making criterion-referenced decisions (using the dependability coefﬁcient) was 100.5% (see
Table 12). For the connected mixture rating plan, the mean accuracy of recovering the five
variance component estimates, shown in Table 13, was 100% i 0.5%. In addition, the mean
accuracy of the generalizability and dependability coefﬁcients were 99.9% and 100.6%,

respectively.

Table 12: Accuracy of the disconnected crossed rating

plan
Mr" Mean Mav Ste: :1 conditions
person 0.982 0.999 1.026 0 008 88
rtem 0 885 0 992 1 251 0 142 88
rater 0682 0993 1291 0126 88
ACCUYBCV Of person—by-Item 0 986 1 002 1 013 0.005 88
the \v'anancc
Components
person-by-rater 0 874 1 002 1.119 0.040 88
nemby4ater 0 811 1008 1431 0 079 88
person—by-rtemrater. error 0 991 1 001 1 011 0 004 88
88
Generalizability Goethe-em 0 989 0 998 1 005 0 003 88
Accuracy of
(‘omposue
Indices Dependability Coeﬂlaent 0 957 1 005 1 049 O 011 88
Misdassrﬁcahon Error 0 901 1 005 1 110 0 037 88

 

 

 

Table 13: Accuracy of the connected mixture rating

plan
Mm Mean Max Std n conditions
person 0984 1.001 1,017 0006 88
Item 0 679 1 005 1 374 0 169 88
Accuracy of
Variance rater item 0 825 O 995 1 098 0 053 88
Components
person-by-item 0 987 1 001 1 014 0 005 88

person-by-rater-nested-
within-item, error 0 993 1 001 1 009 0 003 88

Generalizability Coeffrcrent O 989 O 999 1 008 O 003

8

Accuracy of
Composite Dependability Coeffncrent 0 986 1 006 1 031 0 011
Indices

Misclassrficatnon Error 0 869 1 010 1 204 0 055

 

 

 

63

For the accuracy of variance components reported in Table 12 and Table 13, the
accuracy of the person and person-by-item effects were very similar between the two rating
plans, in terms of minimum, mean, maximum, and standard deviations. Although the accuracy of
the item effects in both rating plans had the largest standard deviations among all other effects,
the means of these two accuracy indices tended to converge to one (i.e., 100% accuracy). Such
convergence suggested that 100 replications was insufﬁcient to assess the accuracy (or
unbiasedness) of the item effect. Collapsing across the 88 conditions, however, increased the
number of replications to 88,000 and thus provided ample replications to evaluate the highly
variable item effect, which appeared to be accurately estimated by the subdividing method. Table
12 and Table 13 show that the mean accuracy for item effects were, respectively, 99.2% and

100.5% for the disconnected crossed and connected mixture rating plans.

4.4) Precision of the subdividing method and the effects of expanding rater pool sizes

In this section, 1 investigated the degree to which the expansion of rater pool sizes
inﬂuenced the precision of the facets that are related to the rater effects, namely the rater, item-
by-rater, and person-by-rater effect. The rater main effect is examined ﬁrst, followed by the
item-by-rater interaction effect. The person-by-rater interaction effect, which was hypothesized

to be inﬂuenced the least by the rater pool size, is discussed last.
Figure 18 on page 65 summarizes the values of SE (63) , corresponding to the

disconnected crossed rating plan for all the levels of sample size, rater pool, minimum batch size,
item eﬂect, and person-by-rater eﬂect. The precision of the rater effect varied only minimally

among the different levels of item effect, person-by-rater effect, and batch size. The major
variation due to the size of rater pool; the SE(c3-f) decreased considerably as the rater pool

expanded, holding the sample size constant. The percent decrease in standard error was lower

than one half of the percent increase in the rater pool size. The two horizontal lines, for high and

64

low person-by-rater effect conditions (03,, = 0.1 and 0.01), representing the theoretical standard

errors for 03 coincide at 0.0131. Even though the theoretical and empirical standard errors were

supposed to be the same or at least very close in the conditions where there was no missing data
(i.e., 2-rater-750-examinee conditions), the empirical standard errors appeared to be larger than
the theoretical standard errors. This ﬁnding was not surprising. Rather, it reﬂected the inaccuracy
of the theoretical standard errors based on a small number of levels in a facet. Chapter 5 on page

90 provides a thorough discussion.

 

Rater (r)

003 . , , , '7 ' 1 pr batch
..., 00200112

i 00200124

E . 0.020112

1 -4“ 0.02 0.1 24
+0.1100112

‘ 0.1100124
.--,“ 0.11 0.1 12

' . 0.110.124
‘_ -011pr=0.1(TSE)
‘. - -0.11pr=0.01(TSE) .

0 025

0.02

0 015 .
N018 '1' and 'p‘ refer 10 the

— _ —' ‘ magnitudesottheltemand
’ person-tw-rater population
parameters used to generate
_ the simulated data and match-

a
s
\ represents ‘nimmum batch suei
\ 'TSE' stands lor

“Theoretical Standard
0005 ,. . _
Error

standard Error

0.01

   

 

 

o . _ _ WW__ _-W--- .W_.____-___

R 2 4 8 4 8 14 8 14 28 14 28
P 750 1500 3000 6000

Size o1 Rater Pool (n_r) 8 Sample Size (n_p)

 

 

 

Figure 18: The reduction of standard error for the rater effect as a function of the size of
rater pool and sample size

Table 14 on page 66 displays the average 8135 and average reductions of standard error in
percentage terms. As the rater pool size increased from 2 to 4 (a 100% increase), the standard
error declined from 0.0224 to 0.0163 with a 27% reduction, averaged across item eﬂect, person-

by-rater eﬂect, and minimum batch size. While holding the rater pool size constant at any value,

65

increases in sample size did not lead to sizeable decreases in standard error. For instance, given a

pool of eight raters, the range of 5&0?) was 0.01 15 i 0.0005, despite sample size increases
from 750, to 1500, and to 3000.

Given that the standard error of the rater effect was 0.0060 when one assigned random
pairs of raters from a pool of 28 to score 6,000 examinees, how large should the rater pool be if
one wanted to maintain the standard error by assigning ALL raters to score the examinees?
Approximately a pool of 13 raters is needed. The parenthesized numbers in Table 14 show this
projection and projections for a variety of rater pool sizes for 750, 1500, 3000, and 6000

examinees.

Table 14: Average SEs and average reduction in empirical standard
error for the rater effect

 

 

 

 

 

 

 

 

A2 Size oﬁiaterPool l
“(0 r) w 2 f4, _ 28
S 27% 27% I14
:1 750 0.0224(2) 00163 (aloof—119 (2)
p 25% 25%
'
e 1500 0.0157 (2) 0.0117 (2; 0.0% (5)
18% 31%
s 513’ 5T1?
1 3000 0.0110(3) 0.0090(5) 0.0062 (12)
Z 270/0
9
6000 0.0083 (6)I 0.00630 (13)

 

 

 

The numbers in parentheses refer to the number of raters needed to maintain an equivalent
level of standard error when all raters in the rater pool score a task. The standard errors
reported are averaged across batch size. item effect. and person-by-rater effect. Equation
(36) in Appendix B was used to obtain the projections shown in parentheses. The standard
error to the leﬁ of the parenthesized numbers were substituted into the leﬂ band size of
Equation (36). The Generalized Reduced Gradient (GRG2) nonlinear optimization method
(developed by Leon Lasdon, University of Texas at Austin, and Allan Waren. Cleveland
State University) in the Microsoft Excel Solver 1995 was used to solve the equation for the
size of rater pool (n_r).

66

 

Figure 19 shows standard errors of the item-by-rater effect. The trends in the SEs
resemble those for the rater effect, indicating that expanding the size of a rater pool reduced the

standard error of the item-by-rater effect.

 

Item by Rater (tr)

1 pr batch
«- 00200112
00200124
_ 0020112
4-0020124
\ +01100112
-—+-.01100124
+0110112
- 0.110124
§‘—Oll(TSE)24 {

0015

Standard Error

0
O

 

0 005

 

 

 

 

N R 2 4 8 4 8 14 8 14 28 14 28
N_P 750 1500 3000 6000

Size of Rater P0018 Sample Size

 

 

 

Figure 19: The reduction trends of the standard error of the rater-by-item effect

Table 15 on page 68 shows that the percent reduction in standard error was less than
one half of the percent increase in the size of rater pool for the item-by-rater effect. The standard
error obtained by subsampling from 28 raters can be obtained by employing ﬁve raters without

sampling. This projection was also reported for sampling from 14, 8, 4, and 2 raters, respectively.

67

Table 15: Relationship between size of rater pool and reduction in standard
error of the item-by-rater effect as a function of sample size

Size of Rater Pool

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

02
“(9 irl 2 4 8 14 28
32% 25%
S 6:11
a 750 0.0145 (2) 0.0099 (2) I 0.0074 (2)
m 24% 25%
p Fin
('5 1500 0.0100(2) 0.0078(2) I 000537 (3)
20% 24%
8 5% é?
1 3000 0.0072 (2) 0.0058 (3) 0.0044 (5)
z 29%
e
8000 0.0055(3) I 0.0039 (5)
Note: The numbers in parathenses refer to the number of raters needed to maintain
an equivalent level of standard error when all raters in the rater pool score a task.

 

Figure 20 on page 69 shows values of the standard error of the person-by-rater effect.
The standard error decreased as the sample size increased, holding constant the size of the rater
pool. Expanding the rater pool did not increase precision for the person-by-rater effect. The two

series of trends clustered around the two reference lines for the two levels of theoretical standard
errors, indicating that the 5516?”) obtained by subsampling was comparable to that obtained by

employing two raters regardless of the size of the rater pool.

Table 16 on page 69 shows the percent reduction in the SEs as a function of the sample
size given a small person-by-rater effect ( 0%» =0.01). The percentages of standard error

reduction for all the conditions were less than the percent of increase in sample size and all

increments to sample size reduce the standard errors. This observation also applied for a large

person-by-rater effect ( 0' 2r =0.10).

68

 

 

0014 ,,

0012

001

Standard Error
o
8
00

E

0004

0002

 

N R
N_P

 

Person by Rater (pr)

1 pr batch I

00200112 I

00200124 I

0020112 I

--—l—-—0020124

+01100112

_ _._- 01100124
——.—0110112

.I _ .—
--7A45"5' <5\ - 0110124

— -0|lpr'—OIO(TSE)

ﬂ —<:- - - -Ollpr=001(TSF.)

 

2 4 8 4 8 14 8 14 28 14 28
750 1500 3000 6000

Size of Rater Pool & Sample Size

 

 

Figure 20: The standard error of the person-by-rater effect as a function of sample size

Table 16: SE and changes in standard error of the person-by-rater effect as sample size

 

 

 

 

 

increases
82 Size of Rater Pool
0( pr) 2 4 8 14 28
f s 750 0.0075 27% g 0.0074 30% @00079
10 i 1500 0.0053 0.0055 25., 0.0055
9 z 32% g .g
1 e 3000 0.0037 0.0041 0.00391
0 348g swag
8000 00027 0.0027

 

69

4.5) Precision of the subdividing method and the effects of increasing volume of examinees

Figure 21 and Figure 22 (p. 70) show the standard error of the object of measurement

2

(if, and the person-by-item-by-rater (plus systematic and non-systematic errors) effect (9),,” as a

function of sample size, rater pool size, and three other simulation parameters. For 6?, , the two

horizontal reference lines represent the theoretical standard errors, given two raters. The top and
bottom reference lines reﬂect standard errors for a large and a small person-by-rater effect. The
empirical standard errors clustered closely around the two theoretical referenced lines and the

standard errors exhibited a pattern similar to that was found for the person-by-rater effect. The

values of SE (6:) reduced by less than 50% as the sample size doubled. Expanding the size of
rater pool had inconsistent, thus ignorable effects on SE (6: ). The standard errors of
63m, resembled those of of, -- increase in the rater pool size had a marginal effect on the

reduction of standard error whereas increase in sample size had a signiﬁcant effect.

 

 

 

 

 

 

PencnIp)
004
0035 i/
,A\.
-.. .1..\ .-
003 ' . .- -
1‘2)’: I 1 pt batch
‘ I 0 0.0200112
0.025 K .. I ‘ 0.0200124
, . I - 0.020.112
is 7' 1‘7}. “" I-» 0020124
B 0.02 i e r .
g ;+0.11oo112
a 1’ 0.1100124
0.015 I .--,-- 0.110.112
, I — 0.110124
17%;“: ._ - .011 r=01 se i
001 D (T )
-..—.011pr=001(TSE)
0005
o. __ - - -W__ W W-..--W .-W--_W----_ - ...,-..-,”
"J1 2 4 8 4 s 14 11 14 211 14 211
up 750 1500 3000 0000
smotRaterPoollsampleSIaa

 

Figure 21: The standard error of the person effect as a function of sample
size and rater pool size

70

 

 

Standard Error

Person by Item by Rater (pine)

 

0012 - .-.- ......
‘\
001
I b
1 pr batch
0008 1- 00200112
00200124
_ 0020112
j +0020124
0006 v+01100112 I
‘ ;I~--+---o.1100124 I
‘ i‘—.—0110112 ‘

0004 R j - 0110124

 

 

N_R 248 4814 81428 1428
N_P 750 1500 3000 8000

Size of Rater Pool 81 Sample Size

 

 

Figure 22: The standard error of the person-by-item-by-rater effect as a
function of sample size and rater pool size

 

 

Standard Err-or

Person by Item (pi)

 

003 7
0025 I
1 pr batch
002 , 400200112
L; 00200124
I . 0020112

. . +0020124

0.015 ; +01100112
Wat .

3 4401100124

—-—0110112

00, W i - 0110.124

.—(TSE) I

0005 1

 

 

4 8 14 8 14 28 14 23
750 1500 3000 6000

Size of Rater Pool 81 Sample Size

2.2
1.13
N
.
a

 

Figure 23: The standard error of the person-by-item effect as a function of
sample size

71

 

Figure 23 on page 71 depicted the standard errors of the person-by-item effect and it

shows results coherent with those for the previous three variance components 6:}, , 63,, , and

[ﬁw . Speciﬁcally, the statistical property of consistency (the larger the sample size the smaller

the variability) holds true for the subdividing method for the disconnected crossed rating plan.

4.6) Findings on the disconnected crossed and the connected mixture rating plans

The complexity of the connected mixture rating plan causes scores to be allocated
unevenly to the three structural designs (crossed, MBIB, and nested). As predicted by the
decision rules discussed on page 28, the percentage of crossed and MBIB data subsets (e.g.,
percentage of the crossed data subsets = Equation (9) / Equation (12) *100%) diminished as the
size of rater pool expanded. Because fewer data points were allocated for these two structural
designs, the person-by—rater effect was less precise for a larger rater pool than it was given a
smaller pool of raters. Table 17 summarizes that the expansion of the rater pool decreased the

certainty of the person-by-rater effect.

Table 17: Increases in uncertainty of the person-by—rater effect in
the connected mixture rating plan

 

 

 

 

 

 

 

 

"2 100% I 75% 100%
6(0 pr)
4 8 14 28
49%
s
a 750 0.0228 I 0.0338
'2 54% 21%
L 1500 0.0182 0.0249 0.0 02
31% 47%
5 ®
1 3000 0.0168 0.0219 0.0323
2 44%
e
8000 0.0180 0.0230

 

 

 

 

72

It can be observed that all the increase in imprecision (or increase in SE3) in Table 17
(indicated by arrows) were less than approximately ﬁve percent in addition to half of the
percentage increase in the rater pool size. Expanding the rater pool size from four to eight raters
yielded a 100% increase in rater pool size and the increases in uncertainty for the person-by-
rater effect were 49% and 54%, respectively, for sample sizes 750 and 1500. The corresponding
increases in uncertainty became lower (21% and 31% for sample sizes 1,500 and 3,000) as the
rater pool expanded by a lower percentage, 75% as opposed to 100% (from eight to 14 raters).
The increases in uncertainty returned to the mid- and high- forty percent (47% and 44%) when
the rater pool expanded by 100%, for sample sizes 3,000 and 6,000. All of the above reductions
were less than 55% (0.5 * 100% + 5%).

Figure 24 depicts the phenomenon reported in Table 17 (p. 72) accompanied by the
theoretical standard errors. These theoretical standard errors predicted the person-by-rater effect
based on the same two raters scoring all the examinees on all items (i.e., completely balanced

situations with a crossed structural design).

 

Person by Rater (pr)

  
  

Standard error decreases as sample
size Increases. holdlng rater pool
size constant

  

0035
1 pr batch

0 00200112
00200124

0
O
U

. 0020112
‘ I 0020124
-4..01100112

. 01100124
‘_--0110112

_ 0110124
_ -011pr=001(TSE)
. . n011pr=010(TSE)

O
O
N
U‘
N
W.

Standard Error
0
o
N
‘ 72‘\

”AK“
“‘x.‘ e

.'
0 015 [It
I.
ll ‘
0 01

0005 —_—— ... -. ..-

 

4 ll 14 8 14 211 I4 214

N_I' 750 151111 11111) 1.11111)

sueotRabrPoolltMStae

 

 

 

Figure 24: The relationship between the improvement of the
person-by-rater effect and the expansion of rater pool size
using the connected mixture rating plan.

73

As observed in the above ﬁgure, the larger the rater pool, the farther was the empirical
standard error from the theoretical standard error. This indicates that one would become less
conﬁdent of the person-by-rater effect as more raters were employed, holding the number of
examinees constant.

The increase in sample size had an opposite effect than that of expanding the size of rater
pool — holding the size of rater pool constant, the larger the sample size, the higher the precision
(and the smaller the SEs). Such observations applied to all four levels of sample size. The trend
shows the increase in precision as the sample size increased from 750, 1500, to 3000, holding

constant the size of rater pool (the three boxes in Figure 24 on page 73 highlight the 8-rater pool

examples; the larger the sample size, the smaller the SEs). The means of SE(&3,,) were 0.0315,

0.0249, and 0.0144 for the three levels of sample sizes (750, 1500, and 3000). The minimum

reduction in SE(&3,,) , from one level of sample size to another, was over 25%.

Figure 25 on page 75 compares the degree of uncertainty of the person-by-rater effect in

the connected mixture rating plan to that in the disconnected crossed rating plan
A Z A 2 A 2 A 2
(SE(O'p,)-SE(O'P,.)), where 0'” and 0'”. represented the person-by-rater effects in the disconnected
crossed and connected mixture rating plans, respectively. The 0‘3 was estimated with a higher
pr

degree of precision in the disconnected crossed rating plan. On average across all the conditions

for the 750-examinee conditions, 0"2 was estimated with a precision 0.0127 higher than it was
I),

estimated in the connected crossed rating plan. For the 1500-, 3000-, and 6000- examinee

conditions, 6-2 was estimated with even higher precision: mean differences were 0.0170,
pr

0.0189, and 0.0161 , respectively.

74

 

 

Difference in Standard Error

0005

0005

—001

-0015

-0.025

-0 035

 

Person by item (pr)
The Difference of Standard Errors
Between the Disconnected Crossed and Connected Mixture Rating

8 14 28 14 28;

+ 1 pr batch
5 100200112
' 00200124
. 0020112
—---1e-- 0.02 0.1 24
+0.1100112
, I-4—01100124
i—.—o.110.112
' .. 0110124

   

SizedRaterPooll-Samplesm

 

Figure 25: The effect of employing two different rating plans on the precision of the

person-by-item effect

As was true for 6;, , ﬁve other variance components (all except the item effect)

manifested increasing imprecision as the size of rater pool expanded in the connected mixture

rating plan. This is likely due to the reduction of data (discussed in the Methodology section)

falling into the crossed and MBIB designs. Figure 26 on page 76 displays this effect for

SE(0‘-2

pine '

75

 

 

Person by Item by Rater (pir,e)

0045
004.

O 035

. 1 pr, batch 3

I - 00200112
I 0020,0124

. .. 0.020.112

. 1+0020124 I

+0.1100112

' .4-...0.1100124

I I—~o—0.110.112 I
1 - 0.110.124

- —(TSE)

003

 

0015

 

001

 

 

N R 2 4 8 4 8 14 8 14 28 14 28
NJ” 750 1500 3000 8000

Size of Rater Pool 81 Sample Size

 

 

 

Figure 26: The relationship between the improvement of the person-by-item-by-rater effect
and the expansion of the rater pool size using the connected mixture rating plan

in order to utilize all the available data in the connected mixture rating plan, four of the

seven variance components obtained in the previous section were summed to parallel the ﬁve
variance components in the nested design, namely 5-3, = 5-f+6-,2, and (3‘3”). = 6%,, -I- 01%,”, .

Figure 27 on page 77 presents the precision of d-f, obtained by the reconﬁguration. The

‘)

a"; became more stable as the rater pool expanded indicating that recruiting more raters while

using only a random pair to score an examinee can add to precision in estimating either one or

both of the following two measurement errors — (I) variability due to rater scoring examinees

differently, averaged across items, namely 01? ; and (2) variability due to rater scoring items

diﬁ'erentially different averaged across examinees, namely (3,2, .

76

 

Rater by Item (r:i)

0 025

O 02
1 pr batch

1- I 00200112
00200124
0020112

8
00" _+-0020124

+01100112
———e—-01100124
_.—_OIIOIIZ

- 0110124
— —00203(TSE)
. - -00202|(TSF.)

Standard Error

0
O

 

0005

 

 

N_R248 4814 81428 1428
N_P 750 1500 3000 8000

Size of Rater Pool 81 Sample Size

 

 

 

 

Figure 27 : The decrease in standard error as a function of rater pool size after utilizing all
the available data

The average standard error for the 750-examinees conditions (averaged across the item
effect, person-by-item effect, and batch size) decreased from 0.0205 to 0.0141, and went further
down to 0.0103 yielding a reduction trend of 3 1% and 27%. The average standard error for the
1500-examinee conditions showed a similar reduction rate with a 30% (0.0135 to 0.0093) decline
from sampling 2 of 4 raters to sampling 2 of 8 raters and with a 20% reduction from the 8-rater
conditions to the 14-rater conditions. For the 3000-examinee conditions, the decreases were 26%
(0.0086 to 0.0068) and 25% (0.0068 to 0.0051), respectively, for the 8-tol4-rater expansion and
the 14-to-28-rater expansion. Increasing from a pool of 14 raters to 28 raters reduced the average

standard error by 23% (0.0065 to 0.005) for a sample of 6000 examinees.

77

4.7) Precision of the subdividing method for item effects

Disconnected crossed rating plan — Figure 28 shows that there is no relationship
between the sample size, rater pool size and the standard error for the item effect, as
hypothesized in research question 4 (p. 30). This ﬁnding was expected because adding more
raters or increasing the volume of examinees should have very little inﬂuence in determining the
certainty of the variations in item difﬁculty. The standard errors ﬂuctuated slightly above the
reference lines (for the two item-effect parameter values), indicating that the simulated variance
components were more variable than were suggested by theory for two raters (Again note that the

theoretical values do not account for sampling from a rater pool).

 

       

ltem(i)
025
02
015
.. . 1 pr batch
. 00200124
15 II . 0020112
005 a g
a VA RN51 . ; Iw—x—Moozo 1 24
5 0 - . . . . . . . . , ,, , . It—e—01100124
m N_R 2 4 8 4 8 14 8 14 28 14 28 '—..—0110112
N_P 750 1500 3000 °°°° E. ..-,.0110124
-005 -I— -i=OII(TSE) I
' I- - -i=0,02(TSE) I
-01
-015

 

-02. ,.

Size of Rater Pool & Sample Size

 

 

 

Figure 28: The randomness of the standard errors for the item effect
(disconnected crossed rating plan)

78

42

Connected mixture rating plan — Figure 29 depicts the precision of a. which was

.,

not inﬂuenced by the size of rater pool and sample size. The series of 6,“ which were generated

with a population value 0.02 were recovered with a smaller degree of variation than those
generated with a population value 0.1 1. As expected, the two series were separated distinctly in
Figure 29 with the larger-variability series associated with larger and a more conservative
standard error than suggested by theory. The average empirical standard error was 0.159
comparing to the theoretical standard error 0.098 based on two raters. This difference was
expected (Brennan, 1992) because the theoretical standard error relied on asymptotic assumptions,

which were not viable when the number of levels in the item facet was small (i.e., 2).

 

 

 

   

 

 

 

Item (i)
'1
l
0.25 I
II 1 pr batch
~+40.0200112
02 1 0.020.0124
r '1 . « 0.020.112
E ," § ~——1u——0.020.124
11.1 ./ ,
0‘15 .7 7 +0.1100112
———+—0.110.0124
.., 1 0.110112
. r~-~0110.124
01 _ * 1
— -0020.21(TSE)
-—0.02 0.3 (TSE)
j — -0110.21(TSE)
005 I - - 0.1103(TSE)
I
0 . :
N_R 2411 41114 1114 211 14211
N P 750 1500 7101111 (111110
Size of Rater Pool 8 Sample Size

 

 

 

Figure 29: The randomness of the standard errors for the item effect
(connected mixture rating plan)

79

4.8) Accuracy and precision in making norm- and criterion- referenced decisions

Generalizability coefficient — Both the generalizability and dependability
coefﬁcients were estimated with high accuracy (ACCURACY( E p2) and ACCURACY(¢(71))

equaled 100% i 1%) averaged across the 88 conditions. Figure 30 on page 81 shows the mean
generalizability coefﬁcients as a function of size of rater pool and sample size for low item and
person-by-rater effects. The results resembled closely those for the high item and person-by-
rater effects and thus only the low item effects were reported. The estimated generalizability
coefﬁcients were accurately estimated (population value = 0.631 1) with sample sizes as small as
750 and they also retained properties as if they were estimated with complete data (i.e., the 95%
conﬁdence intervals became shorter as the sample size increased). Compared to the known
values of the generalizability coefﬁcients and their approximated theoretical conﬁdence
intervals, the empirical coefﬁcients and conﬁdence intervals were recovered within 0.003 of the
theoretical predictions. See Figure 31 on page 81 for the theoretical conﬁdence intervals, which
were obtained by the following steps: (1) compute the standard errors for each variance
component (see Appendix B); (2) apply a multiplying factor (Brennan, 1992) to those variance
components to ﬁnd the upper and lower bound for each variance component; and (3) uses

Equations (21) and (22) to compute the conﬁdence intervals for the coefﬁcients.

80

 

Emplrlcal Confidence Interval: (95%) for Generallzablllty Coeﬂlclenta
leen Low ltem- and Pereon-by-Rater Effects
(I - 0.02. pr - 0.01)

 

O7
068
g l 05317 06839 06000 A
U 066 , 06641 06718 066!)
E 06500 06576 .—--"
o 064 ‘ 08531 06474 06502
i r———0———. W -——.—-——-I .—-—0
g 062 ‘ 05270 06281 06271 00293 06310 06297 06304 06302 03295 06300 06290
3 I——-—-——.
g 05., .m ...,...
" 1 PM 05955
g 058 I, 05w 05885 05938
I
e .
0 056 4 Huh.
«'0 055" 05007 05500
E 054 3 +97 5th Percentile
% 2 , +Mean
W 05 i -*-251h Percentile
05 ;. -. _ , - _ __~__... _.z_.zHW_.___ ,- ., ____., 2,. . ., -..,
N_R 2 4 8 4 8 14 8 14 28 14 28
N P 750 1500 3000 6000

Slze of Rater Pool 8 Sample Slze

 

 

 

Figure 30: Empirical confidence intervals for generalizability coefficients

 

 

 

Theoretical Conﬁdence Interval: (95%) for Generallzablllty Coefﬁcient
Given Low Item. and Penon-byoRater Effecta
(l-0.02. pr 80.01)
0 7
g 0 68 a-—e—a
. 0 6824
'. l—‘—‘
E 0 65%
o 0 64 o 6481
E. r—.——i I-—.——-I -——.—c .-——I
= 0 62 0615 061mm 083K531 061331
a o—~———C
08131
._ 0 6 o—-——-o-—-—c
E o 5:55 ' " o 6007
2 o 53 .-_._.~_-4
6 0 5792
- 0 56
3
f 0 54 +97 5th Percentile
g 0 52 + Mean
+ 2.5111 Perentlle
05 L.._.-.. . ..._-.-. ..-,- V .,-+_ . . -. shad—...,. ...,., .._.. _.*-...-.,........--.. .. .
N_R 2 4 8 4 8 14 8 14 28 14 28
N_P 750 1500 3000 6000
Slze of Rater Pool 8 Sample Slze

 

 

Figure 31: Theoretical generalizability coefficients

Dependability coefficient — The subdividing method was able to detect item
variability for both high and low effects. The empirical dependability coefﬁcients were close to
the known parameter values and the corresponding conﬁdence intervals were all recovered
closely to the theoretical conﬁdence intervals given low item effects (the average difference

between the empirical and theoretical values was less than 0.0035). On average across the 88

81

conditions, the ACCURACY index of the dependability coefﬁcients was 1.0 suggesting that the
subdividing method provided unbiased estimates for the dependability coefficients (for any given
one of the 88 conditions, the ACCURACY was 1.00 i 0.03). For the high item effect, the lower
bounds (2.5‘h percentile) of the conﬁdence intervals deviated farther from the mean than did the
upper bounds (97.5th percentile) indicating that one should be more conﬁdent in the upper
bounds than the lower bounds of the dependability coefﬁcients. Figure 33 on page 83 depicts this
observation. Further investigations, discussed in the next paragraph, determined that the
negatively skewed distribution of the empirical dependability coefﬁcients reflected the unstable
nature of the item effects due to large differences in item difﬁculties.

The theoretical lower bound of the dependability coefﬁcient can be obtained by
replacing the relative standard error of measurement with the absolute standard error of
measurement in Equation (23). Holding sample size and the size of rater pool constant, the lower
bound of the dependability coefﬁcient became lower given any one or both of the following: (1)
the lower limit of the effects for item, rater, person-by-item, person-by—rater, item-by-rater, or
person-by-item-rater, error increases; and (2) the lower limit of the object of measurement
decreases. An ad hoc study concluded that the skewed 2.5‘h percentile of the dependability
coefﬁcients were caused by the highly variable item effect (at the high level =0.] l), which had a
maximum variance component of 1.57 and standard deviation 0.16 even though the mean of the
high item effects was 0.1 l. The 2.5th percentile of the dependability coefﬁcients was raised by an
average 14% when offsetting the estimated item effect variances in the high item effect
conditions to the maximum of the estimated variances in the low item effect conditions (0.35).
Notice that 0.35, which was at the 93th percentile on the high item and person-by-rater effect
condition, was chosen to examine to what extent the lower bound of the dependability coefﬁcient
would rise when the extreme item effects were restricted to a lower value. Figure 32 (p. 83)

shows the skewed distribution of the item effect.

82

 

Distribution of the Item Effect
(High Item and Person by Rater Conditions)

 

1400‘

1200

1000

800

600

b
O
0

Std. Dev = .16
Mean = .11
N = 2200.00

200

 

 

Frequency

 

Magnitude of the Item Variance Component

 

 

 

Figure 32: Distribution of the item variance components for
the disconnected crossed rating plan (averaged across batch
size, sample size, and rater pool size)

The boost of the 2.5th percentile of the dependability coefﬁcients was depicted in Figure
33. This ﬁnding that the dependability coefﬁcients became less skewed as the item effect became
small substantiated that the subdividing method was capable of detecting various degrees of

variability in generalizability studies.

 

 

 

 

 

 

 

 

 

Empirical Conﬁdence Interval: (95%) for Dependability Coefﬁcients
Given ﬂjgn Item and L2H Pemn-by-Rater Effects
(i=0.11.pr= 0.01)
07‘ ~ , ., ., 77777

‘ .
E 065 i 0656, 06303 06603 cm” "Th—‘05»: ow The boost of
g 06 ' ”36‘ “33° 06307 the lower

Haw—- ...M , ...,. bounds of the
U 0 55 05m 05716 05738 osmss 05735 05685 05734 0578? 05754 d d - -
E epen ability
3 0 5 coefﬁcients
5 ~ ./—° .2 _' ~ In
.8. 73,263 0‘0” 04420 0056 97.5 .
1: ° ‘ em“ 0;.» +97 5 m Percentile percentile of
E- 0 35 ”92‘ 03867\\. \. 0 323,?" ...Mean the item
I" f 03582 03600 0 3639 +2501 Perennle effect [0 0°35-

0 3 .M---.___-M_.. . . . . , .-. VMM M... . .M- _ .MM- -.. . -M
N_R 2 4 8 4 8 14 8 14 28 14 28
N_P 750 1500 3000 6000
Size of Rater Pool 8 Sample Slze

 

 

 

Figure 33: Dependability coefﬁcients estimated in the
disconnected crossed rating plan (high item effects)

83

Figure 34 shows that the conﬁdence intervals of the dependability coefﬁcients exhibited
a narrowing trend as sample size increase for the low item effect conditions. The mean estimates
of the dependability coefﬁcients appeared to be rather stable and close to the known value
(0.61 135). Comparing those high item effect conﬁdence intervals and estimates to low-item-
effect conditions (see Figure 34), the 2.5‘h percentile of the dependability coefﬁcients no longer

appears to be so skewed when item-effect variation is low.

 

Empirical Confidence intervals (95%)1’01' Dependability Coefﬁcients
Given Low ltem- and Person-by-Rater El'lecta

 

(1.0.02. pr won
07 .
A
1E 005 06696 05750 06672 W—‘OGW M..—H
. 06540 06551 0645‘ Demo 06476 a—a7
a 06 . . I I . ., 7-‘ l I .. 06%. 0Q1
06123 1 1
E 05032 06091 06100 06093 06117 06095 05122 06109 06 3 06122
O . .
0 05735 ° - 05753
2: 055 . , 05'5“ ' “0.5443 05606 05601 “685
= - . ' 05445
§ 05 0537405105 05341
1:
i
a 0‘5
5
E 04
o.
E +975mpercenﬂle‘
I“ 035 0 Mean
I 2581Perenele .
o3 . . . - .M-MMMMM --..M. M M .,
N_R 2 4 a 4 a 14 0 14 20 14 20
N P 750 1500 3000 8000

Size oi Rater Pool 8 Sample Size

 

 

 

Figure 34: Dependability coefficients estimated in the
disconnected crossed rating plan (low item effects)

Misclassification Error — Shown in Figure 35 were the estimates ofthe
misclassiﬁcation rates, with reference to the parameter values for the 88 conditions. They were
computed based on the absolute standard error of measurement. Accuracy across all conditions
was 100% i l 1% indicating that one would have made as few misclassiﬁcation errors in
unbalanced situations as one would in balanced situations. For instance, the error rate of
misclassifying a random examinee with a true score 3.4 by one or more step was 1.70% given
low item (0.02) and person-by-rater (0.1 1) effects in balanced situations (Table 5 on page 41)
whereas this error rate was 1.72% estimated by using the subdividing method for unbalanced

situations. Given high item and person-by-rater effects, there was a 3.68% average

84

misclassiﬁcation rate in both balanced (Table 5 on page 41) and unbalanced situations. Figure 36
(p. 86) shows the standard errors of the misclassiﬁcation rates and it indicates that none of the

standard errors of the 88 conditions exceeded 2.4%.

 

 

 

 

 

 

 

Mleclaasiﬂcatlon Rate: Mieclaeeitylng one or more steps for true score = 3.4
004 ~ . . »
\ / _ A. T T
T " , _ \ A ‘-
_/ V / \ g
0035 I“ h ‘
~ i i pr batch
: e 0.0200112
_ 1 002 0.0124
8 003 g ‘ ~ 0020.112
:16 g -»~-x—-—0.02 0.1 24
c . 3 +0.1100112
g ._ .. . _ 5 "901100124 ‘
é / ' \ ”\‘l ‘ : —-—0.11 0.1 12 ‘
‘31 0.025 . ' * ﬂ 1 ~--— 0.110124 1
I! ; —Pa'aneter Value=003682
g —Parameier vaue=002659
; g—Puuneter Vdue= 0.017
002
W" -“ I
0.015 M M .
N_R 2 4 8 4 e 14 8 14 28 14 28
N_P 750 1500 3000 6000
Size oi Rater Pool 8 Sample Size

 

 

 

Figure 35: Misclassiﬁction error obtained for the
disconnected crossed rating plan

85

 

 

Standard Error

003

0 025

002

0015

001

0005

 

Misclassification Rate: Misclassifying one or more steps
for true score = 3.4

1 pr batch
,. 00200112
00200124
’ - *. -. 0020112

. f +0020124
- S: l l-o—o1100112
; -+-0110.0124

~ 3 ——0110.112

- 0110124

   

 

$1.. .‘ e
.. ‘0 g \ M! 15‘! .
.. a
N R 2 4 8 4 8 14 8 14 28 14 28
N P 750 1500 3000 6000

Size of Rater Pool 15 Sample Size

 

Figure 36: Standard errors of the misclassiﬁcation rates for
the disconnected crossed rating plan

86

 

CHAPTER 5: CONCLUSIONS, DISCUSSIONS, AND FUTURE DIRECTIONS

5.1) Subdividing method and unbalanced situations in performance assessment

Scoring constructed response items is more time consuming and complex than scoring
multiple-choice items. Many educational and non-educational institutions adapt open-ended
questions in examinations for admission, certiﬁcation, graduation, accountability, and licensing
purposes. These examinations often are administered on a large-scale basis. Large volumes of
examinees are tested and yet only a short time (usually a few weeks) is available for scoring.
Many raters are recruited to score the examinations and it is infeasible to assign all the raters to
score every one of the examinees. Thus each examinee will be scored by a selection of raters,
leading to sparsely-ﬁlled data sets and also unbalanced designs, and also causing potentially
biased and imprecise estimators. The current dissertation developed and validated a method,
called the subdividing method, to resolve this problem. The subdividing method, drawing on the
concept that one could obtain more stable estimators by synthesizing multiple data sources than
using just one source, is set out to improve the accuracy and precision of estimates quantifying
measurement errors in the framework of G theory. The implementation of this method was
discussed in sections 3.1.

The estimates produced by the subdividing method were scrutinized in determining how
well the method worked in realistic scenarios (described in section 3.2 to 3.5) and these scenarios
included differences in: (1) volume of examinees, (2) size of rater pool, (3) variation of item
difﬁculty, (4) levels of rater inconsistency, (5) rules used to decide how to group raters and
assign tasks to raters, and (6) the minimum number of examinees scored by a group of raters.
Results in chapter 4 indicated that the subdividing method produced outcomes having properties
(unbaisedness and consistency) that are similar to those of complete data methods. Different
rules used for forming groups of raters changed the structural design of scores and thus

inﬂuenced the precision of measurement error estimation. Unlike precision, the accuracy of

87

estimating measurement errors was not as sensitive to the rules used for forming groups of raters.
Accuracy of the outcomes was very high (close to perfect). These ﬁnding substantiated that the
subdividing method produced unbiased outcomes with data missing completely at random. The
section that follows summarizes major ﬁndings regarding the precision of estimators.

Suggestions are provided for the set-up of scoring procedures.

5.2) Major findings and implications

Weighting improved the precision of variances involving the person effect when data
subsets varied in size. The standard errors of the weighted outcomes were closer to the
theoretical standard errors when weights were applied (see section 4.1). As was discussed in
section 3.1, data subsets varied in size in the connected mixture rating plan where there were
more batches than possible groups of raters. Weighting, however, had no effect on the precision
under the disconnected crossed rating plan, in which each group was composed of the same
number of raters and each group scored the same number of examinees.

1n large-scale performance assessments, examinees' work was packed in batches for
scoring. Section 4.2 provided evidence that a minimum of 12 tasks scored by the same group of
raters was sufﬁcient to ensure precise estimates for the measurement errors, and increasing the
minimum to 24 tasks did not tend to increase or lower the precision and the accuracy of
measurements.

Results in section 4.3 suggested that the variance components, generalizability
coefﬁcient, dependability coefﬁcient, and misclassiﬁcation error were recovered by the
subdividing method with high accuracy (100% i 1%) in all experimental conditions examined
for the two rating plans. Unlike the precision (inverse of standard errors) of variance components
and composites, accuracy is not inﬂuenced by the patterns and amounts of missing data. This

ﬁnding was consistent with the notion that when data are missing completely at random, one can

88

still obtain an unbiased expected value but the standard error will be larger than that for a data
set with no missing data (Little and Rubin, 1987).

In addition to the descriptive results reported in section 4.3, a multivariate regression
(Rencher, 1995) was used to provide an omnibus test for the accuracy of the variance
components. The results shown in Appendix H indicated that none of the predictors signiﬁcantly
deviated from zero (the ﬁve Wilks' As = 1.0 i 0.001 and the corresponding p-values > 0.530),
assuming normality. Table 18 (Appendix A) reports the value of the ﬁve individual Wilks' As.
The regression coefﬁcients and the coefﬁcients of determination (r-squares) shown in Table 19
indicate that the subdividing method has estimated all the variance components with high
accuracy (without biased) in all the experimental conditions. The intercepts were close to one,
suggesting that the mean accuracy was close to 100%; the coefﬁcients of the predictors were
close to zero, showing that in no conditions did the accuracy of the subdividing method differ
signiﬁcantly. Even though the multivariate regression shed light on the accuracy of the
subdividing method, this analysis provides only supplementary information to those analyses
reported in section 4.3. One should interpret the results of the regression analysis with caution
because the multivariate normality assumptions were not completely met by all the variance
components (e.g., the item component had a positively skewed distribution).

Frequently, the number of raters recruited to score examinees has an inverse relationship
to the time available for completing the scoring, holding the volume of examinees constant. The
section entitled Amounts and Patterns of Missing Data (p. 34) indicated that expanding the rater
pool required more raters to provide more stable estimates for the rater-related effects, but the
expansion of the rater pool induced more unobserved data. To what degree can the gain in
precision by using more raters compensate for the decrease in precision due to the increasing
amounts of unobserved data? Section 4.4 reported that doubling the rater pool increased the

precision of estimating the rater effect and the item-by-rater effect by at most 32%. This ﬁnding

89

was consistent with the expectation that more raters leads to higher precision of estimating the
rater effect, provided that the raters came from the same population. Similarly, the same
conclusions applied to the item-by-rater effect. Thus even though it is infeasible to have raters
completely crossed with examinees, it is still desirable to use more raters rather than fewer raters,

because this will result in characterizing the rater effects with higher conﬁdence.

Results in 4.4 revealed that the theoretical SE (013) was lower than the empirical

SE (CS-3) given a small rater pool (i.e., the two-rater condition). With the following rationales,

one can see that the theoretical method underestimated the precision of the rater and item-by-
rater facets for extremely small sample sizes, while the empirical method provided estimates
with the right degree of precision. First, one can rule out the argument that the empirical standard

errors were incorrectly estimated because the empirical standard errors resembled closely the

theoretical standard errors for the other effects, namely SE (5%,), SE (cs-i”), SE (6%,), and

SE (6'31") . Second, theoretical standard errors for statistics require asymptotic assumptions that

may be inaccurate (Smith, 1982). In the present case, we used two and four levels, respectively,
for the rater- and item-by-rater effects. For this reason, it is likely that the theoretical standard
errors were too small.

Given a desirable level of precision obtained by sampling from a pool of raters, how
many raters are needed to obtain the same level of precision for (33 without sampling? Table 14

in section 4.4 shows the answers to this question. Whether or not the theoretical SE was
underestimated for small numbers of raters, the numbers of raters needed to match the precision
obtained in the sampling situation would be fewer than predicted in Table 14 for the rater effect.
For instance, one may need to use all 13 raters at the most to score examinees in order to match
the level of precision of the rater effect yielded by sampling two raters from a pool of 28.

The size of rater pool had little inﬂuence on the precision of the person-by-rater effect (Figure

90

20, section 4.4) for the disconnected crossed rating plan. No matter how large the rater

pool, SE (6%”) tended to stay at the level equivalent to that obtained by using two raters. The

empirical standard errors clustered around the theoretical values of SE (032,) and the ﬂuctuation

was rather subtle. One explanation was that as the rater pool size expanded the percentage of
unobserved data also increased. On the one hand, expanding rater size pool gave more
information about the degree to which raters scored examinees differently. On the other hand,
expanding the rater pool size while keeping the same number of ratings for each task caused less
observed data to be allocated for estimating the person-by-rater effect. The tension between
these two factors (increasing rater pool size and increasing amounts of unobserved data) tended
to compensate for one another. This ﬁnding suggested that employing more raters did not tend to
lower or boost the precision of characterizing the person-by-rater effect, holding sample size

constant and assuming raters came from the same population.
The ﬁnding that SE (01%,) was inversely related to the volume of examinees suggested

that the subdividing method led to consistent estimators (section 4.5). Practically, this is a
desirable property to have for large-scale testing because it enables one to apply the G theory

framework to partition measurement errors with high conﬁdence as more examinees are

assessed. With 6000 examinees, SE (01%,) was in the range 0.002 to 0.005. With 750 examinees,

SEQ-2,) was in the neighborhood of 0.01 and 0.013.

Typically, scoring centers do not have much control over the volume of examinees —

less control than on the number of raters to be employed. A feasible way to improve the precision

of 0"?” is to employ rating plans that ensure plenty of data to be used for the estimation of
SE (6%,) . Although the disconnected crossed rating plan (section 4.6) examined in the current

dissertation allowed one to utilize all the data to estimate SE (01%,) , this rating plan was designed

91

to estimate the person-by—rater effect using non-overlapping groups of raters (i.e., groups of

raters were disconnected). With another rating plan that allocates data to examine both the
overlapping and non-overlapping rater groups, one would expect to improve SE (Ci-1;”) . Such a

rating plan could guarantee all the data subsets to exhibit the crossed design while allowing

raters to be on more than one scoring committee. This rating plan is constituted of three rules

listed as follows.

rater”... 1‘ rater'us (24)
ratert,.s : ratertﬁﬂ (25)
rater'.., i rater',,.,.1 (26)

The ﬁrst rule, Equation (24), ensures that no rater scores an examinee on the same item (i) twice
in a given data subset (s) of the crossed design. The second rule (25) indicates that all raters
participated in exactly two scoring groups, namely the sth and 3+ lth groups. The third rule (26)
speciﬁes that all raters worked with a different rater in the two groups they sit in. This rating plan
can be called the connected crossed rating plan. The shaded areas in the following ﬁgure show
the observed data and it can be seen that every group of raters has a common linking rater with
one other group. Future research is needed to compare this connected crossed rating plan with
those two examined in the current dissertation.

Level 1: Items; Level 2: Raters

1234567812345678

 

 

 

 

 

 

X 4““ i . . r“ it.‘
a '8\ \ \ R‘.‘
m
i
n . 'r
e "1
e +11 " ; 11111
\‘gn 1. 31 1. va
5 Note: A and B represent two items; "1" to "8" represent eight raters.

Figure 37: A hypothetical connected crossed rating plan

92

Unlike the results found for the disconnected crossed rating plan, the precision of the
person-by-rater effect decreased as the rater pool size expanded in the connected mixture rating
plan (section 4.6). The loss of precision was due to fewer data points being allocated to the
crossed and the MBIB structural designs. This results suggested a general principle for designing
rating plans — rating plans with loose structure and fewer guiding rules (e.g., the connected
mixture rating plan was less structured than the disconnected crossed rating plan) tend to form
data subsets with nested rather than crossed structural designs. For this reason, very few data sets
were available to estimate the rater effect separately from the item effects. The larger the rater
pool and the less structured the rating plan, the less precisely one can estimate the rater-related
effects. A recommendation based on this ﬁnding suggests that one should impose structural
designs at the data subset level and the structural designs chosen should be geared toward to the
effects of interest.

The results in section 4.8 showing that the dependability coefﬁcient had an asymmetric
95% conﬁdence interval with an extended tail towards the low end has implications for

developing and scoring performance based assessments. First, even though the item effect
accounted for only approximately 1 1% (of = 0.1 l) of the total score variation, its variance

component had a wide conﬁdence interval because only two items were used to estimate its large
population value. Empirically, the overall 95% Cl was between 0 and 0.627 with a mean standard
error 0.157, which was larger than the population value. Since the item effect component was
used in the denominator of the dependability coefﬁcient, it then dragged down the lower bound
of the dependability coefﬁcient. The upper bound of the dependability coefﬁcient was not
affected because the item facet has a positive skewed distribution implying that the lower bound
of the item facet was not as inﬂuential as the upper bound. For this reason, the wide conﬁdence
interval of the dependability coefﬁcient was contributed to largely by the item effect. The

subdividing method provides a means to characterize this skewed-dependability phenomenon in

93

unbalanced situations. Such a wide conﬁdence interval is expected in balanced situations as this

can be demonstrated by comparing the dependability coefﬁcients obtained by a large and a

small 0i in Equation (32) (p. 102). As was illustrated in section 4.8, reducing either SE(0,2) or

of itself could provide a more dependable measurement procedure for making criterion-

referenced decisions. In fact, many researchers (e.g., Bejar, 1993) have developed methods to
control for item variation and to increase the generalizability of tasks (e.g., Kane, Crooks, and

Cohen, 1999).

5.3) New applications of the subdividing method and future directions

(1) The current dissertation examined the statistical properties of recovered variance
components under a two faceted design, namely the person x item x rater design. The rater facet
was sampled, but the facet to be sampled could be the item or any one facet in a two faceted
design. The subdividing method can be used to examine the generalizability for a measurement
procedure where examinees respond to a complete set of questions (as opposed to a subsample
from that complete set of questions). Such a comparison can be accomplished by ﬁnding out the
precision of generalizability in unbalanced situations (e.g., via resampling or bootstrapping
methods) and then using that level of precision as a target for predicting how many raters and /or
items are needed to maintain the target level of precision in a balanced situation. Such
applications can be useful to reduce testing time while being able to evaluate the quality of the
testing procedures. To determine the potential results for this application where examinees
responded to only a selected set of items and were scored by all raters, one can simply swap the
subscripts between the item and rater effects in the results section of the current dissertation.
Having examinees to respond to only a set of items from a pool of items can reduce testing time.
Having all raters to score examinees can become feasible in the future as testing companies such

as the Educational Testing Service (ETS) are developing computer technology to use computer

94

programs (called electronic raters) to score with or to replace human raters (Personal
communication from Bejar, 1999 and Hombo, 1999).

Another application of the subdividing method is to evaluate any systematic
measurement procedures involving observations and ratings where data are unbalanced, such as
alternate assessments that might be used for special education students in place of traditional
assessments (Ysseldyke & Olsen, 1999). In addition, the subdividing method can be applied to
any large-scale assessments such as state assessments and the National Assessment of
Educational Progress (NAEP).

(11) The subdividing method developed in the current dissertation was examined using
two common rating plans. In operation, scoring sessions may not document explicitly the rating
plans used. In that case, it is researchers' hope rather than expectation to have the data collected
in the same way as data would be collected by the rating plans. An index of sparseness indicating
the pattern and the extent to which scores were unobserved (e.g., as compared to those in the
speciﬁed plans) may be useful. Such an index requires future development. Graphical displays
and research on matrix analysis (Alan & Liu, 1981) can give insight to this line of development.

(111) A follow-up study investigating the speciﬁc principles (p. 39) used in different
rating plans will be invaluable. The current dissertation examined the disconnected crossed and
connected mixture rating plans. These two plans differed in several ways (see Table 4 for
comparisons) and future studies should be conducted to isolate each of the underlying principles
distinguishing these and other rating plans. One such principle, namely the guiding structural
designs within a given group (the third principle in Table 4) can provide practical insight for test
developers. How does having the same set of raters score all the tasks from an examinee (crossed
design) as opposed to having different sets of raters to score different tasks (mixture designs or
nested designs) inﬂuence the quality of the scoring procedures? Patz (1999) spoke in favor of

using stratiﬁed designs when applying ltem Response Theory (lRT) models to analyze

95

performance-based data. He stated that stratiﬁed designs allow different sets of raters to score the
different tasks submitted by an examinee and thus the chance for an examinee to receive scores
from solely a set of extreme raters (either too lenient or harsh) will be lower than using a crossed
design. Kane, Crooks, and Cohen (1999, p.14) also suggested that "having each task evaluated by
a different set of scorers" can increase the number of raters evaluating each student and thus
helps to control any lack of consistency among raters. Although stratiﬁed or non—crossed
structures, like that suggested by Kane, Crooks, and Cohen (1999) and Patz (1999), have the
advantage to ensure that each student's task is scored by a larger number of raters than in a
crossed structure, they do not necessarily allow one to disentangle rater—related interaction
effects. So, one may not be able to evaluate as precisely the rater-related interactions (e.g., the
rater x item effect) as one can in a crossed structure. Since a line of research studying rating
plans and the statistical properties of reliability and dependability coefﬁcients is emerging, more
research should be conducted (Click and Picou, 1999; Patz, 1999; and Wilson, 1999). Comparing
the connected mixture rating plan studied in the current dissertation and the connected crossed
rating plan proposed on page 92 can shed light on methods to scoring performance-based
questions with high generalizability and dependability. One can ﬁnd more examples of different
rating plans in other areas of the measurement literature such as test equating (Kolen & Brennan,
1995).

(IV) Future research may examine factors in addition to those investigated in the current
study such as the degree to which data are not missing completely at random. In practice, some
raters (e.g. more experienced) may score more responses than the others. In this scenario, data
may not be missing completely at random because some raters have more unobserved data while
others have less. Put differently, missing data were related to the experience of the raters. Little
and Rubin (1987) called this data Missing At Random (MAR). In applying the subdividing

method to the MAR scenario, one can weight the data subset by the experience of the raters

96

(which can be operationalized as the number of responses scored) in the Synthesizing stage.
Future studies are needed in evaluating the subdividing method in such scenario.

(V) The current dissertation applied a multivariate regression to predict the accuracy of
the variance components using the experimental factors controlled in the simulation as
predictors. The inferential results should be treated as tentative, as the skewed distributions of
the variance components did not satisfy completely the multivariate normal assumptions. The
low parameter values used for the item and rater effects caused these variance components to
have a distribution with negative values, which prevented one from using logarithmic
transformations to normalize the variance components (e.g., transformation suggested by Kalaian
& Becker, 1996 and Raudenbush, 1988). Exploring transformations with known statistical
properties for negative variance components deserves much attention; this line of research will
be invaluable for the measurement community because it can provide correct inferential
conclusions about measurement procedures.

(Vl) Though G theory is popular for disentangling multiple sources of variation in
scores, it does so via the expected variation for each facet. It does not, however, identify
individual elements contributing to the variations (e.g., G theory does not indicate which
individual rater is particularly more lenient or severe as compared to the other raters). A
thorough diagnosis utilizing methods such as cluster analysis and meta-analysis goodness of ﬁt
tests can examine individual elements more closely and diagnose problems in the measurement

procedures more carefully.

5.4) Suggestions to test developers and educational values

Suggestions to test developers. The results of the current dissertation inform discussions
of scoring procedures for performance assessment. Is there a particular scoring arrangement that
can yield more accurate and stable estimates for measurement errors than other arrangements?

The current dissertation showed that the disconnected crossed produced more precise estimates

97

for the person-by—rater effect than the connected mixture rating plan. Also, given the same sized
rater pool, how many examinees must be scored by the same group of raters in order to provide
precise evaluations of measurement error? A minimum of 12 examinees scored by the same
group of raters was sufﬁcient to ensure precise estimates for the measurement errors. Increasing
that minimum to 24 examinees did not tend to increase or lower the precision of measurement
errors. This ﬁnding suggests to test developers that bundling tasks is not a real concern as far as
measurement errors go. Given the resources and time, test developers should, instead, consider
seriously what rating plan to use in order to reduce measurement errors and to obtain an accurate
and precise portrait of those errors. Rating plans should be chosen prior to starting a scoring
session to structure and randomize the data collection procedures (scoring procedures).

When conducting generalizability analyses, test developers should apply weights for
combining the measurement errors estimated from each data subset. At the best, weighting will
increase the precision of characterizing the quality of a measurement procedure and in no
situation will it lower the precision. However, one does not necessarily need to apply weights in
using the subdividing method for generalizability analyses provided that the data subset sizes are
equal. Data subsets have equal sample sizes when one employs the disconnected crossed rating
plan. Section 3.1 of the current dissertation provides decision rules and formulae to determine the
need for weighting with the connected mixture rating plan. By and large, it is more likely that one
needs to apply weights when a small pool of raters (e.g, four raters) scores a large volume of
examinees (e.g., 1500) than when a larger pool of raters (e.g., eight raters) scores the same

volume of examinees.

The results of the current dissertation suggested that the use of only a few items varying
much in mean difﬁculty was a major source of variance, lowering the dependability of a
measurement procedure and thus leading to unreliable criterion-referenced decisions. A well

thought-out rating plan can help one conﬁdently determine more rater-related measurement

98

errors but it does not help more conﬁdently determine the difference in mean item difﬁculty (i.e.,
item effect). Increasing the rater pool or sample size did not affect the estimation of the item
effect in unbalanced situations when the subdividing method was employed. Although
administering more items to examinees can reduce the item effect and increase its associated
conﬁdence interval, this may not be a feasible resolution because adding more performance-
based items to a test will increase testing time and costs to the education system and it will also
burden the students. Increasing the homogeneity of test items is an alternative to improve the
dependability of a measurement procedure. This can be achieved by writing items similar in
difﬁculty. A second alternative is to shorten the length of performance-based tasks so that more
tasks could be administered in a limited testing time. A third alternative is to "increase the
correlation among task scores by avoiding tasks that require esoteric information or that involve
some unique format" (Kane, Crooks, and Cohen, 1999, p. 14).

Test developers frequently have to report scores in a short time. For instance, the
Mathematics and Sciences tests of the 1996 National Assessment of Educational Progress
(NAEP) employed 675 raters to score 8,985,583 constructed responses in 12 1/2 weeks (Authors,
1996). The sooner the test developers need to complete scoring an examination, the more raters
they need to recruit. It was shown that the subdividing method can detect the rater and item-by-
rater effects more precisely as the size of rater pool increases, holding everything else constant.
If test developers recruit more raters and obtained a considerably larger rater effect than they
obtained before increasing the rater pool, they can be certain that the mean scores assigned by the
additional raters are more variable than those assigned by the original pool of raters (i.e., the new
raters may be more lenient or harsh than the original raters). Likewise, any large increase in the
item-by—rater effect due to expanding the rater pool size indicates that the additional raters
exhibited a higher degree of inconsistency in scoring items differentially than did the original

group of raters.

99

Testing agencies should be mindful of choosing a rating plan at the same time they
consider increasing the size of a rater pool. The chosen rating plan influences the precision of
quantifying measurement errors and thus the generalizability and the dependability of a
measurement procedure. If test developers use the connected mixture rating plan and decide to
recruit more raters to score the same number of examinees, test developers ought to anticipate
that they will obtain less stable estimates for the rater, item-by-rater, person-by-rater, and
person-by-item-by-rater effects than they would with fewer raters. The reduction of precision
occurs because, as the rater pool size expands, fewer data are allocated to the crossed and MBIB
designs for estimating those effects separately from one another. The data are instead allocated to
the nested design, which does not estimate the all the effects in the crossed and MBIB designs.
Alternatively, if test developers need to use the connected mixture rating plan for logistic
reasons, they may consider converting estimates from the crossed and MBIB designs to match the
estimates from the nested design in order to utilize all the data for obtaining precise estimates.
The rater-nested-in-item effect, in the nested design, becomes an upper bound for either the rater
or the item-by-rater effects in the crossed and MBIB designs. By the same token, the person-by-
rater-nested—in-item effect becomes the upper bound for the person-by-rater or the person-by-
item-by-rater effects in the crossed and MBIB designs. lf test developers provide extensive
training and monitoring to the raters with anticipation that both the rater-nested—in-item and
person-by-rater-nested—in-item effects will be low in magnitude, the connected mixture rating
plan can be used. This is because the expansion in rater pool will increase the precision of the
measurement error estimates, as the connected mixture rating plan allocates all data to estimate
the effects in a nested design.

With computer technology, the implementation of different rating plans becomes easier.
Examinees’ constructed responses (e.g., essays) can be scanned into digital format and raters can

score these responses on-line so that they can focus on scoring rather than paper routing.

100

Computer technology enables test developers to have full control to structure scoring sessions
and this enables them to implement a desirable rating plan prior to using the subdividing method
to analyze unbalanced data. The subdividing method always requires less computational power
than other methods that analyze the entire sparsely-ﬁlled data set. For example, instead of
analyzing a data set of 6000 examinees and 28 raters all at once, one can parse this unbalanced
data into subsets, analyze each subset, and then synthesize the results from the subsets. The
subdividing method enhances the scoring procedures for rating constructed-response items and it
serves as a means to prepare performance assessments to be reliably used in large-scale settings.

Educational values. Given the proliferation of performance assessments, many states and

 

school districts have already implemented this type of assessment on a regular basis. Well-
developed scoring rubrics can be useful, provided that raters implement them consistently and
accurately. Large-scale performance-based assessments can be used for accountability purposes
only if methods are developed to evaluate the quality of the scoring procedures (Mehrens,l992).
Training raters to consistently apply the scoring criteria described in the rubrics designed for
large-scale performance assessments, has many instructional beneﬁts. When raters (mostly
school teachers) return to their classrooms, they will be accustomed to using those criteria.

Improving the quality of performance assessment so that it can be used for high-stakes
decisions also can help align assessments with curriculum and instruction (Pearson, 1998). Many
researchers such as Bracey (1989) reported that schools and teachers were less likely to include
materials in their classrooms if the materials would not be tested in high-stakes examinations.
Developing methods to monitor and improve the quality of performance assessment could reduce
the tensions in using this state-of-the-art assessment for classroom instruction and for high-stakes
decisions. Students will be the beneﬁciaries of this development, which was examined in the

current dissertation.

101

Appendix A:
Equations for scores and coefficients in generalizability theory

(Adapted from Brennan, 1992)

Xp" = 11 (grand mean)
+ (11p - 11) (person effect)
+ (u, - u) (item effect)
+ (it: - it) (rater effect)
+ (1111- ‘ 1111 ' 141+ 11) (person-by-item interaction)
+ (up, - up - it, + u) (person-by-rater interaction)
+ (u.-r - u,- p,+ u) (rater x item interaction)

+ (xp..- 1111“ 1111- 11.1 + 111+ 111+ 11- 11) (residual)

2 _ 2 2 2 2 2 2 2
G (Xpir) _ 6p + 61 + Gr + Gpi + Gpr + Git + Opine

m

2 2 2
Relative Error Variance = 0': = 9%". + 9'1' + 0:9":
n I n f n | n r

Absolute Error Variance = 0': — 4— + —- + —— + -—4 + — +

 

 

Generalizability Coefﬁcient = p2 = 20" 2
Op + 66
. . . __ _ o;
Dependability Coefﬁcnent - (D — 2 2
Op + GA

Standard Error of Measurement = SEM 310:

102

(27)

(23)

(29)

(30)

(31)

(32)

(33)

Appendix B:

Standard error for variance components in a two facet crossed design2

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

.2 .2 .2 2 . .2 2
0(_2)_2 (op,,.+ n,‘a,.,+n,’0',,.+*n,"op) +2 (din..+n,"a,,,)
UP — 1 2 2 2
(np’rIl‘nf’nr ((np-I)*(n,-I)+2)*n, In, (34)
l
-2 .2 2 "
+ 2 (OIIlr,l'+ni‘0/Ir) +2 (U;)I’.l)2 2
((n,1-I)"(n,*Il‘t-7-)"'n,2"'nr2 ((np —I)" (n —I)"' lnr-I)+2)'n. ‘n n,
.7 A ‘7 - 2 . .
C(62) : ,(U;itr,t'+np.olzr+"r.0;n+."p‘"r.o'lzl +7 (0%,,r_c+nr'(7%n)2
I (n,+Il‘n,,2‘n,2 ((np—Il’ln,—Il+3151,1251,2 (35)
I
. .1 2 .1 2 _
,, (031”, + "1).Urr) + 1 (O. 7H,,t') 2
((nl-I)‘(nr—I)+2I.np2.nr2 ((nP—I)‘(nl-I).(”r_II+2I‘n ‘."r,
. . . .7 2 . .
.2 (0%)”,1'+"p.012r+nl.02r+*n "1:07) (Uztrx-I-nI‘IO'ZrIZ
0(O'r): 2 2P 7 P + P P 2 2
(nr+I)‘m 'np‘ ((np—I)‘(n,-I)+2)’n, ’np
' (36)
-2 .2 2 2 —
(O’pﬂ‘e‘tnp‘O'u-I +2 (Uplrt) 2
.,
((n,—I)‘(n,-—I)+Zl’n,2’np2 unp—Irtn,—I)*<n,—I)+21:11.21",-
1
0(62 ): (Uptrlc'I'CIr Up1)2(0'ptr,c)2 2 (37)
pl
((n,,-I)"(n,-I)+2)"n,2 ((np—I)’(n1—I)(nr‘l)+2) nr2
1
(“(31%"): 2 qulii"! éﬂ’f £13101); 2 (38)
((np-IWnr—I)+2)"n,2 ((np-I)(11,-I)(nr‘I)+2)n,2
l
.2 .2 2 .2 2 2
0012): Ml‘ﬂvr1tii"ri_04:) M M_ Mfg/’IAELMM. (39)
If
((np-Ii’lnr—IHZIWPZ ((np—Ifilm-I)(nr-Ili'ZVnP2
I
2 7
(OI/Hr; U) ___ h (40)

.2
0(0 Ire): W
” ‘ ("p—I) (n,—I)(n,—I)+2

 

2 , . . . . . .
Equations for the standard errors were derived by substituting variance components. sample Sizes. and mean squares
into the general formula reported in Brennen (1992, p. 101. 6.2.1).

103

Appendix C:

Computation of misclassification rate for conjunctive decision rules

Misclassiﬁcation Rate (Probability of Downward Misclassiﬁcations as a Function ofthe SEM):

= l - Correct Classiﬁcation Rate

= l - {P(passing item 1 | SEM)XP(passing item 2 | SEM)X XP(passing item # i | SEM)}

= l - {[l - P(failing item 1 [SEM)] X[l - P(failing item 2 | SEM)] X... X[l - P(failing item i |
SEM)]

Example (Cronbach et al., 1997, p. 381):

Question: Assuming a hypothetical examinee had universe scores of (2.5, 2.5, 2.5), (3.5, 3.5,

3.5) and the absolute standard error of measurement (SEM) of a measurement procedure is 0.7,

what is the chance that the examinee had one or more scores less than 1.5?

Answer:
= |-[ {1-P(z < -(2.5-I .5)/.7)}3 x {I-P(Z < -(3.5-|.5)/-7)}3l

= .24

~ 25%

104

Appendix D:
Illustration of an out-of-range sample correlation based on different data sets for sample

covariance and variances

 

 

 

 

 

 

 

 

 

 

 

 

Case Variable A Variable B i
1 { I I '
2 ' " 2 __, 2
"‘3 __ V5 _ _ ‘ if
#3:— i — 5 ‘* _ _; J;
S p . 5
6 10 10
Note: Ga and ob are based on 5 and 4 data points,
respectively. cab is based on only 3 cases that are italicized.
The correlation rab {cab/(ogob): 1.14.

 

 

 

 

105

Appendix E:

Correlations based on missing data

 

Y1 12341234----
Y2 1234---- 1234
Y3 ---- 12344321

 

Note: r12=1, rl3=1, r23=-l. Source:
Little & Rubin (1987, p.43).

106

Appendix F:
The structure of a modified balanced incomplete block design
The M, a variation of the balanced incomplete block design (BIB) (e.g.,
Montgomery, 1997, p.208; Searle et al., 1992, p5), is a balanced design because every examinee
receives two ratings on both essays. The incompleteness comes into play due to the fact that one
reader grades the essay twice and yet the two other different raters grade the essay only once.
The M design is specially designed for situations differing from the BIB design that the
Mﬂlj design has an additional factor to the BIB design. As can be seen in Figure 38 there are
two levels in the item factors (items 1 and 2). As can be observed in Figure 38 a data subset with
the M_BI§ design using raters A, B, and C, with A being the rater assign scores to both items, can

have n M. examinees. The ﬁrst subscript 9 indicates that the sample size 3 refers to the number

of examinees. The second subscript 3 indicates that the data set exhibits a MBIB design and the

third subscript 1 indicates that the data set is the ﬁrst set in the MBIB design. The total number of

cases that can be analyzed by the MBIB model is n M = Em“. The entire sparsely ﬁlled data

3:1

1

has a total number of cases of N = X f:
—l

[:i}

"In/.5 '
l

107

Figure 38: Hypothetical data subsets for the modified balanced incomplete block

 

(MBIB) design
Item 1 Item 2
Rater Rater Rater Rater Rater Rater
A B C A B C

Examinee l XMLLA XMJJB X3,1,1,2,A x3.l.l.2.c
Examinee 2 X3,3,1,._A x3,2,l,l.C X33123 X3,2.I,2.C
Examinee 3 X3,3.1,1,B X3.3,1.1,c X3.3,r.2,A X3.3.l.2.B
Examinee 4 X3_4,1,|,A X3.4.1.1.B X3,4,r,2,3 X3,4|,2,c
Examinee 5 X3.5,|‘|'A X3,5,1,1,c X3,5,1,2,A X3,5.l,2.B
Examinee 6 X3‘6,1_1,B X3,o,l,l,c X3,6,l,2,A x3.6.l.2.C
Examinee m X3.S..np‘2.,,1,A X3‘5an~2..,|‘3 X3.s:,np,2,.,2.a X3,si.np,2,..2,c

 

 

 

 

Note: The Xs indicate scores assigned to an essay. The subscripts indicate the location of
the score, where location is deﬁned as the structural design (ﬁrst subscript), data subset
(second subscript), examinee (third subscript), item (fourth subscript), and rater (ﬁfth
subscript) for the corresponding score.

108

Appendix G:

A mathematical model to determine the size of a rater pool
To ﬁgure the number of raters needed for the entire scoring procedure, we ﬁrst develop a
simple mathematical model. Equation (41) gives a generic model where the number of raters
needed is determined by two quantities, namely the total number of ratings in an administration
of a test (denoted Total workload in minutes) and the total amount of time, in minutes, that an

average rater can put in during the entire scoring procedure (denoted Total work time per rater).
Number_of_raters_needea’ = T otalgworkload_in_minutes / T otal_work_time jerfater (41)

As can be observed, more raters are needed as the total workload increases while holding the
total work time per rater constant. By the same token, fewer raters are needed as the total
workload decreases.

We now deﬁne more speciﬁcally Total workload in minutes and Total work time per
rater in terms of other practical constraints such as the available time for generating and
reporting the scores. By deﬁning those two quantities as a function of other practical constraints,
we will be able to decide how many raters are needed given the available resources (e.g., how
much time do we have until the scores must be analyzed and reported?) The deﬁnitions of those
two quantities can be modiﬁed to accommodate constraints of individual scoring centers. In
addition, the values of those two quantities can vary from one administration to the others within
an individual scoring center, depending upon policy, needs, and available resources. In the
examples that follow, we deﬁne the total workload and total work time in terms of constraints

common to essay scoring, according to anecdotal reports. Thus,

Tota1_workload_in_minutes = n_examinees X n_essays_answered_by_an_examinee X

n_ratings_per_essay X reading_time_per_essay_in_minutes. and (42)

109

Total_w0rk_time_per_rater = n_t0tal_sc0ring_days X n_w0rk_hours_per_rater_per_day X

60_minutes. (43)

By substituting values into Equations (42) and (43) into (41), we reach the following

symbolic equation.

 

"p . n: 0 ”We . te
Number of raters needed = , (44)
n .1 . nh . 60
where n], = number of examinees,
n, = number of items (essays) responded to by an examinee,
nm, 2 number of ratings on each essay,
I, = average time needed to score an essay by a rater (in minute),

n .1 = number of days available to complete scoring, and
nh = average work hours per day by a rater.

For instance, let us assume 6,000 examinees took a test and each examinee responded to
two SOO-word writing prompts. Two different raters score each writing prompt. We further
assume that it takes an average of 10 minutes for a reading of an essay. Substituting this
information into Equation (44), we need 14 raters to complete the scoring in 40 workdays of 7.5

hours ((6000*2*2*10)/(40*7.5*60) = 13.3).

110

Appendix H:

A multivariate regression model predicting the accuracy of variance components

Table l8:Wilks' Lamda for predicting accuracy of variance components

Multivariate Tests (Wilks' Lamda)

 

lEffect Value Hypothesis df Error df p-values

Intercept 0.006 204499.176 7 8788 0.000

N_P 1.000 0.440 7 8788 0.877

N_R 1.000 0.543 7 8788 0.802

VAR_I 0.999 0.869 7 8788 0.530 ‘
VAR_PR 1.000 0.439 7 8788 0.878

MIN_BAT 1 .000 0.602 7 8788 0.755

 

 

Note: N_P refers to the sample size of examinees (levels sampled 750, 1500, 3000. and 6000)
N_R refers to the size of rater pools (level tested sampled 2. 4, 8, 14, and 28)
VAR_I refers to the magnitude of the item effect (levels sampled included 0.02 and 0.11)
VAR_R refers to the magnitude of the person by rater effect (levels sampled included 0.01 and 0.1)
MIN_BAT refers to the minimum batch size imposed to data subsets (levels sampled included 12 and 24)

 

WIT:

 

 

 

111

Table 19: Regression models for the accuracy of the variance components in the
disconnected crossed rating plan

 

 

 

 

 

 

 

 

 

Dependent Variable (Accuracy) Predictors Coefﬁcients Std, Error
Person ' Intercept 1 .00 0.003 364.652 0.000
N_P 0.00 0.000 0.953 0.341
N_R 0.00 0.000 -0.291 0.771
VAR_I -0.01 0.015 -0.747 0.455
VAR_PR 0.01 0.015 0.616 0.538
MIN_BAT 0.00 0.000 -1.591 0.112
Item " Intercept 1 .00 0.064 15.739 0.000
N_P 0.00 0.000 -1.100 0.271
N_R 0.00 0.003 -0.977 0.329
VAR_I 0.09 0.356 0.264 0.792
VAR_PR -0.08 0.356 -0.214 0.831
MIN_BA T 0.00 0.003 1.027 0.305
Rater” Intercept 0.96 0.053 17.921 0.000
N_P 0.00 0.000 0.633 0.527
N_R 0.00 0.002 0.002 0.999
VAR_I 0.24 0.298 0.791 0.429
VAR_PR -0.10 0.298 -0.319 0.750
MIN_BA T 0.00 0.002 0.221 0.825
Person by Item‘ Intercept 1.00 0.002 472.431 0.000
N_P 0.00 0.000 -0.306 0.760
N_R 0.00 0.000 -1.134 0.257
VAR_I -0.01 0.012 -1.023 0.306
VAR_PR -0.02 0.012 -1.273 0.203
MIN_BAT 0.00 0.000 0.990 0.322
Person by Rater“ Intercept 1.00 0.016 61.034 0.000
N_P 0.00 0.000 -0.845 0.398
N_R 0.00 0.001 0.322 0.748
VAR_I 0.00 0.092 -0.030 0.976
VAR_PR -0.07 0.092 -0.718 0.473
MIN_BA T 0.00 0.001 0.563 0.573
Item by Rater” Intercept 0.99 0.034 28.937 0.000
N_P 0.00 0.000 -0.106 0.916
N_R 0.00 0.001 -0.033 0.973
VAR_I 0.20 0.192 1.027 0.304
VAR_PR -0.04 0.192 —0.201 0.840
MIN_BAT 0.00 0.001 0.227 0.820
Person by Item by Rater“ Intercept 1.00 0.002 656.061 0.000
N_P 0.00 0.000 0.555 0.579
N_R 0.00 0.000 0.860 0.390
VAR_I -0.01 0.009 -0.589 0.556
VAR_PR 0.00 0.009 -0.222 0.824
MIN rBAT 0.00 0.000 -0.715 0.475
Note: N_P refers to the sample size of examinees (levels sampled 750. 1500, 3000, and 6000)
N_R refers to the size of rater pools (level tested sampled 2. 4, 8, 14, and 28)
VAR_I refers to the magnitude of the item effect (levels sampled included 0.02 and 0.11)
VAR_R refers to the magnitude of the person by rater effect (levels sampled included 0.01 and 0.1)
MIN_BAT refers to the minimum batch size imposed to data subsets (levels sampled included 12 and 24)
‘: R Squared = .001 (Adjusted R Squared = .000)
”2 R Squared = .000 (Adjusted R Squared = .000)

 

 

 

112

Computer program: Codes for data simulation analysis in SPSS

Appendix I:

*Yiiiiiiiiiiifitii’fiiitiifiif*fi‘kfi‘kfﬁfi*iiiitiiiiiii‘kitffii‘ki’fi*ﬁ**f*ﬁ**t****

* Section A: Generate Full Data Sets with No Missing Data

*ttiiiiiiitiiiii‘k‘ki‘ki‘kﬁ*‘kiifi‘kﬂitiitiii'iifﬁitﬁftii’i9*iti'iiii'ti****************

define LOlFDMOl ( n_p !charend('l')
/n_i !charend('l')
/n_r !charend('|')
/n_pi !charend('l')
/n_pr !charend('l')
/n_ir !charend('l')
/n_ib !charend('|')
/trial !charend('|')
/var_p !charend('|')
/var_i !charend('l')
/var_r !charend('|')
/Var_pi !charend('l')
/var_pr !charend('l')
/var_ir !charend('l')
/var_pir !charend('l')
/ratepan !charend('l')
/dir — !charend('l')
/FN = !charend('|')).

set mxmemory = 124000 workspace = 512.
show workspace mxmemory.
set format=f8.2.

input program.

loop p_id=l to !n_p.
loop i_id=l to !n_i.
loop r_id=l to !n_r.
compute ID=$CASENUM
leave p_id.

leave i_id.

leave r_id.

end case.

end loop.

end loop.

end loop.

end file.

end input program.
execute.

**** Compute Ids.
save outfile = !quote(!concat(
!dir,!fn,'_',!ratepan,'_i',!var_i,' pr',!var pr,‘ ib',
trial,'.sav')) _ _ —
/keep id p_id i_id r_id
/compressed.

**** p (Person facet).

input program.

loop p_id = l to !n_p. /* !n p.
compute p_score=rv.normal(0,sqrt(!var p)). /* !var p.
end case. _ —
end loop.

end file.

end input program.

save outfile = 'C:\temp\junk p.5av'.

execute. —

113

!n_ib,'_',!n_p,'_

,!n_r,'_

I .

**** i (Item facet).

input program.

loop i_id = l to !n_i.

compute i_score=rv.normal(O,sgrt(!var_il).
end case.

end loop.

end file.

end input program.

save outfile = ’c:\temp\junk_i.sav'.
execute.

**** r (Rater facet).

input program.

loop r_id = l to !n_r.

compute r_score=rv.normal(0,5grt(!var_r)).
end case.

end loop.

end file.

end input program.

save outfile = 'c:\temp\junk_r.sav'.
execute.

**** pi (Person by Rater facet).

input program.

loop p_id = l to !n_p.
leave p_id.

loop i_id = 1 to !n_i.

compute pi_score=rv.normal(O,sgrt(!var_pi)).

end case.

end loop.

end loop.

end file.

end input program.

save outfile = 'c:\temp\junk_pi.sav'.
execute.

**** pr (Person by Rater facet).

input program.

loop p_id = l to !n_p.
leave p_id.

loop r_id = l to !n_r.

compute pr_score=rv.normal(0,5qrt(!var_pr)).

end case.

end loop.

end loop.

end file.

end input program.

save outfile = 'c:\temp\junk_pr.sav'.
execute.

**** ir (Item by Rater facet).

input program.

loop i_id = 1 to !n_i.
leave i_id.

loop r_id = l to !n_r.

compute ir_score=rv.normal(O,sqrt(!var_ir)).

end case.

end loop.

end loop.

end file.

end input program.

save outfile = 'c:\temp\junk ir.sav'.
execute. _

/* !n_i.
/* !var_i.

/* !n_r.
/* !var_r.

/* !n_i.
/* !var_pi.

/* !n_p.

/* !n_r.
/* !var_pr.

/* !n_i.

/* !n_r.
/* !var_ir.

**** pir (Person by Item by Rater Plus Residuals facet).

114

input program.

loop p_id = l to !n_p. /* !n_p.
leave p_id.

loop i_id = l to !n‘i. /* !n_i.
leave i_id.

loop r_id = 1 to !n_r. /* !n_r.
compute id = Scasenum.

compute pir_scor=rv.normal(0,sgrt(!var_pir)). /* !var_pir.
end case.

end loop.

end loop.

end loop.

end file.

end input program.

save outfile = 'c:\temp\junk_pir.sav'.

execute.

!enddefine.

ii*******t******+*itiii'i'tit*fittftiiii'fi'tfi'it‘kti*iifﬁt‘kiiiirfiii*‘kﬁiiir'k'tiitiiff‘ki'k‘ki

it

* Section 8: Create Missing Data on the Full Data Sets

tittiiiitiiii‘kiiﬁititit*iiiiiiiitiffitii‘kiiiii‘kiiii*i'iii‘hiiii'kiiiiﬁ‘k‘kfiiiiii'iiiiii‘k

** Section 81: Create missing data for the 'Disconnected Crossed Design', also
named Rating Plan #1.

define LOlMDROl (n_p = !charend('l')
/n_r = !charend('l')
/n_pi = !charend('l’)
/n_px2 = !charend(‘l’)
/n_ib = !charend('l')
/n_bszx4 = !charend('l')
/var_i = !charend('l')
/var_pr = !charend('l')
/ratepan = !charend('l')).

* Create missing data pattern exhibiting a Disconnected Crossed Rating Plan.

* create a file to randomly assign raters to batches without replacement. Call this
file 'Filel'.

* need to change the following as a stand-alone macro.

* Rater file.

input program.

loop rater_id = l to !nﬁr.

compute ranordOl = rv.uniform(0,l).
end case.

end loop.

end file.

end input program.

sort case by ranordOl.

if (Scasenum =1) reading=l.
if (Scasenum =2) reading=2.

if (missing (reading)) reading = laglreading,2).
vector read (2).
compute read(reading) = rater_id.

if ($casenum=1 or $casenum=2) subsetid=1.
if (missing(subsetid)) subsetid=lag(subsetid,2)+1.
AGGREGATE
/OUTFILE= 'c:\temp\junk_r_m.sav'
/presorted
/BREAK=subsetid
/read1 'who read this batch?’ = MEAN(readl) /read2 'who read this batch?‘ =
MEAN(read2).
execute.

* person file.

input program.
loop p_i_id = l to !n_pi.

115

compute rand_ord = rv.uniform(0,l).

end case.

end loop.

end file.

end input program.

do if (Scasenum = 2).

recode rand_ord (else = sysmis).

end if.

do if ($casenum > 2 and modl$casenum,2) = O).
compute rand_ord = lag(rand_ord,2).

end if.

execute.

if (missing(rand_ord)) rand_ord = lag(rand_ord).
execute.

if (Scasenum = l or Scasenum = 2) p_id = l. /* p_id = person id.
if (missing(p_id)) p_id = lag(p_id,2) + 1.
sort case by rand_ord.

if (Scasenum <=!n_bszx4) subsetid = 1.
if (missing(subsetid)) subsetid = lag(subsetid,!n_bszx4) +1.
execute.

Save outfile ='c:\temp\junk_p_m.sav'.
* match person and rater files.

match files file='c:\temp\junk_p_m.sav'

/file = 'c:\temp\junk_r_m.sav’

/by subsetid.

if (missing(read1)) readl = lag(read1).
if (missing(read2)) read2 = lag(read2).
execute.

* convert 'readl' and read2' into casel and case2.

compute casel = (p_i_id —1)* !n_r + readl.
compute case2 = (p_i_id —1)* !n_r + read2.
compute select = l.

execute.

vector x=casel to case2.
loop j=l to 2.
compute id = x(j).

xsave outfile = 'c:\temp\junk.sav'
/keep id select subsetid readl read2.

end loop.

execute.

get file='c:\temp\junk.sav'.
sort case by id.

save outfile='c:\temp\junk.sav'.
execute.

MEANS
TABLES=read1 read2 BY subsetid
/CELLS MEAN COUNT STDDEV.

!enddefine.

* Section 82: Create missing data for the 'Connected Crossed Design' or the
'Mixture Design'.

define LOlMDMOl (n_hf_rt = !charend(‘l')
/n_ib = !charend('l’)
/n_p = !charend('l')
/n_r = !charend('l')
/ratepan = !charend('l')
/var_i = !charend('l')
/var_pr = !charend('l')).

input program.
loop 1: 1 to !n_hf_rt.

116

end case.

end loop.

end file.

end input program.

* Control the batch size.

loop j = l to !n_hf_rt by !n_ib.

* where !n_ib = !n_i * batch_size = e.g., 24 = 2*12.

do if $casenum = j.

compute casel = trunc(rv.uniform(Scasenum*!n_r—(!n_r—1),$casenum*!n_r+l)).
loop.
compute case2 = trunc(rv.uniform($casenum*!n_r—(!n_r-1),Scasenum*!n_r+1)).

end loop if (casel <> case2).
compute select=1.
end if.

do if $casenum = j+1.

do if (!ratepan = 2). /* Rating Plan #2: Connected Crossed
Design.

if missing(casel) casel = lag(case1)+!n_r.

if missing(case2) case2 = lag(case2)+!n_r.

end if.

do if (!ratepan = 3). /* Rating Plan #3: Mixture Design.

compute casel = trunc(rv.uniform($casenum*!n_r-(!n_r-l),$casenum*!n_r+1)).
loop.

compute case2 = trunc(rv.uniform(Scasenum*!n_r—(!n_r—1),$casenum*!n_r+1)).

end loop if (casel <> case2).
compute select=l.

end if.

end if.

end loop.

if (missing(case1)) casel = lag(casel,2) + 2 * !n_r.
if (missing(case2)) case2 = lag(case2,2) + 2 * !n_r.
if (missing(select)) select = lag(select).

execute.

vector x=case1 to case2.

loop j=l to 2.

compute id = x(j).

xsave outfile = 'c:\temp\junk.sav'
/keep id select.

end loop.

execute.

get file='c:\temp\junk.sav'.

sort case by id.

save outfile='c:\temp\junk.sav'.

execute.

!enddefine.

* Section 83: Match the rater selection file with the full data matrix (with only
Ids) generated in section 01.

(define LOlMDJOl (n_p = !charend('l')
/n_r = !charend('l')
/n_ib = !charend('l')
/var_i = !charend('l')
/var_pr = !charend('l')
/ratepan = !charend('l')
/trial = !charend('l')
/dir = !charend('l')
/fnf = !charend('l')
/fnm = !charend('l‘)).

117

 

it

get
file=!QUOTE(!CONCAT(!dir,!fnf,'_',!ratepan,'_i',!var_i,'_pr',!var_pr,'_ib',!n_ib,'_
',!n_p,'_',!n_r,'_',!trial,'.SAV') ).

* a 11/9/98: Need to separate the above section (Section BB) with the following
section

* as two independent macros.

* Merge the full data set with the file containing the ID variable indicating which
case to select.

* 83a) Merge the full data matrix IDs with the file containing selected rater Ids.

match files file=!QUOTE(!CONCAT(
!dir,!fnf,'_',!ratepan,‘_i',!var_i,'_pr',!var_pr,‘_ib‘,!n_ib,'_',!n_p,'_',!n_r,'_',
!trial,'.SAV') )

/file='c:\temp\junk.sav'

/by id.
execute.
save
outfile=lQUOTE<!CONCAT(!dir,!fnm,'_',!ratepan,'_i',lvar_i,'_pr',!var_pr,'_ib',!n_ib
,'_',!n_p,'~',!n_r,'_',!trial,'.SAV') ).
execute.

select if (~missing(select)).
save outfile = 'c:\temp\junk_id.sav'.
exe.

* 83b) Merge the selected IDs with the sample data from each facet.

get file = 'c:\temp\junk_id.sav'.
sort case by p_id.
match files file = *
/file = 'c:\temp\junk_p.sav'
/by p_id.
if missing(p_score) p_score =lag(p_score).

sort case by i_id.
match files file = *
/file = 'c:\temp\junk_i.sav'
/by i_id.
if missing(i_score) i_score =lag(i_score).

sort case by r_id.

match files file = *
/file = 'c:\temp\junk_r.sav'
/by r_id.

select if (~missing(p_id) and ~missing(i_id)).
if missing(r_score) r_score =lag(r_score).

sort case by p_id i_id.
match files file 2 *
/file = 'c:\temp\junk_pi.sav'
/by p_id i_id.
if missing(pi_score) pi_score =lag(pi_score).

sort case by p_id r_id.
match files file = *
/file = 'c:\temp\junk_pr.sav'
/by p_id r_id.
if missing(pr_score) pr_score =lag(pr_score).
select if (~missing(i_id)).

sort case by 1 id r id.
match files file = 7
/file = 'c:\temp\junk_ir.sav'
/by i_id r_id.
if missing(ir_score) ir_score =lag(ir_score).
select if (~missing(p_id)).

sort case by p_id i_id r_id.

118

match files file = *

/file = 'c:\temp\junk_pir.sav'

/by p_id i_id r_id.

if missing(pir_scor) pir_scor =lag(pir_scor).
select if (~missing(p_score1).

compute ttlscore = 3.5 + p_score + i_score + r_score + pi_score + pr_score +
ir_score + pir_scor.

save outfile = 'c:\temp\junk_mer.sav'.

exe.

* 5) Expand the missing data set so that it would have an ID for both missing and
nonmissing data.

match files file =
!QUOTE(gcoNCAT(!dir,!fnfr'_'r!ratepanr'_i'l!var—i"-pr"!var—pr"_ib"!n—ib"_"!n—
Pr' ',!n r,‘ ',!trial,'-SAV') I

/file = 'c:\temp\junk_mer.sav'

/by id.
exe.
save
outfile=lQUOTE(!CONCAT(!dir,!fnm,'_',!ratepan,'_i',!var_i,'_pr',!var_pr,'_ib‘,!n_ib
,'_',!n_p,'_',!n_r,'_',!trial,'.SAV') ).
execute.

!enddefine.

*****************‘§*****************************************************************
*

* Section C: Applying the Parsing Method
***f***********ii'k‘ki'fir*i‘ﬁ‘ki***********i***‘i’i’i‘*i'i’*‘k'k‘k*****************************‘k'i'

*

it _____________________________________________________________________

* Section C
* Stage 1: Modeling = Parsing

* _____________________________________________________________________

* C1: Subsetting data into small subsets.

define LOISDMOI (n_p = !charend('l’)
/n~r = !charend('l')
/n_ib = !charend('l')
/ratepan = !charend('l‘)
/var_i = !charend('l')
/var_pr = !charend('l')
/trial = !charend('l')
/itm_ind = !charend('l')
/r_lbl = !charend('l')
/dir = !charend('l')
/fnm = !charend('l')
/fns = !charend('l‘)).
*get
file=!QUOTE(!CONCAT(!dir,!fnm,'_',!ratepan,'_i',!var_i,'_pr‘,!var_pr,'_ib',!n_ib,'_
',!n_p,'_',!n_r,'w',!trial,'.SAV') ).

*execute.

vector r (!n_r).

compute r(r_id)=ttlscore.

recode r1 to !concat("r",!n_r) (sysmis:O) (else=1).
*missing values r1 to r4 (0).

!do !r=l !to !n_r.

do if (i_id=2 and !concat("r",!r)=1).
recode !concat("r",!r) (1:2) (else=copy).
end if.

!doend.

execute.

AGGREGATE

/OUTFILB='C:\Temp\junk.sav'
/BREAK=p_id

119

/rl to !concat(”r",!n_r) = MEAN(r1 to !concat("r",!n_r)).

get file='C:\Temp\junk.sav'.
sort case by r1 to !concat("r",!n_r) (d).
*lconcat("r",!n_r).

compute samefi1e=O.

do repeat rater=rl to !concat("r",!n_r).
if (lag(rater)=rater) samefi1e=samefile+l.
leave samefile.

end repeat.

do if ($casenum=l).
compute file_id=1.
end if.
execute.

do if ($casenum>l and samefile=!n_r).
compute file_id=lag(file_id).

else if ($casenum>l and samefile<!n_r).
compute file_id=lag(file_id)+l.

end if.

execute.

******* Create a second file identification variable 'file_id2'

data subset belonging to the same type *******

recode rl to !concat(‘r',!n_r) (O = 0) (else = 1) into r_l to !concat('r_',!n_r).
)

sort case by r_l to !concat('r_',!n_r

compute samefilZ = 0.
do repeat rater = r_l to !concat('r_',!n_r).
if (lag(rater)=rater) samefil2 = samefil2 + 1.

leave samefil2.
end repeat.

do if ($casenum=1).
compute file_id2=1.
end if.
execute.

do if ($casenum>1 and samefil2=!n_r).
compute file_id2=lag(file_id2).

else if ($casenum>1 and samef112<!n_r).
compute file_id2 = lag(file_id2)+1.

end if.

execute.

*‘k‘k‘ki‘k'k

*rename

missing
compute

variable labels design

end block ****

\* just added 12/14/98.

variables (file_id2 = subsetid).
values r1 to !concat("r",!n_r) (0).
design=nvalid<r1 to !concat("r",!n_r)).

value labels design 2 ’Crossed' 3 'Mixed'
sort case by p_id.

missing

values r1 to !concat("r",!n_r) ().

'types of design in which the case will be analyzed'.

4 'Nested'.

save outfile='C:\Temp\junk1.sav'.
execute.

* Take out the '*' in the next line if not running production mode.

* frequencies variables = design.

match files file=lQUOTE(!CONCAT(

!dir,!fnm,' ',!ratepan,'_i',!var_i,'_pr',!var_pr,'_ib',!n_ib,'_',!n_p,‘

!trial,'.SA§'))
/file='C:\Temp\junkl.sav‘
/by p_id.

120

_',!n_r,

to indicate MBIB

I

do repeat var=rl to design.
if missing(var) var=lag(var).
end repeat.

execute.

sort case by file_id design (d).

split file by file_id design.

* Take out the '*' in the following line if not running production mode.
* descriptive variables=file_id file_id2.

split file off.

execute.

string r_a_1 to !concat("r_a_",!n_r) (a4).
execute.

!let !RAT_IND=!n_r.

!DO !VAR_ITM= l !to !ITM_IND.

!Do !VAR_RAT = l !to !RAT_IND.

compute #i = !VAR_ITM/!RAT_IND/2.

if ((!concat('r',!VAR_RAT) = #i) and !VAR_ITM= 1) !concat('r_a_',!VAR_RAT) =
!Quote(!concat('S',!VAR_RAT,'a')).

if ((!concat('r',!VAR_RAT) = #i) and !VAR_ITM= 2) !concat('r_a_',!VAR_RAT) =
!Quote(!concat('S',!VAR_RAT,'b')).

if ((!concat('r',!VAR_RAT) = #i) and !VAR_ITM= 3) !concat('r_a_',!VAR_RAT) =
!Quote(!concat('D’,!VAR_RAT)).

!DoEnd.

!DoEnd.

string Com_r_lb !concat('(a',!r_lbl,')').

Variable labels Com_r_lb ’Rater Identification'.

vector aa = r_a_1 to !concat('r_a_',!n_r).

loop #i=l to !n_r.

compute Com_r_lb = concat(rtrim(ltrim(Com_r_lb)),rtrim(ltrim(aa(#i)))).
end loop.

execute.

AUTORECODE
VARIABLES: Com_r_lb /INTO Com_r
/PRINT.

FILTER OFF.

USE ALL.

SELECT IF (select=l).

save

outfile=lQUOTE(!CONCAT(!dir,!fns,'_',!ratepan,'_i',!var_i,'_pr',!var_pr,‘ ib',!n ib
,'_',!n_p,'_',!n_r,'_',!trial,'.SAV')) _ —
/drop = r1 to !concat("r",!n_r) r_l to !concat("r_",!n_r) r_a l to
!concat("r_a_",!n_r). _

execute.

do if $casenum > 1.
select if (file_id2 <> lag(file_id2)).

end if.

Sort case by file_id2.

save outfile = 'c:\temp\junk_raterid_c&m.sav'
/keep file_id2 Com_r_lb com_r.

execute.

get file =

!QUOTE(!CONCAT(!dir,!fns,'_',!ratepan,'_i',!var_i,'_pr',!var_pr,'_ib',!n_ib,'_',!n_
p,'_',!n_r,'_',!trial,'.SAV')).
do if $casenum > 1.

select if ((file_id <> lag(file_id)) and design = 4).

end if.

Sort case by file_id.

save outfile = 'c:\temp\junk_raterid_n.sav'
/keep file_id Com_r_lb com_r.

execute.

!enddefine.

121

‘k _____________________________________________________________________

* Section C

* Stage 2: Estimating (Variance components for subsets of data)
* _____________________________________________________________________
define LOIVCOI (n_p = !charend('l')

/n_r = !charend('l')

/n_ib = !charend('l')

/var_i = !charend('l')

/var_pr = !charend('l')

/ratepan = !charend('l')

/trial = !charend(’l')

/dir = !charend('l')

/fns = !charend('l')

/fnv = !charend('l‘)

/fnd = !charend('l')).
*get file.
get file =

!QUOTE(!CONCAT(!dir,!fns,'_',!ratepan,'
p,'_',!n_r,'_',!trial,'.SAV')).
Execute.

AGGREGATE
/OUTFILE='C:\Temp\junk_f_l.sav'
/BREAK=file_id2
/design = MAX(design)
/N_perset=N.

AGGREGATE
/OUTFILE='C:\Temp\junk_f_2.sav'
/BREAK=file_id
/design = MAX(design)
/N_perset=N.

USE ALL.
COMPUTE filter_$=(design = 2 or design =3).

VARIABLE LABEL filter_$ 'design = 2 or 3 (FILTER)'.
VALUE LABELS filter_$ 0 'Not Selected' 1 'Selected'.
FORMAT filter_$ (fl.O).
FILTER BY filter_$.
EXECUTE
Sort case by file_id2.
Split file by file_id2.
VARCOMP
ttlscore BY p_id i_id r_id
/RANDOM = p_id i_id r_id
/OUTFILE = VAREST ('c:\temp\junkl.sav')
/METHOD = MINQUE (0)
/DESIGN = p_id i_id r_id p_id*i_id p_id*r_id
/INTERCEPT = INCLUDE
*VARCOMP
ttlscore BY p_id i_id r_id
/RANDOM = p_id i_id r_id
/OUTFILE = VAREST ('C:\temp\junkl.sav')
/METHOD = REML
/CRITERIA = ITERATE(50)
/CRITERIA = CONVERGE(1.0E-8)
/DESIGN = p_id i_id r_id p_id*i_id p_id*r_id
/INTERCEPT = INCLUDE
Split file off.
USE ALL.
COMPUTE filter_$=(design = 4).
VARIABLE LABEL filter_$ 'design = 4 (FILTER)'.

122

_i',!var_i,'_pr',!var_pr,'_ib',

i_id*r_id

i_id*r_id

!n_ib,'_',!n_

VALUE LABELS filter‘S 0 'Not Selected' 1 ’Selected'.
FORMAT filter_$ (£1.01.

FILTER BY filter_$.

EXECUTE

Sort case by file_id.
Split file by file_id.

VARCOMP
ttlscore BY p_id i_id r_id
/RANDOM = p_id i_id r_id
/OUTFILE = VAREST ('c:\temp\junk2.sav')
/METHOD = MINQUE (0)
/DESIGN = p id i_id r_id(i_id) p_id*i_id

/INTERCEPT = INCLUDE

*VARCOMP
ttlscore BY p_id i_id r_id
/RANDOM = p_id i_id r_id
/OUTFILE = VAREST ('c:\temp\junk2.sav')
/METHOD = REML
/CRITERIA = ITERATE(50)
/CRITERIA = CONVERGE(1.0E-8)
/DESIGN = p_id i_id r_id(i_id) p_id*i_id
/INTERCEPT = INCLUDE

Split file off.
filter off.

get file ='C:\Temp\junk_f_1.sav'.
sort case by file_id2.

select if (design = 2 or design =3).
save outfile =‘C:\Temp\junk_f_l.sav'.

get file ='C:\Temp\junk_f_2.sav'.
sort case by file_id.

select if (design = 4).

save outfile ='C:\Temp\junk_f_2.sav'.

get file ='C:\Temp\junkl.sav'.
sort case by file_id2.
save outfile ='C:\Temp\junkl.sav'.

get file ='C:\Temp\junk2.sav'.
sort case by file_id.
save outfile ='C:\Temp\junk2.sav'.

match files file ='C:\Temp\junk1.sav'

/file = 'C:\Temp\junk_f_l.sav'

/file = 'C:\temp\junk_raterid_c&m.sav'

/by file_id2.
select if (nvalid(vcl,vc2,vc3,vc4,vc5,vc6,vc7)=7).
save outfile =
!QUOTE(!CONCAT(!dir,!fnv,'_',!ratepan,‘_i',!var_i,'_pr',!var_pr,'_ib',!n_ib,'_',!n_
p,’_',!n_r,'_',!tria1,'_','c&m.sav'))

*sort case by design.

*split file by design.

*descriptive variables = vcl vc2 vc3 vc4 vc5 vc6 vc7.
*Frequencies variables com_r.

11

match files file ='C:\Temp\junk2.sav'
/file = 'C:\Temp\junk_f_2.sav'
/file = 'c:\temp\junk_raterid_n.sav'
/by file_id.
select if (nvalid(vcl,vc2,vc3,vc4,vc5)=5).
save outfile =
!QUOTE(!CONCAT(!dir,!fnv,'_',!ratepan,'_i',!var_i,'_pr',!var_pr,'_ib',!n_ib,'_',!n_
p,'_',!n_r,'_',!trial,'_','n.SAV')).

*sort case by design.

123

*split file by design.
*descriptive variables
*Frequencies variables
*execute.

vcl vc2 vc3 vc4 vc5.
com_r.

Add Files File =
!QUOTE(!CONCAT(!dir,!fnd,'_',!ratepan,'_i',!var_i,'_pr',!var_pr,'_ib',!n_ib,'_',!n

p,'_',!n_r,'c&m.SAV'))

/Fi1e =
!QUOTE(!CONCAT(!dir,!fnv,'_',!ratepan,'_i',!var_i,'_pr',!var_pr,'_ib',!n_ib,'_’,!n_
p,'_',!n_r,'_',!trial,'_','c&m.sav'))

/In = !concat(‘from',!trial).
if (!concat(‘from',!tria1)=l) trial=ltrial.

save outfile =
!QUOTE(!CONCAT(!dir,!fnd,'_',!ratepan,‘_i',!var_i,'_pr',!var_pr,'_ib’,!n_ib,'_',!n
p,'_',!n_r,'c&m.SAV'))

/drop = !concat(‘from',!trial).

erase file =
!QUOTE(!CONCAT(!dir,!fnv,'_',!ratepan,'_i',!var_i,'_pr',!var_pr,’ ib',!n_ib,'_',!n
p,'_',!n_r,'_',!trial,'_','c&m.sav')).

Add Files File =
!QUOTE(!CONCAT(!dir,!fnd,'_',!ratepan,'_i',!var_i,'_pr',!var_pr,'_ib',!n_ib,'_',!n
p,‘ ',!n_r,'n.SAV'))

/File =
!QUOTE(!CONCAT(!dir,!fnv,'_',lratepan,'_i',!var_i,'_pr',!var_pr,'_ib',!n_ib,'_',!n_
p,'_',!n_r,'_',!trial,'_','n.sav'))

/In = !concat(‘from',!trial).
if (!concat(‘from',!trial)=l) trial=ltrial.

save outfile =
!QUOTE(!CONCAT(!dir,!fnd,'_',!ratepan,‘_i',!var_i,'_pr',!var_pr,'_ib',!n_ib,'_',!n
p,'_',!n_r,'n.SAV'))

/drop = !concat(‘from',!trial).

erase file =
!QUOTE(!CONCAT(!dir,!fnv,'_',!ratepan,'_i',!var_i,'_pr',!var_pr,' ib',!n_ib,'_',!n
p,'_',!n_r,'_',!trial,'_','n.sav')).

!enddefine.

tifiii*fﬁfitiiiiifitftttitffittifftfiﬁfiiiit++t§+tiitﬂfiitttfﬁitfﬁifiitittﬁtttitttt
i
* .

Mega Macro Execution
fitiiitifitititiﬁitiﬁtitifiﬁtiittiiittifiiiffftfitiiftttft’tititetfitﬁfﬁttffitfiiit

i

preserve.

set errors = off.
set messages = off.
set printback = off.
set mxloops = 10000.

data list / FILE_ID2 1-5 rowtype_ 6—13 (A) varname_ 14-21 (A) vcl 22-26 vc2 27-31
vc3 32-36 vc4 37—41 vc5 42-46 vc6 47-51 vc7 52-56.
begin data

-999 EST . -999 -999 -999 —999 —999 -999 -999
end data.
Save outfile = 'c:\temp\junk_vcdump.sav'.

* definitions for macro arguments:

* n_bszx4 = n_p / n_r ’ 4 - ( mod (n_p / n_r ' 4, 4,;
define LOZmegOl (macroﬁs = !charend('l')

/seed no = !charend('l')

/sect01 = !charend(‘l')

/save01 = !charend('l')

/sect02 = !charend('l'l

/save02 = !charend('l')

/sect03 = !charend('l')

/save03 = !charend"l')

124

/sect04 = !charend

('l')
/saveO4 = !charend('l')
/n p = !charend('l')
/n—i = !charend('l')
/n_r = !charend('l')
/n—pi = !charend('l')
/n_pr = !charend('l')
/n_ir = !charend('l')
/n_px2 = !charend('l')
/n:ib = !charend('l')
/n_bszx4 = !charend('l')
/var_p = !charend('l')
/var i = !charend('l')
/var:r = !charend('l')
/var_pi = !charend('l')
/var_pr = !charend('l')
/var_ir = !charend('l')
/var_pir = !charend('l')
/b_trial = !charend('l')
/e_trial = !charend('l')
/ratepan = !charend('l')
/itm_ind = !charend('l')
/r_lbl = !charend('l')
/dir = !charend('l')
/fnf = !charend('l')
/fnm = !charend('l')
/fns = !charend('l')
/fnv = !charend('l')
/fnd = !charend('l')).
Get file = 'c:\temp\junk_vcdump.sav'.
save outfile
=!QUOTE(!CONCAT(!dir,!fnd,'_',!ratepan,'_i',!var_i,'_pr',!var_pr,'_ib',!n_ib,'_',!n
p,' ',!n_r,'c&m.SAV')).

save outfile
=!QUOTE(!CONCAT(!dir,!fnd,'_’,!ratepan,'_i',!var_i,'_pr',!var_pr,'_ib',!n_ib,'_',!n
_p,'“',!n_r,'n.SAV')).

Set mprint=lmacro_s.

Set seed = !seed_no.

!do !trial=!b_trial !to !e_trial.

***** Exe Section 1.

!If (!sectOl=l) !then.

LOIFDMOI n_p = !n_p l
n_i = !n_I 1
n_r = !n_r |
n_pi = !n_pi 1
_pr = !n_pr |
n_ir = !n_ir l
n_ib = !n_ib |
ratepan = !ratepan I
trial = !trial I
var_p = !var_p I
var_i = !var_I l
var_r = !var_r I
var_pi = !var_pi l
var_pr = !var_pr I
var_ir = !var_ir |
var_pir = !var_pir l
dir = !dir I
EN = !FNF l.
!ifend.

***** Exe Section 2.
!If (!sect02=l) !Then.

!If (!ratepan=l) !Then.

125

LOIMDROI

!ifend.

!If (!ratepan=2

LOlMDMOl

!Ifend.

ratEpan

nﬂP
n_hf_rt
n_ib
n_r
var_i
var_pr
ratepan

!or !ratepan=3)

ll

!n_p
!n_r

!n px2
!n_pi
!n_ib
!n_bszx4
!var_pr
!var_i
!ratepan

!Then.

!n_p
!n_px2
!n_ib
!n_r
!var_i
!var_pr
!ratepan

* create a macro call to execute the section

* file with the score file.

LOIMDJOI

!Ifend.

n_P

n_r
n_ib
var_i
var_pr
ratepan
trial
dir

fnf

fnm

***** Exe Section 3.

!If (!sect03=l)

LOlSDMOl

!Ifend.

n_P

nwr
n_ib
var_i
var_pr
ratepan
trial
itm_ind
r_lbl
dir

fnm

fns

***** Exe Section 4.

!If (!sect04=l) !Then.

LOlVCOl

!Ifend.

n_P
n_r
n_ib
var_i
var_pr
ratepan
trial
dir
fns
fnv
fnd

!if (!save01=0) !then.

!Then.

!n_p
!n_r
!n_ib
!var_i
!var_pr
!ratepan
!trial
!dir
!fnf
!fnm

!n_p
!n_r
!n_ib
!var_i
!var_pr
!ratepan
!trial
litm_ind
ir_lbl
!dir
!fnm
!fns

!n_p
!n_r
!n_ib
!var_i
!var_pr
!ratepan
!trial
!dir
!fns
!fnv
!fnd

126

(section BB)

to match the selection

 

erase file =

!quote(!concat(!dir,!fnf,'_',!ratepan,'_i',!var_i,'_pr',!var_pr,'_ib',!n_ib,'_',!n_
pr '_'r !n_rr '_', Strial,'.saV')).
!ifend.

!if (!save02=0) !then.

erase file
=!QUOTE(!CONCAT(!dir,!fnm,’_',!ratepan,'_i',!var_i,'_pr',!var_pr,'_ib',!n_ib,'_',!n
_p,'_’,!n_r,'_',!tria1,'.SAV')).

!ifend.

!if (!saveO3=O) !then.
erase file =
!QUOTE(!CONCAT(!dir,!fns,'_',!ratepan,‘_i',!var_i,'_pr',!var_pr,'_ib',!n_ib,'_',!n_

p,' ',!n_r,'_',!trial,'.SAV')).

!ifend.

script "c:\cc21\Delete Navigator Items (All).SBS".
!doend.

restore.
exe.
!enddefine.

l \ledn'vekhris‘dazBig DS'tc2l\now‘wCZlgPCl‘xcellgPCOLC SPS

127

 

REFERENCES
Aczel, A. D. (1996). Complete business statistics. (3rd ed.). Chicago: Irwin.

Alan, G., and Liu, J. (1981). Computer Solution of Large Sparse Positive Deﬁnite Systems,
Prentice-Hall.

Authors. (1996). The NAEP Guide: How Does NAEP Reliably Score and Process Millions of
Student-Composed Responses? (Technical Report Number 97-990). Tempa, Florida:
National Center for Educational Statistics (NCES).

Authors. (1998a). Collegiate Assessment of Academic Proﬁciency (CAAP) (Internet Web Wide
Web Document). Iowa City, Iowa: ACT, Inc.

Authors. (1998b). Test of English as a Foreign Language (TOEFL): Test of written English
online (Internet World Wide Web Document). Princeton, NJ: ETS.

Babb, J. S. (1986). Pooling maximum likelihood estimates of variance components obtained from
subsets of unbalanced data. Master's thesis. Cornell University.

Bejar, I. I. (1993). A generative approach to psychological and educational measurement. In R. .l.
Mislevy, I. I. Bejar, and N. Frederiksen (Ed.), T est theory for a new generation of tests.
(pp. 323-357): Lawrence Erlbaum Associates, Inc, Hillsdale, NJ.

Bejar, I. I. (1999, July). T he future of scoring open-ended assessments. Personal Communication.
Educational Testing Service.

Bell, J. F. (1985). Generalizability theory: The software problem. Journal of Educational
Statistics, 10(1), 19-29.

Brennan, R. L. (1992). Elements of generalizability theory. Iowa City: American College
Testing.

Brennan, R. L. (1997). A perspective on the history of generalizability theory. Educational
Measurement: Issues and Practice, 16(4), 14-20.

Brennan, R. L. (1998). Misconceptions at the intersection of measurement theory and practice.
Educational Measurement: Issues and Practice 17(1): 5—9.

Brennan, R. L., Gao, X., & Colton, D. A. (1995). Generalizability analyses of Work Keys
listening and writing tests. Educational and Psychological Measurement, 55(2), 157-176.

Brennan, R. L., Jarjoura, D., & Deaton, E. L. (1980). Some issues concerning the estimation and
interpretation of variance components in generalizability theory (Technical Bulletin 36).

Iowa City, Iowa: ACT.

Brennan, R. L., & Kane, M. T. (1977). An index of dependability for mastery tests. Journal of
Educational Measurement, 14, 277-289.

128

Bracey, G. W. (1987). Measurement-driven instruction: Catchy phrase, dangerous practice. Phi
Delta Kappan, 68, 683-686.

Burdick, R. K. & F. A. Graybill (1992). Conﬁdence Intervals on Variance Components. New
York, Marcel Dekker.

Chiu, C. W. T., & Wolfe, E. W. (1997, April). Generalizability Theory: A New Approach to
Analyze Non-Crossed Performance Assessment Data. Paper presented at the American
Educational Research Association annual meeting, Chicago, IL.

C Iauser, B. E., Clyman, S. G., & Swanson, D. B. (1999). Components of rater error in a complex
performance assessment. Journal of Educational Measurement, 36(1), 29-45.

Clauser, B. E., Swanson, D. B., & Clyman, S. G. (1996). The generalizability of scores from a
performance of physicians' patient management skills. Academic Medicine (RIME
Supplement), 71, SIO9~I I I.

Crocker, L., & Algina, .l. (1986). Reliability and the Classical True Score Model. In L. Crocker
& .l. Algina (Eds.), Introduction to Classical and Modern Test Theory (pp. 105-130).
New York, NY: Rinehart and Winston.

Cronbach, L. J., Gleser, G. C., Nanda, H., & Rajaratnam, N. (1972). The dependability of
behavioral measurements: Theory of generalizability for scores and proﬁles. New York:
Wiley.

Cronbach, L. .I., Linn, R. L., Brennan, R. L., & Haertel, E. H. (1997). Generalizability analysis
for performance assessments of student achievement or school effectiveness.

Educational and Psychological Measurement, 5 7(3), 373-399.

Engelhard, G., Jr. (1997). Constructing rater and task banks for performance assessments.
Journal of Outcome Measurement, 1(1), 19-93.

Giesbrecht, F. G. (1983). An efﬁcient procedure for computing MINQUE of variance
components and generalized least squares estimates of ﬁxed effects. Communications in

Statistics, Series A: Theory and Method, 12, 2169-2177.

Goodnight, .1. H. (1978). Computing MIVQUEO Estimates of Variance Components (SAS
Techincal Report R-IOS). Cary, NC: SAS Institute.

Gordon, B. (1998, Sep). Scoring performance assessment. Personal Communication. University
of Georgia.

Hamilton, L., C. (1992). Regression with Graphics. Belmont, CA: Duxbury Press.

Harwell, M. ( 1992). Summarizing Monte Carlo results in methodological research. Applied
Psychology Measurement, 17(4), 297-313.

Harwell, M., Stone, C. A., Hsu, T. C., & Kirisci, L. (1996). Monte Carlo studies in item response
theory. Applied Psychology Measurement, 20(2), 101-125.

129

Hays, W. L., & Winkler, R. L. (1970). Statistics: Probability, inference, and decision. (Vol. II).
New York: Holt, Rinehart and Winston, Inc.

Hedges, L. V., & Olkin, 1. (1985). Statistical methods for Meta-Analysis. New York: Academic
Press.

Henderson, C. R. (1953). Estimation of variance and covariance components. Biometrics, 9, 226-
252.

Hombo, C. (1999, July). The potential of electronic raters to score national assessments.
Personal Communication. Educational Testing Service.

Kalaian, H. A., & Becker, B. J. (1996). Modeling diﬂerences in variability. Paper presented at
the Annual Meeting of the American Educational Research Association, San Diego, CA.

Kane, M. T. (1982). A sampling model for validity. Applied Psychological Measurement, 6, 125-
160.

Kane, M. T., Crooks, T., and Cohen, A. (1999). Validating measures of performance.
Educational Measurement: Issues and Practice, 18(2), 5-17.

Kim, J. 0., & Curry, J. (1977). The treatment of missing data in multivariate analysis. Social
Methodological Research, 6, 215-240.

Kolen, M. J ., & Brennan, R. L. (1995). Test Equating: Methods and Practices. New York:
Springer.

Koretz, D., Stecher, B., & McCaffrey, D. (1994). The Vermont Portfolio Assessment Program:
ﬁndings and implications. Educational Measurement: Issues and Practice, 13, 5-6.

Lane, 8., Liu, M., Ankenmann, R. D., & Stone, C. A. (1996). Generalizability and validity of a
Mathematics performance assessment. Journal of Educational Measurement, 33(1), 71-
92.

Linn, R. L., Burton, E., DeStefano, L., & Hanson, M. (1996). Generalizability of New Standards
Project 1993 Pilot Study Tasks in Mathematics. Applied Measurement in Education,
9(3), 201-214.

Little, R. J. A., & Rubin, D. B. (1987). Statistical Analysis with Missing Data. New York: John
Wiley & Sons.

Longford, N. T. (1995). Models for uncertainty in educational testing. New York: Spring-Verlag.

Malley, J. D. (1986). Optimal unbiased estimation of variance components. (Vol. 39). New
York: Springer-Verlay.

Marcoulides, GA. (1988). An alternative method for variance component estimation:

Applications to generalizability theory. Unpublished Dissertation. University of
California, Los Angeles.

130

Marcoulides, G. A. (1990). An alternative method for estimating variance components in
generalizability theory. Psychological Reports, 66, 3 79-3 86.

Mehrens, W. A. (1987). Validity issues in tearcher licensure tests. Journal of Personnel
Evaluation in Education, 1, 195-22.

Mehrens, W. A. (1992). Using performance assessment for accountability purposes.
Educational Measurement: Issues and Practice, 11(1), 3-9.

Millman, J ., & Glass, G. V. (1967). Rules of thumb for writing the ANOVA table. Journal of
Educational Measurement, 4(2), 41 -5 l .

Montgomery, D. C. (1997). Randomized Blocks, Latin Square, and Related Designs, Design and
Analysis of Experiments (pp. 208-210). New York, NY: John Wiley & Son, Inc.

Mooney, C. Z. (1997). Monte Carlo Simulation. (Vol. no. I 16). Thousand Oaks, CA: Sage.

Myford, C. M., Marr, D. B., & Linacre, .1. M. (1995). Reader calibration and its potential role in
equating for the Test of Written English (Report prepared by the Center for Performance
Assessment MS # 95-02). Princeton: Educational Testing Service.

Olsen, A., Seely, J., & Birkes, D. (1976). Invariant quadratic unbiased estimation for two
variance components. Annals of Statistcs, 4, 878-890.

Othman, A. R. (1995). Examining task sampling variability in science performance assessments.
Unpublished Dissertation. University of California, Santa Barbara.

Patterson. P.(1985). An investigation of the dependability of criterion-referenced test scores
using generalizability theory. Unpublished Dissertation. University of Wisconsin -
Madison.

Patz, R. (1996). Markov Chain Monte Carlo Methods for Item Response Theory Models.
Unpublished Dissertation, Carnegie Mellon University.

Pearson, PD. (1998). Aligning standards for teaching: What do we have to gain? Paper
presented at the National Conference on High Standards for Outstanding Achievement in
Education: Examining the Issues, E. Lansing, MI.

Psychometrika Editorial Board. (1979). Publication policy regarding Monte Carlo studies.
Psychometrika, 44(2), 133-4.

Rao, P. S. R. S. (1997). Variance components estimation: Mixed models, methodologies and
applications. New York, NY: Chapman & Hall.

Rao, P. S. R. S. (1997). Combination information from experiments (pp.147-154) ln Variance

components estimation: mixed models, methodologies and applications. New York, NY:
Chapman & Hall.

131

Raudenbush, S. W. (1988). Estimating change in dispersion. Journal of Educational
Statistics, 13(2), 148 - 171.

Rencher, A. C. (1995). Methods of multivariate analysis. New York: Wiley.

Rubinstein, R. Y. (1981). Simulation and the Monte Carlo Method. New York: John Wiley &
Sons.

Satterthwaite, F. E. (1941). Synthesis of variance. Psychometriaka, 6, 309-16.

Satterthwaite, F. E. (1946). An approximate distribution of estimates of variance components.
Biometrics Bulletin, 2, 1 10-1 14.

Schafer, W. (1998, Sept). Scoring performance assessments. Personal Communication. The
Maryland School Performance Assessment Program.

Schroeder, M. L. (1986). Inferential Procedures for Multifaceted C oeﬂicients of
Generalizability. Unpublished Dissertation. The University of British Columbia
(CANADA)

Searle, S. R. (1971). Topics in variance component estimation. Biometrics, 27, 1-76.
Searle, S. R. (1987). Linear Models for Unbalanced Data. New York, NY: John Wiley & Sons.

Searle, S. R., Casella, G., & McCulloch, C. E. (1992). Variance Components. New York, NY:
Wiley.

Seeger, P. (1970). A method of estimating variance components in unbalanced designs.
Technometrics, 12(2), 207-218.

Shavelson, R. J., Baxter, G. P., & Gao, X. (1993). Sampling variability of performance
assessments. Journal of Educational Measurement, 30(3), 215-232.

Shavelson, R. .1 ., & Webb, N. M. (1981). Generalizability theory 1973-1980. British Journal of
Mathematical and Statistical Psychology, 34, 133-166.

Shavelson, R. J ., & Webb, N. M. (1991). Generalizability theory: A primer. Newbury Park, CA:
Sage.

Smith, P. L. (1978). Sampling errors of variance components in small sample multifacet
Generalizability Studies. Journal of Educational Statistics, 3(4), 319-346.

Smith, P. L. (1981). Gaining accuracy in generalizability theory: Using multiple designs. Journal
of Educational Measurement, 18, 147-154.

Smith, P. L. (1982). A conﬁdence interval approach for variance component estimates in the
context of generalizability theory. Educational and Psychological Measurement, 42,
459-466.

132

Townsend, E. C. (1968). Unbiased Estimators of Variance Components in Simple Unbalanced
Designs. Unpublished Ph.D. Dissertation, Cornell University, Ithaca, New York.

Tucker, M. (1998, May 5-7). High Standards for Improved Achievement in Education. Paper
presented at the High Standards for Outstanding Achievement in Education: Examining
the Issues, E. Lansing, MI

US. Department of Education. (1998b). Writing Framework and Speciﬁcations for the 1998
National Assessment of Educational Progress (Report). Washington, DC: Authors.

Vickers, D. (1998, Sept). Scoring performance assessments. Personal Communication. North
Carolina Performance Assessment Program.

Wainer, H. (1993). Measurement problems. Journal of Educational Measurement, 30(1), 1-21.

Welch, C. (1996, July). Scoring performance assessments. Personal Communication.
Performance Assessment Center, ACT, Inc.

Wolfe, E. W. (1998, Sept). Scoring performance assessment. Personal Communication.
University of Florida.

Ysseldyke, J., & Olsen, K. (1999). Putting Alternate Assessments into Practice: What to
Measure and Possible Sources of Data. Synthesis Report No. 28, National Center on
Educational Outcomes. [Online] Available
http://www.coIed.umn.edu/nceo/OnIinePubs/awgﬁnal.html

133