'1W"‘.— 1- —' .

 

 

113111;?
5‘1 11 “11“‘11111 1111

*1, .
1‘ 11’ ' 1311.11"

1.91

__-.._ ..
9“.“
- .. .
4w
.
. “.30..

th...
~—1"' "
- ' 'JW'n-‘W
'
-‘.
-

.—
m.

‘.“‘""2"‘

. “n, _

t.‘

‘ ~‘hv

. .
, “—
N. .~v“.
.
- .- '-
E~§$
..

u-c.-. .—

w
1";
.—

..
._ .n. -
- “n
- A. -. '
""‘""' I.—-‘“m-Iw '3‘, o
@— v.2: z

”—4....

A...“

-FI‘V‘-’ l-"l-I a—n

“ mmw hon—1m
m
o
c. u
.

_.W~..-a

o
"-

09- '
-—....._..~
.
. «v
«OF-«unw-
1
u‘
' " 'ﬁ. a
5%...
‘

nu
N

o—-:.‘ a: ‘ -
r-mw-
c

m~_mv.
a . . .
¢ .
a . -
.p .
_ ._

luv-

'"* W

-

._. .

.

.-... . .

'- ..

W‘
4.-- .-

roe-9

 

--. .. .. .
-. . 9
u ‘ "I 1. s
- —-~a -
' .nn 1
.‘r‘ M. .

w
v
“0—“ .-
-t

. amt-a:-

n . .

mwm
'

u

4 ‘ o .

.huo..4- 4n .-

«M_’.-—o *vonov
Coo:

—
,
. .« .
ol- .
o ' a. .M
. paw-o...
1""! _
. -..
.
°, .
. c

4. -
.
'. 3}. ‘ . '
.41“ f1:
.1

. .
A y!
"sea-4.

.11:
$21 “‘6 £131?!“ :11}

”1:11

1111
t ."
.1' ........

d I n .l
‘ t. . ‘
N.“ 4

‘7.”
1L 1. ‘3’ 3
. 1 "ﬁg up 0
'w ‘3‘. 4‘? ‘
.“..—.o‘-.“ '49-».
"34.44:.”
“V.

a“
n‘ — .~-_..
—',4. ‘

m...
um
.
. . . '.
m
.a’

_.
K23}
...‘ .

. _, ,
u... .. --
..

u 4
n- .- u "’ '
9 '..."I:u.w A. n
.' 0' ‘1‘ l
J

.I C

. - -

a‘wwW-‘L
J

1

.—
~
.97"), a
u . a g 1
<1 - a
- o

.nwa...u
< . .......
. .
0 v.
o

la.

o gnu...“

... .
' '2‘," ,5...‘....4~...
._ .757.

‘. . .. -

.....— - --...
'
..., A

-u...__.--.

:"mv‘..~!
.4. nous-0‘4 v-a

 

....
‘- ..
7’ '
.....4.. .44 .
u .
.
-~ A
.-.

..v
-o
.-

'133‘1-13V1“W .::;.~.
WW" ‘5.” ' 3:. '

‘1‘”: {I L";. :1”...
9': 5.1.5:. M A "‘

"u". ..
...__,,, .. ‘ .-

.‘ ... .. V . A'" n
a . ./. .. .
.. -... . ‘. .. ,._,,,‘_

'73: 1, -—-—o~>— .-_‘ wmﬁmwu‘ﬁC o
...-... ‘ c .

. , .— .-

.1

.w~-,

- 4 o _ . a.)

......«.............
m.

$2.3 m ..-

N' W. ..

3:, ..

.. -..

2' . .z.
- .

.5

”mm. W‘ "‘3 74"
3...;-

-.4
‘01-.
. 4...~
4 . t. .

Q-“
o

    
 
   

   

 
    
 

     

          

 

1E"‘ I
.1111? .3:
11131212 '1- . 1.1:. .. » . .
#111,731: "1 , r3 2-". ‘;?1..1...1v1.'1’,
111111. 1.11:. “1111'“ '1' "11 "11 1' ‘1': 11 .111 1‘ "“‘ "’ “1'11”“ ' ' 1 1“ "' 1*"‘11“11 I“ '1“
1 :‘18‘1‘ "“11 1111‘ 11.11,1.1111111.1 11‘11‘91‘1111 “ 1' 11111911 11;: 111k!“ '1‘: 1111111 11151131! ‘3" 1 "i'.‘“..~1‘fus“’ 121' "1",? 311
3111111 1111.11'111 .111 .11 !1131111111 1111111111 .1. 1‘ 1:111"1 ”‘11 11111‘:11 1111;. 1111 11.13.11“ 11111113111111 “'1‘ '111 111
4111111111111. .1"; 11‘. M1111 11:15 '11""111! 11j1'1 :13: if ”nu 3111‘ 1‘1"; .1”; 1‘I-1‘111gwl1i551‘C ‘3 111 11:11“ 1,,1'111;;%ﬁ
' 1’3‘1'-11.'.1"27“'1- ~‘ 11‘ ..1"11*111111 1% 1 11'11111 1'11"! ""11..1‘;11"‘3 11‘3'11‘511 11111111111311 "1‘" "“
WM 1- .11111 n: 111 "I 1111' ""13' 11111 31311 15111 .11 .1:-1. ‘1 1:1‘*1.1'1..11,‘ $111111” 1:1111 111
1111:11:12:11-1:11111'r.1.-111111111"1‘51'1‘ 1:111:1'1'iz1-111111'1' 1111 1:11“ 11 11 1” 1 ‘ ‘1:
=1.;..=~.--'1.'1‘” 1.:1‘1':1-' "'11 ' “1» 1111.... 1'11 :1" 1‘ 1“11 1' .1 .3 2111111.. 111111111111111
{11,111 1 1'1? ‘5 Ldl film
”1‘11“ ‘ {1‘ 1

 

f“’1,«,. 1-1 I‘ 1‘1 3'1.
111111111 111 1111.11" "1111111111 19111111111111111““' "‘1

" '5’}

.‘_—... In.“

Nx
LHEEMRY
Michigan State
University

*—

 

I

This is to certify that the
dissertation entitled
TEST CHARACTERISTICS AND THE BIAS AND SAMPLING VARIABILITY

OF CRITERION-REFERENCED RELIABILITY COEFFICIENTS

presented by

Loraine Son

has been accepted towards fulﬁllment
of the requirements for

Ph.D. Psychology

degree in

 

 

 

7
/Z/// 7’
¢Z—f (EVA/w; v

Major professor

Date 2/2483

MS U is an Affirmative Action/Equal Opportunity Institution 0-12771

 

 

MSU

LIBRARIES
m

\-

 

 

 

RETURNING MATERIALS:
Place in book drop to
remove this checkout from
your record. FINES will
be charged if book is
returned after the date
stamped below.

 

 

 

 

 

 

 

 

i
1

TEST CHARACTERISTICS AND THE BIAS AND SAMPLING VARIABILITY

OF CRITERION-REFERENCED RELIABILITY COEFFICIENTS

by

Loraine Son

A DISSERTATION

Submitted to
Michigan State University
in partial fulfillment of the requirements
for the degree of

DOCTOR OF PHILOSOPHY

Department of Psychology

1983

The
the aajc:
coeffici
categori
upon Uh:
weenie:
differe:
Shapes,
binoma
the Stu
tims c
f1“mine
SiZes.
a DSYci
as 335‘

invest

ABSTRACT
TEST CHARACTERISTICS AND THE BIAS AND SAMPLING VARIABILITY
or CRITERION-REFERENCED RELIABILITY COEFFICIENTS
By

Loraine Son

The present study examined the bias and sampling variability of
the major single test administration criterion-referenced reliability
coefficients given various test parameters. These coefficients were
categorized by the type of loss function (squared error or threshold)
upon which they were based and by whether they included a chance
agreement correction. The extent of bias was studied as a function of
different parallelism conditions (classic versus random), distribution
shapes, and cut-off scores. Distributions not belonging to the beta-
binomial family and the random parallelism condition were included in
the study to examine the robustness of several coefficients to viola-
tions of their underlying assumptions. Each coefficient's sampling
fluctuation was investigated for various test lengths and sample
sizes. Data from the Michigan Educational Assessment Program and from
a psychology mid-term exam were used to generate item domains as well
as test scores, and were altered to reflect the various parameters
investigated. For each cell of the design, population coefficients

were computed from either randomly or classically parallel alternate

eras.
populati
single t
samples.

the coef

r3,

0:
coeffic:
the cla.
paralle
not 31w
apprcac
became
geneeus
of the
genera:
kappa I
beta-b
test 1
PQCOQ:

categ:

forms. To determine the magnitude and direction of bias, the
population values were compared to the mean of the corresponding
single test administration sample estimates taken across many
samples. The standard deviation of these sample estimates indicated
the coefficient's sampling variability.

For distributions derived from homogeneous item domains, all the
coefficients, except the kappa estimates, were robust to violation of
the classic parallelism assumption. For the other distributions, the
parallelism condition did affect the coefficients' biases, although
not always in the expected direction. Generally, as the cut-off score
approached a distribution's mean, the squared error coefficients
became more biased for randomly parallel tests consisting of hetero-
geneous items. The cut-off score also significantly affected the bias
of the threshold agreement coefficients. However, the results,
generally, did not follow a pattern. The hypothesis that the po and
kappa estimates would be more biased for distributions which were not
beta-binomial was unsupported. Sampling variability decreased as the
test length and sample size increased. Based on these results,
recommendations were made about which coefficient to use within each

category given various test conditions.

ACKNOWLEDGEMENTS

I wish to express my most sincere and warmest thanks to my chair-
person, Dr. Neal Schmitt, who introduced me to this area of study and
whose suggestions guided me through the most difficult times. Without
his sage advice as well as his calm, rational, and nurturant manner
during these periods, the project may never have reached its final
stages. His patience and support have been invaluable.

I also wish to express my appreciation to Dr. Raymond Frankmann,
Dr. William Mehrens, and Dr. Frederic Wickert for their timely, needed
advice and their willingness to respond with constructive feedback
under the pressure of a short time frame. Apart from my committee
members, other individuals have contributed notably to this project; I
am indebted to Bryan Coyle for his kindness in offering to apply his
computer expertise when needed, and to Kathy Sigafoose as well as
those who assisted her, Kathy Cooper and Janet Larrimore, for the hard
work and time they devoted to typing and preparing this manuscript.

Finally, I wish to thank my parents for not only providing love,
support, and patience through the more trying times of this project,
but also for their understanding, kindness, devotion, and sense of
humor which have guided me throughout my life. For these priceless
contributions and with much love and respect, I dedicate this work to

them.

11

"2'
Us

LIST

“P an
LED; v:

FA 6‘
A..V ﬂuv HHM

n
.Y.
wugI

-
R-R

FE<~

TABLE OF CONTENTS

Page
LIST OF TABLESOOOOOOOOOOOOCOOO0.0.....000.00....OOOOOOOOOOOOOOOOOOOOV
LIST OF FIGUREOOOOOQOO0.00.00.00.0000......00.0000000000000000000Vii

TEST CHARACTERISTICS AND THE BIAS AND SAMPLING VARIABILITY
OF CRITERION-REFERENCED RELIABILITY COEFFICIENTS..................1

Criterion-Referenced Measurement and Tests.....................2
Reliability....................................................9
Reliability Formulations Based Upon Squared
Error Loss.............................................17
Reliability Formulations Based Upon Threshold

LOSSCOOIOOOOOOOOOOCOOOOOOOOOIOOOOOOOOOOOOOOOOOO0..0.00.31

syntheSiSOOOOOOOOOOOOOOOOOOOOOOOOOO00......0.00.00.00.00050

METHODOOOOOOO00.0.00...OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOI00.66

Data Base.....................................................66
Procedure.....................................................67
Test Characteristics.....................................67
Data Generation..........................................69
Determination of Bias....................................77

RﬁULTSOOCOOCOOOOCOOOOOOOCOOOOOOOOOOOO00......00.0.000000000000000085

Population ValueSOOOOOOOOOO0.0.0.0...OOOOOOOOOOOOOOO0.0.0.0.0085
Bias .0.0000COOOOOOOOOOOOOOOI.OOOOOOOOOOOOOOOOOOOOOOOO00......96

Sampling Variability.........................................118
DISCUSSION.................................J......................121
SUMMARY AND CONCLUSIONS...........................................125
APPENDICES

A1 Meag Bias and Standard Deviation of Livingston's

R (3.1 ) Across Samples of 25 Examinees...............132
A2 Meag Bias and Standard Deviation of Livingston's

3 (3,1 ) Across Samples of 35 Examinees...............13A
A3 Mesa Bia§ and Standard Deviation of Livingston's

5 (3,1 ) Across Samples of 50 Examinees...............136
Ah Mean Bia§ andAStandard Deviation of Brennan

and Kane's ¢(l) Across Samples of 25 Examinees........138

iii

A5

A6

A7

A8

A9

A10
A11
A12
A13
A13
A15
A16
A17
A18
A19
A20
A21
A22
A23

A2“

Mean Bias and Standard

and Kane's $(A) Across Samples

Mean Bias andAStandard

and Kane's ¢(A) Across Samples

Mean Bias and Standard

and Kane's 3 Across Samples of

Mean Bias and Standard

and Kane's 5 Across Samples of

Mean Bias andAStandard

and Kane's ¢ Across Samples of

Mean Bias and Standard
.ﬁo Across Samples of
Mean Bias and Standard
p0 Across Samples of
Mean Bias and Standard
§o Across Samples of
Meaﬂ'Bias and Standard
g Across Samples of
Meag Bias and Standard
p Across Samples of
Meag Bias and Standard
go Across Samples of
Meaﬁ'Bias and Standard
g Across Samples of
MeaE Bias and Standard
p Across Samples of
Meag'Bias and Standard
Megg

Across Samples of
Bias and Standard

of
of
of
of
of
25
of
35
of

Brennan

35 Examinees........139
Brennan

50 Examinees........1u0
Brennan
Examinees...........1u1
Brennan
Examinees...........1u2
Brennan

50 Examinees...........1ﬂ3
Deviation of Marshall's

25 Bxaminees.....................1AA
Deviation of Marshall's

35 Examinees.....................146
Deviation of Marshall's

50 Examinees.....................1A8
Deviation of Subkoviak's

25 Examinees.....................150
Deviation of Subkoviak's

35 Examinees.....................152
Deviation of Subkoviak's

50 Examinees.....................15A
Deviation of Huynh's

25 Examinees.....................156
Deviation of Huynh's

35 Examinees.....................158
Deviation of Huynh's

50 Bxaminees.....................160
Deviation of Subkoviak's

Deviation

Deviation

Deviation

Deviation

Deviation

§_Across Samples of 25 Examinees......................162

Mean Bias and Standard

Deviation of Subkoviak's

EAcross &mples Of 35 ExamineeSoeeooooooooeeeeeo0000.16“

Mean Bias and Standard

Deviation of Subkoviak's

§_Across Samples of 50 Examinees......................166

Mean Bias and Standard

Deviation of Huynh's

5 Across Samples of 25 Examinees......................168

Mean Bias and Standard

Deviation of Huynh's

g Across Samples of 35 Examinees......................170
Mean Bias and Standard Deviation of Huynh's
g Across Samples of 50 Examinees......................172

LIST OF REERENCESOOO0.000000000000000IOOOO0.0.00.00.00.000000000017u

iv

10.

11.

12.

LIST OF TABLES

Page

Characteristics of Each Randomly Parallel
Alternate FomOOOOOOOOOOO0.0000000000...0.00.00.00.000000086

Characteristics of Each Classically Parallel
Alternate Form...00.00.000.00...OOOOOOOOOOOOOOOOO0.0.0.00088

Classical Reliability of Randomly and Classically
Parallel Alternate Forms for Each Distribution/
Test Length combimtionOOlOOO0.000.000.0000.000.000.00000090

Altsrnate Form Population Values of Livingston's
K (£,T4) for EaCh cell or the DeSIEnOOOOOOOOOOOOIOO0.00.0090

Population Values of Brennan and Kane's
for Each Cell of the Design...............................91

Population Values of Brennan and Kane's
for EaCh cell or the DeSignOOOOOOOOOOOOOOOOO00.0.00000000092

Alternate Form Pepulation Values of 20 for Each
cell Of the DeSignOOO0.0...00.0.0.0?00000000000000.0.0.00093

Alternate Form POpulation Values of Kappa for Each
cell Of the DeSignOOOIOO0.00....000......00.00.000.000...09“

Mean Bias (Across Cells) of Various Coefficients in
Estimating the Reliability of Classically and
Randomly Parallel Alternate Forms.........................98

Mean Bias Across Cells of Each Reliability
CoeffiCient for EaCh DistributionlOOOOOOOO0......00.0.000099

Mean Bias Across Cells of Each Coefficient for
E30!) Cut-Off scoreOOOOOOOOOOOOOOOOOOOOO'OOOIOOOOO000000000102

Mean Bias Across Cells of Each Coefficient for Every
Parallelism/Distribution/Cut-off Score Combination.......1ou

Me;

41!

He;

1’4.

Me:

150

,-
\

Rev

17.

Ac

 

13.

1“.

15.

16.

17.

Mean Standard Deviation Across Cells of Each
coeffiCient for EaCh Test LengthOOOOOOOOOOOOOO00.0.0.0...119

Mean Standard Deviation Across Cells of Each
Coefficient for Each Sample Size.........................119

Mean Standard Deviation Across Cells for Each
&mple Size/Test Length combination I O O O O O O O O O O O O O O O O 0 O O O 0120

Direction of Bias of Each Coefficient for Each
Parallelism/Distribution/Cut-off Score Combination.......126

Recommended Corrected/Uncorrected Squared Error

and Threshold Agreement Coefficients for Each
Parallelism/Distribution/Cut-off Score Combination.......129

vi

Figure

1.

LIST OF FIGURES

Page

Joint Distribution of True and Obtained
ClaSSificationSO0.0000000...OOOOOOOOOOOOOOOO0.00000000.0000015

Mastery Testing Reliability Formulations.....................51

Advancement Scores for Each Combination of Test
Length and cut-Off LeveIOOOOOOOOOOOOOOOOOOOO0.0.0.0..00....69

Skewed Population Frequency Distribution of
Domain scoreSOOOOOOOOOOOOOO..00...IOOOOOOI0.00.00.00.00000072

J-shaped Population Frequency Distribution of
Domain scoreSOOOOOOO0.000000000000000000000...0.0.00000000073

Bimodal POpulation Frequency Distribution of
mmain scores...OOOOOOOOOOOOOOOOOO0.0.0.000...0.000.000.00075

Normal Population Frequency Distribution of
Domain ScoreSIOOOOOOOOOOOOOOO00...00.000000.00000000000000076

Formulas for Both Criterion-Referenced Reliability
Population Coefficients Computed from Alternate
Forms and Single Test Administration Sample
Estimates..................................................79

vii

TEST CHARACTERISTICS AND THE BIAS AND SAMPLING VARIABILITY

0F CRITERION-REFERENCED RELIABILITY COEFFICIENTS

Reliability denotes the consistency of measurement or the extent
to which scores are reproducible over repeated testings on different
occasions, or over different sets of parallel or randomly parallel
items, and/or under other small variations in conditions (Anastasi,
1976). Although it is frequently stated that a tg§t_is reliable,
reliability actually refers to the consistency of the score interpre-
tation obtained from the test, not to the test, in and of itself.

In industrial-organizational psychology as well as in other
applied sciences, norm-referenced interpretations of test scores have
become the sine qua non of measurement. In norm-referenced measure-
ment, an individual's score is given meaning by determining the
individual's relative standing within a normative group (POpham &
Husek, 1969). Quite appropriately, those reliability coefficients
associated with classical test theory and norm-referenced measurement
(e.g., correlation between two test administrations, coefficient
alpha) indicate the extent to which examinees' relative standings
remain consistent. However, norm-referenced interpretations of data
do not satisfy all the measurement needs of psychologists. For
example, in many situations, scores ultimately serve as a basis for
making dichotomous decisions (e.g., accept/reject) or placing
individuals into groups (e.g., successful/unsuccessful). In order to
address these and other measurement needs, many educational psycholo-

gists and measurement experts have turned to criterion-referenced

measure:
indicat:

data arw

No:
Glaser
As prev
relevan
per form
these 3
scores,

0n
levels
ment C0
Klaus,
demons:
are con
testing
pefcen:
anCEOr j
13amb1e
made as
partiCu.
Perform;

class: a.

ﬁk.

hiS/he,

measurement and, in so doing, have had to create coefficients
indicating the extent to which criterion-referenced interpretations of

data are reliable.

ggiterion-Referenced Measurement and Tests

Norm- and criterion-referenced measurements were distinguished by
Glaser (1963) on the basis of the standard used to interpret scores.
As previously noted, the former uses the test scores of members of a
relevant group as the standard for Judging an individual's
performance. Consequently, the mean serves as the anchor point of
these scales and raw scores are typically converted into standard
scores, percentiles, stanines, or ranks (Eignor & Hambleton, 1979).

On the other hand, criterion-referenced measurement uses defined
levels of criterion behavior along an achievement, skill, or attain-
ment continuum as the performance standard (Glaser, 1963; Glaser &
Klaus, 1962). More specifically, the behaviors required to
demonstrate competence at each proficiency level are identified and
are compared to the behaviors exhibited by an individual on the
testing instrument. A typical criterion-referenced measure is the
percentage of items answered correctly. This type of scale has two
anchor points, one at each end of the scale, i.e., 0% and 100%
(Hambleton & Eignor, 1979). In most circumstances, some evaluation is
made as to whether or not an individual's score indicates mastery of a
particular skill, objective, etc.; a minimally acceptable level of
performance, cut-off score, is established and an individual is
classified as either a master or non-master according to whether

his/her score is above or below this predetermined level of competence

(Buck.
several
priate
referer

"T nn‘ep“

.A‘U

empley

based u

necessit‘
indiVidug
objeCtiVQ
skills a

p-"Oficie:
tims hav
organiZa-tn
Placement
multiple {
1962; 001:
been the d

0:138:31

‘
ye:

IndiViA

U37!

(Buck, 1975; Hambleton & Novick, 1973). In other applications,
several cut-off scores may be used to divide the examinees into appro-
priate groups. Contrary to norm-referenced measures, criterion-
referenced scores indicate what an individual can and cannot do
”independent of reference to the performance of others" (Glaser, 1963,
p. 520; Glaser & Klaus, 1962). In short, norm-referenced measures
employ a relative standard, while criterion-referenced measures are
based upon an absolute standard (Glaser, 1963).

Situations where criterion-referenced measurement would be more
appropriate than norm-referenced measurement are easily discernible.
The standards used indicate that both are appropriate for making
various decisions about individuals (Popham & Husek, 1969; Wardrop,
Anderson, Hively, Hastings, Anderson, & Muller, 1982). In education,
criterion-referenced measurement gained prominence partly due to the
necessity of diagnosing student needs and assessing performance in
individualized instructional programs (Mehrens & Ebel, 1979). The
objective of measurement in these instances was to determine what
skills a student possessed or to simply assess whether a student was
proficient in a particular area. Similar types of score interpreta-
tions have long existed in various aspects of industrial-
organizational psychology such as performance appraisal, job
placement, training performance, and personnel selection via the
multiple cut-off and the multiple hurdle techniques (Glaser & Klaus,
1962; Goldstein, 197“). In all these areas, a frequent concern has
been the determination of an individual's performance independent of
others' performance, and decisions have been typically made about the

individual's mastery of an objective. Such score interpretations have

been 93’

1“:

al., ->‘~
Cri
decision
referenc
Popham 5
the with
is desir
(Pcpham
this cas
post-tee
As
measure:
drawing
(Glaser,
tation a
built us
of the Q
can do (
"Search
teSt3_
Var
offered_

been particularly prevalent within a free quota system (Wardrop et
al., 1982).

Criterion-referenced measurement is also appropriate for making
decisions about treatments (e.g., training programs), while norm-
referenced measurement is not as suitable (Mehrens & Ebel, 1979;
POpham & Husek, 1969). The latter technique is designed to increase
the within group variance, while having a small within group variance
is desirable when evaluating the effects of different treatments
(Popham & Husek, 1969). A typical criterion-referenced measure in
this case is the proportion of individuals achieving mastery on a
post-test.

As can be seen in the above examples, criterion-referenced
measurement is not concerned with rank-ordering individuals, but with
drawing conclusions about an individual's behavioral repertoire
(Glaser, 1963). The psychometric implications of this score interpre-
tation are far-reaching. Several experts have suggested that tests
built using classical methods do not provide adequate representation
of the content needed to make generalizations about what an individual
can do (Glaser & Nitko, 1971; Hambleton & Novick, 1973). In response,
researchers have focused on the development of criterion-referenced
tests.

Various definitions of criterion-referenced tests have been
offered. Ivens (1970) proposed the following general definition: "A

criterion-referenced test is one composed of items keyed to a set of

behavioral objectives" (p. 2). In comparison, a very specific and
restrictive definition was offered by Harris and Stewart (1971):

A pure criterion-referenced test is one consisting of a

sample of production tasks drawn from a well-defined

population of performances, a sample that may be used

to estimate the proportion of performances in that

population at which the student can succeed (p. 1).
Similarly, Glaser and Nitko (1971) advanced the following
definition: "A criterion-referenced test is one that is deliberately
constructed to yield measurements that are directly interpretable in
terms of specified performance standards" (p. 653). Expanding upon
this definition, Glaser and Nitko (1971) stated:

Performance standards are generally specified by

defining a class or domain of tasks that should be

performed by the individual. Measurements are taken on

representative samples of tasks drawn from this domain,

and such measurements are referenced directly to this

domain for each individual measured (p. 653).
These definitions of criterion-referenced tests are sufficiently dif-
ferent that a particular test could be classified as either norm- or
criterion-referenced, or could contain characteristics of each
depending upon the definition adopted (Hambleton & Novick, 1973).
However, all these definitions imply that criterion-referenced tests
are constructed by and dependent upon the existence of a well-
specified content domain as well as procedures for generating samples
of items from this domain (Hambleton & Novick, 1973).

Some measurement experts questioned the accuracy and relevance of
distinguishing between norm- and criterion-referenced tests (Brennan,
1979; Hambleton & Novick, 1973; Mehrens & Ebel, 1979). On the other
hand, Hambleton and Eignor (1979) contended that a distinction should

be made since a methodology now exists for constructing the latter

tests. Mehrens and Ebel (1979) observed that any test, whether it be
criterion- or norm-referenced, represents a specified content

domain. Moreover, a criterion-referenced test can be used to make
norm-referenced measurements and, conversely, criterion-referenced
measurements can be derived from norm-referenced tests, although
neither of these usages may be very satisfactory (Hambleton & Novick,
1973). Given these facts, the primary distinction appears to be be-
tween norm- and criterion-referenced measurement (i.e.,
interpretation) rather than between different types of tests (Brennan,
1979; Ebel, 1971; Hambleton & Novick, 1973; Mehrens & Ebel, 1979). Of
course, choosing a particular type of measurement prior to test
construction has different implications for the method used to
determine the items to be included on a test. However, Brennan (1979)
proposed that different methods of test construction and item analysis
produce changes in the definition of the item domain relevant for each
measurement type rather than effect changes in the measurements them-
selves. The point is that scores can be interpreted using either
standard for most tests. The legitimacy of such an interpretation is
a different issue and depends upon the manner used to construct the
tests as well as how restrictive a definition one adopts for a criterion-
referenced test (Mehrens & Ebel, 1979; Wardrop et al., 1982).

In preparing test items for a criterion-referenced score inter-
pretation, the overriding interest is how well the item samples the
content domain or criterion behavior (Wardrop et al., 1982). The
reason for such concern is the need for generalizing from specific
test item responses to the whole domain of behaviors in order that

inferences can be made about what skills the examinee possesses

 

T—

 

 

. 1‘ v w ,Jﬂnm“) ‘1‘ ._ha i h“_,.,-
It‘ ﬁd‘v «- EQ. ‘. u - at. .— -*. ,
n. C n w 5 .~ . -.C‘ﬂ~

(Hambleton & Eignor, 1979). Although test development for norm-
referenced measurement is frequently concerned with defining the
domain of interest, criterion-referenced testing involves far more
concern with this issue and with obtaining a representative sample of
items from this domain (Hambleton & Novick, 1973; Wardrop et al.,
1982). In short, the basic steps in constructing tests specifically
for criterion-referenced measures are specifying the domain, writing
items reflecting these specifications, and selecting items via a
sampling procedure (random or stratified random sampling, repre-
sentative sampling) which assures representativeness. Similarly, the
primary approach for conducting an item analysis after test
construction is to have content specialists judge whether each item
appropriately measures some part of the content domain as well as
whether the items adequately sample the domain (Buck, 1975; Hambleton
& Signor, 1979).

An objective of tests designed for norm-referenced measurement
has been to maximize variability so that individuals can be reliably
rank-ordered. Norm-referenced measures, such as standard scores,
depend upon the existence of variability since they are derived by
comparing an individual’s scores to the scores of a relevant group.
Variability is partly achieved by using classical test development
methods to analyze items. The assumption underlying classical methods
is that a measurement procedure should provide the most discrimination
possible among individuals on a particular characteristic. Conse-
quently, items are largely analyzed and chosen based on their
statistical characteristics, e.g., discrimination index, difficulty

level, and item-total correlation.

 

' '1 ..‘-n_-—=..—.

v .z—rv—v'

HR:

 

 

On the other hand, the need for a "criterion-referenced test" to
produce variability has been the topic of some debate. Some theorists
have contended that variability is irrelevant to criterion-referenced
measurement since these scores derive their meaning through a direct
comparison with the performance criterion (Millman & Popham, 197A;
Popham & Husek, 1969). In addition, many applications (e.g., a post-
training test) exist in which the goal may be to have every examinee
in the sample achieve mastery and, in so doing, to actually restrict
the test score range. In contrast, Woodson (197ua) has argued that a
”criterion-referenced test" must produce variability or else it is not
informative or useful. The premise for Woodson's argument was that a
test should be analyzed and developed on observations representative
of those within the range of interest and, as a result, should
discriminate between different observations of the characteristic.
Using this approach, one would include pre- and post-training test
scores in the range of possible observations used to calibrate an
instrument (Woodson, 1979a). No variance may exist within the pre-
training test nor within the post-training test, but the test should
discriminate between these two testing observations. In contrast,
Millman and Popham (1974) contended that the population of
observations for a test designed to elicit criterion-referenced
measures is "a domain of items and the responses of a single
individual to them" (p. 137). Furthermore, they stated that if items
were chosen on the basis of their ability to discriminate between
observations, the test would not contain a sample of items truly
representative of the content domain. The major difference between

these two positions clearly lies in defining the appropriate group to

be used for calibrating the scale (Woodson, 197ﬂb). Proponents of
both sides agree, however, that items should not be chosen so as to
maximize test score variance (Woodson, 1979b). Therefore, the
variability can be expected to be lower than for "norm-referenced
tests". Moreover, in typical usage, the test score variance may be
very limited or non-existent if a test is administered to a sample of

examinees who have just completed an instructional program.

Reliability

The possible absence or dimunition of score variability and, more
importantly, the type of score interpretation associated with
criterion-referenced measurement make classical reliability estimates
inappropriate for indexing the consistency of these measures. In
classical test theory, the reliability coefficient equals the squared
correlation between true scores and obtained scores, i.e., the ratio
of true to obtained score variance. All of the practical formulations
(e.g., correlation between classically or randomly parallel tests,
coefficient alpha, split-half reliability) for estimating this ratio
require the computation of a correlation coefficient whose size is
largely a function of the amount of variability in the sample. As is
well known, the more heterogeneous the sample, the higher the relia-
bility coefficient. This fact is easily seen from the equation for
reliability: r =

11
reliability and s1 , sez, and st2 denote the true score variance, the

$2 / §§2 = 1 - (§§2 / s22) where};1 equals the
error variance, and the total score variance, respectively. For any
given test, the error variance remains the same from sample to sample,

regardless of the size of the total variance, because the size of the

10

error only depends upon the test's inability to provide accurate
measures of individual true scores (Magnusson, 1967). However, the
total and true score variances increase when a more heterogeneous
sample is given the test, resulting in a larger reliability
coefficient. Conversely, when no true score variance exists, the
reliability equals zero (unless §§2=O, in which case, reliability is
undefined). Due to this dependence upon score variability, a test
used for criterion-referenced measurement might be highly consistent
in a test-retest sense, and yet the classical reliability estimates
might deem it to be unreliable because almost everyone has received
the same score. A criterion-referenced measure might even have a
negative internal consistency index and still be a reliable measure
(Popham & Husek, 1969). In short, classical reliability estimates
provide an unjustified pessimistic view of the consistency of
criterion-referenced measurement due to the farmer’s dependence upon
variability (Buck, 1975). High classical reliability estimates can be
used to support a claim of consistency, but low estimates do not
indicate a lack of reliability (Popham & Husek, 1969).

As noted previously, criterion-referenced measurement most
commonly involves mastery assessment where one cut-off score is used
as the performance standard. Therefore, reliability for this type of
measurement should assess the dependability of the mastery decision.
Clearly, classical reliability estimates are insensitive to this type
of consistency. Since reliability is based on the relationships

between true, observed, and error scores, this viewpoint can be
presented more clearly by determining the impact of mastery score

interpretation upon these variables and their relationships. Marshall

11

(1978) provided an excellent discussion in this area, and much of the
following material was derived from his presentation.

In classical test theory, the relationship among obtained score
(3), true score (I), and error (E) is expressed by the well-known
equation §=1+§. The distributions of true and error scores are
continuous, while 3 has a polytomous or many-valued discrete
distribution (Marshall, 1978). Theoretically, obtained scores could
have a continuous distribution, but measurement instruments do not
provide the necessary discriminations (Marshall, 1978). The effect of
mastery testing upon this basic equation can be easily seen if testing
is viewed within a decision-theoretic framework (Hambleton & Novick,
1973). In mastery testing, one wants to decide whether an examinee's
true performance level is above or below a threshold or cut-off score;
mastery testing can be viewed as a classification problem (Hambleton &
Novick, 1973). Therefore, the comparable equation for mastery testing
is Q=Q+M where D, C, and M represent the obtained classification, the
true classification, and the instance as well as the direction of
misclassification, respectively (Marshall, 1978). This model differs
from its classical test theory counterpart in that all the variables
in the equation are discrete as well as dichotomous given the absolute
value of the misclassification error (Marshall, 1978). Viewed in this
way, mastery testing results in a Platonic true score model (Marshall,
1978).

Using a Platonic inatead of a classical true score model for
mastery testing has implications for the determination of

reliability. First, according to Marshall (1978), statistics such as

 

a mean or a
model since
swat be a
point is d:
properties
interval e-
.‘ication e
Second, me
In classic
error. E:

defined ‘

J.
'1

ioeu the
(Parsl'Iall
30‘ be hit
138ue in
examinee
a r“atest
SWiihat‘:
Per-.3511
t'epeateq}
Plateau
“liabij
one depe

{gamble
reliabi

6
“he.“ a

«'3

12

a mean or a variance are "theoretically not meaningful" for the former
model since observed and true scores in the Platonic true score model
cannot be attributed with more than ordinal properties (p. 4). (This
point is debatable; one could argue that these scores have interval
properties when only two mastery levels exist since there is one
interval equal to itself.) The absolute value of the misclassi-
fication error can also be assumed to be ordinal (Marshall, 1978).
Second, measurement error is defined differently for the two models.
In classical test theory, one is concerned with the size of the

error. However, in the Platonic true score model, error can only be
defined in terms of the existence of misclassification, not its size,
i.e., the examinee is either correctly or incorrectly classified
(Marshall, 1978). Moreover, these two types of measurement error need
not be highly correlated (Marshall, 1978). Given these facts, the
issue in assessing reliability for mastery testing is whether an
examinee is assigned to the same mastery state on parallel tests or on
a retest (Hambleton & Eignor, 1979; Hambleton & Novick, 1973).
Swaminathan, Hambleton and Algina (197”) defined mastery testing
reliability as ”the.measure of agreement between the decisions made in
repeated test administrations" (p. 26%). Consequently, given the
Platonic true score model, the appropriate loss function for
reliability estimation is threshold loss, where loss is either zero or
one depending upon whether the two testing procedures assign the
examinee to the same or to different mastery states, respectively
(Hambleton & Novick, 1973; Marshall, 1978). The correlational
reliability estimates use a squared error loss function and are,

therefore, inappropriate (Hambleton & Novick, 1973).

Some cor:

fioiects, do L

 

appropriate .t
(197 ) examine
measure the so
classificatio-r
states. The;
theoretical a:
ficient is or.
argue that tr.
underlying ya
is somewhat a
Problem is a
correctly re:
classificati
We “eSativ
negative Qla

negating ( yE

Inf
We:
ali
or
ma
me other .
PESpQQt to
catio .
n In
Contra
diA‘

13

Some correlational statistics, the phi and the tetrachoric coef-
ficients, do use a threshold loss function and, therefore, might seem
appropriate for assessing reliability in the Platonic model. Marshall
(1978) examined the ability of these coefficients to accurately
measure the squared correlation between the obtained and true
classifications (i.e., classical reliability) given two mastery
states. The phi coefficient was found to be deficient on both
theoretical and statistical grounds. As is well known, the phi coef-
ficient is only appropriate for true dichotomies. One can easily
argue that the mastery/non-mastery dichotomy is artificial since the
underlying variable is continuous and the setting of the cut-off score
is somewhat arbitrary (Glass, 1978; Marshall, 1978). The statistical
problem is that phi can be negative when a negative value does not
correctly reflect the relationship between the true and obtained
classifications. More specifically, if either the true positive or
true negative classification is zero and the false positive and false
negative classifications are non-zero, the phi coefficient will be
negative (Marshall, 1978).

This would mean, for instance that even though there
were only a few true non-masters (5%, say), if they are
all misclassified then phi is negative, even though 90%
or more of the examinees are correctly classified as
masters (Marshall, 1978, p. 7).
One other problem with phi occurs when no variability exists with
respect to the true mastery status and/or the obtained classifi-
cation. In this instance, phi is undefined.
Contrary to phi, the use of the tetrachoric correlation

coefficient is appropriate when the two variables are artificially

dichotomized. However, this coefficient assumes that the two

variables 3‘3""

 

   
  

score distrib
Since 317
men (1975
coeffiCieD-t'
tantial 511335
cannot be 301'"
data points 5:
the phi we”:
previously dis

Since none

 

Narshall (1978‘
to the obtaine:
Bare the true
respectively.

frequencies shc
(M) (3+2) (M2
Problems with ‘

one, regardles:

Second, the ra‘

Final

1), if th!
equal zero, an:

Similar to thy.

U1 conclusion,

L
'0 Obtained ma:

0? ‘.
“he data eye

0
.Uncnon.

1U

variables-have a bivariate normal distribution while mastery test
score distributions are often bimodal (Marshall & Berlin, 1979).

Since dichotomous variables have ordinal data properties,
Marshall (1978) also considered the Spearman rank order correlation
coefficient. However, this statistic is inappropriate when a sub-
stantial number of tied ranks exist (Marshall, 1978). This problem
cannot be solved by computing the Pearson g_using the tied ranks as
data points since the resultant formula is algebraically equivalent to
the phi coefficient and, consequently, is subject to the problems
previously discussed (Marshall, 1978).

Since none of the correlational approaches proved satisfactory,
Marshall (1978) examined the ratio of the true classification variance
to the obtained classification variance: n(1-n) /.p(1ﬁp) where n and
‘2 are the true and obtained proportions of mastery classification,
respectively. This formula can also be expressed in terms of the cell
frequencies shown in Figure 1, i.e., ﬂ(1-") / 2(1gp) = (5+8) (9&2) /
(Ayn) (Bin) (Marshall, 1978). Marshall also found several statistical
problems with this formula. First, if A;Q or a;g, the ratio equals
one, regardless of the frequencies contained in the other two cells.
Second, the ratio can be greater than one if o557<2 or £<WS-5.
Finally, if the obtained score variance is zero, either plor 112 must
equal zero, and the ratio will be undefined. The latter problem is
similar to that of the typical correlational reliability estimates.

In conclusion, none of the correlational indices nor the ratio of true
to obtained mastery score variance adequately reflect the reliability
of the data even though all these indices use a threshold loss

function.

 

True 2

 

 

 

Figure 1....

hot ever
Wit) and, C1
‘85).ng (\Bpe
““09ch mo:
indi‘dlduals
:3? an exam
of reliabi]
classical 1
bility of 1
“883 a SQUE
latter PEQ'
Che 0*

0f clasgic,
ment. Doe:

under lYing

15

Obtained Classification

 

 

True Classification + -
+ A E
- 9 .0

 

 

 

Figure 1.--Joint Distribution of True and Obtained Classifications

Not everyone in the field agrees that the Platonic true score
model and, consequently, threshold loss are appropriate for mastery
testing (Brennan, 1979; Kane & Brennan; 1977; Livingston, 1972b).
Although more will be said about this viewpoint at a later time, these
individuals contend that the major question in mastery testing is how
far an examinee's score is from the cut-off. This conceptualization
of reliability uses a squared error loss function. Despite this fact,
classical estimates are still inappropriate for evaluating the relia-
bility of this criterion-referenced interpretation because the former
uses a squared error loss function with respect to the mean while the
latter requires squared error loss with respect to the cut-off score.

One other issue should be addressed in judging the applicability
of classical reliability estimates to criterion-referenced measure-
ment. Does criterion-referenced measurement satisfy the assumptions
underlying classical test theory? Briefly, these assumptions are:

(1) §=§-I; (2) €(§)=O in every non-null subpopulation of individuals;
(3) O(I,§)=O; (4) 0(E1,§2)=O; and (5) 0(E1,IZ)=O (Lord & Novick,
1968). (The subscripts 1 and 2 denote parallel tests.) Brennan
(197“) noted that €(§) cannot be zero for the suprpulation with true

score equal to the highest value nor for the subpopulation with true

 

 

 

SCOI‘E

3353.!

note

Vi

rr-

16

score equal to zero. Furthermore, Marshall (1978) found that these
assumptions did not fare very well under the Platonic true score
model.

Klein and Cleary (1967) have shown, among other things,

that with the Platonic true-score model, the correla-

tion of true and error scores is generally negative and

is zero only under extraordinary circumstances, that

the expected value of Platonic error is not likely to

be zero, and that errors on parallel tests cannot be

expected to have zero correlation (Marshall, 1978, p.
5).

To summarize, classical reliability estimates used in norm-
referenced measurement are inappropriate for criterion-referenced
measurement, in general, because of the farmer's dependence upon score
variance. In addition, classical estimates are particularly unsatis-
factory for mastery assessment for two other reasons: (1) they use an
inappropriate loss function, squared error loss with respect to the
mean, and; (2) the classical assumptions underlying their use may not
be met if the Platonic true score model is accepted as the model for
mastery measurement.

Several reliability coefficients have been developed for
criterion-referenced measurement. Since mastery assessment is
involved in the vast majority of cases, most of the coefficients have
been proposed within this context. Basically, reliability formula-
tions for mastery assessment can be divided into two types based upon
whether they use a threshold loss function or a squared error loss
function with respect to the cut-off score. Of course, the choice of
a loss function depends upon one's definition of the purpose of
mastery testing. To reiterate, advocates of threshold loss contend

that mastery testing is a matter of classifying examinees into two or

 

 

 

17

possibly more mutually exclusive categories (Hambleton & Novick, 1973;
Kane & Brennan, 1977). Proponents of the Opposing view assert that
the issue is the "degree to which the student has attained criterion
performance", implying that the estimation of the distance between the
examinee’s score and the cut-off score is the major concern (Glaser,
1963, p. 519). Both types of coefficients are presented below along
with studies evaluating their performance under a variety of test

characteristics.

Reliability Formulations Based Upon Squared Error Loss

Livingston. Livingston (1972b) developed a general form of the
typical reliability coefficient applicable to both criterion- and
norm-referenced measurement. He adapted the classical test theory
model by replacing the deviation of scores about the mean with the
deviation of scores about the cut-off in computing all relevant
statistics. For example, he defined a criterion-referenced correla-
tion coefficient as a product-moment parameter based on moments about
the cut-off. The classical test theory assumptions, the traditional
definitions of true score and errors of measurement, and the well-
known relationships among true, error, and observed scores remained
intact in his formulation. Corresponding to the classical reliability
definition, Livingston defined criterion-referenced reliability as the
squared criterion-referenced correlation between observed and true
scores. This definition in conjunction with the classical test theory

assumptions resulted in the following formula:

 

 

2 2 2 2 2

2 (0 (£43.) (012)) + (Hi-'92:) 01+ (LIE-9x)
‘0‘“? = 2 + 2 = 2 + )2
c:E (LIE-91) 01.! (Mag-Qi

 

 

the

re

35

18

where Q! is the cut-off score, p?(X.Ix) equals any norm-referenced
reliability coefficient, GE? is the variance, u! is the mean, and 0%,
equals the.true score variance (Livingston, 1972b).

Lovett (1977) defined 32(X’TX) in analysis of variance terms. He
assumed the data came from the responses of 9 individuals to 5
parallel measurements or items, resulting in gxk observations. The
design of the ANOVA was a randomized complete block design without
interaction. Given that the only major differences in using ANOVA to
estimate the reliability of norm-referenced versus mastery scores are
the degrees of freedom for both the total sum of squares and the sum
of squares for people, and the substitution of Ex for the mean in all
relevant statistics, Lovett (1977) defined the reliability of mastery
measurement as:

gasp-muse)
E<M§ )

P
where all sums of squares are expressed as deviations from Q .

 

Brennan and Kane. Brennan and Kane's "index of dependability"
was derived from generalizability theory rather than from classical
test theory (Brennan & Kane, 1977a). Two major differences between
these theories are: (1) classical test theory is built upon the
assumption of classically parallel tests, while generalizability
theory assumes random parallelism; and (2) classical test theory does
not differentiate among various types of errors while generalizability
theory does (Brennan, 1978; Cronbach, Gleser, Nanda, Rajaratnam,
1972). This differentiation is accomplished by constructing a general

linear model of the data and using analysis of variance procedures to

19

derive a reliability coefficient. Since a complete understanding and
derivation of Brennan and Kane's index requires a very lengthy discus-
sion, only a brief outline of their analysis has been provided below.
Assuming that test data have been derived from a random sample of
items from an infinite domain of items and a random sample of people
from an infinite population, the following linear model represents the
observed score of person Up? on item Vi}:
321 = “+927+R$7+u££: where

grand mean in the population of persons and the domain

I:
ll

of items

I:
I
ll

effect for person 2

effect for item‘i

I:
I
N

u ~= effect for the interaction of‘p and i plus
experimental error (Brennan, 1979).
Given the assumptions of analysis of variance, this equation repre-
sents a random effects model for the pxi design (Brennan, 1979).
Similarly, the linear model for the proportion of items answered
~ where the subscript "1"

RI

equals the average score for a particular sample of items, and all

correctly on a test is: $2; = u+p£l+pll +u

terms are expressed as averages (Brennan, 1979). From the sample'
sizes and the mean squares generated in the analysis of variance of
the pxi design, the variance components associated with each of the
score effects in the latter equation can be derived. The total of
these variances equals the observed score variance. The variance
component associated with the effect of person 3, qEZ, 13 called the
universe score variance and is comparable to the true score

variance of classical test theory (Brennan, 1979). Similarly, OE;

20

equals the error variance in classical test theory for a test of
length a; (Brennan, 1979). The variance of mean test scores over all
tests, oi, has no comparable statistic in classical test theory
(Brennan, 1979). This fact is not surprising since the classically
parallel test assumption requires equal test means. Therefore,
OI = O for classical test theory (Brennan, 1979). Brennan and Kane
(1977a) used these variance components to derive indices of dependa-
bility for both norm- and criterion-referenced measurement. Although
the focus has been on the latter type of measurement, the index of
dependability for norm-referenced measurement is also presented below
for comparison purposes.

Cronbach et al. (1972) defined the index of dependability for

norm-referenced measurement as the ratio of universe score variance to

expected observed score variance. This ratio was found to equal:

02
p gl
This index is called the generalizability coefficient and its estimate
equals coefficient alpha (Brennan, 1979).
To obtain a mastery testing coefficient, Brennan and Kane (1977a)
assumed the major interest is in estimating the difference between a

person's universe score and the cut-off, i.e., up: A where both terms

are expressed as proportions. To estimate this difference, the
person's average test score is subtracted from the cut-off, resulting
in an error of estimation e ual to: A = X -A - -

q Pl ( pI ) (112 A)

where EDI is the mean observed score of person 2 on test I, and the
other terms are defined as previously (Brennan and Kane, 1977a).

Brennan and Kane (1977a) proved that the variance of these

 

  

{ ‘4“ ; 0 n y .-
a
. 4 n a 1 oju . < . a
. u f
,_o u t
.r E v i. .mb,
1 . a
n
c
rﬁnbwa. ‘3 I u o
*h
*
a
, ml

21

of, equals a: + 0:1 and, then, defined the index of dependa-

errors,
bility for mastery measurement in terms of expected squared deviations

from A:

+ (u-A)2

 

 

+ (u-l) 2+ 02 + q

IH
VON

L.

There are two very important distinctions between ¢(A) and the

generalizability coefficient. First, the true deviation in the former

2

case equals op + (u-A)2 while it equals 0: for the generalizability

coefficient. —The first quantity is the same as the true deviation or
numerator in Livingston's K?(§,Ix), while the numerator of the
generalizability coefficient is comparable to the true score variance
of classical reliability coefficients. Clearly, these similarities
and differences are a function of the intended score interpretations,
i.e., mastery measurement (up-A) versus norm-referenced

measurement (Up-u). Second, the error variance in 802 equals OSI,

whereas it equals 031 + 0% in 9(A). Obviously, the error variance
for ¢(A) is greater than its counterpart in €02 unless all the test
means are equal. The generalizability coefficient does not

incorporate 0% into the error variance because the test effect adds a

constant to every examinee's score resulting in no change in their
relative ordering, i.e., no change in the examinees' norm-referenced
scores (Brennan, 1979).. Since mastery measurement concerns the
absolute magnitude of the distance between an examinee's universe
score and the cut-off, any effect which increases or decreases this
distance for a particular examinee results in an error of measurement

(Brennan, 1979). Despite this fact, Livingston's 52(X,Tx) and

Lovett's ANSI
because of tr
Livingston's
in a domain h

H

Both :3“

 

formations of
signal/noise '
procedure (Sr.
of the desire
CEOUPE'S purp
or the effect
nation (Brenr.
“0:58 determ:
(Brennan & Ka
this ratio 33
For mate-FY m
of the 318m;

Wer Persons

1")

22

Lovett's ANOVA based index do not include oi in their error variance
because of their underlying assumption of classic parallelism.
Livingston's K2(§,Ix) and ¢(A) are equal when all the possible tests
in a domain have equal means (Brennan & Kane, 1977a).

Both 802 and ¢(l) can alternatively be viewed as monotonic trans-
formations of signal/noise ratios (Brennan & Kane, 1977b). The
signal/noise ratio indexes the relative precision of a measurement
procedure (Brennan & Kane, 1977b). The signal indicates the magnitude
of the desired discrimination needed to achieve the measurement pro-
cedure's purpose, and the noise represents the magnitude of the errors
or the effect of extraneous factors in blurring the desired discrimi-
nation (Brennan & Kane, 1977b). The relative sizes of the signal and
noise determine whether the desired discrimination can be made
(Brennan & Kane, 1977b). Brennan and Kane's derivation (1977b) of
this ratio was based upon the principles of generalizability theory.
For mastery measurement, the signal is defined as up-A and the power

of the signal (§(g)) equals the expected value of the squared signal

2 .
over persons or €p(up-X) . The noise equals gpI-up. Noise power

(3(9)) is defined as the expected value of the squared noise over the
‘population of people and samples of items and is expressed

as e (x )2. Combining this information, the signal/noise ratio

6 -
g I 591 ”3

for mastery tests, v<g), becomes:

 

 

2 2 2
62(uB-A) op + (u-A)
W(d) a which equals ‘ (Brennan &
_ 2 2 2 Kane, 1977b).
EBELQE—I- u ) 01. + CPL

23

The index of dependability can now be expressed as:
v(d) §(d)

1 + v(g) a §(g) + n(a) (Brennan & Kane, 1977b).

 

¢(A) =

Although intuitively appealing, using signal/noise ratios to express
measurement precision has one major drawback from the author's

perspective; its upper limit is not one (Brennan, 1979).

Brennan and Kane's o. Kane and Brennan (1977) have shown that
the quantities "(u;§x)2" in K2(3,I¥) and "(u-A)2" in o(l) equal the
expected consistency-due to chance factors. The expected chance
agreement depends only upon the marginal distribution of scores, not
the reliability of the examinee's performance (Kane & Brennan,

1977). Consequently, K2(§,Ix) and ¢(A) can be large even when scores
from the distribution are randomly assigned to examinees on each test
administration (Kane & Brennan, 1977).

By subtracting the quantity (u-A)2 from both the numerator and
denominator of'®(A), Kane and Brennan (1977) introduced a coefficient
which does take chance agreement into account. This coefficient

equals:

(I):

Q
NNN

2 2
+ +
cg. 01 02;

where each term is defined as in<©(l). Note that<bis the lower limit
of $0.) occurring when X = u (Brennan 8: Kane, 1977b).

o is a general or multipurpose index of dependability for
criterion-referenced measurement (Brennan & Kane, 1977b). Speci-
fically, o can serve as a measure of the reliability of examinees'

absolute universe scores and/or as a general index of the reliability

2“

of a measurement procedure designed for several decisions using
different cut-off scores (Kane & Brennan, 1977).

The signal power (0:) in O is the same as that in the generaliza-
bility coefficient and is also comparable to the true score variance
in classical test theory. Brennan and Kane (1977b) stated that 03 is
the appropriate measure of signal power for o since it is independent
of the cut-off score and is often used as the signal power in physical
measurement. The difference between o and the generalizability coef-
ficient lies in the definition of the noise power. Within the context
of generalizability theory and the general linear model, Brennan
(1979) showed that the error variance in using the observed score as a
universe score estimate is 0% + 021. As previously noted, the error

P_
term in the generalizability coefficient is OSI' Therefore, o is

always less than or equal to the generalizability coefficient.
"Intuitively, this is a reasonable characteristic of o since domain-
referenced interpretations of 'absolute' scores are more 'stringent'

than norm-referenced interpretations of 'relative scores'" (Brennan,

1979. P- 23).

Related Coefficients. Two additional coefficients employing
squared error loss have been proposed within the context of relia-
bility. Harris' index of efficiency (ucz) equals the squared
correlation between mastery state, dummy coded as O or 1, and the
total test score (Harris, 1972b). In analysis of variance terms, u:
equals §,S_,2 / (Eggs!) where S_Sh and _S__Sw are the between and within-
group sum of squares, respectively (Harris, 1972b). Harris (1972b)

stated that his coefficient can be interpreted as the ratio of true

25

score variance to obtained score variance if the group mean is defined
as the true score for every subject within the group. Clearly, the
validity of such a denotation is questionable since mastery and non-
mastery are not typically defined by just one score value. Another
problem is that ”£2 is actually a squared point-biserial correlation
coefficient and, therefore, uses an inappropriate loss function
(squared-error with respect to the mean) and requires the presence of
variability.

Similar to Harris' formula, Marshall's index of separation also
assesses the extent to which an instrument achieves separation between
two groups of people (Marshall, 1976). Assuming that the expected
test scores of the knowledgeable group and the not knowledgeable group
are the number of test items (9) and 0, respectively, Marshall (1976)
developed the following index:

2 2
C -X -
§ ri- 2 ix -§ - + E .f_ if}:
E. _.X - K.
593 (5‘ .2?ng 3-9;

 

where Ex is the frequency of score 3, ﬂ equals the number of
examinees, and the other terms are as defined previously. Marshall's
definition of error is analogous to Harris', i.e., within group vari-
ation, and, consequently, using §c as a reliability measure for

mastery testing is not appropriate.

Characteristics of Squared-Error Loss Indices. Given a desire to
interpret scores relative to a particular cut-off, Livingston's
2(X,Tx) and Brennan and Kane's @(A) are currently the best squared

error loss reliability coefficients. If a coefficient accounting for

26

chance is desired, Brennan and Kane's o is appropriate. The evalu-
ation and interpretation of these indices require an analysis of the
factors influencing them. First, as can be seen from the formulas,
these three coefficients increase as the norm-referenced reliability
increases (Livingston, 1972b). Intuitively, a more accurate true
score estimate also implies less error in estimating the distance
between the true score and the cut-off. Second, given the previous
statements, it is no surprise that lengthening a test increases the
value of 32(x,zx), o(i), and o. Livingston (1972b) algebraically
proved that the Spearman-Brown prophecy formula applies to K2(X,Tx).
and Marshall (1976) empirically supported this derivation. The magni-

2 and 02

tudes of @(A) and o increase since the error terms, GI I
_ P-

decrease (Brennan 1979).
Third, although a lack of variability severely affects classical
reliability coefficients, 52(§,Ix) and ¢(A) do not suffer from this

limitation (Kane & Brennan, 1977; Livingston, 1972b). When oi-O,

52(§,I£) reduces to (U§-§;)2 / (“E-gx)2 which equals one when ggiux
and is undefined when Qx=ux (Livingston, 1972b). The value assumed
by 9(1) when the variance equals zero is not as clear-cut. However,
the important point is @(A) can still equal one given this
situation. This lack of dependence upon variability is intuitively
reasonable when placed within a signal/noise ratio context. Even
if 0380, the signal will still be easy to detect as long as the
distance between the mean and the cut-off is large relative to the
noise (Brennan and Kane, 1977b). These statements do not mean the

variance has no effect on these coefficients. A change in the vari-

ance can induce a change in K2(X,Ix) and ¢(l) when the value of the

27

norm-referenced reliability coefficient is affected (Livingston,
1972b). On the other hand, o does depend upon the existence of score
variance and equals zero when everyone scores the same (Brennan,
1979).

Fourth, when the cut-off score equals the sample mean, ﬁ?(x,zx) =
.§1 (norm-referenced reliability coefficient) and, given dichotomously
scored items, $(A)-&21 (Brennan, 1977; Livingston, 1972b). In this
case, all the coefficients use squared error loss with respect to the
mean as the loss function. As the cut-off score moves farther away
from the mean in either direction, K?(X,Tx) and $(A) increase (Brennan
& Kane, 1977a; Livingston, 1972b). In other words, the relationship
between these coefficients and the cut-off is characterized by a U
function with the lowest point occurring when 9x and A equal the
mean. Obviously, 12205.13) 3 £11 and do) _>_ 6221 (Brennan, 1979).
Correspondingly, Schmitt and Schmitt (1977) found that the average £_-
20 over 1H7 criterion-referenced tests was equal to .53 while the
average 32(X,Tx) was .67, and the difference between these coeffi-
cients increased as the distance between the mean and the cut-off
increased. Likewise, Downing and Mehrens (1978) found that the mean
value of K2(Z,TK) taken over 33 achievement tests was greater than the
mean values of Kﬁ-ZO and Kﬁ-21. Livingston (1972b) presented two
reasons why the value of K?(X,Ix) and 9(l) should change as the
distance between the mean and the cut-off changes: (1) if an indivi-
dual's obtained score is farther away from the cut-off, his true and
obtained scores are more likely to lie on the same side of the cut-

off. "Then, if two groups of scores have equal variance and equal

reliability in the norm-referenced sense, the group of scores

28

whose mean is farther away from the criterion score must have the
greater criterion-referenced reliability" (p. 18); and (2) a change in
the cut-off leads to a different interpretation of scores. Similarly,
Brennan and Kane (1977b) viewed an increase in the distance between
the mean and the cut-off as an increase in the ability to detect the
signal. Others have stated that these coefficients' sensitivity to
the relative position of the cut-off is either inappropriate or
undesirable (Harris, 1972a, 1973; Shavelson, Block, & Ravitch, 1972).

Shavelson, et al. (1972) believed that the cut-off score's effect
on the size 0f.£?(Z,Tx) means the latter does not directly reflect the
measurement's repeatability. (This argument could also pertain
to 9(2).) However, given the desire to interpret scores in relation to
,9 , the difference between the mean and Cx reflects true variance and,
therefore, K?(K,Ix) does reflect the measurement's consistency
(Livingston, 1972c).

Harris (1972a) proved that K?(£,Ix) equals the norm-referenced
reliability coefficient computed on pooled data from two populations
equal 02

g
cut-off. According to Harris (1972a)..K?(K.Ix) is deficient because

and means equidistant above and below the

, 9

2
having equal oI

ceiling and floor effects do not always allow one to postulate the
existence of two means equidistant from 9x. Therefore, the higher
reliabilities obtained with 52(X'3x) are simply due to implicitly
increasing the range of talent (Harris, 1972a). In rebuttal,
Livingston (1972a) stated, "Criterion-referenced test score interpre-
tations do not require that the criterion score be conceptualized as

the mean of some distribution" (p. 9). Simply stated, one must reject

29

the notion that the first moment of a distribution has to be the mean
(Lovett, 1977).

Livingston's coefficient was also criticized because the standard
error of measurement remains constant even though K?(§,Ix) increases
as the cut-off score moves further from the mean, i.e., the use of the
higher K?(l,1x) as opposed to a classical coefficient does not lead to

a more dependable estimate of where a particular examinee truly falls

relative to Cx (Harris, 1972a; Shavelson, et al. 1972). This

criticism also applies to 9(k) (Brennan & Kane, 1977a). However,
reliability refers to the dependability of a group of scores, not a
single score (Livingston, 1972a). When a mastery decision must be
made for every group member, the larger value of 52(X,Tx) implies a
more reliable overall estimate of each member's mastery state
(Livingston, 1972a). The situation is analogous to the effect that an
increase in variance has upon a classical coefficient. Moreover, the
standard error of measurement and the squared error criterion-
referenced reliability coefficients provide different information:
the former measures the variability of an individual's scores
independent of the cut-off score, while the latter indicates the
consistency of scores relative to the cut-off (Berk, 1980).

In another critique, Harris (1973) showed that the squared
standard errors of estimate associated with a linear prediction of
true score and of the observed score on a parallel test increase when
g?(X,T£) is substituted for.,:1:,1 in the regression equations.
Livingston (1973) considered this substitution inappropriate because
.g1, not_E2(§,T£), is the least squares linear regression coeffi-

cient. Replacing};1 by g?(K,TX) removes the regressor for the

30

mean and clearly results in an increased residual variance
(Livingston, 1973).

Finally, the appropriateness of 52(5’21) and 0(k) was questioned
because these coefficients increase as the cut-off moves from the mean
toward either mode of a symmetric bimodal distribution (Marshall,
1976; Marshall & Serlin, 1979; Subkoviak, 1976). Intuitively, one
would expect the Opposite to be true, i.e., K?(£,Ix) and 0(A) should
be greatest when the mean equals the cut-off since the mean is the
point of lowest score concentration and, therefore, the point at which
more people should be reliably assigned to mastery states (Marshall,
1976; Subkoviak, 1976). Clearly, this counterintuitive relationship
also applies to a unimodal skewed distribution (Marshall & Serlin,
1979). In short, the magnitude of KZ(X,TX) and 9(k) is sensitive to
the distance between the mean and the cut-off, but not to the cut-
off's relative position to the mode or to heavy score density areas
(Marshall & Serlin, 1979). This criticism is unwarranted given that
K?(3,Tx) and dKA) define reliability as the average squared deviation
from the cut-off. Like any mean, this average is heavily affected by
outliers present in skewed distributions. As the cut-off approaches
the mode of such distributions, the outliers become even more
influential resulting in an increased average squared deviation. An
analogous process occurs for bimodal distributions. In summary, when
the cut-off approaches heavy score density areas, more individuals are
likely to be misclassified but the reliability in terms of the average
squared deviation increases and is appropriately reflected by the

magnitude of K?(X.Ix) and ¢(l).

31

Reliability Formulations Based Upon Threshold Loss

Carver. Carver (1970) introduced two methods for assessing the
reliability of mastery measurement. One method consisted of admin-
istering the same test to two comparable groups and comparing the
percentages of examinees achieving mastery in each group. In the
other procedure, the percentages of examinees achieving mastery on two
parallel tests are compared. Both procedures are subject to the same
limitation; the two percentages compared can be equal even if the
measure unreliably classifies every individual (Subkoviak, 1978b).
For example, according to the second procedure, perfect reliability
can be obtained when "01 of the examinees are classified as masters
based on the first test administration and a different #01 are classi-
fied as masters on the second administration (Subkoviak, 1978b).
Another problem with these procedures is that they do not allow
consistent non-mastery decisions to contribute to the reliability

measure 0

Hambleton and Novick. Hambleton and Novick (1973) suggested that
reliability be expressed as the proportion of times a consistent
mastery decision is made with two parallel measurement procedures.
(Hambleton and Novick (1973) do not use the proportion correct score
as an examinee's true score estimate nor as the whole basis for
mastery classification. First, they recommend using a Bayesian
estimation procedure to determine the probabilities of an examinee
being a master and a non-master. Then, based upon the criterion of

minimizing threshold loss, these probabilities and the estimated

32

losses caused by making erroneous decisions are used to classify an

examinee.) Given,g_mastery states, their index can be expressed as:

where‘pii is the proportion of people classified in the ith mastery
state on both test administrations (Hambleton a Eignor, 1979). This
coefficient is frequently called the coefficient of agreement.
Although they certainly were not referring to mastery testing at the
time of their writing, Goodman and Kruskal (195") had suggested using
this index as a reliability measure for two polytomies consisting of
the same classes.

The upper limit ofﬂpo is, of course, one. Its size is partly a
function of the magnitude of the cut-off score relative to the
examinees' ability level. For example, 20 will be high when the cut-
off score is very low and the examinees have just completed a training
program relevant to the tested skill (Millman, 197“). In other words,
.3g does not take into account the proportion of agreement expected
merely by chance (Kane & Brennan, 1977; Swaminathan et al., 197R).
This fact has led to criticism of this index since, as long as the

base rate for one category is high,‘po can be high even if the

measurement procedure does not contribute to correct classification.

Goodman and Kruskal;AKoslowsky and Bailit. In 19SH, Goodman and
Kruskal advanced an alternative to 20' They recommended using their
index when no relevant continuum underlying the classification scheme
existed and when the classifications did not have ordinal properties

(Goodman & Kruskal, 195“). One can easily argue that mastery

33

measurement satisfies neither of these conditions. However, using
their measure is possible when only two classifications exist since
the ordinal properties are largely irrelevant and since interest lies
in evaluating mastery, not an examinee's score on the underlying

continuum. The proposed reliability measure is:

 

Ar . flag, - [1/2 (12% + 2&3]
- 1 - E/z (PM. + P.1~L)]

where 2;; is defined as previously, and BM- andf.‘M represent the
marginal proportions corresponding to the modal category for rows and
columns, respectively. The numerator equals the decrease in the
probability of misclassification occurring when an examinee's mastery
status is known on one test as opposed to when no information is
available (Goodman & Kruskal, 1959). In the latter case, the best
guess of the examinee's status is the modal class (Goodman & Kruskal,
195“). The denominator equals the probability of misclassification
given no information, and the coefficient equals the proportionate
decrease in the probability of misclassification as one moves from the
no information situation to a situation where the individual's status
is known on one test administration (Goodman & Kruskal, 195“).

Koslowsky and Bailit (1975) expanded upon this formula to deter-
mine the reliability of a series of items. This extended index can be
used to assess the reliability of a series of mastery decisions.

Their measure simply equals the average oflhrtaken over all the

\

ii - [1/2 (By: +£.Mﬂ )5

mastery decisions (2): /
>32

 

" 1,! l - [1/2 (PM. + p M] 1"
\ . J:

3“

A problem with A; and A; is that they are indeterminate when all
examinees are masters (non-masters) and both test administrations
classify them as such. Clearly, the measurement is perfectly reliable
in this case. Koslowsky and Bailit (1975) suggested automatically
assigning a value of 1 to Kr when this situation occurs. Cohen (1960)

questioned the appropriateness of it as a reliability index since
using the modal category as the "best guess" in the no information
situation is more logical within the context of prediction rather than

reliability.

Swaminathan, Hambleton,gand Algina. To eliminate the influence
of chance agreement found with go, Swaminathan et al. (197“) proposed

using Cohen's coefficient kappa, K. This coefficient is defined as:

(39-13.)
58(1 )
Be
where-pc is the proportion of agreement expected by chance alone
m -
or re p 1 (Cohen, 1960). The symbols Bi and 9.i represent the

i.-.
151- -
marginal proportions in a joint classification of the same decision

categories on two test administrations, or the proportion of examinees
assigned to a mastery state, i, on the first and second test
administrations, respectively (Swaminathan et al., 197“). Therefore,
9c is actually a function of the group composition and is the
proportion of agreement one would obtain regardless of whether or not

the two administrations were statistically independent (Hambleton &

Eignor, 1979).

35

The numerator of K equals the difference between the obtained and
the chance proportions of agreement while the denominator equals the
maximum value this difference can assume (Millman, 197“). Therefore,
K_measures the proportion of agreement obtained over and above that
expected by chance alone and is, in a sense, independent of the
proportion of masters and non-masters in a particular group (Hambleton
a Eignor, 1979; Swaminathan et al., 1979).

A limitation of K, as well as of go, is that their computation
requires two test administrations. Since obtaining data on a parallel
test or a retest is not always feasible, an index of classification
consistency estimated from a single test administration is definitely

needed.

Subkoviak. Subkoviak (1976) offered a single test administration
estimate of 90’ He first defined the coefficient of agreement for
person Vi? as the probability of i being placed in the same mastery

state on two parallel tests:

where K and K’ represent the two test administrations. The first term
on the right of the equation denotes the joint probability of person i
being consistently classified as a master, and the second term
represents the joint probability of a consistent non-mastery

decision. Subkoviak then defined the coefficient of agreement (30)

for a group of g examinees as the mean of the individual 20(i):

p = 33— }? (ED/l!
-_ i=1 2

36

To obtain estimates of 29);) from a single test administration,
Subkoviak assumed: (1) scores on the two tests were independent for a
fixed examinee; and (2) given an individual's true score, the condi-
tional obtained score distributions on both tests were identically

binomial. These assumptions led to the following equation for 29(1):

2 (9 - (2(XiZCX))2 + (Heal zcxnz where
- - - B (P. 31 3-51
maize; = it 2:92; (143;)
‘i ‘23

In the latter equation,_l_’i denotes an individual's true probability of
obtaining a correct item response, Q equals the number of test items,
and Xi represents individual i's obtained score.

Once 2(zizgx) has been calculated from the data obtained on one
test administration, both Po(;) and p0 can be easily computed. The
key to determining 2(32293) is estimating 2;. One could choose the
maximum likelihood estimate which equals xL/g (Subkoviak, 1976).
However, the standard error of this estimate is {§;(T:f;77§ which is
relatively large when n 5&0 (Subkoviak, 1976). Due to this
limitation, Subkoviak (1976) recommended using a regression estimate
of 2i when the observed scores approximately follow a negative hyper-
geometric unimodal distribution. Specifically, he proposed the

following equation:

2. = [0.21 (Ki/9)] + [<l—o21) was]

A

where a21 and'ux equal 53-21 and the test mean, respectively. (In a
later paper, Subkoviak (1978b) used Kﬁ-ZO instead of 53-21 in this
equation.) This regression estimate is particularly useful when n is

small because the estimate incorporates collateral information

37

provided by the group (Subkoviak, 1976). The validity of this
approach depends upon the sample estimate of the mean and the
reliability (Subkoviak, 1976). When the test score distribution is
bimodal, Subkoviak (1976) recommended computing separate regression
equations for each group or pOpulation. Bayesian estimation
procedures have also been developed.

Algina and Noe (1978) compared the bias and standard error of go
based upon the maximum likelihood true score estimate to that ofpo
using the regression estimate. (They used 53—20 rather than 53-21 as
the regressor.) Bias and standard error were defined as the mean
square deviation of £9 from 29 over replications and the standard
deviation of this estimate across replications, respectively. The
data were simulated using various values for the number of examinees,
true score variance, number of items, and cut-off score. Basically,
the bias of:po for both true score estimators was affected by the cut-
off score, the true score variance, and the number of items. However,
changes in these factors affected the extent and/or direction of the
biases associated with these two models differently. The regression
estimator resulted in a substantially biased estimate when the cut-off
scores were close to the mean true score and KR-ZO 2.U8. In all other
cases, the bias was reasonably small. On the other hand, the maximum
likelihood estimator tended to result in a substantially biasedpo
when the cut-off was close to the mean and KR-205:.32. The standard
error of go for both estimators was small in all conditions and
increased slightly as the number of examinees decreased. Algina and
Noe concluded that, in most cases, using the 53-20 estimator with the

binomial error model produced accurate Po estimates for tests

38

conforming to this model. However, they also suggested averaging the
maximum likelihood and regression‘po estimates when KR-ZO is large.

Since Subkoviak's procedure depends upon both an independence and
a binomial assumption, the reasonableness of these assumptions should
be discussed. The former assumption means the errors of measurement
on parallel tests are independent for examinee i_and can be met if the
tests contain different items or are administered at different times
(Subkoviak, 1976). These are obviously the conditions under which
many of the classical reliability estimates are determined. When the
independence assumption is violated, Subkoviak's index under- or over-
estimates the dual administrationpo depending upon whether the two
tests are positively or negatively correlated, respectively
(Subkoviak, 1976).

To satisfy the binomial assumption, items must be independent and
have the same difficulty level. These conditions may not accurately
reflect the real world (Gross & Shulman, 1980; Subkoviak, 1976).
According to Brennan (1979), items should not be expected to have the
same difficulty level. Violating this assumption results in a
conservative estimate of mastery classification consistency
(Subkoviak, 1976). More accurateipo estimates can be obtained by
replacing the binomial with a compound binomial model which allows
varying item difficulties (Subkoviak, 1976). However, Marshall and
Serlin (1979) found that these two models produced almost the same
results, except in one case.

The binomial error model seems better suited for describing the
conditional test score distribution than does the normal error model

typically applied in norm-referenced measurement (Brennan, 197“). The

39

problem with the latter model concerns its assumptions that the errors
of measurement are independent of true score and are distributed
normally with a mean of zero and homogeneous variances. In criterion-
referenced measurement, individuals commonly receive a score of one
(expressed as the proportion of correct answers) since they are being
trained to achieve mastery (Brennan, 197“). Adopting the classical
assumption that €(§/1) = 0 implies that people with a true score of
one always obtain this score (Lord and Novick, 1968). Likewise, those
with a true score equalling zero must always score zero since the
observed score can never be negative (Lord & Novick, 1968). In either
case, the variance of the errors of measurement is zero.

This conclusion shows that under any model with bounded

observed score and unbiased errors (not all zero), the

conditional distribution of the observed score cannot

be independent of true score; equally, the conditional

distribution of the error of measurement cannot be

independent of true score (Lord & Novick, 1968, p.

509).
Consequently, the normality, homogeneity of variance, and independence
assumptions of the normal error model are not appropriate for
describing the conditional score distribution for criterion-referenced
measurement (Brennan, 197“). The formula for the binomial distri-
bution indicates that this model does not make these assumptions.

In summary, Subkoviak's procedure requires the following steps:

(1) computei’1 through the appropriate regression equation; (2)
compute £03 Z-x) assuming a binomial distribution of test scores given
,ﬁ ; (3) determine 90(1); (“) sum the individual 20(i) and divide by y

to obtain 20. If the univariate and bivariate score distributions are

approximately normal, the procedure outlined above need not be used to

“0

estimatepo (Subkoviak, 1976). Subkoviak (1976) proposed the

following equation:

<9 .a'<o. ) = l - [2(B(§<g§)-E(a<§£.§'<c§))1

where cx=(§£--5-u§)/G£, 0! equals the standard deviation of 3, and U5

-

equals the mean oflg (Subkoviak, 1976). In this equation, P(z<=gx)
represents the probability that a standardized normal variable is less
than 9x and can be found in univariate normal distribution tables.
-§(§‘9x’§'<9x) is the probability that two standardized normal
variables with a correlation equal to KR-ZO are both less than 9 .
This probability is obtained from tables of the bivariate normal
distribution.

At a later time, Subkoviak (1978b) introduced a single test
administration estimate of coefficient kappa by computing the probaa
bility of chance agreement which would occur given his model of the

data. This probability equals:

{ A A 2
Pa = 1 - 12‘}; -o ms - (2(E(?$_2-§))/l1> ]

This formulation was derived by defining the base rate for mastery
classification as the average probability (taken across examinees) of

being designated a master.

Marshall and Haertel. Marshall and Haertel (1975) also proposed
a single administration estimate of_po, known as coefficient beta
(8). Their coefficient equals "the mean of all possible split-half
coefficients of agreement" and is, consequently, analogous to coeffi-

cient alpha (Marshall & Haertel, 1975, p. 3).

“1

To derive 8, scores on a hypothetical Zn-item test must first be
simulated from examinees' scores on an n-item test. Using the
binomial error model, this simulation is accomplished via the

following equation:

a B -
u - z a (2n) (ll/o)“- (l-ol/n»213W

where ﬁx denotes the frequency of score 3 on the n-item test, and -w
equals the frequency of score W on a 23-item test. Using these

simulated scores, Marshall and Haertel define 8 for an n-item test as:

l_v
8 a v 2 p
o=1 Q

where p0 is the proportion of agreement consistency between two split-
half tests of n items each, and u is the number of possible splits
which can be obtained from the 29-item test. The latter quantity

2n
equals (2;): Marshall and Haertel's computational formula for B is:

Qx-l 2-x-2 3+Qx-l
a - + - o - .. - ~ 0 -
B llafwfos! EC uh, egos (CE 1). 9‘3 maize hi, guys 93.)
_ _ 5 _ .3
22
+ 2 N
where ﬂ=_-I~(_:xjg
.5 = number of examinees

,W : examinee's score on a 23-item test
B = number of test items
3" = frequency of score W

cut-off score on an g-item test

10
I
u

D(H 2n-W\ 2n
Qw(a,p) 71 j n—j/ / Q or the proportion of splits
- =§~ --

“2

resulting in a half-test score of from a to b inclusive, given a total
score of W.

As can be seen, 8 is the mean of its additive parts and,
therefore, each examinee's score makes a specific contribution to
beta's magnitude (Marshall, 1976). The further the score departs from
the value 2Qx-1, the more it contributes to the size of B (Marshall,

1976). A score equalling 2§x—1 makes a zero contribution; at this
particular value, the examinee must always be classified as a master
on one half of the test and a non-master on the other half (Marshall,
1976).

Similar to Subkoviak's model, the validity of using the binomial
error model in Marshall and Haertel's formula is questionable.
However, results of a study investigating the bias of various
estimates showed that 8 produced quite accurate estimates of £0 when
items were not homogeneous, particularly for longer tests (n=30, 9:50)
(Subkoviak, 1978a).

One drawback of this model, as noted in a personal communication
from Marshall (1980), is the use of the proportion correct score as
the true score estimate in computing MK. As previously mentioned, the
standard error of this estimate is reasonably large when n 5 “0.
Apparently, Marshall no longer recommends this procedure (Marshall &
Serlin, 1979). A regression or Bayesian estimate can easily be incor-
porated into the procedure. In one study, Marshall and Serlin (1979)

actually used a predictive Bayesian beta model as well as other models

to obtain the frequency distribution for a ZQ-item test.

“3

Huynh, Huynh (1976) developed a single administration estimate
of £9 and kappa based upon Keats and Lord's beta-binomial test score
model. Like Subkoviak's and Marshall and Haertel's formulations, this
model assumes an examinee's scores given his/her true score follow a
binomial distribution (Huynh, 1976; Keats & Lord, 1962). According to
Huynh (1976), assuming similarity of item difficulty and item content
(i.e., item exchangeability) is reasonable for criterion-referenced
measurement because all items should measure a single trait.

Moreover, his p2 appears robust with respect to violation of the
former assumption (Subkoviak, 1978a). Specifically, violation of this
assumption resulted in slightly conservative estimates of reliability
for a 10-item test and had little effect on longer tests (Subkoviak,
1978a).

The Keats-Lord model also assumes true scores follow a beta
distribution. The beta distribution family includes a wide range of
shapes although multi-humped distributions are not included (except
for a U-shaped function where the modes occur at 0 and n). The para-
meters of the beta distribution, a and B, can be computed from the mean

and standard deviation of a large sample score distribution:

' 1
o ' (-1-+'--) ' U
o
21 1‘- '” u (ti-u )
n E K
B a - a - B.+ 2.. where o21 = KB-21= 551 1 ‘ 2 (Huynh,
o21 “ BU . 1976).

Under the beta-binomial model, the observed score distribution

has a negative hypergeometric distribution with the following density:

(1:) Mods. n+8-25)

§(o.B)

 

nu

where §_denotes the beta function (Huynh, 1976). Huynh (1976) has
provided computational formulas for evaluating §(§). Estimating?o
and kappa also requires determining the joint distribution of
equivalent test forms, §(§,y). Assuming local independence with
respect to the true score, f(;,y) can be simulated. This distribution
follows a bivariate negative hypergeometric or beta-binomial distri-
bution with the following density:

n n

£(§.y) -(§) ‘§)§(a+§+y. 29+B-z-y) (Huynh. 1976).
‘ _1.3(a.8) ' ‘

Huynh (1976) also presented computational formulas for §(§,y).

Given a particular cut-off score, these formulas can be used to
calculate the proportion of examinees who would be placed in the
mastery category on both test forms @911), the proportion who would be
consistently classified as non-masters (POO)’ and the proportion who
would be given mastery status by only one form {21). These propor-

tions are defined in the following manner:

Lo
0
O
I
I
ﬁtn.
O A

I

(Huynh, 1976).

Irﬂa'o rum

A
‘N
v

pl

:0
>4 tr-n

l

Given the assumption that the marginal distribution is the same for

each form, Huynh (1976) defined Po and kappa as:

_ 2
311 31

Pl-PI

US

When the cut-off score is small, the following formula for E is far
more convenient:

Poo'l’o

20-20

where 90 is the proportion of examinees classified as non-masters by
only one test form (Huynh, 1976).

When the number of test items is moderately large (e.g., g_>10),
Huynh (1976) suggested using a normal approximation procedure to
estimate kappa. In this procedure, an arcsine transformation is
applied to the data, resulting in an approximately normal score
distribution. Univariate and bivariate normal distribution tables are
then used to estimate the probabilities needed for computing 5.

Peng and Subkoviak (1980) found that, in the vast majority of his
simulated distributions, a simple normal approximation procedure using
Yate's correction resulted in less proportionate error in estimating §_
than did Huynh's normal approximation procedure. Pens varied the beta
distribution parameters, the cut-off score, and the test length. The
upper limit of the latter variable was 30. Using real data, Peng
(1979) collaborated his findings. The superiority of the simple
normal procedure was more pronounced for short tests and/or moderate
cut-off scores (between 65% and 85%). Similar results were obtained
when the two normal approximation procedures were used to estimate Po

(Peng, 1979; Peng & Subkoviak, 1980).

Characteristics of Threshold Loss Indices. As can be seen, the
most appropriate threshold loss coefficients are divided into two

categories: (1) 29 coefficients; and (2) kappa coefficients. Because

“6

the former indices do not take account of chance agreement while the
latter ones do, various population and test characteristics affectgo
and kappa differently. Since research has shown these factors affect
dual and single administration coefficients similarly, the following
discussion applies to both unless otherwise stated.

First, under the assumption of exchangeability, the theoretical
lower limit of‘po is the proportion of agreement expected by chance,
while kappa's limit is zero (Huynh, 1978; Subkoviak, 1978b). In
general, however, the lower limit of kappa, computed from two test
administrations, depends upon the marginal distributions (Cohen,
1960). The upper limit of both coefficients is +1.00.

Second, as the cut-off approaches the extremes, Po generally
approaches one (Marshall, 1976; Marshall & Haertel, 1975; Subkoviak,
1976, 1977). This trend is particularly evident for symmetric uni-
modal distributions (Marshall & Haertel, 1975). On the other hand,
kappa generally approaches its lowest value as the cut-off moves
toward the distribution extremes (Huynh, 1976; Subkoviak, 1977). This
difference can be partly explained by the fact that the probability of
chance consistency generally tends toward one as the cut-off
approaches the extremes (Huynh, 1976). Therefore, Bo also approaches
one, while kappa decreases because not much opportunity exists for
increasing agreement above chance (Huynh, 1976).

Third, the magnitude of_o has been found to increase as the
distance between the cut-off and areas of heavy score density (e.g.,
the mode) increase (Eignor & Hambleton, 1979; Marshall, 1976;

Subkoviak, 1976, 1977). Given_¥1<:1.00, examinees scoring close to

u?

the cut-off on the first test administration could easily obtain a
score on the opposite side of the cut-off on the second administra-
tion. On the other hand, those further away from the cut-off would
more likely be placed in the same mastery state in both testing
sessions. Therefore, the greater the number of scores further away
from the cut-off, the higher the 20' EXceptions to this relationship
have been found for the single administration coefficients (Marshall &
Serlin, 1979). Marshall and Serlin (1979) examined the behavior of
these coefficients given five different distributions: (1) bell-
shaped; (2) highly negatively skewed unimodal; (3) bimodal with a
stronger mode at the higher end; (u) symmetric bimodal with modes
widely separated; and (S) symmetric bimodal with modes close
together. With the exception of the fifth distribution, the size of
Subkoviak's};o generally reflected the distance between the cut-off
and the mode for both unimodal and bimodal distributions. Fortu-
nately, the fifth distribution is atypical in mastery testing

8 reflected the cut-off's

~2.
position for unimodal distributions and bimodal distributions with

(Marshall & Serlin, 1979). Huynh's

extreme modes, but not for bimodal distributions not belonging to the
beta-binomial family. For Marshall and Haertel's index, five
different test score models were used to simulate scores on a 29-item
test from scores on an gyitem test. The adequacy of their‘ﬁQ in
reflecting the cut-off's relative position depended upon the model
used to generate scores. One of the best models was a binomial
regression model comparable to that used in Subkoviak's index. This
model produced results similar to those obtained with Subkoviak's

.ﬁg: An averaged double binomial model introduced by Marshall and

H8

Serlin also reflected the location of the mode(s) for both unimodal
and bimodal distributions.

In contrast, given the assumption of exchangeability, Huynh
(1978) mathematically proved that kappa is an inverted U function of
the cut-off when the data are normally distributed. This relationship
was also empirically supported for normally distributed data as well
as for various beta-binomial and some bimodal distributions (Eignor &
Hambleton, 1979; Huynh, 1976, 1978; Marshall & Serlin, 1979;
Subkoviak, 1977). Apparently, the location of the cut-off relative to
the score density affects kappa in a manner cpposite to its effect on
.99’ i.e., kappa is greater when the cut-off is located near heavy
score density areas. Intuitively, one might expect kappa to behave
similarly t°.Po' The difference appears to be due once again to the
influence of chance agreement. Specifically, in many distributions,
2c decreases as the cut-off approaches heavy score density areas,
leading to a decrease in pa. However, kappa increases because more
Opportunity exists for agreement above that expected by chance.

Generally, the cut-off score appears to affect the magnitude of
‘39 and kappa in two ways, i.e., through its relative position to the
extremes and to the heavy score density areas. Conceivably, these two
influences could interact, producing some unpredictable results. For
example, what would happen to the size of_po and kappa 1f the cut-off
and the mode were equal to 2? Marshall (1976) used this interactional
effect to explain the unpredictable relationships found between the

cut-off and his coefficient. This effect probably also explains some

unforseen trends Eignor & Hambleton (1979) found with kappa.

H9

Fourthhpo does not require score variability to attain its upper
limit but kappa does (Kane & Brennan, 1977). However, both coeffi-
cients increase as the variance increases (Huynh, 1976; Marshall,
1976; Swaminathan et al., 197“). A large variance implies extreme
scores and, consequently, better differentiation between masters and
non-masters (Marshall, 1976).

Fifth, although all the aforementioned variables affect‘po and
kappa differently, the test length and the classical reliability
coefficients affect them similarly. Specifically, as the number of
test items increase, Bo and kappa increase (Eignor & Hambleton, 1979;
Huynh, 1976, 1978; Marshall, 1976; Marshall & Haertel, 1975;
Subkoviak, 1978b; Swaminathan et al., 197R). Increasing the test
length probably results in a more accurate true score estimate and,
consequently, a more reliable estimate of an examinee's mastery
state. Correspondingly, as the classical reliability coefficient
increases so should .0 and kappa. Marshall (1976) found the mean of
his coefficient taken over various cut-off scores was highly
correlated with Kﬁ-21 across several distributions (Bh9=.93). Given
parallel tests, dual administration kappa was mathematically and
empirically shown to increase as the classical reliability coefficient
increased for a normal distribution and a beta-binomial model,
respectively (Huynh, 1978). In addition, Downing and Mehrens (1978)
found that Huynh's single administration kappa coefficient correlated
.96 and .98 with 53-20 and 33:21, respectively. On the other hand,
Algina and Noe's results (1978) did not support a relationship between

Subkoviak's go and a classical coefficient.

50

Synthesis

In the foregoing discussion, which coefficient to use in a
particular mastery testing situation was not delineated. The present
section addresses this issue by synthesizing the previous material and
determining the major distinctions among the various coefficients.

Using the concept of agreement functions, Kane and Brennan (1977)
provided a single consistent framework for viewing the reliability
coefficients. As explained by Kane and Brennan (1977), an agreement
function denotes the extent of agreement between the interpretation of
examinees' scores on randomly parallel tests. For mastery measure-
ment, coefficients are based upon either a squared-error (with respect
to the cut-off) or a threshold agreement function corresponding to the
squared-error and threshold loss functions previously discussed. Kane
and Brennan showed that the indices equal either the proportion of
maximum agreement achieved by the measurement procedure or the
proportion of maximum agreement achieved over and above that expected
by chance. Maximum agreement is the expected agreement between a
testing procedure and itself, while the agreement produced by the
measurement procedure is the expected value of the agreement
function. Figure 2 presents the major single test administration
reliability coefficients within their appropriate categories, formed
by crossing type of agreement function with the presence of a chance
agreement correction.

One must first decide whether to use squared error or threshold
agreement coefficients (Kane & Brennan, 1977). Since the former
coefficients are concerned with the extent of deviation from the cut-

off, their size reflects the magnitude of errors (Brennan & Kane,

51

Chance Agreement

 

 

 

Type of Agreement Function Uncorrected Corrected
Squared Error Livingston's K?(X,Ix) Brennan & Kane's O
:Brennan & Kane's ¢(X)
Threshold gSubkoviak's po Subkoviak's kappa
‘Marshall & Héértel's Huynh's kappa
‘20, Huynh'slpg

 

 

 

Figure 2.--Mastery Testing Reliability Formulations

1977a). In other words, they do not consider all inconsistent classi-
fications or misclassifications to be equally serious, but assume that
misclassifying an examinee whose true ability level is far from the
cut-off is much more serious than misclassifying someone whose true
ability is close to the cut-off (Brennan & Kane, 1977a). This advan-
tage is particularly compelling since cut-off scores are, to some
extent, arbitrarily determined and, therefore, a sharp distinction
between masters and non-masters seldom exists (Brennan & Kane, 1977a;
Glass, 1978). Furthermore, different procedures for setting cut-off
scores result in different cut-offs (Brennan & Lockwood, 1979).
However, a drawback of these coefficients is their sensitivity to all
errors, even those not resulting in inconsistent mastery decisions
(Brennan & Kane, 1977a).

On the other hand, threshold agreement indices do not reflect the
magnitude of errors but are only sensitive to errors resulting in
misclassification (Brennan & Kane, 1977a). The disadvantage of these
coefficients is that they consider all misclassifications to be

equally serious (Brennan & Kane, 1977a).

52

Clearly, neither the squared error nor the threshold agreement
coefficients are optimal in every situation. Kane and Brennan (1977)
suggested the following course of action:

The threshold agreement coefficient is appropriate

whenever the only distinction that can be made usefully

is a qualitative distinction between masters and non—

masters. If, however, different degrees of mastery and

non-mastery exist to an appreciable extent, the

threshold agreement function is not appropriate because

it ignores such differences (p. “0).
Since reliability is relative to the score interpretation, the appro-
priate agreement function should be dictated by the way the scores
will be used (Popham & Husek, 1969; Subkoviak, 1978b). If the degree
of mastery or non-mastery is of interest, coefficients incorporating a
squared-error agreement function are more suitable (Subkoviak,
1978b). This situation occurs when different actions or programs are
to be initiated based on how far from the cut-off an examinee scores
and/or when distance from the cut-off leads to unequal misclassifi-
cation losses (Brennan & Kane, 1977a; Popham & Husek, 1969). When
only two courses of action are possible and misclassification losses
are considered equal, threshold agreement coefficients should be
applied (Brennan & Kane, 1977a). Likewise, if there exist more than
two mastery categories and no differential misclassification loss
related to distance, threshold agreement indices can be used.
However, Kane and Brennan (1977) stated that threshold agreement
coefficients are inappropriate when more than two mastery
classifications exist and these categories are ordered. Addressing
the ordered case, Goodman and Kruskal (195D) proposed two other

measures which account for how different an individual's mastery

classification on two test administrations is. No single

53

administration index of these ordered coefficients has been formally
developed. However, it seems the single administration threshold
agreement indices could easily be adapted to this purpose.

The next decision one must face is whether or not to use a
coefficient accounting for chance agreement. Differentiating between
corrected and uncorrected coefficients is important because they
provide different kinds of information about reliability (Kane &
Brennan, 1977; Subkoviak, 1978b). The uncorrected squared-error and
threshold agreement indices indicate the reliability of the deviation
scores and the mastery classifications, respectively, i.e., the
consistency of the score interpretation (Kane & Brennan, 1977). Both
chance agreement and the consistency contributed by the testing
procedure affect the value of these coefficients (Kane & Brennan,
1977). In comparison, corrected coefficients measure only the latter
source of consistency, i.e, the contribution of the testing procedure
to the reliability of scores over and above that expected by chance
(Kane & Brennan, 1977). Clearly, the choice between corrected and
uncorrected coefficients depends upon whether one wants to determine
the consistency of scores regardless of the causes of this consistency
(i.e., test procedure, group composition, group's mean ability) or the
reliability of the testing procedure irrespective of the group's
characteristic ability or mastery level (Subkoviak, 1977).

In discussing threshold loss indices, Livingston and Wingersky
(1979) and Berk (1980) do not recommend using the corrected

coefficient, kappa, in situations where an absolute cut-off has been

SH

established because the correction for chance takes the marginal
frequencies as given. As stated by Livingston and Wingersky (1979):
Applying such a correction to a pass/fail contingency
table is equivalent to assuming that the proportion of
examinees passing the test could not have been anything
but what it happened to be (p. 250).
However, the present author fails to see how this fact differentiates
kappa from any other reliability estimate which uses sample statistics
(e.g., the sample mean) as estimates of population values.

The corrected indices, coefficient kappa and 8, could be
criticized because they approach or equal zero when little or no true
mastery score variability exists (i.e., when everyone is placed in the
same mastery state or receives the same domain score, respectively)
even though the scores may be perfectly reliable (Berk, 1980).
However, this criticism is unwarranted. These coefficients' low
values in the presence of small variability do not indicate that the
mastery scores are unreliable, but simply that the testing procedure
does not add much more reliability to the scores above that achieved
by chance processes (Kane & Brennan, 1977). In other words, a testing
procedure resulting in some sort of criterion-referenced score
interpretation must produce variability in terms of those scores if
the procedure is going to contribute to reliability (Kane & Brennan,
1977). On the other hand, the uncorrected coefficients can be large
even when no true score variability exists because of the score
«consistency contributed by chance processes. These observations
provide a new perspective on Popham and Husek's disagreement with

Woodson over the variability issue (Kane 8: Brennan, 1977). To

55

reiterate, Popham and Husek contended that variability is not a
necessary characteristic of a good criterion-referenced test, while
Woodson argued that a test with no variability provides no informa-
tion. It appears that Popham and Husek's argument applies to the
score interpretation, while Woodson's argument applies to the test's
contribution to this interpretation (Kane 8 Brennan, 1977).

As previously discussed, the four types of coefficients depicted
in Figure 2 react differently to the relative position of the cut-
off. Obviously, the cut-off's location does not affect the corrected
squared error coefficient. However, the uncorrected squared error
indices are sensititve to the distance between the mean and the cut-
off; they increase as this distance increases. The Bo and kappa
indices are generally not expected to be sensitive to this difference
unless the mean reflects heavy score density areas.

On the other hand, squared error indices are not sensitive to the
distance between the cut-off and the mode or heavy score density
areas, while uncorrected threshold indices are hypothesized to
increase as this distance increases. In contrast, the corrected
threshold indices appear to be greater when the cut-off is located in
heavy score density areas. For example, when scores are normally
distributed, a U function characterizes the relationship between the
cut-off score and 90’ while kappa is an inverted U function of the
cut-off.

Similar t°.Po’ the uncorrected squared error indices are also a U
function of the cut-off score given a normal distribution since the
mean equals the mode (Marshall, 1976). However, when the distribution

is skewed and/or bimodal, uncorrected squared error coefficients will

56

increase while uncorrected threshold indices will decrease as the cut-
off moves from the mean toward the mode(s). Correspondingly, for
bimodal distributions, Marshall (1976) found that the magnitude of his
g2 and 32(§,IX) did not fluctuate similarly as the cut-off score
varied. This observation is particularly relevant in mastery measure-
ment since the score distribution on any given test administration is
often bimodal and, in some cases, is expected to be skewed (e.g.,
after an instructional program) (Marshall, 1976; Marshall & Serlin,
1979).

Although the list of applicable coefficients can be reduced by
choosing an appropriate agreement function and deciding whether or not
to correct for chance processes, one must still select among alterna-
tive formulas in many cases. The choice of an appropriate index in
these instances depends upon the number of feasible test administra-
tions, the satisfaction of the assumptions underlying a particular
index, the coefficient's robustness to violations of these
assumptions, the coefficient's bias in estimating the dual administra-
tion population index, and the degree of sampling fluctuation
exhibited by the coefficient. In most situations, two test
administrations are not possible and, therefore, the applicable
coefficients are typically those requiring only one test administra-
tion.

If one has decided to use an uncorrected squared error coeffi-
cient, one can choose Livingston's K?(§,!x) and/or Brennan and
Kane's 0(A). A major difference between these indices is that K?(K,Tx)

is based upon classical test theory, while 0(A) is derived from

57

generalizability theory (Brennan, 1979). The latter theory has two
distinct advantages over the former. First, generalizability theory
provides the opportunity to examine the reliability of data derived
from different types of experimental designs, e.g., nested design
(Brennan, 1978). This theory also allows one to take account of
whether the various effects are fixed or random (Brennan, 1978).
Second, generalizability theory can differentiate norm- from
criterion-referenced measurement by distinguishing between different
error variances, while classical test theory cannot (Brennan, 1979).
Specifically, Brennan and Kane's approach indicates that 021 is the
appropriate error term for norm-referenced measurement, -

while 031 + 0% is the proper error variance in criterion-referenced

measurement (Brennan, 1979). Clearly, the classically parallel test

assumption obviates the existence of 0%. Generalizability theory

assumes tests are randomly parallel. Brennan (1979) finds the classi-
cally parallel test assumption unreasonable for criterion-referenced
testing since the test construction method does not require content
specialists to include only items with the same difficulty level in
the domain. If, as expected, the items in a domain have various
difficulty levels, it would be very unlikely for all tests constructed
from this domain to be classically parallel (Brennan, 1979). Further-
more, since K?(X,Tx) equals @(X) when test means are equal, K2(K,Tx)
is really a special case of 0(K)- For these reasons, the more
general 0(A) appears preferable to K2(K,Tx) (Brennan, 1979). Unfortu-

nately, no empirical research concerning the bias and sampling

fluctuation of these coefficients exists.

 

'r‘

1‘)

C)

'—J

58

When considering uncorrected threshold lOss indices, the appro-
priateness of several alternative Bo formulas must be evaluated. If

feasible, 2 can, of course, be estimated from two test administra-

o
tions. The dual administration p0 is unbiased and its standard error
equals (2(1-9)/y)‘/2 (Huynh & Saunders, 1979). Generally, formulas
for evaluating the standard error of the single administration 39
estimates have not been developed and very little empirical evidence
pertaining to their bias and standard error have been produced.
However, assuming a beta-binomial score distribution, Huynh (1978)
showed that his go index is asymptotically unbiased and also presented
a formula for the asymptotic standard error of this estimate. In
addition, Huynh and Saunders (1979) found that Huynh's p9 generally
underestimated the dual administration;o for large data sets not
conforming to the beta-binomial model as well as for small and
moderate sized samples (3:20,uo,60). In the former case, the average
amount of bias was -2.31 across various test lengths and cut-off
scores. For the small and moderate sized samples, the average degree
of bias was -2.6% across various test lengths.

Assuming a large sample size, Huynh and Saunders (1979) also
compared the standard error of the dual administration pg to that of
Huynh's estimate for various beta-binomial distributions, test
lengths, and cut-off scores. The mean and 53921 of the distributions
were chosen to reflect one of the following shapes: (1) U-shaped with
the higher mode at the upper end of the distribution; (2) symmetric;
(3) unimodal with the mode lying between H and g; and (u) J-shaped.

In every instance, the standard error of Huynh's estimate was lower

than its dual administration counterpart. On the average, the

dist:

SEVQ!

Saunc

error

A“; '
Mum

59

standard error of the former was 59.3% of the latter. The uniformly
smaller standard error of Huynh'slpo was also found for large sample
distributions significantly different from the beta-binomial in
several instances and for small to moderate sized samples (Huynh &
Saunders, 1979). Over all the situations considered, the standard
error of Huynh's ﬁe was 50.ﬂ% and 51.us of that of the dual
administration estimate, respectively.

Only one study correctly compared the bias and standard error of
all the _pa estimates (including the dual administration 139)
(Subkoviak, 1978a). In this study, Subkoviak's coefficient was based
upon a compound binomial instead of a binomial model, and the
proportion correct score was used as the true score estimate in
Marshall's p2. Each estimate was computed for 50 random samples of 30
students each and compared to the dual administration_po obtained in
the population (ﬂe1586). Comparisons were made for three test lengths
(10, 30, SO) and four cut-off scores (.59, .6n, .7n, .8n). The mean
and standard deviation of each estimate across the 50 samples provided
the necessary data for judging the estimate's bias and standard
error. All the estimates became more accurate as the test length and
the distance between the mean and the cut-off increased. Moreover,
the estimates' standard errors decreased. The influence of the
distance between the mean and the cut-off can be partly explained by
the fact that estimates become more accurate and less variable as the
population parameter becomes more extreme (Subkoviak, 1978). Corre-
sponding to Huynh and Saunder's results (1979), the dual
administration estimate was unbiased but had the largest standard

error regardless of the test length and cut-off score. Huynh's‘ﬁO

6O

underestimated go for short tests and was, generally, less variable
than the other indices for 30- and 50-item tests. Marshall and
Haertel's E0 was biased upward when the cut-off was near the mean and
biased downward when the cut-off was in the tails of the

distribution. This effect was more pronounced for shorter tests.
Conversely, for short tests, Subkoviak's ﬁo underestimated 90 when the
cut-off was near the mean and overestimated £0 for more extreme cut-
offs. This finding was similar to that found by Algina and Noe
(1978). The opposite reaction of Marshall's & Subkoviak's indices may
have been due to the use of different true score estimates. Specifi-
cally, Marshall and Haertel's use of the proportion correct score
produces an overestimate of the true score variance, while Subkoviak's
regression true score estimate results in an underestimate of this
variance (Algina & Noe, 1978). It should be noted that Subkoviak's go
showed no consistent pattern for longer tests. Finally, Marshall and
Haertel's index was the least variable but the most biased for 3:10.
Except in this latter case, none of the four coefficients was substan-
tially biased.

In evaluating which single administration Po estimate to apply,
the assumptions underlying each of them should be examined. All
assume the distribution of an examinee's test scores given his/her
true score is binomial. Recognizing that the equal item difficulty
assumption might be unrealistic, Subkoviak (1976) proposed using the
compound binomial instead of the binomial model. However, whether or
not this more complicated procedure improves estimation of_po is

highly questionable. Use of the compound binomial in Subkoviak's ﬁg

generally produced results similar to those obtained using the

61

binomial model (Marshall & Serlin, 1979). Furthermore, Huynh and
Saunders (1979) found the standard deviation of item difficulties was
not related to the degree of bias associated with Huynh's_§Q and
Huynh's kappa estimate, and Subkoviak (1978a) provided evidence that
all three coefficients are robust with respect to violation of the
equal item difficulty assumption.

Another assumption implicit in all three single administration
coefficients is classic parallelism (Kane & Brennan, 1977). The
validity of this assumption in criterion-referenced testing has
already been questioned. When tests are not classically parallel,
these coefficients will probably overestimate 20. To the author's
knowledge, no empirical evidence addressing this question exists.
Those few studies examining the bias of one or more of these estimates
included only parallel tests (for example, Subkoviak, 1978a).

HuYnh and Saunders (1979) noted that Subkoviak's procedure and
Huynh's procedure assume the score distribution is beta-binomial.
Therefore, they should have similar patterns of bias and standard
error. Huynh and Saunders (1979) concluded that such was the case in
Subkoviak's investigation (1978a). Although not explicitly stated,
Subkoviak's study of bias appears to have been performed on data
fellowing a normal distribution. The normal distribution is not a
member of the beta-binomial family, although this family does include
a ”normally" shaped distribution (Gross A Shulman, 1980). The bias
and standard error of these estimates have not been investigated for
distributions more typically found in criterion-referenced

measurement, i.e., skewed and bimodal (Marshall, 1976; Marshall &

62

Serlin, 1979). Examining these coefficients given the latter distri-
bution would be particularly interesting because the beta-binomial
family does not include bimodal distributions, except for U-shaped and
J-shaped functions (Gross & Shulman, 1980). Both these distributions
are not expected to occur in the real world (Marshall & Serlin,
1979). Subkoviak (1976, 1978a) has stated that using a single
regression equation to estimate the true score in his procedure is
inappropriate given a bimodal distribution and has recommended using
Huynh's procedure. However, Marshall and Serlin (1979) found that the
magnitude of Huynh's p9 did not reflect the location of the modes for
bimodal distributions, while Subkoviak's ﬁe reflected the mode(s) for
both unimodal and bimodal distributions. Although not explicitly
stated, the researchers appear to have used a single regression equa-
tion to obtain Subkoviak's true score estimate for the bimodal as well
as the unimodal distributions. Gross and Shulman (1980) investigated
the robustness of the beta-binomial model; they compared empirical
values of_po obtained from two test administrations to the theoretical
values of‘p9 derived from the beta-binomial model when its underlying
assumptions were violated. They found that the theoretical and
empirical values were in close agreement. However, the authors did
not indicate the shape of the score distribution nor how severely the
assumptions were violated.

One of the most enlightening findings concerning the_po estimates
evolved from Marshall and Serlin's study (1979). They used five

versions of Marshall and Haertel's go varying in terms of the model
used to simulate scores on a Zn-item test. They found that Huynh's p0

and Subkoviak's p0 were empirically equivalent to Marshall and

choosin
differe
empiric

least b

63

Haertel's estimate when the assumptions of the former indices were
applied to the latter coefficient. Specifically, when the Keats and
Lord beta-binomial model was used to simulate scores on a 29-item test
for Marshall's @0, this index was equal to Huynh's?)o in each of 300
cases. Similarly, when a binomial regression model was used to
simulate scores, Marshall's and Subkoviak's indices were equal. In
summary, Marshall's p0 appears to be a general index subsuming the
other two coefficients and is equal to them when the data are postu-
lated to meet certain assumptions (Marshall & Serlin, 1979).
Therefore, a choice among the three coefficients seems reduced to
choosing among various test models rather than among three entirely
different coefficients (Marshall & Serlin, 1979). Clearly, much more
empirical research is needed to choose which test model results in the
least bias and standard error given a particular type of distribution.

Finally, if the situation demands a corrected threshold agreement
index, one can use a dual test administration kappa estimate,
Subkoviak's model, and/or Huynh's procedure. The dual administration
kappa estimate is asymptotically unbiased (Huynh & Saunders, 1979).
However, Huynh and Saunders (1979) found a small negative bias for
both small (ﬂ=20, HO) and moderate (ﬂ=60) sized samples. They also
presented a formula for computing this estimate's asymptotic standard
error.

Given a beta-binomial distribution, Huynh (1978) showed that his
single administration kappa formula is also asymptotically unbiased
and presented a formula for its asymptotic standard error. For
several large data sets, some of which were significantly different

from a beta-binomial distribution, Huynh and Saunders (1979) found

69

that this estimate tended to underestimate the pOpulation dual
administration kappa. Across various test lengths and cut—off scores,
the average percent of bias was —7.8. The same trend was found for
small and moderate sized samples; across various test lengths, the
average percent of bias was -11.0.

Huynh and Saunders (1979) also compared the standard error of
Huynh's kappa to that of the dual administration estimate. Over
various beta-binomial distributions, test lengths, and cut-off scores,
the standard error of Huynh's kappa was consistently lower. On the
average, it was 53.2% of the standard error of the dual administration
kappa. The uniformly smaller standard error of Huynh's estimate was
also found for large data sets with distributions significantly
different from the beta-binomial in several instances as well as for
small and moderate sized samples. On the average, the standard error
of Huynh's estimate was 50.2% and 56.9$ of the standard error of the
dual administration coefficient, respectively.

The bias of Subkoviak's kappa has not been investigated, and no
studies have compared the bias and standard error of Subkoviak's and
Huynh's kappa estimates. The same issues raised under the discussion
of the bias of the go estimates are also relevant for kappa formu-
lations. Specifically, these coefficients' biases and standard errors
need to be evaluated for various score distributions, including a
bimodal, and for situations where the classic parallelism assumption
is violated.

Obviously, the lack of empirical research does not allow
definitive recommendations as to which coefficient to use within each

cell of Figure 2 given a particular situation. In order to address

65

some of the uninvestigated issues raised in this discussion, the
current study was conducted to assess the influence of various test
characteristics upon the bias and standard error associated with each
major single test administration coefficient when estimating the
appropriate dual test administration population coefficient. Speci-

fically, the effects of the following variables were examined:

(1) violation of the classic parallelism assumption
(2) shape of the test score distribution

(3) test length

(n) cut-off score

(5) number of examinees in the sample

Those coefficients whose derivation is based upon the assumption of
classically parallel tests were expected to be more biased when this
assumption was violated (i.e., when the tests were randomly
parallel). The shape of the test score distribution (particularly a
bimodal distribution) was hypothesized to influence the bias of the
threshold agreement indices because of their implicit or explicit
distributional assumptions. The location of the cut-off was not
expected to affect the extent of bias. Finally, a decrease in

standard error was predicted as test length and sample size increased.

METHOD

Data Base

Several populations reflecting different distributional shapes
were generated from data obtained from one of two sources. The first
data base came from the responses of a sample of Michigan public
school fourth graders to various criterion-referenced tests admin-
istered by the Michigan Educational Assessment Program (MEAP). MEAP
annually collects data on fourth, seventh, and tenth grade students'
attainment of various reading and mathematics objectives which address
several of the minimal skills beginning students in these grades
should have. Using a replicated, systematic sampling procedure, MEAP
annually selects approximately 5000 students in each grade and
computes each test's technical characteristics from their data
(Michigan Department of Education, 1977). (In applying this sampling
plan, the Michigan Department of Education (1977) randomly chooses ten
numbers identifying the first member of each of ten systematic
samples. A spacing factor is computed and added to each of these
numbers to identify the next member of each set. The spacing factor
is repeatedly added to the previous set of numbers until the requisite
sample size has been attained.) The data obtained from a sampling of
5,0ﬂ0 fourth grade students in the fall of 1979 served as the major
population data base in this study. The second data source or popu-
lation was the responses of 589 college students to a mid-term exam
given in their introductory psychology course. This exam was a "norm-
referenced test" and produced a distribution not commonly found with

criterion-referenced tests.

66

”net Ch
Div-v
h:
d;
tne stu

zero; a
that $0

tional

iata.
is also

lower m

67

Procedure

Test Characteristics

Distribution shape. Four distributions were incorporated into
the study: (1) severely negatively skewed; (2) J-shaped; (3) bimodal
with a bigger mode at the upper end and a lower mode not equal to
zero; and (A) normal. The first distribution was believed to typify
that found when a criterion-referenced test is given after an instruc-
tional or a training program (Marshall, 1976). Correspondingly, this
distribution was found in the MEAP data. The J-shaped distribution
was included in the study because it was also represented in the MEAP
data. According to Marshall and Serlin (1979), a bimodal distribution
is also frequently found in mastery testing situations. Setting the
lower mode unequal to zero was intended to reflect the probability
that a non-master would guess the correct answer to one or more
questions. Marshall and Serlin (1979) contended that this distri-
bution is much more likely to occur in mastery testing than a J- or U-
shaped distribution, especially when guessing is a viable factor. In
some cases, the MEAP data (considering each grade) did follow a
bimodal distribution with the lower mode equal to one. However, the
bimodal did not occur more often than the J-shaped distribution.
Finally, a normal distribution was included to explore the appro-
priateness of the reliability formulas for typical norm-referenced
tests. Note that the bimodal and the normal distributions are not

members of the beta distribution family.

3388313

1976;

68

Test length. Test lengths of 5, 10, 15, and 20 items were
examined. These test lengths typify those found for criterion-
referenced tests and/or are representative of those needed to produce
a high probability of accurately assigning respondents to a mastery
state (Algina A Noe, 1978; Klein A Kosecoff, 1973; Marshall, 1976;
Novick and Lewis, 1974). Furthermore, Berk (1980) recommended using
between five and ten items per objective for most classroom decisions
and between 10 and 20 items for school, system, and state level

decisions.

Cut-off score. Three cut-off scores, 70%, 80$, and 90$ were
employed because they are representative of those occuring in mastery
measurement and/or those recommended for usage (Block, 1972; Marshall,
1976; Novick A Lewis, 1974). To adequately effect cognitive learning
and, concurrently, maintain interest in learning, Block's research
(1972) has shown that the cut-off should be set between 80 and 85
percent. Marshall (1976) stated that one would typically use between
60 and 90 percent, and Novick and Lewis (1979) noted that the range
seems to be between 70 and 85 percent in Individually Prescribed
Instruction.

Given the previously specified test lengths and the integer value
of test scores, specifying three test scores (advancement scores)
equalling the chosen cut-off levels was not always possible. There-
fore, reliabilities were only computed for those combinations of test
length and cut-off score for which a test score resulting in a

percentage equal to or slightly greater than the given cut-off could

be specii

ment scor

randomly
chosen be
illustra‘
organiza'
because

formulas
for long-
were beL

SiZe.

69

be specified. Figure 3 presents these combinations and each advance-

ment score with its associated cut-off level.

Number of examinees. Sample sizes of 25, 35, and 50 were
randomly selected from the population. The first two values were
chosen because they were believed to typify classroom sizes and to be
illustrative of the number of people participating in various
organizational training programs. A sample size of 50 was used
because it has been recommended that estimation of'c and B in Huynh's
formulas be accomplished with u_> no for very short tests and N Z_2n
for longer tests (Subkoviak, 1978). Finally, these three sample sizes

were believed to be divergent enough to study the effects of sample

 

 

size.
Cut-off Level
Test Length 701 80$ 90%
5 “/5 (80%)
10 7/10 (701) 8/10 (80%) 9/10 (905)
.15 11/15 (73%) 12/15 (801) 1u/15 (93%)
20 1u/20 (70:) 16/20 (80%) 18/20 (90%)

Figure 3.--Advancement Scores for Each Combination of
Test Length and Cut-off Level
Data Generation
Item Domain. The study required a domain of items from which
randomly and classically parallel tests of various lengths could be
drawn. Specifically, a content domain consisting of at least ”0 items
was needed to construct alternate forms of all possible test lengths

included in this study. Since all the MEAP criterion-referenced tests

7O

consisted of five items, items had to be taken from at least eight
tests, measuring different objectives, to form the domain. MEAP
groups the mathematic and reading objectives into major skill areas.
For example, the program includes 15 mathematics objectives tapping
various aspects of numeration skill. A content analysis indicated
that eight tests from the numeration skill area appeared to measure
similar objectives. These no items were intercorrelated and subjected
to a principal components analysis. The mean item intercorrelation
within objectives was .36. The mean intercorrelation between items on
different objectives, computed by systematically sampling correlations
within the HO x ”0 correlation matrix, was .16. The principal compo-
nents analysis yielded a general factor accounting for 21.H% of the
variance. Ten factors had eigenvalues greater than or equal to one.

A varimax rotation indicated that, generally, items within a partic-
ular test loaded highest on the same factor and each factor was
defined by the items on one particular test. In summary, the set of
”0 items was more heterogeneous than what one might find for a very
narrowly defined objective. However, the KB-ZO was .89, indicating a
fairly high internal consistency. Therefore, the researcher decided
to use these items to construct the domain.

Forty students did not reach the questions in one or more of the
eight MEAP tests comprising the domain and were, therefore, eliminated
from the data base. Based upon 5,000 students, the p values of the “0
items ranged from .69 to .96. The mean and standard deviation of

domain scores were 35.79 and 5.34, respectively.

items h
33-20 a

standar

71

The second data base, the psychology mid-term exam, consisted of
46 items. To increase this item domain's internal consistency, six
items with low item-total correlations were eliminated. The resultant
KR-20 was .68. The 2 values ranged from .19 to .96, and the mean and

standard deviation of domain scores were 27.74 and 4.54, respectively.

Score Distributions. The reason for using two data sources in

 

this study was to provide a population representative of each distri-
bution under investigation. The negatively skewed, J-shaped, and

bimodal distributions were based upon the MEAP data, while the normal
distribution was represented by the psychology mid-term domain scores.

Similar to the majority of MEAP's criterion-referenced tests, the
eight numeration tests produced negatively skewed distributions. Not
surprisingly, the frequency distribution of total scores on the 40-
item domain was also negatively skewed. Figure 4 presents the graph
of this population distribution.

To generate the J distribution, the domain scores were inverted
and merged with the original scores. The resulting distribution
closely resembled a U. Then, a new population was formed by randomly
sampling 3,500 students from the original distribution (upper half of
the "0") and 1,500 students from the inverted distribution (lower half
of the "U"). As can be seen in Figure 5, the graph of this population
closely follows a J-shape.

The bimodal distribution was formed by altering the scores of a
random sample of people from the negatively skewed distribution on a
random sample of items. Specifically, the researcher first sampled 6%

of those with scores greater than or equal to 30 and changed their

1,000

900

800

700

600

500

Frequency

400

300

200

100

 

man—lb

72

 

I L l l
I T l

l
(0 15 20 2t 30 3 Yo

Lnn’

Domain Score

Figure 4.--Skewed Population Frequency Distribution of Domain Scores

73

700 a.
600 Hi
500 --

 

400

>»

o

c

m

s

o:

m

I:
300
200
100

 

 

.L L 1
0 5 10 15 20

‘—
p-
«1-

30 3‘5 no

(“mt-

Domain Score

Figure 5.--J-shaped Population Frequency Distribution of Domain Scores

74

scores from right to wrong on a sample of 30 items. If a student had
already answered a particular question wrong, the item response was
not altered. The reason for changing 30 items was to assure that the
lower mode would equal the number of items expected to be answered
correctly merely by guessing. This same procedure was repeated two
more times with replacement of items and people occurring between each
sampling procedure. If a student was selected in more than one
sampling procedure, he/she was deleted from the second and/or third
sample. These three samples were combined with the unaltered scores
in the original distribution, producing the pOpulation frequency
distribution depicted in Figure 6.

Finally, the psychology mid-term scores were duplicated five
times to create enough examinees for the sampling process. The
resultant domain scores of 2,945 examinees produced the approximately
normal distribution shown in Figure 7. The skewness and kurtosis
moments were -.30 and .15, respectively. (In the computer package
used in this study, the kurtosis of a normal distribution was zero
instead of three.) These statistics indicated that the distribution
was slightly negatively skewed and somewhat more peaked than a normal
distribution. However, the departure did not appear to be practically

significant.

Alternate forms. Following the construction of an item domain
(and the distribution manipulations, alternate parallel and randomly
parallel forms were constructed for each test length. Randomly
I>arallel five item tests were formed by randomly sampling items from

the domain without replacement. Consequently, alternate test forms

my
and

“V

55

\A LCMv-Lehvkl

20C

issue

a?

75

 

 

 

 

800 --
700 __
600 ,-
soo --
>~.
U
c:
Q)
5- uoo _-
0)
H
In
300 --
200 l
100 +-
i 1‘ 1 l l 1 1 1
f 1 1 T 1 l I F
o 5 1o 15 20 25 3o 35 no

Domain Score

Figure 6.--Bimodal Population Frequency Distriution of Domain Scores

76

 

 

 

 

 

 

300 d-
250 -‘..
200 --
>,
2 150 a-
Q)
a
o:
o
y
LI.
100 d-
50 -r
1 1 1 1 J 1 1 L
l l I I 1 r I r
0 5 1O 15 20 25 30 35 40

Domain Score

Figure 7.-Normal POpulation Frequency Distribution of Domain Scores

77

did not have any items in common. The items from both forms were not
replaced in the domain when longer tests were constructed. For each
alternate form, tests of 10, 15, and 20 items were built by using
those items found on the next shorter test and randomly sampling
(without replacement) the necessary number of additional items from
those remaining in the domain. For the MEAP data, the same tests were
used for the skewed and J distributions. However, since these
particular tests did not produce bimodal distributions, the test con-
struction process was repeated for the bimodal score domain. The
sampling procedure was also repeated for the psychology exam data.
Alternate classically parallel forms were constructed by pairing
items based on their 2_values and item-total correlations. One item
from each pair was placed in each form. The five pairs having the
most equivalent items within each pair were used to construct the
five-item tests. In forming longer tests, the next closest pairs were
chosen and added to those on the next shorter test. Since the 2
values and/or the item-total correlations were expected to change when

altering the distribution shape, this process was repeated for each

distribution.

Determination of Bias

For every combination of test length, cut-off score, distribution
Shape, and type of parallelism, population values of Bg’ kappa, and
lutvingston's K?(3,T ) were computed from two test administrations.
BPenman and Kane's (MA) and <I> were also computed in every condition
exetept those involving classically parallel alternate forms because

3511. items in the domain would have to have equal 3 values to meet this

78

assumption. Moreover, if all items had equal 2 values, ©(A) would
simply equal Livingston's_§2(x,Ix) and 9 would equal the
generalizability coefficient (Brennan, 1978, 1979). (Note also that
the value of 0 does not change as the cut-off score is altered.) In
all, 320 population values were computed. Formulas for each popu-
lation coefficient can be found in Figure 8.

Thirty independent random samples of 25, 35, and 50 cases were
drawn with replacement from each of the four population distribu-
tions. Within each cell of the design, an estimate of each population
coefficient was computed for each of the 30 samples using the appro-
priate single test administration coefficients. These estimates were
obtained for only one alternate form. The mean of the estimates
within each cell was compared to the population value to determine the
magnitude and direction of bias. The standard deviation of these
estimates indicated each coefficient's sampling error. The total
design contained 240 cells (four distribution shapes, four test
lengths, three cut-off scores for tests of 10, 15, and 20 items, one
cut-off score for a five-item test, three sample sizes, and either
Classically or randomly parallel alternate forms).

One problem was encountered in sampling examinees; K§r20 and K_-
:21 for some samples were negative or equal to zero. Although negative
r‘e.'Liability coefficients can be equated to zero and many of the
Coefficients can be computed when 53-20 equals zero, Huynh's Bo and
kappa estimates cannot. For each case in which {13-20 or 113-21 was
negative or zero, the random sampling process was repeated until other

Samples with positive coefficients were found.

79

noumaHunm oHasmm :oHumauchHso< Home onch com meson mamcaopH< some cousaeou
nucoHonuooo coHumHsaom auHHHanHox ooococouom1coHLouHLo nuom Lou mmHseLom11. w ocsmHm

 

 

 

 

Hm H1 1
Hazel. ”Hake -
1 om
mm H1m m smHsoxosm
can 1H. We 91an Ho AH. Mvﬁoa HH1H1 sum»: mums: . m M .
1 E
11 11 OHM MI
:xucdmé + Axum 5H. EH 1 1.”.
N N Hm H
m.1 mH H11m1 1 m
mﬁwm\Aw +~wMH mvo\1av11mww+ Aoiu fv+Nm AH: :\ Wov+A Haxwov+ N0 e m.ocmx
1- . .1 new amazonm
AoH\Hva + no Mo

 

M Hainmem moiHm1w—wt. MHHmathwm

mHamHHam e

 

 

 

 

 

 

 

 

 

H1 H m.
1 N A m\H mov+AH :\ ov+ HH11v + o Axve m.oemx
N N N
Mm 3W3 :SHHHTHM— + as New 1H21Hm5+aiw “Hm: . H as. sages
HH1av + o
N N
Hanna 29:1 3.. AM— 1 Ida mew 1H21HHuHe+aH\w_1.H.s+HHm
1 1 1 N
Hanoi.c + NH. WHNH13+H£ ma1+u113+H1¢> 315»
M1 1 m om . m.:oumm:m>H4
HA use 12H s s 11m» H1 as H as: H H:s
ansaLou oumaHumm quamm mmHseLom cowumazqom unoﬁ0ﬁ&%moo

seem oumcgouHH

80

 

 

 

 

 

 

 

 

 

 

 

 

 

1.m .05 w 1 H an
333 :53: mom 5 m ox
Haw—1M1 95:3:
mim imH mo—
1% a
N111. I 1 11 11 I
1 H1H a1 1. 1.1H1H 1
HaHIAQA Hem 1H V1 A :2on A: .5 H1H 1 H H.1..1imH1o_
N am am a 1N1 1 H maamx
n. H.” e .1 m.me>oxasm
1 1 1 1H m 1 H1 m- 1. .1 H1. 1-1H1x
iamzmeA H: NH V1 Am: aoAHHVmH 1 HV N1 1 a a a Am H 1.2a m
mm mm 11
AoHaHV seas: mom m H oa n.2aHsm
91
7w AoN He 8 1am“ - m 1H mo
AIWIV MVIH + AJINI < 1< N1Hamnw V N <
,1 - 1 Ha
E
ﬂ 1
.11 111 1 CNN 11 .l 11 HNﬁ. .11 N11
.AHH V of HH. 1 3... . H1? 3 Wis : a 1
:N c < N1 m N 03 Has— om PHouLomm
m H
H H HuN H1. can 22982
1 a1 1 1 1 1
22.. TVH H? 1H + A H1H.1oVH.H a: H H..
1 mN t H1 u+ a
1 1 H $1». 1 1H m
+ HH1HH.:1HHV1HV a. . 3H H + AH: f m
N1 uN H1
anssnom 3333 39:8 amassgom zoﬁumasqom “53:.300

seem mamceouHc

81

.wcoHumLuchHecm
ummu :uon :H m.mumum >Lmumms :H umomHa >HpcmumHmcoo oHaomq no COHuLoqonm

.coHumLuchHsvm ummu mac :0 M.mumum >Lmumms CH noomHn quoma no coHuLoaogm
.msmuH umm» mo gmnszz

.mHasmm ms» :H mcompma ho Lmnszz

.msmuH new mcomgma no mHQemm soucmn m Lm>o 2mm:

.LOLLm HmucoaHLoqu msHa memuH new mcomnoa mo coHuomLmucH ms» 0» mac mocmHLm>
.mmgoom cams swuH m2» no mocmHLw>

.mmgoom mmLm>Hcs LHmzu Ho mcomnwa Lm>o mocmHLm>

.mEmuH mo mmLm>ch Lo :Hmeoo msu cam mcomgwa no coHpmHsaoa ecu :H cams ucmgu

.aHuomLLoo cmngmcm mEmuH no :oHuLoaona may mm ommmmnaxm mgoom um01pso

.>HuomLLoo vmgmzmcm nequ ho amass: on» ma owmmognxm mLoom um01uso

wsm

"wuoz

A.c.acooV m mgsmHm

82

Within each distribution, the same samples were used to compute
estimates of the coefficients for every combination of test length and
cut-off score. Furthermore, the same samples were used for estimating
the reliability of randomly and classically parallel tests. As
mentioned previously, when a test had a zero or a negative 53-20 or
‘Eﬁe21 in a particular sample, the sample was eliminated and another
one was chosen. However, only the internal consistency of five-item
tests comprised of randomly chosen items was examined in determining
which samples to delete. Since the classically parallel forms
consisted of different items and since the same set of samples was
used in both parallelism conditions, some samples retained in the set
had a negative or zero 53221 for the classically parallel form. This
problem occurred only for the normal distribution and was probably due
to the relatively low internal consistency of the items in this
domain. Moreover, within this distribution, the 33-21 for longer
tests within both parallelism conditions was negative or zero for some
of the retained samples. In those cells where this difficulty sur-
faced, the sample(s) was dropped from the cell. Therefore, within
some cells, the mean and standard deviation were based on less than 30

samples. However, every cell contained at least 20 samples.

Estimation Formulas. Figure 8 presents the single test admin-
istration formulas used to estimate each population alternate form
coefficient. A few formulas require some explanation. In estimating

@(A), Brennan and Kane (1977a) noted that (X-A)2 is not an unbiased

83

estimate of (u-A)2. They presented an unbiased estimate of this

term: . 2

 

1 32 s 32 \

2 (""2 "1 ‘21 \
(XI-A) -".r1—+n——+nn\
3 \‘2 11 1311/

In addition, previous discussion of Brennan and Kane's indices assumed
the item domain was infinite. However, the domain in this study is a
finite universe. To account for this design factor, Brennan (1978)
provided formulas for ¢(l) and o in which a finite universe correction
factor is applied to the variance components comprising these coeffi-
cients. These latter formulas which also incorporate an unbiased
estimate of (u - A)2 were used in this study and appear in Figure 8.

For Huynh's, Subkoviak's, and Marshall's indices, the researcher
assumed that an individual's test scores followed a binomial distri-
bution given his/her true score, rather than a compound binomial
model. Studies cited previously have indicated that using the
binomial model for heterogeneous item difficulty values does not sub-
stantially affect the accuracy of these coefficients. Moreover, the
binomial model has produced results similar to those found using the
compound binomial for Subkoviak's go (Marshall & Serlin, 1979).

For Marshall's Eo' scores on a ngitem test were simulated via a
binomial regression model. Specifically, a linear regression was used
to predict true score from obtained score and the predicted true score
was used in a binomial error model to estimate the frequency distri-
bution of a Zggitem test. As noted previously, Marshall and Serlin
(1979) used five different models for simulating scores. The binomial

regression model was chosen over the others because the relative size

8”

of Marshall's £2 using this model better reflected the distance
between the cut-off and the mode(s) for distributions similar to those

used herein.

RESULTS

Population Values

Tables 1 and 2 present the population distributional character-'
istics associated with each randomly and classically parallel
alternate form, respectively. For the bimodal distribution within the
randomly parallel condition, one five-item form had only one mode and
one ten-item form had three modes. As can be seen from the skewness
and kurtosis moments, the normal distributions departed from their
theoretical shape.

For each condition, Tables 3 to 8 present the alternate form
population values of the classical reliability coefficient (O11),
Livingston's‘§?(g,Tx), Brennan and Kane's ¢(A), Brennan and Kane's ¢,

20, and kappa. To compute the kappa coefficient for classically
parallel tests, the average of the corresponding marginal proba-
bilities was used to determine the probability of chance agreement.
Similarly, for g?(x,rx), the average of the classically parallel

tests' means and variances were used as the values of “x (UV) and

02 (02), respectively.
x Y

‘13 can be seen in Table 3, p11 increased as test length
increased. In general, given a particular distribution and test
length, 011 was higher in the classically parallel condition than in
the randomly parallel condition. The exceptions occurred for shorter
tests. Comparing the results for each distribution, it becomes clear
that tests derived from more internally consistent domains had higher
alternate form reliabilities than those from domains in which the item

intercorrelations were not as high.

85

 

86

 

 

 

 

 

mthmdﬂw >>.M.1 wmqum cmwmdm mmhmwwm. om
..w111. .3; .\..? .? ..1? ..
..1.11..1 2.1?- ﬁ? 1.? 11.? .
mmmmum. wwwqumw mmwwwm mmm1 HWmeuMw om
£1. .111 ? .1 11? .
mHmoucsx mmmczoxm :oHumH>oQ .mmmm coo: .mmmmmm :oHuancumHn
ooooomsm smog

.Ecom oumccouH< HoHHmcmm mHsoocmm comm co mOHuchmuomcmzo11.H mHnme

87

 

 

 

 

Ho.1 mm.1 Ho.m H. Hm.mH
\\ \1
Ho. 11mmnu mo.~ 1mm 111mm1MH om
Nomen- QOOI o o
\111 1\\ Ho A H 8 H
H. m:.1 11mmqm. .MH 11mm1mw m.
Hmecoz
N=.1 mo. H1.. 0 m=.m
$1 \...A. a 1.1.1 1am e
s..- 8. E1111 m :-.A
1Hmnw. 1mmnn oo.H 1H1. .1HH1m. m
«m. Hm._1 om.m om.m om.cH
\ \
mo.1 111mmnﬂu 11mm1m o~.m .111Hm1mﬂ om
m1. =m..1 mwvw11 mH.= Ho.~H
\
HH.1 .111mmqﬂw 31.: 11wmqm 111Hm1HH mH
HmoosHm
:H. am.H1 =H.N oH. mo.m
\
mm. 111Hmnﬂn 11mmum. .dH1mHH 11HH1M o.
o:. H:.H1 am.H m.H mH.=
\
HH. 111mmqﬂn 11mmqﬂ _1m1 11qum m
mHm0ucsx mmoczmxm :oHumH>mn moo: cam: cummmq coHuanLumHa
ocmoomom Home

H.o.ooooV H mHgmH

88

 

 

 

 

HH.H1 mH.1 mH.H om.o HH.mH
\...11.11 1.1.1.1 11:1.H 11mm. \mﬂ. o.
om.H1 HH.1 mm.m mH.o mm.a
\

.11mmuﬂu HH.1 11Hm1m 11mm1m 11mm1m m.
om.H1 HH.1 Ho.= oH.o oo.o ooomsm1e
11.1.1.1. 11.1.1.- \..1.1...1 1.1.1.11 \ﬂ 0.

HH.H1 NH.1 Ho.m m.o om.m

oH.m - oH.m1 oH. om oo.HH

\s\.1.1 \mﬂ. \HMM 1.1 1\..m.1m a

Hm.o om.m1 Hm.H m. Ho.mH

%H 1111.111 \mﬂ A \Ha 1.

o=.c Hm.m1 Hm.H o. H..m omzoxm
\omum \Mﬂ a \e1 \.1.1.m 2

$1111 m..\.w..\\ 3V 1.\ mm\.H.11 m

. Hm.o HH.H1 om. m oo :
mHmoucsx mmoczoxm coHumH>oa moo: com: cameo; coHuancunHa

ocmoomom Howe

.scom mumccouH< HoHHmcmm aHHmonmmHo comm no mOHuchouomcmno11.m oHnme

Table 2 (cont'd)

Standard

Test
Le th

Kurtosis

Deviation Skewness

Mode

Mean

Distribution

 

O\

'- m

o o

1—1— me—

MI NI
0 o
r- 1—
| l

Ln CD

N :1-

we 0

N'- LON
o o
v- N

O

P

m to

a Pm
N a
v-\ m

I- Q

I— v—

o o

0-? PCO
0 o
2' co
Ln 0
‘—

r-O

CU

“O

O

E

-H

a:

4.311

“,1

11.96

15

:1

‘1035

”.05

“,15

11.99

.1”

-1.29

5.41

",2

15.7”

20

.21

-1.32

5.37

1,20

15.87

89

 

-.18
-.17

-.60
'055

F
0‘ 0

-.57

7.1

10

7021

Normal

-012

N

10.7”

15

-.011

F

10.8"

13.88

20

.11

15

13.86

90

Table 3.--Classical Reliability of Randomly and Classically Parallel

Test
Length

5
1O
15

20

Table 4.--Alternate Form Population Values of Livingston's 52(3’Ix)
Each Cell of the Design. ‘

Skewed

.60
.62
.711
.77

_J_
.911
.911
.96
.97

 

Randomly Parallel
Alternate Forms

Bi-
modal

.77
.89
.93
.95

Randomly Parallel
Alternate Forms

Alternate Forms for Each Distribution/Test Length
Combination.

Classically Parallel

i
.93
.97
.98
.98

Alternate Forms

Bi-

modal

.80
.90
.94
.95

Normal

.16

for

Classically Parallel

 

Alternate Forms

 

 

 

 

Test Cut-off Bi- Bi-

Length Score Skewed g_ modal Normal Skewed g_ modal Normal
5 4 .736 .945 .734 .336 .774 .940 .800 .181

7 .847 .939 .899 .190 .904 .975 .917 .413

10 8 .709 .945 .888 .558 .809 .977 .899 .523
9 .591 .954 .905 .775 .692 .981 .909 .735

11 .876 .962 .927 .309 .921 .977 .946 .466

15 12 .809 .965 .925 .563 .869 .979 .942 .590
14 .756 .973 .942 .843 .782 .984 .954 .832

14 .911 .966 .941 .416 .942 .980 .956 .518

20 16 .824 .969 .936 .654 .883 .982 .951 .710
18 .761 .975 .947 .842 .830 .985 .958 .862

 

91

Table 5.--P0pulation Values of Brennan and Kane's ©(A) for Each Cell
of the Design.

Randomly Parallel
Alternate Forms

 

 

 

Test Cut-off Bi-

Length Score Skewed g_ modal Normal
5 4 .632 .915 .789 .378

7 .877 .951 .893 .393

10 8 .774 .956 .882 .548
9 .697 .964 .897 .736

11 .894 .967 .921 .521

15 12 .837 .970 .918 .645
14 .790 .977 .935 .841

14 .935 .975 .943 .564

20 16 .873 .977 .937 .708

18 .822 .981 .946 .848

 

92

Table 6.--Population Values of Brennan and Kane's 9 for Each Cell
of the Design.

Randomly Parallel
Alternate Forms

 

 

 

Test Cut-off Bi-
Length Score Skewed g_ modal Normal
5 4 .535 .905 .789 .244
7 .697 .950 .882 .392
10 8 .697 .950 .882 .392
9 .697 .950 .882 .392
11 .775 .966 .918 .492
15 12 .775 .966 .918 .492
14 .775 .966 .918 .492
14 .821 .974 .937 .563
20 16 .821 .974 .937 .563
18 .821 .974 .937 .563

 

93

Table 7.--Alternate Form Population Values of go for Each Cell of the

 

Design.
Randomly Parallel Classically Parallel
Alternate Forms Alternate Forms
Test Cut-off Bi- Bi-

Length Score Skewed i modal Normal Skewed :1 modal Normal

5 4 .927 .943 .852 .499 .935 .928 .925 .613

 

7 .931 .945 .938 .497 .957 .977 .966 .667
10 8 .873 .908 .891 .621 .917 .964 .929 .657
9 .783 .847 .828 .815 .826 .904 .829 .733

 

11 .926 .943 .931 .569 .963 .967 .953 .643
1 5 12 .896 .925 .897 .615 .934 .950 .933 .650
14 .760 .835 .782 .893 .766 .851 .791 .879

 

14 .940 .953 .944 .645 .956 .965 .953 .676
23C) 16 .898 .929 .908 .683 .927 .948 .927 .749
18 .798 .862 .833 .888 .830 .885 .859 .900

94

Table 8.--Alternate Form Population Values of Kappa for Each Cell of the

 

 

 

 

 

 

Design.
Randomly Parallel Classically Parallel
Alternate Forms Alternate Forms
Test Cut-off Bi- Bi-

Leng th Score Skewed i modal Normal Skewed i modal Normal
ES 4 .517 .875 .635 .106 .591 .846 .792 .135

7 .505 .879 .824 .132 .611 .948 .897 .221

1t) 8 .435 .807 .735 .162 .572 .922 .809 .310
9 .387 .692 .637 .120 .461 .803 .622 .214

11 .579 .878 .815 .215 .705 .928 .864 .270

‘155 12 .579 .844 .752 .154 .636 .893 .824 .279

14 .462 .668 .566 .107 .439 .703 .576 .305

14 .593 .896 .842 .296 .690 .924 .865 .335
22c) 16 .589 .851 .777 .253 .696 .892 .819 .372
18 .500 .725 .655 .233 .580 .771 .705 .283

95

Several characteristics of the criterion-referenced coefficients
deserve attention. First, not surprisingly, those computed using clas-
sically parallel tests were generally greater than their counterparts in
the randomly parallel condition. (This comparison was, of course, only
relevant for §2(X,Tx), p0, and kappa.) The exceptions appeared to be
related to the size of p11 and the location of the cut-off. For example,
within the J distribution, Table 3 indicates that p11 of the 5-item
randomly parallel tests was slightly greater than its classically
parallel counterpart. Likewise, §2(§,Ix), Po' and kappa were also higher
in the randomly parallel condition. In other cases, §2(§,Ix) was higher
in the randomly parallel condition even though p11 was lower. In these
instances, the means of the randomly parallel tests were further from the
cut-off than the means of the classically parallel tests. As Shavelson
et al. (1972) noted, the difference between the cut-off and the mean can

influence §?(§,Tx) more than 011 does. For 2 and kappa, the relation-

0
ship of the cut-off to heavy score density areas and to the size of the
chance agreement probability appeared to account for the other excep-
tions.

Second, 9(l) and 6 increased as test length increased. Except for a
few instances in the randomly parallel condition, §?(§,Ix) was also an
increasing function of test length. Contrary to previous findings,

.90 and kappa did not follow this trend even though p11 increased
(Eignor & Hambleton, 1979; Subkoviak, 1978). This latter result
indicates that the size of the error (expressed as a proportion) found

in classical reliability may not correlate with the proportion of error

found in reliability coefficients based on the Platonic true score model.

96

Third, given a particular test length increased as the cut-off

.99
moved away from heavy score density areas. For the skewed, J, and bi-
modal distributions, these areas were in the upper extremes of the
distribution. Although 29 has been known to increase as the cut-off
approaches the extremes, the score density appears to have had more in-
fluence on the size of‘po in this study. Except for the normal distri-
bution, the changes in the value of kappa as a function of the cut-off
generally followed the same pattern as 90’ One might expect kappa to
become higher as the cut-off approaches denser areas because the proba-
bility of chance agreement decreases. However, the author believes that

due to the large size of these dense areas in the skewed, J, and bimodal

distributions, po was reduced enough to outweigh this factor. For the

normal distribution, the strength of the heavy score density areas and

the size of the chance agreement probability also appeared to interact,
producing some unusual patterns of kappa coefficients. Finally, as ex-
pected, §2(§,Tx) and 9(l) increased as the distance between the cut-off

and the mean increased.

ggggi

Appendices A1 to A24 present the mean bias and standard deviation of
each single test administration coefficient for each cell of the
design. A negative value indicates underestimation, and a positive value
means that the single test administration coefficient overestimated its
population value. Except in two instances, the results for Subkoviak's

and Marshall's p estimates were equal, confirming Marshall and Serlin's

2

findings (1979). For the two exceptions, one for bias and one for

standard deviation, the results differed by only .001, indicating that

97

the differences may simply be due to rounding error. Therefore, to avoid
redundancy, only one of these coefficients, Subkoviak's 80’ is mentioned
and discussed below. The reader should assume that this discussion
applies equally to Marshall's 80' To investigate each hypothesis, the
mean of these statistics across-appropriate cells was computed. In doing
so, each cell's mean and standard deviation was weighted by the number of
samples upon which it was based.

Throughout the ensuing discussion, the use of the term
"significance" means practical significance, rather than statistical
significance. For this study, any mean biases and standard deviations
greater than or equal to .025 and differences between mean biases and
standard deviations greater than or equal to this value were considered
practically significant.

The relative ability of the coefficients to estimate their respec-
tive population reliability coefficients for randomly versus classically
parallel tests was examined for 52(3,Tx), Po’ and kappa since their
single test administration estimates assume classic parallelism.
Collapsing across number of examinees, distribution type, test length,
and cut-off score, Table 9 contains the mean bias of each estimate for
both types of parallelism. Contrary to expectation, the absolute mean
bias in the randomly parallel condition was less than or equal to that in
the classically parallel condition for every coefficient. However, the
only significant difference between the two conditions was for
Subkoviak's E. Taking direction into account, violation of the classic
parallelism assumption significantly altered the mean bias of 32(3’Ix)

and the kappa estimates, while the Po estimates were fairly robust. In

98

Table 9.--Mean Bias (Across Cells) of Various Coefficients in Estimating
the Reliability of Classically and Randomly Parallel Alternate

 

 

Forms.

Type of Parallelism
Coefficient Random Classic
Livingston's 32(x,zx) .019 -.019
Subkoviak's §o -.010 -.030
Huynh's 60 .011 -.012
Subkoviak's E -.023 -.111
Huynh's E .039 -.050

 

the classically parallel case, all indices underestimated the population

coefficient with the kappa estimates and Subkoviak's go doing 30 31801-

ficantly. Given randomly parallel tests, only Subkoviak's coefficients
were underestimates. The others overestimated their corresponding para-
meters with Huynh's E being a significant overestimate. In previous
research, Huynh's coefficients have always been underestimates (Huynh &
Saunders, 1979; Subkoviak, 1978). However, these studies used equivalent
tests. The present findings support the past research, but also indicate
that past results do not generalize to the randomly parallel condition.
The second hypothesis was that the Po and kappa estimates would be
more biased for those distributions not belonging to the beta-binomial
family (i.e., bimodal and normal). Even though no hypotheses were
generated for the influence of the distribution upon §2(§,Tx),
9(A), and 9, Table 10 presents the mean bias of every coefficient for

each distribution. Based upon the absolute value of the mean bias, the

99

Table 10.--Mean Bias Across Cells of Each Reliability Coefficient for
Each Distribution.

 

 

Distribution
Coefficient Skewed J-Shaped Bimodal Normal
Livingston's 32(x,rx) .004 -.005 -.025 .028
Brennan & Kane's $(l)a .057 .008 .010 .061
Brennan 8 Kane's $3 .086 .009 .011 .141
Subkoviak's ﬁe -.020 -.o11 -.028 -.019
Huynh's §Q_ -.011 .018 -.011 .000
Subkoviak's ﬁ_ -.076 -.o32 -.069 -.o92
Huynh's i -.018 .034 -.025 -.014

8Means for these coefficients were based only on cells within the

randomly parallel condition.
pattern of results for Subkoviak's coefficients conformed somewhat to
that predicted. Specifically, Subkoviak's 80 and E were least biased
for the J distribution and most biased given the bimodal and the normal
distributions, respectively. In the case of Subkoviak's g, the differ-
ences between the J and the other distributions were significant.
Contrary to expectation, Subkoviak's go was almost equally biased for the
skewed and normal distributions, and Subkoviak's E was slightly more
biased for the skewed than for the bimodal. Generally, the absolute mean
bias of Huynh's coefficients followed a pattern opposite to that
predicted; Huynh's p0 and g were least biased for the normal distri-
bution and most biased for the J distribution. However, as expected,

Huynh's g was less accurate for the bimodal than for the skewed

100

distribution. For Huynh's go’ the biases associated with these two
distributions were equal and in the same direction. In no case did any
distribution significantly change the absolute mean bias of Huynh's
coefficients.

Considering both magnitude and direction, Subkoviak's p9 consis-
tently underestimated the population value with significant bias
occurring for the bimodal distribution. Note, however, that the mean
bias of this coefficient was not significantly altered by changes in the
distribution's shape, regardless of whether or not the type of distri-
bution violated the underlying assumptions. 0n the other hand, altering
the distribution changed the direction of bias for Huynh's ﬁg,
significant differences between the results for the J distribution and

leading to

those found for the skewed and bimodal distributions. Specifically,
Huynh's ﬁe was unbiased for the normal distribution, slightly negatively
biased for the skewed and bimodal, and positively biased for the J
distribution. In no case were these degrees of bias significant. For
both kappa estimates, the bias associated with the J distribution was
significantly different from that found in the other conditions.
Specifically, Subkoviak's E underestimated kappa much more for the other
distributions, although the extent of bias was significant throughout.

In the case of Huynh's E, the J distribution significantly affected the
direction of bias as it had done for Huynh's go; Huynh's §_was positively
biased for the J distribution and negatively biased for the others. In
addition, the biases associated with the J and bimodal distributions were

significant.

101

Table 10 indicates that the bias of &2(§,Tx) was significantly
affected by the type of distribution in terms of both magnitude and
direction. The biases for the bimodal and normal distributions differed
significantly with significant overestimation associated with the former
and an approximately equal, but negative, bias corresponding to the
latter distribution. The mean biases for the skewed and J distributions
were close to zero and were significantly different from those found for
the bimodal and normal distributions, respectively.

$(A) and 8 followed the same pattern. They consistently over-
estimated their parameters and were significantly less accurate for the
normal and skewed distributions than for the others. As a matter-of-
fact, the extent of bias associated with the normal and skewed distri-
butions was quite high and significant, but was very slight for the other
distributions. For 8, the normal distribution's mean bias was also
significantly greater than that found for the skewed distribution.
Finally, 8 was more biased than $(A), although the differences for the J
and bimodal distributions were negligible.

Moving the location of the cut-off score was expected to have no
influence on the coefficients' accuracy. The mean bias associated with
each cut-off score can be found in Table 11. These means were based on
the results for 10, 15, and 20-item tests. Five-item tests were not
included because only one cut-off score was examined for this test
length, i.e., the design of the study was not completely crossed. Al-
though the cut-off scores associated with the 15-item tests were not
exactly equal to those of the other two test lengths, the researcher felt

the slight deviations would not significantly affect the results.

102

Table 11.--Mean Bias Across Cells of Each Coefficient for Each Cut-off

 

 

Score.
Cut-Off Score
Coefficient 70$ 80% _90ﬁ,
Livingston's 13205.13!) ~ .016 -.004 -.003
Brennan & Kane's $(l)a .038 .032 .035
Subkoviak's £30 -.012 -.029 -.012
Huynh's‘po -.010 -.014 .027
Subkoviak:s E -.071 -.069 -.030
Huynh's g -.052 -.016 .075

8Means for these coefficients were based only on cells within the
randomly parallel condition.

As can be seen, the expectation was confirmed for EZ(§.IX),

$(A), and Subkoviak's so since changes in the cut-off score did not
significantly alter these coefficients' accuracy. However, the biases of
Huynh's estimates and Subkoviak's g for the 90% cut-off were
significantly different from those found for the other two cut-offs.
Specifically, Huynh's E significantly overestimated kappa for the 90%
cut-off, but significantly and moderately underestimated this parameter
for cut-offs of 70% and 80%, respectively. Huynh's go followed a similar
pattern, although the bias associated with the 701 out score was not
significant. Subkoviak's E significantly underestimated kappa,
regardless of cut-off score, but did so significantly less for the 90%
out score. Finally, for Huynh's 3, setting the cut-off score at 701 led
to significantly more underestimation than did the 80% cut-off.

Since the main effects hypotheses concerning bias were generally

unsupported, a three-way interaction effect among the relevant variables

103

(i.e., type of parallelism, distribution, and cut-off score) was
examined. The bias of each 5-item test was again excluded because only
one cut-off score was examined for this test length. The results of this
analysis can be seen in Table 12 and are discussed below for each coeffi-

cient separately.

Livingston's §2(X.Ix). For the J and bimodal distributions, neither
violating the classic parallelism assumption nor moving the cut-off score
significantly altered this coefficient's accuracy. 0n the other hand,
the absolute mean biases belonging to the skewed and normal distributions
were significantly greater in the randomly parallel condition than in the
classically parallel case for cut-off scores located nearest to the
distributions' population means. Accounting for both magnitude and
direction, altering parallelism conditions significantly changed the bias
of §2(§,Tx) for every cut-off score within the skewed distribution and
for the 70% cut-off within the normal distribution. In the former case,
the differences increased as the cut-off approached the population mean
since the mean bias became more negative in the classically parallel case
and more positive in the randomly parallel condition. As a matter-of-
fact, varying the cut-off score significantly altered the bias in the
randomly parallel condition. Significant differences as a function of
cut-off score were also evident in the classically parallel condition
when the results for the 70% and 90% cut-offs were compared. For the
normal distribution, altering the cut-off did not appreciably affect the
mean bias in the classically parallel condition. However, given random

parallelism, the mean bias associated with the 70% cut-off was very

104

Table 12.--Mean Bias Across Cells of Each Coefficient for
Every Parallelism/Distribution/Cut-off Score Combination.

 

 

 

 

 

 

 

701
Type of Bi-
Coefficient Parallelism Skewed J modal Normal
A2 Random .018 .004 -.016 .171
Livingston's g (X,;£)
Brennan & Kane's $01) Random .027 .008 .013 .107
Subkoviak's go
A Random -.O17 -.006 -.037 .090
Huynh's Bo
A Random .000 -.028 -.071 -.037
Subkoviak's _Ig
Classic -.205 -.040 -.084 -.106
A Ratldom 0016 ”0012 ”0078 0037
Huynh's 5_
(3138810 -0175 -003” -012“ -0037

 

105

Table 12 (cont'd.)

 

 

 

 

 

 

 

80%
Type of Bi-
Coefficient Parallelism Skewed J modal Normal
A2 Random .050 .002 -.017 -.003
Livingston's,§ (X’Ix)
" Classic -.028 -.008 -.031 . .000
Brennan & Kane's 9%)) Random .054 .007 .015 .053
Subkoviak's‘p‘£2
Classic -.035 -.027 -.037 -.062
A Random -0009 0016 -0012 0002
Huynh's 29_
Classic -.038 -.003 -.032 -.036
A Random .026 -.022 -.048 -.025
Subkoviak's‘g
A Random .062 .032 -.007 .056
Huynh's K
Classic -.109 -.008 -.065 -.093

 

106

Table 12 (cont'd.)

 

 

 

 

 

 

 

90%
Type of Bi-
Coefficient Parallelism Skewed J modal Normal
Livingston's §?(§,Ix)
‘ Classic -.042 -.007 -.030 -.002
Brennan & Kane's dMA) Random .083 .005 .014 .038
Subkoviak's p0
Huynh's Bo
" Classic .007 .061 .048 -.021
Subkoviak's 3
Classic -.057 -.027 -.008 -.190
A Random .150 .151 .108 .024
Huynh's 5

 

107

large and significantly different from that found for the other two out-
offs. Specifically, §2(X.Ix) greatly overestimated its parameter when
the cut-off equalled 701, but fairly accurately estimated §?(§,TX) for
the 801 cut-off, and moderately underestimated §2(§,Ix) given the 90%
cut-off.

Although no hypothesis was made concerning the influence of distri-
butional shape, this variable did have an impact. In the randomly
parallel condition, the effects varied across cut-off score due to the
changes induced by this variable within the normal and skewed distri-
butions. The J distribution resulted in the least bias. As a matter-of-
fact, its bias was close to zero, regardless of cut-off score. Although
not significant, §?(§,Tx) consistently underestimated its parameter for
the bimodal distribution. The relationship between the J and bimodal
distributions' results remained fairly consistent across cut-off score.
The greatest degree of bias was associated with the normal distribution
for the 70% cut-off and with the skewed distribution given the other two
cut-offs. In these instances, the bias differed significantly from that
corresponding to the other distributions with each one significantly
overestimating its population value. The only other significant
difference was found between the bimodal and skewed distributions for a
701 cut-off score. §2(§,Ix) underestimated gz(§,ix) in the former case
and overestimated 52(3,Tx) in the latter case.

For the classic parallelism condition, the pattern of mean bias
created by changing the distribution was similar across cut-off score.
The mean biases for the normal and J distributions were almost zero in

every case with the former distribution resulting in no bias for the 80%

108

cut-off. The mean biases associated with the skewed and bimodal
distributions were quite similar. In both cases, §2(§,Tx) was consis-
tently underestimated with significant bias occurring for the 80$ and 90%
cut-offs. The bimodal distribution led to significant underestimation
for the 701 cut-off, as well. The biases produced by these distributions
were significantly different from those found for the normal distri-
bution, regardless of cut-off score. For the 905 cut-off, the skewed
distribution also displayed significantly more bias than the J distri-

bution did. Once again, the relationship between the mean biases of the

J and bimodal distributions was fairly consistent across cut-off score.

Brennan and Kane's 8(4), Since $(A) was not computed for classi-

 

cally parallel tests in this study, Table 12 contains the mean bias of
this coefficient for every distribution/cut-off score combination. The
results almost paralleled those found for 32(§,Tx) in the randomly
parallel condition. As a matter-of-fact, in terms of absolute value, the
pattern of results for §2(§,Tx) and $(A) were, with one exception,
nearly identical. In many cases, the actual degrees of bias were very
similar. Note, however, that $(l), on the average, consistently over-
estimated its parametric value, while §2(§.Ix) did not. The following
discussion elaborates upon the similarities between these coefficients.
Altering the cut-off score within the J and bimodal distributions
hardly changed this coefficient's accuracy. However, as the cut-off
approached the population means of the other distributions, the mean
biases increased. For the skewed distribution, each increase was signi-

ficant. When the distribution was normal, the mean bias associated with

109

the 70% cut-off score was significantly greater than that found for the
other two cut-offs .

Changes in the frequency distribution also altered the results. The
mean biases corresponding to the J distribution were consistently close
to zero. The bimodal distribution created slightly more inaccuracy.
Because neither of these distributions was affected by cut-off score, the
relationship between them remained fairly constant across cut-off
score. When the cut-off was 70%, the normal distribution's mean bias was
extremely large and significantly different from that found for the other
distributions. Contrary to the pattern established by &?(x,zx), the
biases of the skewed and normal distributions were, on the average,
comparable, significant, and significantly different from that found for
the other two distributions when the cut-off equalled 801. Finally,
given a 90% cut-off, the skewed distribution produced a large mean bias
which was significantly greater than the biases of the other distri-
butions. In this situation, the mean bias associated with the normal
distribution was also significant as well as significantly greater than

the J distribution's mean bias.

Subkoviak's £0. In terms of absolute value, violating the classic
parallelism assumption did not significantly affect the accuracy of
Subkoviak's go for the J and bimodal distributions. If one considers
direction, however, type of parallelism did significantly alter the J
distribution’s results when the cut-off was 90%; on the average, the
random parallelism situation led to overestimation, while its classically

parallel counterpart produced a fairly accurate estimate.

110

Contrary to the hypothesis, the skewed distribution's absolute mean
bias was greater when the classic parallelism assumption was valid. As
the cut-off approached this distribution's population mode (mean), the
differences in the mean bias of the classically and randomly parallel
conditions increased since the bias became more negative in the former
case and less negative in the latter case. In the latter condition, the
bias was even slightly positive for the 90$ cut-off. However, whether or
not direction was taken into account, violating the classic parallelism
assumption significantly altered only the biases corresponding to the 80$
and 90$ cut-offs.

Although neither type of parallelism was consistently associated
with less bias, significant differences also occurred for the normal
distribution. For the 80$ cut-off, the absolute mean bias was greater in
the classically parallel situation, while the opposite was true for the
90$ cut-off. Taking direction into account, violating the classic
parallelism assumption significantly altered the results for all cut-off
scores within this distribution. For the 80$ and 90$ cut-off scores, go
consistently underestimated its parametric value. The bias corresponding
to the 70$ cut-off was negative in the classically parallel condition but
positive in the randomly parallel situation.

Keeping type of parallelism constant, changing the cut-off score did
not significantly alter the mean biases within the beta-binomial distri-
butions, except in the case of randomly parallel J distributed tests. In
this instance, the mean bias for the 90$ cut-off was positive, while the

mean biases for the 70$ and 80$ cut-offs were slightly negative. When

the distribution was either bimodal or normal, moving the cut-off did

111

lead to significant differences. Specifically, given classic
parallelism, the 80$ cut-off produced much more underestimation than did
the 90$ cut—off. In addition, for the normal distribution, the 70$ cut-
off resulted in significantly more negative bias than did the 90$ cut-
off. The accuracy of the reliability estimates for randomly parallel
tests was also significantly affected. When the distribution was bi-
modal, Subkoviak's?o largely underestimated-pO for a 70$ cut-off but
fairly accurately estimated the population value given the 90$ cut-off.
For the normal distribution, Subkoviak's coefficient largely over-
estimated'po when the cut-off was 70$ while largely underestimating?o
for higher cut-off scores. Also, the 90$ cut-off produced significantly
more underestimation than did the 80$ cut-off.

For the 70$ and 80$ cut-off scores, the patterns of results formed
by changing the distributional shape were similar and partially supported
the hypothesis that the normal and bimodal distributions would produce
more bias than the beta-binomial distributions. Given random
parallelism, the biases corresponding to the J and skewed distributions
were low, negative, and approximately equal. The bimodal distribution
produced significant underestimation, but the results were not signi-
ficantly different from those found for the J and skewed distributions.
Although differing in direction, the normal distribution was
significantly biased for both cut-off scores and, for the cut-off closest
to its population mode (mean), significantly more biased than the other
distributions. When the cut-off equalled 80$, the bias found for the
normal distribution was also much worse than that found for the beta-
binomial distributions, but the differences did not attain significance.

Given classic parallelism, the J distribution once again produced the

112

least bias with significant underestimation occurring for the 80$ cut-
off. The skewed and bimodal distributions resulted in slightly more
negative bias which, therefore, also reached significance for the 80$ ‘
cut-off. For the latter distribution, the extent of underestimation was
also greater than -.025 when the cut-off was 70$. When the distribution
was normal, significant underestimation occurred for both cut-off
scores. In support of the hypothesis, the differences between the normal
distribution's results and those of the beta-binomial distributions
attained significance. The mean biases associated with the normal and
bimodal distributions also differed significantly for the 80$ cut-off.
The pattern of results for the 90$ cut-off was somewhat different.
Unexpectedly, the bimodal distribution, on the average, produced fairly
accurate estimates in both parallelism conditions. For the randomly
parallel situation, the negative bias associated with the normal distri-
bution was significant and, in terms of absolute value, significantly
greater than that found for the other distributions. However, such was
not the case for the classically parallel condition. As a matter-of-
fact, the skewed distribution claimed the greatest bias which was
significantly negative as well as significantly greater than that found

for the J and bimodal distributions.

Huynh’s g . Similar to the results found for Subkoviak's go’ vio-
lating the clagsic parallelism assumption did not significantly affect
the bias within the J and bimodal distributions. When the distribution
was skewed, type of parallelism did significantly alter the results for
the 80$ and 90$ cut-offs. These differences were significant whether or

not direction was taken into account. The direction and degrees of bias

113

corresponding to the 80$ cut-off were almost exactly the same as those

found for Subkoviak's so with the classic parallelism situation resulting

in more underestimation. However, for the 90$ cut-off, Huynh's go was
significantly more biased in the randomly parallel condition, while
Subkoviak's estimate produced more bias in the classically parallel
condition. In fact, Huynh's coefficient significantly overestimated the
parameter in the randomly parallel condition but provided a fairly
accurate estimate in the classically parallel condition. Finally,
whether or not one considers direction, the biases associated with each
parallelism condition within the normal distribution were significantly
different from each other, regardless of cut-off score. Once again, both
Po estimates followed a similar pattern. For cut-offs of 70$ and 90$,
the absolute mean bias associated with random parallelism was greater
than that found in the classically parallel condition, while the opposite
held true for the 80$ cut-off. The mean bias was consistently negative
in the classic parallelism condition. However, in the random parallelism
situation, the mean bias was highly positive, virtually zero, and highly
negative for the 70$, 80$, and 90$ cut-offs, respectively.

Changes in the cut-off score significantly impacted the mean bias
within every distribution. For those distributions with their population
mode (mean) close to 90$, the mean bias corresponding to this extreme
cut-off differed significantly from that found for the other cut-off
scores. Generally, the mean biases for the 90$ cut-off were signi-
ficantly positive, while the mean biases associated with the other out-
off scores ranged from slightly positive to significantly negative. One

exception to this trend occurred when estimating the reliability of

114

classically parallel tests having skewed distributions. In this case,
the mean bias for the 90$ cut-off was close to zero. Among the three
distributions, the bimodal produced the only significant difference
between the 70$ and 80$ cut-offs; given random parallelism, significantly
more negative bias was found for the former than for the 80$ cut-off.

On the other hand, within the normal distribution, altering the cut-
off had no major effect in the classic parallelism condition. In the
case of random parallelism, the results for the various cut-off scores
were all significantly different from each other. As noted previously,
the 70$ cut-off led to significant overestimation as it had done for most
of the other coefficients, while the 80$ and 90$ cut-off scores resulted
in a fairly accurate estimate and a significant negative bias, respec-
tively.

The hypothesis that the normal and bimodal distributions would
produce more bias than the beta-binomial distributions was generally
unsupported. Although the type of distribution significantly affected
the direction and/or extent of bias, no consistent pattern could be found

either across cut-off score or parallelism condition.

Subkoviak's E. Contrary to previous results, type of parallelism
significantly affected the accuracy of Subkoviak's 3 within every distri-
bution. In terms of absolute value, the J and bimodal distributions were
sensitive to this variable when the cut-off was 80$; §_was more nega-
tively biased in the classically parallel condition. When direction was
considered, parallelism produced an additional significant effect for the
J distribution. Specifically, when the cut-off was 90$, the biases asso-

ciated with the two types of parallelism were equal but opposite in

115

direction. As usual, parallelism significantly affected the results
associated with the skewed and normal distributions. In terms of
absolute value, the differences were significant for all cut-off scores
within the normal distribution and for the 70$ and 80$ cut-off scores
within the skewed distribution. In all these cases, the classic
parallelism condition produced more bias than did its randomly parallel
counterpart. When direction was considered, all respective comparisons
within these two distributions were significant. For classically
parallel tests having skewed distributions, g consistently underestimated
its parameter. However, for randomly parallel skewed tests, 3 was, on

the average, unbiased when the cut-off was 70$ and positively biased

17¢)

given the other cut-offs. When the tests were normally distributed,
underestimated the population value, regardless of cut-off score and type
of parallelism.

Cut-off score also had a pervasive effect. For the three distri-
butions having their population modes (means) near 90$, the mean biases
associated with this extreme cut-off were, in general, significantly less
negative than that found for the other two cut-offs. Two very distinct
deviations from this trend occurred for the skewed and J distributions
within the randomly parallel condition. For the former distribution, the
mean bias was either zero or positive and increased significantly as the
cut-off approached 90$. For the J distribution, moving the cut-off
affected the bias' direction but not its magnitude in that g over-
estimated kappa for the 90$ cut-off and almost equally underestimated
this parameter for the other cut-offs. In addition, given classic

parallelism, the 80$ cut-off produced significantly greater

116

underestimation than did the 70$ cut-off for the J distribution, while
the opposite occurred for the skewed distribution.

The pattern of results formed by moving the cut-off was quite
different for the normal distribution. In the randomly and classically
parallel situations, the 90$ cut-off produced more underestimation than
did the 80$ and 70$ cut-offs, respectively. In addition, given classic
parallelism, the 80$ cut—off resulted in significantly more negative bias
than did the 70$ cut-off.

Generally, the hypothesis that §_would be more accurate for beta-
binomial distributions was not supported. Once again, the pattern of
results formed by altering the distribution varied across cut-off
score. However, in the classically parallel condition, 2 significantly
underestimated kappa for every distribution and cut-off score, except
one; for the bimodal distribution, the bias found when the cut-off was
90$ was close to zero. In the randomly parallel condition, the degree of
bias was generally significant but varied in direction. However, for the
skewed and bimodal distributions, the mean bias was zero or close to zero

for the 70$ and 90$ cut-off scores, respectively.

Huynh's 2. With relatively few exceptions, the bias of Huynh's i
was significantly affected when any of the variables in the study changed
values. Moreover, the results did not follow any pattern, making
interpretation very difficult. Therefore, the following observations are
not as specific as those presented for the other coefficients.

Across all distributions and cut-off scores, the absolute mean
biases within the parallelism conditions were comparable in only four
cases. For the 90$ cut-off within the J distribution, E produced

significantly more overestimation in the randomly parallel condition. On

117

the other hand, when the distribution was bimodal, the bias was signi-
ficantly more negative in the classically, as Opposed to the randomly
parallel, condition when the cut-off was either 70$ or 80$. In terms of
absolute value, the skewed distribution also produced significantly more
bias for the classically parallel situation given cut-off scores of 70$
and 80$. However, when the cut-off equalled 90$, the randomly parallel
condition produced significantly more bias. Finally, the absolute mean
bias for classically parallel normally distributed tests was greater than
that found for their randomly parallel counterparts for cut-off scores of
80$ and 90$.

Violating the classic parallelism assumption also affected the
direction as well as the magnitude of the bias, especially for the skewed
and normal distributions. For both these distributions, &_was positive
in the random parallelism condition and, generally, negative in the
classic parallelism condition.

For the three distributions having their modes (means) near 90$, the
bias associated with this extreme cut-off differed significantly from
that found for the other cut-offs. Specifically, the mean biases for the
90$ cut-off were significantly positive, while the mean biases corres-
ponding to the other cut-offs ranged from significantly positive to
significantly negative. In all three distributions, the results for the
70$ and 80$ cut-offs also differed significantly, regardless of
parallelism. As the cut-off moved from 70$ to 90$, the bias moved toward
overestimation.

Changing the cut-off did not affect the normal distribution as
drastically. In the randomly parallel condition, the bias for the 80$

cut-off was significantly more positive than that found for the 90$ cut-

118

off. When the tests were classically parallel, the 70$ cut-off produced
significantly less negative bias than did the 80$ and 90$ cut-off scores.
The hypothesis that 3 would be more accurate for the skewed and J
distributions than for the other two distributions was generally not sup-
ported. Once again, the pattern of results was inconsistent across cut-
off score. In general, the degrees of bias were significant. However,
for the 80$ cut-off, the biases associated with the J and bimodal distri-
butions were close to zero for the classically and randomly parallel

conditions, respectively.

SamplingiVariability

The variability of each coefficient across samples was predicted to
be inversely related to the test length and the sample size. For each
coefficient, Tables 13 and 14 present the weighted mean standard devi-
ation across cells associated with each test length and sample size,
respectively. Clearly, the results support the hypothesis, i.e., the
sampling variability decreased as the test length increased as well as
when the sample size increased. As can be seen, Subkoviak's coefficients
were more variable than Huynh's coefficients for every test length and
cut—off score. 5 appeared to be more unstable than 8(A).

Table 15 contains the mean standard deviation for every test
length/sample size combination. No significant interaction effects were

evident.

119

Table 13.--Mean Standard Deviation Across Cells of Each Coefficient for
Each Test Length.

Test Length

 

Coefficient 1o 15 20
Livingston's 32(z,_x) .049 .039 .032
Brennan & Kane's $(A)a .042 .028 .019
Brennan 8 Kane's $3 .050 .037 .026
Subkoviak's go .035 .033 .030
Huynh's go .029 .025 .022
Subkoviak's E .089 .085 .078
Huynh's g .073 .064 .058

8The mean standard deviations for these coefficients were based only
on cells within the randomly parallel condition.

Table 14.--Mean Standard Deviation Across Cells of Each Coefficient for
Each Sample Size.

 

 

Sample Size
Coefficient * 25 ' 35 50
Livingston's 32(Z’Ix) .050 .043_ .042
Brennan & Kane's $(X)a .040 .036 .031
Brennan & Kane's $3 .050 .043 .037
Subkoviak's ﬁe .038 .033 .030
Huynh's ﬁg. .030 .025 .024
Subkoviak's i, .099 .085 .077
Huynh's ﬁ_ .078 .067 .063

aThe mean standard deviations for these coefficients were based only
on cells within the randomly parallel condition.

120

Table 15.--Mean Standard Deviation Across Cells for Each Sample Size/Test
Length Combination.

 

 

 

 

 

 

 

 

Sample Size
Test
Coefficient Length 25 35 50
10 .055 .048 .046
Livingston's &2(x,rl) 15 .043 .037 .038
20 .036 .031 .029
10 .047 .044 .037
Brennan & Kane's $(A)a 15 .031 .027 .026
20 .022 .019 .016
10 .057 .051 .042
Brennan 8 Kane's $3 15 .043 .035 .033
20 .031 .025 .022
10 .039 .034 .032
Subkoviak's go 15 .038 .031 .029
20 .034 .030 .027
10 .032 .028 .026
Huynh's E12 15 .028 .023 .022
20 .025 .021 .019
10 .100 .088 .080
Subkoviak's §_ 15 .097 .081 .076
20 .091 .077 .068
10 .081 .073 .067
Huynh's E 15 .073 .061 .060
20 .067 .057 .052

 

8The mean standard deviations for these coefficients were based
only on cells within the randomly parallel condition.

DISCUSSION

Although all the coefficients in this study, except for 3(1),
were derived under the assumption of classic parallelism, they were in
many cases robust to violation of this assumption, i.e., type of par-
allelism did not significantly alter the absolute mean bias. Speci-
fically, for ﬁ2(§,T£) and the pg estimates, this variable had no
significant effect when the distributions were either J-shaped or
bimodal. However, type of parallelism, in general, affected the
absolute mean bias when the distributions were either skewed or
normal. These findings can perhaps be explained by examining the item
characteristics which must be present to form each distribution. If
the domain score distribution is either J-shaped or bimodal, the
domain must consist of items having fairly homogeneous p values and
high item intercorrelations. On the other hand, items within a domain
having either a skewed or a normal distribution are more heterogeneous
and have lower item intercorrelations. When items are randomly chosen
to construct alternate forms, some or all of the statistics computed
from one test are more likely to adequately represent the charac-
teristics of the other form when the item domain is more
homogeneous. In other words, the relationship between randomly
parallel tests derived from a homogeneous, in contrast to a heter-
ogeneous, item domain more closely resembles that found between
classically parallel tests. In addition, for E2(X,Tx), coefficient
alpha is probably a better estimate of the alternate form reliability

when the domain is homogeneous, leading to more accurate estimation of

121

122

52(X,Ix) for randomly parallel tests having a J or bimodal distri-
bution. Given these facts, one would expect type of parallelism to
have a greater effect when alternate forms are derived from a hetero-
geneous item domain, i.e., from an item domain having either a skewed
or a normal distribution.

Type of parallelism did affect the absolute mean bias of the
kappa estimates, regardless of distribution. However, the effects
were somewhat less pervasive for the J and bimodal distributions.

When parallelism did significantly alter the absolute mean
biases, the random parallelism condition did not always result in the
most bias. For example, except in one case, Subkoviak's coefficients
displayed greater bias in the classic parallelism situation. This
latter result can perhaps be understood by examining Subkoviak's
formula more closely. Specifically, notice that using a regression
estimate of true score causes regression toward the mean which becomes
more severe as EB-ZO decreases. When the distribution was either
normal or skewed in this study, 53-20 was likely to be fairly low.
The resultant regression toward the mean may have caused the alternate
form population value within the classically parallel condition to be
severely underestimated, leading to a greater degree of bias for the
classically parallel situation. As predicted, for E2(§,z§), the
random parallelism condition did produce the most bias. Finally, the
type of parallelism associated with more bias varied for Huynh's
estimates.

Cut-off score had a significant effect on all the coefficients.

For 52(X’Ix) and 9(1), the effects were found predominantly within the

random parallelism condition when the distributions were either

 

11.—v4“. {la 1 .m .
. ii... I. _. 35...: .17..- i .. . . ... . . z. .,

 

123

skewed or normal. For the most part, these biases tended to increase
as the cut-off approached the mean; the biases were positive when the
cut-off was located near the population mean. Note that as the cut-
off moves close to the mean, the difference between the mean and the
cut-off has less of an impact on the value of these coefficients.
When the cut-off almost equals the mean, the squared-error agreement
coefficients are approximately equal to 53-20 or KR-21. Therefore,
the present results indicate that as the difference between the mean
and the cut-off becomes less influential and the norm-referenced
reliability coefficient accounts more for the magnitude of g2(X,Tx)
and 9(1), the bias becomes significantly more positive for randomly
parallel tests having a skewed or a normal distribution. For the J
and bimodal distributions, the bias of E?(Z,Tx) and 9(1) did not
significantly change as a function of cut-off score. Because of these
distributions' homogeneous item domains, 53-20 is probably a fairly
accurate estimate of the population coefficient used in these
formulas.

No general statements can be made about the effect of cut-off
score on the threshold agreement coefficients since the results varied
widely. The most consistent finding occurred for Huynh's coeffi-
cients. Specifically, for distributions having their population mode
(mean) near 90$, the bias associated with this extreme cut-off score
was significantly positive. Significant overestimation may occur in
this case because, according to the binomal error model, the standard

deviation around an extreme true score is smaller than that around

non-extreme scores. However, the data in this study do not conform

124

exactly to the binomial error model and, therefore, scores may be more
variable than what is predicted by this model. Since scores
within these distributions cluster about the 90$ cut-off, this
increased variability will have more of an effect in decreasing the
population values, leading to overestimation by Huynh's coefficients.
The hypothesis that the};O and kappa estimates would be less
biased for the beta-binomial distributions than for the bimodal and
normal distributions was, generally, not supported. A possible expla-
nation for this finding is that the J and skewed distributions in this
study did not conform closely enough to members of the beta-binomial
family. In addition, the normal and skewed distributions may not have
been different enough since the normal distribution was actually

somewhat skewed.

SUMMARY AND CONCLUSIONS

Due to the variable results found in this study, no general rules
can be offered for choosing between coefficients falling within each
category (e.g., uncorrected threshold agreement coefficients) of
Figure 2. For each parallelism/distribution/cut-off score combi-
nation, Table 16 indicates the direction of bias produced by each
coefficient. If a coefficient had a mean bias less than .025 in
either direction, the coefficient was considered to be unbiased.
Recommendations about which coefficient to use in each of these cells
can be made and are presented in Table 17. Two criteria were used in
choosing a coefficient in each case:

(1) when the biases were in the same direction, the
coefficient with the least bias was selected;
(2) when the biases were opposite in direction, the
negatively biased coefficient was chosen, unless
the positively biased coefficient was fairly
accurate (i.e., had a bias near zero) or much more
accurate than its competitor.
The latter situation occurred only once where Huynh's E had a posi-
tive, but nonsignificant, bias, while Subkoviak's i had a significant
negative bias.

Several other points about Table 17 should be mentioned. First,
sampling variability was not taken into account in making these recom-
mendations. There were two reasons for taking this course of

action: (1) the bias of an estimator is more important than its

125

Table 16.-—Direction of Bias of Each Coefficient for Each
Parallelism/Distribution/Cut-off Score Combination

 

 

 

 

 

 

 

 

70$
Type of Bi-
Coefficient Parallelism Skewed J modal Normal
No No No Over-
Livingston's Random Bias Bias Bias est.
22(X T ) No No Under- No
- -’~§ Classic Bias Bias est. Bias
Brennan & Over- No No Over-
Kane's 8(1) Random est. Bias Bias est.
Brennan & Over- No No Over-
Kane's 3 Random est. Bias Bias est.
No No Under- Over-
A Random Bias Bias est. est.
Subkoviak's po No No Under- Under-
-' Classic Bias Bias est. est.
No No Under- Over—
A Random Bias Bias est. est.
Huynh's 20 Under- No Under- No
" Classic est. Bias est. Bias
No Under- Under- Under-
A Random Bias est. est. est.
Subkoviak's 5 Under- Under- Under- Under-
Classic est. est. est. est.
No No Under- Over-
A Random Bias Bias est. est.
Huynh's 5 Under- Under- Under- Under-
Classic est. est. est. est.

 

126

 

 

 

.. _ 2.! a...F.1id1dﬁ33n..z_3t1$33!}. 3....

 

129

. aoﬁuoaouuu ea

.333 a...» 353:: one: on .3332.» .58 a... cue-«u: clogged 3 ~33 3 .33. 35 5 ecu—"aloe no .333: c.3238!

 

 

 

 

 

 

 

 

 

 

0.55: 533.333..“ 9 AMHJVA a Page: «33333... e ANHQVNM 2..
e.gcasz n.ssas= e AMH.~.mm on audio: n.3aaaososn n.esasm e Auw.a..u on dunno:
n.2emsm n.ssas= s A u.u.~g e n.2nasosssn n.xsaaoxssm e an e op
n.xsa>oaosn n.3saaoassw a Aww.mwfa ea n.3snsososn n.3saaososn o ~u~.~emu om
n.rcara a.ssasz e .n:.~.mu oe,~seoean n.5easa o.seas= e aua.xvma on success
n.2eaaoxssu o.xesaoxssm e A H.mvnu o» n.3saaoxosn e.xna>osasn e A e.gcmu se
n.xaaacnssa o.xoa>oaasn e AHa.m.~m on 6.xsaaoxssn e.gsaaoxssn e .wm.mCmm ca
e.gaasa n.scasr e A e.g.wu on nuance u.xsa>osnsn o.xs«>oxasm a a 9.xs~g cm assess
n.5aasx 1e -e
c.saa:= \e.ssaaoxssn e Ana.ncmm o» e.geasz n.scas= e Axu.~vam a»
c.ans>csasn n.rcas= e Auw.a.~u om n.3asaosssn n.xea>mxosn .e I‘ve no
cos-an n.=c s: oozoxn
oasis: £22326 e rams: a 8 3.32838 533.63% a Amnéuu 8
n.asa== n.sc«ansasn o A a.~ewu oe n.xoa>osssm n.3saaoxssm e A ~.xemu ob
Qvlwmmpmmlmwwmm odqe BENNEH .556 not... «5 3939.5. :5di 3239.5. Logan Locum
ucuooenou 9309.325 uneasy... votes—om vouooetoo v38....ooc= Benson 69.5.3
voaooeeoo «.309—.525 0300.28 convenience:

 

 

 

ssoaaqcuoou co scam

 

1mmoaoaeeooo mo cane

 

 

$26383 canoes

mass: «can 82:6“

. 3332300 9.09m

30130253333:3:523~95.— ncam so» 3:33.308 ace-ooh: 3239.5. and LELM enema—am 9332005330028 caucus-coconut: 03m...

 

 

 

 

130

sampling variability in determining the estimator's adequacy; and (2)
Tables 13, 1H, and 15 indicate that the differences in stability of
each coefficient within a particular category are not very large.
Second, there were two instances where Subkoviak's and Huynh's coeffi-
cients were equally biased, and, consequently, both were listed.
However, since Huynh's coefficients appeared to be slightly more
stable, one might want to select his estimates. Third, even though 8
significantly overestimated its parametric value when the distribution
was either skewed or normal, this coefficient is the only available
corrected squared-error formula and was, therefore, recommended in
every situation. Finally, although either §2(§,Ix) or $(A) was
recommended in each case, one must remember that these two
coefficients are not really comparable because they do not estimate
the same population value. Specifically, g2(§,Tx) measures the
reliability associated with a particular test, while $(A) indicates
the reliability of any set of items randomly selected from a domain.
Because of the latter fact, all tests which can possibly be
constructed from an item domain must be classically parallel for the
classic parallelism assumption to be valid. This situation is
unlikely to occur, unless all items within the domain have equal p
values. When this situation does occur, §2(§,IX) and ¢(A) are equal,
and one can directly compare the accuracy of 32;§,Tx) and $(A).
However, since this study did not contain a domain of items having

equal p values, $(A) could not be computed in the classic parallelism

condition. Therefore, g2(§,Tx) was consistently recommended as the

131

appropriate uncorrected squared-error coefficient within the classic
parallelism condition.

Finally, mention should also be made of two other methods of
estimating reliability; Subkoviak and Wilcox (1978) and Livingston and
Wingersky (1979) introduced mastery coefficients which measure the
extent of agreement between the observed score and the estimated true
score. The former coefficient uses a threshold loss fUnction.
Livingston and Wingersky's index reflects the size of the misclassi-
fication error but does not use a squared error loss function. Since
reliability is really concerned with accurately estimating an
examinee's true score or true classification, these indices deserve

considerable attention.

 

 

 

 

 

APPENDICES

 

 

 

 

 

 

 

 

 

 

 

 

. . . . . mo. epo. «Ponxx soc. .
“of .E s\ a. .\ \. \. \\. :
moo. epo.u moo.u Pmo.- omo.u moo . Poo on
o o o o o o o m o.
wao\ :o\ ms mo\ 8o\ 2o\ =S\. o\. S 9
Foo. >.o.- coo.- Fmo.u .Fo.n _o.u Poo mmo
.V % mg m? @V .E GK a? :
ape. 0.0.- moo.u mmo.u _m_. Po.u moo. mpo.
moo. mo. ope. wwwvxx mmmn\\ ompv\\ awwv\\ mwwvxx
\\ O
\mmmmm mmumx \mmmmm moo. Pso.- moo.- mac. o...
mmwn\\ omo. mwmhxx .wmmxx mmmuxx mmmvxx mwo. mwmv\\
, m OF
ooo.u \mmmmm \ moo.a e.o.: mmo.u .mo.u moo. zoo.
m... :mmmxx wme\\, mme\\ wwwh\\ mmh\\\ .wmv\\ mwmv\\ h
pmo. mmo.u woo.u 0.0.- 0mm. .mo.u zoo. mmo.
. . . .. pmp. Pop. mmo. Pop.
...\ f i \ \. \. \. a .
mo_. mmo.u mmo.- mmo.n so. mmo >_o . mPO u
amenoz dance vommsm omzoxm HmsLoz Hobos commcn cozoxw ogoow cameos
-Hm -e -ﬁm nu occlusc same

 

”Sc—Oh 00 “Chmu Hd

Hoaamnmm >HHmonmmHo

 

menom oumcnoua<
HoHHmme >Hsoccmm

P< xHazmmm¢

moccaemxm mm no noaoemm mmono< AxH.MVNm m.:ounwcH>Hq no coHpmH>oa ogmccmum new nmwm com:

132

133

 

3N6. . ope. «0. who.
\\\\\ \\\\\. \\\\\ \\\\\\
Npo.l NPO.I ©OO.I NO.I
950. P . pro. 0.
\\\\\ J\m\\\ \\\\\ ﬂ\\\\\
mmo.l NPO.I 500.! PNO.I
oﬁ FY Po. 9%
.opo.- .o.- .mmmmm =.o.-
Hmsnoz Hobos commnn cozoxm
lam I6

 

 

 

 

 

 

 

 

 

omo. 3pc. FPO. mwo.
\\\\\\ \\\\\ .\\\\\ \\\\\ @—
mpo.t Noo.l moo. who.
N 0. Po. NFC. o.
m\\\\ A\\\\. \\\\\ JW\\\\ 0_
=~o.n ooo. zoo. Nmo.
mmp. Po. .0. are.
FNP. Poo.| moo. No.
HMSLOZ HNUOB UQNNZW 00308” GLOUW
lam Id unciuso

 

 

match oumcgoua¢
Hoaamnmm uaamoﬁmmmao

match mumcgmua¢
Hmaamnmm haeoocmm

mmocHsmxm mm no mmaoemm mmogo< AMH.MVNw m.coumwcﬁ>aq mo scaumﬁ>oa osmccmum new mmam 2mm:

A.c.p:oov P< xHozmmm<

om

gsmcms
same

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

mwm\\\ omm\\\, mmmxxx mmp. »wm\\\ mwmxxx emmxxx mmm\\\ op
mpo. mmo.- moo.- mo . Npo.u mFo.- moo. omo.
E\ .\. .s .§ §\. .g\ .g\ ...\ N, ..
.mo. mmo - moo.- omo.- omo ooo.- ooo. omo.
O 0 O O P. O\ I O
m? of of of N. o? of mxmo\ :
mmo. omo.u moo.- ooo.- oo.. o.o.u ooo. opo.
mg mg as m? of >mo\ NS\ mmo\ m
ooo. ooo.- ooo.- ooo.- mmo.- omo.- ooo. m=..
oo. omo. _o. moo. Poo. omo. moo. omo.
mxxxxx \\AW\\ .\\\\\ .\\\\\ .\\\\\ \\\\\ \\\\\ m op
ooo. mo . ooo.- ooo.- ooo. mmo.- ooo. moo.
O O O l O O . O 0\
m..\ .g .2\ .2 §\ A? .E we .
ooo. moo.- ooo.- m.o.- mmm. mmo.u ooo. mmo.
mmwxxxx moo www\\\ moo wwwxxx mmm\\\ me\\\ opp o m
moo. up..- mmo.u .omo.u mmo. opo. o.o.- ooo.-
HmELoz mmcos ommmnm omzmxm Hmspoz Hobos oommnm cozoxm onoom cameos
-Hm uo -Hm no oooaooo ammo

 

mason oumcgoua<

Hoaamnmm mammommmmao

 

<

wagon opmcnmum<
Hommmnmm zaaoocmm

N< xHozmmm<

homemamxm mm mo moaoemm amono¢ AmH.mv m m.c0ummcm>mq uo :oHumH>oa unmocmum pom mmwm com:

134

135

 

 

 

 

 

 

 

 

 

 

 

 

 

JWWV\\ ewm\\\. mmm\\\ WWW\\\ mW\\\\ :WW\\\. MWW\\\ wwm\\\ @—
Noo. opo.l zoo.l 33o.l moo.l moo.l moo. Poo.
5.? .3 .ﬁ oi. .% .g % :\. 2
:oo. m—o.l moo.I NNo.I moo. ooo.! ooo. mo.
mmm\\\ mwm\\\ wmm\\\ mwm\\\. ”W\\\\ mwm\\\ mwm\\\ mwm\\\ :P
moo.l opo.l ooo.: :Po.l app. ooo.: ooo. bro.
Hmsnoz Hmcos .mmmmmm .mmmmmm .wmmmmm Hmcos .mmmmmm oozmxm .Immmmml
tam Io lam Io omeluso

 

 

menom oumcnmum<
Homamnmm mammoammmmo

wagon oumcnoum<
Hmammnmm hasoocmm

noocmsmxm mm mo nomosmm moono< Axh.womm m.coumwcm>mq no cowpmm>oo ogmncmum new mmmm coo:

A.o.u:oov N¢ xHozmmmd

om

causes
some

 

 

 

 

 

 

 

 

 

 

 

 

 

NNo. bmo. po. cmo. mmo. pNo. who. >30.
\ \ \ \\ \ \ \ \ .1
ooo. Pmo.l moo.| no.1 FNo.I h—o.l Noo.l cmo.
¢Y a? ,Y mﬁ as :\ as E\. N, m.
mpo. :mo.l aoo.l #6.! ooo. NNo.I Noo.l mmo.
wmo. mpo. moo. cmo. app. mpo. mpo. , NNC.
No. mNo.I woo.l mNo.I New. mpo.l ooo. Npo.
O . m 0 O , FF. CO . ”mo. . O O
fwm\\\ \m\\\\ Awm\\\ m\\\\\ mW\\\\ ‘\\\\\ wm\\\\ mwm\\\ o
mpo.n mmo.l mpo.l $30.: wmo.l Pmo.l —oo.u cap.
8% RV. mg ﬁmV mg 5% g mmo\ m S
>NO.I N©O.I =~o.n mmo.I mmo.l omo.l Poo.l :50.
E g :E 3% 8.. 3% So. so.
wpo.l 530.: mpo.l mpo.l \MHMW\ Nmo.l .mmmW\ \mmmm\ h
;o\ m8\ 3\ ax mm.\ 5\ omo\ m8 3 m
:30. opp.n mmo.l m—F.I wmo. moo. :No.l 2:0.I
Hmegoz Haves ummmnm cmzmxm Hmsgoz dance vmmmcm cwzoxm mgoom sumcmq
lam Id lam .Ih Mueluso umwe

 

mmwcﬁemxm om no mmaqsmm mmono< AxH.MVNm m.:0pmmcﬁ>ﬁq uo :oHumw>mn cnmucmum ucm mmﬁm cam:

wagon mumcgmua<

Hmaamgmm >Hamoﬁmmmao

 

wagon mumcnoaa<

Hmaamnmm haeoocmm

m< xHozmmm<

136

137

 

kwmuxx
Noo.-
%
Poo.l
wmmvxx

_moo.

HGSLOZ

 

NNo.I

mg.
mmo.|
2?

mpo.l

Haves

lam

 

$00.:

commnm
Iw

wagon mumcnmua<

Hoaamnmm adamoﬁmmmHo

 

mmmv\\

mao.u
g
mmo.a
h?

peo.u

umzmxm

mmmv\\

moo.|

mmvxx,

moo.

mmmV\\

mmp.

 

HMSLOZ

A.u.ucoov m< xHazmmm<

 

g

mco.u

moo.|
wwwv\\
ooo.:

Haves
tam

 

moo.
@\
moo.

cmmmnm
uh

wagon mumcgmaac

Hmaamnmm >HeouCMm

 

mmmv\\

who.

zao.
mpo.

0P0.

 

cmzmxm

 

mp

op

2F

wgoom

whouuzo

mmmcﬂamxm om no mmaasmm mmogo< AxH.mva m.:0umm:H>Hq mo :oﬁumﬁ>oo unaccwum vcm wwwm cam:

 

om

sauce;
away

APPENDIX A“

$(A) Access Samples of 25 Examinees

Randomly Parallel
Alternate Forms

Mean Bias and Standard Deviation of Brennan and Kane's

 

 

 

 

Test Cut-off J- Bi-
Length Score Skewed shaged modal Normal
5 u .101 0019 -0025 0015
,/1762 .037 ,/<6§§ .155
7 .019 .000 .005 .078
.036 z/T616 .02u A .110
.001 .002 .007 .030
10 8
//f666 017 .031 z”f00
9 .063 .002 .009 .029
.10 .015 .029 .005
.035 .01 .022 .072
11
//T022 ”T000 /’/:01 z’f101
.054 .008 .020 .019
15 12
.0uu .01 z/f612 .076
.072 .006 .02 .021
1a
//1666 .008 1/7611 /’7622
1“ .028 .01 .02u .153
.011 .005 2/1665 .085
.06 .009 .028 .071
20 16
.025 ,/1606 z/f601 053
18 .09 .007 .020 .0u9
//7632 .006 .007 ””761;

 

 

 

 

 

 

138

APPENDIX A5

Mean Bias and Standard Deviation of Brennan and Kane's
¢(l) Across Samples of 35 Examinees

Randomly Parallel
Alternate Forms

 

 

 

 

Test Cut-off J- Bi-
Length Score Skewed shaped modal Normal
5 u .065 .019 -.028 .006
ﬁ .028 X92 /09
.017 .006 -.00u .088
7
/7626 z/1613 2/7632 «<ﬁiﬁ
.0u7 .006 -.002 .060
10 8
l”:05 t’7612 ,/7630 .08
.084 .005 .002 .005
9
x/T5;6 .01 .0u7 .039
0035 .011 .016 0096
11
46 65 49 43
15 12 .058 .01 .018 .051
.025 .006 .025 .068
.083 .008 .017 .032
1”
«/f631 ”’7005 /’1010 ’/:6?§
.026 .011 .021 .1u7
1H '
2/7606 z/TEEA ”’760; , .063
.058 .01 .025 .083
20 16 ‘
.015 .00u ,/757A .051
18 .091 .009 .022 .055
026 //<66§ ,xfﬁ73 .017

 

 

 

 

 

 

139

APPENDIX A6

Mean Bias and Standard Deviation of Brennan and Kane's
¢(A) Across Samples of 50 Examinees

Randomly Parallel
Alternate Forms

 

 

 

 

Test Cut-off J- Bi-
Length Score Skewed shaged modal Normal
5 u .077 .01“ -.032 .021
.069 . 32 .067 .126
7 .021 .003 -.002 .073
/.0/23 45 /.02 .098
0052 .001 -0003 003”
10 8
2’7655 r’7616 ,z’763 .085
.091 .000 .000 .031
9
ﬂ .%5 .03 437
.03” .008 .017 .089
11
43 .008 4 2 41
15 12 .056 .006 017 .033
.022 .009 .017 .078
.08" .005 .01” .026
14
.032 -m3 4 5 %
1n .026 .009 .022, .161
.008 .%5 %6 %56
20 16 .056 .008 .02“ .082_
./015 .45 /01 .0113
.091 .006 .021 .053
18
/.62 .005 .01 41

 

 

 

 

 

 

140

APPENDIX A7

Mean Bias andAStandard Deviation of Brennan and Kane's
¢>Across Samples of 25 Examinees

Randomly Parallel
Alternate Forms

 

 

 

 

Test Cut-off J- Bi-

Length Score Skewed shaped modal Normal

5 u .101 .023 -.023 .151
,/1138 ./<6§6 .089 ,xfTK8

7 .069 .009 .01 .121
z’fagé .017 ,//763 [/7636

10 8 .069 .009 .01 .121
,/1699 ./1619 ,//f65 ,/1696

9 .069 .009 .01 .121
.093 «I161; ./’755 .096

.08 .002 .029 .128

11

./072 ./009 4 2 47

15 12 .08 .002 .02u .128
”’76;2 .009 .012 [/1661

. .08 .002 .02u .128

1a

‘/’76;2 [/1669 ”7612 »’7669

10 .092 .01 .028 .162
.ou ,/1666 .007 a’76;6

20 16 .092 .01 .028 .162
///761 .006 .007 1’76?6

18 .092 .01 .028 .162
.01: %6 %7 ./076

 

 

 

 

 

 

141

APPENDIX A8

Mean Bias andAStandard Deviation of Brennan and Kane's
¢1Aoross Samples of 3S Examinees

Randomly Parallel
Alternate Forms

 

 

 

 

Test Cut-off J- Bi-
Length Score Skewed shaped modal Normal
.065 .023 -.029 .11
5 u
2/1768 .027 «’1669 ‘/<696
7 .086 .007 -.001 .128
z’fagé 2/7612 ,/’f66 2/76?2
10 8 .086 .007 -.001 .128
.072 .012 ,z’765 .072
9 0086 0007 -0001 .128
//<6?2 .012 .05 .072
.09 .011 .018 .199
11
‘/’T61 [/1666 "T625 "76;2
.09 .011 .018 .109
15 12
I’TBA .006 1/7625 —’7612
.09 .011 .018 .109
1a
/’T/A 2/1666 .2155; 1’7512
1“ .093 .011 .025 .159
.02u /’7669 ’57611 ”1639
.093 .011 .025 .159
20 16
.029 .009 .019 211669
18 .093 .011 .025 .159
.029 .009 019 2’7669

 

 

 

 

 

 

142

APPENDIX A9

Mean Bias andAStandard Deviation of Brennan and Kane's
¢>Across Samples of 50 Examinees

Randomly Parallel
Alternate Forms

 

 

 

 

Test Cut-off J- Bi-
Length Score Skewed shaped modal Normal
0069 0019 -003“ .1143
5 U
.092 W .061 .13
7 .093 .003 -.002 .116
.035 47 .03 46
10 8 .093 .003 -.002 .116
.035 2’7511 ,,//765 ,/7666
9 .093 .003 -.002 .116
.035 .017 A3 46
.09 .009 .017 .14
11
./032 409 45 476
.09 .009 .017 .1”
15 12
.032 .009 m 476
009 0009 0017 01""
1”
./032 49 % 476
1” .091 .01 .02“ .169
«/’762 2’7665 z”661 z’TEEA
20 16 .091 .01 .024. .169
42 40/5 .01 ./0511
18 0091 001 00211 0169
AZ %5 41 ./0511

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

mo. Nmo. on. poo. Pmc. mac. m=o. 950.
m\\\\\ \\\\\ \\\\\ \\\\\ \\\\\ \\\\\x \\\\\ \\\\\ 2F

:Fo.l mmo. mpo. bro.l PmO.I No. mNo. ace.

6. No. mo. @mo. o. NNO. NO. . Nmo.

mmo.l mNo.l NO.I emo.l ®FO.I mwo.l ooo.! wOo.l
use. NNO. mNo. «mo. mmo. . mNo. mNo. mNo.
\\\\\ \\\\\\ \\\\\. \\\\\\ \\\\\ \\\\\\ \\\\\ \\\\\ PP

Nmo.l mwo.l mpo.l QNO.I mo. mpo.l ©OO.I moo.l

mo. bmo. N O. O. Fmo. mmo. MD. who.
J\\\\\ \\\\\ H\\\\ JW\\\\ \\\\\ \\\\\ J\\\\\ \\\\\ m

20.! moo. ﬂoo.l mo.l POF.I =No.l :No. Poo.l

. . . . . . o. . . 1

6\ §\ 6\ 65\ E.\ Nm\ .6\ 6% m e
mﬁo.l 30.! Nmo.l hmo.l mmo.l wNo.l zoo.l 006.!
mmm\\\ MWW\\\ mwm\\\ wmm\\\ mmm\\\ mmm\\\ mWW\\\- mmw\\\ b

mmo.l N:C.l @FO.I ©NO.I ©NP. mmo.l mpo.l mpo.l
6\ 6O\ .§ §\ g\ 6\ .2\ §\ .6 m
ooo.! FOF.I ®N0.I mo.l mo. 50.! 30.! wmo.l

Hmsgoz Hmoos cmmmnm nozoxm Hmegoz Hmcos oommcm cosmxm onoom cameo;

IHm IQ lam Id huOIuSU name

 

mspom mumcgmuH<

Hoaamnmm >Hamoammmao

moocﬁemxm mm mo moansmm mmono< W a.HHmnmLmz mo coHumH>oa unmocmum ocm mmﬁm :mmz

 

manom mumcgmua<

Hoaamnmm zaaoocmm

o~< xHazmmm<

144

145

 

 

 

 

 

 

 

 

 

 

 

 

 

m0. Nmo. m 0. 0. 0. MMO. m 0. m0.

0\ \ ..\ ..\ :\ \ =\ \ a
mF0.! N0.! 0P0.! bm0.! 030.! m00.! 0F0. bro.
m 0. N0. N0. Fm0. m 0. N0. 5N0. , mm0.

\W\\\\ w\\\\\ &\\\\\ \\\\\ HW\\\. &\\\\\ \\\\\ \\\\\\ or
m00.! 0F0.! NFO.! 0m0.! m:0.! 000.! 000. P00.!
020. N0. NNO. 0N0. 00. mmo. NNO. MNO.

mm0.! 000. 000.! mF0.! PNO. 300.! ’00.! M00.

Hmagoz Hmcos vommzm co3mxm Hmanoz Hmoos woman» vozmxm onoom

-am .a -ﬁm nu aeonsso

 

 

meson oumcgoua<
HoHHmme adamoammmau

wagon oumcnoaa<
Hoaamnmm >Heoccmm

moocﬁemxm mm mo moaasmm mmono< ”Jam.HHmnmgmz no coHumH>oQ ogmucmum vcm mmﬁm 2mm:

A.U.ucoov op< anzmmm<

om

some»;
game

 

 

 

 

 

 

 

 

 

 

 

 

 

mmo. mmo. «mo. «mo. mmo. omo. mmo. mac.
\\\\\ .\\\\\. \\\\\ \\\\\ \\\\\. \\\\\. _\\\\\ \\\\\ :P
moo. o_o. _Po. mo.- mmo.- mmo. mmo. woo.
6% :\ \ .1 .1 6V :\ 2.\ N. e
Fmo.- :o.- “No.1 emo.- .ooo.- mmo.- .Po.- :.o.-
mmo. mmo. are. «No. 1 mac. _mo. .mo. mmo.\\
.\ \ \ \\ \ \ \ \x :
m:o.- mmc.- m_o.- mFo.- ego. mmo.- .o.- ._o.-
emo.- mpo.- .mpo.- omo.- >~o.u mmo.u mmo. .=oo.-
mmo. emo. .mo. mmc. omo. mmo. amo. omo. .
\. \. \. \. \. \. \\. \. m 2
mac . Pmo - _go . omo - mmo . ago - aoo . m.o .
€\. m§ 6% .1 of mg @V mg l
.ao.- _mo.- amo.- opo.- ... Nmo.- mmo.- .mo.-
mmmuxx mmmv\\ mmmvxx mwmvxx mwmvxx mwmvxx mmmmxx wwmv\\ m
mpo.- Pm..- .zo.- m:o.- moo. .mo.- Pao.- mmo.- a
Hmsgoz Hmcoa ommmnm nmzmxm Hmsnoz Havos commcm oozmxm mgoom cameo;
-Hm -u -Hm nu ooo-u=o same

 

msgom mumcgouﬁ<

Hoaamnmm adamonmmHo

 

meson mumcgmuac
HmHHmme zaeoocmm

me
momcaemxm mm no moaqemm mmono< nwm.HHmcmLmz no coﬁgma>oa ngmvcmum new mmwm cam:

pvt xHozmmm<

146

147

 

o. m o. m o 0
mm\ m\ m0\ 5%
Po. =mo.u PNo.! ozo.u
mwm\\. mwm\\\ mwm\\. mwm\\1
mmo.u _mo.u mo.n smo.!
mmo. Pmo. pmo. No.
moo.! Ppo.! _o.! mpo.u
awesoz Hmcos commcm oozoxm

tam In

 

 

 

 

 

mELOk m..— wCe—mu H<

Hodamnmm haamowmmmao

mmo.- moo.- FFC. Npo.
mwmvxx 6mmV\\ mmmhxx mwmuxx
mmo.- mmo.- m.o.- aoo.-
mmmv\\ mmmvxx mwmvx\. mmmvxx
Foe. Pmo.- moo.- ooo.-
.mmmumm .mmmmm .mmmmmm .mmmmmm
-ﬁm -6

 

 

 

 

 

msLom oumcnoaa<

Hmﬁﬂmamm sasouqmm

mp

@—

:9

onoom

.COquo

m!
moocwemxm mm no moaaemm mmono< n_m.HHmcmLmz no coﬁumﬁ>mo oumccmum new mmam cam:

A.o.ocoov F_< xHazmaa<

 

om

cameo;
game

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

é mmv m% «9* mg RY iv :81»-.. a
000.! FPO. 000. mm0.! ma0.! 5N0. mpo. m00.
MWWV\\ wmmh\\ mmmh\\ mmh\\\ 2mmh\\ mme\\ mmh\\\ wWWV\\ NF m9
mm0.! 0M0.! NNO.! 0m0.! ~0P0.! MNO.! PPO.! MP0.I
mmo. Fro. . 0P0. NNO. m0. 9N0. 3N0. . :No.
\\\\\ \\\\\ \\\\\. \\\\\ &\\\\\ \\\\\ \\\\\ \\\\\\ PF
>m0.! 0N0.! h~0.! mm0.! 0:0. 0N0.! b00.! P0.!
0' 0| 0' 0|. 0H0!- ol- 0 o
mmmh\\ mmh\\\ mmmh\\ MWDV\\ rmmh\\ mwmh\\ wmmh\\ mmmh\\ 0 OF
m00.! =m0.! :m0.! hm0.! 520.! 330.! NPO.! 000.!
. . . . 0. \ NNO. 5N0. mN0.\\
”MA ,W .m\. mm“ NA 3.- e\..- mm. .
RE ms 2% mg a? a? mm.\.\ .m€\ : m
0m0.! 0—P.! >m0.! mzo.! N00. 0h0.! mm0.! b:0.!
Amanoz Hmcos cmmmzm nozmxm HmBLoz Haves ommmcm uozoxm oLoom cameo;
lam In lam Id MMOIJ=0 ummh

 

mason oumcgmaa<

Hoaamgmm aaamowmmmao

 

mason mumcnmua<
Hodamnmm haaoccmm

momcwemxm om no moaaamm mmogo< nﬁmm.aamzmgmz uo coﬁama>ma onmncmum vcm mmﬁm cam:

NP< xHozmmm<

148

 

 

 

       

.ll‘ «0. . aid: 4 .rmvfG .11“. 1.3.4:; .. o . . . dJJ-Tm

 

 

149

 

mwmv\\ mmmvx. Jmmhxx mmmvxx
m00.! 0N0.! 0N0.! m0.!
sf my WV 3%
300.! m0.! :No.! N:0.!

mm0..\ opo. . 0P0. _ N0.

\\\\ \\\\\ \\\\\. \\\\\\

mm0.! mw0.l m00.l m~0.l

Hmsnoz Hmcos commnm cozmxm
lam lﬁ

 

 

 

 

 

meson oumcgmua<

HoHHmme adamoammmao

 

 

 

N0. mN0. mmo. mNO.
m\\\\\ .\\\\\ \\\\\ \\\\\
20.! 000. N00. 000.
mmmvx. mmmvxx mwmv\. mme\\
m:0.! 0P0.! H0.! h00.!
mmm\\\ mm\\\. mwwxxx www\\\
0P0. 0P0.! N00.! 000.!
Hmegoz Haves commnm cozmxm
IHm I0

 

meson oumcgmuac

HmHHmme adsoocmm

 

mp

me

:P

mnoom

uncluso

moocwemxm om no moaasmm mmogo< am.m.HHmnmLmz no coHumH>ma ugmvcmum new mmwm :mmz

A.U.ucoov Np< xHozmmm<

 

om

some»;
game

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

m0. Nmo. — 0. 1 P 0. Fmo. . mac. m:0.\\. F50.\\

m\\\\\ \\\\\ HW\\\ UW\\\\ \\\\\\ \\\\\\ \\\. \\\ :—

:P0.! mm0. m—o. 590.! Pm0.! N0. 0N0. =00.

5\ .1 s\ m? a? §\ .g\\ 5\ 6 m.
0m0.! 0N0.! N0.! 0M0.! 0P0.! m—0.! 000.! 000.!

5 . N0. mmo. Pmo. mmo. mN0. mN0. mN0.

\Wm\\\ N\\\\\ \\\\\ \\\\\ \\\\\ \\\\\. \\\\x .\\x\\ FF

Nm0.! mP0.! mpo.! 0N0.! m0. mP0.! 0004! m00.!
QMb\\. mmm\\\ N20 JWW\\\ mmm\\\ mmm\\\ Jmm\\\ mwm\\\ m

20.! m00. 500.! m0.! ’09.! 3N0.! 3N0. P00.!
JWW\\- @Wm\\\ mWW\\\ mmm\\\ WWW\\\ WWW\\\ mmW\\\ MWW\\. m or
050.! #0.! Nm0.! 5M0.! am0.! 0N0.! £00.! 000.!

.1 .1 §\ N2\ ms\ 1% .1 @x n

mm0.! N:0.! 0F0.! 0N0.! 0N». mm0.I 0P0.! m~0.!
2\ §\ $\ 6.\ g\ E\ .2\ of 1. m
000.! For.! 0N0.! m0.! 00. 50.! 30.! ®m0.!

Hmsgoz Hmnos nommzm nmzmxm amenoz Hmnos nommcm nozmxm ogoom cameo;

Iﬁm In lam In unclazo name

 

wagon mumcgmua<

moocaamxm mm no moaasmm mmogo< om.m.xma>oxnsm no coHpmH>on numncmum new mmam :moz

Hoaamamm adamoammmao

 

wagon oumcnouﬂ<
Hoaamgmm hasoncmm

mp< anzmmm<

150

 

 

 

 

 

151

 

 

 

 

 

 

 

 

MPO.! N0.| 0F0.I 5m0.! 020.! m00.! 0P0. 5P0.
o. \ N . . o. o. C . \ . . .
m®0.! ©F0.I 5F0.! 0m0.! M20.! 000.! 000. 900.!
020. N0. NNO. 0N0. . 00. 1 MNO. NNO. - MNO.
mm0.! 000. 000.! m—0.! PNO. 200.! 900.! m00.
Hmsnoz Hmnoe nommzm nozoxm HmBLoz Hmnos nmmmnm nmzmxm
IHQ I0 lam Ih

 

 

menom oumcnmua<
Hoaamnmm haamoﬁmmmao

m
moocﬁsmxm mm no moaasmm mnogo<

A.n.ucoov m—< xHozmmm<

magom mumcgmua<

Hmaamnmm aﬁeoncmm

 

mp

09

:—

onoom

amelpzo

Wm.me>Ov—DSW MO COHQMH>GQ ULMUCM»W USN maﬁm cam:

 

0N

somcmq
same

 

 

 

 

 

 

 

 

 

 

 

 

 

mo. mmo. Nmo. Nmo. . mmo. m0. mmo. N 0.
W\\\\. \\\\\. \\\\\_ \\\\\ \\\\\. w\\\\\ \\\\\ .HW\\\ 2P
m00. 0F0. 990. m0.! mm0.! NNO. mN0. 000.
5\ §\ 3\ mg mg E\ a? 8\ N, e
rm0.! 20.! 5N0.! 0N0.! 000.! 0N0.! P90.! 2P0.!
mm0. NNO. mp0. NNO. m 0. Fm0. PNO. mN0.
\\\\\\ \\\\\ \\\\\ .\\\\\ HW\\\ \\\\\\ \\\\\ \\\\\ PP
N20.! mN0.! 090.! 0P0.I 520. Nm0.! F0.I P90.!
mmh\\\ mmmh\\ wmmh\\ mwmh\\ mwwh\\ mwmh\\ wmmh\\ mmmh\\
m
5N0.! NPO.! ®F0.! 0m0.! 550.! 0M0.! mN0. 200.!
Nm0. . 5N0. FNO. N0. mmo. m0. 0N0. m0.
\\\\\ \\\\\ \\\\\\ m\\\\\ \\\\\. w\\\\\ \\\\\ M\\\\\ m or
mmo.! ~m0.! p20.! 0N0.! mm0.! 020.! 000.! mP0.!
mwm\\\ mwm\\\ me\\\ wwm\\\ mm\\\\ wmb\\\ mmb\\\ mwm\\\ 5
F20.! Pm0.! 2N0.! 0F0.! pp. Nm0.! MNO.! FNO.!
§\ 6\ §\ of a? 1% 5\ g ._ m
mF0.! PNF.! P20.! m20.l moo. '00.! F20.! Nm0.!
Hmsgoz Hmnos nommsn noZoxm Hmsnoz Hmnos nmmmnm nozmxm mnoom cameo;
-«m -w -nm .4 “mousse name

 

wagon oumcnoua<

Hoﬁamgmm haamoammmao

moocﬁamxm mm mo mmaqemm mmoLo<

 

o!
(a!

menom oumcnmuac

HmHHmme zaaoncmm

2F< xHazmmm<

m.xmﬁ>oxnsm ho coﬁumﬁ>mn ngmnCMum ncm mmHm com:

152

153

 

 

 

 

 

 

 

momcHamxm mm no moaaemm mmogo<

Hoaamnmm zaamoammmau

 

 

 

 

 

 

 

 

menom oumcnoua<
Hmﬁﬂmgma aasoucmm

d
m m.me>oxnsm no cowumw>oa nuancmum ncm mmwm 2mm:

A.n.ocoov 29¢ xHozmmm<

 

0N0. mmo. . mmo. 5 0. m0. m0. N0. . mmo.
PO. 2m0.! PNO.! 020.! mmool N00.l 5F0. N90.
mmm\\\ mmm\\. mwmxxx mmmxxx mwmxx. mmw\\\ mmmxxx mmwxx. m. cm
®m0.! pm0.! m0.! 5m0.! mm0.l MNO.! NPO.l 000.!
mm0. PNO. PNO. N0. P20. N0. 050. NNO.
\\\\\\ \\\\\\ \\\\\ \\\\\\ \\\\\\ w\\\\\ \\\\\\ \\\\\ 2P
000.! Pp0.! r0.! MPO.! 900. 9N0.! m00.l 000.!
HmsLoz Hmnoa nommnm nmzmxm HmELoz Hmnos nommnm nozmxm mgoom cameo;
IHm !0 Ida l0 thlaso 9mm?
meson oumcnmua<

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

mo. mmo. . omo. «mo. mmo. _mo. mo. mo.
ooo.- .Fo. ooo. mmo.- ooo.- omo. mpo. moo.
% 9% NE mo\ E m? mo\ o~o\ NF 9
mmo.u omo.u mmo.u omo.u opo.- mmo.- Ppo.- mpo.u
O O O O . O F o. N O No.
mo\ 1 av W1 omo\ N\ =o\ =\ :
omo.u omo.u opo.u mmo.! ooo. ooo.- ooo.! .o.-
mmm\\\ om\\\\ ommxxx mmm\\. mmm\\\ mmm\\\ mmm\\\ mmm\\\ o
omo.- o.o.u meo.n ooo.! oH.n :mo.- opo. moo.
o o o o o No. . N o N o
1 .1 m1 1 1 1 1 .1 . ..
moo.s ooo.! :mo.u omo.- ooo.- ooo.- m.o.- ooo.!
.1 1 .1 1 1 1 .1 1 .
omo.- ooo.- No.- opo.- Pup. mo.- ooo.- No.1
1 1 1 m1 1 1 1 .1 .. .
omo.- o_P.u omo.u ooo.- moo. ooo.- mmo.u poo.u
Hmaeoz Hmnoa nommem nmzmxm Hmseoz Hmnoa nommen nozoxm oeoom cameo;
-am no -Hm -o ooouooo some
meeom oumeeoaa< meeom oumeeopa<

Hoaamemm haamoﬁmmmao

HoHHmemm hasonemm

mmoeﬁsmxm om no moaesmm mmoeo< nxmm.xmﬁ>oxnew no eoﬁumﬁ>ma nemnemum new mmam emmz

mp< xHazmmm<

154

155

 

 

 

 

 

 

HoHHwewm maawowmmwao

moweﬁewxm om no moaaewm wmoeo<

 

 

 

 

m

m

A.n.ueoov mpc anzmmm<

maeom moweewua<

Hoaawewm haaonewm

 

n.wa>oxeem no eoﬁuwﬁ>mn newnewum new mwwm ewo:

 

1 11x 1 .1 1 1 1 m1 ..
m00.! 0N0.I 0N0.! m0.! 20.! 000. N00. 000.
200.! m0.! 2N0.! N20.! m20.! 090.! 90.! 500.!
mmo. F0. P0. -_ N0. mm0. N0. 0P0. MNO.
mm0.! m~0.! m00.! mp0ol 0P0. QFO.! N00.! 000.!
thoz AMOS Bag” 50303” Han—(~02 H MUOE 8mg” UQ3QXW OLOOW £8MCQ1—
lam l0 IHm l0 9001950 ummh
MELOE mamCquH¢

 

 

 

 

 

 

 

 

 

 

 

 

 

m . P0. 59 . . 0m . - m 0. mpo. 590. p .
\mm\\\ W\\\\. \MW\\ \VW\\ HW\\\ \\\\\ \\\\\ mmV\\, 2P
0N0.! 500. 00. p20. m00.! 00. N00. mm0.
0. - P0. F0. . Pm0. \ Nm0. P0n\\ mw0. F 0. .
mW\\\\ w\\\\\ W\\\\. \\\\\ .\\\\\ d\\ \\\\\ mW\\\ NF m—
000.l N0.! 200. 920.! 0P0. p00.! 0P0. P0.!
0 0. ~0M\\ m—0. NNO.- 0. - P0. 90. F 0.
o\\ o\ \ .\ o\ .1 =\ m\ 2
F00. —m0.! 90.! 20.! m00. 0N0.! N00. N90.!
0 o o x m 0 \ o . o o . I
1 1 1 .1 1 .1 1 .1 .
Pm0.! Nm0. 020. N00. 200.! mN0. 50. m0.
1 .1 1 1. .1 .1 .1 1 . ..
m20.! m20.! 500.! Nm0.! 200.! 0P0.! 090. 200.!
.1 1 1 1 .1- 1 1 1. .
000. 200.! 5F0.! Pm0.! 2mp. 020.! 290.! 0P0.!
1. 1.1 1.111. 1. .
Pmo m00 ! 200 ! 520 ! 0MP 0N0 ! Pro ! 0m0.!
steoz Hwnos nomwem nmzmxm steoz Hwnoe nomwem nmSmxm meoom cameo;
IHm l0 IHm I0 MMOIUDU ummﬁ

 

wagon moweemua<
Hoaawewm haawoammwao

 

m

meeom ouweemua<
HoHHwewm zaaonewm

momeﬁewxm mm mo moanswm wwoeo< m.m.ee>:: mo eoHawH>ma newnewum new wwam ewmz

0F< anzmmm<

156

157

 

1 1 1 m1
>mo.u Fae. omo. moo.

1 N1. 1 1
amo.u Pro.u N90. 20.!

...m\..\ .1 1 1
:No.l mmo.l Ppo.! hmo.!
Hasnoz Haves vmmmnm cmzmxm

tam Ia

 

 

 

 

 

menom mumcgmua<

Hwaamgmm adamoﬁmmmﬁo

mmmcﬁsmxm mm no mmaqsmm mmogo<

mm0.! P00. who. Fmo.
NE NE mphx0\\ \. g
mF0.I P00. >90. 200.!
.W..\ ﬁx--- .1. 1
020. MNO.! N00.! F0.!
Hasnoz Havos commcm nmzmxm
IHm I0

 

 

 

 

 

m

wagon mumcpmua<

Hmaamnmm haaoccmm

A.U.ucoov ©P< xHazmmm¢

0P

or

2—

«zoom

ghouuso

m m.sc>z= uo coﬁumw>ma ogmucmum cam mmﬁm 2mm:

 

om

zumcmq
game

 

 

 

 

 

 

 

 

wagon oumcnoua<

Hwﬁamamm zaamoﬂmmmao

mmmcaemxm mm no moaaemm mmogoc

 

 

 

 

 

m

wagon mumcgmpa<

Hmaamnmm zaaoccmm

>p< xHozmmm<

m.m.::>:: no coﬁpmw>mn ngmocmum

new mmwm 2mm:

 

. . . . . . Po. 0.
mmm\\\ mwm\\x mm\\\\ qwm\\\ mmm\\\ mwm\\‘ M\\\\\ dm\\\ :P
mpool owe. ma. GNO. mo.l NQO. N©o. mmo.
Mk 2% was» g 0% W? é mg.. N, 3
COO. @NO.I moo. mm0.l Pmo. 90.! mac. MPO.I
.wmo\ m? Q? ~o\ Ex Nb\ .% m8\ :
ooo.! mmo.l FPO.I Nmo.l zmo. :mo.l POO. FPO.I
NO.I NC. ~36. mccol FPO.I mpo. mwo. Pmoo
EV wk E 8% EV .W\ 0% cg m e
Nmo.l ©m0.I NPO.I ONO.I NCO. mo.l FPO. mcool
30. PN0. PO. MNO. O. NO. mpo. NNO.
awo.l Mbool FNO.I mNO.I map. hm0.l 3—0.I mNO.I
E\. EV mm? 9% 2% F? a? a.\. a m
GPO. mop.l Npool 330.! mop. bmc.l wpccl Q=O.l
Hasnoz Haves vmmmnm cmzmxm HmsLoz Haves vmmmnm vmzmxm mgoom numcmq
lﬁm l5 le l5 thIOSU 0mm?

158

 

 

159

 

mNO. PC. PC. MO. ND. NNO. PFC. NOW“

000.! Nmo. mmo. $00.! 30.! mmo. :50. 020.

5\ E a? g\ a? ﬁx. a? E\ e
330.! mpool DOC. N:O.l FPO-I ooo.! ope. moo.l

. .. . . O \ . ‘ F . . \\ .

:8\ m? wwwx WV Go\ m\o\ mo\\ ms a
mmo.l smocl NPO.I mNo.l pmo. ONO.I 300.! OFO.I

H mac—OZ H MODS 60“ MS” 00303” H NELCZ H MUG:— Umm NS” U$3Q 8W GLOOW

UHm IQ lam lﬁ thlUﬁU

 

 

 

 

 

 

 

 

 

 

 

wagon mumcnouac

menom mumcgmuac
Hmaamnmm aaﬁmoammmao

Hmaamnmm zaaoccmm

d
mmmcasmxm mm no mmaqemm mmono< mwm.::>:: mo coﬁumﬂ>ma cumucmum new mmam 2mm:

A.U.u:oov ~F< xHazmmm<

ON

same»;
game

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

m? g w? 3\ §\ g m? .g\ a

Pmo.- mo. mmo. :No. co.- mo. mac. omo.
mmmvx. mwmvx\ mm\\\\ mwmxxx mmm\\\ mwmxxx mwm\\‘ wwp\\x m, m,
moo.- mmo.- Poo.- mao.- Pmo. ooo.- MPo. m.o.-
cmo\ _F 6\ WS\ @\ ms\ 3% ms :

~oo.- Nao.- m.o.- pmo.- moo. mmo.- .oo.- m.o.-
E\ E g @\ §\ §\ .g\ 2\ a

mmo.- mpo. pmo. mro.- wmo.- mpo. poo. cmo.
@v _g EV Ev. :O\.. E.\ E\. a? w e
co.- co.- m.o.- omo.- m.¢.- ¢mo.- o_o. moo.-
mmwxx mwmh\\ mwmvx. mwmvx\ mmmvx‘ mwmh\\ mwmvx\ mwmvxx F

mpo.- mso.- mmo.- pmo.- map. omo.- mpo.- mmo.-
mmwxx. mmmv\\ mwmhxx mmwv\\ mmmhxx mmmhxx mmmvx. mmmvx. a m
ooo. amo.- =.o.- pzo.- mm_. omo.- o.o.- o:o.-

Hmsgoz Hmcoa cmmmnm umzmxm HmsLoz Haves cmmmzm cmzmxm macaw numcmq

-ﬁm -u -Hm -3 uno-»=o ummy

 

wagon mamcgouad

mwmcﬁsmxm om no mmaqemm mmono<.mm

Hmaamnmm haawoﬁmmmHu

 

wagon oumcgmua<
Hmaamnmm adsoucmm

mp< xHazmmm<

m.::>=m no coHumH>ma camccmum cum mmwm :mm:

160

161

m
m
\0

 

2'
N
'- O
o o
.\\
I

cn\€§ co

m

N

O
O
I

HmsLoz

 

Nmo.n

Hmcos

lam

 

FPO.
\

Pmo.
Pov\\\
Noo.-

@oo.
\\\\\

=PO.I

ommmnm
ta

nanom mumcgmua¢

Hmaamgmm adamOHmmmHo

 

5\
FPO.I
0:0.3
.6.

d\\\\\
omo.l

U m3® 8m

 

=No.
mmc.u

E\.
mpo.u

%
N20.

Hmanoz

 

$
mmo.

g
moo.l

5\
hmo.l

Hmoos

tam

 

pro.
WE

spa.
mwmxxx

moo.|

commcm
Ia

nELom mumcgmuad

Hoaamgmm >Hsoucmm

 

opo.|

umsmxm

 

m?

we

:9

macaw

analyze

.0.
mmmcwemxm om no mmHQEmm mmono< m.m.£:>=: ho coﬁumﬁ>mo vgmucmum new mmwm cam:

A.U.ucoov mp< chzmmm<

 

om

name»;
game

 

 

 

 

 

 

 

 

 

 

 

 

@WWV\\ MMDM\\ WWD\\\ mWW\\\ $WW\\\ mWM\V\ mmmh\\ WWW\\\ :9
QNN.I :00. FPO. b00.l N0.l m0. mmo. >N0.
a? we? 2\ F \ §\. @ @Y t\\. a e
HH.I 500.! Nm0.l m—.I 000. MNO.l MN0.I 9N0.l
a? 3% @V ORV % @Y .1“me @Vx :
000.! 0m0.l P20.I wa.l h:0.l Fm0.l PN0.I ®F0.l
% 1% g\o.\ ma\.. @V §\.\ f 9% a
NP.I :Fo. zN0.l 0N0.I 0N0.I 9:0.l 0m0. 000.
:1. S\. E w.\ 1% m? % EV m e
mhp.u hmo.l awo.l hmw.l mmo.l mmo.l mpo.l Nzo.
2\\ 9%. §\ WV @\ @V §\ a,\ 1
N00.l mmF.l 5:0.l NPN.I m00. 000.! N:0.I :m0.I
mm? We.\ 2,\.. EV v.\ :1 1% ﬁ .1 m
£00.! 00m.l 000.! ©0N.I >00. mmp.l 000.l mmp.l
Hmagoz Hmcos cmmmnm vm3oxm Hasnoz Hmcos ummmnm cmzmxm mgoom guano;
le I0 lam l0 thluso ummh

 

mason mumcnmua¢

Hmaamnmm adamowmmmao

 

wagon mumcgmua<
Hmaamgmm uaaoccmm

mmmcﬁemxm mm no mmaaemm mmogo< mmm.xmw>oxo:m mo coﬁgmw>mn onmccmum new mmﬁm cam:

mp< xHazmmm<

162

 

 

 

 

Haves
lam

 

pm?
Pmo.|
g
szo.a
mm?
Fmo.l

vmmmcm
Ia

 

 

 

 

wagon mumcnmuac

HoHHMme haamoammmao

 

 

moo.l

 

Hmcos
tam

 

8mg.
-n

msLom wumcgoua<

Hoaamgmm masoccmm

 

«no.
Ev
pmo.

ME,

mo.

U 030 SW

 

mp

or

:F

wgoom

whetuso

mmmcﬁemxm mm mo awanemm mmogo< mam.xmw>oxn:m no :oHumH>oo unmuCMum ucm mmﬁm cam:

A.o.ucoov ¢F< xHszmm<

 

om

cumcmq
puma

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

@\ §\ §\ .1 m? §\ E\ E\ a

mam.- mo. _Po. .mo.- .mo.- awe. smo. «no.
wmmw\\ wmmv\\ @mvx‘~ mmwv\\ mmmnxx mmbh\. mmmV\\ wmov\ N, m.
am..- moo.u omo.- a..- mmo. mzo.u mmo.- moc.-

wmm\\\ 4mm\\\ wmm\\x mmwxx mmxxxx mmm\\‘ mmm\\\ mwwxx PF

mao.- >o.- mzo.- mem.u mmo.u ~o.- mmo.u ooo.-
m? E @\ E\. 2% f E\. N:\. a

mmP.- »_o.u ¢mo.- mo.- amo.u =oo.- mac. moo.
mmmh\\ mmcv\\ mmmn\\ mmxvx. mmvxxa @mbn\\ mmwvx. mwwv\- w o"
e¢F.- mmp.u moo.u map.- pmo.- =ao.u opo.- . amo.
mmpv\\ mwmvxx @mmhxx omwv\\ mmmwxx wwv\\\ wmmv\‘ mmwvx‘ p

.¢o.- ¢mp.- omo.- ¢om.u moo. am..- mzo.u hmo.a
mwpv\\ mwxv\\ mmmv\\ mmwh\\ wmmv\\ mwwvxt mwpvx. mmwvx : m
mmo.- mam.u >mo.u mam.u .mo.u m2..- pmo.u go..-

Hmagoz Hmcoa cwmmcm omzmxm HmsLoz Haves cmmmnm cmzmxm ogoom namcmq

-ﬁm nu -Hm aw “wonuzo

 

wagon oumcnmua<
HmHHwme adamoammmﬁu

mmmcasmxm mm no mmaaemm mmogo<

 

wagon mumcgmua<
HoHHmme zaaoccmm

m m.xma>oxn:m no coﬁumw>wa unmucmum new mmﬁm cam:

om< xHozmmm<

game

164

165

 

_mo\ ib\ :b\ .02\
NFN.I moo.l mmo.l Nmo.l
mmmu\\ mmw\\\ mmmhxx mmwnxx
mm.u >o.n moo.u >:P.u
WV mm? :o\ s.\
m=P.I mmo.l mNo.I ©=P.I
Hmsgoz Hmcos cmmmnm cmzmxm
tam Id

 

 

 

 

 

 

manom oumcnmua¢

Hmaamnmm zaamoﬁmmmHo

 

 

 

 

mmm\\\ mmm\\\ mmm\\\ mmp\\
mNP.| moo. :No. who.
% mﬁx §\. ;\o.\
amo.n mmo.| mmo.l :mo.
wmo.| m=0.u 0.0.: mzo.
Hmsnoz Haves nommnm cozmxm
tam In

 

mason oumcgoua<

HmHHmnmm haaoncmm

 

mp

or

3F

 

ogoow
mmonuso

mmocasmxm mm no mmaasmm mmono< mmm.xmﬁ>oxn=m no :oHumH>mo onmucmum cam mmam 2mm:

A.U.ucoov om¢ xHazmmm<

 

om

cameo;
ance

 

 

 

 

 

 

 

 

 

 

 

 

mmm\\\ mmm\\\ ﬁWW\\\ mmm\\. me\\. WMW\\\ mm\\\\ WMW\\\. :p
MN.I 9N0. 000. 0:0.I =N0.I >30. mN0. :mo.
FY NE >mo\.. :m.\ Mao\. Go\.\ who“. m\m\.\ NP 9
00F.I 0H.I omo I 00— I arc N30 I Nmo I :00 I
FY 2% 6 E\ s\ v.1 ,6? F\ :
m00.I m00.I Fm0.I 0:N.I om0.I m>0.I >N0.I N00.
5% 0% ma\. mY- mmo\. ﬁx «ms WSx m
>mp.I >N0.I =0.I o>0.I mm0.I >m0.I FNO. N00.
NE §\. m\ g\ \3.\. E s\.. E.» m e
0N I mmP.I moo I ON I «:0 I F00 I zmo I >0
us moo\. WWo.\ ”WV Waxy WV. $\. as 0.
:0~.I 00P.I mm0.I 0N.I P00.I =~.I Nm0.I :N0.I
>00 I Nam I P00 I P>m I M00 I saw I Mac I :ON I
Hmanoz Haves ummmnm cmzmxm Hmsaoz Haves commnm cozmxm wnoom numcmq
lam I0 IHm I0 nucluso umwh

 

 

menom muwcnmuac
Hoaamnmm haamoamnmﬂo

wagon oumcnmua<
Hwaamnmm zaaoccmm

mmwcﬁamxm om no mmaqemm mmono< m.m.xma>oxnsm no coHumH>wa vnmccmum cam mmﬁm cmmz

FN< anzmmm<

166

167

 

 

 

 

 

 

 

mﬁ\ N8\ 9? :o\
mmp.u mmo.I :wo.u mmo.n
s\o.\ We\ a? @v
>m..n mwc.n Pco.u omF.I
mﬁx E\ WV WY
mP—.I 0:0.I mm0.I mmP.I
Hasnoz Haves commsm umzmxm
lam Ia
meson oumcnmua<

Hmaamnmm haawoﬁmmmHo

%
PP .'
2%
.m:o.-
@V
8a..

Hasnoz

 

mmxo\
as.

ame
PS:

§\
Pm0.I

autos
IHm

 

m00.I
NNO\.\\
bN0.I
g
m~0.I

8mg.»
-n

wagon mumcnmua<

Hmaamnmm maaoozmm

 

6030 X”

 

mp

@—

:F

mnoom

uncluso

momcaewxm cm 00 mmﬂasmm mmono< mmm.xmﬁ>oxnsm no coﬁuma>un vgmccmum new mmwm :mmz

A.o.ucoov FN< xHazmmm¢

 

om

cameo;
game

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

g §\ §\ 0? o\ 2\ §\ \ a

53F.I POP. msp. NNP. 0m0. 25’. op. amp.
% 2% 3% ﬁx Ev ex.\ mm? mﬁ N, a
Pm0.I Nm0.I 000. 000.I ~00. 0P0. Pmo. 0P0.
e\\ 5\ @\ Q? .§. §\ @\ 5\ :

mpo.l b0.I mN0.I 0P.I m0. m:0.I N00. 000.
3% Na? @ex 0% §\f 2.\ E E m.

omo.l mwo. one. woo. mmo. omo. mmp. bar.
Wax ma? E mY @.\ .m\o.\ .E..\ g m e
Pop.I m0P.I 0—0.I p00.I 0:0. mN0.I mmo. 000.
mg. % @Y ”VI- 6 3% E.\ .E N.

mF0.I hmP.I =0.I Nmp.l who. z—F.I m0.I F0.I

mmo. 0:N.I NF0.I mmP.I hop. m0.I =N0.I m00.I

Hmanoz Hmuoa vwmmzn cwzmxm amaze: Haves vmmmnm omzmxm onoom sauce;

Iwm I0 lam I0 MQOIUSU ummh

 

mason mumcnwuad

HmHHmme zaawoﬁmmmao

 

nsLom wumcnmua<

Hwaamsmm hasoucmm

mmmcaamxm mm no mwaasmm mmogo< mam.::>:m mo :oHpmH>ma cnmucmum cam mwﬁm 2mm:

NN< xHazmmm<

168

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

. O C C P. O O C
PN—.I 000. 00-. so. mm0.I mmp. map. mmF.
wwxxxx. mwmxxx mmm\\\. mmm\\\ mwwx\\ mmm~;\ mwm\m\\ mmmxxx o_ om
mN—.I Ppo.I poo. N00.I 0N0. 2N0. zmo :00.
PF. . O C P. O o. .
mso.l mmo.I mmo.I 00P.I po.I wmo.I moo.I bmo.
Hasnoz Hmooa ummmnm Omzmxm Amanoz Hmcoa 00mmcm cozoxm mLoom numcmq
lam In lam Id uncluzo name
menom oumcgmua¢

 

menom mumcnmua<

amaamnmm zaamonmmHo Hwaamgmm >Haoocmm

mmmcﬁsmxm mm ho moaasmm mmogo< m.m.::>=m ho coHumﬁ>mn onmucmum new mmHm cam:

A.c.»=oov m~< anzmmmc

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

3% E\ 2\x 2.\ mﬁx E\. E\ ﬁx. a
mop.l NNF. mhp. P00. moo. mmp. Par. 33—.

ﬁx @xx Ex @\ xx? §\ ﬁx Ex 2 m.
OM0.I m0.I 000. N:~.I PP. P00. mmc. wNO.
mmm\\x mmm\\\ WW©\\\ WWW\\\ WWW\\. mmm\\\ mmm\\: JWW\\\ FF

QP0.I w®0.I NN0.I GMN.I omo. 0m0.l :00. :90.
2% a? ﬁx Ex Ex f 3%. 8.x m

00.! mmo. NOD. 090. $30. ©MO. me. amp.
ﬁx I? ﬁx Ex E\. 2% mm?» @x m e
NNP.I :NP.I 3N0.I NPP.I 5:0. =30.I @mo. hop.
@\- mmxx ﬁx ox:\ ngxx ﬁx a\ E\ I

Nm0.I ©PN.I m30.I sz.I mmo. :MF.I 0N0.I NFC.

m00.l PQN.I MN0.I ®2N.I :00. m=0.I Nm0.l «FP.I

Hasnoz Haves ummmzm cmzmxm Hassoz Hmcoa vmmmsn cmzmxm msoom sumac;

IHm I0 lam I0 thIQSD ummﬁ

 

wagon oumcnmaa<

HoHHmnmm zaamowmmmHo

 

wagon mamcnmua<

Hmaamnmm hasovcmm

momcﬁemxm mm no mmaqamm mmogo< m.m.::>=: mo coﬁumw>ma onmucmum cam mmHm :moz

mm< xHazmmm<

170

 

171

 

N=P.I
m\o.x
:P.I

25‘
350.:

Amanoz

 

mm0.I

:o\\

ooo.l

Haves

Iam

 

mmuxxx

wow.
m.owx-
moo.
.o.
m\\\\\
amo.-

ummmnm
Ia

 

cs
P90.

pr\
bpp.l

NP.
\
>=P.I

umzaxm

 

wagon mumcnmua¢

Hmaamnmm haamoﬁmmmHo

mmocwemxm mm mo mmaasmm mmogo< «mm.s:>sm mo coﬁpmw>wo unaccwum new mmwm

 

0:0.I

3V
pro.

23‘.
:Po.u

 

Hmsgoz

 

Nmo.I

Hmcoa

IHm

 

mmo.
N00.I

cmmmnm
I0

 

 

vmzoxm

 

 

megom mumcnmuac

Hmaamgmm zﬁeoucmm

A.u.ucoov mm< xHozmmmc

0P

op

:P

mnoom

quIuso

cam:

 

om

sauce;
ummh

 

 

 

 

 

 

 

 

 

 

 

 

mmmvxx wmmvxx wwmvxx wmmvxx mmmvxx mmmhxx mmmvxx mmmvxx 2.

mm..- mm_. For. moo. Foo. amp. MFP. mzp.
5% g mmo\. E m8\ 3% % exoxmvx NP 9
.mo.. zoo.- moo.u pm..- Poe. moo.- awe. pmo.
mmhxxx mmmhxx mwmvxx wwwvxx mmmhxx Pmmwxx mmnnxx .mmvxx P.

wPo.- :oF.- omo.- :mm.- mac. noo.x moo.- mpo.
gs 3% 1% s\. 9% Ex g\. was I

moo.u smo. No. mpo. .:o. omo. m~_. mo_.
m8\. emo\. :% E .ao\. as Ex mex m 2
mm..- «a..- omo.- 5...- mmo. w:o.x mo. m...
:mmvxx mmmnxx ammmxx kahxx mmmvxx mmmnxx mmmhxx mmmuxx p

=o.- >m~.- ~mo.u mmP.- poo. mme.x omo.- pee.

,% CY me\.x 1% m..\ E.\ .E z\..\ z m
wpo.- mm.- mmo.- =5~.- amo. mmo.u on.- we..-

Hwaeoz Hwnoa nmmwew nvzmxm steoz Hwnos nmmwem nmzmxm weoom cameo;

-Hw In new I» neouusu ”may

 

meeom wuweemuae

mmweﬁewxm om no mwaaswm mmoeo< wmm.ee>:z no eoHuwH>ma newnewum

HwHHwewm zaawoammwau

 

mseom muweemuae

HwHHwewm masonewm

=N< xHazmmm<

new anm eww:

172

173

 

% _m\o.\ WOV io\.
~_F.- zoo. m¢o. moo.
nmmvxx. ammvxx mwmnxx mmmhxx
‘m...- mmo.- ~oo.x ¢...-
NF- @\ m.\ a..\
.mo.- mo.- =mo.- m=P.-
steoz Hwnoe nomwem nmzmxm
-em -w

 

 

 

 

 

maeom oaweewaa<

Hmaawewm >~Hwoammwao

mmoeﬁewxm om no mwaeewm mmoeoe m w.ee>=:

 

 

 

 

2.\o.\ RV g §o\.
nmo.u NPP. smp. pap.
mmvx N8\. g s.\..
mmo. poo. mmo. cmo.
o. mmo. smo. . c.
Poo. nmo.n P—o.u mo.
Hweeoz Hwnos nomwem nwzwxm
lam In

 

 

maeom ouweeouae

Hmaawewm haaonewm

no eoﬁpwﬁ>mn newnewum new mwﬁm

A.o.pcoov =m< xHazmmm<

mp

op

:p

weoom

thIpzo

ewm:

 

0N

cameo;
puma

LIST OF REFERENCES

 

 

LIST OF REFERENCES

Algina, J., & Noe, M. J. A study of the accuracy of Subkoviak's
single-administration estimate of the coefficient of
agreement using two true-score estimates. Journal of
Educational Measurement, 1978, lg, 101-110.

Anastasi, A. Psychological testing. New York: Macmillan, 1976.

Berk, R. A. Item analysis. In R. Berk (Ed.), Criterion-referenced
testing: State of the art. Baltimore: The Johns Hapkins
University Press, 1980.

Berk, R. A. A consumers' guide to criterion-referenced test
reliability. Journal of Educational Measurement, 1980,.31,
323-3119 0

Block, J. H. Student learning and the setting of mastery performance
standards. Educational Horizons, 1972, §QJ 183-190.

Brennan, R. L. Psychometric methods for criterion-referenced tests.
Unpublished manuscript, March 1979. (Available from
[Department of Education, SUNY at Stony Brook, Stony Brook,
New York, 11790)) .

Brennan, R. L. KR-21 and lower limits of an index of dependability
for mastery tests (ACT Technical Bulletin No. 27). Iowa
City, Iowa: American College Testing Program, December 1977.

Brennan, R. L. Extensions of generalizability theory to domain-
referenced testing (ACT Technical Bulletin No. 30). Iowa
City, Iowa: American College Testing Program, June 1978.

Brennan, R. L. Some applications of generalizability theory to the
dependability of domain-referenced tests (ACT Technical
Bulletin No. 32). Iowa City, Iowa: American College Testing
Program, April 1979.

Brennan, R. L., & Kane, M. T. An index of dependability for mastery
tgstsz )Journal of Educational Measurement, 1977. 13, 277-
2 9. a

Brennan, R. L., & Kane, M. T. Signal/noise ratios for domain-
referenced tests. Psychometrika, 1977. ﬁg, 609-625. (b)

174 .

 

 

175

Brennan, R. L., & Lockwood, R. E. A comparison of two cutting score
procedures using generalizability theory (ACT Technical
Bulletin No. 33). Iowa City, Iowa: American College Testing
Program, April 1979.

Buck, L. 8. Use of criterion-referenced tests in personnel
selection: A summary status report (Technical Memorandum 75-
6). washington, D.C.: United States Civil Service
Commission, December 1975.

Carver, R. P. Special problems in measuring change with psychometric
devices. In Evaluative Research: Strategies and Methods.
Washington, D.C.: American Institutes for Research, 1970.

Cohen, J. A. A coefficient of agreement for nominal scales.
Educational and Psychological Measurement, 1960, g9, 37-H6.

Cronbach, L. J., Gleser, G. C., Nanda, R., & Rajaratnam, N. The
dependabilitypof behavioral measurements: Theory of
generalizability for scores and profiles. New York: Wiley,
1972.

Downing, S. M., & Mehrens, H. A. Six single-administration
reliability coefficients for criterion-referenced tests: A
comparative study. Paper presented at the annual meeting of
the American Educational Research Association, Toronto, 1978.

Ebel, R. L. Criterion-referenced measurements: Limitations. School
Review, 1971, 69, 282-288.

Eignor, D. R., & Hambleton, R. K. Relationship of test length to
criterion-referenced test reliability and validity (Report
No. 86). Amherst: University of Massachusetts (School of
Education), Laboratory of Psychometric and Evaluative
Research, 1979.

Glaser, R. Instructional technology and the measurement of learning
outcomes: Some questions. American Psychologist, 1963, 18,
519-5210

Glaser, R., & Klaus, D. J. Proficiency measurement: Assessing human
performance. In R. Gagne (Ed.), Psychological principles in
system development. New York: Holt, 1962.

Glaser, R., & Nitko, A. J. Measurement in learning and instruction.
In R. L. Thorndike (Ed.), Educational measurement. (2nd
ed.) washington, D.C.: American Council on Education, 1971.

Glass, G. V. Standards and criteria. Journal of Educational Measure-
ment, 1978, 1;, 237-261.

Goldstein, I. L. Training program: Development and evaluation.
Monterey, California: Brooks/Cole, 197M.

 

176

Goodman, L. A., & Kruskal, H. H. Measures of association for cross
classifications. Journal of the American Statistical Associ
ation, 195”, 32, 732-76M.

Cross, A. L., & Schulman, V. The applicability of the beta binomial
model for criterion referenced testing. Journal of
Educational Measurement, 1980, 11, 195-201.

 

Hambleton, R. K., & Eignor, D. R. Criterion-referenced test
development and validation methods. Training program
presented at the annual meeting of the American Educational
Research Association, San Francisco, April 1979.

Hambleton, R. K., & Novick, M. R. Toward an integration of theory and
method for criterion-referenced tests. Journal of
Educational Measurement, 1973, 123 159-170.

Harris, C. A. An interpretation of Livingston's reliability
coefficient for criterion-referenced tests. Journal of
Educational Measurement, 1972, 93 27-29.

Harris, C. W. An index of efficiency for fixed-length mastery,
tests. Paper presented at the annual meeting of the American
Educational Research Association, Chicago, April 1972.

Harris, C. W. Note on the variances and covariances of three error
types. Journal of Educational Measurement, 1973. 123 99-50.

Harris, M. L., & Stewart, D. M. Application of classical strategies
to criterion-referenced test construction. A paper presented
at the annual meeting of the American Educational Research
Association, New York, February 1971.

Huynh, H. 0n consistency of decisions in criterion-referenced
testing. Journal of Educational Measurement, 1976, 13, 253-
269.

Huynh, H. Reliability of multiple classifications. Psychometrika,

Huynh, R., & Saunders, J. C., III. Accuracy of two procedures for
estimating reliability of mastery tests. Paper presented at
the annual conference of the Eastern Educational Research
Association, Kiawah Island, South Carolina, February 1979.

Ivens, S. H. An investigation of item analysis, reliability,pand
validity in relation to criterion-referenced tests.
Unpublished doctoral dissertation, Florida State University,
August 1970.

Kane, M. T., & Brennan, R. L. Agreement coefficients as indices of
dependability for domain-referenced tests (ACT Technical
Bulletin No. 28). Iowa City, Iowa: American College Testing
Program, December 1977.

 

177

Keats, J. A., A Lord, F. M. A theoretical distribution for mental
test scores. Psychometrika, 1962, 21, 59-72.

 

Klein, S. P., A Kosecoff, J. Issues and procedures in the development
of criterion-referenced tests (ERIC/TM Report 26).
Princeton: ERIC Clearinghouse on Tests, Measurement, and
Evaluation , September 1973 .

Koslowsky, M., A Bailit, H. A measure of reliability using
qualitative data. Educational and Psychological Measurement,
1975. 3:3 8u3-8u6.

Livingston, S. A. A reply to Harris' "An interpretation of
Livingston's reliability coefficient for criterion-referenced
tests". Journal of Educational Measurement, 1972, 9; 31. (a)

Livingston, S. A. Criterion-referenced applications of classical test
theory. Journal of Educational Measurement, 1972, 23 13-26.
(h) .

Livingston, S. A. Reply to Shavelson, Block, and Ravitch's
”Criterion-referenced testing: Comments on reliability".
Journal of Educational Measurement, 1972, 9; 139-1A0. (c)

Livingston, S. A. A note on the interpretation of the criterion-
referenced reliability coefficient. Journal of Educational
Measurement, 1973. 1_0_. 311.

Livingston, S. A., A Wingersky, M. S. Assessing the reliability of
tests used to make pass/fail decisions. Journal of
Educational Measurement, 1979, JQJ 2u7-260.

Lord, F. M., A Novick, M. R. Statistical theories of mental test
scores. Reading, Massachusetts: Addison-Wesley, 1968.

Lovett, H. T. Criterion-referenced reliability estimated by ANOVA.
Educational and Psychological Measurement, 1977. 313 21-29.

Magnusson, D. Test theory. Reading, Massachusetts: Addison-Wesley,
1967.

Marshall, J. L. The mean split-half coefficient of agreement and its
relation to other single administration test indices: A
study based on simulated data (Technical Report No. 350).
Madison: University of Wisconsin, Research and Development
Center for Cognitive Learning, June 1976.

Marshall, J. L. Possible mathematical relationships of true and
obtained scores and their implications for mastery testing_.
Paper presented at the annual meeting of the Midwest
Educational Research Association, Bloomingdale, Illinois,

1978.

178
Marshall, J. L. Personal communication, 1980.

Marshall, J. L., A Haertel, E. H. A single-administration reliability
index for criterion-referenced tests: The mean split-half
coefficient of agreement, Paper presented at the annual
meeting of the American Educational Research Association,
Washington, D.C., March-April 1975.

Marshall, J. L., A Serlin, R. C. Characteristics of four mastery test
reliability indices: Influence of distribution shape and
cutting score. Paper presented at the annual meeting of the
American Educational Research Association, San Francisco,
April 1979.

Mehrens, W. A., A Ebel, R. L. Some comments on criterion-referenced
and norm-referenced achievement tests. Measurement in
Education, Winter 1979, 19, 1-8.

Michigan Department of Education. Technical Report: Michigan Educa-
tional Assessment Program. Lansing: Michigan Department of
Education, Research, Evaluation and Assessment Services, June
1977.

Millman, J. Criterion-referenced measurement. In W. J. Popham (Ed.),
Evaluation in education: Current applications. Berkeley,
California: McCutchan, 197A.

Millman, J., A Popham, W. J. The issue of item and test variance for
criterion-referenced tests: A clarification. Journal of
Educational Measurement, 19?", llJ 137-138.

Novick, M. R., A Lewis, C. Prescribing test length for criterion-
referenced measurement. In C. W. Harris, M. C. Alkin, A W.
J. Popham (Eds.), Problems in criterion-referenced
measurement. CSE monograph series in evaluation, No. 3, Los
Angeles: Center for the Study of Education, University of
California, 197“.

Peng, C.-Y. J., An investigation of Hgynh's normal approximation
procedure for estimating criterion-referenced reliability.
Paper presented at the annual meeting of the American
Educational Research Association, San Francisco, April 1979.

Peng, C.-Y. J., A Subkoviak, M. J. A note on Huynh's normal
approximation procedure for estimating criterion-referenced
reliability. Journal of Educational Measurement, 1980,.EL,
359-368.

Popham, W. J., A Husek, T. R. Implications of criterion-referenced
measurement. Journal of Educational Measurement, 1969, Q, 1-
9.

 

179

Schmitt, N., A Schmitt, K. Differences in reliability estimates for
objective-referenced tests. Unpublished manuscript, 1977.

Shavelson, R. J., Block, J. R., A Ravitch, M. M. Criterion-referenced
testing: Comments on reliability. Journal of Educational
Measurement, 1972, 9, 133-137.

Subkoviak, M. J. Estimating reliability from a single administration
of a criterion-referenced test. Journal of Educational

Subkoviak, M. J. Further comments on reliability for mastery tests.
Unpublished manuscript, university of Wisconsin, 1977.

Subkoviak, M. J. Empirical investigation of procedures for estimating
reliability for mastery tests. Journal of Educational
Measurement, 1978, lg, 111-116. (a)

Subkoviak, M. J. The reliability of mastery classification
decisions. Paper presented at the first annual Johns Hopkins
university National Symposium on Educational Research,
Washington, D.C., 1978. (b).

Swaminathan, H., Hambleton, R. K., A Algina, J. Reliability of
criterion-referenced tests: A decision theoretic
formulation. Journal of Educational Measurement, 19?”, 113

Wardrop, J. L., Anderson, T. R., Hively, W., Hastings, C. M.,
Anderson, R. I., Muller, K. E. A framework for analyzing the
inference structure of educational achievement tests.

Journal of Educational Measurement, 1982, 1g, 1-18.

Woodson, M. I. C. E. The issue of item and test variance for
criterion-referenced tests. Journal of Educational
Measurement, 197A, JJJ 63-6". (a)

Woodson, M. I. C. E. The issue of item and test variance for
criterion-referenced tests: A reply. Journal of Educational
Measurement, 19?", ll, 139-1A0. (b)