AN EVALUATION OF THE SEQUENTIAL
METHOD OF PSYCHOLOGICAL
TESTING
Thesis for H1: Deqru 0‘ Ed. D.
MICHIGAN STATE UNIVERSITY
John James Paterson
1962
This is to certify that the
thesis entitled
AN EVALUATION OF THE SEQUENTIAL METHOD
OF
PSYCHOLOGICAL TESTING
presented by
John James Paterson
has been accepted towards fulfillment
of the requirements for
CDwi/ET [31% -.
Major: professor
Date June 15, 1962
LIBRARY
Michigan State
University
V VT-‘r
_, _._.. #. ff
“’
1“
H
I“.
- 1
r ‘0-
.-¢ .....
nucF I
~z-" " 5.: 5:"
v-.‘..-.....-~ an-
.' s
""V. fic~v~~
.mv-‘ ‘~:L.
. -..
.3
".2 ’e-... -_.
1A. "‘ x
' i-..g \
_
J .
" ‘0. -~. .
h' ~ I * _
“~..v u
‘1 ~...:_“
a.‘,‘ 3...
~‘.‘:"-V‘ 2 ‘ 0‘
.
s... V-
P. “‘1 a‘.‘
"" ‘9: ~
u
y N‘.\
‘ k. *1
VA
5 ‘\
L4
Q
G
5“.
\; I
“ V“
\._.\ :-
V y‘
‘.
.-
"vffiifi
o
u;
\‘
F
“¢~c
-
e.
\"
.\-
‘k 5":
.‘V‘;
»
ABSTRACT
AN EVALUATION OF THE SEQUENTIAL METHOD OF
PSYCHOLOGICAL TESTING
by John J. Paterson
In the sequential method of psychological testing the
examinees are directed to subsequent items on the basis of
their responses to prior items. No examinee responds to all
the items of a sequential test, and any given examinee
might complete the test by responding to any of several com-
binations of items. Scores on the sequential test reflect
the difficulty of items correctly answered not the number
correct.
The evaluation did not involve an actual population
of individuals, but used probability models and hypothetical
populations. The probability of passing a given item in a
test was calculated from the ability level of the individual,
the difficulty of the item, and the precision of the item.
(Precision may be computed from the item-total biserial
correlation.) The probability of passing a sequence of
items was determined for each of fifteen ability categories
by multiplying together the probabilities of passing or
failing a sequence of six items. Sixty—four different se-
quences were calculated for each ability category.
The problem involved was the comparison of the sequen-
tial model with the traditional cumulative model (in which
‘ I
RAU-W‘vn
— u- up
I»-yo...-n.~
u
n- .— v- «-
i—l" .‘ ,t\
"—.u V‘agg
. ‘
"F .J-fi-
‘ ~\
.-a- ._._.
.
r; r." ‘v‘
oy--.-v..
C ‘2'!- 0.1,,‘
54
'y.‘
rat
..\-
John J. Paterson
all items were at the 50 per cent level of difficulty) to
determine how well individuals at different ability levels
were classified by the tests. The parameters of the sequen-
tial test (difficulty and precision) and the effects of
errors in estimating these parameters were examined in
relation to the resulting classification of individuals.
One sequential test model was constructed with an
item-total biserial correlation of .75 and item difficulties
such that the sum of the squared deviations of the individ-
ual's ability level from the mean ability level of the group
into which the individual was classified would be a minimum.
Even though individuals in each ability category were kept
separate from individuals in other categories, individuals
in different categories took the same difficulty item if
the calculated difficulties were less than .20 standard
deviation units apart. A rectangular distribution of
ability was assumed in these calculations.
Both normal and U-shaped distributions of ability
were used as input for the above sequential and cumulative
test models to determine how well the results classified
individuals of different ability levels. It was concluded
that regardless of the distribution of ability used as
input; the individuals in the extreme ability categories had
significantly less variance of scores in the sequential test.
At middle ability levels the sequential test did have slightly
lower variance of test scores than the cumulative. For the
. n \
-sn. .r‘V‘ -
— I— “‘
.v‘v- ~13‘Ao U
.vu
. . q
2'; 1" JV"
-"" o -lev‘
0‘.
’7 ""\ AV‘"
nu
‘ 'Vr %..u
I ‘ .
r~1..l.‘,‘,
-uu-.“": Q I
-~,
g
Wow.
(
(
I
(n
(
L
(U
i
s
h;
\—
l-’
John J. Paterson
top scores the sequential test had less variance of ability
level than the cumulative.
The second and fifth items in the sequential test werewd
each separately changed in difficulty and precision. The
resulting number of people at each score, the mean ability
level of individuals at each score, the variance of scores
for top and middle ability level individuals and the variance
of ability level scores for the top and middle scoring
individuals were all insignificantly changed. The sequential
test was not sensitive to errors in estimating the precision
and difficulty of the items.
When precision of items in the sequential tests was
varied, tests consisted of higher precision items (with dif-
ficulties appropriate for that precision level) had less
variance of scores for ability level categories and less
variance of ability level categories for top and middle
scoring individuals.
It was concluded that more difficult items are needed
to distinguish among more able students; less difficult
items among the less able. If extreme scores having low
variance of ability level are desired, the item difficulties
should be regressed toward the mean from those difficulties
which give the best discrimination between individuals of
similar ability level.
in partial fu
AN EVALUATION OB
OF
PSYCHOLOGICAL TESTI
1 '1
DV ..
A THESIS
Submitted ts
Michigan State University
lfillment of the requiremcrt
for t? d ‘
Ceilege Oi Education
1102
617 25,73 12..
s. ,‘ ,'
.. 2, , g, 3,
ACKNOWLEDGMENTS
The writer wishes to express his appreciation
for the guidance given by Dr. David R. Krathwohl in
the preparation of this thesis and to the Bureau of
Educational Research for arranging the time necessary
for completion of the research.
TABLE OF CONTENTS
CHAPTER
I. DESCRIPTION OF THE PROBLEM.
Description of the Sequential Test Model
Starting Point.
Stopping Point.
Scoring
Pattern of Items
Directions to Testee.
A Iiagram of a Sequential Test Used
in This Study
Need for Test Improvement . .
Maxim.ally Efficient Use of the Items
Selected .
Control of the Score Distribution
Meaning of A Score
Rationale for the Sequential Item Model
Maximally Efficient Use of Items.
Control of the Score Distribution
Meaning of a Score
Selection of the Sequential Procedure
Hypotheses . .
Effect of the Type of Ability
Distribution .
Effect of Item Precision and Difficulty
for the Sequential Test . .
Effect of Errors in Estimating Para-
meters . .
Limitations of the Study
Best Cumulative Test.
Distribution of Scores
Ability Distributions
Test Parameters . . .
Test Construction Procedures
Test Presentation Procedures and
Effects.
Overview of the Remainder of the
Dissertation
II. REVIEW OF LITERATURE.
Maximally Efficient Use of Items
Selected.
ii
PAGE
1...)
‘...0.-'~-—
_.,... .n.
--- '
11
CHAPTER
Control of the Score Distribution.
Meaning and Use of Score Produced.
Sequential Testing Procedures
III. PROCEDURES
Test Model Construction . .
Effect of Shape of Distribution of
Ability. . .
Effect of Normal Distribution .
Effect of U— Shaped Distribution .
Effect of Ability Distributions for
Additional Seqeuntial Tests .
Item Precision and Difficulty for the
Sequential Test .
Errors in Sequential Test Para.meter
Estimates
General Comparisons
sumwary of Procedures and Hypotheses.
IV. ANALYSE AND RESULTS
Sequential Test Construction
First Item Decision
Second Item Decision
Third Item Decision
Fourth Item Decision
Fifth Item Decision
Sixth Item Decision .
Input Distribution Effects
Results from the Normal Distribution.
Results from the U- Shaped Distribu-
tion.
Item Precision and Difficulty for the
Sequential Test .
Varia.nce of Scores.
Variance of Ability Levels
Errors in the Sequential Test Parameter
Estimates .
Errors in Estimating Difficulty
Errors in Estimating Precision.
General Comparisons
V. CONCLUSIONS
Sequential Testing and Testing Problems.
Efficiency of Items.
Control of the Score Distribution.
Meaning of a Score.
111
PAGE
69
81
91
91
95
97
100
102
105
107
112
114
122
122
123
123
125
125
126
128
131
132
138
142
142
144
1A8
148
153
156
16A
16A
16A
167
160
"‘ J
~ __
‘... _—_
o-v
v-0
Ian‘.__..
~..-‘._’“~
‘-""1'-\v-r
_-... '
'-. . “1-,
....
O
A
no .
c.
-—
~ .
CHAPTER PAGE
Sequential Testing Hypotheses. . . . . 171
Effect of Ability Distribution . . . 171
Effect of Precision and Difficulty . . 173
Effect of Error in Parameters. . . . 175
VI. SUMMARY AND RECOMMENDATIONS . . . . . . 176
Summary . . . . . . . . . . . . 176
Recommendations . . . . . . . . . 183
BIBLIOGRAPHY . . . . . . . . . . . . . . 186
APPENDIX A . . . . . . . . . . . . . . . 192
APPENDIX B . . . . . . . . . . . . . . . 198
iv
_. ‘_ u-u
.
...~~"‘
I. L»... f. L Cad T.
u... L. A... L. n.» :3 3.. ;. ml
. ‘ .r.\ rye A a s V s w u. , A . &
.C A . A: a 4. ac A v ac 14. a:
vn r. r.. A“. vs. .V r.. L» .r..
IL“ A\~ Ax~ m- .5. «x» A\~ ~\~ Pr. .
. .H
TABLE
LIST OF TABLES
Analysis of Means and Variances of Normalized
Scores for Category 8 Individuals When Normal
Distribution of Ability is Input into Sequential
and Cumulative Test Models . . . .
Analysis of Means and Variances of Normalized
Scores for Category 14 and 15 Individuals When
Normal Distribution of Ability is Input into
Sequential and Cumulative Test Models
Analysis of Means and Variances of Ability
Level Scores for the Top 8.4 Per Cent of the
Score Distribution When Normal Distribution of
Ability is Input into Sequential and Cumulative
Tests . . . . . . . .
Differences Between Normalized ”T" Scores for
Adjacent Top Ability Levels for Normal and U—
Shaped Input. . . . . . . . .
Differences Between Ability Level Scores for
Adjacent Top Scores for Cumulative Test Model
for Normal and U—Shaped Input . .
Differences Between Ability Level Scores for
Adjacent Top Scores for Sequential Test Model
for Normal and U—Shaped Input . .
Analysis of Means and Variances of Normalized
Scores for Category 13 Individuals When a U-
Shaped Distribution of Ability is Input into
Sequential and Cumulative Test Models
Analysis of Means and Variances of Normalized
Scores for Category 15 Individuals When a U—
Shaped Distribution of Ability is Input into
Sequential and Cumulative Test Models
Analysis of Means and Variances of Ability
Level Scores for the Top 13.5 Per Cent of the
Score Distribution When a U— Shaped Distribution
of Ability is Input into Sequential and Cumula-
tive Tests . . . . . . . .
V
PAGE
13A
134
134
136
136
137
139
139
139
o-
-
Pp-'
'.--—"
I
o
-
‘ 1 4 I: . Q ‘1 i
——J .. . . .
L. .. . a: .. .
«u . . .. . .... 'L
m ‘ ‘ . a m m . or.
.1. .s1
. . . .
.
...,f
-
.9. .v
..>J
~ u .
. . .s
.C .
v .. _.
rm“ N\
r u
.
1‘
Lisa
..~
.1 r
TABLE
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
Analysis of the Variance of Scores for Individ—
uals at Specified Ability Levels for Five Tests
of Different Precision. . . . . .
Analysis of the Variance of Ability Level
Scores for Individuals at Specified Score
Levels for Five Tests of Different Precision
The Means and Variances of Rank Scores Assigned
to Each Ability Level by Five Tests of
Different Precision. . . . . .
The Discrimination Indices Between Adjacent
Ability Levels for the Input of a Normal Distri—
bution of Ability into Tests of Different
Precision
Distribution of Individuals by Two Tests-~One
Test With Second
Item Difficulties Farther from
50 Per Cent Level Than the ”Error Free” Test
Distribution of Individuals by Two Tests——One
Test With Second
Item Difficulties Nearer to
50 Per Cent Level Than the ”Error Free" Test
Analysis of the Variance of Ability Level
Scores for Individuals at Specified Score
Levels for One
”Error Free"
Test and Two ”Error
in Difficulties of Fifth Items” Tests.
Analysis of the Variance of Rank Scores for
Individuals at Specified Ability Levels for
One "Error Free"
culties of Fifth
Distribution and
Top Score Values
Difficulties and
Input
Distribution and
Top Score Values
Difficulties and
Ability Input.
Distribution and
Top Score Values
Difficulties and
Input
Test and Two ”Error in Diffi-
Items” Tests
Mean Ability Level Scores for
for Three Tests With Different
Normal Distribution of Ability
Mean Ability Level Scores for
for Three Tests With Different
Rectangular Distribution of
Mean Ability Level Scores for
for Three Tests With Different
U-Shaped Distribution Ability
V1
PAGE
143
143
146
147
150
150
152
152
157
158
159
: . . . :O ». ... .u. r. r .
.e. C~ 2. Ca V. . . .u.
s. ._ v“ w“ .. A. r”
.C. n C. ._. . C __v .C. n . a a. .n.
: . .C v. : . a . : i a. L. .N.
....C... .... 3.2. v-“
and | o t I
. . . . .,_ .. a .
w... fr. .r. p/b .r.
L...
. L.
—~. Q.
..u w..
L. S.
_~\p
Ar.
— n v
a .Z.
. ».
~ 5. A.
. . . r“
~\\M > .
Q. . N
v. .t
s. n.
: A.
3‘ .V
a . nx~
phi
a L» _. s
e .. K. .A .
T.» by «a. e O
1 s . _ L» .. A
a g Q» «C are
-r-u
'u
TABLE PAGE
21. Distribution and Mean Ability Level Scores for
Top Scores of Tests With Different Levels of
Precision and With an Input of Normal Distribu-
tion of Ability. . . . . . . . . . . 161
22. Distribution and Mean Ability Level Scores for
Top Scores of Tests With Different Patterns of
Items Encountered and With an Input of a Normal
Distribution of Ability . . . . . . . . . 163
23. Per Cent Passing Items of the Different
Sequential Tests Constructed . . . . . . . 193
24. Mean Normalized "T" Scores for Each Ability
Level for Cumulative and ”Least Squares”
Sequential Tests . . . . . . . . . . . 194
25. Distribution and Mean Ability Level Scores for
Cumulative Test With the Input of Different
Distributions of Ability. . . . . . . . . 195
26. Distribution and Mean Ability Levels for Top
Scores on "Least Squares” Sequential Test With
the Input of Different Distributions of Ability . 195
27. Distribution and Mean Ability Levels for Top
Scores With Difficulties of Certain Items Changed
in a Sequential Test With an Input of Normal
Distribution of Ability . . . . . . . . . 196
28. Distribution and Mean Ability Levels for Top
Scores With Precision of Items Changed in a
Sequential Test With an Input of Normal Distri-
bution of Ability . . . . . . . . . . . 197
vii
FIGURE
1.
LIST OF FIGURES
Graphic Representation of the "Least Squares"
Sequential Test Model
Three Distributions of Ability
Mean Ability Level of Groups Separated Out by
Sequential Test and Difficulties of Items Used
viii
PAGE
99
124
DESCRIPTION OF THE
m'-- . v‘ y r ‘ . 7" rv
ine usual 0038»
U2
erics of questions or
correct or incorrect a
given
o _s
by an examlnee
typ .f test; the
e O £1110
based on a cumulative
tive to the cumu 9
items
'98 given to eao.
.‘.'. \.'
CHAPTER I
(J I}
L: \J
5 item are
he r'3te: of CO'
1.. . -- em
3-91/1 f1.) r112) s..‘_
n3 “loceuir” us
ore Cl 5.. S
PROLLHA
W 1 111
One
seque
GICEITRR‘
ntial
model. In te“ts based upon the .eqiential model, exanlnees
are directed to subsequent items on the basi" of their re~
sponses txlé” mar ones. (Mule no clunvincc iflailonds 1fl)c111 of
1;}.‘3 l ch“d..) C)if £1 \fiC11163rli.i.211_ t.i t;?. 2:11 1 .1115' i; 1‘; :1; .:gici1n zit: : z” 15;}? *
complete the ‘est 7y .csponiiru to any oi a YJLIECJ oi r‘m-
binations of items. Score“ on $141 rt a1 cats are 14:).
upon the nature of laps to which c. 6'. 17.2441 :-.—‘.-._spon:,e.s 1:5: gluon
and not merely
C
H
f—v
’7‘
(L.
i"
,-o
f ' ,. ,u... . ‘ - J \ A». - “5. y..- .. ° . v3
The 1m:slc 'ilotdnnf tau.~-l;t a 111 thi
-- 's. 4-...,-~w:.— ._1 - L. »
parqmmon 01 c1 seqcunitlal Imll< riith this t:
test model. In the sequen
individual who
UdifibE.)
D
1 model
t
‘
IL_;
item 1s
l._l
_ .... .2
a 3.. T. w. W.
i. p1 .: L.
L. t. a: . . T.
w. .1 .. . r»
a u .3 .. . .
d» f” =— .v
r“ ...t. . ”
.AJ .;4 o . "w
. . L.
. e A a 1‘ .. . u v
u c .4 a u. a 1. o
.w. . .
5v
.
C.
s. w.
h... ..
r: .1.
as »y
au-
.—. r.
a” i.
.q .14
4
. c
.‘
...«-\-J-&
Ar-
-a a..
A-
Abe-v—
ibi
I. ‘
,-.
.~: «3‘
s. d-
.
..- ...v’.A
Agsu
«. n.
n. r»;
rm.
”a at
I. .r..
m.
Sb
rw
hit
:4
k...
Ts~
2
to a more difficult item; if he fails an item he is directed
to an easier item. If the item is very precise, the individ-
ual who passes it is given a much more difficult item; and
if the item is not very precise, the individual is given an
item closer to the difficulty level of the item Just answered.
The opposite is true for failing an item. The score is
directly related to the difficulty of the item to which the
individual is directed at the final stage of testing.
In addition to the comparison of testing methods, the
sequential test is examined for its strengths and weaknesses.
Methods of improving the sequential model are suggested from
the results so that even if the present procedure is not
better than the cumulative test, future sequential procedures
may be improved.
The present evaluation of the sequential method of
psychological testing consists of (1) a description of the
features of the sequential method as compared with the usual
cumulative test; (2) a description of some of the problems
encountered in the use of the cumulative test and how these
problems are handled by the sequential model; (3) a rationale
for the sequential solution; and (4) the formulation of
hypotheses as to the behavior of the cumulative and sequential
test models in regard to specific problems. Following the
hypotheses are (5) the limitations of the study and (6) an
overview of the remainder of the dissertation. To aid the
.nb
. r.
was C»
.L.
—.. 2:
«iv \ d
t.‘
Ax»
0a a
p v
.~.
« .
.L
a. :
. .
r . t
a t ..
.A. ‘
r..
C l‘ \
A v
is l.
.a- . _.
. . .
‘ 4
.2
.\
a a 5.
.~.
is a. u.
‘\. a
. 4 «it
u. s ‘ h
a e \K»
3
reader a few of the more frequently used terms in this dis-
sertation are explained in Appendix B.
I. DESCRIPTION OF THE SEQUENTIAL TEST MODEL
In any testing situation certairldecisionswmist be made:
(1) the individual must be told where to start, (2) the
decision must be made when to stop testing, (3) the final
score must be determined, (4) the characteristics of each
succeeding item must be stipulated, and (5) the testee must
be informed as to where and how he should proceed. In the
cumulative test the character of these decisions is obvious.
Because they are unusual in the sequential item test, these
decision points will be described in some detail.
Starting Point
Depending on the purpose of the test and what one
therefore wishes to emphasize, the starting point may be at
any level of difficulty. For instance, one may start with an
easy item that most individuals will be able to pass and with
which the individual would feel comfortable, or one may star
with an item at the middle of the score distribution with no
consideration as to the individuals who may be taking the
test. The sequential test model deveIOped in this paper has
the individual take as his first item one that would be con—
sidered at the fifty per cent level of difficulty for the
group of which he is a part. The reason for this choice is
-v-
.5-" l..-
-A
,9
._.
‘ » V F u n b
r x C Q
. 1.: a. . L . _.a . . a . _ . .
w. : . w. k . . J. . r‘
3 . . . ,, . , .c
. 1V P. e A -‘ rd. .v.
I‘ ‘ -~» g.» ‘vV ‘ I,» V o .- . fis M
.. . . ..
u 5. .~« 3. . s 0“ - x . we
.: 1 . V. F» . I s
i. . . . 4 .~‘ .1 . a 4..
s i. v , . s x x f. . H. . .
L» .r . . . ..\_L. ., a .. r .~_ . a. r .
CL . ‘ ._. a . u . . P . e \{s
. ~s . .. « ..~» \
. . x t 1‘
4
explained in Chapter III, Section 1. The present discussion
must, of necessity, ignore the psychological effects which
need to be empirically determined.
StoppingPoint
Criteria for deciding when to step are also determined
by the purpose of the test. If doing the best job possible
in the time allowed is paramount, then everyone is given
the same number of items knowing that the extremes will be
better classified than the middle ability levels. (Note
that the criterion measure need not be a measure of ability
but could be an attitude or interest. However, in this dis—
sertation the criterion will be referred to as an "ability.")
If time is flexible and there is a prescribed degree of
accuracy for each score, then a fewer number of items is
used for the extreme and more items used for the middle
ability levels. If the rapid classification of extreme
ability level individuals is desired, then one may stop
testing when it can be determined that the individual is
probably not at some middle ability level. In the sequential
model in this paper all people will take six items.
Scoring
Reasons for choosing one system of scoring over another
depend upon whether the score is to discriminate one ability
group from another, to discriminate among the individuals in
a group, or to describe the response pattern of the individual.
-v-.
§
\J‘:
If one wishes to disrrlnlnate one ability group from anOther,
one would probably assign a score reflecting the difficulty
of the final item. If one wishes to discriminate among in-
dividuals, then the score may rep resent the number of people
in, for example, one hundred that the individual would rank
above in the population. If the score is to represent a
response pattern, it may be an estimate of the number of
items the testee could have answered C01 -rectly if he does
answer an item of a gi en ifficulty, or it may identify the
precise pattern of correctly and incorrectly answered items.
‘he sequential test model in this paper assigns the individual
a score which is the difficulty of the item to which he is
directed at the final stage of testing as his score.
Pattern of Items
The problem in the sequential test is to select thrt
sequence of items which will yield the information needed
to assign the individual a score. At any stei‘ in the test
the decision as to the s cceeding item to be taken may depend
upon (1) the number of pmeceding items one has ansfeled
correctly, (2) the pattern of preceding items, or (3) the
difficulty and precision of the immediately preceding item.
This sequential model uses the difficulty and precision of
all preceding items to deter:ci ne the next item.
Difficulty of the item for this model 1s measured in
terms of standard score an ts for a theoretically normal
>~
Lu
6
group. An item that fifty per cent of the theoretical group
would pass is designated as ”0.00.“ The precision of the
item is essentially a measure of the validity of the item.
The measure of precision, 65, may be defined as the standard
deviation of the item characteristic curve. (It is also re-
lated to the measure of precision ”h” used in psychOphysics:
h = 1/1205 ; and, as Lord indicates, 05 is identical with
his "bi".)
Directions to Testee
The testee may be told how well he perf rmed on any
given item, may be told what is right or wrong with his per-
formance, or may be simply directed to another item. Any
combination of the above may be used at different stages in
the test.
Individuals may be directed to items which are taken
by those who perform differently, or they may be directed to
an item unique to their pattern of response. Pattern of
response may be determined from correctness or incorrectness
' 3:. e I! —
*1)
(D
‘ 1
L" .
Po
only, or each alternative to any item may designat
ent sequence. In this sequential test, pattern was determined
from only correctness or incorrectness of items, and more
than one possible sequence of responses could lead to the
same item.
l o 1 .. o
Freder1c M. Loro,_A Theory of Test Scores, Psychometric
Monograph No. 7 (Chicago: University of Chicago Press, 1952),
p. 7.
o.
.. ‘
7.» .s.
..A
.. A
l. a
. .
... L.
. _ .
Ca
7
Many methods of giving the necessary information to
the testee are available. In the empirical tests that have
been built by Krathwohl and Paterson, the succeeding item
that the individual should attempt is disclosed to the in-
dividual when he eras s the opaque covering under the letter
that has been selected as the answer to the question at hand.
The final erasure disclosed a letter used to indicate a score
”W
f
I
rather than the number of the next item.“ The testee mus
answer each item as he comes to it as he receives no direc-
tions if he does not answer. Other response techniques which
could be used are tabs, envelopes within envelopes, sliding
masks, and scrambled books.
A Diagram of a Sequential Test Used in This Study
Figure l is a diagram of one of the sequential tests
used in this study. It is the one constructed by the ”least
Cl
H
squares" method which is desc'ibe later. The pattern shown
is only one of many possible sequential patterns.
Difficulty of items.—~ltems are represented by circles,
the ordinate position of which represents the difficulty of
the item. The closer the item is to the top of the page,
the more difficult it is. Difficulty is expressed in standard
score units, i.e., an item that fifty per cent of the normative
2Unpublished material developed in the Bureau of Edu-
cational Research, Michigan state University, East Lansing,
Michigan, 1956-1959.
Q
AI.
.XMV.~
NZELDLL ULTTZfitu?‘ :NV
T..:..,.J \ L a...
>5 N 2; 7w.“ ‘2
Hess: same
Hmapccsvom :mcmmSUm unwed: we» mo eoflpmpccmmsdmm oficdmswuu.a .wam
pace esp wo mcwmpm
swam swam smpH stem swam smpH
msoom saw sum as: esm ssm pma
oo.H -
om. -
om. -
0 ON.
t . ow.
. om.
0:.
om.
‘ ON.
Oar
. . 00.
OH.
ON.
om.
0:.
. om.
00.
on.
ow.
om.
OO.H
Ammmoom ohmccmpm Gav
msmsH no spasoumuum
P... a
9
group would answer correctly is labelled "0.00". An item
that 8H per cent of the normative group would answer cor-
rectly is labelled ”-1.00”.
Sequence of items.~-The sequence is represented by the
abcissa value for the item. The first item of the test is
at theleft-hand side; the sixth item at the right of the
diagram. The individual confronts one item at each "stage"
of the test.
Size of step.—-The size of the step or the increase or
decrease in difficulty from the item at one stage to the
item at the next stage is represented by the difference in
ordinate positions of the items as can be seen in Figure 1.
There would be a large increase in the difficulty of the
second item if one were to correctly answer the first item.
There would be less difference between the easiest item at
stage four and the easiest item at stage five.
Route taken.--Lines slanting upward designate that
those who are considered to have passed an item at the
preceding stage should proceed to a more difficult item
for the next stage. Lines slanting downward designate that
the individuals are considered to have failed the item at
the previous stage and should proceed to a less difficult
item at the next stage. It may occur that passing a less
difficult item will lead the individual to a more difficult
.4-.-"';"‘~
-
1-.--4AOV
1. nu .w
v.
.~. ~ «V
-_~ v.
_... T. as.
L.
.h. fi,» ‘J
k. r” r.
L. .C
L» I» e .
.—_ .
a. . .J
. . .... .?
10
item for the next stage than he would have encountered by
failing a more difficult item. In this case the lines be-
tween items will cross. (This case is not illustrated in
Figure l.) The other alternative not yet mentioned is that
individuals passing a less difficult item or failing a more
difficult item may be lead to the same difficulty of item
at the succeeding stage.
II. NEED FOR TEST IMPROVEMENT
In order to lay the background as to why the sequential
test is worth considering, one should examine what problems
have been encountered in the use of the cumulative test.
Present test procedures seem to have encountered three im-
portant problems related to: (l) utilization of items to
operate most efficiently with the group taking the test,
(2) controlling the score distribution to arrive at a useful
scale, and (3) production of a score with a precise meaning.
Maximally Efficient Use of the Items Selected
Once one has decided upon a purpose, then one can
solve the problem of the most efficient selection of items
either completely empirically, or theoretically in terms of
the effect of varying certain item characteristics. The
approach in this paper is the theoretical one. If one uses
this theoretical approach, one of the problems is that of
utilizing the most precise items available in a pool. The
-..
v
u
.-..A->
1. n h I. ~ lu u
_. c a. .
.L . n.
5 H
‘._. v... 3‘
. ._ .t . .
... a . .. .
..n u y a»
. v A: 2*
. .
... . . L. .
..g s u 1.. . rs
~ 8 I
ll
cumulative test cannot always use all of the more precise
items. m
In the cumulative test, if the score is the number of
correct responses and if all of the items are of equal dif-
ficulty, then a test with less precise items would give a
better measure of the scale of ability than a test with more
precise items.3
The above phenomenon has been called the ”attenuation
paradox." Violation of any one or a combination of the
following assumptions has been given as an explanation for
the attenuation paradox: (1) scores are normally distributed,
(2) ability is normally distributed, (3) the regression of
scores on ability is linear, (4) measurement produces an
interval scale of ability, and (5) response distribution is
homoscedastic. There is evidence to support the contention
that violation of any one of these could be the reason for
the lack of a monotonic relationship between item reliability
(precision) and the validity of scores in the usual testing
situation with the cumulative test.
One method of using the most precise items and increasing
test validity is to use a spread of item difficulties as sug~
l
gested by Brogden.+ However, this does not seem to be a
3Ledyard R. Tucker, ”Maximum Validity of a Test with
Equivalent Items,” Psychometrika, llzl-lM; March, 1946.
“Hubert E. Brogden, "Variation in Test Validity with
Variation in the Distribution of Item Difficulties, Number of
Items, and Degree of their Intercorrelation," Psychometrika,
11:197-214; December, 1946.
l2
completely satisfactory solution because (1) there is no
scheme to determine the appropriate spread and (2) the most
extreme difficulties cannot be efficiently used any time
the majority of the individuals taking the item guess at the
answer.5 There should be some procedure which would allow
use of precise items no matter what their difficulty level.
If items are to be efficiently used in the discrimination
of a group into two parts, the items should be at the SO per
cent level of difficulty for the hypothetical group the
median ability level of which is at the point where the
6
discrimination is desired. This means that if discrimina-
tions are desired among a few high ability individuals then
difficult items should be used. The usual cumulative test
cannot efficiently use such items.
5Paul E. Meehl and Albert Rosen, ”Antecedent Probability
and the Efficiency of Psychometric Signs, Patterns, or Cutting
Scores," Psychological Bulletin, 52:194—216; May, 1955.
,
OBrogden, 9p;_§;t,; Lee J. Cronbach and Willard G. War-
rington, "Efficiency of Multiple-Choice Tests as a Function
of Spread of Item Difficulties,” Psychometrigg, 17:127-147.
June, 19523Frederick B. Davis, HThe Se ection of Test Items
According to Difficulty Level," .merican Psychologist, 4:243,
July, 1949; Harold Gulliksen, "The Relation of Item Difficulty
and Inter—item Correlation to Test Variance and Reliability,"
Psychometrika, 10:79—91, June, 1945; Lloyd G. Humphreys,
WThe Normal Curve and the Attenuation Paradox in Tes Theory,
Psychological Bulletin, 53:472—476, November, 1956; D. N.
Lawley, ”On Problems Connected with Item Selection and Test
Construction," Proceedings of the Roya Society of Edinburgh
él (Section A, Part III);273«2 V, lQME-lOMB; Jane Loevinger,
The Attenuation Paradox in Test Theory, Psychological
Bulletin, 51:493-5042 September, 1954; Frederic W. Lord,
rrSome Perspectives on 'The Attenuation Paradox in Test Theory',"
Psychological Bulletin, 52:505~510, November, 1955; Frederic
H
.
l ,. . .
4 .
. no I
o. . .
a. S. .. ..
. _.
. . _ .
~ ...
.
p u _
l-
. ..
7.. ....
-K. A
-4
13
Control of the Score Distribution
The problem of score distribution is not only to assign
a certain number of individuals to a given score, but to
assign only like individuals to that score. The particular
type of distribution which is desired depends upon the pur-
pose for which the test is designed. A normal distribution
is assumed in most statistical computations and interpre—
tations. A rectangular distribution would give the best set
of rankings in that peOple are spread evenly over all the
scores. A bimodal distribution may be desired to classify
individuals into accept or reject categories. Other than
differences in the use of scores, factors which influence the
score distribution are the distribution of ability levels of
those taking the test, the item precision, and the difficulty
of the items. A test able to proouce any type of score
distribution desiredirrespective of the distribution of
ability level of those taking the test and irrespective of
the precision or difficulty of items available would have
considerable utility.
M. Lord, A Theory of Test Scores; N. w. Richardson, "The
Relation Between the Difficulty and the Differential Validity
of a Test,” Psychometrika, 1:33-49, June, 1936; Thelma G.
Thurstone, ”The Difficulty of a Test and Its Diagnostic
Value,” Journal of Educational Psychology, 23:335—343, May,
1932; Ledyard B. Tucker, 9p4~pit.; and David A. Walker,
"Answer-Pattern and Score—Scatter in Tests and Examinations,"
British Journal of Psychology,jfl3flflfli{%ih January, 1940.
Meaning of a Score
The problem in assigning a meaning to a score is that
the conventional cumulative score is typically a conglomer—
ation which may represent the ability level of the individ—
ual, the rank of the individual, the pattern of response,
or any combination of these. It is not possible to clearly
represent the ability level of the individual with the usual
cumulative test. While it is possible to just rank individ~
uals or to just indicate the pattern of response with the
cumulative test, this is not usually done. (in indicating
the pattern of response the score is assigned to the sequence
of items passed not to the number of items passed.) It may
be useful to examine each of these possible elements in turn.
The ability level of the individual cannot be deter-
mined by knowing that he passed a difficult item in a
cumulative test, because all people must take each item and
difficult it.ms are often passed by chance as the majority
of the group must guess at these items. This clouds any
interpretation of the number of correctly answered items as
a measure of performance. To get a better measure of the
ability level of the individual from the score, White and
Saltz have argued that the items should be scaled as to di.-
ficulty so that one knows which set of items a person has
answered correctly if he knows the total number answered
15
correctly.7 The usual cumulative test score does not permit
one to infer which items the individual has passed. The
score in the type of test suggested by White and Saltz would
probably be used to represent the level of subject matter
learned rather than how the individual ranked with others.
In addition to the infrequent use of the above solution,
the suggestion does not solve the problem of the majority
of individuals guessing the answer to difficult items.
To rank individuals in a normal distribution of ability
so they are spread evenly throughout the score range, the
test must make finer discriminations of ability at the middle
ability range than it does for the extremes. Thus the test
designed to rank individuals does not have a score scale
which has the same relationship to the ability scale at
the middle as at the extremes. Rarely is this relationship
of scores to ability level reported. The cumulative test
often compromises between using scores which rank individuals
best and scores which tend to be normally distributed (as
assumed in many statistical computations). The cumulative
test may do either of the above alternatives well, but the
decision made should be explicit and communicated to the
test user. The decision should be to use the test score which
permits one to infer rank (if this is what is desired), not
7Benjamin w. White and Eli Saltz, "Measurement of Repro-
ducibility,” Psychological Bulletin, 54:81-99, March, 1957.
-"~-
A_.
.‘ 5-..." V a- o..-
D'A‘-ll"‘\'
Iv 'v‘
- ‘1 I
.f—A P ‘--...‘..,
a \ ~
-.r~v ‘fi 4-5:..-
--. . .
...:._.\.. ,‘.._
. ..
~..-~--... n"
P'. .'. .,
....,,>___
" ‘"I~p ‘
" ‘vvn.-_. “ 5
.
._',__ ,
.- V ' -
v..,_'_“ “‘ -
.' ’ I.
- .. - ‘ _
~-.«.- '_.v -
.\q,. AI
.... fi-WA 7 ‘7
\
‘--.. ,_
*-~-_ ‘
I
----. _x
"‘ Vax—
"--.v ,4".
a.~
.,., ..
'm'
'. .u —
.._-\
.
r.
_ ”AOV ‘
. ~_ "r-
.' I0 ‘
‘-—..
—.. .
I
‘ra.! \.
-.. r _
"'\v ‘ ":';
-..,‘
--
,1
.H
- -;
.__V
"‘r-
.Zv-y.‘
"' o.
u‘, ‘7
"a!
I ow
"Q
~ ~.
'- \b .
v ‘r
~~‘ ‘
»"~A
.
-— '3’. . ‘
"' .x ‘
v.4 v- .
‘
A
“ a ‘-
‘\' &
~~_‘
‘~.KV‘
-- fl.
‘ _
t
_‘
, »_
.
-‘ v _
~
'7—
‘ a .._ r
‘*..
‘-—\_
‘\~
. N
K e. -
-
\. A .-l_
‘ N‘
v “.m
‘-“ 5—.
v
-
- ~
\~‘_
‘ ‘ —a
“. v-
x.
k.
s
..
‘
.—
‘- ,—
~I -
\\.
- ~...
t
- c
x 3
“ ‘QO
‘ . ._
‘0 \
‘
‘
‘\\‘ : _
‘ ‘
‘ § 3
&
\
16
to contaminate the meaning of a score by forcing the scores
into a distribution just to create a higher correlation co-
efficient with normally distributed measures.
Another use of a score is to indicate the pattern of
response. Cronbach has concluded that one should be as
concerned with heterogeneity in content as in difficulty.
Since the ”level of difficulty" meaning for a score has
been discussed above, the ”heterogeneity with respect to
content" meaning is considered here. For example, one bit
of information is given when an individual is placed above
the mean in pitch discrimination. With another set of items,
the individual might be placed relative to the mean in visual
acuity. The two items (with heterogeneity with respect to
content) together place him in one of four categories. (If
the second item had been a further measure of pitch, then
he would have been placed in one of three categories with
respect to pitch). The use of items with heterogeneity in
respect to content thus seems useful, but one must remember
that to recover all four categories the test cannot be scored
by the number correct. Too often the items in cumulative
tests are heterogeneous with respect to content and the number
correct is used for the score. This cumulative scoring pro-
cedure permits the precise meaning of a score from a test
with perfectly precise items to be inferred only when the
individual possesses all of the characteristics above the
specified levels or possesses none of the characteristics at
a
'
V
WK"
..--v0 V
‘--b- v’
—.
Uta-”v“-
o
.It‘f
‘. I w.-
1..
r‘
.H, on u a” v.. w!“ P a rL; C .
3:. 1. Z. L. i... .. _._
.w a a:
2. 3.. e o. S . I Q.
r.. o z. .. . F C r..
a. t. . . a; (\ 2
w . . . . 4 o
.1 2g . . a: ) ID 44
. _, v 44 2» r . a.
.-.. .. . . 4. ~.. a. ..
. . , .. a a
.3 a... .. e r.
. . a . a o .a z.
N. . ..... :u n. . . c . «x»
. . .,:. .. .. E Y ;.
.... . a . . . . y .
. . u a «a. I . r t. ..
17
or above the specified level. These cumulative scores are
even more difficult to interpret when the items are not
perfectly precise.
Rarely is any method of scoring other than the number
correct used, and, if the level of ability in any character-
istic is desired in conjunction with the pattern of charac-
teristics, the problems discussed above for reflection of
ability are added to lack of knowledge about which charac-
teristics the individual possesses.
III. RATIONALE FOR THE SEQUENTIAL ITEM MODEL
The sequential item model is now examined to show why
this model is expected to (I) give maximally efficient use
"of items, (2) control the score distribution, and (3) yield
a score with a precise meaning. In addition, the rationale
for using one of the several sequential procedures is
presented.
Maximally Efficient Use of Items
The sequential test is expected to make optimal use
of all items, irrespective of difficulty, because this test
model provides that each item be at the fifty per cent level
of difficulty for the group taking the item. At each suc-
ceeding stage in testing the original group is divided into
progressively more homogeneous ability groups and the dif-
ficulties of items are matched to the average abilities of
.n
'—
p
In“
l-“'
A-
4‘---
_. p;
Q
.. I: ha
,1 ... ..
.y.. r.. . ‘
. U
V“. VA.
t» I. u .
v . ».. r..
i. .. ..
L. 3 e ..
‘N ~‘r ~|t
a
- c.
18
each group taking the item. Thus the easiest items are taken
by the lowest ability groups and the hardest items by the
highest ability ones.
This procedure accords with the works of Brogden, Cron-
bach, Davis, Gulliksen, Humphreys, Lawley, Loevinger, Lord,
Richardson, Thurstone, Tucker, and Walker which indicate
that if one wishes maximum discrimination of a group into
two groups, then all items should be at the 50 per cent level
of difficulty for a hypothetical group the median of which
is at the point where the discrimination is desired.8 This
means that one needs difficult items to best discriminate
within high ability groups and easy items to discriminate
within low ability groups. The sequential procedure allows
the difficulty of the item to be suited to the ability level
of the group answering the item.
The second reason for assuming that the sequential test
will operate better than a cumulative test is that since dif—
ferent ability level individuals do not take the same items,
the number of low ability people passing a difficult item
by chance will not exceed the number of high ability people
passing the item due to their ability. As has been pointed
out by Meehl, in the cumulative test an item with poor dis-
criminating power is better than one with greater discrimin-
ating power if fifty per cent of the people are expected to
8See footnote 6.
..
.. _
a:
... av
:. .. t
as .m.
a: ....
_ . ....
A ,s.
)1. has
....
J. . ..
.. 9. ‘v
3. o .
. . s
.- .54
u.‘ a.“
.fiA Ly
... . a.
~. . f
A.» ~x~ g—t
.. .
... ~.
l9
9
pass the first, and only 10 per cent to pass the second.
Control of the Score Distribution
The problem of control of score distribution is to
assign like people the same score, and to yield a score
distribution which will best serve the purpose of the test.
Since the distribution of scores depends upon the distri-
bution of ability of those taking the test and upon the
difficulty and precision of the items, Lord and Brogden
have each stated that for a normal distribution of ability
and with items of equal difficulty and usual precision, the
cumulative test cannot produce normally distributed scores.1
Humphreys has suggested that the answer is to spread the
item difficulties.ll
He gives no method to show how such
a spread of difficulties is determined. Another answer is
the sequential process developed in this paper. It is
assumed that the sequential procedure will more adequately
control the score distribution because the items must operate
well for only a small group of people not for all of the
individuals taking the examination. After precise items
are used to validly split a given group, the resulting groups
may be further divided into whatever size is desired by
using additional items of appropriate difficulty. Any number
of subgroups may be combined if desired to produce appropriate
9Meehl and Rosen, op. cit.
10Lord, A Theory of Test Scores, op. cit., p. 11; and
Brogden, op. cit., p. 207.
llHumphreys, op. cit.
20
distributions or to combine like individuals. These methods
of control should allow maximum control of the score distri-
bution.
Meaning of a Score
A sequential test score may represent the ability level
of the individual, the rank of the individual, or the pattern
of response, but it does not represent more than one of
these at the same time. The ability level of the individual
is represented by the score when the score is the difficulty
of the final item. The rank of the individual is represented
by the rank of difficulty of the final item. (The rank scale
is an equal interval scale on ability when equal discrimin~
ations are made at all ability levels--in this case rank of
difficulty and difficulty represent the same factor-~the
ability level of the individual. If unequal discriminations
at different ability levels are made the scales represent
different information.) The pattern of response of the
individual would be represented by a score assigned to the
sequence of items taken in the sequential test. Even though
every individual may pass the same number of items, the se-
quence of items taken by an individual may be specified and
assigned a score different from that of an individual who
passed the same number of items but via a different route.
Different routes (sequences)will represent different items
being passed even though the number of items passed is iden-
tical.
2.
p.
JAVA.
.
..~
~.
V
L:
n
o
.-¢-vv‘
v
uvu'-"
\
A
6-»:-
.
a.
7. C
3. 4..
us
.v
a. .v
..~. 2.
.R‘
r. ...
us
21
Since the sequential test has several scoring procedures
each yielding a different but precise score meaning, the
sequential score is more interpretable than the cumulative
test score which is typically a conglomerate of all of these
scoring procedures. In addition to the precision of meaning,
the different scoring procedures allow great versatility in
the use of the test.
Selection of the Sequential Procedure
The type of sequential procedure used depends upon the
purpose of the test: (1) rapid classification of extreme
ability individuals, (2) reaching a prescribed degree of
accuracy for each score, or (3) doing the best job possible
in the time allowed. In the present case the decision was
made to do the best possible job with six items. The reasons
for accepting this decision and the reasons for rejecting the
other decisions are outlined briefly.
The rapid classification of individuals may be thought
of as either classification into such categories as accept,
reject, and continue testing--or classification into score
categories which would more closely represent the results of
the more traditional scoring procedures. The classification
into the three categories closely resembles the procedure
developed by Wald for industry where the concern was to pre-
dict the number of faulty objects in the population. A
random sample of the population was used at each stage.
.
L. . r
..
_ .
o. ‘1‘ ..
,"A ._. \.
Q .,_ .l
.. .... . .
.. u.
I. v v ‘
. —v .‘4
o. v\. .h
\ a .u I ‘
22
In the Wald procedure two sets of values are computed:
the one set is such that after each sample if results are
lower (e.g., in number of correct items) than a specified
number, then one may classify the population (or individual)
as rejected with probability; and the other set of values
such that after each sample if results are higher than a
specified number, then one may classify the population (or
individual) as accepted with probability.12
Fiske and Jones have advocated that the sequential pro-
cedure as outlined by Wald be used only when the problem in-
volves the choice between two possible parameter values
which can be specified on a priori, but not arbitrary
grounds.13
To classify people into additional categories, Cowden
modified the Wald procedure. He assumed that the fewer items
one needed to meet the criteria for classification into either
the accept or reject categories, the farther the individual
was from the specified level. He thus created five cate—
gories with the extreme categories being classified very
rapidly with few items.
The second sequential procedure suggested above-~that
is, classifying until a specific degree of accuracy has been
reached-—has not yet been investigated. Exploration of this
l2Abraham Wald, Sequential Analysis (New York: John
Wiley and Sons, 1947).
13Donald W. Fiske and Lyle V. Jones, "Sequential Analysis
in Psychological Research,” Psychological Bulletin, 51:264—275,
May, 1954.
. u. ._ .~ . . 1. . rt. .
. 2. . 2. ”1 e . i; _. L, . a ..
L. v. ... .v ; .... . . c . {a «v
a 2. 3. .2 _ . v“ 2. a c o 2. .1 T. .a to
u L. r~. .. L. _... u . .4 3. _ 0‘. 2‘
”A . . 2.. not {a I» .. 3. ..... {a s. cc
r» r». . . ... :~ 5 . . . v.. 2» rx.
3. .: w. .. -.. u L. r.. ....
. . ._ . Q» a r“ . . .. . a. u; .~— 1 .
.: w . ... .. . . ._. 4 ‘ ac. . . .r.
v, ..' r . L . ... r» .H l. .. - ‘ ~ . v
n u v ... . . ... .. . ..~ g a 2‘ x...
.4.
Pa
23
procedure was rejected because it was felt that this procedure
might be more fruitfully explored after there was more ade-
quate understanding of the interrelationships of the
variables involved in the sequential procedure developed in
this paper.
Whereas in the industrial system of sequential testing
the model assumes a random sample of ability at each level,
this is not the best procedure for obtaining information
about the ability level of an individual. Except in selec-
tion situations, the purpose is to determine the level of
ability the individual possesses rather than whether the
individual is above or below a given ability level. In the
sequential procedure developed in this paper, a random sample
of the individual's behavior is not used; there is rather an
attempt to classify individuals into as many ability cate-
gories as can be adequately differentiated. There has been
no mathematical model developed for the above procedure and
the apparent alternative of developing one did not seem
fruitful at this time. An empirical study of the problem did
not seem fruitful because neither the ability level of individ-
uals,the precision of the items nor the difficulty of the
item can be determined exactly. The best alternative seemed
to be that of creating exact data and then creating a model
which would use this data in a manner resembling the actual
situation.
uHL
“Arr"-
.
.,.r-v‘
,-
.—
Ap-v-V‘
".
..~.Av
‘4'
r». V‘
2..
n. r».
.:
.4 .1
”a r“
rwa an“
w.
»v 2.
«v w”.
2v
2. . .
a” : .
b V
3. .WJ
- 1 ~\&
24
Preliminary work with the sequential procedure had
usediiprobability model that had been empirically checked
with actual data and which had been programmed for the
electronic computer.14 It was thus decided to take advantage
of the computer program for this study. The program used
six items and permitted calculation for any sequence
possible where items were used to make dichotomous decisions.
IV. HYPOTHESES
The problems of testing are best described according
to the type of decisions that need to be made; however,
the investigation of these problems is best classified
according to the variables that are changed. Changes in
any variable, such as the type of ability distribution of
those taking the examination, may affect one or more of
these problems.
From the rationale developed in the previous section,
one can deduce the effects these variables should have on
efficiency, control of score distribution, and type of score
produced. The rationale will explain the effect of the
variables when used with the six-item cumulative with all
items at the fifty per cent level of difficulty as well as
when used with the sequential model. The one exception to
this statement is that Lawley's work would indicate that
14Unpublished material developed in the Bureau of Edu-
cational Research, Michigan State University, East Lansing,
Michigan, 1956-1959.
25
precise scores (scores which have small variance of ability
level for individuals assigned the score) are created for
only a single group by using items quite removed from the
ability level of those individuals whom one wishes to
precisely classify. For example, if we wished to have the
extreme scores precisely defined then we would use items at
the fifty per cent level of difficulty. The hypotheses on
precision of score are derived from the above conclusion
of Lawley. The score distribution examined in this study
is the one actually produced although it is clear that scores
could be combined to yiehdshapescd‘distributions different
from the one initially produced. The score meaning that
is examined here is that of reflection of the criterion
ability scale.
The general hypotheses arising out of the rationale will
be described here. The operational hypotheses that are tested
are stated in Chapter III. There are (l) a set of hypotheses
concerned with the effect of the type of ability distribution
on both the six-item cumulative model and the six-item
sequential test model; (2) a set of hypotheses concerned with
the effect of precision and difficulty on the output distri-
bution of the sequential test model; and (3) a set of
hypotheses concerned with the effect of the errors in estim-
ating the parameter values on the output.
.. .
.._ 4 ... r. . .L 2. w. r“ r“ L. 7: NC 30 a. at ...A .V
.3 r. 2. L. 4‘ Au r? 3. -.. 44 t.. . . a. .y.. a v“
a . v. v. .w; v» r;. .v. "w 3. :V A: v“ a o «J \ Av
. L.” r“ a. a-» a a .1. 2.. v... v. 2.. w“ .. . .4 a . : . .. a
r. 2. ... . . a o 2. .1“ .: a. a: 2.. .w . E to L. c .
_ . .. . r“ E .3 a . r“ L. ... . c . .. ... L. L. -. . .. . u... .N a a:
. . 5a . . ~.: c; «a 2. 2. . . w. c . ..I L. 2» r.
L. .. . .. . 3. n... .14 I a .a L . J o . .. r w. r.. ah . . a:
... . v. . “I L. :e ._v n1 .‘u . v . ~ ~._. 1.. y \.
... . r5 .: s a In I. Z. n. .. . a v r“ o n L. l c . . a. 1.
.vv u. . . y . .4. nn¢ - Ail. ~ v s v I s an. I - ~ . ~.~ a. 4 ...~
26
Effect of the Type of Ability Distribution
The effect of type of ability distribution on maximally
efficient use of items may be examined by determining the
variance of scores which are assigned to a given ability
level, or by examining the variance of ability levels assigned
to a score. "Discrimination among ability levels" shall be
used to designate whether different ability levels are
assigned different scores, and “precision of scores" shall
be used to indicate whether all individuals at that score
are of approximately the same ability level. Another method
of determining the effect of type of ability distribution is
to determine discrimination among people. (This procedure
involves decisions as to both control of score distribution
and meaning of the score produced.) Discrimination among
people is a measure of the ability of the test to rank
individuals according to ability. This type of discrimina-
tion is not considered in the following hypotheses.
As the sequential test being considered here is one de-
signed to discriminate among ability levels, it should work
quite efficiently for all distributions with respect to the
separation of the ability levels and the reflection of the
actual ability distribution in the score distribution. As
will be shown in Chapter II in the review of Lawley's work,
the cumulative test should have a greater precision of scores
for extreme scores, but should be equal to the sequential in
its ability to accurately discriminate among the ability
27
levels of individuals only at the middle ability levels.
These expectations are examined under conditions where two
different distributions are input--norma1 and U-shaped.
Normal distribution.--(l) The cumulative and sequential
test models should have equal ability tg classify individuals
.gf mean ability level. This hypothesis follows from the
fact that middle ability people will take 50 per cent level
of difficulty items in the cumulative test, and should take
items near the 50 per cent level of difficulty in the sequen-
tial test. If the sequential does not operate efficiently,
the cumulative test will have the more discriminating scores.
(2) The sequential test model should more accurately
classify the individuals gt the extremes Q; the ability scale
than should the cumulative test model. This is based upon the
rationale that the sequential test can use difficult items
because it discriminates among high ability individuals (as
these items are at the 50 per cent level of difficulty for
these high ability individuals). The test item does not have
to discriminate between low and high ability individuals as
only high ability individuals will take the item.
(3) The cumulative test model should have more precise
scores gt the extremes pf ability than the sequential test
model. This follows from the work of Lawley which showed that
the variance of ability levels for individuals assigned to
high scores would be low if the items were easy for these
individuals.
_u. u. .
.u .r“ . .
r - o‘ .
Ty .nL.
:._ r . pp;
W . p .
._. L. wLu
W. . . v“
0 . .1. . N e
"A . . .
n—v “a pv.‘
. .
‘ .§.
._a .r‘
u. r.‘
2v
. r“
:a
o a I.
Wu ‘1‘
.—u .~.
_. w.
;.
.W‘
Z 1 . .
he :. .1
.4
«L 0, so.
vi. 2..
. o
I. 2.
. a 5 4
2. 3.
a“ A
I. ~1-
.... 7
.2 .
A:
r.. .4
F: .Wu
Z» a. ‘
~... ..i
J a
C. .A. u
.. c...
a l .
¢
. a v..
p: A.»
A «\V
.flq a...
.. . a:
.. - K.
2‘
.N< .uv
r .. p a
e \ ‘ Q
FU-
;»
... .
28
(4) The scores for the cumulative test model should
represent finer ability units in the middle than_at the
extremes while the sequential test model scores should
reflect the ability level scale. The best discriminations
among ability levels should be made by using items at 50 per
cent level of difficulty for the hypothetical group the
median ability of which is at the point where the discrimin—
ation is desired. For the cumulative test the best discrim-
inations should be at the 50 per cent level of ability;
whereas, in the sequential test items should discriminate
quite equally over the entire range of ability.
U-shaped distribution.--(l) The sequential test model
should more accurately classify the individuals of catetory
1; (see ”Ideal T Score" in Table 24) than the cumulative
test model. Category 13 individuals are the focus of consid-
eration because in a U-shaped distribution few people are at
the mean and the question becomes how well one can classify
individuals who exist in larger number and are not at the
extreme. Category 13 represents this mean value for those
individuals in the upper half of the distribution of ability.
The reason that the sequential should more accurately classify
these people is that the items are more appropriate for their
level of ability than 50 per cent level of difficulty items
used in the cumulative.
(2) The sequential test model should more accurately
classify the individuals at the extremes_gf the ability
29
distribution than the cumulative test model. The reason for
these expected results is again that items are more appropri-
ate for the individuals, and individuals taking the items
have a smaller variance in ability than those taking the
cumulative items.
(3) The cumulative test model should have more precise
scores at the extremes than the sequential test model.
Again this follows from Lawley's work.
(4) The sequential test model should have equal score
discriminations for all groups including the mean group,
whereas the cumulative test model should have finer score
discriminations for middle ability levels than for the extreme
ability group. This follows from the wide distribution of
item difficulties used in the sequential as compared to the
cumulative tests. Items discriminate best only at once
ability level and should be used only with individuals close
to that ability level.
Effect of Item Precision and Difficulty for
the Sequential Test
The relationship of item precision and difficulty to
output characteristics must be examined together as change
in precision results in change of the appropriate difficulty
levels in the manner described in Chapter III. There are
five levels of precision used: rbis = .79, .j5, .71, .60,
and .45. Since the ability distribution also effects score
distribution, a normal distribution of ability is used as
this is the type of distribution most likely to occur in the
3O
practical situation.
(1) The variance 9: scores for a_given ability level
should pe_less with the test using the most precise items.
The value for the precision of an item indicates how effec-
tively the item differentiates individuals of one ability
from those in the next closest ability level. If the item
is precise then each item can make a different distinction
in ability rather than more accurately making the distinction
that should have been made by a prior item.
(2) The test consistingp£_the most precise items
should have more equal discrimination between adjacent
ability levels than will the less precise test. If the
ability of an item to discriminate among ability levels is
dependent upon the difficulty level of the item, then the
more precise test which has a wider range of difficulties
should discriminate at all levels while the less precise test
which has a smaller range of difficulties should discriminate
well among middle ability individuals where difficulties are
appropriate. The less precise test should not discriminate
as well among extreme ability individuals where difficulties
are not as appropriate.
Effect of Errors in Estimating Parameters
The usefulness of the model for practical purposes de-
pends upon the sensitivity of the test design to the use of
an item which only approximates the precision and dif-
ficulty level which would be called for by the "ideal" model.
31
If the values need not be very accurately determined before
use can be made of the sequential test model, one is more
likely to use the model. Preliminary studies have indicated
that the sequential test will probably be more sensitive to
precision estimates than to difficulty estimates. The
effect of errors of parameter estimates is the same effect
as is involved in the use of items which have parameter
values other than those required by the test.
As is noted in Chapter III, Section 1, each succeeding
item in a sequential test is selected in such a way as to
maximize discrimination based on data from the effects of
previous items. The effect of using a more precise item
than called for should be that the next item would not be
difficult enough or easy enough for maximum discrimination.
The effect of using an item too easy should be to increase
the precision of score for the upper group, but to decrease
the discrimination among ability levels.
Since the effect of errors made in early stages is
either corrected or magnified by the effect of later items,
and since the effect of errors made in later stages has no
chance to be corrected or magnified, one would expect dif-
ferences in the effect of errors at early and late stages.
The hypotheses made as to effect of errors at these different
stages are as follows:
(1) Errors in difficulty ap.ap early stage should not
have any serious effects as there would pg a wide range pf
32
ability and the item would operate well for some 9f that
range.
(2) Errors in difficulty 32 the final stages should
increase the variance pf ability levels assigned_pg one pf
the two subgroups into which the total group would_pe
separated, but should not lower the variance pf scores
assigned 39 the ability levels.
(3) Errors_in estimates_g£ the precision_g£ the item
should pg more serious in the initial stages where wide
separations_in difficulty level pf the next item would pg
used.
(4) Errors ip the estimates 9: the precision f the
items should make little difference ap the final stages as
the next item would pg appropriate.
If the sequential testing procedure is robust in that
errors in estimating parameters do not seem to greatly effect
type of output, then it would be possible to design the test
with parameter values determined from one sample of a popula-
tion and use this same test in different situations. (The
value used for the precision of the item is dependent upon
the spread of ability in the sample used to determine the
precision value. If the spread of ability is great in con—
trast to item sensitivity, one has a precise item. If the
spread of ability is narrowed, the same item would be consid-
ered a less precise item.)
33
V. LIMITATIONS OF THE STUDY
The three major contributionscxfthis study are that
it: (1) discusses the problems of the cumulative test and
shows how the sequential model attempts a solution to each
of these; (2) provides a model that may be used in construc-
tion of any sequential test; and (3) presents a rationale
for the sequential test model which, when tested, should
allow the construction of additional sequential tests. There
are, however, many problems that are not examined. Six of
these are listed and discussed because the background material
gives suggestions as to the probable answers to these problems
also. These are: (l) the best possible cumulative test, (2)
the score distributions desired for the cumulative and
sequential models, (3) the types of ability distributions
that may be present in the usual situation, (4) likely test
parametensftu‘usual test items, (5) commercial test construc-
tion procedures, and (6) test presentation procedures and the
psychological effects of the sequential model.
Best Cumulative Test
The work of Brogden and Humphreys indicates that the
best cumulative test with precise items is one with a spread
of difficulties.15 The exact relationship between spread of
15Brogden, op. cit.; and Humphreys, op. cit.
34
difficulties and precision to yield maximum validity (measu-
red by correlation with ability distribution) is not known,
but Cronbach and Warrington indicate that for a cumulative
test of a given length, 0&2 + oa2 will have a preferred
16 (
The term 0& is the standard deviation of the
spread of item difficulties and oh is the measure of precision
value.
which is the same as the one used in this paper.)
The sequential test models are not compared to the best
possible cumulative model, but the use of items all at the
50 per cent level of difficulty creates a test that is more
than sufficient for most uses for most levels of precision.17
The purpose of the cumulative test model in this dissertation
is to put the sequential test model material into perspective.
Distribution of Scores
If the purpose of testing is selection, then a test
need only produce two scores, one for the individual who is
selected and the other for the one rejected. In this situation
the sequential model developed here would require modification
both in method of scoring and in number of items taken by
individuals. The previously discussed sequential model devel-
oped by Wald, involving a variable number of items taken by
16Cronbach and Warrington, op. cit.
l7ibid.
35
individuals, is probably the optimal solution. The problem
of test construction thus is no longer that of determining
the difficulty of the item, but rather the number of items
needed to make the most rapid classification. There is no
score distribution as such, only accept, reject and continue
testing categories of individuals.
The cumulative test used to differentiate two groups
would be one with all the items at the level of difficulty
appropriate for the ability level at which one wishes to make
the decision. A test of this nature would have a score
distribution which would be platykurtic, rectangular, or
bimodal depending upon the precision of the items in the test.
The test with most precise items would have a bimodal score
distribution.
If one desired to rank individuals by the scores from
the test, one would make fine discriminations in ability for
those ability levels where there were many people. In this
way the individuals would be assigned scores which would be
rectangularly distributed. This can be accomplished by
use of a cumulative test which has either fairly precise items
at the 50 per cent level of difficulty or a spread of item
difficulties for less precise items. For the sequential test,
there would be more items included at the difficulty level
appropriate for the discriminations that are desired.
The construction of either a sequential or a cumulative
test which has the score distribution discussed above is
36
outside the scope of this dissertation. Further research is
needed to determine the items for a sequential test which
would have a rectangular distribution with the input of a
normal ability distribution.
Ability Distributions
Lord has stated that perhaps test constructors should
not consider ability as normally distributed.19 It is
possible that a bimodal distribution of ability is common
in that there are many individuals who perform adequately
and many individuals who perform inadequately with a large
gap between these two performance groups. If this is true,
the sequential test model should operate well for these
distributions, as it should operate well with any type of
distribution. Abberations in its operation would show up
most clearly when the test model is tested against a U-
shaped distribution of ability. In Chapter IV the results
are reported for testing the model against the U-shaped and
normal distributions. These results indicate how the sequen-
tial test scores may be interpreted when used with different
ability distributions. However, no rationale is developed
to indicate what the results should be and, therefore, the
interpretation of scores across ability levels depends upon
a rationale developed post facto, not upon the rationale
tested in the study.
19Lord,A Theory of Test Scores, op. cit.
37
Test Parameters
The effect of the number of items has not been examined.
The six-item test was used because the probability model for
the test had been programmed for the electronic computer and
six—items were the maximum for this program. Further research
is needed to determine how rapidly the output characteristic
changes (if at all) when the test consists of more items.
Test Construction Procedures
The computational model described in Chapter III for
the construction of a sequential test has a method of
selecting items with the best possible parameter values. This
method could be used in the construction of a sequential test
with the data in terms of difficulty and precision taken from
actual items. The criterion may be a measure of the number
of individuals desired to pass the item or a measure of the
variance of ability levels of individuals assigned to the
pass and fail categories.
It would seem reasonable that one should use the most
precise items to differentiate the individuals as to ability
level and then the difficulty of a less precise item could
be used to control the number of individuals assigned to any
one score category. The second differentiation would not be
as valid as the one made with the more precise item, but the
shape of the distribution could be well controlled.
In addition to lack of a complete evaluation of the
score distribution control procedure, there has been no attempt
38
to follow the standard criteria such as that published by the
Committee on Test Standards of the American Educational
Research Association.20 These criteria include content
validity, concurrent validity, predictive validity, con-
struct validity, error of measurement at different score
levels, equivalence of forms reliability, internal consis-
tency reliability, stability reliability, and information on
norms and scales.
Since this dissertation uses hypothetical data, content
validity is not considered. It is assumed that the test items
are homogeneous and thus measure only one content or ability
which may or may not be a composite of several abilities.
The six~item sequential is compared with the six-item
cumulative but no correlation is computed between the two
sets of scores, as is common in concurrent validity studies.
In this type of a model one can probably obtain more inform—
ation from the correlation with a known criterion score than
from correlation between sequential and cumulative test scores.
The predictive validity of the test is not determined
as it made no sense to use hypothetical data to predict
hypothetical performance. Predictive validity needs to be
20American Educational Research Association, Committee
on Test Standards, and National Council on Measurements Used
in Education, Committee on Test Standards, Technical Recom—
mendations for Achievement Tests, 1955.
39
studied through the construction of a sequential test with
actual items, testing of a group, and then the prediction
of future performance. This would be a logical next step if
the model data studied here show that the sequential item
test is a better test than a six-item cumulative under the
conditions of this study. If sequential test does not have
results which may be considered better than the results from
the cumulative test, then there is no need to study the
sequential under less favorable conditions.
In construct validity it is assumed that the character-
istics measured and related are not affected by the type of
items used in the test. Results from this study may be used
to indicate that these assumptions are not met in most situ-
ations. A study of the attenuation paradox literature should
make one aware of the problems involved in the measurement of
characteristics and their relationships. There is no attempt
to evaluate the construct validity of the sequential test.
Neither is there any attempt made to correlate test scores
with other abilities that should be related to the particular
hypotheticalabilitybeing measured. That which is measured
is any homogeneous ability measured by the items with the
given level of precision—~all of the items in the sequential
model have the same precision.
Error of measurement at different score levels is
examined in detail as suggested by the criteria for evaluation
of a test. The discriminating power of the test at a given
no
level of test score is to be distinguished from the discrim-
inating power at a given level of ability. Both the variance
of the test scores of each ability level, and the variance of
the ability levels at each score are examined.
The equivalence of forms reliability is not determined
as there is only one form. It would be quite simple to build
two tests in a computer and determine how well the scores on
the one test could be predicted from the scores on the other.
It is possible that quite equivalent tests could be built
from quite different items. This possibility is not examined
in this dissertation.
Due to the hypothetical nature of the data the internal
consistency reliability is not examined. Stability reliability
is not determined as it would be necessary to administer a
test twice to a group to determine this, and no test is
actually used in this paper. This is another area that
needs to be examined.
There is a fairly complete discussion of the score dis—
tribution of the sequential item test. It is hoped that the
rationale which predicted the type of score distribution
would be proved correct and thus a tested rationale would
be presented rather than a rationale derived from the results.
Norms (like many of the criteria lised to evaluate a
test rather than a test procedure) are irrelevant to the test
procedure.
All
Another limitation to the study is that no attempt is
made to examine the effects of errors of estimating the
parameter values when the level of precision is low. However,
one would suppose that the effect of errors will be less at
lover precision levels. if the effects produced at high
. ,fi . ‘ , .fi ‘ ...1 .. 4—..7‘ .,‘ ‘5 .‘h’ _'V 3'. .,\‘ 4 ,-.. .. “A" y.
levels of item pICClriOH are Withlh the titer range ioi
~ ’ r ». (N q -? A. . #fi ' -- .7. a , -‘ ‘ V. 13.. . 'v’l - ° ‘ -' —' 4-— ” 1 - , - . r ‘ ' c‘ .\
pdxhjtlcai si»nrrllcan e, Infill CNQIB ixi_iittle lanai to exannjh:
of)
the eilects at low level of item precision. If the effects
at high levels ofitem precision are beyond the error allowed
for practical significance, then one must determine the effects
of lower item precisions or develop methods of obtaining
better estimates. This decision can be made later.
;ect Presentation Procedures and Effects
In the area of sequential test presentation to the testee
little is known as to how to proceed in actual practice. For
example, it may be psychologically advantageous to give the
easiest items first, allowing some individuals to subsequently
try more difficult items, rather than to have everyone start
at an item of 50 per cent difficulty. Since the test is not
given to an actual group this procedure cannot be examined in
this dissertation.
The gnwmater the immdxar of scort33- )
E(X) _ O O l l 2 pl 3 ...
When the mean difficulty of items is at the 50 per cent level
of difficulties for the individual then the error variance
of the score is defined as below:
2
0’ __ h
r/
_ .___ s /9
E(X) 2v “’0 1
The terms are defined as follows:
0'2
= error variance of score
E(X)
n = number of items
X z score value
23D. N. lawley, ”0n Problems Connected with Item Selec—
tion and Test Construction," Proceedings of the Royal Society
of Edinburgh, 61 (Section A, Part III):273—287, 19E2-1943, p.
273.
57
to, t1, t2, etc. = values from Table 29 of Pearsods
Tables for Statisticians and
Biometricians (ordinarily used to
calculate r'Jce't)
a
,0 = £1
1 05 + cda
a ,
C7 :: variance of item difficulties (standard score
1 form)
1 . . ..
'—75 = preClSlon of item
0;
From these equations, and the assumptions mentioned
above, one can determine that large.p3 would reduce the
error term whether the ability level is equal to the mean
difficulty of the items or not. The size Of/Dl can be in-
2
creased by decreasing 6' (using more precise items), by
0
decreasing 612 in the denominator (or using all items at one
difficulty), or by increasing die in the numerator (using
items at more than one difficulty level). This immediately
suggests that the best procedure is to use more precise items
if one wishes to reduce error variance in the score, as 6i
appears in both the numerator and denominator. This is in
contrast with the most valid test results reported by Tucker,
who empirically found that the most valid test was the test
with imperfect items.25
Another way of reducing error variance would be to use
the small tO values. (The value, t is necessary to enter
0)
25Tucker, op. cit.
58
Pearson's tables.) Lawley gives the following formula for
.26
to.
ability level (standard score form)
X:
€2-= mean difficulty level of cumulative test
9 2
62: 66‘ + 61 (as defined above)
To aid in the understanding of the interpretation of
the formula given above the following summary data is reported
for a test with the mean difficulty level of items nearly
equal to mean ability level ( 51 = .045) and with a o'
(a combination of the spread of item difficulty and precision
2‘ 2
of items) of 1.30 for a 100—item test. 7 The values of CT
E(X)
for given values of &i;§- are as follows:
i 2
x - OK CI,
6' h(X)
0.0 20.8
0.1 20.7
0.2 20.4
0.3 19.8
0.4 19.0
0.5 18.0
0.6 16.9
0.7 15.6
0.8 14.3
0.9 13.0
1.0 11.6
26Lawley, op. cit., p. 279. Ibid.
59
As can be seen from the preceding data, for'given
é;2. and cr values, the higher the ability level (x), the
]_c>wer the error variance for the score ( ‘7E(X) ) for a cumu—
1.21tive test. If the items had a large value (fixed) for the
rneean difficulty level (i.e., the value of 5K increased)
tzrien the value of -—§—:§E— would be smaller and thus the
6’
eaxrror variance (C7 ))would be larger.
E(X
Lawley also pointed out that the effective discrimin-
2
Eitsing power of a test may be computed. as follows:
%.= _g(Xi - ZEX')
E(X')
E(X) +
If x = 5i then the above formula becomes:
x and x' are two different ability levels
X and X‘ are two different score values
Other terms are defined as before.
As Lawley pointed out, in order to increase the effec—
1Dive discriminating power the numerator must be increased
Wklich means obtaining large values for 55L, or the denominator
maybe decreased, and, assuming oi? is constant (as one cannot
<3P1ange precision) then one must change 6
l
C)f‘ difficulty.29 The smaller the spread of difficulties the
which is the spread
Ibid., p. 280.
291bid., p. 281.
60
lower the value.
The effective discriminating power for a test would
thus be greatest when the mean difficulty of items was
equal to the ability level for the extremes in ability, and,
when there was no spread of item difficulties. This type
of test would be used to create scores which would be
assigned only to individuals that are the same. It is not
used to differentiate between the ability levels of individ-
uals. The same logic which states that middle scores will
be more precise (i.e., representing only one type of ability
level individual) when difficulties are extreme would indicate
that extreme scores will be more precise when 0.00 level of
difficulty items are used in the test. (Remember the formula
uses 38' so it would operate for either extreme of difficulty.
Support for this position is given by Lord who stated
that the standard error of measurement would be practically
zero for extreme positive or negative values of ability.30
He argued that there would exist individuals whose ability
would be so low that the test would not be discriminating for
them, and other individuals whose ability would be too high
to be discriminated. The standard error of measurement is low
for these zero or perfect scores and is necessarily smallest
for those examinees for whom the test is least discriminating.
The above solutions to changing the criteria of test
validity still do not exhaust the solutions to the attenuation
3OLord, A Theory of Test Scores, op. cit., p. 14.
61
paradox. Brogden offers yet another solution. He found
that the correlation continued upward when a spread of item
difficulties instead of one level of difficulty was used.31
He concluded that the problem was that of determining the
distribution of item difficulties to yield a more valid
score. Brogden showed that by using items with rtet = .60
or higher, a distribution of item difficulties will produce
(for an 18-item test) a higher validity than will be obtained
with all items at the .50 difficulty level.32 The spread of
difficulty seemed to be important when items were of this
reliability.
Brogden's solution of determining the spread of items
for a test such that the results would correlate highest
with a criterion seems to be inadequate since there remain
the problems of measuring the relationship and the meaning
of the coefficients that are computed. It is impossible to
solve all of these problems at this time, but assuming that
the difficulty of the item is an adequate score, and assuming
that discrimination among ability levels (with an examination
of the effective discriminating power) is the important ques-
tion, a rationale can be built for the sequential test devel-
oped in this dissertation.
Two areas of literature will now be examined to build
the rationale for effective use of items in the sequential
31Brogden, op. cit., p. 2A0. 32lbid.
62
test. They are (1) literature on Bayes' Theorem, and (2)
literature on the use of items at the 50 per cent level of
difficulty for the hypothetical group with a median ability
level equal to the value at which the discrimination is
desired.
Meehl and Rosen, through the use of Bayes' Theorem,
point out that the practical value of a psychometric sign,
pattern, or cutting score depends jointly upon its intrinsic
validity (in the usual sense of its discriminating power)
and the distribution of the criterion variable (base rates)
in the clinical population.33 They note that if the base
rates of the criterion classification deviate greatly from
a 50-50 split, the use of a test sign having only moderate
validity will result in an increase of erroneous clinical
decisions.
One reason that the sequential test is assumed to have
maximally efficient use of items is that the base rate does
not have to deviate from the 50-50 split. The other reason
is that the sequential test uses items at the 50 per cent
level of difficulty for the group taking the item. These
items have been found to be efficient with various criteria
for efficiency.
Lord concluded from maximizing the ratio of difference
in means to standard error of difference, that if one desires
33Meehl, Paul E. and Rosen, Albert. ”Antecedent Proba-
bility and the Efficiency of Psychometric Signs Patterns or
Cutting Scores,” Psychological Bulletin, 52:194-2l6, No. 3,
1955-
63
to construct a test that will have the greatest possible
discriminating power for examinees of a given level of
ability, 0 = 00, then all items should be of equal difficulty
(no spread) and of such difficulty that half of those exam-
inees whose ability score is cO would answer each item
3” This
correctly and half would answer it incorrectly.
measure of discriminating power is completely independent of
the distribution of ability in the group tested.
However, when item precision is such that item—total
biserial correlations are .447, Lord empirically showed that
a test composed solely of items at the 50 per cent difficulty
is more discriminating (as measured above) than any other
test for examinees at gpy level of ability between -2.5 and
+2.5.35 Lord does not show results of more highly correlated
items which will be investigated in the present study.
Lord's empirical study above is supported by Cronbach and
Warrington's theoretical study. They stated that for
items of the type ordinarily used in psychological tests,
the test with uniform item difficulty gives greater over-all
validity and superior validity for most cutting scores, as
compared with a test with a range of item difficulties.36
It is the cutting score validity which is new here and of
some relevance to the sequential test constructor. For
34Lord, A Theory of Test Scores, op, cit., p. 26.
351bid., p. 29.
36Cronbach and Warrington, op. cit., p. 127.
64
example, Cronbach and Warrington found that if 0’ = .2
0..
(i.e., rtet = .94 orlfl = .80), if no guessing is possible
(or Ttet = .55 orlfl = .37 if the probability of chance suc-
cess by guessing is one-third), and if all items are at the
50 per cent level of difficulty, better results are obtained
for separating out from 40 to 62 per cent below the cutting
score than if there were a normal distribution of item
difficulties.37
The empirical determination of the best difficulties
for discrimination has not always been as nonsupportive of
the present rationale as the work of Lord. Lord used discrim-
inating power (as defined by him) as his criterion. Richard-
son's empirical study had more supportive results. He created
five subtests of different difficulty levels: 78-95, 60-77,
41-54, 23—40, and 5-22.38 He then calculated the biserial
correlations for 23 different divisions of the criterion
starting at 4.17 per cent of the people in the lower category,
and, by percentage units of 4.1667, continuing to 95.83 per
cent in the lower category. He graphed these results and
noted that the test consisting of items from 78-95 per cent
passing produced the highest biserial correlation for those
divisions where 4.17 to 25.00 per cent of the people were in
the lower category. Likewise the 60-70 per cent pass test
37Ibid., p. 135.
38M. W. Richardson, "The Relation Between the Difficulty
and the Differential Validity of a Test,II Psychometrika, 1:33-
49, No. 2, June, 1936.
65
was best for the 25.00 to 35.00 divisions; the 41-49 per cent
pass test for the 35.00 to 61.50 divisions; the 23-40 per
cent pass test for the 61.50 to 82.00 divisions; and the
5-22 per cent pass test produced the highest biserials
where 82.00 to 95.83 per cent of the people were in the lower
category. Although these results are from 50—item tests, the
results indicate that different difficulty tests for differ-
ent discriminations should be useful.
Other results from studies which would support the
position that items at the 50 per cent level of difficulty
for the group are the best items, are those which indicate
differentiation of a group by items of different difficulty.
In these studies the ability level of the individuals are not
known and differentiation for each ability level is not re-
ported separately. The reader must assume that the individ-
uals were normally distributed around an ability level equal
to the difficulty of the items. If this assumption is made
then low differentiation by difficult items support the con-
clusion that items appropriate for the ability level are the
best items.
Such a study as described above is reported by Cleeton.
Cleeton used four well selected ability groups--one superior
group and three inferior groups.39 He then constructed two
measures of the differential or predictive value of the test.
39G1en U. Cleeton, "Optimum Difficulty of Group Test
Items," Journal of Applied Psychology, 10:327-340, No. 3,
September, 1926.
66
One of these was (R1 - R4) in which R stands for the number
of items answered correctly by group 1, 2, 3, or 4. The
other measure was (Rl — R2) + (Rl - R3) + (Rl — R4) +
(R2 - R3) + (R2 - R4) + (R3 — R4). (Terms having the same
meaning as above). These are criterion II and criterion I
in the following results, respectively. Cleeton examined
difficulty by grouping 1/10 of the items in each interval
and by grouping 1/10 of the range of difficulty in each
interval. For present purposes it is most informative to
look at the actual difficulty divided into 10 parts even
though the number of items in each interval is different.
The following data show the results of 240, 240, and 480
individuals each taking three tests of 400, 236, and 109
items. (For the computation of criterion indices, Cleeton
assumed that he had only 720 individuals.)
Interval
for Rank Rank Value Value
% Passing Criterion Criterion Criterion Criterion
Item I II I II
91 - 100 8 8 44.4 14.7
81 - 9O 6 6 104.9 28.9
71 - 80 5 5 125.9 40.8
61 - 7O 4 4 152.6 46.8
51 - 60 3 3 158.9 47.6
41 - 50 1 1 175.2 51.3
31 - 40 2 2 163.9 51.1
21 - 30 7 7 85.8 26.1
11 - 20 10 10 35.9 11.1
0 - 10 9 9 37.3 11.9
67
From the above data one may determine that the slightly more
difficult items seem to have the greatest predictive value
as measured by both these estimates of predictive value.
This would support the decision to use items at the 50 per
cent level of difficulty for the group which is to be dis—
criminated among.
Logical analysis also supports the above decision.
Flanagan pointed out the extremes of this difficulty and
item validity argument. He stated that if one wanted the
maximum amount of discrimination between the individuals in
a particular group, a test should be composed of items all
of which are at 50 per cent difficulty for that group--
provided the intercorrelations of all the items are zero.“0
If intercorrelations were other than zero, the decision
would not be this clear.
Lord studied theoretical test models which had either
high or low item reliabilities with easy, difficult, or easy
and difficult test items. After examining the relationship
of the true score distribution to the distribution of ability,
he reached the following conclusion:41
A test composed of items of equal discriminating
power but of varying difficulty will not be as
discriminating in the neighborhood of any single
“OJohn C. Flanagan, "General Considerations in the
Selection of Test Items and a Short Method of Estimating the
Product-Moment Coefficient from Data at the Tails of the
Distribution," Journal of Educational Psychology, 30:674-680,
No. 9, December, 1939.
1+1Lord, "The Relation of Test Score to the Trait Under-
lying the Test,” op. cit., p. 5A3.
cs
ability level as would a test composed of similar
items all of appropriate difficulty for that level.
Thus,most Of’the literature supports (1) the use of items
at the appropriate difficulty for each level and (2) the
separation of individuals into groups that would have a base
rate of 50 per cent.
Because the base rate is near a 50 per cent split each
time, the sequential model should permit the use of only
moderately discriminating items. In the cumulative test,
there will be only 5 or 10 per cent of the individuals who
should pass a difficult item, as all people take the item.
In the sequential method 50 per cent should pass this dif-
ficult item, as only those with high ability will take the
item. According to Bayes' Theorem the probability of high
ability people passing the item must be much higher than the
probability of low ability people passing the item if 90 per
cent of those taking the item have low ability. Once the
group taking the item has a base rate of 50 per cent (as is
the case in the sequential method), then the item should work
better-~i.e., increase the number of correct clinical decisions.
In the sequential test, those groups which are different
in ability would use items at the 50 per cent level of dif-
ficulty for that group. This would allow the use of diffi-
cult items which are precise. Such items could not be
efficiently used in a cumulative test.
09
II. CONTROL OF THE SCORE DISTRIBUTION
The problem of score distribution is not only to
assign a specified number of individuals to each score value,
but also to assign like individuals to each score value. The
score distribution is not only related to the item parameters,
but should also be related to the use. The score distribution
problem may be studied through the use of a theoretical model
or empirically.
Lord attempted to study the problem of control of score
distribution through the use of a theoretical model. He made
the following assumptions: (1) the item characteristic curves
have the general shape typical of cognitive items that are not
answered correctly by guessing; (2) the items are homogeneous
in a certain specified sense; (3) the items are scored 0 or 1;
and (4) the raw test score is the number of items answered
correctly}+2 (A homogeneous test is, for Lord's purpose,
defined as a test composed of items such that, within any
group of examinees all of whom are at the same ability level,
the response given to any item is statistically independent
of the response given to the remaining items.)
The generalizations reached by Lord were as followszu3
1. Since the test characteristic curve is in general
nonlinear, the test score distribution will not in
general have the same shape as the distribution of
421bid., p. 546.
431bid., pp. 541-542.
7O
ability; in particular, if the ability distribution is
normal, the score distribution in general will not be
strictly normal.
2. U—shaped and roughly rectangular score distributions
can be produced provided sufficiently discriminating
test items can be found. (All appropriate individuals
pass or all appropriate individuals fail an item if they
are perfect items at the 50 per cent level of difficulty.)
3. Typically, if a test is at the appropriate difficulty
level for the group tested, the more discriminating the
test, the more platykurtic the score distribution.
4. The skewness of the test score distribution
typically tends in a positive direction as the test dif~
ficulty is increased above the level appropriate for
the group tested; in a negative direction as the test
difficulty is decreased below that level.
These generalizations aid in interpreting the empirical
11
results of a study made by Mollenkopf.’4 He selected 1000
answer sheets chosen on the bases that: (a) every person
must have attempted every item, and (b) a wide range of scores
should exist in the sample chosen. Items were then chosen to
make up nine synthetic tests. These nine tests contained
score distributions with three types of kurtosis and three
types of skewness. A study of the literature revealed that
the total test score distributions were believed to be con-
trolled for skewness by item difficulty. However, since easy
items tended to have higher correlations with the total score
than did difficult items, control on mean difficulty alone
was found not to be sufficient. When building a test with a
symmetrical score distribution,Mollenkopf found that a set
uaWilliam G. Mollenkopf, I'Variation of the Standard
Error of Measurement,” Psychometrika, 14:189-229, No. 3,
September, 1949.
71
of items of the same type (all of difficulty close to .50)
yielded scores with a definitely flat distribution. (From
Lord's work, it looks as though the item precision must have
been very good.) To secure a leptokurtic score distribution
Mollenkopf tried sets of items with .40 and .60 difficulties,
but found that homogeneous sets of items of .20 and .80 dif-
ficulties were needed.
If one uses Lord's work to translate back from score
distribution (by assumed highly precise items) to ability
level, one can determine that the distribution of ability
must have been near normal. Also of interest in the Mollen-
kopf article is the fact that the standard error of measure-
ment for a nonskewed platykurtic distribution of scores is
greatest in the middle sections and lowest at the extremes.
This may be accounted for by what Mollenkopf has labelled
the "end effect."45 This effect means that at the ends
large differences in parallel forms cannot occur. A perfect
score is perfect in each half. Small empirically observed
errors of measurement are inevitable in the tail where the
pile-up occurs on skewed distributions but not for normal
distributions.
This explanation would suggest that the variance of
ability levels for a given test score may be small, but it
does not indicate, as Mollenkopf also pointed out, that there
72
is a small variance of scores for a given ability level.
Both points are of interest if reflection of the ability
distribution is desired in the score distribution.
The cumulative test can be used to yield the type of
score distribution that one wishes. The important parameters
are item difficulty and item precision, but only general
statements are available as to the relationship between
these parameters and the score distribution. Empirical
studies are used to determine exact parameter values for
given score distributions.
Hymphreys stated that the variance of item difficulties
forces scores toward the center of the distribution and thus
counters the effect of high item intercorrelations.46 It is
thus necessary to have a spread of difficulties, only if
the items are very precise. Whereas very highly intercor-
related items of one difficulty level would produce two
scores, if one were to use a spread, one could force people
into a distribution that would be expected to have some
validity. Humphreys advocated that the shape of the score
distribution be controlled by the difficulty level of the
test items.u7\The type of distribution favored by Humphreys
was a rectangular distribution—-a distribution that would
allow individuals to be ranked.
u/
OHumphreys, op. cit., p. 474.
47Ibid., p. 475.
73
If the items were perfect, the procedure to produce
the rectangular distribution desired by Humphreys would be as
reported by Davis. Davis reported that if the tetrachoric
item intercorrelations are all unity, a rectangular distri-
bution of raw scores is most likely to be obtained by
selecting items with difficulty levels of 1/(n + l),
2/(n + l), 3/(n + l), . . . n/(n + 1). However, if the
tetrachoric intercorrelations are all .50, a rectangular dis»
tribution of raw scores is most likely to be obtained by
selecting all items at the 50 per cent level of difficulty.48
He argued that for any level of tetrachoric item intercorre—
lations from zero to .50, the maximum number of discrimin-
ations that could be made by the total score would be insured
by selecting all items at the 50 per cent level of difficulty.
Davis went on to say that this simple mathematical
procedure employed to specify the exact difficulty levels
of items for two- and three-item tests cannot be applied to
specifying the exact difficulty levels of items for tests
containing larger numbers of items except in the limiting
case when the item intercorrelations are all unity. The
reason one cannot generalize is that when intercorrelations
are not unity, errors in classification will be made, and the
spread of ability represented by those who pass or fail will
be greater but undetermined. Thus, the appropriate difficulty
48Davis.,o . cit., p. 103.
74
for the resulting group cannot be easily determined. The
effect of errors is difficult to determine, but as pointed
out by Davis, there is need for a general solution.
Whereas the general rules about control of score dis-
tribution are known, there is no general solution in the
sense that the actual score distributions are known. The
actual score distributions must be empirically determined
for each test. The literature indicates that if the sequen-
tial method of testing could more easily and predictably
control the score distribution, a real contribution would be
made to the solution of a difficult measurement problem.
III. MEANING AND USE OF SCORE PRODUCED
Both the score distribution and the meaning of a score
are related to the use of the test. Ferguson has pointed
out that for discrimination between two groups one would
need a bimodal distribution of scores; the discrimination
between two groups and among the members of one group would
require an asymmetrical distribution of scores; and, if one
were establishing the order of ability of individuals, one
would use a rectangular distribution. Ferguson concluded
that the construction of tests to yield distributions ap-
proximating the normal form results in a loss of discrimina-
tory capacity.49
49George A. Ferguson, ”On the Theory of Test Discrimin—
ation," Psychometrika, 14:61-68, No. 1, March,l949, p. 68.
75
Not all scoures have the same meaning. A score resulting
from the discrimination between two groups is more a probabil-
ity statement that the individual should be classified into
a given category than it is a statement that the individual's
ability is at a certain level. The score from a test designed
to rank individuals compares any individual in relation to
others.
In addition to the meanings necessary for the above uses,
Gulliksen (as stated in the first section) would have the
score be the best estimate of the difficulty level reached.50
This type of score represents the ”true ability" level of
the individual. This type of score is also advocated by those
who argue for reproducibility as a measure of the best test.
However, it should be noted that it has been the practice to
determine how well a pattern of responses from an instrument
will reproduce original results, not hypothesized "true"
results. As reported by White and Saltz, these indices will
reflect without equivocation the amount of information thrown
away by representing the subject's performance on the test by
a total score based on the number of items passed. "They
indicate, in other words, how adequately a unidimensional
1
model fits the obtained data."5
50Gulliksen, op. cit.
51Benjamin W. White and Eli Saltz, HMeasurement of
Reproducibility," Psychological Bulletin, 54:81-99, No. 2,
March, 1957, p- 95-
f
76
However, a reproducibility score from a unidimensional
test does not insure either an interval scale or a known be-
havior domain being sampled. Individuals may be ranked by
the test scores (compared to other individuals) or be assigned
an ability level (compared to a standard). The behavior
domain may be related to the test label or it may not-—the
only assurance one has is that the domain is unidimensional.
The question as to domain samples (which seems like a
validity question) has actually been studied as a part of
reliability. Tryon in theory related reliability to the
behavior domain sampled.52 He reviewed the two theories of
test reliability: (1) the Spearman-Yule theory that tests
are unreliable because of an error factor and reliable because
of a true factor which may be a composite of more than one
common factor; and (2) the Brown-Kelley theory that reliabi-
1ity may be explained by equivalent test-samples in which all
items in the total score have equal standard deviations and
equal intercorrelations. (To obtain equivalent test-samples
the content and difficulty of items must be considered, but
all items do not have to be equally difficult.)
Tryon defined reliability as the value of “correlation,
rtt: between the observed Xt scores and a second set of com-
posite scores, Xt‘, earned on a 'comparable form' of the Xt
52Robert C. Tryon, "Reliability and Behavior Domain
Validity: Reformulation and Historical Critique," Psycho-
logical Bulletin, 54:229-249, No. 3, May, 1957.
77
composite."53 (A comparable Xt composite is one in which the
n test-samples vary on the average as much in standard devia-
tions and intercorrelations as do the n test-samples in the
observed Xt composite.)
If this definition of reliability is used, a reliable
test is one that indicates how well the individual knew the
domain or how he ranked with others in his knowledge of the
domain. At least the domain sampled by the score is known
and can be made part of the meaning of the score.
The literature reviewed to this point would indicate
that the score (1) may be a function of difficulty which prob—
ably reflects the ability level of the individual, (2) may
represent a pattern as to content, or (3) may indicate how
well the individual did on the samples of the domain that the
test is hypothesized to sample. Reliability measures may be
a factor in determining what meaning can be assigned to the
score, but there are still contributions coming from content
and from difficulty.
Swineford examined the importance of the difficulty of
the item as a factor in the score assigned to the individual.
Swineford has shown that only if the items are quite precise
and intercorrelated is the difficulty of the item an important
factor in the score of an individual. Swineford used present
531bid., p. 230.
78
day tests and attempted to measure the impact of variability
of item difficulty and item-item correlation.54 The varia-
bility of item difficulty was designated 6; , A: being the
normal-curve deviate (for a distribution with mean of 13 and
standard deviation of 4) above which lies the area under the
curve equal to the proportion of successful examinees. For
a measure of inter-item correlation Swineford used the recip-
rocal of the square of the mean of the item-total correlation.
The results of Swineford's study showed that when the
score was the number correct that the best formula for pre-
dicting this score was as follows:
2 = .1530 z + .8649 z
1 3 4
21 is the predicted standard score on the test
Z3 is the measure of the spread of item difficulties
in a standard score form
24 is the inter-item correlation measure in standard
score form
Rl.34 = .9648 for this formula.
When the score was the number right minus k times the
number wrong the results were as follows:
21 = .2117 Z3 + .9222 24
and R1 34 was .9642. The symbols are the same as above. As
can be seen from these formulas, the contribution of spread
54Frances Swineford, “Some Relations Between Test
Scores and Item Statistics," Journal of Educational Psycho-
logy, 50:26—30, No. 1, February, 1959.
79
of item difficulties in the usual cumulative test is not great.
Another way of looking at the contribution of item dif-
ficulty spread is to specify the spread and inter-item corre-
lation, and then examine the standard deviation of test
scores. Swineford used (n - chance)/ included in the one axis of
her chart where values of 028 range from 5.8 to 3.0 for the
highest (.50)rbis, and from 14.8 to 11.9 for the lowest
(.20) rbis' The mean rbis is .36, the highest rbis (.50) is
.70 sigma units away from the mean, and the lowest rbis (.20)
is 3.15 sigma units away from the mean. Thus, while the values
of C; ‘may be considered to be close to normally distributed
and likely to be encountered in the usual cumulative test,
the values for rbis are not normally distributed. We might
conclude that if rbis were normally distributed, then higher
values of rbis might appropriately be investigated. A standard
deviation unit on Cg_ would indicate that today most tests do
use items centered around the mean difficulty level, but that
the reliability of items has a larger range. If one examines
i .70 sigma units of rbis’ one has about a three point change
in (n - chance)/th values which is about the same change en—
countered from i 3.0 sigma units of C55 . This supports the
conclusion that conventional cumulative tests do not use
81
difficulty as a major factor in the score; the score is a con—
glomerate of difficulties and other factors.
The literature indicates that the cumulative test may
be constructed to measure a single factor but that the
attention of the test constructors has not been directed
toward reporting the decisions made as to the meaning of the
score. If one remains concerned with traditional operational
definitions of reliability and validity, one may forget the
construct operationalized and not change the construct when
it needs to be changed.
The sequential test procedure developed in this disser-
tation will use reflection of true ability as the meaning of
a scores. The literature indicates that this is only one of
the many meanings that could be assigned to a score.
IV. SEQUENTIAL TESTING PROCEDURES
The literature indicates that there are many choices as
to the use of the sequential testing procedure. The sequen-
tial process may be used (1) to quickly determine score
to be assigned to good and poor students; (2) to determine
to which of two categories the individual should probably be
assigned, if assigned at all; or (3) to classify each individ-
ual as well as possible in time allowed. The sequential
analysis developed by Wald would be most applicable to the
second purpose, but this method has been modified by Cowden
to serve the first purpose.
82
Cowden has indicated that when an examination is given
to a student it sometimes happens that not enough questions
are asked to permit a fair evaluation of his knowledge and
ability.56 On the other hand the examination is sometimes
drawn out longer than is necessary. If a student is very
good or very poor, only a few questions may be needed to
establish this fact beyond reasonable doubt; but borderline
students need to be examined at considerable length before
deciding whether they should be passed or failed. If sequen-
tial testing is used, the fate of good students and of poor
students tends to be quickly determined, but mediocre students
must continue with the examination until the results give
adequate grounds for a decision. By use of the sequential
method the number of questions answered by a student is re-
duced to a minimum, and at the same time the probability of
passing a poor student or failing a good student is controlled.
Cowden graded his students in a small class in elemen-
tary statistics at the University of North Carolina. Using
D1 (decision number I) to indicate the number of questions
that could be missed and still permit a student to pass, D2
(decision number 2) to indicate the number of questions that
must be answered incorrectly before a student is failed, and
N to indicate the cumulative number of questions answered;
the two linear equations used to make the decision follow:57
56Dudley J. Cowden, "An Application of Sequential Sampling
to Testing Students,‘l Journal of the American Statistical
Association, 41:547-556, No. 236, December, 1946, p. 548.
57Ibid., pp. 548asu9.
83
D1 = a1 + bN D2 = a2 + bN
As can be seen, the straight lines representing these two
equations are parallel and differ only as to the constants
a1 and a2. These constants al and a2 are shown to depend on
the values of p1, p2, 0< , and /3 when: "pl" is defined
as the maximum proportion of errors in all possible ques-
tions of a given type made by a student who is definitely
good; "p2" is defined as the minimum proportion of errors in
all possible questions of a given type made by a student who
is definitely poor; "04‘ is defined as the probability of
failing a good student; and ”/3" is defined as the probabil-
ity of passing a poor student. The more widely p1 and p2
differ the closer together the lines will be, and, therefore,
the more quickly will a decision be reached. The larger the
values ofa‘ and/3 the smaller will be the value of a2 and
the larger (algebraically) will be the value of a1. There-
fore to bring the two lines closer together one must increase
0<. and/or /3 . The value of a1 is always negative, since
answering all questions correctly does not strongely indicate
knowledge of the subject until a reasonable number of questions
is answered (what is a reasonable number depends on the value
adopted.frm'fl?, becoming larger as [3 is made smaller). On
the other hand, a2 is always positive, but a decision to fail
cannot be reached until D2 = N, since a student cannot miss
more questions than he answers. WhenOK =fl , a2 a: —al. The
84
slope b is independent;cd’0< andifl3 , but depends exclusively
C
on p1 and p2. Cowden gives the following formulaszj8
: 10 *r' g. 2 lo .______
$1 8; Pl 2 s 1 _ p2
—cK _
-8 :r: h : a”) :2 h :2 °<
l 1 g + g. 2
l 2 El + g2
b _ S2
$1 + $2
Cowden thus develops two lines for pass, fail, and
indeterminate, but has grades for six categories based on
59
the following decisions:
After 20 questions if a student made errors in less
than 10 percent of the questions, the grade of "A"
was assigned; if 55 per cent or more of the questions
were answered incorrectly, the grade of ”F” was
assigned; if the percent of incorrect questions was
between these percentage values then testing was con-
tinued. After 40 questions if a student (not classi-
fied before) made errors in less than 22.5 percent of
the questions the grade of "B” was assigned; or if
more than 45 percent of the 40 questions were incorrect,
the grade of ”F" was assigned. Similar decisions were
made after 60, 80, 100, 200, and 1,000 questions. After
1,000 questions those students not already classified
were assigned ”D" or "E” grades. Those individuals
having errors in less than 34.89 percent of the ques-
tions were assigned ”D” and those students having
errors in more than 35.3 percent of the questions
were assigned a grade of "ET
Sequential testing is thus changed to allow using more
than three categories by changing the number of items that
Ibid., p. 551.
591bid., p. 552.
85
are used to make the decision. Estimates of the size of the
number of items can be obtained by the following formulas:6O
_ h __
1 - b
N. 2 (l - 0< ) h]. "O< h2 E Z (1 -fl) h2 "flhl
pl b ‘ 91 pg 92 “ b
Cowden found that it took 13.5 items before it was
possible to decide that the student should pass. This is
due to a random sample of items assumed in the sequential
process. It therefore seems worthwhile to investigate a pur-
poseful sample of items instead of random sample even though
the mathematics has not been worked out for this type of test.
To use the model developed by Wald, one must first state
the probability of type I and type II errors that one will
accept (as to a given alternative) and then continue until
one satisfies the conditions of the mathematical model with
probabilities.61 The procedure may be used to decide upon
pass or fail categories as was done by Moonan; or modified
by making assumptions about the number of items needed to
make the decision as done by Cowden; or an individual may
wait for the mathematics of the multiple decision (or other
modification) to be completed and reported as Wald indicates
62
might be done in his book on sequential analysis.
60Ibid., p. 553.
61
Wald, Sequential Analysis, op. cit.
62
Ibid., pp. 138-150.
86
The sequential procedure developed by Wald for a ”most
powerful" test is built upon the assumption that one may con-
tinue to sample the same universe. The procedure determines
what decision is best after every sample and states whether
one has attained the desired degree of probability (of being
correct). It is not necessary to follow the lead of Cowden
and Moonan and, therefore, use a random sample of items. It
is known that certain items of different difficulties will
give more information about an individual than other items,
and this information Shauld be used: this means that one
does not wish to sample from the same universe of items each
time. While the aptitude or ability being tested must remain
unidimensional, there may be great advantage in allowing the
difficulty of items to change. The sequential model herein
described thus departs from the Wald sequential model in
that it uses different difficulty levels so that fewer items
are needed for the decision.
Fiske and Jones in an article intended to introduce
sequential analysis to psychologists, stated that the un-
critical use of sequential analysis obviously is not recom—
mended.63 It is a design which can have advantages when one
or more of the following conditions actually holds: (a) The
problem involves the choice between two possible parameter
values which can be specified on a_priori but not arbitrary
¥
63Donald w. Fiske and Lyle v. Jones, "Sequential Analy-
sis in Psychological Research,” Psychological Bulletin, 51:
264~275, No. 3, May, 1954, pp. £73~274.
87
grounds—-the null hypothesis will usually be one of the two;
(b) the data are such that the cost per datum is high and
economy is desired; and (c) the total amount of data is not
fixed.
Such criteria would lead one to believe that the sequen-
tial model developed by Wald may not be the appropriate model
for the test situation, as the total amount of data is fixed
and one cannot afford to have 1,000 items as indicated by
Cowden. It may be no more expensive to acquire the data
from all candidates than from a few, unless one wishes to
select only rather than classify. The decision to accept or
not acceptwthe selection question—-seems to be the most ap-
propriate decision which can be answered by the sequential
method as described by Wald.
The literature also indicates methods<1fpresenting the
material to the testee. Some of these are noted here. Glaser,
Damrin, and Gardner constructed a tab item test to aid in
training of electronics specialists.6u In this test, the
performance on one test yields information which supplies a
cue for the selection of the next test and subsequent proce-
dures. ‘0ne "tab item” test, for example, had the trainee
read a description of the malfunction of a television set and
then, rather than actually performing various checking
64Robert Glaser, Dora E. Damrin, and Floyd M. Gardner,
"The Tab Item: A technique for the Measurement of Proficiency
in Diagnostic Problem Solving Tasks,” Educational and Psycho-
logical Measurement, 14z283—93, No. 2, Summer, 1954.
88
procedures, the trainee pulled the tabs of those checks he
would make if he were actually trouble shooting a real tele-
vision set. Whenever he pulled a tab he uncovered the
information he would have obtained if he actually had per-
formed that check on a real set.
Another method of presentation was used by Krathwohl
and Paterson in preliminary studies of the sequential test
model. They had directions printed on the page, covered
these with a transparent hard finish ink so that directions
could not be erased, then covered this in turn with strips
of opaque ink. The testee erased the strip of opaque ink
under the letter he considered to be related to the correct
answers. (This is similar to an IBM answer sheet, but in-
stead of marking a spot, the testee erases a spot.) The appro-
priate directions were thus made available to the student.
Teaching machine presentations are also obvious methods
to present material to the testee. The material is similar
to that presented by teaching machines, but in the sequential
model being developed in this paper, the individual does not
obtain information about the correctness or the reason for
the correctness or incorrectness of the response. However,
the individual is told to take a more difficult item if he
correctly answered the preceding item, or a less difficult
item if he incorrectly answered the preceding item.
The literature suggests that if the decision is to
best classify the individual by a sequential procedure, the
89
present sequential model may be better than past models which
have been developed from different assumptions and for differ«
ent problems. The literature also suggests that traditional
scores represent more than one meaning.
The present sequential model has used reflection of
input in the output as the proper meaning for a score; the
cumulative test should not perform this function as well as
the sequential test. The decision as how to measure the ef~
ficiency of these tests (and indirectly the items) was then
related to the reflection of input in the output. The two
factors considered in the output were (1) the means and
variances of ability levels assigned to a score (precision
of score) and (2) the means and variances of scores assigned
to an ability level category (discrimination of test).
It should be noted that the decisions as to the type
of score distribution desired and the meaning that should
be.assigned to a score had to be made before one could deter—
nnxme the efficiency of the test (or items). The decisions
made in the present study were those decisions which it was
hoped would favor the sequential test procedure.
There should be maximally efficient use of items in the
sequential method as (1) there is a separation of individuals
into groups which have a base rate of 50 per cent for the
items used, and (2) the use of items at the 50 per cent level
of difficulty for the subgroups permits the use of more
90
difficult items and makes better separation of these individ-
uals (as the item is at the 50 per cent level of difficulty
for the subgroup).
CHAPTER III
PROCEDURES
There are six sections to this chapter. First, the
actual construction of the six-item cumulative and the
six-item sequential test model is considered. The second
section outlines the method of evaluating the hypotheses
stated in Chapter I which relate to the effect of input
distributions. The third and fourth sections show the
methods for testing the hypotheses about item precision and
difficulty, and effect of errors of estimating a parameter,
respectively--both for the sequential model. Fifth, some
general comparisons between test score distribution and
ability level distribution are examined. And finally, a
summary of procedures and hypotheses is presented.
I. TEST MODEL CONSTRUCTION
This section deals with the construction of six-item
sequential and cumulative test models. Later these test
models are used with different inputs of ability and the type
of score output is examined.
The test model for the sequential and cumulative tests
assumed that the probability of passing an item was dependent
91
92
upon three factors: (1) the ability level of the individual,
(2) the precision of the item, and (3) the difficulty level
of the item. The assumption was made that no one passed by
randomly guessing the correct answer to the item.
The ability level of the individual was specified in
terms of standard score units for a normalized distribution
of ability. The precision of the item was specified in
terms of either r or dd“ These two terms are related by
bis
the following formulas:1
4
M, 2
s 1 " rbis (1)
r’bis
8d
or by algebraic manipulation;
1
rbis z (2)
l +o§
As can be seen from the second formula, r . is equal to one
bls
if dd is equal to zero. The smaller the 6' value the more
d
precise the item, and if Gd were equal to zero, the individ-
uals who had abilitylevels.above the difficulty level of the
item would pass the item, and vice versa.
The difficulty of the item was expressed in terms of
standard score units for a normal population. It need be
remembered that 80 or 90 per cent of a select group could
pass (or fail) a 50 per cent difficulty item.
lFrederic M. Lord, "Some Perspectives on 'The Attenua—
tion Paradox in Test Theoryl,” Psychological Bulletin, 52:
505-10, No. 6, November, 1955, p. 506.
93
The probability of passing a single item for a given
small segment of ability was computed by determining the
a - d
area under the normal curve frmm-—cfi to the value'——gr~—— ;
where"a"is equal to the ability level of the individugl in
standard score or sigma units,"d" is equal to difficulty
level in standard score or sigma units, and”oa”is the
{measure of precision described above.
The probability of passing a sequence of items for
bcpth the sequential and the cumulative was determined by
an11_tiplying the probabilities of passing each item in that
seeqfuence. This assumed that for that small segment of
azk>i.l;ity (for which the probability of passing an item was
cieat:eelrmined), performance on any one item was experimentally
irdcieezoendent of performance on any other item. Since the con-
C€?I“rd. was with classifying people by ability, it was assumed
”léinS each of these items measured only one factor other than
”163 error factor, i.e., the test was unidimensional. The
er'I-“<:>:ra factor on any one item was assumed to be independent
of) ESinceror on any other item.
Using the above scheme, one six—item sequential test
mCDCi‘Eiill was constructed for a hypothetical population of 1500
ind-?31~‘v:'lddals with 100 people at each of 15 ability levels as
SFICD‘AJIfl in Table 24. The item precision for all items in this
mc>Ci‘EE:1. was arbitrarily set at 6d = .882. The appropriate dif-
ffLCZLiiLties were determined by the following procedure. First,
t?)
SE ‘rlumber of people at each of the 15 ability levels who
94
would pass or fail an item was computed. The value of the
sum of the deviations from the mean squared for each of the
ability scores was computed for the pass and fail groups.
This value was computed and graphed for different trial
values of difficulty until the difficulty level was found for
which the sum of all sets of deviations of ability level
eabout the mean ability level for the entire group was a
rninimum. Since Edgavmmsa constant, the value for difficulty
leavel was calculated by maximizing (ELX)2/N. The difficulty
lxel/el of the item taken by each group was not the same.
ITc1r* example, in Figure 1, both the group who passed the first
j.tmern and failed the second item and the group who failed the
.f211753‘t itemenklpassed the second item take the same item at
{SiSEiégee 3-- a 0.00 item. If this had not been done, the six-
it3€3IT1 sequential test would require 63 different items.
It was decided to use the same item for those groups
for Which 2(2X)2/N maximized at a difficulty level no more
“153151 .20 standard score units away in difficulty from each
Ot3k1<33;rr. This allowed the test to be built with fewer items
arlci tzhus any test built to correspond to the model could use
Or1:l‘57’ the most precise items in a pool of items. Also, this
CC)17RE‘ ijave answered correctly. Both of these raw score distri-
b115t3:1.ons were converted to normalized "T" scores so that the
'tVV‘:> score distributions might be compared on an equivalent
Ln ‘t: e rval scale basis .
II. EFFECT OF SHAPE OF DISTRIBUTION OF ABILITY
It was hypothesized that the sequential test model con—
8 tbucted as described above should work well for any type
(Bit? ilanput distribution and thus be better than the six-item
(IIJtrrrlaglative test model. The six-item cumulative test con—
8
tr‘ucted with all items at 50 per cent level of difficulty
96
was not expected to be effective for those distributions
which had many high ability individuals. It was hypothesized
from the literature that these individuals would need more
difficult items to discriminate among them. To test this
hypothesis different ability distributions were used as
input. The diffioplty levels of the items used in the sequen-
tial test model were determined according to the method
(described in the last section, and were for a precision level
(if an rbis item total correlation of .75 (or dé = .882). A
pornecision of .75 was used because differences between the
sai)c~item cumulative and sequential models should be greatest
azt: Idigh levels of precision——.75 would be considered very
f1i4g1F1 by the standards in use. Few tests have an average
1 teerrl—total correlation of .75. A rectangular input distri-
bL11::i_(3“EE’ZLs--the sequential and the cumulative--were each used
V'i_t:1fil
tklee
in
LI‘_
a normal and a U-shaped distribution of ability to make
1:otal of four tests. These four tests were constructed
Eaflrl electronic computer. (For both the normal and the
Eslfléaped distributions the individuals were assumed to be
97
distributed over 15 input categories. Since the values used
in the computer program wereproportions at each category,
any number of individuals may be assumed. The most common
assumption made in interpreting this data is that there were
1000 individuals distributed over these 15 input categories.)
The item difficulties for the sequential models were the
ones computed above. The item difficulties for the two cumu-
;1ative models were all at the 50 per cent level. It was thus
pnossible to compare not only the sequential with the cumulative
rncxflels, but also the effect of an input of normal and U-shaped
d i s tributions.
Iiififezct of Normal Distribution
The effect of an input of a normal distribution of
atDfi.:L ity on the output distribution was examined in several
WEiE/ Es , but before the examination of hypotheses related to
tFIEB Es63 in the end categories. This was done because spreading
thé Se individuals over the middle of the distribution would
ha-
RJ’EE underrepresented the number of people likely to be at
E):
f:”I7€eme values in ability. Using the above procedure the end
98
categories extended from il.6l2 to :l.736 sigma units. Since
there were so few to consider at the levels beyond il.736
sigma units, these individuals were all considered to be at
the mean ability level for all_people beyond :l.6l2: that
is, at 1.942 sigma units (see Figure 2).
To test the hypothesis that the cumulative and sequen-
tial test models have equal ability to classify individuals
of mean ability level, the means and variances of comparable
Ilormalized scores from the six-item cumulative and ”least
smzuares" sequential test models for those 100 individuals
aasssumed to be in category eight of ability (the middle cate-
ggcoxey) were tested for significance of difference. The means
vveezre tested by use of a ”t” test and the variances by use of
Elf) F ratio.
To test the hypothesis that the "least squares'I sequen-
‘tdi.éa.l test model should more accurately classify the few
1¥r1<:1;ividuals at the extremes of the ability scale than the
‘34i-:>: —item cumulative model, the means and variances of com-
EDEa-I“£able normalized test scores for the 84 individuals in
abLinity categories 14 and 15 were tested. (Testing of the
irlcg spa
s
HHQ<
O
K
.m
.mHm
OJ
OH
Om
om
0:
On
00
on
om
om
OOH
OHH
Oma
Oma
at
sea
deqmnm
JO
sea
100
the extremes than the sequential test model, the means and
variances of ability level scores for the individuals ranked
in the top 8.4 per cent of the score distribution for each
test model were tested. When it was necessary to take only
a proportion of a score group to complete the top 8.4 per
cent of scores, then the ability levels were proportionately
sampled. The value of 8.4 per cent was selected because
tfldere were 84 individuals in the top two input ability
leevels of the hypothetical population of 1000 individuals.
It was hypothesized that the six~item cumulative test
Incociel would produce scores representing finer ability units
111 izhe middle than at the extreme score values, while the
S<3C1LAeHNAal test would more nearly reflect the ability scale.
'ITlea ‘thesized to be smaller in the middle and greater as
ex“Cir“‘esmes were approached. These differences in mean ability
VEIJ—lélees for the adjacent scores in one—half of the symmetrical
SC‘CDSIT’EE distribution are shown in Tables 5 and 6. In addition
tCD iZ¥kUis, the differences between mean normalized "T" scores
fk3:r’ esach adjacent ability level for both the sequential and
”LlrrrLJ—Jiative tests are shown in Table 4.
Hr
\% of U-shaped Distribution
The effect of the U-shaped distribution of ability was
531:
‘Ll‘:1fLed.by the same procedures used with the normal distribution
101
of ability. The distribution used in these tests is the
one shown in Figure 2. To determine if the "least squares"
sequential test would more accurately classify individuals
at the mean of the absolute ability levels than would the
six-item cumulative test, the means and variances of normal-
ized scores assigned to category thirteen were tested for
significance of difference between scores assigned by the
isix-item sequential and the six-item cumulative models.
cuategory 13 was selected as it included the mean value of
aatxility for those individuals in the top half of the ability
C13:r*eme values of the ability distribution than would the
Sia>:‘-item cumulative test, the means and variances of normal"
iZEECi scores assigned to category 15 individuals were compared
fC>37’ the sequential and cumulative test models.
To test the hypothesis that the cumulative test model
WCDIJ~:L.3I‘<3 distribution were examined for differences in means
arj‘:1 ‘variances of ability level. These top-scoring individuals
VV€3:EIEE* proportionately selected as stated for the normal dis~
tzla -
:LTEDXJtion. The top 13.5 per cent of the score distribution
Wa‘g
tzklesg
lased as there were 13.5 per cent of the individuals in
top ability category.
102
To determine if the classification of the middle ability
level was more finely classified by the six—item cumulative,
the mean normalized "T" score for each ability level was
determined and shown in Table 24. The same was done for the
sequential test model. The hypothesis was that the sequential
model should have approximately equal distances between test
score means for each of the ability categories, while the
six-item cumulative model would have larger differences in
mean test scores for the middle ability levels than for
extreme values.
The differences in mean score values for adjacent
ability levels are shown in Table 4. The mean ability levels
for each score are likewise shown in Tables 25 and 26. It
was hypothesized from Lawley's work that the extreme scores
of the cumulative test should have lower variance of ability
level than the extreme scores for the sequential test.2
Since less variance of ability level means fewer lower ability
individuals, it was assumed the extreme cumulative test scores
would have higher mean values.
Effect of Ability Distributions for
Additional Sequential Tests
In addition to the four tests described above, three
other sequential tests were built with an electronic computer.
2D. N. Lawley, ”0n Problems Connected with Item Selec-
tion and Test Construction," Proceedings of the Royal Society
of Edinburgh, 61 (Section A, Part III): 273—287, 1942-1943.
103
However, in these tests the difficulties of the items were
not determined by a "least squares” procedure, but used
difficulties determined by an adaptation of Lord's work.3
The item difficulties used in these three tests were so
selected that, it was hypothesized, depending on the particu-
lar selection, a normal, rectangular, and a U-shaped distri—
bution of scores would be obtained. The number of individuals
assigned to each score and mean ability level of these individ-
ualsarereported in Tables 18, 19, and 20.
It was assumed that a score from a test designed to
output a rectangular score distribution should correlate
highest with a rectangular input of ability. Scores with
normal distribution should likewise correlate highest with
the normal input of ability, and scores with U—shaped distri-
bution should correlate highest with U—shaped input of ability.
However, information was obtained as to the effect on both
output distribution and the correlation values of changing
the input distribution.
The rule stated by Lord was that if one wished to
divide the group at a given point, then the item difficulty
(expressed in standard score units) is represented by the
item-total r times the standard score unit which represents
bis
the proportion below the point where the split is desired.
The procedure followed in constructing these three tests was
3Lord, ”Some Perspectives on 'The Attentuation Paradox
in Test Theory'," op. cit.
104
that if there were four different difficulties used at a
given stage, then the abcissa should be divided into five
equal ability segments. The difficulties necessary to pro-
duce these proportions were them computed from Lord's formula.
One time the distribution of scores to be produced was con-
sidered normal; one time, rectangular; and one time, U—shaped.
Since different proportions were to be selected for each
distribution shape, different difficulties were needed for
each. The rule used to determine the number of different
difficulties at each stage was to add one more difficulty
at each stage. It turned out that this rule gave results
approximating the nasults from the determination of difficul»
ties by the rules developed in the past section on "Test
Model Construction.”
Lord has shown how to select item difficulties to yield
a desired split of individuals by a cumulative test. These
Lord difficulties assume an input of a normal distribution
of ability; therefore, in the sequential test one should com-
pute difficulties with a normal distribution of ability for
each item of the test. This was not possible in the present
sequential model. The differences in the difficulty levels
of the items selected by Lord‘s technique and the above tech“
nique when an r = .75 is used are noted, but no study of
bis
the effect at other values of rbis was made.
105
III. ITEM PRECISION AND DIFFICULTY FOR
THE SEQUENTIAL TEST
To determine the interrelationships among item precision,
difficulty level, and output characteristics, five tests con-
taining items of varying precision and difficulty were compared.
The five tests were built in the electronic computer and
varied in precision and difficulty of items used. The tests
were built using Lord's rule in the selection of difficulties
so that a normal distribution of scores should be obtained
when the distributions of ability were normal. The five
precision levels were for rbis equal to .79, .75, .71, .60,
and .45. (The .75 precision test was the same as the one
constructed above.) For an assumed N of 1000, the .79 and
.71 values are one standard error of a This above and below
.75. The .60 value was selected as it is a value common in
the literature; the .AB to show the effect of meeting low
precision standards. The .79 precision level is not consid-
ered unrealistic if the spread of ability level is great.
Precision was hypothesized to be one of the most important
parameters in the behavior of the sequential test model.
To examine the hypothesis that the more precise items
would produce a better separation of people, the variances
of scores for category eight ability level (the middle ability
level) individuals were compared for each of the five tests
by use of Bartlett's test for homogeneity of variance. This
106
test was repeated for the combination of categories 14 and
15 (the most extreme categories) for the five test models.
It was hypothesized that there would be a difference in the
variance of scores, with the more precise items producing the
scores with the smaller variances. Since a lower precision
of items means that the effective difficulty level regresses
toward the mean and, therefore, is closer to the 50 per cent
level, the middle difficulty items should increase the preci-
sion of scores at the extremeSQ-although not the ability to
classify individuals. Thus, the extreme scores would have
small variance of ability levels for both precise and less
precise items and it was hypothesized that the variances of
ability level scores would be most different at the middle
score values.
The second hypothesis stated that a test consisting
of more precise items would have the ability to discriminate
evenly over the entire range of ability rather than making
finer discriminations at the middle of the ability range.
This hypothesis was tested by examining differences in the
means of test scores for each category of ability. A table
was made of the means and variances of test scores for each
of the fifteen ability levels and for each of the five levels
of precision. The discrimination index for adjacent ability
levels was computed as suggested by Lord}1L The higher the
“Lord, A Theory of Test Scores, op. cit., p. 24.
107
index the better the discrimination; values may range from
zero to infinity. Lord's discrimination index was computed
as follows:
D = S-Pn. S Ca
4.6* ‘
MS c = mean of score values for ability level co
° 0
Ms.c1 = mean of score values for ability level 01
6* 2 some appropriate average of the standard
deviation of the two score distributions
Lord stated that this discrimination index is completely
independent of the distribution of ability in the group tested:5
This is an advantage when a general description of the
test is desired without reference to any particular
group of examinees; it is adisadvantageii'the effective
discrimination of the test for a specified group of
examinees is desired.
IV. ERRORS IN SEQUENTIAL TEST
PARAMETER ESTIMATES
The procedures used to determine the effects of errors
in estimating the parameters of precision and difficulty for
the sequential test items are related to the nature of the
error involved. The difficulty of an item is usually
specified in terms of the proportion of the group passing
the item. This test model, however, uses difficulty specified
in standard scores, so the standard error of a proportion
must be translated into standard score terms. The standard
51bid.
108
error of a proportion (~/(§§X7N) is greatest when P = Q =
.50. Thus the greatest error in estimating difficulty in
terms of proportion passing an item would occur at the 50
per cent level of difficulty. The value of W is
smallest at the extreme values of P or Q. The error in
terms of proportion passing an item was thus investigated
at .50 and .90. These errors were then translated into
standard score units. The values of V/(fi5X7N_ (when N = 1000)
were .016 and .010 for .50 and .90, respectively. When the
values necessary to encompass two standard errors of the
proportion were translated to standard score form the values
were quite similar and equal to about :,l0. The error for
estimating difficulty was thus assumed to be less than or
equal to.: .10 no matter what the difficulty level of the
item.
The error made in precision depends upon the estimate
6
of r , which has a sampling error as follows:
bis
—\ PQ/Z - I)tglis
rbis — ‘——
/N
Terms as defined before.
Thus for rbis equal to .75 (which was the only precision
level for which the error was studied), and assuming P =
Q = .50, and N = 1000; then dEbis = .02. Since the error
in rbis is not likely to be greater than i .04, then Pbis
6Quinn McNemar, Psychological Statistics (second edi-
tion; New York: John Wiley and Sons, 1955), p. 194.
109
of .75 is not likely to be outside of the interval of .71 to
.79. The 6d value for .71 is .99 and the dd value for .79
is .78. Thus the error in terms of 65 is not likely to be
greater than i .10. These estimates were the values used
to determine the effect of parameter estimation on output.
The testing of the first hypothesis as to the effect
of errors of item difficulty was done with a normal distri-
bution of ability; test items designed for rbis equal to
.75; and by the least squares of deviations method described
in "Test Model Construction." It was hypothesized that if
one were to use at the second stage an item which was .40
more sigma units away from the mean than the items selected
as above, then more people should be directed toward mean
scores than if the ideal difficulty were used. This would
imply fewer people at the extreme values than usual if the
rest of the test did not correct this trend. It was hypothe—
sized that the opposite should happen if the item were .40
sigma units toward the mean at the second stage. These
changes were tested by use of the chi-square technique. If
a difference of .40 did not make any difference it would
seem obvious that errors of estimate (about .10) would not
make any difference.
The errors of estimate in the fifth stage were deter-
mined when the item difficulties were shifted .40 sigma units
away from the mean in one problem and .40 sigma units toward
the mean on another problem. As the hypotheses on the effects
110
of error at the second and fifth stages derive from the same
rationale,and as the effects of the fifth stage were expected
to be in the same direction as the second stage effects only
larger, the hypotheses on the second stage errors require
only analysis of direction of change (i.e. chi-square) while
the hypotheses on the fifth stage errors require more exten—
sive variance analysis.
It was hypothesized that the variance of ability level
for the top 84 individuals would be greater for tests with
the shifted difficulties than for the test where the items
were at the ideal difficulty level. The significance of dif~
ferences in variances was tested by use of Bartlett's test
for homogeneity of variance.
The discrimination of the tests for an ability level
was determined by examining, for these same tests, the
variance of test scores for the category fifteen ability
level individuals. It was hypothesized that the variance of
scores forcategorylS individuals would be highest when difu
ficulties were closest to the mean value. Variances for the
three tests were compared by use of Bartlett's test for homo-
geneity of variance.
For the test with difficulties at the fifth stage dis-
placed away from the mean by .40 sigma units, it was hypothe-
sized from Lawley's work that the variance of ability level
for the 100 middle-scoring individuals would be lower than
the variance of these individuals on the other tests. Again
lll
Bartlett‘s test for homogeneity of variance was used as the
test.
Ability level discrimination was similarly determined
by examination of the variance of test scores for category
eight of ability. It was hypothesized that the original
test (with ideal difficulties) would have better discrimin~
ation than the modified tests. Again Bartlett's test was
used to compare variances.
The third hypothesis~~that errors in estimating the
precision of the items would be more serious in the initial
stages than at later stagesm-was tested by placing items
of rbis = .71 (instead of r .75) at the second stage.
bis :
Since subsequent items were designed with the assumption
that the second item had rbis : .75, the spread of ability
should be greater than ideal for discriminating among individ—
uals arriving at subsequent items. These subsequent items
are more difficult than ideal and this increased difficulty
should thus force the individuals toward the center of the
distribution. The greatest increase in variance of test
scores should thus be noticed for high and low ability groups;
middle ability groups should not change in variance of test
scores produced. The variances of scores for extreme and
middle ability levels were compared by use of the F ratio.
Also, the variances of ability level scores for individuals
ranked in top 8.4 per cent of the score distribution were
tested by the F ratio.
112
The fourth hypothesis that errors in estimate of
precision should make little difference at the fifth stage
was examined by placing items of rbis equal to .71 at the
fifth stage. The difficulty of the items remained the same.
The effect of this should be that again the item would be
more difficult than the Lord formula would suggest as ideal,
because difficulty should be regressed toward the mean de-
pending upon the rbis value. The lower the rbis the more
the ideal difficulties should be regressed toward the mean.
The results should be that more individuals than ideal would
take an easier sixth item which, according to Lawley, should
increase the precision of high ability scores. It was also
hypothesized that this change in fifth item precision would
increase the variance of score levels for high ability individ~
uals. These results were hypothesized to be in the same
direction as results from changes at the second stage, and
the F ratio was likewise used to test these hypotheses.
V. GENERAL COMPARISONS
A general comparison of the relationship between input
distribution and output distribution of scores was felt to
be of value even though no specific hypotheses were advanced
due to the number of variables involved. The difficulty of
the items, the precision of the items and the pattern of items
taken by individuals of different abilitylevelseflj.interact
to affect the score distribution.
113
The effect of difficulty of items was noted for the
nine tests described in ”Effect of Ability Distribution for
Additional Sequential Tests." As the difficulties of items in
each test do not regress toward the mean at the same rate,
no clear conclusion can be made as to the effect of dif-
ficulty on output characteristics.
The effect of difficulty can thus be determined only
for certain ability levels. (The data for the distributions
of only one-half of the scores were presented as the other
half was symmetrical.)
In addition to the distribution of scores, the cor-
relation ratios were reported as these give information as
to the general relationship between the input distribution
of ability and the output distribution of scores. In former
unpublished trials of the sequential test the value of the
Pearson Product—Moment r was made to closely approach that
for eta, by assigning the scores to the 64 different sequences
of items from the rank of the mean ability level of the
individuals at the score. (Another alternative would have
been to assign scores according to rank of the sequence if
ideal items had been used in the test model.)
The best general comparison of output to input in regard
to precision of item came from the five sequential tests
described in ”Item Precision and Difficulty for the Sequential
Test," where item difficulties and type of distribution
remained constant over all five sequential tests. The general
114
comparisons were made in terms of correlation ratiosthe
data were reported for one-half of the output distribution
of scores for the five tests.
A comparison of output to input in regard to the
pattern of items taken by an individual came from using the
Lord difficulties which yielded a rectangular output of
scores when a rectangular distribution of ability was input.
The rectangular distribution was used because this best ap—
proximated the "least squares” solution. Two new test models
were constructed: each had exactly the same items with same
difficulties and same precision (rbis = .75); one test had
items distributed as in Figure l, and the other test had
items distributed by one item at first stage, two items
at the second, three at the third, and continued until it
had six—items at stage six. Only the pattern of items taken
by the individuals was different in the two tests. Again eta
and the distribution of one~half of the output distribution
of scores for each of the two test models were reported.
VI. SUMMARY OF PROCEDURES AND HYPOTHESES
One sequential test model was constructed by the "least
squares" (of the deviations from the mean ability level) rule
for a rectangular distribution of ability over 15 ability
categories and r equal to .75 for item precision. (Ability
bis
level one represented lowest ability level and ability level
15 represented highest ability level.)
115
The above test was then used with an input of normal
and U-shaped distributions of ability. A six-item cumula-
tive test with all items at the 50 per cent level of dif-
ficulty and a precision level of the item-total rbis equal
to .75 was likewise used with normal and U-shaped distri-
butions of ability. The output distributions for comparable
tests were then examined.
The null statistical hypotheses concerning the effect
of the normal ability distribution on output of scores
stated that the cumulative and sequential test models should
have the following: (The alternative hypothesis expected
from the rationale is given in parentheses.)
(1) equal means for the comparable normalized scores for
category eight individuals (no alternate, hope to accept
null);
(2) equal variances for the comparable normalized scores
for category eight individuals (hope to accept null;
cumulative may be smaller);
(3) equal means for the comparable normalized scores for
combined category 14 and 15 individuals (cumulative
lower);
(4) equal variances for the comparable normalized scores
for combined category 14 and 15 individuals (sequential
smaller);
(5) equal means for the ability level scores for the
individuals ranked in the top 8.4 per cent of the
score distribution (cumulative lower); and
(6) equal variancesfor the ability level scores for the
individuals ranked in the top 8.4 per cent of the score
distribution (sequential smaller).
The null statistical hypotheses concerning the effect
of the U—shaped ability distribution on output stated that
116
the cumulative and sequential test models should have the
following:
(1) equal means for the comparable normalized scores for
category 13 individuals (cumulative lower);
(2) equal variances for the comparable normalized scores
for category 13 individuals (sequential smaller);
(3) equal means for the comparable normalized scores for
category 15 individuals (cumulative lower);
(4) equal variances for the comparable normalized scores
for category 15 individuals (sequential smaller);
(5) equal means for the ability level scores for the individ-
uals ranked in the top 13.5 per cent of the score dis~
tribution (cumulative lower); and
(6) equal variances for the ability level scores for the
individuals ranked in the top 13.5 per cent of the
score distribution (sequential smaller).
In addition to the hypotheses listed above,mean score
values for each ability level, and mean ability level for
each score value were plotted for both the normal and U-
shaped distributions of ability. Additional information as
to effect of distribution of input on output is presented as
part of the general comparisons.
Three tests were constructed by Lord's rules and each
of these was used with normal, rectangular, and U-ohaped
distributions of ability, although each test was designed
to reflect only one of the input distributions. Eta was
used to compare the input distribution with output distri-
bution for these nine tests. In addition, the actual output
distribution of each of the nine tests was tabled. These
tests were built for information, and no hypotheses were
made as to results.
.117
To determine the effect of item precision on the output
of the sequential test, four test models were constructed
with an input of a normal distribution of ability and item
precision taking the values of r equal to .79, .71, .60,
bis
and .45. Item difficulties were those determined by Lord's
procedure to be most appropriate for a given precision level
when assuming a normal distribution of scores desired. The
variances of ability levels for extreme and middle scores,
and the variances of scores for extreme and middle ability
levels were examined by use of Bartlett‘s test.
The null statistical hypotheses (and expected alterna—
tives) concerning the effect of item precision and dif-
ficulty stated that tests which use a normal distribution
of ability for input and a nearly normal output of scores
should yield the following: (The alternative hypothesis
is given in parentheses:)
(1) equal variances of scores for category eight ability
level individuals for all five tests of different
precision levels (most precise test smallest);
(2) equal variances of scores for category 14 and 15
ability level individuals for all five tests of dif-
ferent precision levels (most precise test smallest);
(3) equal variances of ability level scores for the individ—
uals ranked in the top 8.4 per cent by each of the five
tests of different precision levels (most precise test
smallest); and
(4) equal variances of ability level scores for the
individuals ranked in the middle 10 per cent by each
of the five tests of different precision levels (most
precise test smallest).
ll8
In addition to these hypotheses, the meansand variances
of the test scores, and the discrimination indices between
each of the adjacent ability levels were computed for each
of the five different precision tests.
To determine the effect of errors of using other than
the difficulty level computed by ”least squares" method for
certain items, four sequential tests were constructed.
One had the second item shifted away from the sample mean
in difficulty; another had the second item toward the mean
value. The fifth item encountered by the individual was
likewise displaced toward or away from the mean difficulty
value in the third and fourth test models, respectively.
Again the characteristics of the "error" and "error free"
output distributions were examined.
The null statistical hypotheses that were tested con-
cerning the effect of errors in estimating the difficulty
of the item at the second stage are as follows: (These
hypotheses were used to determine if differences were in
direction hypothesized.)
(l) the number of people in each of a set of score categories
would be independent of whether distributed by an
error free difficulty test or one in which difficulties
at the second stage were away from the mean (50 per cent)
difficulty (more people at middle for "error' test);
(2) the number of people in each of a set of score categories
would be independent of whether the people were distri-
buted by an "error free" difficulty test or one in which
difficulties at the second stage were toward the mean
(50 per cent) level of difficulty (more people at
extreme for ”error" test).
119
The null statistical hypotheses that were tested con-
cerning the effect of errors in estimating the difficulty of
the item at the fifth stage predicted the following: (These
hypotheses were deduced from same rationale as ones above,
and data were examined more closely as it was hypothesized
that those differences would be in same direction as dif-
ferences above and of a larger magnitude.)
(1) equal variances for the ability level scores for the
individuals ranked in the top 8.4 per cent of the
score distribution (”error free” test smallest);
(2) equal variances of test scores for individuals in
ability category 15 (test with items near 50 per cent
largest);
(3) equal variances of ability level scores for the
individuals ranked in the middle 10 per cent of the
score distribution (test with items away from mean
smallest); and
(4) equal variances of test scores for individuals in
ability category 8 (”error free'I test smallest).
The effect of error in estimating the precision of
items was examined by constructing two additional ”least
squares" test models. One test had less precise items for
the second item encountered; the other had less precise
items substituted for the fifth item encountered. Again the
I'error free"
distributions of scores for the ”error" and
tests were examined.
The null statistical hypotheses concerning the effect
of error in estimating the precision of items at the second
stage predicted the following:
(1) equal variances of test scores for individuals in
ability category 15 ("error free” test smaller);
120
(2) equal variances of test scores for individuals in
ability category 8 (”error free” test smaller); and
(3) equal variances of ability level scores for individ—
uals ranked in top 8.4 per cent of the score distri-
bution ("error free" test smaller).
The null statistical hypotheses concerning the effect
of error in estimating the precision of items at the fifth
stage predicted the following:
(1) equal variances of test scores for individuals in
ability category 15 (”error free" test smaller); and
(2) equal variances of ability level scores for individuals
ranked in top 8.4 per cent of the score distribution
(”error free' test smaller).
The general comparison examined the effect of difficulty
on score output, the effect of precision of items, and the
effect of the pattern of items. Difficulty effects were
examined for normal, rectangular, and U—shaped inputs on
tests with item precision of rbis equal to .75 and item dif-
ficulties as listed in Table 20 of the Appendix. (The rule
for selection of difficulties of items is that one should
use an item not at the difficulty level equal to ability
level where split between groups is desired, but difficulty
level should be regressed toward the mean value of 50 per
cent. The lower the rbis the greater should be the regres-
sion.) The distributions and mean ability level scores for
each score were tabled.
Distributions and mean scores were also tabled for five
tests with different item precision and for two tests with
different patterns of items. In addition to these tables eta
between in ut and output scores was re orted for each of these
r
tests.
CHAPTER IV
ANALYSES AND RESULTS
There are six sections to this chapter. Section one
gives the results of building the sixmitem sequential test
model. Section two reports results of the input distribu—
tion on the score distribution of both the sequential and
the six~item cumulative test models. Section three presents
the effects of item precision and difficulty on the score
distribution of the siXWitem sequential test model. Sec~
tion four gives the effects of errors of estimating preci~
sion and difficulty parameters on the score distribution of
the sequential test. Section five gives some general results
of changes in difficulty of items, precision of items, and
pattern of items. Section six is a summary of the analyses
and results. In all sections results are simply reported;
interpretation is reserved for Chapter V.
I. SEQUENTIAL TEST CONSTRUCTION
As stated in Chapter III, the sequential test model
2 . . .
was constructed so that the 2;(2iX) /N was maxlmlzed; graphic
methods were used to aid in determination of maximum values.
(é_X refers to sum of ability level scores for any one group.
122
123
Z (2X)2/N refers to squaring the sum of scores for the
group dividing by the number in the group and then summing
over the two or more groups that used the particular item.)
The only restriction was that any item difficulty had to be
more than .20 standard score units away from other difficul-
ties to be considered different from them, and thus to be
used. (The reader will be aided in following the item deci-
sions given below by referring to Figure 3.)
First Item Decision
The values of 2 (EXP/N for + .01., .OO, and —.01
difficulty items were as follows: 109073.85, 179931.86, and
109073.85. The maximum value was thus obtained from a .00
difficulty level item and this item fulfilled the criterion
of selection. Thus out of the 1500 people taking the hypo-
thetical test, 750 would pass and 750 would fail this item.
The mean ability level of these groups was i.°73~
Second Item Decision
The second item produced four groups over which. §L(ZX)2/N
was maximized. The three strategic Values for this item
were i-23: i324: and i,25 which had values for §_(2_X)2/N
of 113796.15, 113796.21, and 113796.00. (Strategic values
were determined by estimating values and plotting these
values of i (2X)2/N until the maximum value was stradled by
three points that could be read from the graph.) The i,24
items were selected for the second stage. The resulting four
a\./’,,.Wa\~s§ww§s x.
o
a. T’w’ ¢e>i . l
x.
/. .
a. .. at)... anon..- a? m
wasxwg 4.x _
5.1.1.430]. 1IOO/Ru 76RTILrQJfiL101In; 314. 56789019.. 345
1;.Iilllal llllLai
.._.._____.____
Ht>mq abaaaa< cats
Score
6th
5th
Fig. 3.-—Mean Ability Level of Groups Separated
ties
..
a.
)
A
rnd Difficu‘
by Sequential Test
4.
U
Ou
Jsed
Items T
of
125
groups had mean ability levels of +1.04, +.lO, -.10, and
-1.04. At this point 504 individuals had passed both the
first and second items; 246 had passed the first and failed
the second; and like numbers had failed both, and the first
only.
Third Item Decision
The third stage items were reduced to three in number
as the two middle groups were both given the same difficulty.
Both of these middle groups took the same difficulty because
each has the sum of (LX)2/N of 3287855 for i.09 items and
32878.02 for i,lO items. As :; (2QX)%/N maximized at less
than .10, the ideal difficulty levels would be less than .20
sigma units apart. As this would vio‘ate a condition of the
test construction, the two middle groups were given the same
item which yielded a 2 (2.x)2/N of 32885.60.
The two extremes ability groups produced 2.(£X)2/N
equal to 83416.29, 83417.00, and 83416.84 for .48, .49, and
.50 difficulty items, respectively. Thus the three difficulty
levels used at the third stage are +.49, .00, and -.49. The
mean ability levels of the eight resulting groups were from
highest to lowest 1.21, .62, .48, .33, -.33, -.48, -.62, and
-1.21.
Fourth Item Decision
At this stage there were eight groups taking four dif—
ferent difficulty items (i.73 and i.40) and resulting in
126
sixteen groups. Those individuals who had passed (or failed)
the first three items had 2 (1X)2/N equal to 62860.59,
62860.69, and 62860.50 for :,72,.i.73, and i,74 items respec-
tively. The 1.73 item difficulty was selected. The second
group (PPQ or QQP) had maximum values between i345 and i,50
which were more than .20 standard score units away from i.73.
However, the third group (PQP or QPQ) had a (2X)2/N that
maximized above .32. The similarity of the groups is shown
in that while a .32 maximum is 18242.13, the .41 maximum is
18243.05. Since such values would giveiimmmsless than .20
standard deviation units away, the second and third groups
were each given the same difficulty. The remaining group
(QPP or PQQ) maximized between .29 and .35 for 21(iLXI/N
of 15387.12 and 15386.92, respectively. Since the best dif-
ficulty level for the previous two groups would be less than
.20 standard deviation units away, all three groups were
given an item of the same difficulty level. The strategic
values for difficulty of item assigned to the three groups
were i,39, i,40, and i_.4l which had 2_(2_X)2/N values of
54983.63, 54983.81, and 54983.57. Thus the :,40 item dif-
ficulties were used. 0f the eight groups at this stage, one
group took +.73, three took +.40, three took -.40 and one took
an item of -.73 difficulty level.
Fifth Item Decision
The fifth stage decisions resulted in sixteen groups
taking six items of different difficulty thus producing thirty-
127
two new groups. The groups that took the different difficulty
levels were as follows: The PPPP and QQQQ groups took items
of +.87 and -.87 difficulty. The PPPQ, PPQP, PQPP, and QPPP
groups took an item of +.66 level of difficulty. (The QQQP,
QQPQ, etc. opposites of above took an item at -.66.) The
PPQQ, PQPQ, and QPPQ groups each took an item of +.15 dif-
ficulty level. (Opposite groups took -.15 difficulty item.)
In other words for the eight groups above the mean, one group
took an item of .87 difficulty, four groups took an item of
.66, and three groups took an item of .15 level of difficulty.
The PPPP (and opposite) rmui gi (Z;X)é/N values of
45791.18, 45791.20, and 45791.18 for .86, .87, and .88 levels
of difficulty. The PPPQ group maximized the 2;(z;x)2/N
Just above the .71 difficulty level, thus the decision had
to be made to give this group either the same difficulty
item as the PPPP group or the difficulty of the PPQP group.
The PPQP group maximized between .67 and .71-- 2: (2_X)2/N
values of 13411.14 and 13411.11, respectively. These two
groups were thus given the same difficulty level as their
curves remained fairly near maximum for the difficulty
level common to both. The PQPP group maximized Z (EXP/N
between .60 and .65 with values of 10395.92 and 10395.98.
the QPPP group maximized at about .60 with 2 (2X)2/N
of 7826.26. Since none of these was .20 standard score
units apart in difficulty, the one difficulty value that
would maximize i(ZX)2/N for all four groups was determined.
128
The difficulties of .65, .66, and .67 had §L(ZLX)2/N values
of the eight groups of 49000.44, 49000.65, and 49000.60.
The item of 1,66 difficulty level was thus used for these
eight groups.
The PPQQ group maximized Z (ZX)2/N values between
.20 and .28--8208.84 and 8298.88, respectively. This was
more than .20 standard deviation units from .66, so this
group was not given the item of .66 difficulty level. The
PQPQ group maximized €L(2;X)2/N at .15 with 8096.18. (Dif-
ficulty levels .10 and .14 had 2 (EX)2/N values of 8096.15
and 8096.17, respectively.) The QPPQ group maximized
between .00 and .10 difficulty levels. This was not .20 stan-
dard score units of difficulty away, so the one difficulty
level that would maximize the sum of (2X)2/N for these six
groups was determined. The strategic difficulty levels of
.14, .15, and .16 had I§;(2.X)2/N for six groups of 24092.29,
24092.36, and 24092.30.
Sixth Item Decision
The sixth stage had 32 groups taking items at five dif-
ferent difficulty levels (i,87, i,49, and .00). The PPPPP
group had maximized Z. (2X)2/N between .90 and 1.00 dif-
ficulty--the respective Z(2_X)2/N values are 32852.58 and
32852.64. The group PPPPQ had £(2X)2/N values of 13051.00,
13051.01, and 13051.00 at .86, .87, and .88, respectively.
Thus it was clear that these two would not use different
difficulty of item and neither would any group that maximized
129
above .75. The other groups which maximized about.75 were
as follows: the PPPQP group which for .85, .86, .87, and
.88 had 3 (ZX)2/N values of 11334.16, 11334.17, 11334.17,
and 11334.16, respectively; the PPQPP group which for .85,
.86, .87, and .88 had 8373.86, 8373.86, 8373.86, and 8373.84,
respectively3the PQPP group which maximized Z (ZX)2/N
between .80 and .85 with values of 6059.83 and 6059.79,
respectively; and the QPPPP group which maximized between
.74 and .80 both with Z;(ZLX)2/N value of 4227.53.
The Z (ZX)2/N for the 12 groups using the same dif-
ficulty level of item were 75898.65, 75898.66, and 75898.60
for the .86, .87, and .88 level of difficulty, respectively.
The decision was thus to use a .87 difficulty item for
these groups.
The PPPQQ group (the next highest ability level group)
maximized between .55 and .65 with ‘£_(E;X)2/N values of
6150.61 and 6150.64, respectively. (The approximate value
for maximum was determined by plotting of the curve from
six points.) Since the group maximized more than .20 standard
deviation units away from the .87 groups and also maximized
within five points of the next lowest group, the decision
was made to use a new difficulty for all remaining groups
that maximized above .40. The remaining groups which maximized
at difficulty levels greater than .40 (but below.60) were as
follows: The PPQPQ group which for difficulty levels of .50,
.55, and .65, had £(iX)2/N values of 5135.92, 5136.00, and
130
5135.88, respectively; the PQPPQ group which for difficulty
levels of .43, .48, .49, and .50 had 2 (ZX)2/N values of 4422.34
4422.41, 4422.41, and 4422.40, respectively; the PPQQP group
which for difficulty levels of .43, .48, .49, and .50 had
2f (2iX)2/N values of 4680.27, 4680.33, 4680.33, and 4680.32,
respectively; the PQPQP group which for difficulty levels of
.32, .43, and .48 had 2: (fiiX)2/N values of 4198.70, 4198.83,
and 4198.77, respectively; and the QPPPQ group which for
.32, .43, and .48 had i(éx)2/N values of 3670.06, 3670.19,
and 3670.14, respectively.
The QPPQP group maximized €i(€iX)2/N between .32 and
.43. Difficulty levels of .29, .32, and .43 had values of
3628.46, 3628.48, and 3628.35, respectively. A decision
thus was whether to include this group with the higher or
lower groups. The PQQPP group (next in line) for difficulty
levels of .OO and .09 had 2L(2LX)2/N values of 4244.60 and
4243.82 and maximized below .09. For this reason the QPPQP
group was included with the higher group instead of .20 units
lower in difficulty which would have yielded a lower z(ix)2/N
value.
The sum of {(iXF/N for the 14 groups for difficulty
levels of .48, .49, and .50 were 31886.02, 31886.04, and
31885.95. Thus a difficulty of in49 was used with each of
these groups.
The remaining six groups all maximized between i,09,
thus .00 item was used here. The QPPQQ group for difficulty
131
levels of .OO and .09 had éi(E;X)2/N values of 4244.60 and
4243.82, respectively, The PQPQQ group had 3983.96 and
3983.54 for these same values, and the PPQQQ group for dif—
ficulty levels of .00 and .09 had.2i(§_X)2/N values of
3615.52 and 3615.32, respectively.
Thus of the 16 groups above the mean, six groups took
the .87 difficulty item, seven groups took the .49 difficulty
item, and three groups took the .00 difficulty item at the
final stage.
The above sequential test was compared with the cumu-
lative test to determine how well the score differentiated
individuals of different ability levels and to determine the
range of ability levels assigned to any one score.
The above sequential test was also used in the deter-
mination of the effects of errors in estimating the parameter
values for the items in this test. Parameter values consid-
ered were difficulty and precision.
This pattern of items determined above was also used
with different difficulties to determine how a test with an
arbitrary pattern and easily computed difficulties compared
with a test using pattern of items determined above.
II. INPUT DISTRIBUTION EFFECTS
Normal and U-shaped distributions were each used
with the cumulative and the ”least squares" sequential test.
The results from the two distributions are presented separately.
132
Results from the Normal Distribution
The first null hypothesis was that there should be
equal means of the comparable normalized scores for the
middle category, category eight, individuals taking the se-
quential and cumulative test both with a normal distribution
of ability input. Results are shown in Table 1. As can be
seen from the table, the null hypothesis tested by a "t"
test must be accepted. This was expected as both have a
symmetrical distribution of scores. This hypothesis was
included as a parallel hypothesis to hypothesis one on U-
shaped distribution (and as a check on the accuracy of
computer computations). In this, and all other hypotheses,
the reader should be aware of the fact that the number of
individuals is dependent only upon the accuracy of the cal-
culations. Since the figures were carried to between eight
and twelve places a larger N could well be assumed. This
would make the error terms smaller and differences signifi-
cant. The theoretical lOOO individualswerg used to give
the reader a point of reference. If the differences exist
in the proper direction, the rationale may be said to be
supported.
The second null hypothesis was that there would be
equal variances of the comparable normalized scores for
middle category (number 8) individuals taking the sequential
test and the cumulative test both with a normal distribution
of ability input. Results are shown in Table 1. Again
133
the null hypothesis tested by a F ratio test must be accepted.
This was expected from the rationale.
The third null hypothesis was that there should be
equal means of the comparable normalized scores for combined
category 14 and 15 individuals taking the sequential and
cumulative tests both with a normal distribution of ability
input. Results are shown in Table 2. The null hypothesis
was based upon 1000 individuals and accepted. The scores
were in the expected direction with the sequential test
assigning the more extreme value; therefore, the rationale
tends to be supported.
The fourth null hypothesis was that there should be
equal variances of the comparable normalized scores for com-
bined category 14 and 15 individuals taking the sequential
'\and cumulative tests both with a normal distribution of
ability input. Results are shown in Table 2. The null
hypothesis was rejected at the .01 level of significance.
The sequential test had lower variance for high ability in-
dividuals as was predicted from the rationale.
The fifth null hypothesis was that there should be equal
means of ability level scores for the individuals in the top
8.4 per cent of the score distributions taking the sequential
and cumulative tests both with a normal distribution of
ability input. Results are shown in Table 3. The null
hypothesis was rejected at the .01 level of significance.
The sequential test had a higher mean ability level for the
134
TABLE 1
ANALYSIS OF MEANS AND VARIANCES 0F NORMALIZED SCORES
FOR CATEGORY 8 INDIVIDUALS WHEN NORMAL DISTRIBUTION
OF ABILITY IS INPUT INTO SEQUENTIAL
AND CUMULATIVE TEST MODELS
Significance
Parameter Sequential Test Cumulative Test Between Tests
Mean 50.00 50.00 n.s.
Variance 16.37 21.22 n.s.
TABLE 2
ANALYSIS OF MEANS AND VARIANCES 0F NORMALIZED SCORES
FOR CATEGORY 14 AND 15 INDIVIDUALS WHEN NORMAL
DISTRIBUTION OF ABILITY IS INPUT INTO
SEQUENTIAL AND CUMULATIVE TEST MODELS
Significance
Parameter Sequential Test Cumulative Test Between Tests
Mean 63.40 62.96 n. s.
Variance 3.87 6.77 p<:.01
TABLE 3
ANALYSIS OF MEANS AND VARIANCES 0F ABILITY LEVEL SCORES
FOR THE TOP 8.4 PER CENT OF THE SCORE DISTRIBUTION
WHEN NORMAL DISTRIBUTION OF ABILITY IS INPUT INTO
SEQUENTIAL AND CUMULATIVE TESTS
Significance
Parameter Sequential Test Cumulative Test Between Tests
Mean 13.66 g 12.92 p < .01
Variance 2.35 '3.47 p‘<-O5
135
top 8.4 per cent of the score distribution as had been pre-
dicted.
Null hypothesis six was that there should be equal
variances of ability level scores for the individuals in the
top 8.4 per cent of score distributions taking sequential and
cumulative tests, both with a normal distribution of ability
input. The results are shown in Table 3. The null hypothe—
sis was rejected at the .05 level of significance. The
sequential test had smaller variance of ability level scores
for the top 8.4 per cent of the score distribution as had
been predicted.
To examine the hypothesis that the six-item cumulative
test model would have smaller differences in mean ability
levels between the middle and adjacent scores than between
the extreme and adjacent scores, the differences in mean
ability level for adjacent scores were computed. These dif-
ferences are reported in Table 5, column 3. As was hypothe-
sized, the smaller differences in ability level were between
the middle score 4, and the adjacent score 5. However, it
should be noted that the differences between ability level
scores for adjacent scores for the sequential test model
(shown in Table 6) were not equal interval and there is no
pattern to the differences shown, although in both cases the
differences were greatest for the extreme scores.
If one wishes to examine the mean ability level and
number of individuals at each score, these values are shown
136 .
TMflE4
DIFFERENCES BETWEEN NORMALIZED “T“ SCORES
FOR ADJACENT TOP ABILITY LEVELS FOR
NORMAL AND U-SHAPED INPUT
Between
Ability Ideal Normal Input U-Shaped Input
Levels Difference Cumulative Sequential Cumulative Sequential
15-14 4.5 2.3 2.5 1.2 2.6
14—13 2.5 1.1 2.1 1.1 1.7
13-12 2.5 1.9 2.1 1.5 1.8
12—11 2.5 2.0 2.3 1.7 1.7
11-10 2.4 2.4 2.3 1.7 1.4
10— 9 2.5 2.3 2.3 1.6 1.4
9- 8 2.5 2.5 2.3 1.6 1.2
TMEE5
DIFFERENCES BETWEEN ABILITY LEVEL SCORES FOR
ADJACENT TOP SCORES FOR CUMULATIVE TEST
MODEL FOR NORMAL AND U-SHAPED INPUT
Input
Between Scores* Ideal Difference Normal U—Shaped
7—6 2.33 2 1 1.9
6-5 2.33 l 5 2.0
5-4 2.33 1 3 2.0
*Scores range from 1-7.
137
TABLE 6
DIFFERENCES BETWEEN ABILITY LEVEL SCORES FOR
ADJACENT TOP SCORES FOR SEQUENTIAL TEST
MODEL FOR NORMAL AND U-SHAPED INPUT
Between Input Between Input
Scores* Normal U-Shaped Scores* Normal U-Shaped
64-63 1.4 .8 48-47 .3 .1
63—62 .0 .0 47-46 .3 .3
62-61 .2 .1 46~45 .2 .5
61—6' .1 .2 45-44 .3 .2
60-59 .3 .2 44—43 .2 .0
59-58 .1 .2 43-42 .0 .4
58-57 .6 .6 42-41 .0 .1
57-56 .3 .1 41-40 .0 —.1
56-55 .0 .2 40-39 .5 .5
55-54 .0 .1 39-38 —.4 -.1
54-53 .5 .2 38-37 .4 .1
53-52 .0 .0 37-36 .0 .0
52-51 .0 .3 36-35 .0 .2
51.50 .0 .1 35—34 .0 .2
50-49 .2 -.1 34-33 .3 .3
49-48 .0 .4
*gdeaégdifference if all had been equal intervals would
e . .
138
in Tables 25 and 26 of the Appendix. The mean normalized
"T" score for each ability level is reported in Table 24
of the Appendix.
Results from the U-Shaped Distribution
The first null hypothesis was that there should be
equal means of the comparable normalized scores for category
13 individuals taking the sequential and cumulative tests
both with an input of a U-shaped distribution of ability.
Results are shown in Table 7. As can be seen, the null
hypothesis must be accepted. The sequential test did have
the higher mean value as expected, but not Significantly so
if 1000 individuals are assumed to have taken the test.
Rationale would tend to be supported though the effect is
small. (See comments on size of N under HResults from
Normal DistributionAU
The second null hypothesis was that there should be
equal variances of the comparable normalized scores for
category 13 individuals taking the sequential and cumulative
tests each with an input of a U—shaped distribution of
ability. From Table 7 one can determine that the null hypothe-
sis must be accepted if only 1000 individuals are assumed to
have taken the test. The variance of the sequential test
was less, however, than the cumulative test as anticipated
though the effect was small.
The third null hypothesis was that there should be
equal means of the comparable normalized scores for category
139
TABLE 7
ANALYSIS OF MEANS AND VARIANCES 0F NORMALIZED SCORES
FOR CATEGORY 13 INDIVIDUALS WHEN A U-SHAPED
DISTRIBUTION OF ABILITY IS INPUT INTO
SEQUENTIAL AND CUMULATIVE TEST MODELS
Significance
Parameter Sequential Test Cumulative Test Between Tests
Mean 58.44 58.03 n.s.
Variance 13.99 14.00 n.s.
TABLE 8
ANALYSIS OF MEANS AND VARIANCES 0F NORMALIZED SCORES
FOR CATEGORY 15 INDIVIDUALS WHEN A U-SHAPED
DISTRIBUTION OF ABILITY IS INPUT INTO
SEQUENTIAL AND CUMULATIVE TEST MODELS
Significance
Parameter Sequential Test Cumulative Test Between Tests
Mean 60.73 60.44 n.s.
Variance 1.96 3.62 p<..01
TABLE 9
ANALYSIS OF MEANS AND VARIANCES 0F ABILITY LEVEL SCORES
FOR THE TOP 13.5 PER CENT OF THE SCORE DISTRIBUTION
WHEN A U-SHAPED DISTRIBUTION OF ABILITY IS INPUT
INTO SEQUENTIAL AND CUMULATIVE TESTS
SignIfICance
Parameter Sequential Test Cumulative Test Between Tests
lean 14.43 13.87 p'<.01
Variance .77 1.86 ;><.Ol
140
15 individuals taking the sequential and cumulative tests
both with an input of a U-shaped distribution of ability.
The results are shown in Table 8. The null hypothesis must
be accepted, although the results were in the direction
indicated by the research hypothesis. The cumulative had a
lower value for the mean. Again significance depends upon
number of individuals assumed to have taken the test.
The fourth null hypothesis was that there should be
equal variances of the comparable normalized scores for
category 15 individuals taking the sequential and cumula-
tive tests both with an input of a U-shaped distribution of
ability. As shown in Table 8, the null hypothesis was re—
jected at the .01 level of significance. The sequential test
had less variance of scores for the highest ability level
individuals than did the cumulative test. 5
The fifth null hypothesis was that there should be equal
means of ability level scores for the individuals in the top
13.5 per cent of the score distribution taking the sequential
and cumulative tests both with an input of a U-Shaped distri-
bution of ability. The results are shown in Table 9. The
sequential test had a significantly higher mean ability level
for the top 13.5 per cent of the score distribution than did
the cumulative. This was in the direction hypothesized.
The sixth hypothesis was that there should be equal
variances of ability level scores for the individuals in the
top 13.5 per cent of the score distribution taking the sequen-
tial and cumulative tests both with an input of a U—shaped
141
distribution of ability. The results in Table 9 indicate
that the sequential test had at the .01 level of signifi-
cance, a smaller variance of ability level scores for the
top 13.5 per cent of the score distribution than did the
cumulative test. This was in the direction hypothesized.
The difference in mean ability level between adjacent
top scores for the cumulative and sequential test models are
Shown in Tables 5 and 6, respectively. The scores on the
sequential test did not yield equal intervals on the ability
level scale as had been hypothesized. The cumulative scores
are a good approximation of equal intervals on the ability
level scale.
To examine the hypothesis that the sequential test
model should have approximately equal distance between test
score means for each of the ability categories, while the
six-item cumulative would have larger differences in mean
test scores for the middle ability levels than for extreme
ability levels, the differences between adjacent scores were
computed. These differences are reported in Table 4. The
cumulative test did have smaller score differences between
the extreme ability levels than any other point in ability
distribution. However, the sequential test did not have an
equal interval scale, but in general decreased in size of
difference between mean scores of adjacent ability levels
from extreme ability category to middle ability category.
It should be noted that neither test represented the ability
142
levels with any real accuracy. The top ability level shown
had an ideal ”T” score of 69 instead of the 61.8 assigned
by the sequential or the 60.4 assigned by the cumulative
test. (See Table 24.)
III. ITEM PRECISION AND DIFFICULTY FOR
THE SEQUENTIAL TEST
Five levels of precision and the appropriate levels of
difficulty for each were used in the construction of five
sequential test models. For these tests the variances of
scores for the extreme and middle ability levels and the
variances of ability level for extreme and middle scores
were examined.
Variance of Scores
The first null hypothesis was that there would be equal
variances of scores for category 8 ability level individuals
for all five tests of different precision level. Data and
results are shown in Table 10. The null hypothesis was re-
jected at the .001 level of significance. As was hypothe—
sized, the more precise tests had smaller variances.
The second null hypothesis was that there would be
equal variances of scores for a combination of ability level
categories 14 and 15 for all five tests of different preci-
sion level. From data in Table 10 it can be seen that the
null hypothesis was rejected at the .001 level of signifi-
cance;the more precise tests had smaller variances of scores.
143
TABLE 10
ANALYSIS OF THE VARIANCE 0F SCORES FOR
INDIVIDUALS AT SPECIFIED ABILITY
LEVELS FOR FIVE TESTS OF
DIFFERENT PRECISION
——
t
‘ Precision of Test
Ability “ Significance
Category .45 .60 .71 .75 .79 of Difference
8 260.03 198.70 147.52 127.33 111.65 p<:.001
14 and 15 94.74 40.90 19.69 15.50 11.86 p<’.OOl
TABLE 11
ANALYSIS OF THE VARIANCE OF ABILITY LEVEL SCORES FOR
INDIVIDUALS AT SPECIFIED SCORE LEVELS FOR
FIVE TESTS OF DIFFERENT PRECISION
Score Level ___> Precision Of Test Significance
(Per Cent) ,45 .60 .71 .75 .79 of Difference
Top 8.4 5.88 3.71 2.45 2.10 1.77 ‘p< .001
p< .001
OK
Middle 10 8.22 5.37 3.62 3.06 2.
\
144
As was hypothesized, the precision of the item was an impor-
tant variable in precision of scores.
Variance of Ability Levels
The third null hypothesis was that there would be equal
variances of ability level scores for the individuals ranked
in the top 8.4 per cent of the score distribution by each of
the five tests of different precision level. Data and results
are shown in Table 11. The null hypothesis was rejected at
the .001 level of significance. As can be seen, the preci-
sion of item was important in determining the precision of
the scores as hypothesized. The individuals assigned to the
top 8.4 per cent of the score distribution were not as
variable in ability level when assigned by a test with items
having an rbis of .79 as when assigned by a test with items
having an rbis of .45.
The fourth null hypothesis was that there would be
equal variances of ability level scores for the individuals
ranked in the middle 10 per cent of the score distribution
by each of the five tests 0f different precision level. As
can be seen in Table 11, the null hypothesis was rejected
at the .001 level of significance. The results were in the
direction hypothesized--the more precise tests had smaller
variance of ability levels. However, it should be noted
that for the middle 10 per cent of the score distribution,
the variances were 8.22 and 2.56 for the .45 and .79 tests,
respectively. The one variance is 3.21 times larger than the
145
other. For the top 8.4 per cent of the score distribution
the variances were 5.88 and 1.77 for the .45 and .79 tests,
respectively. The larger variance is 3.32 times the other.
Greater differences at the top than at the middle of the
score distribution were contrary to what had been expected.
Table 12 gives the means and variances of rank scores
assigned to each ability level by the five tests of different
precision. The means for Category 8 individuals were always
the same. However, the mean rank scores assigned to category
1 individuals were lower as the precision of the item in—
creased. This was especiallylmoticeab16rat the lower preci—
sion levels. The variances of the test scores for each
ability level decreased with the precision of item as was
hypothesized. Also, it should be noted that the variances
of extreme scores were much lower than the variances of the
middle value scores.
The discrimination indices are reported in Table 13.
(Only one-half of the score distribution is tabled because
the two halves are symmetrical about the mean.) The higher
the value, the better the discrimination. The test con—
sisting of the most precise items had the highest discrimin-
ation index. The test was more discriminating for the extremes
in ability than it was for the other ability values. However,
the other values of the discrimination index were remarkably
close to each other for all ability levels other than the
extremes. This was what had been hoped for with the sequen-
tial test.
TABLE 12
THE MEANS AND VARIANCES OF RANK SCORES ASSIGNED TO EACH
ABILITY LEVEL BY FIVE TESTS OF DIFFERENT PRECISION
Variances of Rank Scores for
Different Precision Levels
79
J5
.71
A5
\0
b.
H
O\\OKO LON-KO
LflFr—iONCDCh
HMJ‘OCD
N-
896
68.7
495
3L6
59
0’)
.1
O
H
m43
H
H
H
(IDLflxOCDCfiOi‘FiTOCDCDKOLflCD
PMCIDQDCU NMMMNNQQM¥
CUMLDQDOCUCUCUOCIDLDMCU
r-ir—ir—{t-i
1
OOHCUQ'MQLDOMHTCUr—JOO
r-JO\O\CU[\:fr—i[\—r—1—:I'[\CUO\O\r—l
Mean of Rank Scores for
Different Precision
Levels
.60 .71 .75 .79
.45
Ability
Level
CID—if OMLONKOLOKDMLONOKOCU
CULflCDr—{LOOKOCUCUQ‘QMN-QCU
r—It—iCUCUMM-If—d' 10me
NOJQDOH-d'mfiChOCUKOOCD
M‘OCIDH\Or—JKOCUCOMO\CY)\OO\r—i
r—lr—ICUCUMNUQ'Q' LOLOLOKD
\OLflCULflLfl-Zf'fIDLDCUKOLflLQCD Lflzi'
(Y)\OO\CU\Or—i ocuoomoocumooH
HHcvcummxzmmhrvo
r—lKOCULfl
:quqosmmocomoo
GXULRGDHLHQDN 0C3dikxnolo
H Hraolmcuofiruidwi:rmmn
ram “PTUYOPWKDOWDrdanQ'W\
HeariHra
147
TABLE 13
THE DISCRIMINATION INDICES BETWEEN ADJACENT ABILITY
LEVELS FOR THE INPUT OF A NORMAL DISTRIBUTION OF
ABILITY INTO TESTS OF DIFFERENT PRECISION
Precision Level of Test
Between Ability
Levels .45 .60 .71 75 .79
1 and 2 .42 .58 .89 .70 .81
2 and 3 .22 .30 .42 .43 .50
3 and 4 .22 .33 .43 .47 .51
4 and 5 .24 .34 .43 .50 .54
5 and 6 .23 .34 .46 .52 .57
8 and 7 .24 .38 .46 .52 .59
7 and 8 .24 .35 .48 .52 .58
148
IV. ERRORS IN THE SEQUENTIAL TEST
PARAMETER ESTIMATES
Errors in estimating the difficulty level of items and
errors hlestimating the precision of items were investigated.
Four different tests with errors in estimates of difficulty
were constructed, and two tests with errors in precision
were built. All tests used the "least squares" difficulties
as the base for comparison. The results of investigating
these two types of errors will be discussed separately.
Errors in Estimating Difficulty
Of the four tests with errors in estimates of item dif-
ficulty, two had the error at the second item encountered and
two had the error at the fifth item encountered.
Second item error.--The first null hypothesis was that
the number of people in each set of score categories would
be independent of whether the people were classified by an
”error free” test or one which had items too far from the
mean at the second stage. The distributions are reported
in Table 27 of the Appendix. The number of individuals at
12 selected categories, the expected values from an indepen-
denaaassumption, and the chi-square value are reported in
Table 14. The null hypothesis had to be accepted. There
were more people at the middle values as hypothesized, but
the differences were not significant if 1000 people were
149
assumed to have taken the test. It can be concluded that
the effects of second-item errors are small.
The second null hypothesis was that the number of
people in each set of score categories would be independent
of whether the people were classified by an "error free"
test, or by a test which had the second item encountered
too near the mean value. The distribution is reported in
Table 27 in the Appendix. The number of individuals at
12 selected categories, the expected values from an indepen-
dence assumption, and the chi—square value are reported in
Table 15. The null hypothesis had to be accepted However,
there were more people at the extreme categories in the
modified test than in the "error free” test, as was hypothe-
sized. The differences were not significant due to the
assumption of 1000 individuals.
Fifth item error.-—The first null hypothesis was that
of equal variances of the ability level scores for the
individuals ranked in the top 8.4 per cent of the score dis-
tribution by the ”error free" difficulty test and the tests
which had the fifth item too far and too near the mean value.
The variances of the ability level scores for the top 8.4
per cent in each of the tests are reported in Table 16. The
differences in variances were not significantly different
from each other. However the ”error free” test did not
‘)
have the smallest variance as was hypothesized. The test with
150
TABLE 14
DISTRIBUTION OF INDIVIDUALS BY TWO TESTS--0NE TEST
WITH SECOND ITEM DIFFICULTIES FARTHER FROM 50
PER CENT LEVEL THAN THE "ERROR FREE" TEST*
Rank Scores
Test 64 58-63 54-57 46753 40‘45 33-39
2nd Item (62.44) (100.20) (57.91) (96.68) (86.61) (94.16)
extreme 59 97 50 106 8c 100
"Error (61.56) ( 98.80) (57.09) (as 32) (85.39) (92.84)
Free" 65 102 65 86 86 87
x2 = 10.624 d.f. = 11
TABLE 15
DISTRIBUTION OF INDIVIDUALS BY TWO TESTS-—ONE TEST
WITH SECOND ITEM DIFFICULTIES NEARER TO 50 PER
CENT LEVEL THAN THE ”ERROR FREE" TEST*
—_
Rank Scores
Test 64 58-63 54-57 46-53 40-45 33-39
2nd Item (67.24) (104.14) (73.81) (82.40) (88.47) (85.94)
near 50 68 104 81 77 89 83
"Error (65.76) (101.86) (72.19) (80.60) (68.53) (84.06)
Free" 65 102 65 86 86 87
x2 = 10.624 d.f. = 11
*NOTE: The rank scores are broken to make approximately
equal intervals on the ability scale. The scores
1-32 are not reported in the table but are symmetrical
about 32.5. All values were used in the calculations
of chi-square. Expected cell frequencies are given in
parentheses.
151
items nearer to 50 per cent level of difficulty had least
variance of ability represented in the top 8.4 per cent of
the score distribution. Rationale was not supported here.
The second null hypothesis was that there would be
equal variances of test scores for individuals in ability
category 15 on the ”error free" test and the tests which had
the fifth item too far and too near the mean value. The
results in Table 17 show that the null hypothesis must be
accepted. The largest variance was for the test with items
nearer the 50 per cent level of difficulty as was hypothe—
sized even though the results were not significant due to
the assumptions of only 1000 individuals.
The third null hypothesis was that there would be equal
variances of ability level scores for the individuals ranked
in the middle 10 per cent of the score distribution by the
three tests. The results of these tests are shown in
Table 16. The test with the items at the fifth stage near
the 50 per cent level of difficulty had lower variance than
other tests, but not significantly so. It was hypothesized
from Lawley's work on the cumulative that the test with the
difficulties away from the mean would have had the smallest
variance. Rationale was not supported.
The fourth null hypothesis was that there would be
equal variances of test scores for individuals in ability
category 8 on all three tests. Again the null hypothesis had
to be accepted. The lowest variance was for the test with
152
TABLE 16
ANALYSIS OF THE VARIANCE OF ABILITY LEVEL SCORES FOR
INDIVIDUALS AT SPECIFIED SCORE LEVELS FOR ONE
"ERROR FREE” TEST AND TWO “ERROR IN
DIFFICULTIES 0F FIFTH ITEMS" TESTS
5th Items 5th Items Significance
"Error Free" Nearer Away from . Of
Score Level Test 50% 50% Differences
Top 8.4 % 2.30 2.25 2.33 n.s.
Middle 10% 3.08 3.06 3.30 n.s.
TABLE 17
ANALYSIS OF THE VARIANCE OF RANK SCORES FOR INDIVIDUALS
AT SPECIFIED ABILITY LEVELS FOR ONE “ERROR FREE" TEST
AND TWO "ERROR IN DIFFICULTIES 0F FIFTH ITEMS" TESTS
5th Items 5th Items Significance
"Error Free” Nearer Away from of
Score Level Test 50% 50% Differences
15 148.08 173.34 129.01 n.s.
8 5.57 4.75 6.67 n.s.
153
the fifth item nearer the 50 per cent level of difficulty.
It had been hypothesized that the ”least squares" would have
the smallest variance.
Errors in Estimating Precision
Two tests were built to examine the error of estimating F4
i
precision: one with rbis equal to .71 items at the second 1
stage of the ”least squares” (rbis = .75) test, and the other
with rbis equal to .71 items at the fifth stage. These are
v -
discussed separately.
Errors at the second stage.--The first null hypothesis
was that there would be equal variances of test scores for
individuals in ability category 15 for the ”error free” test
and the test where the precision was lowered at the second
stage. The variances of the "error free” and ”error” tests
were 5.57 and 5.94, respectively, for ability category 15.
The F ratio was 1.06 and thus the null hypothesis had to be
accepted. The variance increased with error as was expected,
but not to a significant degree if only 1000 individuals were
assumed to have taken the test.
The second null hypothesis was that there would be equal
variances of test scores for individuals in ability category 8
for the "error free" test and the test where precision was
lowered at second stage test. The variance of the "error
free" test was 148.08 and for the "error” test was 149.09.
The F ratio was 1.01 and again the variance increased as was
154
hypothesized, but not significantly so if an N of 1000 was
assumed. It should be noted that the F ratio for the
variances at ability category 15 was greater than the F ratio
for variances at level 8—-i.e., errors at the second stage
seemed to have a greater effect on extreme scores as was
anticipated.
The third null hypothesis was that there would be
equal variances of ability level scores for individuals
ranked in the top 8.4 per cent of the score distribution
of each of these two tests. The variance of ability level
scores for top 8.4 per cent on the "error free" test was
2.30 and the variances of ability level scores from the
"error” test was 2.37. The null hypothesis had to be ac-
cepted, but the variance did increase with error in item
precision. Again significance depended upon the value
assumed for N.
Errors at the fifth stag_,--The first null hypothesis
was that there would be equal variances of test scores for
individuals in ability category 15 for the "error free preci-
sion" test and the test with ”error” in precision at the
fifth stage. The "error free” test had a variance of test
scores of 5.57 and the ”error” test had a variance of 5.71
for ability category 15. The null hypothesis had to be
accepted, but the variance was larger for the test with
errors as had been hypothesized. (Changes in assumption of
N would change significance test.)
155
The second null hypothesis was that there would be
equal variances of ability level scores for individuals
ranked in the top 8.4 per cent of the score distribution by
the two tests. The “error free" test had a variance of 2.30
and the ”error'' test had a variance of 2.40. Again the null
hypothesis was accepted; but the variance of the test with
the error was larger as hypothesized.
It had been assumed that at the middle ability level
the effects of errors in precision would be slight The
variance for the "error free" test was 148.08 and for the
"error” test was 150.69. The difference between the variances
of the two tests is slight and the F ratio for the middle
ability variances is the same as the F ratio of variances for
category 15 individuals. This was as expected.
It had also been assumed that errors in precision at
the second stage would be more serious than those at the
fifth stage. In the variances of scores for high ability
individuals, the error in precision at the second stage
increased variance more than error in precision at fifth
stage. (The variances were 5.57, 5.94, and 5.71 for "error
free,” error at second and error at fifth stage precision
)
tests, respectively.) However, the variance of scores for
middle ability level individuals was higher for the test
with error in the fifth stage than the test with error in
the second stage. (The variances were 148.08, 149.09, and
150.69 for ”error free," error at second, and error at fifth
156
stage precision tests, respectively.) The error at the
fifth stage test also had the highest variance of ability
level scores for individuals in top 8.4 per cent of the
score distribution. (Variances were 2.30, 2.37, and 2.40
for "error free," error at second, and error at fifth stage
precision tests, respectively.) The hypotheses that errors
in second stage would be more serious than errors in fifth
stage was not confirmed.
General Comparisons
The three areas of general comparisons were effects
of difficulty, effects of precision, and effect of pattern
of items. There were no hypotheses made about these general
comparisons. The information is presented to suggest new
hypotheses and to aid in forming tentative conclusions.
Effects_of difficulty.--In addition to the hypothesis
testing material already reported, examination of Tables 18,
19, and 20 yields information on difficulty. Only the dif-
ficulty of the items has changed from column to column within
any one of the three tables. It should be noted that the
distribution of difficulties to form a normal output of
scores yielded the highest mean ability level for the top
score, no matter what type of distribution was input. Also,
the distribution of difficulties to produce a U-shaped output
of scores yielded the greatest number of individuals in the
extreme score irrespective of the type of distribution that
157
.osoom m an mHm3UH>HUQH mo H0>0H szHHnm :moE 0:» mo xcmm*
0.0 0 0.0 HH 0.0 NH mm 0.0 NH N.0 :H H.0H HH 0:
H.0 0 H.0 HH m.m SH :m 0.0 NH 0.0 0H 3.0H HN 00
H.0 w N.0 NH 3.0 SH mm N.0 NH 0.0H MH 0.0H SH Hm
H.0 0 m.w HH 3.0 MH 0m N.0 NH H.0H :H 0.0H 0H Nm
N.0 0 :.0 HH 0.0 0 mm 0.0 NH N.0H :H 0.0H 0H mm
m.m 0 0.0 0 N.0 0H mm 0.0H NH m.0H MH N.HH NH :0
m.0 0 0.0 HH 0.0 0H 0m 0.0H HH :.0H NH :.HH MH 00
m.0 0 0.0 0H 0.0 NH 0: H.0H HH 0.0H NH 0.HH 0 00
0.0 m 0.0 0 0.0 0H H: m.0H 0H 0.0H 0H N.HH MH mm
0.0 s 0.0 m N.0 NH N: o.HH 0H N.HH 0H H.NH 0 00
0.0 0 H.0 s 3.0 :H m: m.HH :N 0.HH :N m.NH HN 0m
N.0 0H N.0 HH 0.0 0H 3: :.HH 0N 0.HH 0N 0.NH 0N 00
m.m HH 4.0 NH s.0 0H m: 0.HH sm H.NH sN H.MH mm H0
:.0 HH 0.0 MH 0.0 MH 0: 0.HH SN m.NH 0N :.MH 0N N0
:.0 HH 0.0 NH 0.0H 0 s: 0.HH 0N 0.NH :N S.MH 0H 00
0.0 0H 0.0 HH H.0H HN w: 0.MH HOH 0.MH HS 0.:H 0N :0
saw: 2 can: 2 can: 2 onoom sec: 2 cat: 2 new: 2 msoom
vommcmpa an3wcmpoom HmEnoz *xcmm 000630-: an3wcmpoom HmEnoz *xcmm
030030 No shoppmm copooaxm 030030 No stoppmm Umpoomsm
BszH MBHHHm< mo ZOHBDmHmBmHQ H mmoom 008 mom mmmoom Hm>mq MBHHHm< zH6:H 00 H0>0H :pHHHnm :00E 0:0 0o x:mm*
158
6.0 0 6.0 0 6.0 NH 00 0.0 6H H.6H NH 6.6H 6H 0:
H.0 s N.0 0 0.0 0H :0 6.6H 6H 0.6H 0H 6.HH 0H 60
N.0 0 0.0 0 4.0 0H 00 H.6H 6H H.6H NH H.HH 0H H0
N.0 0 0.0 0 0.0 6H 00 N.6H 6H 0.6H NH m.HH 0H N0
0.0 s 0.0 0 s.0 a s0 N.6H 6H 0.6H NH 0.HH 0H 00
H.0 0 0.0 s 0.0 NH 00 H.6H 6H 0.6H NH 0.HH 0H an
0.0 0 0.0 0 H.0 0 0m H.6H 6H 0.6H HH 0.HH NH mm
3.0 0 s.0 s N.0 0 6s 0.6H 6H 6.HH HH H.NH 6H 60
0.0 0 6.0 s N.0 NH He N.6H 0 0.HH 6H H.NH 0H s0
0.0 m N.0 0 0.0 0 Ne 0.HH 0H N.HH 6H 0.NH HH 00
0.0 s 0.0 0 0.0 NH ms 0.HH 0N H.NH 6N N.NH 0N 00
0.0 0 0.0 0 0.0 0 a: 0.HH 0N H.NH 60 H.0H N0 60
0.0 0 s.0 6H H.6H 0H ms N.NH 0N 0.NH N0 H.0H H0 H0
s.0 0 0.0 6H N.6H HH 0: 0.NH Hm 0.NH H0 0.0H sN N0
0.0 0 0.0 6H H.6H m as H.NH 60 6.0H Hm 0.0H 0N 00
0.0 0 H.6H 0 0.6H 0H 0H 0.0H 00H 6.HH 66H 0.NH 6s :6
:00: z :00: z :00: Z 0:oOm :00: z :00: z :00: z 0noom
0000:0u0 :0H30:0060m Hsssoz *x:mm 0000:010 :0H30:mpo0m Hmsnoz *x:mm
030030 06 :000000 000600xm 030030 06 :000000 000600xm
EbmzH HBHHHm< mo ZOHBDmHmBmHQ m mmoom 009 mom mmmoom Hm>mH HBHHHm< z0©:H 00 00>00 :000000 :00: 0:0 06 ::0m*
0.0 z 0.0 m H.0 0 mm :.0H 0 0.00 0 0.00 0 0:
6.0 3 0.0 0 3.0 0 30 0.6H 0 0.6H 0 0.HH 0H 60
0.0 3 0.0 m 0.0 0 mm 0.0H 0 0.00 0 o.HH 0H 00
0.0 z 0.0 m 0.0 0 0m 0.6H 0 6.HH 0 0.HH 0H N0
3.0 0 0.0 m 0.0 m 00 0.6H 0 N.HH 0 6.NH 0H 00
3.0 d 0.0 0 0.0 0 0m 0.0H 0 0.00 0 3.0H 0H :0
0.0 a 0.0 3 0.0 0 00 0.6H 0 :.HH 0 0.NH 0H 00
0.0 z 0.0 m 0.0 0 0: 0.0H 0 0.HH 0 0.0H 00 00
0.0 a 3.0 z 0.0 0 0: 0.0H 0 0.HH 0 0.00 :0 00
0.0 0 3.0 z 0.0 0 m: 0.NH 00 0.0H 00 0.0H 00 00
3.0 0 0.0 a 0.00 0 m: 0.00 mm 0.00 00 0.00 00 00
0.0 m 0.0 m 0.00 0 :0 0.00 pm 0.0H mm 0.0H 0: 00
0.0 0 0.0H 0 0.0H 0H m: 0.0H om 0.00 mm 0.00 00 00
0.0H 0 0.0H 0 0.0H 0 0: 0.NH mm 0.00 mm 0.30 00 00
0.0H 0 0.00 0 0.0H : 0: 0.00 mm 0.0H 00 0.0H mm 00
:.6H 0 0.6H 0 6.HH :H 0: N.:H m6N n.sH 00H 0.0H H0 :0
:00: z :00: z :00: 2 00600 :00: z :00: z :00: z 0:660
. 00:00: . *vfiHmm
0000:0ID :0030:0060m 005:6: 0000:0u0 00030:00o0m 00:06:
030030 00 :000000 000600xm 030030 06 :000000 000000xm
BbmzH NBHqu¢ mo ZOHBDmHmEmHm Gmm mmoom 008 mom mmmoom dm>mq :BHQHmd z