AN EVALUATION OF THE SEQUENTIAL
METHOD OF PSYCHOLOGICAL
TESTING

Thesis for H1: Deqru 0‘ Ed. D.
MICHIGAN STATE UNIVERSITY

John James Paterson
1962

This is to certify that the

thesis entitled

AN EVALUATION OF THE SEQUENTIAL METHOD
OF
PSYCHOLOGICAL TESTING

presented by

John James Paterson

has been accepted towards fulfillment
of the requirements for

 

CDwi/ET [31% -.

Major: professor

 

Date June 15, 1962

 

 

   
    

  

LIBRARY

Michigan State
University

V VT-‘r

 

_, _._.. #. ff

 

 

 

“’

1“
H
I“.
- 1
r ‘0-
.-¢ .....
nucF I
~z-" " 5.: 5:"
v-.‘..-.....-~ an-
.' s
""V. ﬁc~v~~
.mv-‘ ‘~:L.
. -..
.3
".2 ’e-... -_.
1A. "‘ x
' i-..g \
_
J .
" ‘0. -~. .
h' ~ I * _
“~..v u
‘1 ~...:_“
a.‘,‘ 3...
~‘.‘:"-V‘ 2 ‘ 0‘
.
s... V-
P. “‘1 a‘.‘
"" ‘9: ~

 

u
y N‘.\
‘ k. *1
VA
5 ‘\
L4
Q
G
5“.
\; I
“ V“
\._.\ :-
V y‘
‘.
.-
"vfﬁiﬁ
o
u;
\‘
F
“¢~c
-
e.
\"
.\-
‘k 5":
.‘V‘;
»

 

ABSTRACT

AN EVALUATION OF THE SEQUENTIAL METHOD OF
PSYCHOLOGICAL TESTING

by John J. Paterson

In the sequential method of psychological testing the
examinees are directed to subsequent items on the basis of
their responses to prior items. No examinee responds to all
the items of a sequential test, and any given examinee
might complete the test by responding to any of several com-
binations of items. Scores on the sequential test reflect
the difficulty of items correctly answered not the number
correct.

The evaluation did not involve an actual population
of individuals, but used probability models and hypothetical
populations. The probability of passing a given item in a
test was calculated from the ability level of the individual,
the difficulty of the item, and the precision of the item.
(Precision may be computed from the item-total biserial
correlation.) The probability of passing a sequence of
items was determined for each of fifteen ability categories
by multiplying together the probabilities of passing or
failing a sequence of six items. Sixty—four different se-
quences were calculated for each ability category.

The problem involved was the comparison of the sequen-

tial model with the traditional cumulative model (in which

 

‘ I

RAU-W‘vn
— u- up
I»-yo...-n.~

u

n- .— v- «-

i—l" .‘ ,t\

"—.u V‘agg
. ‘

"F .J-ﬁ-
‘ ~\

.-a- ._._.

.
r; r." ‘v‘

oy--.-v..

C ‘2'!- 0.1,,‘

54

'y.‘

rat
..\-

John J. Paterson

all items were at the 50 per cent level of difficulty) to
determine how well individuals at different ability levels
were classified by the tests. The parameters of the sequen-
tial test (difficulty and precision) and the effects of
errors in estimating these parameters were examined in
relation to the resulting classification of individuals.

One sequential test model was constructed with an
item-total biserial correlation of .75 and item difficulties
such that the sum of the squared deviations of the individ-
ual's ability level from the mean ability level of the group
into which the individual was classified would be a minimum.
Even though individuals in each ability category were kept
separate from individuals in other categories, individuals
in different categories took the same difficulty item if
the calculated difficulties were less than .20 standard
deviation units apart. A rectangular distribution of
ability was assumed in these calculations.

Both normal and U-shaped distributions of ability
were used as input for the above sequential and cumulative
test models to determine how well the results classified
individuals of different ability levels. It was concluded
that regardless of the distribution of ability used as
input; the individuals in the extreme ability categories had
significantly less variance of scores in the sequential test.
At middle ability levels the sequential test did have slightly

lower variance of test scores than the cumulative. For the

 

 

. n \
-sn. .r‘V‘ -
— I— “‘

.v‘v- ~13‘Ao U

.vu

. . q
2'; 1" JV"
-"" o -lev‘
0‘.
’7 ""\ AV‘"
nu
‘ 'Vr %..u

I ‘ .
r~1..l.‘,‘,

-uu-.“": Q I
-~,

g

Wow.

(
(
I
(n

(
L
(U

i
s
h;
\—

l-’

 

John J. Paterson

top scores the sequential test had less variance of ability
level than the cumulative.

The second and fifth items in the sequential test werewd
each separately changed in difficulty and precision. The
resulting number of people at each score, the mean ability
level of individuals at each score, the variance of scores
for top and middle ability level individuals and the variance
of ability level scores for the top and middle scoring
individuals were all insignificantly changed. The sequential
test was not sensitive to errors in estimating the precision
and difficulty of the items.

When precision of items in the sequential tests was
varied, tests consisted of higher precision items (with dif-
ficulties appropriate for that precision level) had less
variance of scores for ability level categories and less
variance of ability level categories for top and middle
scoring individuals.

It was concluded that more difficult items are needed
to distinguish among more able students; less difficult
items among the less able. If extreme scores having low
variance of ability level are desired, the item difficulties
should be regressed toward the mean from those difficulties
which give the best discrimination between individuals of

similar ability level.

in partial fu

AN EVALUATION OB

OF
PSYCHOLOGICAL TESTI

1 '1
DV ..

A THESIS

Submitted ts
Michigan State University
lfillment of the requiremcrt
for t? d ‘

Ceilege Oi Education
1102

617 25,73 12..

s. ,‘ ,'
.. 2, , g, 3,

ACKNOWLEDGMENTS

The writer wishes to express his appreciation
for the guidance given by Dr. David R. Krathwohl in
the preparation of this thesis and to the Bureau of
Educational Research for arranging the time necessary

for completion of the research.

TABLE OF CONTENTS

CHAPTER
I. DESCRIPTION OF THE PROBLEM.

Description of the Sequential Test Model
Starting Point.
Stopping Point.
Scoring
Pattern of Items
Directions to Testee.
A Iiagram of a Sequential Test Used
in This Study
Need for Test Improvement . .
Maxim.ally Efficient Use of the Items
Selected .
Control of the Score Distribution
Meaning of A Score
Rationale for the Sequential Item Model
Maximally Efficient Use of Items.
Control of the Score Distribution
Meaning of a Score
Selection of the Sequential Procedure
Hypotheses . .
Effect of the Type of Ability
Distribution .
Effect of Item Precision and Difficulty
for the Sequential Test . .
Effect of Errors in Estimating Para-
meters . .
Limitations of the Study
Best Cumulative Test.
Distribution of Scores
Ability Distributions
Test Parameters . . .
Test Construction Procedures
Test Presentation Procedures and
Effects.
Overview of the Remainder of the
Dissertation

II. REVIEW OF LITERATURE.

Maximally Efficient Use of Items
Selected.

ii

PAGE

1...)

 

 

 

‘...0.-'~-—

_.,... .n.

--- '

 

11

CHAPTER

Control of the Score Distribution.
Meaning and Use of Score Produced.
Sequential Testing Procedures

III. PROCEDURES

Test Model Construction . .

Effect of Shape of Distribution of
Ability. . .
Effect of Normal Distribution .
Effect of U— Shaped Distribution .
Effect of Ability Distributions for

Additional Seqeuntial Tests .

Item Precision and Difficulty for the
Sequential Test .

Errors in Sequential Test Para.meter
Estimates

General Comparisons

sumwary of Procedures and Hypotheses.

IV. ANALYSE AND RESULTS

Sequential Test Construction
First Item Decision
Second Item Decision
Third Item Decision
Fourth Item Decision
Fifth Item Decision
Sixth Item Decision .

Input Distribution Effects

Results from the Normal Distribution.

Results from the U- Shaped Distribu-
tion.
Item Precision and Difficulty for the
Sequential Test .
Varia.nce of Scores.
Variance of Ability Levels

Errors in the Sequential Test Parameter

Estimates .
Errors in Estimating Difficulty
Errors in Estimating Precision.
General Comparisons

V. CONCLUSIONS

Sequential Testing and Testing Problems.

Efficiency of Items.
Control of the Score Distribution.
Meaning of a Score.

111

PAGE

69
81
91
91

95

97
100

102
105

107
112
114

122

122
123
123
125
125
126
128
131
132

138

142
142
144

1A8
148
153
156

16A

16A
16A

167
160

"‘ J

 

 

 

~ __
‘... _—_
o-v
v-0

 

Ian‘.__..

~..-‘._’“~

‘-""1'-\v-r
_-... '

'-. . “1-,

....
O
A
no .
c.
-—
~ .

CHAPTER PAGE

Sequential Testing Hypotheses. . . . . 171

Effect of Ability Distribution . . . 171

Effect of Precision and Difficulty . . 173

Effect of Error in Parameters. . . . 175

VI. SUMMARY AND RECOMMENDATIONS . . . . . . 176
Summary . . . . . . . . . . . . 176
Recommendations . . . . . . . . . 183
BIBLIOGRAPHY . . . . . . . . . . . . . . 186
APPENDIX A . . . . . . . . . . . . . . . 192
APPENDIX B . . . . . . . . . . . . . . . 198

iv

_. ‘_ u-u
.
...~~"‘

 

 

I. L»... f. L Cad T.
u... L. A... L. n.» :3 3.. ;. ml
. ‘ .r.\ rye A a s V s w u. , A . &
.C A . A: a 4. ac A v ac 14. a:
vn r. r.. A“. vs. .V r.. L» .r..
IL“ A\~ Ax~ m- .5. «x» A\~ ~\~ Pr. .
. .H

TABLE

LIST OF TABLES

Analysis of Means and Variances of Normalized
Scores for Category 8 Individuals When Normal
Distribution of Ability is Input into Sequential
and Cumulative Test Models . . . .

Analysis of Means and Variances of Normalized
Scores for Category 14 and 15 Individuals When
Normal Distribution of Ability is Input into
Sequential and Cumulative Test Models

Analysis of Means and Variances of Ability
Level Scores for the Top 8.4 Per Cent of the
Score Distribution When Normal Distribution of
Ability is Input into Sequential and Cumulative
Tests . . . . . . . .

Differences Between Normalized ”T" Scores for
Adjacent Top Ability Levels for Normal and U—
Shaped Input. . . . . . . . .

Differences Between Ability Level Scores for
Adjacent Top Scores for Cumulative Test Model
for Normal and U—Shaped Input . .

Differences Between Ability Level Scores for
Adjacent Top Scores for Sequential Test Model
for Normal and U—Shaped Input . .

Analysis of Means and Variances of Normalized
Scores for Category 13 Individuals When a U-

Shaped Distribution of Ability is Input into

Sequential and Cumulative Test Models

Analysis of Means and Variances of Normalized
Scores for Category 15 Individuals When a U—

Shaped Distribution of Ability is Input into

Sequential and Cumulative Test Models

Analysis of Means and Variances of Ability
Level Scores for the Top 13.5 Per Cent of the
Score Distribution When a U— Shaped Distribution
of Ability is Input into Sequential and Cumula-
tive Tests . . . . . . . .

V

PAGE

13A

134

134

136

136

137

139

139

139

o-
-

Pp-'

'.--—"

I
o
-

 

 

‘ 1 4 I: . Q ‘1 i

——J .. . . .

L. .. . a: .. .

«u . . .. . .... 'L

m ‘ ‘ . a m m . or.
.1. .s1
. . . .

 

 

 

.

...,f
-

.9. .v
..>J

~ u .

. . .s

.C .

v .. _.

rm“ N\
r u
.

 

1‘

Lisa

 

..~

.1 r

TABLE

10.

11.

12.

13.

14.

15.

16.

17.

18.

19.

20.

Analysis of the Variance of Scores for Individ—
uals at Specified Ability Levels for Five Tests
of Different Precision. . . . . .

Analysis of the Variance of Ability Level
Scores for Individuals at Specified Score
Levels for Five Tests of Different Precision

The Means and Variances of Rank Scores Assigned
to Each Ability Level by Five Tests of
Different Precision. . . . . .

The Discrimination Indices Between Adjacent
Ability Levels for the Input of a Normal Distri—
bution of Ability into Tests of Different

Precision

Distribution of Individuals by Two Tests-~One

Test With Second

Item Difficulties Farther from

50 Per Cent Level Than the ”Error Free” Test

Distribution of Individuals by Two Tests——One

Test With Second

Item Difficulties Nearer to

50 Per Cent Level Than the ”Error Free" Test

Analysis of the Variance of Ability Level
Scores for Individuals at Specified Score

Levels for One

”Error Free"

Test and Two ”Error

in Difficulties of Fifth Items” Tests.

Analysis of the Variance of Rank Scores for
Individuals at Specified Ability Levels for

One "Error Free"
culties of Fifth

Distribution and
Top Score Values
Difficulties and
Input

Distribution and
Top Score Values
Difficulties and
Ability Input.

Distribution and
Top Score Values
Difficulties and
Input

Test and Two ”Error in Diffi-

Items” Tests

Mean Ability Level Scores for
for Three Tests With Different

Normal Distribution of Ability

Mean Ability Level Scores for
for Three Tests With Different
Rectangular Distribution of

Mean Ability Level Scores for
for Three Tests With Different

U-Shaped Distribution Ability

V1

PAGE

143

143

146

147

150

150

152

152

157

158

159

 

: . . . :O ». ... .u. r. r .

.e. C~ 2. Ca V. . . .u.

s. ._ v“ w“ .. A. r”

.C. n C. ._. . C __v .C. n . a a. .n.

: . .C v. : . a . : i a. L. .N.

....C... .... 3.2. v-“
and | o t I
. . . . .,_ .. a .
w... fr. .r. p/b .r.
L...

 

. L.
—~. Q.
..u w..
L. S.

_~\p

 

Ar.

 

— n v
a .Z.

. ».

~ 5. A.
. . . r“
~\\M > .

Q. . N
v. .t
s. n.
: A.
3‘ .V
a . nx~
phi

 

a L» _. s
e .. K. .A .
T.» by «a. e O
1 s . _ L» .. A

a g Q» «C are

-r-u
'u

TABLE PAGE

21. Distribution and Mean Ability Level Scores for
Top Scores of Tests With Different Levels of
Precision and With an Input of Normal Distribu-
tion of Ability. . . . . . . . . . . 161

22. Distribution and Mean Ability Level Scores for
Top Scores of Tests With Different Patterns of
Items Encountered and With an Input of a Normal
Distribution of Ability . . . . . . . . . 163

23. Per Cent Passing Items of the Different
Sequential Tests Constructed . . . . . . . 193

24. Mean Normalized "T" Scores for Each Ability
Level for Cumulative and ”Least Squares”
Sequential Tests . . . . . . . . . . . 194

25. Distribution and Mean Ability Level Scores for
Cumulative Test With the Input of Different
Distributions of Ability. . . . . . . . . 195

26. Distribution and Mean Ability Levels for Top
Scores on "Least Squares” Sequential Test With
the Input of Different Distributions of Ability . 195

27. Distribution and Mean Ability Levels for Top
Scores With Difficulties of Certain Items Changed
in a Sequential Test With an Input of Normal
Distribution of Ability . . . . . . . . . 196

28. Distribution and Mean Ability Levels for Top
Scores With Precision of Items Changed in a
Sequential Test With an Input of Normal Distri-
bution of Ability . . . . . . . . . . . 197

vii

FIGURE

1.

LIST OF FIGURES

Graphic Representation of the "Least Squares"
Sequential Test Model

Three Distributions of Ability

Mean Ability Level of Groups Separated Out by
Sequential Test and Difficulties of Items Used

viii

PAGE

99

124

DESCRIPTION OF THE

m'-- . v‘ y r ‘ . 7" rv
ine usual 0038»

U2

erics of questions or

correct or incorrect a

given

o _s

by an examlnee

typ .f test; the

e O £1110

based on a cumulative

tive to the cumu 9

items
'98 given to eao.

.‘.'. \.'

CHAPTER I

(J I}
L: \J

5 item are

he r'3te: of CO'
1.. . -- em
3-91/1 f1.) r112) s..‘_
n3 “loceuir” us

ore Cl 5.. S

PROLLHA

W 1 111

One

seque

GICEITRR‘

ntial

model. In te“ts based upon the .eqiential model, exanlnees
are directed to subsequent items on the basi" of their re~
sponses txlé” mar ones. (Mule no clunvincc iﬂailonds 1ﬂ)c111 of
1;}.‘3 l ch“d..) C)if £1 \fiC11163rli.i.211_ t.i t;?. 2:11 1 .1115' i; 1‘; :1; .:gici1n zit: : z” 15;}? *
complete the ‘est 7y .csponiiru to any oi a YJLIECJ oi r‘m-
binations of items. Score“ on $141 rt a1 cats are 14:).
upon the nature of laps to which c. 6'. 17.2441 :-.—‘.-._spon:,e.s 1:5: gluon

and not merely

C
H
f—v
’7‘
(L.
i"
,-o

f ' ,. ,u... . ‘ - J \ A». - “5. y..- .. ° . v3
The 1m:slc 'ilotdnnf tau.~-l;t a 111 thi

-- 's. 4-...,-~w:.— ._1 - L. »
parqmmon 01 c1 seqcunitlal Imll< riith this t:

test model. In the sequen

individual who

UdiﬁbE.)

D

1 model

t
‘
IL_;

item 1s

l._l

_ .... .2
a 3.. T. w. W.
i. p1 .: L.
L. t. a: . . T.
w. .1 .. . r»
a u .3 .. . .
d» f” =— .v
r“ ...t. . ”
.AJ .;4 o . "w
. . L.
. e A a 1‘ .. . u v
u c .4 a u. a 1. o

 

.w. . .

5v

.

C.

s. w.

h... ..

r: .1.

as »y
au-

.—. r.

a” i.

.q .14

 

4

. c

.‘
...«-\-J-&

Ar-

-a a..

A-

Abe-v—

ibi

I. ‘
,-.
.~: «3‘
s. d-

.

..- ...v’.A

Agsu

«. n.

n. r»;

rm.

”a at

I. .r..
m.

Sb

rw

hit

:4

k...

 

Ts~

 

 

2
to a more difficult item; if he fails an item he is directed
to an easier item. If the item is very precise, the individ-
ual who passes it is given a much more difficult item; and
if the item is not very precise, the individual is given an
item closer to the difficulty level of the item Just answered.
The opposite is true for failing an item. The score is
directly related to the difficulty of the item to which the
individual is directed at the final stage of testing.

In addition to the comparison of testing methods, the
sequential test is examined for its strengths and weaknesses.
Methods of improving the sequential model are suggested from
the results so that even if the present procedure is not
better than the cumulative test, future sequential procedures
may be improved.

The present evaluation of the sequential method of
psychological testing consists of (1) a description of the
features of the sequential method as compared with the usual
cumulative test; (2) a description of some of the problems
encountered in the use of the cumulative test and how these
problems are handled by the sequential model; (3) a rationale
for the sequential solution; and (4) the formulation of
hypotheses as to the behavior of the cumulative and sequential
test models in regard to specific problems. Following the
hypotheses are (5) the limitations of the study and (6) an

overview of the remainder of the dissertation. To aid the

 

.nb

 

. r.
was C»
.L.
—.. 2:
«iv \ d
t.‘
Ax»
0a a
p v
.~.
« .

 

 

 

 

 

 

 

.L

a. :

. .

r . t

a t ..

.A. ‘

r..

C l‘ \

A v

is l.
.a- . _.
. . .

‘ 4
.2
.\
a a 5.
.~.
is a. u.
‘\. a
. 4 «it
u. s ‘ h
a e \K»

 

3

reader a few of the more frequently used terms in this dis-

sertation are explained in Appendix B.
I. DESCRIPTION OF THE SEQUENTIAL TEST MODEL

In any testing situation certairldecisionswmist be made:
(1) the individual must be told where to start, (2) the
decision must be made when to stop testing, (3) the final
score must be determined, (4) the characteristics of each
succeeding item must be stipulated, and (5) the testee must
be informed as to where and how he should proceed. In the
cumulative test the character of these decisions is obvious.
Because they are unusual in the sequential item test, these

decision points will be described in some detail.

Starting Point

 

Depending on the purpose of the test and what one
therefore wishes to emphasize, the starting point may be at
any level of difficulty. For instance, one may start with an
easy item that most individuals will be able to pass and with
which the individual would feel comfortable, or one may star
with an item at the middle of the score distribution with no
consideration as to the individuals who may be taking the
test. The sequential test model deveIOped in this paper has
the individual take as his first item one that would be con—
sidered at the fifty per cent level of difficulty for the

group of which he is a part. The reason for this choice is

-v-
.5-" l..-
-A
,9

._.

 

 

 

‘ » V F u n b
r x C Q

. 1.: a. . L . _.a . . a . _ . .
w. : . w. k . . J. . r‘
3 . . . ,, . , .c

. 1V P. e A -‘ rd. .v.

I‘ ‘ -~» g.» ‘vV ‘ I,» V o .- . ﬁs M

.. . . ..
u 5. .~« 3. . s 0“ - x . we
.: 1 . V. F» . I s
i. . . . 4 .~‘ .1 . a 4..

s i. v , . s x x f. . H. . .
L» .r . . . ..\_L. ., a .. r .~_ . a. r .
CL . ‘ ._. a . u . . P . e \{s

. ~s . .. « ..~» \
. . x t 1‘

4
explained in Chapter III, Section 1. The present discussion
must, of necessity, ignore the psychological effects which

need to be empirically determined.

StoppingPoint

 

Criteria for deciding when to step are also determined
by the purpose of the test. If doing the best job possible
in the time allowed is paramount, then everyone is given
the same number of items knowing that the extremes will be
better classified than the middle ability levels. (Note
that the criterion measure need not be a measure of ability
but could be an attitude or interest. However, in this dis—
sertation the criterion will be referred to as an "ability.")
If time is flexible and there is a prescribed degree of
accuracy for each score, then a fewer number of items is
used for the extreme and more items used for the middle
ability levels. If the rapid classification of extreme
ability level individuals is desired, then one may stop
testing when it can be determined that the individual is
probably not at some middle ability level. In the sequential

model in this paper all people will take six items.

Scoring

Reasons for choosing one system of scoring over another
depend upon whether the score is to discriminate one ability
group from another, to discriminate among the individuals in

a group, or to describe the response pattern of the individual.

-v-.

 

 

 

§

 

 

\J‘:

If one wishes to disrrlnlnate one ability group from anOther,
one would probably assign a score reflecting the difficulty
of the final item. If one wishes to discriminate among in-
dividuals, then the score may rep resent the number of people
in, for example, one hundred that the individual would rank
above in the population. If the score is to represent a
response pattern, it may be an estimate of the number of
items the testee could have answered C01 -rectly if he does
answer an item of a gi en ifficulty, or it may identify the
precise pattern of correctly and incorrectly answered items.
‘he sequential test model in this paper assigns the individual
a score which is the difficulty of the item to which he is

directed at the final stage of testing as his score.

Pattern of Items

 

The problem in the sequential test is to select thrt
sequence of items which will yield the information needed
to assign the individual a score. At any stei‘ in the test
the decision as to the s cceeding item to be taken may depend
upon (1) the number of pmeceding items one has ansfeled
correctly, (2) the pattern of preceding items, or (3) the
difficulty and precision of the immediately preceding item.
This sequential model uses the difficulty and precision of
all preceding items to deter:ci ne the next item.

Difficulty of the item for this model 1s measured in

terms of standard score an ts for a theoretically normal

 

 

>~

Lu

 

 

 

 

 

 

 

 

 

 

 

6
group. An item that fifty per cent of the theoretical group
would pass is designated as ”0.00.“ The precision of the
item is essentially a measure of the validity of the item.
The measure of precision, 65, may be defined as the standard
deviation of the item characteristic curve. (It is also re-
lated to the measure of precision ”h” used in psychOphysics:
h = 1/1205 ; and, as Lord indicates, 05 is identical with

his "bi".)

Directions to Testee

 

The testee may be told how well he perf rmed on any
given item, may be told what is right or wrong with his per-
formance, or may be simply directed to another item. Any
combination of the above may be used at different stages in
the test.

Individuals may be directed to items which are taken
by those who perform differently, or they may be directed to
an item unique to their pattern of response. Pattern of
response may be determined from correctness or incorrectness

' 3:. e I! —

*1)

(D
‘ 1
L" .
Po

only, or each alternative to any item may designat
ent sequence. In this sequential test, pattern was determined
from only correctness or incorrectness of items, and more
than one possible sequence of responses could lead to the

same item.

 

l o 1 .. o

Freder1c M. Loro,_A Theory of Test Scores, Psychometric
Monograph No. 7 (Chicago: University of Chicago Press, 1952),
p. 7.

 

 

o.
.. ‘

7.» .s.

..A

.. A
l. a

. .

... L.

. _ .

 

Ca

 

 

 

 

 

 

7
Many methods of giving the necessary information to
the testee are available. In the empirical tests that have
been built by Krathwohl and Paterson, the succeeding item
that the individual should attempt is disclosed to the in-
dividual when he eras s the opaque covering under the letter
that has been selected as the answer to the question at hand.

The final erasure disclosed a letter used to indicate a score

”W

f
I

rather than the number of the next item.“ The testee mus
answer each item as he comes to it as he receives no direc-
tions if he does not answer. Other response techniques which
could be used are tabs, envelopes within envelopes, sliding

masks, and scrambled books.

A Diagram of a Sequential Test Used in This Study

 

Figure l is a diagram of one of the sequential tests

used in this study. It is the one constructed by the ”least

Cl

H

squares" method which is desc'ibe later. The pattern shown

is only one of many possible sequential patterns.

Difficulty of items.—~ltems are represented by circles,

 

the ordinate position of which represents the difficulty of
the item. The closer the item is to the top of the page,
the more difficult it is. Difficulty is expressed in standard

score units, i.e., an item that fifty per cent of the normative

 

2Unpublished material developed in the Bureau of Edu-

cational Research, Michigan state University, East Lansing,
Michigan, 1956-1959.

 

 

Q

AI.
.XMV.~

NZELDLL ULTTZﬁtu?‘ :NV

T..:..,.J \ L a...

>5 N 2; 7w.“ ‘2

Hess: same
Hmapccsvom :mcmmSUm unwed: we» mo eoﬂpmpccmmsdmm oﬁcdmswuu.a .wam

pace esp wo mcwmpm

 

swam swam smpH stem swam smpH
msoom saw sum as: esm ssm pma
oo.H -
om. -
om. -

0 ON.

t . ow.
. om.

0:.

om.

‘ ON.

Oar

. . 00.

OH.
ON.
om.
0:.
. om.
00.
on.
ow.
om.
OO.H

 

Ammmoom ohmccmpm Gav
msmsH no spasoumuum

 

P... a

 

 

 

 
 

 

 

 
 
 
 

 

 

9
group would answer correctly is labelled "0.00". An item
that 8H per cent of the normative group would answer cor-

rectly is labelled ”-1.00”.

Sequence of items.~-The sequence is represented by the
abcissa value for the item. The first item of the test is
at theleft-hand side; the sixth item at the right of the
diagram. The individual confronts one item at each "stage"

of the test.

Size of step.—-The size of the step or the increase or
decrease in difficulty from the item at one stage to the
item at the next stage is represented by the difference in
ordinate positions of the items as can be seen in Figure 1.

There would be a large increase in the difficulty of the

 

second item if one were to correctly answer the first item.
There would be less difference between the easiest item at

stage four and the easiest item at stage five.

Route taken.--Lines slanting upward designate that
those who are considered to have passed an item at the
preceding stage should proceed to a more difficult item
for the next stage. Lines slanting downward designate that
the individuals are considered to have failed the item at
the previous stage and should proceed to a less difficult
item at the next stage. It may occur that passing a less

difficult item will lead the individual to a more difficult

.4-.-"';"‘~

-

1-.--4AOV

1. nu .w
v.
.~. ~ «V
-_~ v.
_... T. as.
L.
.h. ﬁ,» ‘J
k. r” r.
L. .C
L» I» e .
.—_ .
a. . .J
. . .... .?

 

 

 

 

 

10
item for the next stage than he would have encountered by
failing a more difficult item. In this case the lines be-
tween items will cross. (This case is not illustrated in
Figure l.) The other alternative not yet mentioned is that
individuals passing a less difficult item or failing a more
difficult item may be lead to the same difficulty of item

at the succeeding stage.
II. NEED FOR TEST IMPROVEMENT

In order to lay the background as to why the sequential
test is worth considering, one should examine what problems
have been encountered in the use of the cumulative test.
Present test procedures seem to have encountered three im-
portant problems related to: (l) utilization of items to
operate most efficiently with the group taking the test,

(2) controlling the score distribution to arrive at a useful

scale, and (3) production of a score with a precise meaning.

Maximally Efficient Use of the Items Selected

 

Once one has decided upon a purpose, then one can
solve the problem of the most efficient selection of items
either completely empirically, or theoretically in terms of
the effect of varying certain item characteristics. The
approach in this paper is the theoretical one. If one uses
this theoretical approach, one of the problems is that of

utilizing the most precise items available in a pool. The

 

 

-..

v
u

.-..A->

 

 

1. n h I. ~ lu u
_. c a. .
.L . n.
5 H
‘._. v... 3‘
. ._ .t . .
... a . .. .
..n u y a»
. v A: 2*
. .
... . . L. .
..g s u 1.. . rs
~ 8 I

 

 

 

 

 

 

 

 

 

ll

cumulative test cannot always use all of the more precise
items. m

In the cumulative test, if the score is the number of
correct responses and if all of the items are of equal dif-
ficulty, then a test with less precise items would give a
better measure of the scale of ability than a test with more
precise items.3

The above phenomenon has been called the ”attenuation
paradox." Violation of any one or a combination of the
following assumptions has been given as an explanation for
the attenuation paradox: (1) scores are normally distributed,
(2) ability is normally distributed, (3) the regression of
scores on ability is linear, (4) measurement produces an
interval scale of ability, and (5) response distribution is
homoscedastic. There is evidence to support the contention
that violation of any one of these could be the reason for
the lack of a monotonic relationship between item reliability
(precision) and the validity of scores in the usual testing
situation with the cumulative test.

One method of using the most precise items and increasing
test validity is to use a spread of item difficulties as sug~

l
gested by Brogden.+ However, this does not seem to be a

 

3Ledyard R. Tucker, ”Maximum Validity of a Test with
Equivalent Items,” Psychometrika, llzl-lM; March, 1946.

 

“Hubert E. Brogden, "Variation in Test Validity with
Variation in the Distribution of Item Difficulties, Number of
Items, and Degree of their Intercorrelation," Psychometrika,
11:197-214; December, 1946.

 

l2
completely satisfactory solution because (1) there is no
scheme to determine the appropriate spread and (2) the most
extreme difficulties cannot be efficiently used any time
the majority of the individuals taking the item guess at the
answer.5 There should be some procedure which would allow
use of precise items no matter what their difficulty level.
If items are to be efficiently used in the discrimination
of a group into two parts, the items should be at the SO per
cent level of difficulty for the hypothetical group the

median ability level of which is at the point where the

6
discrimination is desired. This means that if discrimina-
tions are desired among a few high ability individuals then

difficult items should be used. The usual cumulative test

cannot efficiently use such items.

 

5Paul E. Meehl and Albert Rosen, ”Antecedent Probability
and the Efficiency of Psychometric Signs, Patterns, or Cutting
Scores," Psychological Bulletin, 52:194—216; May, 1955.

 

,
OBrogden, 9p;_§;t,; Lee J. Cronbach and Willard G. War-
rington, "Efficiency of Multiple-Choice Tests as a Function
of Spread of Item Difficulties,” Psychometrigg, 17:127-147.
June, 19523Frederick B. Davis, HThe Se ection of Test Items
According to Difficulty Level," .merican Psychologist, 4:243,
July, 1949; Harold Gulliksen, "The Relation of Item Difficulty
and Inter—item Correlation to Test Variance and Reliability,"
Psychometrika, 10:79—91, June, 1945; Lloyd G. Humphreys,
WThe Normal Curve and the Attenuation Paradox in Tes Theory,
Psychological Bulletin, 53:472—476, November, 1956; D. N.
Lawley, ”On Problems Connected with Item Selection and Test
Construction," Proceedings of the Roya Society of Edinburgh
él (Section A, Part III);273«2 V, lQME-lOMB; Jane Loevinger,
The Attenuation Paradox in Test Theory, Psychological
Bulletin, 51:493-5042 September, 1954; Frederic W. Lord,
rrSome Perspectives on 'The Attenuation Paradox in Test Theory',"
Psychological Bulletin, 52:505~510, November, 1955; Frederic

 

 

 

H

 

 

 

 

 

 

 

 

.
l ,. . .
4 .
. no I
o. . .
a. S. .. ..
. _.
. . _ .
~ ...
.
p u _
l-
. ..
7.. ....

 

 

 

-K. A

-4

 

 

 

13

Control of the Score Distribution

 

The problem of score distribution is not only to assign
a certain number of individuals to a given score, but to
assign only like individuals to that score. The particular
type of distribution which is desired depends upon the pur-
pose for which the test is designed. A normal distribution
is assumed in most statistical computations and interpre—
tations. A rectangular distribution would give the best set
of rankings in that peOple are spread evenly over all the
scores. A bimodal distribution may be desired to classify
individuals into accept or reject categories. Other than
differences in the use of scores, factors which influence the
score distribution are the distribution of ability levels of
those taking the test, the item precision, and the difficulty
of the items. A test able to proouce any type of score
distribution desiredirrespective of the distribution of
ability level of those taking the test and irrespective of
the precision or difficulty of items available would have

considerable utility.

 

M. Lord, A Theory of Test Scores; N. w. Richardson, "The
Relation Between the Difficulty and the Differential Validity
of a Test,” Psychometrika, 1:33-49, June, 1936; Thelma G.
Thurstone, ”The Difficulty of a Test and Its Diagnostic
Value,” Journal of Educational Psychology, 23:335—343, May,
1932; Ledyard B. Tucker, 9p4~pit.; and David A. Walker,
"Answer-Pattern and Score—Scatter in Tests and Examinations,"
British Journal of Psychology,jﬂ3ﬂﬂﬂi{%ih January, 1940.

 

 

 

Meaning of a Score

 

The problem in assigning a meaning to a score is that
the conventional cumulative score is typically a conglomer—
ation which may represent the ability level of the individ—
ual, the rank of the individual, the pattern of response,
or any combination of these. It is not possible to clearly
represent the ability level of the individual with the usual
cumulative test. While it is possible to just rank individ~
uals or to just indicate the pattern of response with the
cumulative test, this is not usually done. (in indicating
the pattern of response the score is assigned to the sequence
of items passed not to the number of items passed.) It may
be useful to examine each of these possible elements in turn.

The ability level of the individual cannot be deter-
mined by knowing that he passed a difficult item in a
cumulative test, because all people must take each item and
difficult it.ms are often passed by chance as the majority
of the group must guess at these items. This clouds any
interpretation of the number of correctly answered items as
a measure of performance. To get a better measure of the
ability level of the individual from the score, White and
Saltz have argued that the items should be scaled as to di.-
ficulty so that one knows which set of items a person has

answered correctly if he knows the total number answered

15
correctly.7 The usual cumulative test score does not permit
one to infer which items the individual has passed. The
score in the type of test suggested by White and Saltz would
probably be used to represent the level of subject matter
learned rather than how the individual ranked with others.
In addition to the infrequent use of the above solution,
the suggestion does not solve the problem of the majority
of individuals guessing the answer to difficult items.

To rank individuals in a normal distribution of ability
so they are spread evenly throughout the score range, the
test must make finer discriminations of ability at the middle
ability range than it does for the extremes. Thus the test
designed to rank individuals does not have a score scale
which has the same relationship to the ability scale at
the middle as at the extremes. Rarely is this relationship
of scores to ability level reported. The cumulative test
often compromises between using scores which rank individuals
best and scores which tend to be normally distributed (as
assumed in many statistical computations). The cumulative
test may do either of the above alternatives well, but the
decision made should be explicit and communicated to the
test user. The decision should be to use the test score which

permits one to infer rank (if this is what is desired), not

 

7Benjamin w. White and Eli Saltz, "Measurement of Repro-
ducibility,” Psychological Bulletin, 54:81-99, March, 1957.

 

 

 

 

 

 

 

 

 

 

 

-"~-
A_.
.‘ 5-..." V a- o..-
D'A‘-ll"‘\'
Iv 'v‘
- ‘1 I
.f—A P ‘--...‘..,
a \ ~
-.r~v ‘ﬁ 4-5:..-
--. . .
...:._.\.. ,‘.._
. ..
~..-~--... n"
P'. .'. .,
....,,>___
" ‘"I~p ‘
" ‘vvn.-_. “ 5
.
._',__ ,
.- V ' -
v..,_'_“ “‘ -
.' ’ I.
- .. - ‘ _
~-.«.- '_.v -
.\q,. AI
.... ﬁ-WA 7 ‘7
\
‘--.. ,_
*-~-_ ‘
I
----. _x
"‘ Vax—
"--.v ,4".
a.~
.,., ..
'm'
'. .u —
.._-\
.
r.
_ ”AOV ‘
. ~_ "r-
.' I0 ‘
‘-—..
—.. .
I
‘ra.! \.
-.. r _
"'\v ‘ ":';
-..,‘
--
,1
.H
- -;
.__V
"‘r-
.Zv-y.‘
"' o.
u‘, ‘7
"a!
I ow
"Q
~ ~.
'- \b .
v ‘r
~~‘ ‘
»"~A
.
-— '3’. . ‘
"' .x ‘
v.4 v- .
‘
A
“ a ‘-
‘\' &
~~_‘
‘~.KV‘
-- ﬂ.
‘ _
t
_‘
, »_
.
-‘ v _
~
'7—
‘ a .._ r
‘*..
‘-—\_
‘\~
. N
K e. -
-
\. A .-l_
‘ N‘
v “.m
‘-“ 5—.
v
-
- ~
\~‘_
‘ ‘ —a
“. v-
x.
k.
s
..
‘
.—
‘- ,—
~I -
\\.
- ~...
t
- c
x 3
“ ‘QO
‘ . ._
‘0 \
‘
‘
‘\\‘ : _
‘ ‘
‘ § 3
&
\

16
to contaminate the meaning of a score by forcing the scores
into a distribution just to create a higher correlation co-
efficient with normally distributed measures.

Another use of a score is to indicate the pattern of
response. Cronbach has concluded that one should be as
concerned with heterogeneity in content as in difficulty.
Since the ”level of difficulty" meaning for a score has
been discussed above, the ”heterogeneity with respect to
content" meaning is considered here. For example, one bit
of information is given when an individual is placed above
the mean in pitch discrimination. With another set of items,
the individual might be placed relative to the mean in visual
acuity. The two items (with heterogeneity with respect to
content) together place him in one of four categories. (If
the second item had been a further measure of pitch, then
he would have been placed in one of three categories with
respect to pitch). The use of items with heterogeneity in
respect to content thus seems useful, but one must remember
that to recover all four categories the test cannot be scored
by the number correct. Too often the items in cumulative
tests are heterogeneous with respect to content and the number
correct is used for the score. This cumulative scoring pro-
cedure permits the precise meaning of a score from a test
with perfectly precise items to be inferred only when the
individual possesses all of the characteristics above the

specified levels or possesses none of the characteristics at

a
'
V

WK"
..--v0 V
‘--b- v’
—.
Uta-”v“-

o

.It‘f
‘. I w.-
1..

r‘

 

 
 

.H, on u a” v.. w!“ P a rL; C .
3:. 1. Z. L. i... .. _._
.w a a:
2. 3.. e o. S . I Q.

r.. o z. .. . F C r..
a. t. . . a; (\ 2
w . . . . 4 o
.1 2g . . a: ) ID 44
. _, v 44 2» r . a.
.-.. .. . . 4. ~.. a. ..

. . , .. a a

.3 a... .. e r.
. . a . a o .a z.
N. . ..... :u n. . . c . «x»
. . .,:. .. .. E Y ;.
.... . a . . . . y .

. . u a «a. I . r t. ..

17
or above the specified level. These cumulative scores are
even more difficult to interpret when the items are not
perfectly precise.

Rarely is any method of scoring other than the number
correct used, and, if the level of ability in any character-
istic is desired in conjunction with the pattern of charac-
teristics, the problems discussed above for reflection of
ability are added to lack of knowledge about which charac-

teristics the individual possesses.

III. RATIONALE FOR THE SEQUENTIAL ITEM MODEL

The sequential item model is now examined to show why
this model is expected to (I) give maximally efficient use
"of items, (2) control the score distribution, and (3) yield
a score with a precise meaning. In addition, the rationale
for using one of the several sequential procedures is

presented.

Maximally Efficient Use of Items

 

The sequential test is expected to make optimal use
of all items, irrespective of difficulty, because this test
model provides that each item be at the fifty per cent level
of difficulty for the group taking the item. At each suc-
ceeding stage in testing the original group is divided into
progressively more homogeneous ability groups and the dif-

ficulties of items are matched to the average abilities of

 

 

.n
'—

p

In“

l-“'

 

 

 

A-

4‘---

 

_. p;
Q

.. I: ha
,1 ... ..
.y.. r.. . ‘
. U

V“. VA.
t» I. u .
v . ».. r..
i. .. ..
L. 3 e ..
‘N ~‘r ~|t

a
- c.

 

 

 

 

 

18
each group taking the item. Thus the easiest items are taken
by the lowest ability groups and the hardest items by the
highest ability ones.

This procedure accords with the works of Brogden, Cron-
bach, Davis, Gulliksen, Humphreys, Lawley, Loevinger, Lord,
Richardson, Thurstone, Tucker, and Walker which indicate
that if one wishes maximum discrimination of a group into
two groups, then all items should be at the 50 per cent level
of difficulty for a hypothetical group the median of which
is at the point where the discrimination is desired.8 This
means that one needs difficult items to best discriminate
within high ability groups and easy items to discriminate
within low ability groups. The sequential procedure allows
the difficulty of the item to be suited to the ability level
of the group answering the item.

The second reason for assuming that the sequential test
will operate better than a cumulative test is that since dif—
ferent ability level individuals do not take the same items,
the number of low ability people passing a difficult item
by chance will not exceed the number of high ability people
passing the item due to their ability. As has been pointed
out by Meehl, in the cumulative test an item with poor dis-
criminating power is better than one with greater discrimin-

ating power if fifty per cent of the people are expected to

 

8See footnote 6.

 

 

 

 

..

.. _
a:
... av
:. .. t
as .m.
a: ....
_ . ....
A ,s.
)1. has
....

 

 

 

J. . ..
.. 9. ‘v
3. o .
. . s
.- .54
u.‘ a.“
.ﬁA Ly
... . a.
~. . f
A.» ~x~ g—t
.. .
... ~.

 

 

 

 

 

 

 

 

 

l9
9

pass the first, and only 10 per cent to pass the second.

Control of the Score Distribution

 

The problem of control of score distribution is to
assign like people the same score, and to yield a score
distribution which will best serve the purpose of the test.
Since the distribution of scores depends upon the distri-
bution of ability of those taking the test and upon the
difficulty and precision of the items, Lord and Brogden
have each stated that for a normal distribution of ability
and with items of equal difficulty and usual precision, the
cumulative test cannot produce normally distributed scores.1
Humphreys has suggested that the answer is to spread the

item difficulties.ll

He gives no method to show how such

a spread of difficulties is determined. Another answer is
the sequential process developed in this paper. It is
assumed that the sequential procedure will more adequately
control the score distribution because the items must operate
well for only a small group of people not for all of the
individuals taking the examination. After precise items

are used to validly split a given group, the resulting groups
may be further divided into whatever size is desired by

using additional items of appropriate difficulty. Any number

of subgroups may be combined if desired to produce appropriate

 

9Meehl and Rosen, op. cit.

10Lord, A Theory of Test Scores, op. cit., p. 11; and
Brogden, op. cit., p. 207.

llHumphreys, op. cit.

 

 

 

 
 
 
 

 

 
 

 

 

20
distributions or to combine like individuals. These methods
of control should allow maximum control of the score distri-

bution.

Meaning of a Score

 

A sequential test score may represent the ability level
of the individual, the rank of the individual, or the pattern
of response, but it does not represent more than one of
these at the same time. The ability level of the individual
is represented by the score when the score is the difficulty
of the final item. The rank of the individual is represented
by the rank of difficulty of the final item. (The rank scale
is an equal interval scale on ability when equal discrimin~
ations are made at all ability levels--in this case rank of
difficulty and difficulty represent the same factor-~the
ability level of the individual. If unequal discriminations
at different ability levels are made the scales represent
different information.) The pattern of response of the
individual would be represented by a score assigned to the
sequence of items taken in the sequential test. Even though
every individual may pass the same number of items, the se-
quence of items taken by an individual may be specified and
assigned a score different from that of an individual who
passed the same number of items but via a different route.
Different routes (sequences)will represent different items
being passed even though the number of items passed is iden-

tical.

 

2.

p.
JAVA.
.

..~

~.

 

V
L:
n

o
.-¢-vv‘
v

uvu'-"

\

A
6-»:-

 

.
a.
7. C
3. 4..
us

.v
a. .v
..~. 2.

.R‘
r. ...

us

 

 

 

 

 

 

 

21
Since the sequential test has several scoring procedures
each yielding a different but precise score meaning, the
sequential score is more interpretable than the cumulative
test score which is typically a conglomerate of all of these
scoring procedures. In addition to the precision of meaning,
the different scoring procedures allow great versatility in

the use of the test.

Selection of the Sequential Procedure

 

The type of sequential procedure used depends upon the
purpose of the test: (1) rapid classification of extreme
ability individuals, (2) reaching a prescribed degree of
accuracy for each score, or (3) doing the best job possible
in the time allowed. In the present case the decision was
made to do the best possible job with six items. The reasons
for accepting this decision and the reasons for rejecting the
other decisions are outlined briefly.

The rapid classification of individuals may be thought
of as either classification into such categories as accept,
reject, and continue testing--or classification into score
categories which would more closely represent the results of
the more traditional scoring procedures. The classification
into the three categories closely resembles the procedure
developed by Wald for industry where the concern was to pre-
dict the number of faulty objects in the population. A

random sample of the population was used at each stage.

 

 

.
L. . r
..

_ .

o. ‘1‘ ..
,"A ._. \.
Q .,_ .l
.. .... . .
.. u.
I. v v ‘
. —v .‘4
o. v\. .h
\ a .u I ‘

 

 

 

 

22

In the Wald procedure two sets of values are computed:
the one set is such that after each sample if results are
lower (e.g., in number of correct items) than a specified
number, then one may classify the population (or individual)
as rejected with probability; and the other set of values
such that after each sample if results are higher than a
specified number, then one may classify the population (or
individual) as accepted with probability.12

Fiske and Jones have advocated that the sequential pro-
cedure as outlined by Wald be used only when the problem in-
volves the choice between two possible parameter values
which can be specified on a priori, but not arbitrary
grounds.13

To classify people into additional categories, Cowden
modified the Wald procedure. He assumed that the fewer items
one needed to meet the criteria for classification into either
the accept or reject categories, the farther the individual
was from the specified level. He thus created five cate—
gories with the extreme categories being classified very
rapidly with few items.

The second sequential procedure suggested above-~that

is, classifying until a specific degree of accuracy has been

reached-—has not yet been investigated. Exploration of this

 

l2Abraham Wald, Sequential Analysis (New York: John
Wiley and Sons, 1947).

 

13Donald W. Fiske and Lyle V. Jones, "Sequential Analysis
in Psychological Research,” Psychological Bulletin, 51:264—275,
May, 1954.

 

 
 

. u. ._ .~ . . 1. . rt. .
. 2. . 2. ”1 e . i; _. L, . a ..
L. v. ... .v ; .... . . c . {a «v
a 2. 3. .2 _ . v“ 2. a c o 2. .1 T. .a to
u L. r~. .. L. _... u . .4 3. _ 0‘. 2‘
”A . . 2.. not {a I» .. 3. ..... {a s. cc
r» r». . . ... :~ 5 . . . v.. 2» rx.
3. .: w. .. -.. u L. r.. ....
. . ._ . Q» a r“ . . .. . a. u; .~— 1 .
.: w . ... .. . . ._. 4 ‘ ac. . . .r.
v, ..' r . L . ... r» .H l. .. - ‘ ~ . v
n u v ... . . ... .. . ..~ g a 2‘ x...

.4.

Pa

23
procedure was rejected because it was felt that this procedure
might be more fruitfully explored after there was more ade-
quate understanding of the interrelationships of the
variables involved in the sequential procedure developed in
this paper.

Whereas in the industrial system of sequential testing
the model assumes a random sample of ability at each level,
this is not the best procedure for obtaining information
about the ability level of an individual. Except in selec-
tion situations, the purpose is to determine the level of
ability the individual possesses rather than whether the
individual is above or below a given ability level. In the
sequential procedure developed in this paper, a random sample
of the individual's behavior is not used; there is rather an
attempt to classify individuals into as many ability cate-
gories as can be adequately differentiated. There has been
no mathematical model developed for the above procedure and
the apparent alternative of developing one did not seem
fruitful at this time. An empirical study of the problem did
not seem fruitful because neither the ability level of individ-
uals,the precision of the items nor the difficulty of the
item can be determined exactly. The best alternative seemed
to be that of creating exact data and then creating a model
which would use this data in a manner resembling the actual

situation.

uHL

“Arr"-
.

.,.r-v‘
,-

 

.—

Ap-v-V‘

".

 

..~.Av

‘4'

r». V‘
2..
n. r».
.:
.4 .1
”a r“
rwa an“
w.
»v 2.
«v w”.
2v
2. . .
a” : .
b V
3. .WJ
- 1 ~\&

 

 

 

 

 

 

 

24
Preliminary work with the sequential procedure had
usediiprobability model that had been empirically checked
with actual data and which had been programmed for the
electronic computer.14 It was thus decided to take advantage
of the computer program for this study. The program used
six items and permitted calculation for any sequence

possible where items were used to make dichotomous decisions.
IV. HYPOTHESES

The problems of testing are best described according
to the type of decisions that need to be made; however,
the investigation of these problems is best classified
according to the variables that are changed. Changes in
any variable, such as the type of ability distribution of
those taking the examination, may affect one or more of
these problems.

From the rationale developed in the previous section,
one can deduce the effects these variables should have on
efficiency, control of score distribution, and type of score
produced. The rationale will explain the effect of the
variables when used with the six-item cumulative with all
items at the fifty per cent level of difficulty as well as
when used with the sequential model. The one exception to

this statement is that Lawley's work would indicate that

 

14Unpublished material developed in the Bureau of Edu-

cational Research, Michigan State University, East Lansing,
Michigan, 1956-1959.

25
precise scores (scores which have small variance of ability
level for individuals assigned the score) are created for
only a single group by using items quite removed from the
ability level of those individuals whom one wishes to
precisely classify. For example, if we wished to have the
extreme scores precisely defined then we would use items at
the fifty per cent level of difficulty. The hypotheses on
precision of score are derived from the above conclusion
of Lawley. The score distribution examined in this study
is the one actually produced although it is clear that scores
could be combined to yiehdshapescd‘distributions different
from the one initially produced. The score meaning that
is examined here is that of reflection of the criterion
ability scale.

The general hypotheses arising out of the rationale will
be described here. The operational hypotheses that are tested
are stated in Chapter III. There are (l) a set of hypotheses
concerned with the effect of the type of ability distribution
on both the six-item cumulative model and the six-item
sequential test model; (2) a set of hypotheses concerned with
the effect of precision and difficulty on the output distri-
bution of the sequential test model; and (3) a set of
hypotheses concerned with the effect of the errors in estim-

ating the parameter values on the output.

 

 

 

 

.. .
.._ 4 ... r. . .L 2. w. r“ r“ L. 7: NC 30 a. at ...A .V

.3 r. 2. L. 4‘ Au r? 3. -.. 44 t.. . . a. .y.. a v“
a . v. v. .w; v» r;. .v. "w 3. :V A: v“ a o «J \ Av
. L.” r“ a. a-» a a .1. 2.. v... v. 2.. w“ .. . .4 a . : . .. a
r. 2. ... . . a o 2. .1“ .: a. a: 2.. .w . E to L. c .

_ . .. . r“ E .3 a . r“ L. ... . c . .. ... L. L. -. . .. . u... .N a a:

. . 5a . . ~.: c; «a 2. 2. . . w. c . ..I L. 2» r.

L. .. . .. . 3. n... .14 I a .a L . J o . .. r w. r.. ah . . a:
... . v. . “I L. :e ._v n1 .‘u . v . ~ ~._. 1.. y \.

... . r5 .: s a In I. Z. n. .. . a v r“ o n L. l c . . a. 1.

.vv u. . . y . .4. nn¢ - Ail. ~ v s v I s an. I - ~ . ~.~ a. 4 ...~

26

Effect of the Type of Ability Distribution

 

The effect of type of ability distribution on maximally
efficient use of items may be examined by determining the
variance of scores which are assigned to a given ability
level, or by examining the variance of ability levels assigned
to a score. "Discrimination among ability levels" shall be
used to designate whether different ability levels are
assigned different scores, and “precision of scores" shall
be used to indicate whether all individuals at that score
are of approximately the same ability level. Another method
of determining the effect of type of ability distribution is
to determine discrimination among people. (This procedure
involves decisions as to both control of score distribution
and meaning of the score produced.) Discrimination among
people is a measure of the ability of the test to rank
individuals according to ability. This type of discrimina-
tion is not considered in the following hypotheses.

As the sequential test being considered here is one de-
signed to discriminate among ability levels, it should work
quite efficiently for all distributions with respect to the
separation of the ability levels and the reflection of the
actual ability distribution in the score distribution. As
will be shown in Chapter II in the review of Lawley's work,
the cumulative test should have a greater precision of scores
for extreme scores, but should be equal to the sequential in

its ability to accurately discriminate among the ability

27
levels of individuals only at the middle ability levels.
These expectations are examined under conditions where two

different distributions are input--norma1 and U-shaped.

Normal distribution.--(l) The cumulative and sequential

 

 

 

test models should have equal ability tg classify individuals

 

 

.gf mean ability level. This hypothesis follows from the

 

fact that middle ability people will take 50 per cent level
of difficulty items in the cumulative test, and should take
items near the 50 per cent level of difficulty in the sequen-
tial test. If the sequential does not operate efficiently,
the cumulative test will have the more discriminating scores.

(2) The sequential test model should more accurately

 

 

classify the individuals gt the extremes Q; the ability scale

 

 

 

than should the cumulative test model. This is based upon the

 

rationale that the sequential test can use difficult items
because it discriminates among high ability individuals (as
these items are at the 50 per cent level of difficulty for
these high ability individuals). The test item does not have
to discriminate between low and high ability individuals as
only high ability individuals will take the item.

(3) The cumulative test model should have more precise

 

 

scores gt the extremes pf ability than the sequential test

 

 

 

model. This follows from the work of Lawley which showed that
the variance of ability levels for individuals assigned to
high scores would be low if the items were easy for these

individuals.

 

_u. u. .
.u .r“ . .

r - o‘ .
Ty .nL.
:._ r . pp;
W . p .
._. L. wLu
W. . . v“
0 . .1. . N e
"A . . .
n—v “a pv.‘

. .
‘ .§.
._a .r‘
u. r.‘
2v
. r“
:a
o a I.
Wu ‘1‘
.—u .~.
_. w.

;.
.W‘

Z 1 . .
he :. .1
.4
«L 0, so.
vi. 2..
. o
I. 2.
. a 5 4
2. 3.
a“ A
I. ~1-

 

 

 

 

.... 7
.2 .

A:

r.. .4

F: .Wu

Z» a. ‘
~... ..i
J a

C. .A. u
.. c...
a l .

 

 

 

¢
. a v..
p: A.»
A «\V
.ﬂq a...
.. . a:
.. - K.
2‘
.N< .uv
r .. p a
e \ ‘ Q

FU-

;»

... .

 

 

28

(4) The scores for the cumulative test model should

 

represent finer ability units in the middle than_at the

 

 

extremes while the sequential test model scores should

 

reflect the ability level scale. The best discriminations

 

among ability levels should be made by using items at 50 per
cent level of difficulty for the hypothetical group the
median ability of which is at the point where the discrimin—
ation is desired. For the cumulative test the best discrim-
inations should be at the 50 per cent level of ability;
whereas, in the sequential test items should discriminate

quite equally over the entire range of ability.

U-shaped distribution.--(l) The sequential test model

 

 

should more accurately classify the individuals of catetory

 

 

1; (see ”Ideal T Score" in Table 24) than the cumulative

 

test model. Category 13 individuals are the focus of consid-

 

eration because in a U-shaped distribution few people are at
the mean and the question becomes how well one can classify
individuals who exist in larger number and are not at the
extreme. Category 13 represents this mean value for those
individuals in the upper half of the distribution of ability.
The reason that the sequential should more accurately classify
these people is that the items are more appropriate for their
level of ability than 50 per cent level of difficulty items
used in the cumulative.

(2) The sequential test model should more accurately

classify the individuals at the extremes_gf the ability

 

 

29

distribution than the cumulative test model. The reason for

 

these expected results is again that items are more appropri-
ate for the individuals, and individuals taking the items
have a smaller variance in ability than those taking the
cumulative items.

(3) The cumulative test model should have more precise

 

scores at the extremes than the sequential test model.

 

 

Again this follows from Lawley's work.

(4) The sequential test model should have equal score

 

discriminations for all groups including the mean group,

 

whereas the cumulative test model should have finer score

 

 

discriminations for middle ability levels than for the extreme

 

ability group. This follows from the wide distribution of
item difficulties used in the sequential as compared to the
cumulative tests. Items discriminate best only at once
ability level and should be used only with individuals close
to that ability level.

Effect of Item Precision and Difficulty for
the Sequential Test

 

The relationship of item precision and difficulty to
output characteristics must be examined together as change
in precision results in change of the appropriate difficulty
levels in the manner described in Chapter III. There are
five levels of precision used: rbis = .79, .j5, .71, .60,
and .45. Since the ability distribution also effects score
distribution, a normal distribution of ability is used as

this is the type of distribution most likely to occur in the

3O
practical situation.

(1) The variance 9: scores for a_given ability level

 

 

 

should pe_less with the test using the most precise items.

 

The value for the precision of an item indicates how effec-
tively the item differentiates individuals of one ability
from those in the next closest ability level. If the item

is precise then each item can make a different distinction

in ability rather than more accurately making the distinction
that should have been made by a prior item.

(2) The test consistingp£_the most precise items

 

 

should have more equal discrimination between adjacent

 

ability levels than will the less precise test. If the

 

ability of an item to discriminate among ability levels is
dependent upon the difficulty level of the item, then the
more precise test which has a wider range of difficulties
should discriminate at all levels while the less precise test
which has a smaller range of difficulties should discriminate
well among middle ability individuals where difficulties are
appropriate. The less precise test should not discriminate
as well among extreme ability individuals where difficulties

are not as appropriate.

Effect of Errors in Estimating Parameters

 

The usefulness of the model for practical purposes de-
pends upon the sensitivity of the test design to the use of
an item which only approximates the precision and dif-

ficulty level which would be called for by the "ideal" model.

31
If the values need not be very accurately determined before
use can be made of the sequential test model, one is more
likely to use the model. Preliminary studies have indicated
that the sequential test will probably be more sensitive to
precision estimates than to difficulty estimates. The
effect of errors of parameter estimates is the same effect
as is involved in the use of items which have parameter
values other than those required by the test.

As is noted in Chapter III, Section 1, each succeeding
item in a sequential test is selected in such a way as to
maximize discrimination based on data from the effects of
previous items. The effect of using a more precise item
than called for should be that the next item would not be
difficult enough or easy enough for maximum discrimination.
The effect of using an item too easy should be to increase
the precision of score for the upper group, but to decrease
the discrimination among ability levels.

Since the effect of errors made in early stages is
either corrected or magnified by the effect of later items,
and since the effect of errors made in later stages has no
chance to be corrected or magnified, one would expect dif-
ferences in the effect of errors at early and late stages.
The hypotheses made as to effect of errors at these different
stages are as follows:

(1) Errors in difficulty ap.ap early stage should not

 

 

have any serious effects as there would pg a wide range pf

 

 

 

32

ability and the item would operate well for some 9f that

 

 

range.
(2) Errors in difficulty 32 the final stages should

 

 

increase the variance pf ability levels assigned_pg one pf

 

 

the two subgroups into which the total group would_pe

 

separated, but should not lower the variance pf scores

 

assigned 39 the ability levels.

 

 

(3) Errors_in estimates_g£ the precision_g£ the item

 

 

should pg more serious in the initial stages where wide

 

 

 

separations_in difficulty level pf the next item would pg

 

 

 

used.

(4) Errors ip the estimates 9: the precision f the

 

 

items should make little difference ap the final stages as

 

 

the next item would pg appropriate.

 

 

If the sequential testing procedure is robust in that
errors in estimating parameters do not seem to greatly effect
type of output, then it would be possible to design the test
with parameter values determined from one sample of a popula-
tion and use this same test in different situations. (The
value used for the precision of the item is dependent upon
the spread of ability in the sample used to determine the
precision value. If the spread of ability is great in con—
trast to item sensitivity, one has a precise item. If the
spread of ability is narrowed, the same item would be consid-

ered a less precise item.)

33
V. LIMITATIONS OF THE STUDY

The three major contributionscxfthis study are that
it: (1) discusses the problems of the cumulative test and
shows how the sequential model attempts a solution to each
of these; (2) provides a model that may be used in construc-
tion of any sequential test; and (3) presents a rationale
for the sequential test model which, when tested, should
allow the construction of additional sequential tests. There
are, however, many problems that are not examined. Six of
these are listed and discussed because the background material
gives suggestions as to the probable answers to these problems
also. These are: (l) the best possible cumulative test, (2)
the score distributions desired for the cumulative and
sequential models, (3) the types of ability distributions
that may be present in the usual situation, (4) likely test
parametensftu‘usual test items, (5) commercial test construc-
tion procedures, and (6) test presentation procedures and the

psychological effects of the sequential model.

Best Cumulative Test

 

The work of Brogden and Humphreys indicates that the
best cumulative test with precise items is one with a spread

of difficulties.15 The exact relationship between spread of

 

15Brogden, op. cit.; and Humphreys, op. cit.

34
difficulties and precision to yield maximum validity (measu-
red by correlation with ability distribution) is not known,
but Cronbach and Warrington indicate that for a cumulative
test of a given length, 0&2 + oa2 will have a preferred
16 (

The term 0& is the standard deviation of the

spread of item difficulties and oh is the measure of precision

value.

which is the same as the one used in this paper.)
The sequential test models are not compared to the best

possible cumulative model, but the use of items all at the

50 per cent level of difficulty creates a test that is more

than sufficient for most uses for most levels of precision.17

The purpose of the cumulative test model in this dissertation

is to put the sequential test model material into perspective.

Distribution of Scores

If the purpose of testing is selection, then a test
need only produce two scores, one for the individual who is
selected and the other for the one rejected. In this situation
the sequential model developed here would require modification
both in method of scoring and in number of items taken by
individuals. The previously discussed sequential model devel-

oped by Wald, involving a variable number of items taken by

 

16Cronbach and Warrington, op. cit.

l7ibid.

 

35
individuals, is probably the optimal solution. The problem
of test construction thus is no longer that of determining
the difficulty of the item, but rather the number of items
needed to make the most rapid classification. There is no
score distribution as such, only accept, reject and continue
testing categories of individuals.

The cumulative test used to differentiate two groups
would be one with all the items at the level of difficulty
appropriate for the ability level at which one wishes to make
the decision. A test of this nature would have a score
distribution which would be platykurtic, rectangular, or
bimodal depending upon the precision of the items in the test.
The test with most precise items would have a bimodal score
distribution.

If one desired to rank individuals by the scores from
the test, one would make fine discriminations in ability for
those ability levels where there were many people. In this
way the individuals would be assigned scores which would be
rectangularly distributed. This can be accomplished by
use of a cumulative test which has either fairly precise items
at the 50 per cent level of difficulty or a spread of item
difficulties for less precise items. For the sequential test,
there would be more items included at the difficulty level
appropriate for the discriminations that are desired.

The construction of either a sequential or a cumulative

test which has the score distribution discussed above is

36
outside the scope of this dissertation. Further research is
needed to determine the items for a sequential test which
would have a rectangular distribution with the input of a

normal ability distribution.

Ability Distributions

 

Lord has stated that perhaps test constructors should
not consider ability as normally distributed.19 It is
possible that a bimodal distribution of ability is common
in that there are many individuals who perform adequately
and many individuals who perform inadequately with a large
gap between these two performance groups. If this is true,
the sequential test model should operate well for these
distributions, as it should operate well with any type of
distribution. Abberations in its operation would show up
most clearly when the test model is tested against a U-
shaped distribution of ability. In Chapter IV the results
are reported for testing the model against the U-shaped and
normal distributions. These results indicate how the sequen-
tial test scores may be interpreted when used with different
ability distributions. However, no rationale is developed
to indicate what the results should be and, therefore, the
interpretation of scores across ability levels depends upon

a rationale developed post facto, not upon the rationale

 

tested in the study.

 

19Lord,A Theory of Test Scores, op. cit.

 

37

Test Parameters

 

The effect of the number of items has not been examined.
The six-item test was used because the probability model for
the test had been programmed for the electronic computer and
six—items were the maximum for this program. Further research
is needed to determine how rapidly the output characteristic

changes (if at all) when the test consists of more items.

Test Construction Procedures

 

The computational model described in Chapter III for
the construction of a sequential test has a method of
selecting items with the best possible parameter values. This
method could be used in the construction of a sequential test
with the data in terms of difficulty and precision taken from
actual items. The criterion may be a measure of the number
of individuals desired to pass the item or a measure of the
variance of ability levels of individuals assigned to the
pass and fail categories.

It would seem reasonable that one should use the most
precise items to differentiate the individuals as to ability
level and then the difficulty of a less precise item could
be used to control the number of individuals assigned to any
one score category. The second differentiation would not be
as valid as the one made with the more precise item, but the
shape of the distribution could be well controlled.

In addition to lack of a complete evaluation of the

score distribution control procedure, there has been no attempt

38
to follow the standard criteria such as that published by the
Committee on Test Standards of the American Educational
Research Association.20 These criteria include content
validity, concurrent validity, predictive validity, con-
struct validity, error of measurement at different score
levels, equivalence of forms reliability, internal consis-
tency reliability, stability reliability, and information on
norms and scales.

Since this dissertation uses hypothetical data, content
validity is not considered. It is assumed that the test items
are homogeneous and thus measure only one content or ability
which may or may not be a composite of several abilities.

The six~item sequential is compared with the six-item
cumulative but no correlation is computed between the two
sets of scores, as is common in concurrent validity studies.
In this type of a model one can probably obtain more inform—
ation from the correlation with a known criterion score than
from correlation between sequential and cumulative test scores.

The predictive validity of the test is not determined
as it made no sense to use hypothetical data to predict

hypothetical performance. Predictive validity needs to be

 

20American Educational Research Association, Committee
on Test Standards, and National Council on Measurements Used
in Education, Committee on Test Standards, Technical Recom—
mendations for Achievement Tests, 1955.

 

 

39
studied through the construction of a sequential test with
actual items, testing of a group, and then the prediction
of future performance. This would be a logical next step if
the model data studied here show that the sequential item
test is a better test than a six-item cumulative under the
conditions of this study. If sequential test does not have
results which may be considered better than the results from
the cumulative test, then there is no need to study the
sequential under less favorable conditions.

In construct validity it is assumed that the character-
istics measured and related are not affected by the type of
items used in the test. Results from this study may be used
to indicate that these assumptions are not met in most situ-
ations. A study of the attenuation paradox literature should
make one aware of the problems involved in the measurement of
characteristics and their relationships. There is no attempt
to evaluate the construct validity of the sequential test.
Neither is there any attempt made to correlate test scores
with other abilities that should be related to the particular
hypotheticalabilitybeing measured. That which is measured
is any homogeneous ability measured by the items with the
given level of precision—~all of the items in the sequential
model have the same precision.

Error of measurement at different score levels is
examined in detail as suggested by the criteria for evaluation

of a test. The discriminating power of the test at a given

no
level of test score is to be distinguished from the discrim-
inating power at a given level of ability. Both the variance
of the test scores of each ability level, and the variance of
the ability levels at each score are examined.

The equivalence of forms reliability is not determined
as there is only one form. It would be quite simple to build
two tests in a computer and determine how well the scores on
the one test could be predicted from the scores on the other.
It is possible that quite equivalent tests could be built
from quite different items. This possibility is not examined
in this dissertation.

Due to the hypothetical nature of the data the internal
consistency reliability is not examined. Stability reliability
is not determined as it would be necessary to administer a
test twice to a group to determine this, and no test is
actually used in this paper. This is another area that
needs to be examined.

There is a fairly complete discussion of the score dis—
tribution of the sequential item test. It is hoped that the
rationale which predicted the type of score distribution
would be proved correct and thus a tested rationale would
be presented rather than a rationale derived from the results.

Norms (like many of the criteria lised to evaluate a
test rather than a test procedure) are irrelevant to the test

procedure.

All
Another limitation to the study is that no attempt is
made to examine the effects of errors of estimating the

parameter values when the level of precision is low. However,

one would suppose that the effect of errors will be less at
lover precision levels. if the effects produced at high

. ,ﬁ . ‘ , .ﬁ ‘ ...1 .. 4—..7‘ .,‘ ‘5 .‘h’ _'V 3'. .,\‘ 4 ,-.. .. “A" y.
levels of item pICClriOH are Withlh the titer range ioi

~ ’ r ». (N q -? A. . #ﬁ ' -- .7. a , -‘ ‘ V. 13.. . 'v’l - ° ‘ -' —' 4-— ” 1 - , - . r ‘ ' c‘ .\
pdxhjtlcai si»nrrllcan e, Inﬁll CNQIB ixi_iittle lanai to exannjh:

of)

the eilects at low level of item precision. If the effects
at high levels ofitem precision are beyond the error allowed
for practical significance, then one must determine the effects

of lower item precisions or develop methods of obtaining

better estimates. This decision can be made later.

;ect Presentation Procedures and Effects

In the area of sequential test presentation to the testee
little is known as to how to proceed in actual practice. For
example, it may be psychologically advantageous to give the
easiest items first, allowing some individuals to subsequently

try more difficult items, rather than to have everyone start
at an item of 50 per cent difficulty. Since the test is not
given to an actual group this procedure cannot be examined in
this dissertation.

The gnwmater the immdxar of scort3<nategorieisiﬂnat one

wishes to use, the more cumbersome

Ho
7f)
L)

the presentation of the
items. Some of the teaching machine methods ofpresentation may

prove to be useful if cne wishes to use a large number of score

42
categories and a large number of items for each individual.
A complete exploration of methods of presenting the test
should be considered once the advantages prove to be great

:enough to warrant such an exploration.
VI. OVERVIEW OF THE REMAINDER OF THE DISSERTATION

Up to this point an attempt has been made: (1) to des-
cribe the physical characteristics of the sequential test
model illustrated in Figure l; (2) to present the problems
which suggest a sequential test model; (3) to outline the
decisions made in regard to which problems were to be inves-
tigated, and to delineate them from alternatives. The second

chapter, "Review of Literature,’'

will report material used

in arriving at the decisions made and reported in Chapter I.
Chapter III describes the actual procedures used in the con—
struction of the sequential test, the operational hypotheses
tested in this study, and procedures used to test each of the
three major hypotheses. Chapter IV gives the analyses and
results of the procedures used to test the three major hy—
potheses. Chapter V offers the conclusions reached by the
author as to the questions raised in the three major hypoth—
eses and in relation to the general problems of testing raised

in Chapter I. Chapter VI gives the summary and recommenda-

tions for further study.

CHAPTER II
REVIEW OF LITERATURE

The literature relevant to the study of sequential
testing is reported in four sections related to: (l)
maximally efficient use of items, (2) control of score dis-
tribution, (3) meaning and use of scores produced, and (4)
sequential testing procedures. This is the same organiz-
ation as that used for the Rationale in Chapter I. The
decision as to organization is made with cognizance of the
fact that research is used to study the effects of variables
as well as which variables are related with certain effects.
In the review of literature, the data from studies involving
research about effects of variables are placed in the section
where the major effect was noted and mention is made of
other related effects even though these effects may be more
closely tied to problems considered in another section.

The interrelationships among the test item parameters
of difficulty and precision, validity, reliability, and score
distributions had not been extensively explored for the se-
quential test. However, many of these relationships have been
studied for the cumulative test and yield data whicheumerelevant
to sequential testing procedures. This lack of exploration can

not be due to the length of time that the procedure has been

43

42+
available for studg for L. L. Thurstone advocated some of the
notions of the sequential testing procedure as early as 1926,
and Binet, before this time.
However, whatever the explanation for the lack of
study of the sequential procedure, the problem at hand is
the evaluation of the cumulative test literature which is

relevant to the sequential test procedure.
I. MAXIMALLY EFFICIENT USE OF ITEMS SELECTED

As has been noted in Chapter I, one of the problems of
efficiency is that of using the most precise set of items
selected from a pool of items which has a range of precision
and a range of difficulty. The usual cumulative test cannot
efficiently use a difficult item_even if it were precise.

Tucker stated that a test with imperfect items gives
a better measurement of the scale of ability than a test with
perfect items if the score is the number correct. He reported
the amazing fact that low—value item intercorrelations
yielded the best measurement under cumulative test procedures.1
These item reliabilities for maximum test scores vs. ability
correlations were within the range of practical experience.
In fact, for an n of 10 the maximum validity came from inter-

correlations of .50.

 

lLedyard R. Tucker, ”Maximum Validity of a Test with
Equivalent Items," Psychometrika, 11:1-14, No. 1, March,
1946, p. 11.

 

45

It has usually been believed that increasing the
reliability or item precision of the test always increases
its validity. However, as Gulliksen has pointed out, in-
creasing the reliability of a test beyond a certain point
will, under certain conditions, decrease the validity.2 The
literature in regard to this ”attenuation paradox" is reviewed
here, as these articles indicate reasons why reliability and
validity have not shown a monotonic relationship.

To explain this paradox, one may question any of the
assumptions that are conventionally made in test construction
and analysis. One may question the measurement of validity,
the scales produced for the criterion and their comparison
with the criterion, the assumption of normally distributed
populations, the approximations used to measure data, and the
basis upon which the test was scored.

Gulliksen-—who seems to have been the first person to
point out the paradox—-argued that his formulas indicated
that if all the items were concentrated at one difficulty
level then the test reliability could be higher than is poss-
ible if the items cover a rather wide difficulty range.3

But instead of arguing that items should be at one level of

difficulty he argued that items of graded difficulty should

 

2Harold Gulliksen, ”The Relation of Item Difficulty and
Inter-item Correlation to Test Variance and Reliability,"
Psychometrika, 10:79-91, No. 2, June, 1945.

 

3lbid.

46

be used, and that the error was in the scoring procedure of
counting one point for each item correct. According to
Gulliksen, the score assigned should be the best estimate of
the difficulty level reached. It is true that a change in
scoring procedure may be a solution, but before any general
solution can be reached about the relationship between item
reliability and test validity many assumptions must be checked.

Humphreys stated that the supposed lack of monotonic
relationship between reliability and validity was due to

4

normal curve and interval data assumptions. To explore the
effect of the assumption of normally distributed abilities,

Humphreys determined the r in two ways: one computation

pbis
involved the assumption of normality; the other computation
did not involve this assumption. He found that rpbis = 773’

did not result in values for validity that were the same as

those obtained from rpbis = [ﬁrgggé]-[zg/(pq)l/Pj. Using

the usual point biserial correlation, Humphreys also found
a monotonic relationship between reliability and validity for
all difficulty levels. He concluded that the ”assumption of a

normal distribution of the criterion is not compatible with

b

the mechanics of adding items together.H

 

“Lloyd G. Humphreys, "The Normal Curve and the Attenua-
tion Paradox in Test Theory," Psychological Bulletin, 53:472-
476, No. 6, November, 1956, p. 473.

5

 

Ibid., p. 474.

47

Humphreys thus questions both the assumptions of scoring
and the normal distribution of the criterion. The problem of
determining the maximally efficient use of an item thus
becomes more difficult because there is no agreement as to
how efficiency can be measured.

The problem is further confounded by the knowledge that
even if the criterion were normally distributed one has to
measure the relationship between test scores and the criterion.
This relationship is usually specified from the slope of a
straight line, but both Brogden and Lord stated that since
the regression curve must be the sum of the item character-
istic curves, it is inevitably curvilinear and, in particular,
will be strongly curved if the items are all of equal dif—
ficulty and have high intercorrelations.6 Thus if ability
is normally distributed, the test scores cannot be normally
distributed. Lord stated that with progressive increase in
the item intercorrelations, the progressive decrease in the
product-moment correlation between test score and ability is
due in part to the fact that as the item intercorrelations

increase, the regression becomes more and more curvilinear.

 

6Hubert E. Brogden, ”Variation in Test Validity with
Variation in the Distribution of Item Difficulties, Number
of Items, and Degree of Their Intercorrelation," Psychometrika,
ll:l97~2l4, No. 4, December, 1946, p. 207; and Frederic M.
Lord, A Theory of Test Scores, op. cit., p. 11.

 

 

7Lord, A Theory of Test Scores, op. cit., p. 19.

48

However, in a later article, Lord decided that the
attenuation paradox was not due entirely to the violation of
linear regression as he had assumed earlier.8 His later
work showed that the problem was even more complex in that
it was not intuitively valid to demand a nonparadoxical
relationship due to the hodgepodge we now have in reliability.
Lord concluded that there is a serious lack of homoscedasti-
city, for when item intercorrelations are high the standard
error of measurement is very different for test scores at
different ability levels.9

Thus the efficiency of an item may depend upon the
scoring procedure which changes the distribution of scores,
or the type of measures used to describe the relationship and
the assumptions used in the computation of these relationships.
However, instead of explaining the paradox by the above charac-
teristics, the possibility exists for explaining the paradox
by the effects of each individual item.

Cronbach and Warrington gave the following explanation

of the paradox:10

 

8Frederic M. Lord, ”Some Perspectives on 'The Attenuation
Paradox in Test Theory',” Psychological Bulletin, 52:505-510,
No. 6, November, 1955, p. 506.

Ibid., p. 507.

 

10Lee J. Cronbach and Willard G. Warrington, "Efficiency
of Multiple-Choice Tests as a Function of Spread of Item Dif-
ficulties," Psychometrika, l7:l27~l47, No. 2, June, 1952, p.
139.

 

49

If an item has perfect precision, it gives no infor~
mation about which of the men whose criterion score

is below yi are best. All of these men will have

the same score (zero) on a group of perfectly precise
free-response items, if guessing is impossible. If
each item allows two or more choices, the scores will
vary but the differences will not be related to
ability. Since the obtained scores are equal or
differ only by chance, the test does not discriminate
among low-ability men having different criterion
scores. Likewise the peaked test gives no information
about individual differences within the high-ability
group, whose thresholds are above the scale position
of the items. In a less precise item, the proportion
passing is a sloping function of criterion score, and
a man whose ability falls slightly below the scale
position of the item will tend to earn a higher score
than the man who is far below the scale position. Each
item contributes information along the whole scale.

The "attenuation paradox” has many possible explanations
but the examination of the solutions that are derived from
these explanations may give more valuable information.

The solution advanced by Gulliksen--the first person
to point out the paradox--of changing the scoring procedure
does not seem to have been widely adopted.

The impetus for changing the scoring procedure seems
to come from those concerned with the meaning of the score,
rather than from individuals interested in solving the
attenuation paradox.

The most common solution to the attenuation paradox
seems to be that of changing the idea of how to measure the
I'best test." This material is particularly relevant to this
dissertation because some of the test evaluation ideas from
this literature are used in the present paper in lieu of the

more traditional techniques of test evaluation.

 

11Gulliksen, op. cit.

50

Lord has argued that a more basic concept than validity
is that of the discriminating power of the test at various
ability levels. Lord felt that the test constructor's goal
should be to achieve a desired degree of discriminating power
rather than to maximize any single composite validity coef—
ficient. The conventional reliability and validity coeffi-
cients are indices of discrimination for the test as a whole;
however, except under certain limited conditions, these ovei-
all indices do not apply at all points along the score scale.12

Levine and Lord examined the discriminating power of
a test at different parts of the score range. They used as
he measure of discrimination the ratio of the slope at a given
x value (score measured in sigma units) to the standard devia-
tion of the y scores (criterion measure) at that x value.13
The lower the deviation of y scores and the greater the slope,
the higher the discrimination index.

Levine and Lord stated that there is no precise standard
error of the discrimination index known, but the expected value
of the discrimination index for a homoscedastic linear scatter-
plot serves as a standard against which we may judge the

14

computed values at various score points. This value of the

 

12Lord, ”Some Perspectives on ‘The Attenuation Paradox
in Test Theory',” op. cit., p. 506.

13Richard Levine and Frederic M. Lord, HAn Index of the
Discriminating Power of a Test at Different Parts of the Score
Range,” Educational and Psychological Measurement, 19:497-503,
No. 4, Winter, 1959.

1”Ibid., p. 502.

 

51

discrimination index for a homoscedastic linear scatterplot
r

is computed from the following formula: 73E§%E::

Using a 25-item test as the x variable, and a
107- item criterion test as the y variable, Lord determined
the value of the discrimination index between 27 adjacent
ability levels on the criterion test.15 Values for the
discrimination index did not indicate a lack of discrimination
at the extremes or in either half of the ability distribution.
For both extreme quartiles and for the upper and lower halves,
Lord found nine discrimination indices below the expected
value and 17 above the expected discrimination index value.

This literature only indicates a method of measuring
the efficiency of a test and some methods of interpreting
the data obtained. Since there are no data presented as to
the spread of item difficulties, no conclusion can be drawn
about the relationship between difficulty of item and the
expected discrimination index.

Levine examined test validity in terms of the discrimin-
ation at different parts of the score range rather than by
some measure of correlation between test scores and a criterion
measure.16 Again, the relationship between item reliability
and this measure of validity was not studied. However, Levine

suggested that if the reader is interested in discriminations

at both extremes of the score scale, then two criterion tests

 

15lbid., p. 501.

16Levine and Lord, op. cit.

52
of different difficulty would probably be needed. This
indicates that if one wishes to discriminate among the high
ability individuals of a group, one would use more items
which would be considered difficult, while if one wishes
to discriminate among the low ability individuals of a
group, one would use more items which would be considered
easy. Item difficulty can thus be related to discrimination
as well as to the more traditional measures of validity.
(This conclusion is further supported by the work of Lord
reported in a later part of this section dealing with the
decision to make discrimination by using items of the appropri-
ate level of difficulty.)

It should be noted that the discriminating power of a
test may have more than one aspect. Loevinger has suggested
that it has three aspects--fineness, probability, and range.17
Lord has made the following distinction between "discrimin-
ating power” and "effective discrimination":18

A test may have low discriminating power for examinees
in a certain range of ability. If in any given group
of examinees there are only a few individuals spread
out thinly over this range of ability, however, the
rank order of these individuals on ability may be more
accurately determined by the test scores than is the
rank order of examinees in some other range of ability

where the discriminating power of the test is greater,
but where there are many examinees of almost identical

ability. The effective discrimination is greatest where
the rank order of the examinees is most accurately
determined.

 

17Jane Loevinger, "The Attenuation Paradox in Test Theory,"

Psychological Bulletiny 51:493-504, No. 5, September, 1954.

 

lBLord, A Theory of Test Scores, op. cit., pp. 24-25.

53

Both discriminating power and effective discrimination
are examined in this dissertation. (However, it should be
remembered that the rationale for the test construction is
based upon discriminating power.) The specific literature
relevant to the distinction between discriminating power
and effective discrimination is presented, respectively.

Davis contributed to knowbdge of discriminating power
by pointing out that tests may be constructed so that items
can be selected according to difficulty in such a way as to
control the standard error of measurement at different points
on the ability scale.19 However, Davis pointed out that
little systematic work has been done to show analytically
the relationship between the shape of the distribution of
item difficulty indices and the size of the standard error
of measurement at various levels of ability.

The summarizing statements made by Davis about con-
trolling the magnitude of the standard error of measurement
at various ability levels are as follows:

1. To minimize the aggregate of errors of measurement
of a test (thus perhaps sacrificing over-all validity

or differential validity), all items should be of 50
percent difficulty. This would maximize the over-all
test reliability coefficient and minimize the standard
error of measurement at the center of the range of
ability measured.

2. To minimize the standard error of measurement at any

one point on a scale of ability, all items should be
concentrated at that level of difficulty.

 

l9FrederickB. Davis, "Item Analysis in Relation to Edu-
cational and Psychological Testing,” Psychological Bulletin,
49:97-121, No. 2, March, 1952, p. 106.

20Ibid., p. 107.

 

54
3. To minimize the standard error of measurement at
two or more points on a scale of ability, items should
be apportioned to each of the levels of difficulty
specified.
4. To minimize the standard error of measurement
throughout a certain part of the range of ability,
items should be distributed within the corresponding
range of difficulty in accordance with the procedure
suggested for obtaining maximum over~all test validity.
5. To equalize as nearly as possible the standard
error of measurement throughout the range of ability
measured, items should be distributed over the entire
range of difficulty in accordance with procedure sug-
gested for obtaining maximum over—all test validity.
These suggestions are useful, but the scale upon
which the scores are based should be considered in the eval-
uation of item efficiency. Davis mentioned the scale upon
which the scores were based when he stated that if the
Mollenkopf data were reworked to express scores derived from
his various tests as approximations to interval scores on a
single scale, these date might show that the size of the
standard error of measurement in a given range of ability
decreases with an increase in the number of discriminations
21
among examinees obtaining scores in that range of ability.
(The Mollenkopi'data.arepresented in Section II, Control of
Score Distribution.)
An even more dramatic consequence of the changing of

scales was noted by Symonds. In his examination of the

standard error of measurement, Symonds gave six groups on

 

21lbid.

 

55

three tests with 24, 14, and 2 difficulty levels.22 Results
from these tests yielded means of 20.48 and 19.54 for the
first test, 24.39 and 21.70 for the second test, and 26.74
and 32.42 for the third test, respectively. The variances
.of scores were 5.86 and 5.88, 9.10 and 9.22,and 11.12 and
11.82. The standard error of measurement was thus greatest
for the third test which had a narrow range of difficulty--
the standard errors of measurement were 2.33, 3.38, and 4.58.
However, Symonds pointed out that as far as assigning the
difficulty as a score was concerned, on the first test one
score unit equalled .25 units on the Ayres scale of dif-
ficulty, for the second test one unit equalled .125 Ayers
units, and for the third test one score unit equalled .02
units on the Ayers scale of difficulty. In terms of dif-
ficulty levels on the Ayres scale the standard errors of
measurement were then .58, .42, and .09. Thus changing the
scale reversed the order of the size of the standard errors
of measurement.

When one evaluates a test using ”effective discrimin~
ation" rather than ”discriminating power," one examines the
type of individuals assigned to a score rather than the distri—

bution of scores assigned to an ability level.

 

22Percival M. Symonds, ”Factors Influencing Test
Reliability," Journal of Educational Psychology, 19:73-87,
No. 2, February, 1928.

 

56
Lawley assumed that all items composing a given test

were measuring the same ability x; and that the scale in
which this ability was measured was so chosen that x was
normally distributed over the whole population of individuals
for whom the test was designed-~with zero mean and unit
variance.23 With these assumptions Lawley described the
error variance of a score.24 When the mean difficulty of

the items is not at the 50 per cent level of difficulty for

the individual, the error variance of the score is defined

as below:
<72 — n (t - t2 - t 80 — t 2 2 - t 2 />3- )
E(X) _ O O l l 2 pl 3 ...

When the mean difficulty of items is at the 50 per cent level
of difficulties for the individual then the error variance
of the score is defined as below:

2

0’ __ h

r/
_ .___ s /9
E(X) 2v “’0 1

The terms are defined as follows:

0'2
= error variance of score
E(X)
n = number of items
X z score value

 

23D. N. lawley, ”0n Problems Connected with Item Selec—
tion and Test Construction," Proceedings of the Royal Society
of Edinburgh, 61 (Section A, Part III):273—287, 19E2-1943, p.
273.

 

57

to, t1, t2, etc. = values from Table 29 of Pearsods
Tables for Statisticians and
Biometricians (ordinarily used to
calculate r'Jce't)

 

a

,0 = £1

1 05 + cda

a ,
C7 :: variance of item difficulties (standard score
1 form)

1 . . ..
'—75 = preClSlon of item
0;

From these equations, and the assumptions mentioned
above, one can determine that large.p3 would reduce the
error term whether the ability level is equal to the mean
difficulty of the items or not. The size Of/Dl can be in-

2
creased by decreasing 6' (using more precise items), by

0
decreasing 612 in the denominator (or using all items at one
difficulty), or by increasing die in the numerator (using
items at more than one difficulty level). This immediately
suggests that the best procedure is to use more precise items
if one wishes to reduce error variance in the score, as 6i
appears in both the numerator and denominator. This is in
contrast with the most valid test results reported by Tucker,
who empirically found that the most valid test was the test
with imperfect items.25
Another way of reducing error variance would be to use

the small tO values. (The value, t is necessary to enter

0)

 

25Tucker, op. cit.

58

Pearson's tables.) Lawley gives the following formula for

.26
to.

 

ability level (standard score form)

X:

€2-= mean difficulty level of cumulative test

9 2
62: 66‘ + 61 (as defined above)

To aid in the understanding of the interpretation of
the formula given above the following summary data is reported

for a test with the mean difficulty level of items nearly

 

equal to mean ability level ( 51 = .045) and with a o'
(a combination of the spread of item difficulty and precision
2‘ 2
of items) of 1.30 for a 100—item test. 7 The values of CT
E(X)
for given values of &i;§- are as follows:
i 2
x - OK CI,
6' h(X)
0.0 20.8
0.1 20.7
0.2 20.4
0.3 19.8
0.4 19.0
0.5 18.0
0.6 16.9
0.7 15.6
0.8 14.3
0.9 13.0
1.0 11.6

 

26Lawley, op. cit., p. 279. Ibid.

59
As can be seen from the preceding data, for'given
é;2. and cr values, the higher the ability level (x), the
]_c>wer the error variance for the score ( ‘7E(X) ) for a cumu—
1.21tive test. If the items had a large value (fixed) for the
rneean difficulty level (i.e., the value of 5K increased)
tzrien the value of -—§—:§E— would be smaller and thus the

6’

eaxrror variance (C7 ))would be larger.

E(X
Lawley also pointed out that the effective discrimin-

2
Eitsing power of a test may be computed. as follows:

%.= _g(Xi - ZEX')
E(X')

E(X) +
If x = 5i then the above formula becomes:

 

 

x and x' are two different ability levels

X and X‘ are two different score values

Other terms are defined as before.

As Lawley pointed out, in order to increase the effec—
1Dive discriminating power the numerator must be increased
Wklich means obtaining large values for 55L, or the denominator
maybe decreased, and, assuming oi? is constant (as one cannot
<3P1ange precision) then one must change 6

l
C)f‘ difficulty.29 The smaller the spread of difficulties the

which is the spread

Ibid., p. 280.

 

291bid., p. 281.

60

lower the value.

The effective discriminating power for a test would
thus be greatest when the mean difficulty of items was
equal to the ability level for the extremes in ability, and,
when there was no spread of item difficulties. This type
of test would be used to create scores which would be
assigned only to individuals that are the same. It is not
used to differentiate between the ability levels of individ-
uals. The same logic which states that middle scores will
be more precise (i.e., representing only one type of ability
level individual) when difficulties are extreme would indicate
that extreme scores will be more precise when 0.00 level of
difficulty items are used in the test. (Remember the formula
uses 38' so it would operate for either extreme of difficulty.

Support for this position is given by Lord who stated
that the standard error of measurement would be practically
zero for extreme positive or negative values of ability.30
He argued that there would exist individuals whose ability
would be so low that the test would not be discriminating for
them, and other individuals whose ability would be too high
to be discriminated. The standard error of measurement is low
for these zero or perfect scores and is necessarily smallest
for those examinees for whom the test is least discriminating.

The above solutions to changing the criteria of test

validity still do not exhaust the solutions to the attenuation

 

 

 

3OLord, A Theory of Test Scores, op. cit., p. 14.

61

paradox. Brogden offers yet another solution. He found
that the correlation continued upward when a spread of item
difficulties instead of one level of difficulty was used.31
He concluded that the problem was that of determining the
distribution of item difficulties to yield a more valid
score. Brogden showed that by using items with rtet = .60
or higher, a distribution of item difficulties will produce
(for an 18-item test) a higher validity than will be obtained
with all items at the .50 difficulty level.32 The spread of
difficulty seemed to be important when items were of this
reliability.

Brogden's solution of determining the spread of items
for a test such that the results would correlate highest
with a criterion seems to be inadequate since there remain
the problems of measuring the relationship and the meaning
of the coefficients that are computed. It is impossible to
solve all of these problems at this time, but assuming that
the difficulty of the item is an adequate score, and assuming
that discrimination among ability levels (with an examination
of the effective discriminating power) is the important ques-
tion, a rationale can be built for the sequential test devel-
oped in this dissertation.

Two areas of literature will now be examined to build

the rationale for effective use of items in the sequential

 

31Brogden, op. cit., p. 2A0. 32lbid.

 

62
test. They are (1) literature on Bayes' Theorem, and (2)
literature on the use of items at the 50 per cent level of
difficulty for the hypothetical group with a median ability
level equal to the value at which the discrimination is
desired.

Meehl and Rosen, through the use of Bayes' Theorem,
point out that the practical value of a psychometric sign,
pattern, or cutting score depends jointly upon its intrinsic
validity (in the usual sense of its discriminating power)
and the distribution of the criterion variable (base rates)
in the clinical population.33 They note that if the base
rates of the criterion classification deviate greatly from
a 50-50 split, the use of a test sign having only moderate
validity will result in an increase of erroneous clinical
decisions.

One reason that the sequential test is assumed to have
maximally efficient use of items is that the base rate does
not have to deviate from the 50-50 split. The other reason
is that the sequential test uses items at the 50 per cent
level of difficulty for the group taking the item. These
items have been found to be efficient with various criteria
for efficiency.

Lord concluded from maximizing the ratio of difference

in means to standard error of difference, that if one desires

 

33Meehl, Paul E. and Rosen, Albert. ”Antecedent Proba-
bility and the Efficiency of Psychometric Signs Patterns or
Cutting Scores,” Psychological Bulletin, 52:194-2l6, No. 3,

1955-

 

63
to construct a test that will have the greatest possible
discriminating power for examinees of a given level of
ability, 0 = 00, then all items should be of equal difficulty
(no spread) and of such difficulty that half of those exam-
inees whose ability score is cO would answer each item

3” This

correctly and half would answer it incorrectly.
measure of discriminating power is completely independent of
the distribution of ability in the group tested.

However, when item precision is such that item—total
biserial correlations are .447, Lord empirically showed that
a test composed solely of items at the 50 per cent difficulty
is more discriminating (as measured above) than any other
test for examinees at gpy level of ability between -2.5 and
+2.5.35 Lord does not show results of more highly correlated
items which will be investigated in the present study.

Lord's empirical study above is supported by Cronbach and
Warrington's theoretical study. They stated that for
items of the type ordinarily used in psychological tests,
the test with uniform item difficulty gives greater over-all
validity and superior validity for most cutting scores, as
compared with a test with a range of item difficulties.36

It is the cutting score validity which is new here and of

some relevance to the sequential test constructor. For

 

34Lord, A Theory of Test Scores, op, cit., p. 26.

351bid., p. 29.

 

36Cronbach and Warrington, op. cit., p. 127.

64

example, Cronbach and Warrington found that if 0’ = .2

0..

(i.e., rtet = .94 orlﬂ = .80), if no guessing is possible
(or Ttet = .55 orlﬂ = .37 if the probability of chance suc-
cess by guessing is one-third), and if all items are at the
50 per cent level of difficulty, better results are obtained
for separating out from 40 to 62 per cent below the cutting
score than if there were a normal distribution of item
difficulties.37

The empirical determination of the best difficulties
for discrimination has not always been as nonsupportive of
the present rationale as the work of Lord. Lord used discrim-
inating power (as defined by him) as his criterion. Richard-
son's empirical study had more supportive results. He created
five subtests of different difficulty levels: 78-95, 60-77,
41-54, 23—40, and 5-22.38 He then calculated the biserial
correlations for 23 different divisions of the criterion
starting at 4.17 per cent of the people in the lower category,
and, by percentage units of 4.1667, continuing to 95.83 per
cent in the lower category. He graphed these results and
noted that the test consisting of items from 78-95 per cent
passing produced the highest biserial correlation for those

divisions where 4.17 to 25.00 per cent of the people were in

the lower category. Likewise the 60-70 per cent pass test

 

37Ibid., p. 135.

38M. W. Richardson, "The Relation Between the Difficulty
and the Differential Validity of a Test,II Psychometrika, 1:33-
49, No. 2, June, 1936.

 

65
was best for the 25.00 to 35.00 divisions; the 41-49 per cent
pass test for the 35.00 to 61.50 divisions; the 23-40 per
cent pass test for the 61.50 to 82.00 divisions; and the
5-22 per cent pass test produced the highest biserials
where 82.00 to 95.83 per cent of the people were in the lower
category. Although these results are from 50—item tests, the
results indicate that different difficulty tests for differ-
ent discriminations should be useful.

Other results from studies which would support the
position that items at the 50 per cent level of difficulty
for the group are the best items, are those which indicate
differentiation of a group by items of different difficulty.
In these studies the ability level of the individuals are not
known and differentiation for each ability level is not re-
ported separately. The reader must assume that the individ-
uals were normally distributed around an ability level equal
to the difficulty of the items. If this assumption is made
then low differentiation by difficult items support the con-
clusion that items appropriate for the ability level are the
best items.

Such a study as described above is reported by Cleeton.
Cleeton used four well selected ability groups--one superior
group and three inferior groups.39 He then constructed two

measures of the differential or predictive value of the test.

 

39G1en U. Cleeton, "Optimum Difficulty of Group Test
Items," Journal of Applied Psychology, 10:327-340, No. 3,
September, 1926.

66
One of these was (R1 - R4) in which R stands for the number
of items answered correctly by group 1, 2, 3, or 4. The
other measure was (Rl — R2) + (Rl - R3) + (Rl — R4) +
(R2 - R3) + (R2 - R4) + (R3 — R4). (Terms having the same
meaning as above). These are criterion II and criterion I
in the following results, respectively. Cleeton examined
difficulty by grouping 1/10 of the items in each interval
and by grouping 1/10 of the range of difficulty in each
interval. For present purposes it is most informative to
look at the actual difficulty divided into 10 parts even
though the number of items in each interval is different.
The following data show the results of 240, 240, and 480
individuals each taking three tests of 400, 236, and 109
items. (For the computation of criterion indices, Cleeton

assumed that he had only 720 individuals.)

 

 

 

Interval
for Rank Rank Value Value
% Passing Criterion Criterion Criterion Criterion
Item I II I II
91 - 100 8 8 44.4 14.7
81 - 9O 6 6 104.9 28.9
71 - 80 5 5 125.9 40.8
61 - 7O 4 4 152.6 46.8
51 - 60 3 3 158.9 47.6
41 - 50 1 1 175.2 51.3
31 - 40 2 2 163.9 51.1
21 - 30 7 7 85.8 26.1
11 - 20 10 10 35.9 11.1
0 - 10 9 9 37.3 11.9

 

67
From the above data one may determine that the slightly more
difficult items seem to have the greatest predictive value
as measured by both these estimates of predictive value.
This would support the decision to use items at the 50 per
cent level of difficulty for the group which is to be dis—
criminated among.

Logical analysis also supports the above decision.
Flanagan pointed out the extremes of this difficulty and
item validity argument. He stated that if one wanted the
maximum amount of discrimination between the individuals in
a particular group, a test should be composed of items all
of which are at 50 per cent difficulty for that group--
provided the intercorrelations of all the items are zero.“0
If intercorrelations were other than zero, the decision
would not be this clear.

Lord studied theoretical test models which had either
high or low item reliabilities with easy, difficult, or easy
and difficult test items. After examining the relationship
of the true score distribution to the distribution of ability,
he reached the following conclusion:41

A test composed of items of equal discriminating

power but of varying difficulty will not be as
discriminating in the neighborhood of any single

 

“OJohn C. Flanagan, "General Considerations in the
Selection of Test Items and a Short Method of Estimating the
Product-Moment Coefficient from Data at the Tails of the
Distribution," Journal of Educational Psychology, 30:674-680,
No. 9, December, 1939.

 

1+1Lord, "The Relation of Test Score to the Trait Under-
lying the Test,” op. cit., p. 5A3.

cs

ability level as would a test composed of similar
items all of appropriate difficulty for that level.

Thus,most Of’the literature supports (1) the use of items
at the appropriate difficulty for each level and (2) the
separation of individuals into groups that would have a base
rate of 50 per cent.

Because the base rate is near a 50 per cent split each
time, the sequential model should permit the use of only
moderately discriminating items. In the cumulative test,
there will be only 5 or 10 per cent of the individuals who
should pass a difficult item, as all people take the item.

In the sequential method 50 per cent should pass this dif-
ficult item, as only those with high ability will take the
item. According to Bayes' Theorem the probability of high
ability people passing the item must be much higher than the
probability of low ability people passing the item if 90 per
cent of those taking the item have low ability. Once the
group taking the item has a base rate of 50 per cent (as is
the case in the sequential method), then the item should work
better-~i.e., increase the number of correct clinical decisions.

In the sequential test, those groups which are different
in ability would use items at the 50 per cent level of dif-
ficulty for that group. This would allow the use of diffi-
cult items which are precise. Such items could not be

efficiently used in a cumulative test.

09
II. CONTROL OF THE SCORE DISTRIBUTION

The problem of score distribution is not only to
assign a specified number of individuals to each score value,
but also to assign like individuals to each score value. The
score distribution is not only related to the item parameters,
but should also be related to the use. The score distribution
problem may be studied through the use of a theoretical model
or empirically.

Lord attempted to study the problem of control of score
distribution through the use of a theoretical model. He made
the following assumptions: (1) the item characteristic curves
have the general shape typical of cognitive items that are not
answered correctly by guessing; (2) the items are homogeneous
in a certain specified sense; (3) the items are scored 0 or 1;
and (4) the raw test score is the number of items answered
correctly}+2 (A homogeneous test is, for Lord's purpose,
defined as a test composed of items such that, within any
group of examinees all of whom are at the same ability level,
the response given to any item is statistically independent
of the response given to the remaining items.)

The generalizations reached by Lord were as followszu3
1. Since the test characteristic curve is in general

nonlinear, the test score distribution will not in
general have the same shape as the distribution of

 

421bid., p. 546.

 

431bid., pp. 541-542.

7O

ability; in particular, if the ability distribution is

normal, the score distribution in general will not be

strictly normal.

2. U—shaped and roughly rectangular score distributions

can be produced provided sufficiently discriminating

test items can be found. (All appropriate individuals

pass or all appropriate individuals fail an item if they

are perfect items at the 50 per cent level of difficulty.)

3. Typically, if a test is at the appropriate difficulty

level for the group tested, the more discriminating the

test, the more platykurtic the score distribution.

4. The skewness of the test score distribution

typically tends in a positive direction as the test dif~

ficulty is increased above the level appropriate for

the group tested; in a negative direction as the test

difficulty is decreased below that level.

These generalizations aid in interpreting the empirical
11

results of a study made by Mollenkopf.’4 He selected 1000
answer sheets chosen on the bases that: (a) every person
must have attempted every item, and (b) a wide range of scores
should exist in the sample chosen. Items were then chosen to
make up nine synthetic tests. These nine tests contained
score distributions with three types of kurtosis and three
types of skewness. A study of the literature revealed that
the total test score distributions were believed to be con-
trolled for skewness by item difficulty. However, since easy
items tended to have higher correlations with the total score
than did difficult items, control on mean difficulty alone

was found not to be sufficient. When building a test with a

symmetrical score distribution,Mollenkopf found that a set

 

uaWilliam G. Mollenkopf, I'Variation of the Standard
Error of Measurement,” Psychometrika, 14:189-229, No. 3,
September, 1949.

 

71

of items of the same type (all of difficulty close to .50)
yielded scores with a definitely flat distribution. (From
Lord's work, it looks as though the item precision must have
been very good.) To secure a leptokurtic score distribution
Mollenkopf tried sets of items with .40 and .60 difficulties,
but found that homogeneous sets of items of .20 and .80 dif-
ficulties were needed.

If one uses Lord's work to translate back from score
distribution (by assumed highly precise items) to ability
level, one can determine that the distribution of ability
must have been near normal. Also of interest in the Mollen-
kopf article is the fact that the standard error of measure-
ment for a nonskewed platykurtic distribution of scores is
greatest in the middle sections and lowest at the extremes.
This may be accounted for by what Mollenkopf has labelled
the "end effect."45 This effect means that at the ends
large differences in parallel forms cannot occur. A perfect
score is perfect in each half. Small empirically observed
errors of measurement are inevitable in the tail where the
pile-up occurs on skewed distributions but not for normal
distributions.

This explanation would suggest that the variance of
ability levels for a given test score may be small, but it

does not indicate, as Mollenkopf also pointed out, that there

 

 

72
is a small variance of scores for a given ability level.
Both points are of interest if reflection of the ability
distribution is desired in the score distribution.

The cumulative test can be used to yield the type of
score distribution that one wishes. The important parameters
are item difficulty and item precision, but only general
statements are available as to the relationship between
these parameters and the score distribution. Empirical
studies are used to determine exact parameter values for
given score distributions.

Hymphreys stated that the variance of item difficulties
forces scores toward the center of the distribution and thus
counters the effect of high item intercorrelations.46 It is
thus necessary to have a spread of difficulties, only if
the items are very precise. Whereas very highly intercor-
related items of one difficulty level would produce two
scores, if one were to use a spread, one could force people
into a distribution that would be expected to have some
validity. Humphreys advocated that the shape of the score
distribution be controlled by the difficulty level of the
test items.u7\The type of distribution favored by Humphreys
was a rectangular distribution—-a distribution that would

allow individuals to be ranked.

 

u/
OHumphreys, op. cit., p. 474.

47Ibid., p. 475.

73

If the items were perfect, the procedure to produce
the rectangular distribution desired by Humphreys would be as
reported by Davis. Davis reported that if the tetrachoric
item intercorrelations are all unity, a rectangular distri-
bution of raw scores is most likely to be obtained by
selecting items with difficulty levels of 1/(n + l),
2/(n + l), 3/(n + l), . . . n/(n + 1). However, if the
tetrachoric intercorrelations are all .50, a rectangular dis»
tribution of raw scores is most likely to be obtained by
selecting all items at the 50 per cent level of difficulty.48
He argued that for any level of tetrachoric item intercorre—
lations from zero to .50, the maximum number of discrimin-
ations that could be made by the total score would be insured
by selecting all items at the 50 per cent level of difficulty.

Davis went on to say that this simple mathematical
procedure employed to specify the exact difficulty levels
of items for two- and three-item tests cannot be applied to
specifying the exact difficulty levels of items for tests
containing larger numbers of items except in the limiting
case when the item intercorrelations are all unity. The
reason one cannot generalize is that when intercorrelations
are not unity, errors in classification will be made, and the
spread of ability represented by those who pass or fail will

be greater but undetermined. Thus, the appropriate difficulty

 

48Davis.,o . cit., p. 103.

74
for the resulting group cannot be easily determined. The
effect of errors is difficult to determine, but as pointed
out by Davis, there is need for a general solution.

Whereas the general rules about control of score dis-
tribution are known, there is no general solution in the
sense that the actual score distributions are known. The
actual score distributions must be empirically determined
for each test. The literature indicates that if the sequen-
tial method of testing could more easily and predictably
control the score distribution, a real contribution would be

made to the solution of a difficult measurement problem.
III. MEANING AND USE OF SCORE PRODUCED

Both the score distribution and the meaning of a score
are related to the use of the test. Ferguson has pointed
out that for discrimination between two groups one would
need a bimodal distribution of scores; the discrimination
between two groups and among the members of one group would
require an asymmetrical distribution of scores; and, if one
were establishing the order of ability of individuals, one
would use a rectangular distribution. Ferguson concluded
that the construction of tests to yield distributions ap-
proximating the normal form results in a loss of discrimina-

tory capacity.49

49George A. Ferguson, ”On the Theory of Test Discrimin—
ation," Psychometrika, 14:61-68, No. 1, March,l949, p. 68.

 

75

Not all scoures have the same meaning. A score resulting
from the discrimination between two groups is more a probabil-
ity statement that the individual should be classified into
a given category than it is a statement that the individual's
ability is at a certain level. The score from a test designed
to rank individuals compares any individual in relation to
others.

In addition to the meanings necessary for the above uses,
Gulliksen (as stated in the first section) would have the
score be the best estimate of the difficulty level reached.50
This type of score represents the ”true ability" level of
the individual. This type of score is also advocated by those
who argue for reproducibility as a measure of the best test.
However, it should be noted that it has been the practice to
determine how well a pattern of responses from an instrument
will reproduce original results, not hypothesized "true"
results. As reported by White and Saltz, these indices will
reflect without equivocation the amount of information thrown
away by representing the subject's performance on the test by
a total score based on the number of items passed. "They
indicate, in other words, how adequately a unidimensional

1
model fits the obtained data."5

 

50Gulliksen, op. cit.

51Benjamin W. White and Eli Saltz, HMeasurement of
Reproducibility," Psychological Bulletin, 54:81-99, No. 2,

March, 1957, p- 95-

 

f

76

However, a reproducibility score from a unidimensional
test does not insure either an interval scale or a known be-
havior domain being sampled. Individuals may be ranked by
the test scores (compared to other individuals) or be assigned
an ability level (compared to a standard). The behavior
domain may be related to the test label or it may not-—the
only assurance one has is that the domain is unidimensional.

The question as to domain samples (which seems like a
validity question) has actually been studied as a part of
reliability. Tryon in theory related reliability to the
behavior domain sampled.52 He reviewed the two theories of
test reliability: (1) the Spearman-Yule theory that tests
are unreliable because of an error factor and reliable because
of a true factor which may be a composite of more than one
common factor; and (2) the Brown-Kelley theory that reliabi-
1ity may be explained by equivalent test-samples in which all
items in the total score have equal standard deviations and
equal intercorrelations. (To obtain equivalent test-samples
the content and difficulty of items must be considered, but
all items do not have to be equally difficult.)

Tryon defined reliability as the value of “correlation,
rtt: between the observed Xt scores and a second set of com-

posite scores, Xt‘, earned on a 'comparable form' of the Xt

 

52Robert C. Tryon, "Reliability and Behavior Domain
Validity: Reformulation and Historical Critique," Psycho-
logical Bulletin, 54:229-249, No. 3, May, 1957.

 

77

composite."53 (A comparable Xt composite is one in which the
n test-samples vary on the average as much in standard devia-
tions and intercorrelations as do the n test-samples in the
observed Xt composite.)

If this definition of reliability is used, a reliable
test is one that indicates how well the individual knew the
domain or how he ranked with others in his knowledge of the
domain. At least the domain sampled by the score is known
and can be made part of the meaning of the score.

The literature reviewed to this point would indicate
that the score (1) may be a function of difficulty which prob—
ably reflects the ability level of the individual, (2) may
represent a pattern as to content, or (3) may indicate how
well the individual did on the samples of the domain that the
test is hypothesized to sample. Reliability measures may be
a factor in determining what meaning can be assigned to the
score, but there are still contributions coming from content
and from difficulty.

Swineford examined the importance of the difficulty of

the item as a factor in the score assigned to the individual.
Swineford has shown that only if the items are quite precise
and intercorrelated is the difficulty of the item an important

factor in the score of an individual. Swineford used present

 

531bid., p. 230.

78

day tests and attempted to measure the impact of variability
of item difficulty and item-item correlation.54 The varia-
bility of item difficulty was designated 6; , A: being the
normal-curve deviate (for a distribution with mean of 13 and
standard deviation of 4) above which lies the area under the
curve equal to the proportion of successful examinees. For
a measure of inter-item correlation Swineford used the recip-
rocal of the square of the mean of the item-total correlation.

The results of Swineford's study showed that when the
score was the number correct that the best formula for pre-

dicting this score was as follows:

2 = .1530 z + .8649 z

1 3 4

21 is the predicted standard score on the test

Z3 is the measure of the spread of item difficulties
in a standard score form

24 is the inter-item correlation measure in standard
score form

Rl.34 = .9648 for this formula.
When the score was the number right minus k times the

number wrong the results were as follows:

21 = .2117 Z3 + .9222 24

and R1 34 was .9642. The symbols are the same as above. As

can be seen from these formulas, the contribution of spread

 

54Frances Swineford, “Some Relations Between Test
Scores and Item Statistics," Journal of Educational Psycho-
logy, 50:26—30, No. 1, February, 1959.

79

of item difficulties in the usual cumulative test is not great.

Another way of looking at the contribution of item dif-
ficulty spread is to specify the spread and inter-item corre-
lation, and then examine the standard deviation of test
scores. Swineford used (n - chance)/<D’t as her measure of
standard deviation because ”although it does vary with test
length, the variation is not great for reliable tests when
the longest is no more than 8 or 10 times the length of the
shortest."55 (In this formula "n” equals the number of items
in the test and ”chance” equals the number of items assumed
to be correctly answered by chance.) The smaller the number
from this quotient, the better one would assume the score to
be because a large variance in relation to the total possible
score is considered best according to present test theory.
For the highest inter-item correlations, rbis = .50, the
values of (n ~ chance)/O’t range from 5.8 to 3.0, for the
largest (3.5 0;) to the smallest (0.0 6) spread of item dif-

A

ficulties. Thus, 0.0 02 has the lowest value or produces the
best test. However, the mean value for (n - chance)/15t is
6.2891 and the standard deviation is 2.6847 for the entire
battery of tests studied. It should be noted that the (n -
chance)/’6¥ values reported above are all below the mean and
within approximately 1.3 standard deviation units from each

other. Therefore, the conclusion should be that in this range,

the smaller the spread of difficulties the better the score

 

 

55Ibid.

80

value tends to be. There is no conclusion reached about the
entire range but Swineford‘s data would support using no
spread of item difficulties.

The standard deviation from the entire battery of tests
may not be the most appropriate value to use, but this is
the only value available. The standard deviation of q;
is .4391 and the standard deviation of l/r2 is 5.4344.
From these values one may note that there are about six
Standard deviation units of C§> included in the one axis of
her chart where values of 028 range from 5.8 to 3.0 for the
highest (.50)rbis, and from 14.8 to 11.9 for the lowest
(.20) rbis' The mean rbis is .36, the highest rbis (.50) is
.70 sigma units away from the mean, and the lowest rbis (.20)
is 3.15 sigma units away from the mean. Thus, while the values
of C; ‘may be considered to be close to normally distributed
and likely to be encountered in the usual cumulative test,
the values for rbis are not normally distributed. We might
conclude that if rbis were normally distributed, then higher
values of rbis might appropriately be investigated. A standard
deviation unit on Cg_ would indicate that today most tests do
use items centered around the mean difficulty level, but that
the reliability of items has a larger range. If one examines
i .70 sigma units of rbis’ one has about a three point change
in (n - chance)/th values which is about the same change en—
countered from i 3.0 sigma units of C55 . This supports the

conclusion that conventional cumulative tests do not use

81
difficulty as a major factor in the score; the score is a con—
glomerate of difficulties and other factors.

The literature indicates that the cumulative test may
be constructed to measure a single factor but that the
attention of the test constructors has not been directed
toward reporting the decisions made as to the meaning of the
score. If one remains concerned with traditional operational
definitions of reliability and validity, one may forget the
construct operationalized and not change the construct when
it needs to be changed.

The sequential test procedure developed in this disser-
tation will use reflection of true ability as the meaning of
a scores. The literature indicates that this is only one of

the many meanings that could be assigned to a score.
IV. SEQUENTIAL TESTING PROCEDURES

The literature indicates that there are many choices as
to the use of the sequential testing procedure. The sequen-
tial process may be used (1) to quickly determine score
to be assigned to good and poor students; (2) to determine
to which of two categories the individual should probably be
assigned, if assigned at all; or (3) to classify each individ-
ual as well as possible in time allowed. The sequential
analysis developed by Wald would be most applicable to the
second purpose, but this method has been modified by Cowden

to serve the first purpose.

 

82

Cowden has indicated that when an examination is given
to a student it sometimes happens that not enough questions
are asked to permit a fair evaluation of his knowledge and
ability.56 On the other hand the examination is sometimes
drawn out longer than is necessary. If a student is very
good or very poor, only a few questions may be needed to
establish this fact beyond reasonable doubt; but borderline
students need to be examined at considerable length before
deciding whether they should be passed or failed. If sequen-
tial testing is used, the fate of good students and of poor
students tends to be quickly determined, but mediocre students
must continue with the examination until the results give
adequate grounds for a decision. By use of the sequential
method the number of questions answered by a student is re-
duced to a minimum, and at the same time the probability of
passing a poor student or failing a good student is controlled.

Cowden graded his students in a small class in elemen-
tary statistics at the University of North Carolina. Using
D1 (decision number I) to indicate the number of questions
that could be missed and still permit a student to pass, D2
(decision number 2) to indicate the number of questions that
must be answered incorrectly before a student is failed, and
N to indicate the cumulative number of questions answered;

the two linear equations used to make the decision follow:57

56Dudley J. Cowden, "An Application of Sequential Sampling
to Testing Students,‘l Journal of the American Statistical
Association, 41:547-556, No. 236, December, 1946, p. 548.

 

 

57Ibid., pp. 548asu9.

 

83

D1 = a1 + bN D2 = a2 + bN

As can be seen, the straight lines representing these two
equations are parallel and differ only as to the constants

a1 and a2. These constants al and a2 are shown to depend on
the values of p1, p2, 0< , and /3 when: "pl" is defined

as the maximum proportion of errors in all possible ques-
tions of a given type made by a student who is definitely
good; "p2" is defined as the minimum proportion of errors in
all possible questions of a given type made by a student who
is definitely poor; "04‘ is defined as the probability of
failing a good student; and ”/3" is defined as the probabil-
ity of passing a poor student. The more widely p1 and p2
differ the closer together the lines will be, and, therefore,
the more quickly will a decision be reached. The larger the
values ofa‘ and/3 the smaller will be the value of a2 and

the larger (algebraically) will be the value of a1. There-
fore to bring the two lines closer together one must increase
0<. and/or /3 . The value of a1 is always negative, since
answering all questions correctly does not strongely indicate
knowledge of the subject until a reasonable number of questions
is answered (what is a reasonable number depends on the value
adopted.frm'ﬂ?, becoming larger as [3 is made smaller). On
the other hand, a2 is always positive, but a decision to fail

cannot be reached until D2 = N, since a student cannot miss

more questions than he answers. WhenOK =ﬂ , a2 a: —al. The

84

slope b is independent;cd’0< andiﬂ3 , but depends exclusively

 

 

C

on p1 and p2. Cowden gives the following formulaszj8

: 10 *r' g. 2 lo .______

$1 8; Pl 2 s 1 _ p2

—cK _
-8 :r: h : a”) :2 h :2 °<

l 1 g + g. 2
l 2 El + g2
b _ S2
$1 + $2

Cowden thus develops two lines for pass, fail, and

indeterminate, but has grades for six categories based on

59

the following decisions:

After 20 questions if a student made errors in less
than 10 percent of the questions, the grade of "A"
was assigned; if 55 per cent or more of the questions
were answered incorrectly, the grade of ”F” was
assigned; if the percent of incorrect questions was
between these percentage values then testing was con-
tinued. After 40 questions if a student (not classi-
fied before) made errors in less than 22.5 percent of
the questions the grade of "B” was assigned; or if

more than 45 percent of the 40 questions were incorrect,

the grade of ”F" was assigned. Similar decisions were

made after 60, 80, 100, 200, and 1,000 questions. After

1,000 questions those students not already classified
were assigned ”D" or "E” grades. Those individuals
having errors in less than 34.89 percent of the ques-
tions were assigned ”D” and those students having
errors in more than 35.3 percent of the questions

were assigned a grade of "ET

Sequential testing is thus changed to allow using more

than three categories by changing the number of items that

 

Ibid., p. 551.

 

591bid., p. 552.

85

are used to make the decision. Estimates of the size of the

 

 

 

number of items can be obtained by the following formulas:6O
_ h __
1 - b
N. 2 (l - 0< ) h]. "O< h2 E Z (1 -ﬂ) h2 "ﬂhl
pl b ‘ 91 pg 92 “ b

Cowden found that it took 13.5 items before it was
possible to decide that the student should pass. This is
due to a random sample of items assumed in the sequential
process. It therefore seems worthwhile to investigate a pur-
poseful sample of items instead of random sample even though
the mathematics has not been worked out for this type of test.
To use the model developed by Wald, one must first state
the probability of type I and type II errors that one will
accept (as to a given alternative) and then continue until
one satisfies the conditions of the mathematical model with
probabilities.61 The procedure may be used to decide upon
pass or fail categories as was done by Moonan; or modified
by making assumptions about the number of items needed to
make the decision as done by Cowden; or an individual may
wait for the mathematics of the multiple decision (or other
modification) to be completed and reported as Wald indicates

62

might be done in his book on sequential analysis.

 

60Ibid., p. 553.

61
Wald, Sequential Analysis, op. cit.
62

 

Ibid., pp. 138-150.

 

86

The sequential procedure developed by Wald for a ”most
powerful" test is built upon the assumption that one may con-
tinue to sample the same universe. The procedure determines
what decision is best after every sample and states whether
one has attained the desired degree of probability (of being
correct). It is not necessary to follow the lead of Cowden
and Moonan and, therefore, use a random sample of items. It
is known that certain items of different difficulties will
give more information about an individual than other items,
and this information Shauld be used: this means that one
does not wish to sample from the same universe of items each
time. While the aptitude or ability being tested must remain
unidimensional, there may be great advantage in allowing the
difficulty of items to change. The sequential model herein
described thus departs from the Wald sequential model in
that it uses different difficulty levels so that fewer items
are needed for the decision.

Fiske and Jones in an article intended to introduce
sequential analysis to psychologists, stated that the un-
critical use of sequential analysis obviously is not recom—
mended.63 It is a design which can have advantages when one
or more of the following conditions actually holds: (a) The
problem involves the choice between two possible parameter

values which can be specified on a_priori but not arbitrary

¥

63Donald w. Fiske and Lyle v. Jones, "Sequential Analy-
sis in Psychological Research,” Psychological Bulletin, 51:
264~275, No. 3, May, 1954, pp. £73~274.

 

87
grounds—-the null hypothesis will usually be one of the two;
(b) the data are such that the cost per datum is high and
economy is desired; and (c) the total amount of data is not
fixed.

Such criteria would lead one to believe that the sequen-
tial model developed by Wald may not be the appropriate model
for the test situation, as the total amount of data is fixed
and one cannot afford to have 1,000 items as indicated by
Cowden. It may be no more expensive to acquire the data
from all candidates than from a few, unless one wishes to
select only rather than classify. The decision to accept or
not acceptwthe selection question—-seems to be the most ap-
propriate decision which can be answered by the sequential
method as described by Wald.

The literature also indicates methods<1fpresenting the
material to the testee. Some of these are noted here. Glaser,
Damrin, and Gardner constructed a tab item test to aid in
training of electronics specialists.6u In this test, the
performance on one test yields information which supplies a
cue for the selection of the next test and subsequent proce-
dures. ‘0ne "tab item” test, for example, had the trainee
read a description of the malfunction of a television set and

then, rather than actually performing various checking

 

64Robert Glaser, Dora E. Damrin, and Floyd M. Gardner,
"The Tab Item: A technique for the Measurement of Proficiency
in Diagnostic Problem Solving Tasks,” Educational and Psycho-
logical Measurement, 14z283—93, No. 2, Summer, 1954.

 

88
procedures, the trainee pulled the tabs of those checks he
would make if he were actually trouble shooting a real tele-
vision set. Whenever he pulled a tab he uncovered the
information he would have obtained if he actually had per-
formed that check on a real set.

Another method of presentation was used by Krathwohl
and Paterson in preliminary studies of the sequential test
model. They had directions printed on the page, covered
these with a transparent hard finish ink so that directions
could not be erased, then covered this in turn with strips
of opaque ink. The testee erased the strip of opaque ink
under the letter he considered to be related to the correct
answers. (This is similar to an IBM answer sheet, but in-
stead of marking a spot, the testee erases a spot.) The appro-
priate directions were thus made available to the student.

Teaching machine presentations are also obvious methods
to present material to the testee. The material is similar
to that presented by teaching machines, but in the sequential
model being developed in this paper, the individual does not
obtain information about the correctness or the reason for
the correctness or incorrectness of the response. However,
the individual is told to take a more difficult item if he
correctly answered the preceding item, or a less difficult
item if he incorrectly answered the preceding item.

The literature suggests that if the decision is to

best classify the individual by a sequential procedure, the

89
present sequential model may be better than past models which
have been developed from different assumptions and for differ«
ent problems. The literature also suggests that traditional
scores represent more than one meaning.

The present sequential model has used reflection of
input in the output as the proper meaning for a score; the
cumulative test should not perform this function as well as
the sequential test. The decision as how to measure the ef~

ficiency of these tests (and indirectly the items) was then

 

related to the reflection of input in the output. The two
factors considered in the output were (1) the means and
variances of ability levels assigned to a score (precision
of score) and (2) the means and variances of scores assigned
to an ability level category (discrimination of test).

It should be noted that the decisions as to the type
of score distribution desired and the meaning that should
be.assigned to a score had to be made before one could deter—

nnxme the efficiency of the test (or items). The decisions
made in the present study were those decisions which it was
hoped would favor the sequential test procedure.

There should be maximally efficient use of items in the
sequential method as (1) there is a separation of individuals
into groups which have a base rate of 50 per cent for the
items used, and (2) the use of items at the 50 per cent level

of difficulty for the subgroups permits the use of more

90
difficult items and makes better separation of these individ-
uals (as the item is at the 50 per cent level of difficulty

for the subgroup).

 

CHAPTER III
PROCEDURES

There are six sections to this chapter. First, the
actual construction of the six-item cumulative and the
six-item sequential test model is considered. The second
section outlines the method of evaluating the hypotheses
stated in Chapter I which relate to the effect of input
distributions. The third and fourth sections show the
methods for testing the hypotheses about item precision and
difficulty, and effect of errors of estimating a parameter,
respectively--both for the sequential model. Fifth, some
general comparisons between test score distribution and
ability level distribution are examined. And finally, a

summary of procedures and hypotheses is presented.
I. TEST MODEL CONSTRUCTION

This section deals with the construction of six-item
sequential and cumulative test models. Later these test
models are used with different inputs of ability and the type
of score output is examined.

The test model for the sequential and cumulative tests

assumed that the probability of passing an item was dependent

91

 

92
upon three factors: (1) the ability level of the individual,
(2) the precision of the item, and (3) the difficulty level
of the item. The assumption was made that no one passed by
randomly guessing the correct answer to the item.

The ability level of the individual was specified in
terms of standard score units for a normalized distribution
of ability. The precision of the item was specified in
terms of either r or dd“ These two terms are related by

bis

the following formulas:1

4

M, 2
s 1 " rbis (1)

r’bis

 

8d

or by algebraic manipulation;

1

 

rbis z (2)
l +o§
As can be seen from the second formula, r . is equal to one

bls
if dd is equal to zero. The smaller the 6' value the more

d
precise the item, and if Gd were equal to zero, the individ-
uals who had abilitylevels.above the difficulty level of the
item would pass the item, and vice versa.
The difficulty of the item was expressed in terms of
standard score units for a normal population. It need be

remembered that 80 or 90 per cent of a select group could

pass (or fail) a 50 per cent difficulty item.

 

 

lFrederic M. Lord, "Some Perspectives on 'The Attenua—
tion Paradox in Test Theoryl,” Psychological Bulletin, 52:

505-10, No. 6, November, 1955, p. 506.

 

93
The probability of passing a single item for a given
small segment of ability was computed by determining the

a - d
area under the normal curve frmm-—cﬁ to the value'——gr~—— ;

where"a"is equal to the ability level of the individugl in
standard score or sigma units,"d" is equal to difficulty
level in standard score or sigma units, and”oa”is the
{measure of precision described above.

The probability of passing a sequence of items for
bcpth the sequential and the cumulative was determined by
an11_tiplying the probabilities of passing each item in that

seeqfuence. This assumed that for that small segment of
azk>i.l;ity (for which the probability of passing an item was
cieat:eelrmined), performance on any one item was experimentally
irdcieezoendent of performance on any other item. Since the con-
C€?I“rd. was with classifying people by ability, it was assumed
”léinS each of these items measured only one factor other than
”163 error factor, i.e., the test was unidimensional. The
er'I-“<:>:ra factor on any one item was assumed to be independent
of) ESinceror on any other item.
Using the above scheme, one six—item sequential test

mCDCi‘Eiill was constructed for a hypothetical population of 1500
ind-?31~‘v:'lddals with 100 people at each of 15 ability levels as
SFICD‘AJIfl in Table 24. The item precision for all items in this
mc>Ci‘EE:1. was arbitrarily set at 6d = .882. The appropriate dif-
ffLCZLiiLties were determined by the following procedure. First,

t?)
SE ‘rlumber of people at each of the 15 ability levels who

94
would pass or fail an item was computed. The value of the
sum of the deviations from the mean squared for each of the
ability scores was computed for the pass and fail groups.
This value was computed and graphed for different trial
values of difficulty until the difficulty level was found for
which the sum of all sets of deviations of ability level
eabout the mean ability level for the entire group was a
rninimum. Since Edgavmmsa constant, the value for difficulty
leavel was calculated by maximizing (ELX)2/N. The difficulty
lxel/el of the item taken by each group was not the same.

ITc1r* example, in Figure 1, both the group who passed the first
j.tmern and failed the second item and the group who failed the
.f211753‘t itemenklpassed the second item take the same item at

{SiSEiégee 3-- a 0.00 item. If this had not been done, the six-
it3€3IT1 sequential test would require 63 different items.

It was decided to use the same item for those groups
for Which 2(2X)2/N maximized at a difficulty level no more
“153151 .20 standard score units away in difficulty from each
Ot3k1<33;rr. This allowed the test to be built with fewer items

arlci tzhus any test built to correspond to the model could use
Or1:l‘57’ the most precise items in a pool of items. Also, this

CC)17RE‘<Essponds with reality for it is unlikely that items which
8:863 iliess than .20 standard units of difficulty apart can be
ac3‘63<:l‘;1ately distinguished one from the other under usual con-
CEi’t::i~<3ns for determining difficulty. When more than one group

LlEiesscj_ . . .. 2
an item of a given dlfilculty then the (2X) /N was

95
maximized across all the groups using that item. (If one
should desire to construct other tests along similar lines,
it would seem desirable to use an electronic computer as
there were over 100 hours needed to build this one test on
a hand calculator.)
The six-item cumulative test was constructed for the
same hypothetical population as for the sequential test.
{The item precision was likewise arbitrarily set at 6d = .882.
Iiowever, all items were at the 50 per cent difficulty level.
Raw scores for the sequential test were the rank of
‘tlie mean criterion level of the group with 64 being the
highest possible value. There are 64 possible sequences
vvlleexd there are six items with dichotomous classification for
6351(313,item. The raw scores for the cumulative test were the
number of items (out of six) that the individual was computed
t3<> ijave answered correctly. Both of these raw score distri-
b115t3:1.ons were converted to normalized "T" scores so that the
'tVV‘:> score distributions might be compared on an equivalent

Ln ‘t: e rval scale basis .
II. EFFECT OF SHAPE OF DISTRIBUTION OF ABILITY

It was hypothesized that the sequential test model con—
8 tbucted as described above should work well for any type
(Bit? ilanput distribution and thus be better than the six-item
(IIJtrrrlaglative test model. The six-item cumulative test con—

8
tr‘ucted with all items at 50 per cent level of difficulty

96
was not expected to be effective for those distributions
which had many high ability individuals. It was hypothesized
from the literature that these individuals would need more
difficult items to discriminate among them. To test this
hypothesis different ability distributions were used as
input. The diffioplty levels of the items used in the sequen-
tial test model were determined according to the method
(described in the last section, and were for a precision level
(if an rbis item total correlation of .75 (or dé = .882). A
pornecision of .75 was used because differences between the
sai)c~item cumulative and sequential models should be greatest
azt: Idigh levels of precision——.75 would be considered very
f1i4g1F1 by the standards in use. Few tests have an average
1 teerrl—total correlation of .75. A rectangular input distri-
bL11::i_<on was used as the items selected were hoped to operate
W63iL.]_ for any distribution. Not only is the rectangular
dj—Es‘t3:rdbution a good compromise, but with the same number
at; <Eraach ability level, only the ability level should deter-
Mi.r1(33 the selection of the difficulty of the item.
To determine if the sequential model would work well
quleelrl compared with the cumulative test model the two test_

Inc>(3“EE’ZLs--the sequential and the cumulative--were each used
V'i_t:1ﬁl
tklee
in

LI‘_

a normal and a U-shaped distribution of ability to make
1:otal of four tests. These four tests were constructed
Eaﬂrl electronic computer. (For both the normal and the

Eslfléaped distributions the individuals were assumed to be

97

distributed over 15 input categories. Since the values used

in the computer program wereproportions at each category,

any number of individuals may be assumed. The most common

assumption made in interpreting this data is that there were

1000 individuals distributed over these 15 input categories.)

The item difficulties for the sequential models were the

ones computed above. The item difficulties for the two cumu-
;1ative models were all at the 50 per cent level. It was thus
pnossible to compare not only the sequential with the cumulative
rncxﬂels, but also the effect of an input of normal and U-shaped

d i s tributions.

Iiififezct of Normal Distribution
The effect of an input of a normal distribution of

atDﬁ.:L ity on the output distribution was examined in several

WEiE/ Es , but before the examination of hypotheses related to

tFIEB Es<e effects a description of the particular distribution

uSEBC'S. here is given. The normal curve was divided into 15
SEECZ't: ions from +2.5 to -2.5 sigma units. The middle 13 cate-
g<33r‘:i_<es were assigned the proportion of individuals that would
E75i3~ili in that portion of a normal curve, but all those individ-
Llea:l-=ES more than 2.5 sigma units from the mean were considered
CC) 13>63 in the end categories. This was done because spreading
thé Se individuals over the middle of the distribution would

ha-
RJ’EE underrepresented the number of people likely to be at

E):
f:”I7€eme values in ability. Using the above procedure the end

98
categories extended from il.6l2 to :l.736 sigma units. Since
there were so few to consider at the levels beyond il.736
sigma units, these individuals were all considered to be at
the mean ability level for all_people beyond :l.6l2: that
is, at 1.942 sigma units (see Figure 2).

To test the hypothesis that the cumulative and sequen-
tial test models have equal ability to classify individuals
of mean ability level, the means and variances of comparable
Ilormalized scores from the six-item cumulative and ”least

smzuares" sequential test models for those 100 individuals
aasssumed to be in category eight of ability (the middle cate-
ggcoxey) were tested for significance of difference. The means
vveezre tested by use of a ”t” test and the variances by use of
Elf) F ratio.
To test the hypothesis that the "least squares'I sequen-
‘tdi.éa.l test model should more accurately classify the few
1¥r1<:1;ividuals at the extremes of the ability scale than the
‘34i-:>: —item cumulative model, the means and variances of com-
EDEa-I“£able normalized test scores for the 84 individuals in
abLinity categories 14 and 15 were tested. (Testing of the
irl<j1:11viduals in the lower categories, one and two, was unnec-
e S S ary since the resulting scores are symmetrical about the
me an. Scores resulting from actual test administrations
nnqi‘égglnt be skewed by individuals guessing at the correct answer.)
To test the hypothesis that the cumulative test model

S
1710nm have scores with smaller variance of ability levels at

99

as

 

Ha

shaaaea co sconeseascnaa scare

homeomopmo so

OH

m

I

@

>cg spa
s

HHQ<
O

K

.m

.mHm

OJ

 

OH
Om
om
0:
On
00
on
om
om
OOH
OHH
Oma

Oma

at

sea

deqmnm

JO

sea

100
the extremes than the sequential test model, the means and
variances of ability level scores for the individuals ranked
in the top 8.4 per cent of the score distribution for each
test model were tested. When it was necessary to take only
a proportion of a score group to complete the top 8.4 per
cent of scores, then the ability levels were proportionately
sampled. The value of 8.4 per cent was selected because
tﬂdere were 84 individuals in the top two input ability
leevels of the hypothetical population of 1000 individuals.
It was hypothesized that the six~item cumulative test
Incociel would produce scores representing finer ability units
111 izhe middle than at the extreme score values, while the
S<3C1LAeHNAal test would more nearly reflect the ability scale.
'ITlea <differences between the mean ability levels of adjacent
PEIVV =score categories for the cumulative test model were
FB/ED<:>‘thesized to be smaller in the middle and greater as
ex“Cir“‘esmes were approached. These differences in mean ability
VEIJ—lélees for the adjacent scores in one—half of the symmetrical
SC‘CDSIT’EE distribution are shown in Tables 5 and 6. In addition
tCD iZ¥kUis, the differences between mean normalized "T" scores
fk3:r’ esach adjacent ability level for both the sequential and

”LlrrrLJ—Jiative tests are shown in Table 4.

Hr
\% of U-shaped Distribution
The effect of the U-shaped distribution of ability was

531:
‘Ll‘:1fLed.by the same procedures used with the normal distribution

 

101
of ability. The distribution used in these tests is the
one shown in Figure 2. To determine if the "least squares"
sequential test would more accurately classify individuals
at the mean of the absolute ability levels than would the
six-item cumulative test, the means and variances of normal-
ized scores assigned to category thirteen were tested for
significance of difference between scores assigned by the
isix-item sequential and the six-item cumulative models.
cuategory 13 was selected as it included the mean value of
aatxility for those individuals in the top half of the ability
<j:isstribution.
To examine the hypothesis that the sequential test
Incacieel would more accurately classify individuals at the
e>C13:r*eme values of the ability distribution than would the
Sia>:‘-item cumulative test, the means and variances of normal"
iZEECi scores assigned to category 15 individuals were compared
fC>37’ the sequential and cumulative test models.
To test the hypothesis that the cumulative test model
WCDIJ~:L.<d have more precise scores at the extremes of the ability
digs‘11::ribution than would the sequential test model, the
:ir1Cj-ji.‘viduals ranked in the top 13.5 per cent for each model‘s
ESC:C>3I‘<3 distribution were examined for differences in means
arj‘:1 ‘variances of ability level. These top-scoring individuals
VV€3:EIEE* proportionately selected as stated for the normal dis~

tzla -
:LTEDXJtion. The top 13.5 per cent of the score distribution
Wa‘g

tzklesg

lased as there were 13.5 per cent of the individuals in

top ability category.

102

To determine if the classification of the middle ability
level was more finely classified by the six—item cumulative,
the mean normalized "T" score for each ability level was
determined and shown in Table 24. The same was done for the
sequential test model. The hypothesis was that the sequential
model should have approximately equal distances between test
score means for each of the ability categories, while the
six-item cumulative model would have larger differences in
mean test scores for the middle ability levels than for
extreme values.

The differences in mean score values for adjacent
ability levels are shown in Table 4. The mean ability levels
for each score are likewise shown in Tables 25 and 26. It
was hypothesized from Lawley's work that the extreme scores
of the cumulative test should have lower variance of ability
level than the extreme scores for the sequential test.2
Since less variance of ability level means fewer lower ability
individuals, it was assumed the extreme cumulative test scores
would have higher mean values.

Effect of Ability Distributions for
Additional Sequential Tests

 

 

In addition to the four tests described above, three

other sequential tests were built with an electronic computer.

 

2D. N. Lawley, ”0n Problems Connected with Item Selec-
tion and Test Construction," Proceedings of the Royal Society
of Edinburgh, 61 (Section A, Part III): 273—287, 1942-1943.

 

103

However, in these tests the difficulties of the items were
not determined by a "least squares” procedure, but used
difficulties determined by an adaptation of Lord's work.3
The item difficulties used in these three tests were so
selected that, it was hypothesized, depending on the particu-
lar selection, a normal, rectangular, and a U-shaped distri—
bution of scores would be obtained. The number of individuals
assigned to each score and mean ability level of these individ-
ualsarereported in Tables 18, 19, and 20.

It was assumed that a score from a test designed to
output a rectangular score distribution should correlate
highest with a rectangular input of ability. Scores with
normal distribution should likewise correlate highest with
the normal input of ability, and scores with U—shaped distri-
bution should correlate highest with U—shaped input of ability.
However, information was obtained as to the effect on both
output distribution and the correlation values of changing
the input distribution.

The rule stated by Lord was that if one wished to
divide the group at a given point, then the item difficulty
(expressed in standard score units) is represented by the

item-total r times the standard score unit which represents

bis
the proportion below the point where the split is desired.

The procedure followed in constructing these three tests was

 

3Lord, ”Some Perspectives on 'The Attentuation Paradox
in Test Theory'," op. cit.

104

that if there were four different difficulties used at a
given stage, then the abcissa should be divided into five
equal ability segments. The difficulties necessary to pro-
duce these proportions were them computed from Lord's formula.
One time the distribution of scores to be produced was con-
sidered normal; one time, rectangular; and one time, U—shaped.
Since different proportions were to be selected for each
distribution shape, different difficulties were needed for
each. The rule used to determine the number of different
difficulties at each stage was to add one more difficulty
at each stage. It turned out that this rule gave results
approximating the nasults from the determination of difficul»
ties by the rules developed in the past section on "Test
Model Construction.”

Lord has shown how to select item difficulties to yield
a desired split of individuals by a cumulative test. These
Lord difficulties assume an input of a normal distribution
of ability; therefore, in the sequential test one should com-
pute difficulties with a normal distribution of ability for
each item of the test. This was not possible in the present
sequential model. The differences in the difficulty levels
of the items selected by Lord‘s technique and the above tech“

nique when an r = .75 is used are noted, but no study of

bis

the effect at other values of rbis was made.

105

III. ITEM PRECISION AND DIFFICULTY FOR

THE SEQUENTIAL TEST

To determine the interrelationships among item precision,
difficulty level, and output characteristics, five tests con-
taining items of varying precision and difficulty were compared.
The five tests were built in the electronic computer and
varied in precision and difficulty of items used. The tests
were built using Lord's rule in the selection of difficulties
so that a normal distribution of scores should be obtained
when the distributions of ability were normal. The five
precision levels were for rbis equal to .79, .75, .71, .60,
and .45. (The .75 precision test was the same as the one
constructed above.) For an assumed N of 1000, the .79 and
.71 values are one standard error of a This above and below
.75. The .60 value was selected as it is a value common in
the literature; the .AB to show the effect of meeting low
precision standards. The .79 precision level is not consid-
ered unrealistic if the spread of ability level is great.
Precision was hypothesized to be one of the most important
parameters in the behavior of the sequential test model.

To examine the hypothesis that the more precise items
would produce a better separation of people, the variances
of scores for category eight ability level (the middle ability
level) individuals were compared for each of the five tests

by use of Bartlett's test for homogeneity of variance. This

106

test was repeated for the combination of categories 14 and
15 (the most extreme categories) for the five test models.
It was hypothesized that there would be a difference in the
variance of scores, with the more precise items producing the
scores with the smaller variances. Since a lower precision
of items means that the effective difficulty level regresses
toward the mean and, therefore, is closer to the 50 per cent
level, the middle difficulty items should increase the preci-
sion of scores at the extremeSQ-although not the ability to
classify individuals. Thus, the extreme scores would have
small variance of ability levels for both precise and less
precise items and it was hypothesized that the variances of
ability level scores would be most different at the middle
score values.

The second hypothesis stated that a test consisting
of more precise items would have the ability to discriminate
evenly over the entire range of ability rather than making
finer discriminations at the middle of the ability range.
This hypothesis was tested by examining differences in the
means of test scores for each category of ability. A table
was made of the means and variances of test scores for each
of the fifteen ability levels and for each of the five levels
of precision. The discrimination index for adjacent ability

levels was computed as suggested by Lord}1L The higher the

 

“Lord, A Theory of Test Scores, op. cit., p. 24.

107
index the better the discrimination; values may range from
zero to infinity. Lord's discrimination index was computed

as follows:

 

D = S-Pn. S Ca
4.6* ‘
MS c = mean of score values for ability level co
° 0
Ms.c1 = mean of score values for ability level 01
6* 2 some appropriate average of the standard

deviation of the two score distributions

Lord stated that this discrimination index is completely

independent of the distribution of ability in the group tested:5

This is an advantage when a general description of the
test is desired without reference to any particular
group of examinees; it is adisadvantageii'the effective
discrimination of the test for a specified group of
examinees is desired.

 

 

IV. ERRORS IN SEQUENTIAL TEST

PARAMETER ESTIMATES

The procedures used to determine the effects of errors
in estimating the parameters of precision and difficulty for
the sequential test items are related to the nature of the
error involved. The difficulty of an item is usually
specified in terms of the proportion of the group passing
the item. This test model, however, uses difficulty specified
in standard scores, so the standard error of a proportion

must be translated into standard score terms. The standard

 

51bid.

108

error of a proportion (~/(§§X7N) is greatest when P = Q =
.50. Thus the greatest error in estimating difficulty in
terms of proportion passing an item would occur at the 50
per cent level of difficulty. The value of W is
smallest at the extreme values of P or Q. The error in
terms of proportion passing an item was thus investigated
at .50 and .90. These errors were then translated into
standard score units. The values of V/(ﬁ5X7N_ (when N = 1000)
were .016 and .010 for .50 and .90, respectively. When the
values necessary to encompass two standard errors of the
proportion were translated to standard score form the values
were quite similar and equal to about :,l0. The error for
estimating difficulty was thus assumed to be less than or
equal to.: .10 no matter what the difficulty level of the
item.

The error made in precision depends upon the estimate

6

of r , which has a sampling error as follows:

bis
—\ PQ/Z - I)tglis
rbis — ‘——
/N

 

Terms as defined before.
Thus for rbis equal to .75 (which was the only precision
level for which the error was studied), and assuming P =
Q = .50, and N = 1000; then dEbis = .02. Since the error

in rbis is not likely to be greater than i .04, then Pbis

 

6Quinn McNemar, Psychological Statistics (second edi-
tion; New York: John Wiley and Sons, 1955), p. 194.

 

109

of .75 is not likely to be outside of the interval of .71 to
.79. The 6d value for .71 is .99 and the dd value for .79
is .78. Thus the error in terms of 65 is not likely to be
greater than i .10. These estimates were the values used
to determine the effect of parameter estimation on output.

The testing of the first hypothesis as to the effect
of errors of item difficulty was done with a normal distri-
bution of ability; test items designed for rbis equal to
.75; and by the least squares of deviations method described
in "Test Model Construction." It was hypothesized that if
one were to use at the second stage an item which was .40
more sigma units away from the mean than the items selected
as above, then more people should be directed toward mean
scores than if the ideal difficulty were used. This would
imply fewer people at the extreme values than usual if the
rest of the test did not correct this trend. It was hypothe—
sized that the opposite should happen if the item were .40
sigma units toward the mean at the second stage. These
changes were tested by use of the chi-square technique. If
a difference of .40 did not make any difference it would
seem obvious that errors of estimate (about .10) would not
make any difference.

The errors of estimate in the fifth stage were deter-
mined when the item difficulties were shifted .40 sigma units
away from the mean in one problem and .40 sigma units toward

the mean on another problem. As the hypotheses on the effects

110
of error at the second and fifth stages derive from the same
rationale,and as the effects of the fifth stage were expected
to be in the same direction as the second stage effects only
larger, the hypotheses on the second stage errors require
only analysis of direction of change (i.e. chi-square) while
the hypotheses on the fifth stage errors require more exten—
sive variance analysis.

It was hypothesized that the variance of ability level
for the top 84 individuals would be greater for tests with
the shifted difficulties than for the test where the items
were at the ideal difficulty level. The significance of dif~
ferences in variances was tested by use of Bartlett's test
for homogeneity of variance.

The discrimination of the tests for an ability level
was determined by examining, for these same tests, the
variance of test scores for the category fifteen ability
level individuals. It was hypothesized that the variance of
scores forcategorylS individuals would be highest when difu
ficulties were closest to the mean value. Variances for the
three tests were compared by use of Bartlett's test for homo-
geneity of variance.

For the test with difficulties at the fifth stage dis-
placed away from the mean by .40 sigma units, it was hypothe-
sized from Lawley's work that the variance of ability level
for the 100 middle-scoring individuals would be lower than

the variance of these individuals on the other tests. Again

lll
Bartlett‘s test for homogeneity of variance was used as the
test.

Ability level discrimination was similarly determined
by examination of the variance of test scores for category
eight of ability. It was hypothesized that the original
test (with ideal difficulties) would have better discrimin~
ation than the modified tests. Again Bartlett's test was
used to compare variances.

The third hypothesis~~that errors in estimating the
precision of the items would be more serious in the initial
stages than at later stagesm-was tested by placing items

of rbis = .71 (instead of r .75) at the second stage.

bis :
Since subsequent items were designed with the assumption

that the second item had rbis : .75, the spread of ability
should be greater than ideal for discriminating among individ—
uals arriving at subsequent items. These subsequent items

are more difficult than ideal and this increased difficulty
should thus force the individuals toward the center of the
distribution. The greatest increase in variance of test
scores should thus be noticed for high and low ability groups;
middle ability groups should not change in variance of test
scores produced. The variances of scores for extreme and
middle ability levels were compared by use of the F ratio.
Also, the variances of ability level scores for individuals

ranked in top 8.4 per cent of the score distribution were

tested by the F ratio.

112

The fourth hypothesis that errors in estimate of
precision should make little difference at the fifth stage
was examined by placing items of rbis equal to .71 at the
fifth stage. The difficulty of the items remained the same.
The effect of this should be that again the item would be
more difficult than the Lord formula would suggest as ideal,
because difficulty should be regressed toward the mean de-
pending upon the rbis value. The lower the rbis the more
the ideal difficulties should be regressed toward the mean.
The results should be that more individuals than ideal would
take an easier sixth item which, according to Lawley, should
increase the precision of high ability scores. It was also
hypothesized that this change in fifth item precision would
increase the variance of score levels for high ability individ~
uals. These results were hypothesized to be in the same
direction as results from changes at the second stage, and

the F ratio was likewise used to test these hypotheses.
V. GENERAL COMPARISONS

A general comparison of the relationship between input
distribution and output distribution of scores was felt to
be of value even though no specific hypotheses were advanced
due to the number of variables involved. The difficulty of
the items, the precision of the items and the pattern of items
taken by individuals of different abilitylevelseﬂj.interact

to affect the score distribution.

113

The effect of difficulty of items was noted for the
nine tests described in ”Effect of Ability Distribution for
Additional Sequential Tests." As the difficulties of items in
each test do not regress toward the mean at the same rate,
no clear conclusion can be made as to the effect of dif-
ficulty on output characteristics.

The effect of difficulty can thus be determined only
for certain ability levels. (The data for the distributions
of only one-half of the scores were presented as the other
half was symmetrical.)

In addition to the distribution of scores, the cor-
relation ratios were reported as these give information as
to the general relationship between the input distribution
of ability and the output distribution of scores. In former
unpublished trials of the sequential test the value of the
Pearson Product—Moment r was made to closely approach that
for eta, by assigning the scores to the 64 different sequences
of items from the rank of the mean ability level of the
individuals at the score. (Another alternative would have
been to assign scores according to rank of the sequence if
ideal items had been used in the test model.)

The best general comparison of output to input in regard
to precision of item came from the five sequential tests
described in ”Item Precision and Difficulty for the Sequential
Test," where item difficulties and type of distribution

remained constant over all five sequential tests. The general

114
comparisons were made in terms of correlation ratiosthe
data were reported for one-half of the output distribution
of scores for the five tests.

A comparison of output to input in regard to the
pattern of items taken by an individual came from using the
Lord difficulties which yielded a rectangular output of
scores when a rectangular distribution of ability was input.
The rectangular distribution was used because this best ap—
proximated the "least squares” solution. Two new test models
were constructed: each had exactly the same items with same
difficulties and same precision (rbis = .75); one test had
items distributed as in Figure l, and the other test had
items distributed by one item at first stage, two items
at the second, three at the third, and continued until it
had six—items at stage six. Only the pattern of items taken
by the individuals was different in the two tests. Again eta
and the distribution of one~half of the output distribution

of scores for each of the two test models were reported.
VI. SUMMARY OF PROCEDURES AND HYPOTHESES

One sequential test model was constructed by the "least
squares" (of the deviations from the mean ability level) rule
for a rectangular distribution of ability over 15 ability

categories and r equal to .75 for item precision. (Ability

bis
level one represented lowest ability level and ability level

15 represented highest ability level.)

115

The above test was then used with an input of normal
and U-shaped distributions of ability. A six-item cumula-
tive test with all items at the 50 per cent level of dif-
ficulty and a precision level of the item-total rbis equal
to .75 was likewise used with normal and U-shaped distri-
butions of ability. The output distributions for comparable
tests were then examined.

The null statistical hypotheses concerning the effect
of the normal ability distribution on output of scores
stated that the cumulative and sequential test models should
have the following: (The alternative hypothesis expected
from the rationale is given in parentheses.)

(1) equal means for the comparable normalized scores for
category eight individuals (no alternate, hope to accept
null);

(2) equal variances for the comparable normalized scores
for category eight individuals (hope to accept null;
cumulative may be smaller);

(3) equal means for the comparable normalized scores for
combined category 14 and 15 individuals (cumulative
lower);

(4) equal variances for the comparable normalized scores
for combined category 14 and 15 individuals (sequential
smaller);

(5) equal means for the ability level scores for the
individuals ranked in the top 8.4 per cent of the
score distribution (cumulative lower); and

(6) equal variancesfor the ability level scores for the
individuals ranked in the top 8.4 per cent of the score
distribution (sequential smaller).

The null statistical hypotheses concerning the effect

of the U—shaped ability distribution on output stated that

116
the cumulative and sequential test models should have the
following:

(1) equal means for the comparable normalized scores for
category 13 individuals (cumulative lower);

(2) equal variances for the comparable normalized scores
for category 13 individuals (sequential smaller);

(3) equal means for the comparable normalized scores for
category 15 individuals (cumulative lower);

(4) equal variances for the comparable normalized scores
for category 15 individuals (sequential smaller);

(5) equal means for the ability level scores for the individ-
uals ranked in the top 13.5 per cent of the score dis~
tribution (cumulative lower); and

(6) equal variances for the ability level scores for the
individuals ranked in the top 13.5 per cent of the
score distribution (sequential smaller).

In addition to the hypotheses listed above,mean score
values for each ability level, and mean ability level for
each score value were plotted for both the normal and U-
shaped distributions of ability. Additional information as
to effect of distribution of input on output is presented as
part of the general comparisons.

Three tests were constructed by Lord's rules and each
of these was used with normal, rectangular, and U-ohaped
distributions of ability, although each test was designed
to reflect only one of the input distributions. Eta was
used to compare the input distribution with output distri-
bution for these nine tests. In addition, the actual output
distribution of each of the nine tests was tabled. These

tests were built for information, and no hypotheses were

made as to results.

.117
To determine the effect of item precision on the output
of the sequential test, four test models were constructed
with an input of a normal distribution of ability and item

precision taking the values of r equal to .79, .71, .60,

bis

and .45. Item difficulties were those determined by Lord's

procedure to be most appropriate for a given precision level
when assuming a normal distribution of scores desired. The
variances of ability levels for extreme and middle scores,
and the variances of scores for extreme and middle ability
levels were examined by use of Bartlett‘s test.

The null statistical hypotheses (and expected alterna—
tives) concerning the effect of item precision and dif-
ficulty stated that tests which use a normal distribution
of ability for input and a nearly normal output of scores
should yield the following: (The alternative hypothesis
is given in parentheses:)

(1) equal variances of scores for category eight ability
level individuals for all five tests of different
precision levels (most precise test smallest);

(2) equal variances of scores for category 14 and 15
ability level individuals for all five tests of dif-
ferent precision levels (most precise test smallest);

(3) equal variances of ability level scores for the individ—
uals ranked in the top 8.4 per cent by each of the five
tests of different precision levels (most precise test
smallest); and

(4) equal variances of ability level scores for the
individuals ranked in the middle 10 per cent by each

of the five tests of different precision levels (most
precise test smallest).

ll8

In addition to these hypotheses, the meansand variances
of the test scores, and the discrimination indices between
each of the adjacent ability levels were computed for each
of the five different precision tests.

To determine the effect of errors of using other than
the difficulty level computed by ”least squares" method for
certain items, four sequential tests were constructed.

One had the second item shifted away from the sample mean

in difficulty; another had the second item toward the mean

value. The fifth item encountered by the individual was

likewise displaced toward or away from the mean difficulty
value in the third and fourth test models, respectively.

Again the characteristics of the "error" and "error free"

output distributions were examined.

The null statistical hypotheses that were tested con-
cerning the effect of errors in estimating the difficulty
of the item at the second stage are as follows: (These
hypotheses were used to determine if differences were in
direction hypothesized.)

(l) the number of people in each of a set of score categories
would be independent of whether distributed by an

error free difficulty test or one in which difficulties
at the second stage were away from the mean (50 per cent)
difficulty (more people at middle for "error' test);

(2) the number of people in each of a set of score categories
would be independent of whether the people were distri-
buted by an "error free" difficulty test or one in which
difficulties at the second stage were toward the mean

(50 per cent) level of difficulty (more people at
extreme for ”error" test).

119

The null statistical hypotheses that were tested con-
cerning the effect of errors in estimating the difficulty of
the item at the fifth stage predicted the following: (These
hypotheses were deduced from same rationale as ones above,
and data were examined more closely as it was hypothesized
that those differences would be in same direction as dif-
ferences above and of a larger magnitude.)

(1) equal variances for the ability level scores for the
individuals ranked in the top 8.4 per cent of the
score distribution (”error free” test smallest);

(2) equal variances of test scores for individuals in
ability category 15 (test with items near 50 per cent
largest);

(3) equal variances of ability level scores for the
individuals ranked in the middle 10 per cent of the
score distribution (test with items away from mean

smallest); and

(4) equal variances of test scores for individuals in
ability category 8 (”error free'I test smallest).

The effect of error in estimating the precision of
items was examined by constructing two additional ”least
squares" test models. One test had less precise items for
the second item encountered; the other had less precise
items substituted for the fifth item encountered. Again the

I'error free"

distributions of scores for the ”error" and
tests were examined.

The null statistical hypotheses concerning the effect
of error in estimating the precision of items at the second

stage predicted the following:

(1) equal variances of test scores for individuals in
ability category 15 ("error free” test smaller);

120

(2) equal variances of test scores for individuals in
ability category 8 (”error free” test smaller); and

(3) equal variances of ability level scores for individ—
uals ranked in top 8.4 per cent of the score distri-
bution ("error free" test smaller).

The null statistical hypotheses concerning the effect
of error in estimating the precision of items at the fifth
stage predicted the following:

(1) equal variances of test scores for individuals in
ability category 15 (”error free" test smaller); and

(2) equal variances of ability level scores for individuals
ranked in top 8.4 per cent of the score distribution
(”error free' test smaller).

The general comparison examined the effect of difficulty
on score output, the effect of precision of items, and the
effect of the pattern of items. Difficulty effects were
examined for normal, rectangular, and U—shaped inputs on
tests with item precision of rbis equal to .75 and item dif-
ficulties as listed in Table 20 of the Appendix. (The rule
for selection of difficulties of items is that one should
use an item not at the difficulty level equal to ability
level where split between groups is desired, but difficulty
level should be regressed toward the mean value of 50 per
cent. The lower the rbis the greater should be the regres-
sion.) The distributions and mean ability level scores for
each score were tabled.

Distributions and mean scores were also tabled for five
tests with different item precision and for two tests with

different patterns of items. In addition to these tables eta

between in ut and output scores was re orted for each of these
r

tests.

CHAPTER IV
ANALYSES AND RESULTS

There are six sections to this chapter. Section one
gives the results of building the sixmitem sequential test
model. Section two reports results of the input distribu—
tion on the score distribution of both the sequential and
the six~item cumulative test models. Section three presents
the effects of item precision and difficulty on the score
distribution of the siXWitem sequential test model. Sec~
tion four gives the effects of errors of estimating preci~
sion and difficulty parameters on the score distribution of
the sequential test. Section five gives some general results
of changes in difficulty of items, precision of items, and
pattern of items. Section six is a summary of the analyses
and results. In all sections results are simply reported;

interpretation is reserved for Chapter V.
I. SEQUENTIAL TEST CONSTRUCTION

As stated in Chapter III, the sequential test model
2 . . .
was constructed so that the 2;(2iX) /N was maxlmlzed; graphic
methods were used to aid in determination of maximum values.
(é_X refers to sum of ability level scores for any one group.

122

123

Z (2X)2/N refers to squaring the sum of scores for the
group dividing by the number in the group and then summing
over the two or more groups that used the particular item.)
The only restriction was that any item difficulty had to be
more than .20 standard score units away from other difficul-
ties to be considered different from them, and thus to be
used. (The reader will be aided in following the item deci-

sions given below by referring to Figure 3.)

First Item Decision

 

The values of 2 (EXP/N for + .01., .OO, and —.01
difficulty items were as follows: 109073.85, 179931.86, and
109073.85. The maximum value was thus obtained from a .00
difficulty level item and this item fulfilled the criterion
of selection. Thus out of the 1500 people taking the hypo-
thetical test, 750 would pass and 750 would fail this item.

The mean ability level of these groups was i.°73~

Second Item Decision

 

The second item produced four groups over which. §L(ZX)2/N
was maximized. The three strategic Values for this item
were i-23: i324: and i,25 which had values for §_(2_X)2/N
of 113796.15, 113796.21, and 113796.00. (Strategic values
were determined by estimating values and plotting these
values of i (2X)2/N until the maximum value was stradled by
three points that could be read from the graph.) The i,24

items were selected for the second stage. The resulting four

a\./’,,.Wa\~s§ww§s x.
o

a. T’w’ ¢e>i . l
x.

/. .
a. .. at)... anon..- a? m
wasxwg 4.x _

 
 

 

 

5.1.1.430]. 1IOO/Ru 76RTILrQJﬁL101In; 314. 56789019.. 345
1;.Iilllal llllLai
.._.._____.____

Ht>mq abaaaa< cats

Score

6th

5th

Fig. 3.-—Mean Ability Level of Groups Separated

ties

..
a.

)
A

rnd Difficu‘

by Sequential Test

4.
U

Ou

Jsed

Items T

of

125
groups had mean ability levels of +1.04, +.lO, -.10, and
-1.04. At this point 504 individuals had passed both the
first and second items; 246 had passed the first and failed
the second; and like numbers had failed both, and the first

only.

Third Item Decision

 

The third stage items were reduced to three in number
as the two middle groups were both given the same difficulty.
Both of these middle groups took the same difficulty because
each has the sum of (LX)2/N of 3287855 for i.09 items and
32878.02 for i,lO items. As :; (2QX)%/N maximized at less
than .10, the ideal difficulty levels would be less than .20
sigma units apart. As this would vio‘ate a condition of the
test construction, the two middle groups were given the same
item which yielded a 2 (2.x)2/N of 32885.60.

The two extremes ability groups produced 2.(£X)2/N
equal to 83416.29, 83417.00, and 83416.84 for .48, .49, and
.50 difficulty items, respectively. Thus the three difficulty
levels used at the third stage are +.49, .00, and -.49. The
mean ability levels of the eight resulting groups were from
highest to lowest 1.21, .62, .48, .33, -.33, -.48, -.62, and

-1.21.

Fourth Item Decision

 

At this stage there were eight groups taking four dif—

ferent difficulty items (i.73 and i.40) and resulting in

126
sixteen groups. Those individuals who had passed (or failed)
the first three items had 2 (1X)2/N equal to 62860.59,
62860.69, and 62860.50 for :,72,.i.73, and i,74 items respec-
tively. The 1.73 item difficulty was selected. The second
group (PPQ or QQP) had maximum values between i345 and i,50
which were more than .20 standard score units away from i.73.
However, the third group (PQP or QPQ) had a (2X)2/N that
maximized above .32. The similarity of the groups is shown
in that while a .32 maximum is 18242.13, the .41 maximum is
18243.05. Since such values would giveiimmmsless than .20
standard deviation units away, the second and third groups
were each given the same difficulty. The remaining group
(QPP or PQQ) maximized between .29 and .35 for 21(iLXI/N
of 15387.12 and 15386.92, respectively. Since the best dif-
ficulty level for the previous two groups would be less than
.20 standard deviation units away, all three groups were
given an item of the same difficulty level. The strategic
values for difficulty of item assigned to the three groups
were i,39, i,40, and i_.4l which had 2_(2_X)2/N values of
54983.63, 54983.81, and 54983.57. Thus the :,40 item dif-
ficulties were used. 0f the eight groups at this stage, one
group took +.73, three took +.40, three took -.40 and one took

an item of -.73 difficulty level.

Fifth Item Decision

 

The fifth stage decisions resulted in sixteen groups

taking six items of different difficulty thus producing thirty-

127

two new groups. The groups that took the different difficulty
levels were as follows: The PPPP and QQQQ groups took items
of +.87 and -.87 difficulty. The PPPQ, PPQP, PQPP, and QPPP
groups took an item of +.66 level of difficulty. (The QQQP,
QQPQ, etc. opposites of above took an item at -.66.) The
PPQQ, PQPQ, and QPPQ groups each took an item of +.15 dif-
ficulty level. (Opposite groups took -.15 difficulty item.)
In other words for the eight groups above the mean, one group
took an item of .87 difficulty, four groups took an item of
.66, and three groups took an item of .15 level of difficulty.

The PPPP (and opposite) rmui gi (Z;X)é/N values of
45791.18, 45791.20, and 45791.18 for .86, .87, and .88 levels
of difficulty. The PPPQ group maximized the 2;(z;x)2/N
Just above the .71 difficulty level, thus the decision had
to be made to give this group either the same difficulty
item as the PPPP group or the difficulty of the PPQP group.
The PPQP group maximized between .67 and .71-- 2: (2_X)2/N
values of 13411.14 and 13411.11, respectively. These two
groups were thus given the same difficulty level as their
curves remained fairly near maximum for the difficulty
level common to both. The PQPP group maximized Z (EXP/N
between .60 and .65 with values of 10395.92 and 10395.98.
the QPPP group maximized at about .60 with 2 (2X)2/N
of 7826.26. Since none of these was .20 standard score
units apart in difficulty, the one difficulty value that

would maximize i(ZX)2/N for all four groups was determined.

128

The difficulties of .65, .66, and .67 had §L(ZLX)2/N values
of the eight groups of 49000.44, 49000.65, and 49000.60.
The item of 1,66 difficulty level was thus used for these
eight groups.

The PPQQ group maximized Z (ZX)2/N values between
.20 and .28--8208.84 and 8298.88, respectively. This was
more than .20 standard deviation units from .66, so this
group was not given the item of .66 difficulty level. The
PQPQ group maximized €L(2;X)2/N at .15 with 8096.18. (Dif-
ficulty levels .10 and .14 had 2 (EX)2/N values of 8096.15
and 8096.17, respectively.) The QPPQ group maximized
between .00 and .10 difficulty levels. This was not .20 stan-
dard score units of difficulty away, so the one difficulty
level that would maximize the sum of (2X)2/N for these six
groups was determined. The strategic difficulty levels of
.14, .15, and .16 had I§;(2.X)2/N for six groups of 24092.29,
24092.36, and 24092.30.

Sixth Item Decision

The sixth stage had 32 groups taking items at five dif-
ferent difficulty levels (i,87, i,49, and .00). The PPPPP
group had maximized Z. (2X)2/N between .90 and 1.00 dif-
ficulty--the respective Z(2_X)2/N values are 32852.58 and
32852.64. The group PPPPQ had £(2X)2/N values of 13051.00,
13051.01, and 13051.00 at .86, .87, and .88, respectively.
Thus it was clear that these two would not use different

difficulty of item and neither would any group that maximized

129
above .75. The other groups which maximized about.75 were
as follows: the PPPQP group which for .85, .86, .87, and
.88 had 3 (ZX)2/N values of 11334.16, 11334.17, 11334.17,
and 11334.16, respectively; the PPQPP group which for .85,
.86, .87, and .88 had 8373.86, 8373.86, 8373.86, and 8373.84,
respectively3the PQPP group which maximized Z (ZX)2/N
between .80 and .85 with values of 6059.83 and 6059.79,
respectively; and the QPPPP group which maximized between
.74 and .80 both with Z;(ZLX)2/N value of 4227.53.

The Z (ZX)2/N for the 12 groups using the same dif-
ficulty level of item were 75898.65, 75898.66, and 75898.60
for the .86, .87, and .88 level of difficulty, respectively.
The decision was thus to use a .87 difficulty item for
these groups.

The PPPQQ group (the next highest ability level group)
maximized between .55 and .65 with ‘£_(E;X)2/N values of
6150.61 and 6150.64, respectively. (The approximate value
for maximum was determined by plotting of the curve from
six points.) Since the group maximized more than .20 standard
deviation units away from the .87 groups and also maximized
within five points of the next lowest group, the decision
was made to use a new difficulty for all remaining groups
that maximized above .40. The remaining groups which maximized
at difficulty levels greater than .40 (but below.60) were as
follows: The PPQPQ group which for difficulty levels of .50,

.55, and .65, had £(iX)2/N values of 5135.92, 5136.00, and

130

5135.88, respectively; the PQPPQ group which for difficulty
levels of .43, .48, .49, and .50 had 2 (ZX)2/N values of 4422.34
4422.41, 4422.41, and 4422.40, respectively; the PPQQP group
which for difficulty levels of .43, .48, .49, and .50 had

2f (2iX)2/N values of 4680.27, 4680.33, 4680.33, and 4680.32,
respectively; the PQPQP group which for difficulty levels of
.32, .43, and .48 had 2: (ﬁiX)2/N values of 4198.70, 4198.83,
and 4198.77, respectively; and the QPPPQ group which for

.32, .43, and .48 had i(éx)2/N values of 3670.06, 3670.19,
and 3670.14, respectively.

The QPPQP group maximized €i(€iX)2/N between .32 and
.43. Difficulty levels of .29, .32, and .43 had values of
3628.46, 3628.48, and 3628.35, respectively. A decision
thus was whether to include this group with the higher or
lower groups. The PQQPP group (next in line) for difficulty
levels of .OO and .09 had 2L(2LX)2/N values of 4244.60 and
4243.82 and maximized below .09. For this reason the QPPQP
group was included with the higher group instead of .20 units
lower in difficulty which would have yielded a lower z(ix)2/N
value.

The sum of {(iXF/N for the 14 groups for difficulty
levels of .48, .49, and .50 were 31886.02, 31886.04, and
31885.95. Thus a difficulty of in49 was used with each of
these groups.

The remaining six groups all maximized between i,09,

thus .00 item was used here. The QPPQQ group for difficulty

131
levels of .OO and .09 had éi(E;X)2/N values of 4244.60 and
4243.82, respectively, The PQPQQ group had 3983.96 and
3983.54 for these same values, and the PPQQQ group for dif—
ficulty levels of .00 and .09 had.2i(§_X)2/N values of
3615.52 and 3615.32, respectively.

Thus of the 16 groups above the mean, six groups took
the .87 difficulty item, seven groups took the .49 difficulty
item, and three groups took the .00 difficulty item at the
final stage.

The above sequential test was compared with the cumu-
lative test to determine how well the score differentiated
individuals of different ability levels and to determine the
range of ability levels assigned to any one score.

The above sequential test was also used in the deter-
mination of the effects of errors in estimating the parameter
values for the items in this test. Parameter values consid-
ered were difficulty and precision.

This pattern of items determined above was also used
with different difficulties to determine how a test with an
arbitrary pattern and easily computed difficulties compared

with a test using pattern of items determined above.
II. INPUT DISTRIBUTION EFFECTS

Normal and U-shaped distributions were each used
with the cumulative and the ”least squares" sequential test.

The results from the two distributions are presented separately.

132

 

Results from the Normal Distribution

The first null hypothesis was that there should be
equal means of the comparable normalized scores for the
middle category, category eight, individuals taking the se-
quential and cumulative test both with a normal distribution
of ability input. Results are shown in Table 1. As can be
seen from the table, the null hypothesis tested by a "t"
test must be accepted. This was expected as both have a
symmetrical distribution of scores. This hypothesis was
included as a parallel hypothesis to hypothesis one on U-
shaped distribution (and as a check on the accuracy of
computer computations). In this, and all other hypotheses,
the reader should be aware of the fact that the number of
individuals is dependent only upon the accuracy of the cal-
culations. Since the figures were carried to between eight
and twelve places a larger N could well be assumed. This
would make the error terms smaller and differences signifi-
cant. The theoretical lOOO individualswerg used to give
the reader a point of reference. If the differences exist
in the proper direction, the rationale may be said to be
supported.

The second null hypothesis was that there would be
equal variances of the comparable normalized scores for
middle category (number 8) individuals taking the sequential
test and the cumulative test both with a normal distribution

of ability input. Results are shown in Table 1. Again

133
the null hypothesis tested by a F ratio test must be accepted.
This was expected from the rationale.

The third null hypothesis was that there should be
equal means of the comparable normalized scores for combined
category 14 and 15 individuals taking the sequential and
cumulative tests both with a normal distribution of ability
input. Results are shown in Table 2. The null hypothesis
was based upon 1000 individuals and accepted. The scores
were in the expected direction with the sequential test
assigning the more extreme value; therefore, the rationale
tends to be supported.

The fourth null hypothesis was that there should be
equal variances of the comparable normalized scores for com-
bined category 14 and 15 individuals taking the sequential

'\and cumulative tests both with a normal distribution of
ability input. Results are shown in Table 2. The null
hypothesis was rejected at the .01 level of significance.
The sequential test had lower variance for high ability in-
dividuals as was predicted from the rationale.

The fifth null hypothesis was that there should be equal
means of ability level scores for the individuals in the top
8.4 per cent of the score distributions taking the sequential
and cumulative tests both with a normal distribution of
ability input. Results are shown in Table 3. The null
hypothesis was rejected at the .01 level of significance.

The sequential test had a higher mean ability level for the

134
TABLE 1

ANALYSIS OF MEANS AND VARIANCES 0F NORMALIZED SCORES
FOR CATEGORY 8 INDIVIDUALS WHEN NORMAL DISTRIBUTION
OF ABILITY IS INPUT INTO SEQUENTIAL
AND CUMULATIVE TEST MODELS

 

 

Significance
Parameter Sequential Test Cumulative Test Between Tests

 

Mean 50.00 50.00 n.s.
Variance 16.37 21.22 n.s.

 

TABLE 2

ANALYSIS OF MEANS AND VARIANCES 0F NORMALIZED SCORES
FOR CATEGORY 14 AND 15 INDIVIDUALS WHEN NORMAL
DISTRIBUTION OF ABILITY IS INPUT INTO
SEQUENTIAL AND CUMULATIVE TEST MODELS

 

 

Significance
Parameter Sequential Test Cumulative Test Between Tests

 

Mean 63.40 62.96 n. s.
Variance 3.87 6.77 p<:.01

 

TABLE 3

ANALYSIS OF MEANS AND VARIANCES 0F ABILITY LEVEL SCORES
FOR THE TOP 8.4 PER CENT OF THE SCORE DISTRIBUTION
WHEN NORMAL DISTRIBUTION OF ABILITY IS INPUT INTO
SEQUENTIAL AND CUMULATIVE TESTS

 

Significance
Parameter Sequential Test Cumulative Test Between Tests

 

Mean 13.66 g 12.92 p < .01
Variance 2.35 '3.47 p‘<-O5

 

135
top 8.4 per cent of the score distribution as had been pre-
dicted.

Null hypothesis six was that there should be equal
variances of ability level scores for the individuals in the
top 8.4 per cent of score distributions taking sequential and
cumulative tests, both with a normal distribution of ability
input. The results are shown in Table 3. The null hypothe—
sis was rejected at the .05 level of significance. The
sequential test had smaller variance of ability level scores
for the top 8.4 per cent of the score distribution as had
been predicted.

To examine the hypothesis that the six-item cumulative
test model would have smaller differences in mean ability
levels between the middle and adjacent scores than between
the extreme and adjacent scores, the differences in mean
ability level for adjacent scores were computed. These dif-
ferences are reported in Table 5, column 3. As was hypothe-
sized, the smaller differences in ability level were between
the middle score 4, and the adjacent score 5. However, it
should be noted that the differences between ability level
scores for adjacent scores for the sequential test model
(shown in Table 6) were not equal interval and there is no
pattern to the differences shown, although in both cases the
differences were greatest for the extreme scores.

If one wishes to examine the mean ability level and

number of individuals at each score, these values are shown

136 .

TMﬂE4

DIFFERENCES BETWEEN NORMALIZED “T“ SCORES
FOR ADJACENT TOP ABILITY LEVELS FOR
NORMAL AND U-SHAPED INPUT

 

Between
Ability Ideal Normal Input U-Shaped Input
Levels Difference Cumulative Sequential Cumulative Sequential

 

 

 

 

15-14 4.5 2.3 2.5 1.2 2.6
14—13 2.5 1.1 2.1 1.1 1.7
13-12 2.5 1.9 2.1 1.5 1.8
12—11 2.5 2.0 2.3 1.7 1.7
11-10 2.4 2.4 2.3 1.7 1.4
10— 9 2.5 2.3 2.3 1.6 1.4
9- 8 2.5 2.5 2.3 1.6 1.2
TMEE5
DIFFERENCES BETWEEN ABILITY LEVEL SCORES FOR
ADJACENT TOP SCORES FOR CUMULATIVE TEST
MODEL FOR NORMAL AND U-SHAPED INPUT
Input
Between Scores* Ideal Difference Normal U—Shaped

 

7—6 2.33 2 1 1.9
6-5 2.33 l 5 2.0
5-4 2.33 1 3 2.0

 

*Scores range from 1-7.

137

TABLE 6

DIFFERENCES BETWEEN ABILITY LEVEL SCORES FOR
ADJACENT TOP SCORES FOR SEQUENTIAL TEST
MODEL FOR NORMAL AND U-SHAPED INPUT

 

 

 

Between Input Between Input

Scores* Normal U-Shaped Scores* Normal U-Shaped
64-63 1.4 .8 48-47 .3 .1
63—62 .0 .0 47-46 .3 .3
62-61 .2 .1 46~45 .2 .5
61—6' .1 .2 45-44 .3 .2
60-59 .3 .2 44—43 .2 .0
59-58 .1 .2 43-42 .0 .4
58-57 .6 .6 42-41 .0 .1
57-56 .3 .1 41-40 .0 —.1
56-55 .0 .2 40-39 .5 .5
55-54 .0 .1 39-38 —.4 -.1
54-53 .5 .2 38-37 .4 .1
53-52 .0 .0 37-36 .0 .0
52-51 .0 .3 36-35 .0 .2
51.50 .0 .1 35—34 .0 .2
50-49 .2 -.1 34-33 .3 .3
49-48 .0 .4

 

*gdeaégdifference if all had been equal intervals would
e . .

138
in Tables 25 and 26 of the Appendix. The mean normalized
"T" score for each ability level is reported in Table 24

of the Appendix.

Results from the U-Shaped Distribution

 

The first null hypothesis was that there should be
equal means of the comparable normalized scores for category
13 individuals taking the sequential and cumulative tests
both with an input of a U-shaped distribution of ability.
Results are shown in Table 7. As can be seen, the null
hypothesis must be accepted. The sequential test did have
the higher mean value as expected, but not Significantly so
if 1000 individuals are assumed to have taken the test.
Rationale would tend to be supported though the effect is
small. (See comments on size of N under HResults from
Normal DistributionAU

The second null hypothesis was that there should be
equal variances of the comparable normalized scores for
category 13 individuals taking the sequential and cumulative
tests each with an input of a U—shaped distribution of
ability. From Table 7 one can determine that the null hypothe-
sis must be accepted if only 1000 individuals are assumed to
have taken the test. The variance of the sequential test
was less, however, than the cumulative test as anticipated
though the effect was small.

The third null hypothesis was that there should be

equal means of the comparable normalized scores for category

139

TABLE 7

ANALYSIS OF MEANS AND VARIANCES 0F NORMALIZED SCORES
FOR CATEGORY 13 INDIVIDUALS WHEN A U-SHAPED
DISTRIBUTION OF ABILITY IS INPUT INTO
SEQUENTIAL AND CUMULATIVE TEST MODELS

 

 

Significance
Parameter Sequential Test Cumulative Test Between Tests

 

Mean 58.44 58.03 n.s.
Variance 13.99 14.00 n.s.

 

TABLE 8

ANALYSIS OF MEANS AND VARIANCES 0F NORMALIZED SCORES
FOR CATEGORY 15 INDIVIDUALS WHEN A U-SHAPED
DISTRIBUTION OF ABILITY IS INPUT INTO
SEQUENTIAL AND CUMULATIVE TEST MODELS

 

 

 

Significance
Parameter Sequential Test Cumulative Test Between Tests

 

Mean 60.73 60.44 n.s.
Variance 1.96 3.62 p<..01

 

TABLE 9

ANALYSIS OF MEANS AND VARIANCES 0F ABILITY LEVEL SCORES
FOR THE TOP 13.5 PER CENT OF THE SCORE DISTRIBUTION
WHEN A U-SHAPED DISTRIBUTION OF ABILITY IS INPUT
INTO SEQUENTIAL AND CUMULATIVE TESTS

 

 

SignIfICance
Parameter Sequential Test Cumulative Test Between Tests

 

lean 14.43 13.87 p'<.01
Variance .77 1.86 ;><.Ol

 

140
15 individuals taking the sequential and cumulative tests
both with an input of a U-shaped distribution of ability.
The results are shown in Table 8. The null hypothesis must
be accepted, although the results were in the direction
indicated by the research hypothesis. The cumulative had a
lower value for the mean. Again significance depends upon
number of individuals assumed to have taken the test.

The fourth null hypothesis was that there should be
equal variances of the comparable normalized scores for
category 15 individuals taking the sequential and cumula-
tive tests both with an input of a U-shaped distribution of
ability. As shown in Table 8, the null hypothesis was re—
jected at the .01 level of significance. The sequential test
had less variance of scores for the highest ability level
individuals than did the cumulative test. 5

The fifth null hypothesis was that there should be equal
means of ability level scores for the individuals in the top
13.5 per cent of the score distribution taking the sequential
and cumulative tests both with an input of a U-Shaped distri-
bution of ability. The results are shown in Table 9. The
sequential test had a significantly higher mean ability level
for the top 13.5 per cent of the score distribution than did
the cumulative. This was in the direction hypothesized.

The sixth hypothesis was that there should be equal
variances of ability level scores for the individuals in the
top 13.5 per cent of the score distribution taking the sequen-

tial and cumulative tests both with an input of a U—shaped

 

141
distribution of ability. The results in Table 9 indicate
that the sequential test had at the .01 level of signifi-
cance, a smaller variance of ability level scores for the
top 13.5 per cent of the score distribution than did the
cumulative test. This was in the direction hypothesized.

The difference in mean ability level between adjacent
top scores for the cumulative and sequential test models are
Shown in Tables 5 and 6, respectively. The scores on the
sequential test did not yield equal intervals on the ability
level scale as had been hypothesized. The cumulative scores
are a good approximation of equal intervals on the ability
level scale.

To examine the hypothesis that the sequential test
model should have approximately equal distance between test
score means for each of the ability categories, while the
six-item cumulative would have larger differences in mean
test scores for the middle ability levels than for extreme
ability levels, the differences between adjacent scores were
computed. These differences are reported in Table 4. The
cumulative test did have smaller score differences between
the extreme ability levels than any other point in ability
distribution. However, the sequential test did not have an
equal interval scale, but in general decreased in size of
difference between mean scores of adjacent ability levels
from extreme ability category to middle ability category.

It should be noted that neither test represented the ability

142
levels with any real accuracy. The top ability level shown
had an ideal ”T” score of 69 instead of the 61.8 assigned
by the sequential or the 60.4 assigned by the cumulative

test. (See Table 24.)

III. ITEM PRECISION AND DIFFICULTY FOR

THE SEQUENTIAL TEST

Five levels of precision and the appropriate levels of
difficulty for each were used in the construction of five
sequential test models. For these tests the variances of
scores for the extreme and middle ability levels and the
variances of ability level for extreme and middle scores

were examined.

Variance of Scores

 

The first null hypothesis was that there would be equal
variances of scores for category 8 ability level individuals
for all five tests of different precision level. Data and
results are shown in Table 10. The null hypothesis was re-
jected at the .001 level of significance. As was hypothe—
sized, the more precise tests had smaller variances.

The second null hypothesis was that there would be
equal variances of scores for a combination of ability level
categories 14 and 15 for all five tests of different preci-
sion level. From data in Table 10 it can be seen that the
null hypothesis was rejected at the .001 level of signifi-

cance;the more precise tests had smaller variances of scores.

143

TABLE 10

ANALYSIS OF THE VARIANCE 0F SCORES FOR
INDIVIDUALS AT SPECIFIED ABILITY
LEVELS FOR FIVE TESTS OF
DIFFERENT PRECISION

——
t

‘ Precision of Test

 

 

Ability “ Significance
Category .45 .60 .71 .75 .79 of Difference
8 260.03 198.70 147.52 127.33 111.65 p<:.001

14 and 15 94.74 40.90 19.69 15.50 11.86 p<’.OOl

 

TABLE 11

ANALYSIS OF THE VARIANCE OF ABILITY LEVEL SCORES FOR
INDIVIDUALS AT SPECIFIED SCORE LEVELS FOR
FIVE TESTS OF DIFFERENT PRECISION

 

 

Score Level ___> Precision Of Test Significance
(Per Cent) ,45 .60 .71 .75 .79 of Difference
Top 8.4 5.88 3.71 2.45 2.10 1.77 ‘p< .001

p< .001

OK

Middle 10 8.22 5.37 3.62 3.06 2.

\

 

144

As was hypothesized, the precision of the item was an impor-

tant variable in precision of scores.

Variance of Ability Levels

 

The third null hypothesis was that there would be equal
variances of ability level scores for the individuals ranked
in the top 8.4 per cent of the score distribution by each of
the five tests of different precision level. Data and results
are shown in Table 11. The null hypothesis was rejected at
the .001 level of significance. As can be seen, the preci-
sion of item was important in determining the precision of
the scores as hypothesized. The individuals assigned to the
top 8.4 per cent of the score distribution were not as
variable in ability level when assigned by a test with items
having an rbis of .79 as when assigned by a test with items
having an rbis of .45.

The fourth null hypothesis was that there would be
equal variances of ability level scores for the individuals
ranked in the middle 10 per cent of the score distribution
by each of the five tests 0f different precision level. As
can be seen in Table 11, the null hypothesis was rejected
at the .001 level of significance. The results were in the
direction hypothesized--the more precise tests had smaller
variance of ability levels. However, it should be noted
that for the middle 10 per cent of the score distribution,
the variances were 8.22 and 2.56 for the .45 and .79 tests,

respectively. The one variance is 3.21 times larger than the

145

other. For the top 8.4 per cent of the score distribution
the variances were 5.88 and 1.77 for the .45 and .79 tests,
respectively. The larger variance is 3.32 times the other.
Greater differences at the top than at the middle of the
score distribution were contrary to what had been expected.

Table 12 gives the means and variances of rank scores
assigned to each ability level by the five tests of different
precision. The means for Category 8 individuals were always
the same. However, the mean rank scores assigned to category
1 individuals were lower as the precision of the item in—
creased. This was especiallylmoticeab16rat the lower preci—
sion levels. The variances of the test scores for each
ability level decreased with the precision of item as was
hypothesized. Also, it should be noted that the variances
of extreme scores were much lower than the variances of the
middle value scores.

The discrimination indices are reported in Table 13.
(Only one-half of the score distribution is tabled because
the two halves are symmetrical about the mean.) The higher
the value, the better the discrimination. The test con—
sisting of the most precise items had the highest discrimin-
ation index. The test was more discriminating for the extremes
in ability than it was for the other ability values. However,
the other values of the discrimination index were remarkably
close to each other for all ability levels other than the
extremes. This was what had been hoped for with the sequen-

tial test.

TABLE 12

THE MEANS AND VARIANCES OF RANK SCORES ASSIGNED TO EACH
ABILITY LEVEL BY FIVE TESTS OF DIFFERENT PRECISION

Variances of Rank Scores for

Different Precision Levels

79

J5

.71

A5

\0
b.
H

O\\OKO LON-KO

LﬂFr—iONCDCh
HMJ‘OCD

N-

896
68.7
495
3L6

59

0’)
.1
O
H

m43

H
H
H

(IDLﬂxOCDCﬁOi‘FiTOCDCDKOLﬂCD

PMCIDQDCU NMMMNNQQM¥
CUMLDQDOCUCUCUOCIDLDMCU
r-ir—ir—{t-i

1

OOHCUQ'MQLDOMHTCUr—JOO

r-JO\O\CU[\:fr—i[\—r—1—:I'[\CUO\O\r—l

 

 

 

Mean of Rank Scores for

Different Precision

Levels

.60 .71 .75 .79

.45

 

Ability
Level

 

CID—if OMLONKOLOKDMLONOKOCU

CULﬂCDr—{LOOKOCUCUQ‘QMN-QCU
r—It—iCUCUMM-If—d' 10me

NOJQDOH-d'mﬁChOCUKOOCD

M‘OCIDH\Or—JKOCUCOMO\CY)\OO\r—i
r—lr—ICUCUMNUQ'Q' LOLOLOKD

\OLﬂCULﬂLﬂ-Zf'fIDLDCUKOLﬂLQCD Lﬂzi'

(Y)\OO\CU\Or—i ocuoomoocumooH
HHcvcummxzmmhrvo

r—lKOCULﬂ

:quqosmmocomoo

GXULRGDHLHQDN 0C3dikxnolo
H Hraolmcuoﬁruidwi:rmmn

ram “PTUYOPWKDOWDrdanQ'W\
HeariHra

 

147

TABLE 13

THE DISCRIMINATION INDICES BETWEEN ADJACENT ABILITY
LEVELS FOR THE INPUT OF A NORMAL DISTRIBUTION OF
ABILITY INTO TESTS OF DIFFERENT PRECISION

 

 

Precision Level of Test

Between Ability

 

 

Levels .45 .60 .71 75 .79
1 and 2 .42 .58 .89 .70 .81
2 and 3 .22 .30 .42 .43 .50
3 and 4 .22 .33 .43 .47 .51
4 and 5 .24 .34 .43 .50 .54
5 and 6 .23 .34 .46 .52 .57
8 and 7 .24 .38 .46 .52 .59
7 and 8 .24 .35 .48 .52 .58

 

148

IV. ERRORS IN THE SEQUENTIAL TEST

PARAMETER ESTIMATES

Errors in estimating the difficulty level of items and
errors hlestimating the precision of items were investigated.
Four different tests with errors in estimates of difficulty
were constructed, and two tests with errors in precision
were built. All tests used the "least squares" difficulties
as the base for comparison. The results of investigating

these two types of errors will be discussed separately.

Errors in Estimating Difficulty
Of the four tests with errors in estimates of item dif-
ficulty, two had the error at the second item encountered and

two had the error at the fifth item encountered.

Second item error.--The first null hypothesis was that
the number of people in each set of score categories would
be independent of whether the people were classified by an
”error free” test or one which had items too far from the
mean at the second stage. The distributions are reported
in Table 27 of the Appendix. The number of individuals at
12 selected categories, the expected values from an indepen-
denaaassumption, and the chi-square value are reported in
Table 14. The null hypothesis had to be accepted. There
were more people at the middle values as hypothesized, but

the differences were not significant if 1000 people were

149
assumed to have taken the test. It can be concluded that
the effects of second-item errors are small.

The second null hypothesis was that the number of
people in each set of score categories would be independent
of whether the people were classified by an "error free"
test, or by a test which had the second item encountered
too near the mean value. The distribution is reported in
Table 27 in the Appendix. The number of individuals at
12 selected categories, the expected values from an indepen-
dence assumption, and the chi—square value are reported in
Table 15. The null hypothesis had to be accepted However,
there were more people at the extreme categories in the
modified test than in the "error free” test, as was hypothe-
sized. The differences were not significant due to the

assumption of 1000 individuals.

Fifth item error.-—The first null hypothesis was that

 

of equal variances of the ability level scores for the
individuals ranked in the top 8.4 per cent of the score dis-
tribution by the ”error free" difficulty test and the tests
which had the fifth item too far and too near the mean value.
The variances of the ability level scores for the top 8.4
per cent in each of the tests are reported in Table 16. The
differences in variances were not significantly different

from each other. However the ”error free” test did not

‘)

have the smallest variance as was hypothesized. The test with

150
TABLE 14
DISTRIBUTION OF INDIVIDUALS BY TWO TESTS--0NE TEST

WITH SECOND ITEM DIFFICULTIES FARTHER FROM 50
PER CENT LEVEL THAN THE "ERROR FREE" TEST*

 

Rank Scores

 

 

 

Test 64 58-63 54-57 46753 40‘45 33-39
2nd Item (62.44) (100.20) (57.91) (96.68) (86.61) (94.16)
extreme 59 97 50 106 8c 100
"Error (61.56) ( 98.80) (57.09) (as 32) (85.39) (92.84)
Free" 65 102 65 86 86 87
x2 = 10.624 d.f. = 11

TABLE 15

DISTRIBUTION OF INDIVIDUALS BY TWO TESTS-—ONE TEST
WITH SECOND ITEM DIFFICULTIES NEARER TO 50 PER
CENT LEVEL THAN THE ”ERROR FREE" TEST*

—_

Rank Scores

 

 

 

Test 64 58-63 54-57 46-53 40-45 33-39
2nd Item (67.24) (104.14) (73.81) (82.40) (88.47) (85.94)
near 50 68 104 81 77 89 83
"Error (65.76) (101.86) (72.19) (80.60) (68.53) (84.06)
Free" 65 102 65 86 86 87
x2 = 10.624 d.f. = 11

*NOTE: The rank scores are broken to make approximately
equal intervals on the ability scale. The scores
1-32 are not reported in the table but are symmetrical
about 32.5. All values were used in the calculations

of chi-square. Expected cell frequencies are given in
parentheses.

151

items nearer to 50 per cent level of difficulty had least
variance of ability represented in the top 8.4 per cent of
the score distribution. Rationale was not supported here.

The second null hypothesis was that there would be
equal variances of test scores for individuals in ability
category 15 on the ”error free" test and the tests which had
the fifth item too far and too near the mean value. The
results in Table 17 show that the null hypothesis must be

accepted. The largest variance was for the test with items

 

nearer the 50 per cent level of difficulty as was hypothe—
sized even though the results were not significant due to
the assumptions of only 1000 individuals.

The third null hypothesis was that there would be equal
variances of ability level scores for the individuals ranked
in the middle 10 per cent of the score distribution by the
three tests. The results of these tests are shown in
Table 16. The test with the items at the fifth stage near
the 50 per cent level of difficulty had lower variance than
other tests, but not significantly so. It was hypothesized
from Lawley's work on the cumulative that the test with the
difficulties away from the mean would have had the smallest
variance. Rationale was not supported.

The fourth null hypothesis was that there would be
equal variances of test scores for individuals in ability
category 8 on all three tests. Again the null hypothesis had

to be accepted. The lowest variance was for the test with

152

TABLE 16

ANALYSIS OF THE VARIANCE OF ABILITY LEVEL SCORES FOR
INDIVIDUALS AT SPECIFIED SCORE LEVELS FOR ONE
"ERROR FREE” TEST AND TWO “ERROR IN

DIFFICULTIES 0F FIFTH ITEMS" TESTS

 

5th Items 5th Items Significance

 

 

"Error Free" Nearer Away from . Of
Score Level Test 50% 50% Differences
Top 8.4 % 2.30 2.25 2.33 n.s.
Middle 10% 3.08 3.06 3.30 n.s.
TABLE 17

ANALYSIS OF THE VARIANCE OF RANK SCORES FOR INDIVIDUALS
AT SPECIFIED ABILITY LEVELS FOR ONE “ERROR FREE" TEST
AND TWO "ERROR IN DIFFICULTIES 0F FIFTH ITEMS" TESTS

 

5th Items 5th Items Significance

 

"Error Free” Nearer Away from of
Score Level Test 50% 50% Differences
15 148.08 173.34 129.01 n.s.

8 5.57 4.75 6.67 n.s.

 

153
the fifth item nearer the 50 per cent level of difficulty.
It had been hypothesized that the ”least squares" would have

the smallest variance.

Errors in Estimating Precision

 

Two tests were built to examine the error of estimating F4

i

precision: one with rbis equal to .71 items at the second 1
stage of the ”least squares” (rbis = .75) test, and the other

with rbis equal to .71 items at the fifth stage. These are

 

v -

discussed separately.

Errors at the second stage.--The first null hypothesis

 

was that there would be equal variances of test scores for
individuals in ability category 15 for the ”error free” test
and the test where the precision was lowered at the second
stage. The variances of the "error free” and ”error” tests
were 5.57 and 5.94, respectively, for ability category 15.
The F ratio was 1.06 and thus the null hypothesis had to be
accepted. The variance increased with error as was expected,
but not to a significant degree if only 1000 individuals were
assumed to have taken the test.

The second null hypothesis was that there would be equal
variances of test scores for individuals in ability category 8
for the "error free" test and the test where precision was
lowered at second stage test. The variance of the "error
free" test was 148.08 and for the "error” test was 149.09.

The F ratio was 1.01 and again the variance increased as was

154
hypothesized, but not significantly so if an N of 1000 was
assumed. It should be noted that the F ratio for the
variances at ability category 15 was greater than the F ratio
for variances at level 8—-i.e., errors at the second stage
seemed to have a greater effect on extreme scores as was
anticipated.

The third null hypothesis was that there would be
equal variances of ability level scores for individuals
ranked in the top 8.4 per cent of the score distribution
of each of these two tests. The variance of ability level
scores for top 8.4 per cent on the "error free" test was
2.30 and the variances of ability level scores from the
"error” test was 2.37. The null hypothesis had to be ac-
cepted, but the variance did increase with error in item
precision. Again significance depended upon the value

assumed for N.

Errors at the fifth stag_,--The first null hypothesis

 

was that there would be equal variances of test scores for
individuals in ability category 15 for the "error free preci-
sion" test and the test with ”error” in precision at the
fifth stage. The "error free” test had a variance of test
scores of 5.57 and the ”error” test had a variance of 5.71
for ability category 15. The null hypothesis had to be
accepted, but the variance was larger for the test with
errors as had been hypothesized. (Changes in assumption of

N would change significance test.)

155

The second null hypothesis was that there would be
equal variances of ability level scores for individuals
ranked in the top 8.4 per cent of the score distribution by
the two tests. The “error free" test had a variance of 2.30
and the ”error'' test had a variance of 2.40. Again the null
hypothesis was accepted; but the variance of the test with
the error was larger as hypothesized.

It had been assumed that at the middle ability level
the effects of errors in precision would be slight The
variance for the "error free" test was 148.08 and for the
"error” test was 150.69. The difference between the variances
of the two tests is slight and the F ratio for the middle
ability variances is the same as the F ratio of variances for
category 15 individuals. This was as expected.

It had also been assumed that errors in precision at
the second stage would be more serious than those at the
fifth stage. In the variances of scores for high ability
individuals, the error in precision at the second stage
increased variance more than error in precision at fifth
stage. (The variances were 5.57, 5.94, and 5.71 for "error

free,” error at second and error at fifth stage precision

)
tests, respectively.) However, the variance of scores for
middle ability level individuals was higher for the test
with error in the fifth stage than the test with error in

the second stage. (The variances were 148.08, 149.09, and

150.69 for ”error free," error at second, and error at fifth

156

stage precision tests, respectively.) The error at the
fifth stage test also had the highest variance of ability
level scores for individuals in top 8.4 per cent of the
score distribution. (Variances were 2.30, 2.37, and 2.40
for "error free," error at second, and error at fifth stage
precision tests, respectively.) The hypotheses that errors
in second stage would be more serious than errors in fifth

stage was not confirmed.

General Comparisons

 

The three areas of general comparisons were effects
of difficulty, effects of precision, and effect of pattern
of items. There were no hypotheses made about these general
comparisons. The information is presented to suggest new

hypotheses and to aid in forming tentative conclusions.

Effects_of difficulty.--In addition to the hypothesis

 

testing material already reported, examination of Tables 18,
19, and 20 yields information on difficulty. Only the dif-
ficulty of the items has changed from column to column within
any one of the three tables. It should be noted that the
distribution of difficulties to form a normal output of
scores yielded the highest mean ability level for the top
score, no matter what type of distribution was input. Also,
the distribution of difficulties to produce a U-shaped output
of scores yielded the greatest number of individuals in the

extreme score irrespective of the type of distribution that

157

.osoom m an mHm3UH>HUQH mo H0>0H szHHnm :moE 0:» mo xcmm*

 

 

 

 

 

 

 

0.0 0 0.0 HH 0.0 NH mm 0.0 NH N.0 :H H.0H HH 0:
H.0 0 H.0 HH m.m SH :m 0.0 NH 0.0 0H 3.0H HN 00
H.0 w N.0 NH 3.0 SH mm N.0 NH 0.0H MH 0.0H SH Hm
H.0 0 m.w HH 3.0 MH 0m N.0 NH H.0H :H 0.0H 0H Nm
N.0 0 :.0 HH 0.0 0 mm 0.0 NH N.0H :H 0.0H 0H mm
m.m 0 0.0 0 N.0 0H mm 0.0H NH m.0H MH N.HH NH :0
m.0 0 0.0 HH 0.0 0H 0m 0.0H HH :.0H NH :.HH MH 00
m.0 0 0.0 0H 0.0 NH 0: H.0H HH 0.0H NH 0.HH 0 00
0.0 m 0.0 0 0.0 0H H: m.0H 0H 0.0H 0H N.HH MH mm
0.0 s 0.0 m N.0 NH N: o.HH 0H N.HH 0H H.NH 0 00
0.0 0 H.0 s 3.0 :H m: m.HH :N 0.HH :N m.NH HN 0m
N.0 0H N.0 HH 0.0 0H 3: :.HH 0N 0.HH 0N 0.NH 0N 00
m.m HH 4.0 NH s.0 0H m: 0.HH sm H.NH sN H.MH mm H0
:.0 HH 0.0 MH 0.0 MH 0: 0.HH SN m.NH 0N :.MH 0N N0
:.0 HH 0.0 NH 0.0H 0 s: 0.HH 0N 0.NH :N S.MH 0H 00
0.0 0H 0.0 HH H.0H HN w: 0.MH HOH 0.MH HS 0.:H 0N :0
saw: 2 can: 2 can: 2 onoom sec: 2 cat: 2 new: 2 msoom
vommcmpa an3wcmpoom HmEnoz *xcmm 000630-: an3wcmpoom HmEnoz *xcmm
030030 No shoppmm copooaxm 030030 No stoppmm Umpoomsm

 

BszH MBHHHm< mo ZOHBDmHmBmHQ H<Smoz Dz< mmHBHDQHmmHQ 82mmmmmHQ mBHZ
mBmmB mmmme mom mMDH<> mmoom 008 mom mmmoom Hm>mq MBHHHm< z<mz Qz< ZOHBDmHmBmHQ

wH

MHm<B

 

.00660 0 00 0H036H>H6:H 00 H0>0H :pHHHnm :00E 0:0 0o x:mm*

 

158

 

 

 

 

 

6.0 0 6.0 0 6.0 NH 00 0.0 6H H.6H NH 6.6H 6H 0:
H.0 s N.0 0 0.0 0H :0 6.6H 6H 0.6H 0H 6.HH 0H 60
N.0 0 0.0 0 4.0 0H 00 H.6H 6H H.6H NH H.HH 0H H0
N.0 0 0.0 0 0.0 6H 00 N.6H 6H 0.6H NH m.HH 0H N0
0.0 s 0.0 0 s.0 a s0 N.6H 6H 0.6H NH 0.HH 0H 00
H.0 0 0.0 s 0.0 NH 00 H.6H 6H 0.6H NH 0.HH 0H an
0.0 0 0.0 0 H.0 0 0m H.6H 6H 0.6H HH 0.HH NH mm
3.0 0 s.0 s N.0 0 6s 0.6H 6H 6.HH HH H.NH 6H 60
0.0 0 6.0 s N.0 NH He N.6H 0 0.HH 6H H.NH 0H s0
0.0 m N.0 0 0.0 0 Ne 0.HH 0H N.HH 6H 0.NH HH 00
0.0 s 0.0 0 0.0 NH ms 0.HH 0N H.NH 6N N.NH 0N 00
0.0 0 0.0 0 0.0 0 a: 0.HH 0N H.NH 60 H.0H N0 60
0.0 0 s.0 6H H.6H 0H ms N.NH 0N 0.NH N0 H.0H H0 H0
s.0 0 0.0 6H N.6H HH 0: 0.NH Hm 0.NH H0 0.0H sN N0
0.0 0 0.0 6H H.6H m as H.NH 60 6.0H Hm 0.0H 0N 00
0.0 0 H.6H 0 0.6H 0H 0H 0.0H 00H 6.HH 66H 0.NH 6s :6
:00: z :00: z :00: Z 0:oOm :00: z :00: z :00: z 0noom
0000:0u0 :0H30:0060m Hsssoz *x:mm 0000:010 :0H30:mpo0m Hmsnoz *x:mm
030030 06 :000000 000600xm 030030 06 :000000 000600xm

 

 

 

EbmzH HBHHHm< mo ZOHBDmHmBmHQ m<HDGZ<Bomm QZ< mmHEHDOHmmHQ BZMEMmmHQ SBHZ
memme mmmme mom mMDH<> mmoom 009 mom mmmoom Hm>mH HBHHHm< z<m: QZ< ZOHBDmHmBmHQ

0H mHm<s

159

. 0000 0 00 000300>0©:H 00 00>00 :000000 :00: 0:0 06 ::0m*

 

 

 

 

 

 

 

 

0.0 z 0.0 m H.0 0 mm :.0H 0 0.00 0 0.00 0 0:
6.0 3 0.0 0 3.0 0 30 0.6H 0 0.6H 0 0.HH 0H 60
0.0 3 0.0 m 0.0 0 mm 0.0H 0 0.00 0 o.HH 0H 00
0.0 z 0.0 m 0.0 0 0m 0.6H 0 6.HH 0 0.HH 0H N0
3.0 0 0.0 m 0.0 m 00 0.6H 0 N.HH 0 6.NH 0H 00
3.0 d 0.0 0 0.0 0 0m 0.0H 0 0.00 0 3.0H 0H :0
0.0 a 0.0 3 0.0 0 00 0.6H 0 :.HH 0 0.NH 0H 00
0.0 z 0.0 m 0.0 0 0: 0.0H 0 0.HH 0 0.0H 00 00
0.0 a 3.0 z 0.0 0 0: 0.0H 0 0.HH 0 0.00 :0 00
0.0 0 3.0 z 0.0 0 m: 0.NH 00 0.0H 00 0.0H 00 00
3.0 0 0.0 a 0.00 0 m: 0.00 mm 0.00 00 0.00 00 00
0.0 m 0.0 m 0.00 0 :0 0.00 pm 0.0H mm 0.0H 0: 00
0.0 0 0.0H 0 0.0H 0H m: 0.0H om 0.00 mm 0.00 00 00
0.0H 0 0.0H 0 0.0H 0 0: 0.NH mm 0.00 mm 0.30 00 00
0.0H 0 0.00 0 0.0H : 0: 0.00 mm 0.0H 00 0.0H mm 00
:.6H 0 0.6H 0 6.HH :H 0: N.:H m6N n.sH 00H 0.0H H0 :0
:00: z :00: z :00: 2 00600 :00: z :00: z :00: z 0:660
. 00:00: . *vﬁHmm
0000:0ID :0030:0060m 005:6: 0000:0u0 00030:00o0m 00:06:
030030 00 :000000 000600xm 030030 06 :000000 000000xm

 

 

 

 

BbmzH NBHqu¢ mo ZOHBDmHmEmHm Gmm<mmub QZ< mmHBdDOHmmHQ 92mmmmmHQ mBHz
mBmmB mmmme mom mmDQ<> mmoom 008 mom mmmoom dm>mq :BHQHmd z<m: Qz< ZOHBDmHmBmHQ

ON mqm¢B

160
was input. In general, it can be seen that the number of
people at each score is controlled by the distribution of
difficulties. A U—shaped input of ability with difficulties
calculated to yield a normal distribution of scores did yield
scores with more people at the extremes. However, not as
many were assigned the extreme score values by the test
designed for normal output of scores as were assigned by
tests designed for rectangular or U—shaped distribution of
scores.

Changes in difficulty made no difference on the value
for eta (ability predicted from scores). For a normal distri-
bution of ability into any set of difficulties, the value for
eta was .89; for a rectangular distribution of ability into
any set of difficulties, the value for eta was .92; and for
a U-shaped distribution of ability into any of the three sets

of diffimilties, the value for eta was .95.

Effects ofgprecision.--In addition to the hypotheses

 

about precision, examination of Table 21 yields information
about the numbers and types of individuals at each score value
when the precision (and difficulty) of a test is changed. As
may be noted in Table 23 of the Appendix, the difficulties of
items for each test were quite different. However, the number
of people at each score level remained relatively constant.

It is thus obvious that the test with more precise items may
effectively usea.greater rangetof difficulties than the less

precise item tests, and still produce similar distribution

161

TABLE 21

DISTRIBUTION AND MEAN ABILITY LEVEL SCORES FOR TOP SCORES
OF TESTS WITH DIFFERENT LEVELS OF PRECISION AND WITH
AN INPUT OF NORMAL DISTRIBUTION OF ABILITY

 

 

Level of Precision

 

 

 

 

 

 

Rank, .45 .60 .71 .75 .79
Score DJ Mean 11 Mean 11 Mean 11 Mean 11 Mean
64 23 13.3 27 14.0 28 14.4 29 14.5 29 14.6
63 16 12.2 16 13.0 17 13.5 16 13.7 16 13.9
62 17 12.1 18 12.9 19 13.3 20 13.4 20 13.6
61 18 12.0 20 12.7 22 13.0 23 13.1 24 13.2
60 20 11.9 22 12.4 24 12.7 25 12.8 26 12.8
59 20 11.7 21 12.2 22 12.3 21 12.3 21 12.2
48 18 11.5 15 11.8 10 11.8 9 12.1 8 12.2
57 13 10.5 11 11.4 11 11.7 13 11.7 12 11.8
56 14 10.4 13 11.1 13 11.6 9 11.6 7 11.5
5 14 10.4 14 10.9 13 11.3 13 11.4 13 11.4
54 15 10.3 15 10.8 16 11.2 17 11.2 18 11.3
53 15 10.3 16 10.7 17 10.9 18 10.9 19 11.0
52 16 10.2 17 10.5 18 10.7 18 10.8 19 10.8
51 15 10.2 16 10.4 17 10.6 17 10.6 17 10.6
50 16 10.1 18 10.3 20 10.4 21 10.4 22 10.5
49 15 10.0 14 10.1 20 10.1 11 10.1 5 10.2
48 16 9.9 18 10.1 12 10.1 21 10.1 22 10.1
47 15 9.8 15 9.9 14 9.8 6 10.0 10 10.1
46 16 9.7 16 9.9 16 9.8 13 9.8 12 9.8
45 16 9.7 15 9.7 7 9.7 16 9.7 10 9.7
44 15 9.5 14 9.3 15 9.5 10 9.5 16 9.6
43 13 9.4 9 9.3 11 9.4 14 9.4 14 9.4
42 12 8.7 12 9.0 13 9.1 12 9.2 11 9.3
41 13 8.6 10 9.0 12 9.1 16 9.0 16 9.2
40 14 8.4 13 8.8 15 8.9 12 9.0 10 9.0
39 14 8.4 14 8.7 11 8.8 10 8.9 11 8.9
38 14 8.3 12 8.6 7 8.6 18 8.7 20 8.7
37 15 8.2 16 8.5 17 8.6 6 8.5 13 8.5
36 15 8.2 14 8.2 17 8.3 13 8.4 18 8.4
35 14 8.2 15 8.2 14 8.3 17 8.4 18 8.3
34 15 8.1 15 8.2 17 8.2 17 8.3 4 8.3
33 15 8.0 15 8.0 16 8.0 17 8.0 18 8.0

 

*Rank of the mean ability level of individuals at a score.

162
if only the number of people at each score level is consid-
ered. However, the mean of ability levels for each score
indicates that as precision increases the extreme scores do
represent individuals who are more extreme in ability.

The precision of the item also effects the value ob-
tained for eta (ability levels predicted from scores). Eta
has the values of .68, .81, .87, .89, and .91 for biserials
of .45, .60, .71, .75, and .79, respectively. This was as

expected.

Effect of pattern.--Table 22 reports the distribution

 

and mean ability level scores when a normal distribution of
ability is input into two tests that have exactly the same
difficulty of items, but have these items taken in a differ-
ent pattern. The difficulties are those reported in Table

23 of the Appendix, for the Lord rectangular. The two
patterns are those for the "least squares” and the Lord
rectangular. The differences in the two patterns were not
great and neither was the difference in the score distribu-
tions. The eta (ability level predicted from scores) for both

tests was .89.

163

TABLE 22

DISTRIBUTION AND MEAN ABILITY LEVEL SCORES FOR TOP SCORES
OF TESTS WITH DIFFERENT PATTERNS OF ITEMS ENCOUNTERED AND
WITH AN INPUT OF A NORMAL DISTRIBUTION OF ABILITY

 

Pattern of Items

 

 

Pattern of Items

 

 

 

Rank (1)* (2)** Rank (1)* 12)“

Score N Mean N Mean Score N Mean N Mean
64 72 13.9 71 13.9 48 12 9.9 11 9.6
63 20 12.5 24 12.6 47 13 10.0 12 9.6
62 20 12.5 25 12.3 46 10 9.6 13 9.5
61 22 12.5 27 12.1 45 11 9.4 12 9.4
60 17 12.2 26 11.9 44 12 9.2 11 9.2
59 15 11.9 24 11.6 43 17 9.2 7 9.1
58 10 11.6 16 11.2 42 14 8.9 8 8.9
57 13 11.0 10 10.8 41 11 8.9 9 8.8
56 14 11.0 12 10.6 40 15 8.9 10 8.6
55 13 10.8 12 10.4 39 12 8.5 11 8.5
54 17 10.8 13 10.3 38 12 8.6 9 8.5
53 15 10.6 14 10.2 37 11 8.7 11 8.4
52 13 10.5 14 10.1 36 11 8.6 11 8.3
51 16 10.2 13 10.0 35 12 8.4 12 8.2
50 12 10.3 15 9.9 34 13 8.5 11 8.1
49 10 10.2 14 9.7 33 13 8.1 11 8.0

 

 

*(1) Pattern used in ”least squares" solution.

**(2) Pattern used in all other sequential tests.

CHAPTER V

CONCLUSIONS

The conclusions are separated into two sections. The
conclusions related to testing problems as discussed in
”Need for Test Improvements“ in Chapter I are stated in the
first section. The conclusions as to the three major

hypotheses are presented in the second section.

I. SEQUENTIAL TESTING AN1 TESTINu PROBLEMS

Efficiency of Items

 

The lack of a monotonic relationship between the
reliability of items and the validity of scores produced
has been remarked upon by many authors as has the problem
of the definition of validity. If one uses eta for the pre—
diction of ability levels from the scores, one finds the
value of eta steadily increasing from .68 to .91 as rbis
increased from .75 to .79. The more precise the items the
higher the correlation between the ability level input and
the resulting scores. If one uses the variance of scores
assigned to a given ability level as measure of validity
(see Table 12), the variance decreased for every ability

level as the precision of the item increased. If one uses

164

165
the discrimination index as the measure of efficienty (see
Table 13), the index values increased with increases in
precision.

Rather than using the cumulative test as a comparison
with the sequential to determine the efficiency of use of
items, it would probably be better to use Lord's work with
the discrimination index as a guide to the efficiency of use

of items. There has been no attempt here to find and compare

1.. i.- 1; An.-

the sequential test with the best possible cumulative test.

 

The expected value for the discrimination index was deter-
mined (using an assumed r = .91, because this was highest
value for r used in the study) by calculating the predicted
score value for ability levels 8 and 9. These values were
subtracted from each other and the difference divided by an
estimate of the standard error of scores for a given ability
level (estimated from the correlation). The predicted score
values for ability level 8 and 9 were 32.5 and 37.1, respec-
tively. The estimate of the standard error of scores for a
given ability level (with r = .91) was 7.95. This yielded a
discrimination index of .59. However, if the calculations

had been using an assumed r .81, the discrimination index

would have been only .25.

Using .59 as the expected value of the discrimination
index for rbis = .79, and .25 as the expected index when
rbis = .45, one can compare these values with those actually

obtained for different ability levels in Table 13. One notes

166
that the sequential test discriminated better than expected
at the extreme values, and never dropped much below the ex-
pected value. Lord has shown that the usual cumulative test
(with items all at the 50 per cent level of difficulty) has
the highest discrimination at middle score values and is
much more discriminating than tests with a spread of item
difficulties.1 He has also shown that even cumulative tests
with a spread of item difficulties discriminated better in the
middle ability areas than at the extremes of ability level.
No conclusions can be drawn as to the efficiency of the se-
quential compared with the cumulative test for there has been
no comparable study of the cumulative test model. However,
the fact that the sequential has here been shown the better
discriminator for extreme ability levels may be of potential
value for test constructors.

It is obvious that efficiency of items must be deter—
mined by the use of the test. The sequential tests of dif-
ferent precision levels had difficulties selected by Lord's
formula which assumed an output of a normal score distribution
at each stage. A sequential test could be built similar to
the one built by the ”least squares" method except that the
difficulty would be that difficulty which would maximize

the discrimination index for the divisions that were most

 

1Frederic M. Lord, A Theory of Test Scores. Psychometric
Monograph No. 7 (Chicago: University of Chicago Press, 1952).

 

167
desired. There are many more possibilities, but these are

not investigated in this paper.

Control of the Score Distribution

 

The sequential test can be constructed to yield any
type of score distribution. However, the one test built
with "least squares” for a rectangular input of ability
did not seem to control the score distribution as well as
might be expected. (See Table 26.) A normal input of ability
into thenleast squares.model did not have an automatic normal
output of scores, and a U—shaped input into this model did
not yield a similar U-shaped output. However, since there
are 64 scores for the sequential test, any number of these
could be combined to yield the type of score distribution
that one wished. One logical combination of scores might be
to combine score categories until each separate combination
represented equal intervals on the ability level scale.
However, the stability of these combinations would need to
be investigated before any conclusions could be reached.

The "least squares” sequential did yield a distribution
that was somewhat like the input distribution of ability as
can be seen in Table 26. The cumulative also yielded a dis—
tribution that was somewhat like the input distribution
as can be seen in Table 25. However, the important fact
is not the number of people at each score, but the type of

people at each score.

168

As can be seen from Table 8, when the U—shaped distri—
bution was input into the sequential the top ability level
individuals were assigned to the higher "T” score values
than when the U—shaped distribution was input into the
cumulative test. The same was true for a normal distribution
as is shown in Table 2. Thus, this sequential did control
the top ability level better than did the cumulative. Tables
1 and 7 indicate that this conclusion may be valid for aver—
age ability level individuals also. However, it should be
remembered that this is a single case of the sequential
and cumulative test. It can only be concluded that a single
sequential test can control several ability distributions--
that is not to say that it is better than the cumulative
at controlling the ability level distribution.

In addition to the comparison of the ”least squaresH
sequential with the cumulative, it is interesting to compare
the ”least squares" test with the ”Lord rectangular” test——
its closest counterpart. These two tests yielded very
similar results as can be seen in Table 22. If the ”Lord
rectangular“ test should continue to perform as well as the
"least squares” test, then it would be concluded that the
"Lord rectangular“ procedure would be preferable because of
the comparative convenience of calculating the difficulties
and pattern of items taken. Further investigation would be

needed to determine if this were a valid conclusion.

169

Meaning of a Score

 

By comparing Tables 5 and 6 it can be seen that what
the score represents is not constant but depends upon the
distribution of ability of those taking the test. With a
U-shaped distribution of ability input, the top score
represented a higher ability level than it did when a
normal distribution of ability was input into the test model.
This was true for both the sequential and the cumulative
tests.

If for a normal distribution of ability one were to
combine the sequential scores into score categories such
that each category represented a unit of ability, it should
be noted that the scores which would be combined to make up
a category equivalent to an ability unit would be different
than those that would be similarly combined for a U-shaped
distribution of ability. This indicates that scores have to
be interpreted in terms of the distribution of ability that
is likely to be encountered. This was true for both the
sequential and cumulative tests. It had been hoped that the
"least squares” sequential would have been more stable, thus
an equal interval on ability scale could have been set up
and used over very different inputs.

It can be concluded that if a normal distribution of
ability is input into sequential and cumulative tests com-
parable to those in this study, that the normalized scores

assigned by the sequential to the highest ability level will

170
be closer to criterion value than will be the normalized scores
assigned by the cumulative test. (See Table 24.)

Also, it can be concluded that the ”least squares"
sequential and the cumulative (but especially the sequential)
better control the output of scores for a normal distribution
of ability than they control scores for a U—shaped distribu-
tion of ability. Table 25 shows that the top ability level
individuals came closer to being assigned the criterion score
of 69 in the case of the normal distribution than in the
case of the U-shaped distributicn.

It should be remembered that these conclusions are
based upon the assumption that one wishes to reflect ability
level rather than rank individuals. Judging from the values
for eta (ability levels predicted from scores) the U—shaped
distribution was the best ranked distribution by both sequen-
tial and cumulative tests. The value of eta reached .95
when a U-shaped distribution of ability was used with all of
the rbis = .75 sequential and cumulative models studied here.
The input of a normal ability distribution (rbis = .75)
yielded an eta of .89 for all sequential models, and an eta
of .88 for the cumulative model. This cannot be attributed
to the large number at one value, because the rectangular
input likewise had a lower value for eta on all tests than
did the U-shaped distribution of ability. The actual ranking

of individuals was not investigated in this paper.

171
II. SEQUENTIAL TESTING HYPOTHESES

The three major hypotheses were (1) that different
ability distributions would be translated into score distri-
butions that were not too different from the ability distri-
bution; (2) that the more precise tests (with a spread of
item difficulties) would produce the more accurate scores;
and (3) that errors in estimating the difficulty of an item
would have a greater effect at the fifth stage than at the
second stage, while errors in estimating the precision of
an item would have a greater effect at the second stage than
at the fifth stage. The conclusions for each hypothesis are

considered separately.

Effect of Ability Distribution

 

It can be concluded that the "least squares" sequential
test did perform better than the cumulative in respect to the
variance of scores for top ability level individuals. For
both the normal and U-shaped distributions the sequential
test had significantly lower variance of scores for these
individuals. The "least squares” test did have a lower vari-
ance of scores for middle ability level individuals, but not
significantly lower when a N of 1060 was assumed.

If one determines the variance of the ability level for
individuals assigned the top 8.4 per cent of the scores by
the sequential and cumulative tests, the sequential again had

significantly less ability level variance.

172

If one asks the question about the mean normalized "T”
scores assigned to each ability level as compared with the
"T" score value for that ability level (see Table 24), it can
be seen that for a normal input the sequential assigned for
the top and bottom three ability levels "T" scores that were
close to the ”T" score value for the ability level. The
cumulative assigned "T” scores for ability levels five through
seven and nine through eleven that were more nearly like the
ability level ”T" score values than did the sequential. The
two tests shared honors. For the U—shaped distribution of
ability, the cumulative did as well or better than the sequen—
tial except at the most extreme scores.

If test efficiency depends upon the value of eta (ability
level predicted from scores), the two tests were identical
in their ability to perform with a U-shaped distribution (eta =
.95), and the sequential was slightly better than the cumula-
tive when a normal distribution of ability was used (eta = .89
and .88 for the sequential and cumulative, respectively).

Thus, the effectiveness of the test also depends upon
the criterion used. However, the sequential seemed to compare
favorably with the cumulative. Considering the amount of work
needed to prepare the sequential and considering the fact
that this cumulative may not be the best possible cumulative,

one must determine for himself if the sequential is worthwhile.

173
Effect of Precision and Difficulty

Conclusions must be made about the joint effect of
precision and difficulty as there is no reasonable way to
separate these in a sequential test. It must be concluded
that precision (used with appropriate difficulties) was an
important variable as it resulted in lowering the variance
of scores for ability levels (see Tables 10 and 12), and
lowering the variance of ability levels for scores (see
Table 11).

The contribution of difficulty as a separate factor is
not clear. If one uses eta as the criterion, the difficulty
of items seemed to make no difference. The three tests in
Table 18 differed as to difficulty of items, but did not
differ as to the value for eta. The same is true for Tables
19 and 20. However, the number of people at each score and
the mean ability level of individuals at a score did change,
so one cannot conclude that difficulty does not have an effect.

Another clue as to the contribution of difficulty can
be obtained from Tables 16 and 17. In Table 16 the top 8.4
per cent of the score distribution and middle 10 per cent of
the score distribution had lower variances of ability level
scores when the fifth items were nearer the 50 per cent level
of difficulty. This might lead one to believe that Lawley's
suggestion that for a cumulative test if the items were near
the 50 per cent level of difficulty then the top scores would

be more precise was correct. This is not the case, however,

174
for a sequential test, because if one compares the top score
(on the normal distribution of ability) for tests with dif-
ficulties calculated so as to yield normal and U-shaped dis—
tributions of scores, the variances of ability levels are .88
and 2.40 for the normal and U—shaped difficulty distributions,
respectively. The larger variance for the top score came
from the test with the item difficulties toward the 50 per
cent level.

The above results can probably be explained from Bayes'
Theorem. Lawley's work was with a cumulative test where
everyone took every item. In the sequential test only those
considered to be of high ability took the difficult items.
This thus made Lawley's work not directly applicable to the
sequential test.

However, it is noted that using items which are nearer
the 50 per cent level of difficulty did produce more precise
scores. The same change also produced lower variances of
scores for middle ability individuals, but the variance of
scores for extreme ability people was increased (see Table 17).
Thus the more items one has appropriate for the ability level
of the individual being classified, the better one can dis-
criminate between these ability levels.

The above conclusion was supported by the test with
the fifth item away from the 50 per cent level of difficulty.
The more difficult items increased the precision of scores
for high ability individuals and decreased the precision for

the middle ability individuals (see Table 17).

175
The general conclusion thus seems to be that one should
use more difficult items to distinguish among the more able

students.

Effect of Error in Parameters

 

The data lead to the conclusion that errors in para-
meter estimates for the "least squares” test did not seem to
be too serious. In actual practice one error may tend to
cancel another. (In this study, the errors were not made
so as to cancel one another.) If one examines the score
distribution (see Tables 27 and 28) it can be seen that
there was very little shift in the number of people at each
score, and little change in the mean of ability level of
individuals at the score.

If eta is used as the criterion for the effect of
error, this same conclusion is reached. The value for eta
was .89 no matter whether the error introduced into the
"least squares” sequential test was an error in estimating
the difficulty of the item or an error in estimating the

precision of the item.

CHAPTER VI

SUMMARY AND RECOMMENDATIONS

The first section of this chapter reviews the dif-
ferent test models constructed and the reasons for their
construction, and then reports the conclusions reached from
the data analyses. The second section lists the limitations
of the study and recommendations for the future study of

this and other related problems.

’ I. SUMMARY

To evaluate the sequential testing procedure, the con—
tribution of the sequential procedure to the solution of
test construction problems was examined. An attempt was made
to determine how well the sequential test could classify in-
dividuals of different ability levels and which parameters
seemed to be related to this ability. The effect of error
in estimating these parameters was also considered. Adequate
control of all variables for this evaluation was obtained by
the construction of test models which used hypothetical
populations.

The probability of passing a given item in each test
was calculated from the ability level of the individual,

the difficulty of the item, and the precision of the item.
176

177
The probability of passing a sequence of items was determined
for each ability level by multiplying together the probabili-
ties of passing or failing a sequence of six-items. Sixty-
four differenct sequences were calculated for each of 15
ability levels with 100 hypothetical people at each of the
ability levels.

Using the above procedure one cumulative test was con-
structed with all items at the 50 per cent level of difficulty
and with the precision of rbis = .75. This was the only cum—
ulative test model used.

Using this same procedure one sequential test was con-
structed with a precision of rbis : .75 and difficulties
such that the sum of the squared deviations of the individual's
ability level from the mean ability level of the group (i.e.,
of those who had passed or who had failed the item) would
be a minimum. This test was the only sequential test con- :
structed with difficulties computed in this manner.

With the above cumulative and sequential tests, normal
and U-shaped distributions of ability were input. The score
distribution from the cumulative and sequential tests were
then examined. The variance of ability levels of the individ-
uals at a given score value was obtained as one measure of
efficiency of the test. The variance of the scores assigned
to the individuals of a given ability category was the other
Ineasure of efficiency. It was hypothesized that the sequential

test would have a lower variance of the scores assigned to

178
individuals at the highest category of ability than the cumu-
lative test; however, it was hypothesized that the cumulative
test would have a lower variance of ability levels of the
individuals at the highest score value than would the
sequential test.

The conclusion reached was that regardless of the
distribution of ability input the sequential test more accur-
ately classified the individuals at extreme ability levels
than did the cumulative test. These extreme ability individ—
uals had a variance of 3.87 and 1.96 on the sequential test
scores, and a variance of 6.77 and 3.62 on the cumulative
scores for a normal and U—shaped distribution of input,
respectively. Also, for the sequential test, the normalized
"T” scores for extreme ability individuals were more nearly
like the criterion ”T” score for the extreme ability level
than was the case with the cumulative test. However, the
"T" scores assigned by the cumulative test for middle ability
levels more nearly approximated the criterion middle ability
"T" scores.

In addition to the tests constructed for the comparison
of the cumulative and the sequential test, six sequential
tests were constructed to examine the sequential test output
in relation to the parameter values. These tests varied from
the above sequential test in that two of them had the dif—
ficulty of the second item changed, two had the difficulty

of the fifth item changed, one had the precision of the second

179
item changed and the last had the precision of the fifth
item changed. All tests had a normal input of ability. The
score distributions resulting from these tests were examined
to determine the effect of errors of difficulty (tests 1-4)
and errors of precision (tests 5 and 6). The changes in the
number of people at each score, the mean ability level of
individuals at each score, the variance of scores for top
and middle ability level individuals, and the variances of
ability level scores for the top and middle scoring individ-
uals were all insignificantly changed.

The above seven sequential tests were constructed with
the “least squares” difficulties. Five additional sequential
tests were constructed with the level of precision (rbis)
taking values of .45, .70, .71, .75, and .79. In contrast
to the ”least squares” difficulties, the difficulties of items
‘used here were such that they would provide one additional
difficulty level at each subsequent stage. Theoretically
the items at each stage would separate individuals normally
distributed in ability into a normal output distribution.
These values were determined from the use of Lord's work.
Each of these tests ofdifferent levels of precision was
used only with a normal distribution of ability.

Results from these test models indicated that precision
of item was an important factor. When used with appropriate
difficulties, the high precision tests had significantly
lower variances of scores for top and middle ability level

individuals and had significantly lower variance of ability

180
levels for top and middle score level individuals. The
values for the discrimination indices and for eta increased
with corresponding increases in precision.

The general conclusion as to selection of item dif-
ficulty was that the items should be more difficult if used
to distinguish among the more able students, and vice versa.
When fifth stage items in the sequential test were made
closer to 50 per cent level of difficulty, the variances
of ability level scores for both top and middle scoring
individuals decreased; however, at the same time the score
variance for the middle ability level individuals decreased
and the score variance for the high ability level individuals
increased. The variance of scores for high and middle ability
individuals changed in the opposite direction (high ability
decreased, middle ability increased) when the difficulties
of the fifth items were moved away from the 50 per cent
level. Caution is necessary in making conclusions about the
variance of ability level for given scores, because when the
top scores were considered for two tests which had the same
precision but different difficulties, the larger ability
level variance came from the test with the items toward the
50 per cent level of difficulty. It was concluded that this
was due to the fact that a smaller ability range of individ-
ualsixxﬂcthe item in the sequential. (Range of ability will
vary with the precision and difficulty of the preceding item.)

From Bayes' Theorem one notes that the probability of high

181
ability people passing the item must be_much higher than the
probability of low ability people passing the item if 90 per
cent of those taking the item have low ability. In the se—
quential test the base rate (per cent of people who pass the
item) is about 50 per cent, even in the case of the "difficult"
item considered above instead of a base rate of 10 per cent
as in the case for the cumulative test which uses the above
item. (More people must pass an item because of their ability
than pass the item because of chance. This is not always
true in a cumulative test when 90 per cent of the individuals
may guess at the answer.)

For descriptive purposes only, three additional sequen-
tial tests were constructed. All had rbis = .75 for level of
item precision; however, the difficulties varied. Each had
item difficulties calculated to use one additional difficulty
level at each succeeding stage. (The nth stage had N differ—
ent difficulties for items at that stage.) One test had dif—
ficulties calculated to yield a normal distribution of scores
at each stage with the input of a normal distribution of
ability at each stage. (This test is the same as the .75
test described above and calculated to test the hypotheses
about effect of precision.) The second test had difficulties
calculated to yield a rectangular distribution of scores when
a rectangular distribution of ability was input. The third
test had difficulties calculated to yield a U-shaped distri-

bution of scores when a U-shaped distribution of ability was

182
input. (These difficulties were quite different as can
be seen in Table 23.) Each test was used with a normal,
rectangular and U-shaped distribution of ability--i.e., with
two distributions of ability other than the one used in the
original calculation of item difficulties. The difficulties
for a U-shaped score distribution yielded the greatest num-
ber of individuals at the extreme score irrespective of the
type of ability distribution that was input; and the item
difficulties for a normal score distribution yielded the
fewest individuals at the extreme score each time.

For a test with item precision of rbis = .75 for all
items, changes in item difficulty made no difference on
the value for eta (ability predicted from score). The normal
distribution of ability used with any set of difficulties
yielded an eta of .89. The rectangular distribution of
ability yielded an eta of .92 for all item difficulties, and
likewise the U-shaped yielded an eta of .95. Values thus
were dependent upon the type of distribution of ability used
with the tests and not upon the type of ability distribution
that the test had been designed to reflect.

The last sequential test constructed was one which had
the difficulties determined for a rectangular score output
and rectangular distribution of ability, but with the pattern
of items the same as those for ”least squares” sequential test.
Item precision remained at rbis = .75. The results from this

test (used with a normal input of ability) were compared with

183
a test which had the same set of difficulties and precision
level, but which had the arbitrary pattern of one more dif-
ficulty level used at each stage. These two tests produced
approximately the same number at each score value and had
approximately the same mean ability levels for individuals
at each score value (see Table 22). Results of the comparison
of the ”least squares'l and the “arbitrary pattern” tests
indicate that the latter more rapidly calculated test may
yield as useful results as the more exact and time—consuming
"least squares” method. While the comparative utility cannot
be determined except relative to a specific decision to be
made, it would seem advisable to explore further the more

easily constructed ”arbitrary pattern” test.

11. RECOMMENDATIONS

Results of this study indicate that finding applications
of the sequential method is not of first importance. The
sequential method for the item precision levels studied in
this dissertation does allow better classification of individ-
uals in the extreme ability levels than of those in the middle
ability levels. Whereas others have used the sequential
method to quickly classify the individuals as not being of
middle ability, they have not attempted to better classify
individuals who are in the extremes. (It may be that such
users of the sequential test are primarily concerned with

discriminations at the middle ability level where the majority

«i

184
of the people are.) In any case, the sequential test has not
been used in a manner that takes advantage of its ability to
quickly pick out and discriminate among the extreme ability
individuals.

If one attempts to build a theory of the sequential
test that may be used at some later date, then the design
needs to be somewhat different from this paper. The building
of sequential tests with different decision rules would seem
to be more appropriate than testing the behavior of a single
test under varying conditions. This would mean that one
would not spend time comparing the sequential and cumulative

models until much more was

understood about the building of
the sequential test.

The construction of an electronic computer program
which would select item difficulty and precision should be
done in line with specific decision rules such as maximizing
the Z. (EDGE/N. One possible SCthG might be to assign
each individual a score equal to the mean ability level of
the group in which he is included; then for each ability
level calculate the mean and variance of the score values.
From these values the discrimination index between each ad—
jacent ability level could be determined. The difficulty of
the item which yielded (1) the highest sum of discrimination
indices or (2) the highest discrimination index at a particu—

lar point (such as between the middle and adjacent ability

levels) would then be selected. Still another scheme would

185
be to calculate the variances of ability levels for each
group and maximize the sum of'( ELXJe/N for each group as
was done in this paper, but this time making no restrictions
on these difficulty levels of items. It might then be
decided that when the variance is below a certain specified
level, then the group takes no more items. This would allow
the length of test to change to suit the ability level of
the individuals about whom the decision must be made.

No recommendation is made to study the psychological
effects of sequential testing, the type of distributions of
ability one is likely to encounter, or precision of items
likely to be found in practice. These are practical questions
which need to be answered only after the sequential procedure

is more fully understood.

SELECTED BIBLIOGRAPHY

186

SELECTED BIBLIOGRAPHY

American Educational Research Association, Committee on
Test Standards, and National Council on Measure—
ments Used in Education, Committee on Test Standards.
Technical Recommendations for Achievement Tests.
Washington, D. C.: National Education Association,

1955. 36 pp-

 

Brogden, Hubert E. "Variation in Test Validity with
Variation in the Distribution of Item Difficulties,
Number of Items, and Degree of Their Intercorrela-

 

tion," Psyghometrika, 112197-214, No. 4, December,
1946.
Campbell, Donald T. and Fiske, Donald W. ”Convergent and

Discriminant Validation by the Multitrait—Multimethod
Matrix,” Psychological Bulletin, 56381-105, No. 2,
March, 1959.

 

Cleeton, Glen U. ”Optimum Difficulty of Group Test Items,“
Journal of Applied Psychology, 102327-340, No. 3,
September, 1926.

 

Cook, Desmond L. ”A Replication of Lord's Study on Skewness
and Kurtosis of Observed Test-Score Distributions,"
Educational and Psychological Measurement, 19:81-87,
No. 1, Spring, 1959.

 

Cowden, Dudley J. ”An Application of Sequential Sampling
to Testing Students,” Journal of the American Statis-
tiial Association, 41 547-556, No. 236, December,

19 6.

 

 

Cronbach, Lee J. HCoefficient Alpha and the Internal
Structure of Tests,” Psychometrika, 16:297—334, No.
3, September, 1951.

 

Cronbach, Lee J. and Warrington, Willard c. "Efficiency
of Multiple-Choice Tests as a Function of Spread of
Item Difficulties,” Psychometrika, 17 127-147, No.
2, June, 1952.

 

Davis, Frederick E. ”The Selection of Test Items According
to Difficulty Level,” American Psychologist, 4:243,
No. 7, July, 1949.

 

187

188

Davis, Frederick B. I'Item Analysis in Relation to Educational
and Psychological Testing," Psychological Bulletin, 49:
97-121, No. 2, March, 1952.

 

Ferguson, George A. "On the Theory of Test Discrimination,"
Psychometrika, 14:61-68, No. 1, March, 1949.

 

Fiske, Donald W. and Jones, Lyle V. ”Sequential Analysis
in Psychological Research," Psychological Bulletin, 51:
264-275, No. 3, May, 1954.

 

Flanagan, John C. ”General Considerations in the Selection
of Test Items and a Short Method of Estimating the
Product-Moment Coefficient from Data at the Tails of
the Distribution," Journal of Educational Psychology,
30:674-680, No. 9, December, 1939.

 

Glaser, Robert, Damrin, Dora E., and Gardner, Floyd M.
"The Tab Item: A Technique for the Measurement of
Proficiency in Diagnostic Problem Solving Tasks,"
Educational and Psychological Measurement, 14: 283-
293, No. 2, Summer, 1954.

 

Glaser, Robert and Schwarz, P. A. ”Scoring Problem--
Solving Test Items by Measuring Information,” Educa-
tional and Psychological Measurement, 142665-670, No.
4, Winter, 1954?

 

 

Gulliksen, Harold. “The Relation of Item Difficulty and
Inter—item Correlation to Test Variance and Reliability,”
Psychometrika, 10:79-91, No. 2, June, 1945.

 

Harris, Frank J., Howell, Margaret A., and Newman, Sidney H.
”Forced Choice Tetrads--Effect of Scoring Procedure
and Key Length on Validity and Reliability," Educa—
tional and Psychological Measurement, 16:454-464, No.
4, Winter, 1956.

 

Humphreys, Lloyd G. "The Normal Curve and the Attenuation
Paradox in Test Theory,” Psychological Bulletiny 53:
472-476, No. 6, November, 1956.

 

Jackson, Robert W. B. and Ferguson, George A. "A Plea for
a Functional Approach to Test Construction," Educa-
tional and Psychological Measurement, 3:23-28, No. 1,
Spring, 1943.

 

 

Johnson, M. Clemens and Imnxi, Frederic M. ”An Empirical
Study of the Stability of a Group Mean in Relation
to the Distribution of Test Items Among Students,”
Educational and Psychological Measurement, 18:325-
329, No. 2, Summer, 1958.

 

189

Lawley, D. N. "On Problems Connected with Item Selection
and Test Construction," Proceedings of the Roygl
Society of Edinburgh, 617(Section A, Part 111): 273-
287, 1942-1943.

 

 

Levine, Richard and Lord, Frederic M. “An Index of the
Discriminating Power of a Test at Different Parts
of the Score Range,” Educational and Psychological
Measurement, 192497-503, No. 4, Winter, 1959.

 

 

 

Loevinger, Jane. "The Attenuation Paradox in Test Theory,“
Psychological Bulletin, 51:493-504, No. 5, September,
1954.

Lord, Frederic M. A Theory of Test Scores. Psychometric
Monograph No. 7. Chicago: University of Chicago
Press, 1952. 84 pp.

 

. ”The Relation of Test Score to the Trait Under-
lying the Test,” Educational and Psychological
Measurement, 132517-549, No. 4, Winter, 1953.

 

 

 

"Some Perspectives on 'The Attenuation Paradox

 

in Test Theory'," Psychological Bulletin, 52:505-
510, No 6, November, 1955.

 

. "A Survey of Observed Test-Score Distributions
with Respect to Skewness and Kurtosis," Educational
and Psychological Measurement, 15:383-389, No.4,
Winter, 1955.

 

 

 

"An Empirical Study of the Normality and Indepen-
dence of Errors of Measurement in Test Scores,”
Psychometrika, 25291-104, No. 1, March, 1960.

 

 

McNemar, Quinn. Psychological Statistics. Second edition.
New York: John Wiley and Sons, 1955. 408 pp.

 

Maxwell, A. E. ”Maximum Likelihood Estimates of Item Para-
meters Using the Logistic Function,” Psychometrika,
24:221-227, No. 3, September, 1959.

 

Meehl, Paul E. and Rosen, Albert. ”Antecedent Probability
and the Efficiency of Psychometric Signs Patterns or
Cutting Scores,” Psychological Bulletin, 52:194-216,
N0. 3, May, 1955-

 

Merwin, Jack C. ”Rational and Mathematical Relationships
of Six Scoring Procedures Applicable to Three-Choice
Items,” Journal of Educational Psychology, 50:153-
161, No. 4, August, 1959.

 

190

Michael, William B. "Development of Statistical Methods
Especially Useful in Test Construction and Evaluation,"
Review of Educational Research, 29:106-129, No. 7,
February, 1959. '

 

Michael, William B. and Perry, Norman C. ”A Theory of Item-
Analysis Based on the Scoring of Items at Three Levels
of Appropriateness of Response," Educational and Psych-
ological Measurement, 152404-415, No. 4, Winter, 1955.

 

 

Milholland, John E. "The Reliability of Test Discrimina-
tions,“ Educational and Psychological Measurement,
15:362-370, No. 4, Winter, 1955.

 

Mollenkopf, William G. ”Variation of the Standard Error of
Measurement,ll Psychometrika, 142189-229, No. 3, Sept-
ember, 1949.

 

Moonan, William J. ”Some Empirical Aspects of the Sequential
Analysis Technique as Applied to an Achievement Examin-
ation," Journal of Experimental Education, 18:195-207,
NO. 3, March, 1950.

 

Mosier, Charles I. "Psychophysics and Mental Trait Theory:
Fundamental Postulates and Elementary Theorems,”
Psychological Review, 47:355-366, No. 4, July, 1940.

 

Richardson, M. w. ”The Relation Between the Difficulty
and the Differential Validity of a Test,“ Psychometrika,
1:33-49, No. 2, June, 1936.

 

Ryans, David G. ”An Analysis and Comparison of Criterion
Techniques for Weighting Criteria Data,” Educational
and Psychological Measurement, 142449-458, No. 3,
Autumn, 1954.

 

 

Swineford, Frances. ”Some Relations Between Test Scores and

Item Statistics,” Journal of Educational Psychology,
50:26-30, NO. 1, February, 1959.

 

Symonds, Percival M. "Factors Influencing Test Reliability,H
Journal of Educational Psychology, 19:73-87, No. 2,
February, 1928.

 

"Symposium: Standard Scores for Aptitude and Achievement
Tests," Educational andﬁPsychological Measurement, 22:
5-39, No. 1, Spring, 1962.

 

Thurstone, L. L. "The Scoring of Individual Performance,"
Journal of Educational Psychology, 17:446-457, No. 7,
October, 1926.

 

191

Thurstone, Thelma Gwinn. ”The Difficulty of a Test and Its
Diagnostic Value,” Journal of Educational Psychology,

23:335-343, NO. 5, May, 1932.
Tryon, Robert C. "Reliability and Behavior Domain Validity:

 

 

Reformulation and Historical Critique,” Psychological
Bulletin, 54:229-249, No. 3, May, 1957.
Tucker, Ledyard R. "Maximum Validity of a Test with Equiva-

lent Items,” Psychometrika, 11:1—14, No. 1, March 1946.

 

Tyler, Ralph W. Constructing Achievement Tests. Columbus,
Ohio: Bureau of Educational Research, Ohio State
University, 1934. 102 pp.

Wald, Abraham. Sequential Analysis. New York: John Wiley
and Sons, 1947. 212 pp.

 

Walker, David A. ”Answer-Pattern and Score—Scatter in Tests
and Examinations,” British Journal of Psychology, 22:
73-86 (July, 1931); 267301-308 (January, 1936); 30:248-
60(January, 1940).

 

Wherry, Robert J. and Gaylord, Richard H. ”The Concept of
Test and Item Reliability in Relation to Factor
Pattern,” Psychometrika, 8:247—264, No. 8, December,

1943.

White, Benjamin W. and Saltz, Eli. ”Measurement of Repro-
ducibility,” Psychological Bulletin, 54:81-99, No. 2,
March, 1957.

 

 

APPENDIX A

 

193

 

 

 

60 00 N0 H0 60 00 00 H0 H0 H0 H0 H0 HN
60 00 60 00 00 :0 00 00 00 00 00 00 6N
00 00 H0 H0 60 00 00 60 60 60 60 60 0H
0: :: 00 00 6: H: :: H0 H0 H0 H0 H0 0H 0
6: :0 6N HN NN 0N N0 0H 0H 0H 0H 0H 0H
60 HN 06 06 6H :H HN 0H
00 60 H0 H0 H0 0H
00 00 60 00 00 :0 00 60 00 00 00 00 0H
00 00 00 00 N0 00 :0 6: H0 00 00 00 :H 0
60 60 60 60 60 60 60 60 0N :: :: :: 0H
N: 00 0N 0N 0N H0 00 6: :H 0N 0N 0N NH
N0 :N 6H HH NH 0H 0N N0 6H 0H 0H 0H HH
00 :0 00 00 00 N0 00 00 00 00 00 00 6H
:0 00 00 00 :0 N0 00 00 00 00 00 00 0 :
0: N: :0 00 00 00 H: :0 :0 :0 :0 :0 0
00 0N NH 0H :H 0H 0N 0N 0N 0N 0N 0N 0
N0 00 :0 00 H0 00 H0 00 00 00 00 00 0
60 60 60 60 60 60 60 60 60 60 60 60 0 0
00 H0 0H 0H 0H 0N 0N H0 H0 H0 H0 H0 :
00 00 00 00 N0 00 :0 00 00 :: :0 00 0 N
N: 00 0N 0N 0N H0 00 H: H: 00 0N H: N _
60 60 60 60 60 60 60 60 60 60 60 60 H H
00. 0a. 00. 00. H0. 60. 0:. 00. 00. 00. 00. 0a. s00H 00000
0000:m 00030 H0E0oz 0:030: :03< 0:036: :03<
-0 -c00600 :00 :00 ch esN
0006300m:o0 0:0: 00063000:00 0000300 00000

 

 

QMBoDmBmZOO mBmmE Q<HBZMDGmm 92mmmmmHQ M:B mo m:mBH UZHmm<0 92:0 mmm

mm mqm<B

194

TABLE 24

MEAN NORMALIZED “T" SCORES FOR EACH ABILITY LEVEL EOE
CUMULATIVE AND”L£AST‘SQUAEEs" SEQUENTIAL TESTS

 

Ideal
Ability ”T”
Level Score Cumulative Sequential Cumulative Sequential

Normal Input U—Shaped Input

 

 

1 30.6 35.5 34.1 39.6 38.2
2 35.1 37.8 36.6 40.8 40.8
3 37.6 38.9 38.7 41.9 42.5
1+ 40.1 40.8 40.8 43.4 44.3
5 42.6 42.8 43.1 45.1 46.0
6 45.0 45.2 45.4 46.8 47.4
7 47.5 47.5 47 7 48.4 48.8
8 50.0 50.0 50.0 50.0 50.0
9 52.5 2.5 52.3 51.6 51.2
10 55.0 54.8 54.6 53.2 52.6
11 57.4 57.2 56.9 54.9 54.0
12 59.9 59.2 59.2 56.6 55.7
13 62.4 61.1 61.3 58.1 57.5
14 64.9 62.2 63.4 59.2 59.2
15 69.4 64.5 65.9 60.4 61.8

 

195

TABLE 25

 

 

 

 

 

 

 

 

 

 

 

DISTRIBUTION AND MEAN ABILITY LEVEL SCORES FOR
CUMULATIVE TEST WITH THE INPUT OF
DIFFERENT DISTRIBUTIONS OF ABILITY
Normal U—Shaped
Score N Mean N Mean
7 166 12.9 284 13-9
6 139 10.8 119 12.0
5 130 9.3 69 10.0
4 129 8.0 58 8.0
3 130 6.7 69 6.0
2 139 5.2 119 4.0
1 166 3.1 284 2.1
TABLE 26
DISTRIBUTION AND MEAN ABILITY LEVELS FOR TOP SCORES ON
”LEAST SQUARES” SEQUENTIAL TEST WITH THE INPUT OF
DIFFERENT DISTRIBUTIONS OF ABILITY
Normal U—Shaped Normal U-Shaped
Score N Mean N Mean Score N Mean N Mean
64 65 14.0 145 14.4 48 9 10.3 6 11.0
63 24 12.6 37 13.6 47 11 10.0 7 10.9
62 24 12.6 37 13.6 46 9 9.7 6 10.6
61 20 12.4 31 13.5 45 12 9.5 8 10.1
60 15 12.3 21 13.3 44 13 9.2 7 9.9
59 10 12.0 14 13.1 43 16 9.0 8 9.9
58 9 11.9 9 12.9 42 15 9.0 6 9.5
57 18 11.3 17 12.3 41 15 9.0 7 9.4
56 16 11.0 15 12.2 40 15 9.0 7 9.5
55 15 11.0 12 12.0 39 12 8.5 6 9.0
54 16 11.0 14 11.9 38 14 8.9 7 9.1
53 12 10.5 11 11.7 37 12 8.5 6 9.0
52 12 10.5 9 11.7 36 12 8.5 6 9.0
51 12 10.5 8 11.4 35 12 8.5 6 8.8
50 12 10.5 9 11.3 34 12 8.5 6 8.6
49 9 10.3 7 11.4 33 13 8.2 5 8.3

 

196

TABLE 27

DISTRIBUTION AND MEAN ABILITY LEVELS FOR TOP SCORES WITH
DIFFICULTIES OF CERTAIN ITEMS CHANGED IN A SEQUENTIAL
TEST WITH AN INPUT OF NORMAL DISTRIBUTION OF ABILITY

 

 

Stage Changed

 

 

 

 

2nd 2nd 5th 5th
Away Toward Away Toward None
Score N Mean N Mean N Mean N Mean N Mean
59 14.1 68 14.0 52 14.2 74 13.9 65 14.0
20 12.8 26 12.6 17 12.9 29 12.5 24 12.6
20 12.8 26 12.6 36 13.0 13 12.4 24 12.6
17 12.7 23 12.4 15 12.8 25 12.3 20 12.4
13 12.5 17 12.2 11 12.6 19 12.1 15 12.3
18 12.1 5 11.7 8 12.3 14 11.8 10 12.0
9 11.7 7 11.9 6 12.0 10 11.6 9 11.9
14 11.4 22 11.0 25 11.4 12 10.9 . 18 11.3
13 11.3 21 10.9 11 11.3 23 10.8 16 11.0
11 11.1 18 10.7 9 11.1 20 10.6 15 11.0
12 11.1 20 10.7 23 11.0 10 10.5 16 11.0
10 10.9 17 10.5 19 10.8 9 10.4 12 10.5
19 10.8 7 10.3 8 10.8 17 10.3 12 10.5
18 10.6 7 10.1 16 10.6 8 10.1 12 10.5
9 10.6 16 10.2 9 10.6 15 10.2 12 10.5
11 10.2 8 10.4 6 10.6 14 10.1 9 10.3
11 10.0 8 10.2 13 10.3 7 9.9 9 10.3
17 10.3 7 9.9 8 10.2 14 9.9 11 10.0
11 9.6 7 10.0 7 10.0 12 9.6 9 9.7
10 9.7 20 9.2 19 9.6 11 9.3 12 9.5
9 9.5 19 9.0 18 9.4 10 9.0 13 9.2
22 9.4 10 9.0 19 9.0 12 9.4 16 9.0
19 9.2 9 8.8 17 9.1 10 8.8 15 9.0
9 9.2 20 8.8 10 9.2 20 8.8 15 9.0
17 8.8 11 9.2 18 8.8 10 9.2 15 9.0
15 8.5 9 8.9 15 8.9 10 8.5 12 8.5
20 8.9 10 8.5 10 8.9 20 8.5 14 8.9
8 9.0 17 8.5 17 8.9 8 8.4 12 8.5
7 8.4 19 8.9 15 8.5 9 8.8 12 8.5
17 8.2 10 8.6 9 8.7 19 8.2 12 8.5
17 8.6 9 8.2 17 8.7 8 8.2 12 8.5
16 8.0 99 8.4 16 8.4 EB 8.0 13 8.2

 

197

TABLE 28

DISTRIBUTION AND MEAN ABILITY LEVELS FOR TOP SCORES WITH
PRECISION OF ITEMS CHANGED IN A SEQUENTIAL TEST WITH
AN INPUT OF NORMAL DISTRIBUTION OF ABILITY

 

 

 

 

 

 

Stage Changed Stage Changed
2nd 5th 2nd 5th
Score N Mean N Mean Score N Mean N Mean
64 63 14 O 63 14.0 48 9 10.1 10 10.2
63 23 12 7 23 12.6 47 12 10.2 11 10.0
62 23 12 7 25 12.8 46 9 9.8 9 9.8
61 2O 12 6 20 12.5 45 15 9.4 15 9.5
60 15 12 3 15 12.3 44 14 9.2 14 9.3
59 13 12 2 11 12.0 43 16 9.3 16 9.2
58 8 11 8 8 11.7 42 14 9.1 13 9.0
57 18 11 1 18 11.3 41 15 8.8 15 8.9
56 17 11 o 1 11.0 40 14 8.9 14 8.9
55 14 10 8 1 10.8 39 12 8.7 12 8.7
54 16 10 8 17 10.9 38 15 8.8 15 8.6
53 14 10 6 14 10.7 37 12 8.6 13 8.8
52 13 10 7 12 10.5 36 13 8.8 12 8.6
51 13 10 4 12 10.4 35 14 8.4 14 8.4
50 12 10 3 12 10.3 34 13 8.5 13 8.5
49 10 10 3 10 10.2 33 12 8.1 12 8.3

 

 

APPENDIX B

198

GLOSSARY

Ability—-the criterion measure. The term ability is used
even though the test could be used to measure atti-
tude or interest. Ability is used to refer to the
input variable.

Ability level--a point on the criterion. The term ”ability
category" is used to designate a range of ability
levels.

 

Arbitrary pattern—-pattern chosen by logic of the situation
alone not from an underlying theory or on empirical
grounds. In the use of Lord‘s work one must decide
where to divide a group and if every possible group
is to be considered a separate group. In the pattern
used in the sequential tests, other than the ”least
squares,” those who fail a given item and those who
pass the next easiest item take the same succeeding
item. This pattern has no empirical testijitmds study.

 

Category-—a class of scores or ability levels. The ability
levels or the score values considered in the category
must be designated.

 

Classification--assigning the individuals to two or more
categories in an attempt to designate the level of
ability of the individuals. Classification usually
produces more than the pass or fail categories which
are needed for the problem of selection.

 

 

Cumulative—-test or scoring procedure in which every item
is available to the individual taking the examination.
Cumulative scoring unless otherwise designated will
refer to the counting of the number of correct
responses with or without a correction for those
incorrect.

 

Difficulty of items—-measure related to the number of people
who would be assumed to be able to pass the item. A
50 per cent level of difficulty item is one that 50
per cent of the individuals in the referent group
would pass. The standard score form of difficulty
always refers to difficulty for the entire group,
and is the value below which all those who fail the
item would be on a normal curve.

 

199

BOO

Difficulty level-~a point on the difficulty scale.

 

Difficulty, rule for--states the difficulty of the item con-
sidered most appropriate for a given group. The item
to be taken by a group is decided by the rule for the
arbitrary pattern described above. In this disserta-
tion, if the difficulty is not empirically determined
(i.e., by the ”least squares" solution), then difficulty
is the computed difficulty equal to the standard score
for the ability level where the division is to be made
regressed by the value of Tbis- The higher the Pbis
the less the value is regressed toward the mean.

 

Discrimination--ability to rank individuals in the proper
order or to classify individuals into a score category
that reflects the ability level of the individuals. In
this study discrimination is used as the ability to
classify individuals into a score category that reflects
the ability level. If there are many individuals at the

[extreme ability level, then these same individuals
should be at the extreme score category.

 

Discrimination indeX--a measure of discrimination. In this
dissertation the discrimination index is one developed
by Lord which reflects the classification of individ-
uals into score categories, which in turn reflect the
ability levels of the individuals.

 

Distribution--the manner in which individuals are distributed,
including both the number of individuals in a category
and the ability levels of the individuals in a category.

 

Efficiency of test--effective production of test scores that
serve a desired function. The two measures of efficiency
used in this dissertation are (1) the variance of ability
levels assigned to any one score category and (2) the
variance of scores assigned to any one ability category.
The lower the variance the more efficient the test.
Efficiency could also be measured by the value of a
product—moment correlation or correlation ratio (eta)
between input levels and output scores.

 

Input——the variable that is to be measured by the test. The
term ”ability distribution” is used in this dissertation
to indicate input distribution. In reality, the input
variable could be ability, interest, or attitude. The
input variable may also be referred to as the "criterion
variable.”

201

"Least squares” test—-the test in the dissertation which was
constructed with item difficulties such that the sum
of the squared deviations of the individual's ability
level from the mean ability level of the group into
which the individual was classified would be a minimum.
However, a restriction was placed on the values the
difficulties could take. Even though individuals in
each ability category were kept separate from individ-
uals in other categories, individuals in different
categories took the same difficulty item if the cal-
culated difficulties were less than .20 standard
deviation units apart.

 

Level--the point on the score (output) distribution or
ability (input) distribution which represents the
amount of ability (criterion) variable possessed by an
individual or group of individuals. The higher the
number assigned to a level the more of the variable
the individual at that level is presumed to possess.

 

Output distribution——manner in which scores assigned to
individuals are distributed. The output is a measure
of the input variable rather than a correlate of the
input variable.

 

 

Parameter values--arbitrary values assigned to measures of
difficulty or precision. These are the two parameters
of primary concern in this dissertation.

 

Parameter errors—-values assigned to measures of difficulty
or precision that do not agree with the values cal-
culated to fit the I'least squares” model.

 

Pattern of response-—sequence of items taken by an individual.
The pattern indicates which items the individual
answers and whether his answers are correct or incorrect.

 

Precision of item--measure of how validly the item splits the
group into two parts. In this dissertation the calcu—
lations use 05 as the measure of precision. The value
of Gd is directly related to the value of the item-total
biserial correlation.

 

Precision of score—-measure of the variance of the ability
levels of individuals assigned to a score. A precise
score has low variance of ability level for individuals
assigned that score--i.e., all of the individuals at
the score are at approximately the same ability level.

 

202

Reflection-—to accurately represent the position of each

 

individual. If the score distribution reflects the
input distribution, then the score distribution will
have the individuals in the same relative location as
the input distribution; i.e., those individuals at one
extreme of the input distribution will likewise be ‘
at the extreme of the output or score distribution.

Selection-~the process of picking out individuals for inclu—

 

sion or exclusion. This process differs from classifi-
cation where one needs to determine the level of the
individual in order to assign him to one of several
categories. In selection one is only concerned with
whether an individual is above or below a specified
level.

Sequential test——a testing procedure in which the examinees

 

are directed to subsequent items on the basis of their
responses to prior items.

Significance of difference--differences should be considered

 

Significant only if the difference could not arise by
chance from the sampling of identical populations.

The value necessary for significance depends upon the
number of individuals included in the sample. As there
is no Sample used in this dissertation,when significant
differences are reported they are really an indication
of the probable size of the difference relative to the
likely error rather than significance in the usual
sense.

11111

103 7918

I

93 03

lllllHllIlHl

i

 

Willi