r
I?“
‘1

_, 55'

?: I55

    
  

   
   

   

I
i. _- —— w) 1;. 3.: .. n_- ‘-
- ‘. .5551: ‘- ' '
- -.- _- ”5155.3

I
“

. .1, rﬂ'k.‘.- _ O _
'II 5 5‘ I "-3“ I;

    
   

       
     

d- '- “3., "
;:'

III . .II a" " I I: ..g .2425." J ‘ :27;
I I 'I '0‘ H I 5IIHI4I t . . 2.5:; II‘II‘;1_I..'._1.;T;:E‘:£; ._,II' X’L a}
5MiIIIII IIII‘W’ 0 I- ' I ' ‘ 3' ' ' ' 5;, ~."I .A I -',,I we. :'-“1'.I I', 355;" :5? _-
'III LI ' ' I ' “I" I: -‘ AI. 'I' , .3!“ .- ’ 1:1. -I- i. -
I I Md 3 _ - - r... ' . 5.435 5 5.. “‘25: r .j‘- 3-51-31-
”. - ‘ o. I III . QI Iv : l ' h - - I 'zI'E...
. .‘ ' I ' 'I‘ .i‘:%-:11L;F‘ It,“ ;.::‘:;:;‘O.- :._" 1"! it ._-t{l'_
II 1?:5 :5qu ' 5 4‘! 9* 8‘}: . {III “II: 'n"; "'1‘? 9'1“:
05555 I?‘ II IIIIIJ
I
115-:

I II .
“-I‘ C” L A- 2‘; c 5’:
55

 
   
 

. .I ' . . I ' .' I
. g ' .‘u I. "To ‘— . ' ' 5",
. . l' . I. I ' ‘0' I I . I ”in; .
II a... O. I... p' :0 II- ! . l I u i 1:;123
.t A . .. .'- ' ' o. . ‘5; ‘3 ‘- J
l4:- 255‘, III ‘ l - I '-' ' IT“ I.“ '-' I ‘ A -“

'Q L.t 1" I .. "' . I .I ‘ . .".l I Ir“ 5’::; ;:.-:¢-- :‘ - -

WI}?! ' I'l. . I ‘ ‘9'. .' :. 0; _ . - ‘ .._I‘:-rl Q. _I:I":’ ‘ . Y'-' ' -j-‘;::.. J‘.' - :jip.‘ ..;;;. .l. 't: "___. d 4

. . Q . . 9 ' . . 5 ' ‘ . 5 . _ \ g A‘ '— l~ r '
' ' ‘ ‘ ' ' ‘ - 'I ' ' -‘ I - '5! u‘. 1;er .‘-.\ -.-‘L '
‘D k - ' l - -I 5.2:?” ‘1 ‘Lv‘ ' '--h "'l" '
I ' I . - ' ' ' ' .I x -' - 55-: ’ T: rju‘. ,— ,: .
I ‘I; A _ . . o v'.‘ '.' ’ o _ I " ‘t..r‘L.“°.v' '2'
.' ‘ I ‘ I ". '-' .- " ‘.'
I - i I h. ‘ I I Elf-v L 3-7.“: 's':‘
.‘If 2“. II II ’ I I I. ‘~ ' .. ":I Efﬁv "II r,_ " - .~
in. . Q . - g ‘ 5 u _ . Q - o o I '_,I L I t .
i 1:. I I | . I. g . . i J . I," n y " ﬁ‘“ I. . ‘ - 'I.- 3. p...°. .b‘ I 3- . o.-
i a o I O 'I | ‘-| I. 0' I ' - I' I - ' ' a c - o a c
I. . i o n.5l ‘ '—.~ n I . '- ‘. s r- ‘- f ‘q ' ‘| u q, ' -‘§. .
.5 . . " I | p l - I r \. 'I' I.... I u n." '-
. I . I . ‘ . . . I 5 '5- ‘ "- L I 1‘ o- 1. ‘\
y . I . O ‘ l C ’ _ I ' D' o a . . g . . a
. '. .0 . - n ‘ . . v ‘1 . . g .. ..I .0
' I ll ‘ l I '- '
O
U

—-—-’.~

I IIk5III

H1555!
. 5‘; III

   

”JP-I “55 , . 5
| I 5I ' .II. -'-rl':';n‘-"
5I'IJI'IN5II 5, '5i *IuIIJII; III-LII II"? 555 11.. : 5

  
  
   
  
    
   
     

II III 5‘ II 5555II5'
IIIuI'.’ I IIIIII I I. 5IIIIII I I “II I III III “VIII It IIIII
.II III II .IIII'IIIIII iIIIIlII55I‘5I5'5IIiIIIIIIIIﬁI
_ :4 5 I?” rip lupin 5555555554
: III-"I III I; I'LII'II"'IIIHIII”,
lI I'5i .5I, ”555455351555, iI555 g5” . 5
. 5 553:5?" ' 5.. x. ','_
I 5: I IIIIII I5II5IIIIII 5IIIIJIIII W555? I I. I '3 5'
;. I I I5 I 545,555
- "I w I“; I ,,5x M ',=' I.»
IIIIII II"I5II:r5-EII.'_..;- . ‘
I I " "I“ 'II'I I'II‘I II"‘III; 5 zi‘ I

I” i m5.» III
'u; i .3
5 M5 55 ,5JI5JHI 'II5 IIIIJIII 55IIII:I

 

1
WI

 

 
 

I. !
a

57,
Is; '4

O

O . '
. o B
0 - . a
- h
. - . I .o .. ...
‘0 '
" a.“ o _
g.“ 9 ‘
. .0
o ’ . .
I ‘
- --
_ I

  

       
    
  
 

     

.0
._
O I
+"~
c C d .
-‘ I’
O

_ . 5'5 _ c. o? 1:. .
. .IHIIzIII III"; 5:05 ”55 ‘5:,' 9 'J. 51:: .: 2.1
12.5, I5 In III? I I (“III“I’I *

II ' I:.;'I'.o':°'
IIIII IIII. I5:,'5.5{,'I.5I55 'II :III .I ‘IIIIII5I: '5' 'II HI; II II ,5 51:15:91.:6 .
I""' I"I' III 555 5I:I' ’lIL . . I,
i - I.’ :5 I I:I|.1":-|I,I15.'I16I'H.._!, 5:} I1 5 .

I If" “III? :5I 'I i .

II ".II I
" ’ I" .5.I:II_.5'I.' I515 EI‘ I:-

     
   
  

I. III-III
' '15" 'I.'.'5 «III: II I I I -. '
«II;1II.I' fI-I'I'II 5IIIII 'IIIII'II IIIII . , :2": j I,» -"-
III III I 5'5" :III , :IJI '5l5' 555541555 ‘55 5II55 '53 5; 5IILI III; "
I'.‘ I
M I.II'III,"5451,555-1.”:Ia ,
II: "III I 55I5I {45,55551;‘ I,
550555.55“,   5,555 II5'I5 | ‘II5I'5 II5I5'5 I55‘5 5.555 5 ' 5 .:
LIA. I~ ". Im' IIII'IM IIIM‘I II II ”I ' I
I .II II.:.,-I;I'I=,IIII w‘
""'i"""r" :II: “II ‘IIIHAIIe'Mw
5.555 H. Um. 5L5,» I. .
I
|I

 

 

I

H' LI'
'1 (5 l.’ 5155 $55

 

I

.5 55 '5'5'I‘ILIIIII

. , , 0'7; I:
II ' ‘ i] I5I:.' I II:1I::. ' ‘I ”‘7'? ”12'5”. III] IfaIlr I {I5 I
;,_ 1, I, , IIIIMI I

1-—
__

MI

I.‘lI'III -
'11sz

III. II I
II ‘ III“. 55

I I! II 'Ii 5,5 I-: III
d“ .I. IIIIIIiI "' I"III lI'*I.III:

. II'II IIII5IIII

..I
II

 

 

  

LIBRAit Y E"
Michigan State
University

       

This is to certify that the

thesis entitled

A. MODEL FOR CRITERION-REFERENCED MEASUREMENT
AND A COMPARISON OF ITEM ANALYSIS PROCEDURES

presented by

Susan K. Thrash

has been accepted towards fulfillment
of the requirements for

 

 

Ph. D. degree in Educat1on
- a
V/ f ﬂ’KIA
Major professor
(7 g, I.”
Date 1 ’/ “"//

 

0-7639

A MODEL FOR CRITERION-REFERENCED MEASUREMENT
AND A COMPARISON OF ITEM ANALYSIS PROCEDURES

By

Susan Kaye Thrash

A DISSERTATION

Submitted to
Michigan State University
in partial fulfillment of the requirements
for the degree of

DOCTOR OF PHILOSOPHY

Department of Counseling and Personnel
Services and Educational Psychology

1977

ABSTRACT

A MODEL FOR CRITERION-REFERENCED MEASUREMENT
AND A COMPARISON OF ITEM ANALYSIS PROCEDURES

By

Susan Kaye Thrash

The first purpose of this study was to propose a theoretical
conception of criterion-referenced testing and to explain two basic
item analysis techniques (Cox and Vargas, C-V and Roudabush, R)
theoretically with respect to this general model. The second purpose
was to determine the adequacy of the C-V and R procedures using the
theoretical model. The final purpose was to compare three item
analysis techniques, the C-V, R and the Brennan and Stolurow (B-S),
using real data.

A theoretical model for criterion-referenced testing was
proposed. The model includes l2 parameters that completely described
the pretest-posttest situation. The R and C-V indices can be
explained in terms of this general model by making certain assumptions.

There were two parts to this study. The first part attempted
to determine if the C—V and R indices adequately estimated the true
values, if one technique estimated the true values better than the
other and if the C-V and R indices were better estimators of the
true values for some parameter sets. These questions were considered

by simulating data for 21 different sets of parameter values using

Susan Kaye Thrash

the model as the theoretical framework. It was found that for R,

when the assumptions were met, the technique provided a more stable
and accurate estimate than when the assumptions were not met. It

was also found that when the sample size was increased from 50 to 200,
the stability and accuracy increased greatly. The C-V technique
seemed to provide a reasonably accurate and stable estimate regardless
of whether the assumptions were met. The estimates were more stable
with larger sample sizes. Also, the C-V technique estimated the C-V
true value better than the R technique estimated the R true value.

The second part of the study was designed to determine the
comparability of the three item analysis procedures, R, C-V and 8-5.
C-V and R values were computed for 128 items and 8-5 values were com-
puted for 64 items. These items were testing l6 objectives from two
subject areas, Mathematics and Reading, two grade levels, Middle
and Upper, and two treatments, assigned objectives (treatment A) and
selected objectives (treatment B).

The major question to be answered was do the C-V, R and 8-5
item analysis procedures provide comparable results? Three additional
questions were also considered:

1. Are the three procedures more comparable for items in

Mathematics than for items in Reading?;

2. Does and the comparability of the three procedures depend

on the grade level?; and,

3. Are the three procedures more comparable for items given

in treatment A than for items given in treatment B?

Susan Kaye Thrash

The Pearson product moment correlation coefficient between the R and
C-V indices was significantly different than zero (r = .80, p‘<.0l).
The point-biserial correlation coefficients between the 3-5 procedure
and the C-V index and the 8-5 procedure and the R index were also
significantly different than zero (r = .70, p<<.0l and r = .36, p<=.0l,
respectively).

The separate analyses of the indices for each subject area,
grade level and treatment indicated that the indices were more com-
parable for Mathematics than for Reading. The indices were also more
comparable for treatment B than for treatment A. The correlations
between the indices for the grade levels, Middle and Upper, were
almost identical.

An analysis of the agreement among the three item analysis
procedures showed than when a cut-off of .50 for the R and C-V
indices was used for selection of items, there was complete agreement
for 39 of the 64 items (6l percent) given on the pretest and retention
test.

From the results of the several analyses, it appears that the
best item analysis procedure to use for criterion-referenced testing,
or pretest-posttest situations, is the C-V technique. This tech-
nique provides a reasonably accurate and stable estimate of its true
value and gives very similar results when compared to the R index

and the 3-5 procedure.

TO THE N's IN MY LIFE

ii

ACKNOWLEDGMENTS

There are many individuals who have contributed to this work
as well as to my professional and personal growth. Dr. William
Mehrens, my chairman, advisor and friend helped shape my ideas into
a finished product, provided encouragement throughout my studies and
gave me advice whenever I needed it. Dr. William Schmidt, who
deserves a special thanks, spent a number of hours with me building
the framework of this dissertation. Dr. Walter Hapkiewicz has pro-
vided me with constant attention throughout my graduate studies.

His advice and concern for my educational progress has always been
appreciated. Dr. Robert Spira, also a member of my committee, has
had a significant impact on my educational achievement. Dr. Spira
has had the faith and confidence in me to achieve what at times I
was not sure I would be able to do. I will always be indebted to
Dr. Spira for his meaningful comments and suggestions for my disser-
tation, professional goals and personal well-being.

I also wish to thank Joseph Nisenbaker, who assisted me with
the computer simulation, four supervisors, Harley Jensen, Dr. Charles
A. Pounian, Robert Joyce and Dan Wallock, who provided support and
understanding during the trauma of the writing and rewriting of the
dissertation, and my mother-in-law, Mrs. Marguerite Thrash, who also
has provided me with support and encouragement during the completion
of my graduate studies and dissertation.

iii

I have saved for last the one individual who has inspired me
the most, helped to build the self-confidence I lacked, listened to
my ideas and helped to develop these ideas into a dissertation. I
have reserved a very special thanks for this very special person--

thank you--William Thrash.

iv

TABLE OF CONTENTS

Page
LIST OF TABLES .......................... vii
LIST OF DIAGRAMS ......................... x
Chapter
I. INTRODUCTION ........................ l
Need. . , ........................ 1
Purpose ....................... .. .. 2
Research Questions .................... 3
Overview ......................... 4
II. REVIEW OF LITERATURE .................... 6
Proposed Item Analysis Techniques . ........... 7
New Techniques ..................... 7
Traditional Techniques ................. 22
Summary ........................ 25
Comparing Techniques ................... 26
Summary ......................... 33
III. THEORETICAL DISCUSSION ................... 36
Summary ......................... 45
IV. DESIGN ........................... 47
Part A: Design of the Simulation ............ 47
Part B: Design of the Comparison Study with
Actual Data ...................... 55
Summary ......................... 6l
V. RESULTS OF THE SIMULATION ................. 63
The C-V Index: Adequacy and Stability .......... 63
Assumptions Met .................... 63
Comparison--Assumptions Met Versus Assumptions
Not Met ....................... 66
The R Index: Adequacy and Stability ........... 68

Chapter Page

Assumptions Met .......... , .......... 68
Comparison-~Assumptions Met Versus Assumptions
Not Met ....................... 68
The C-V Technique Versus the R Technique ........ 73
Consideration of C-V and R Techniques by Parameter Set
Set .......................... 76
Summary ......................... 81
VI. RESULTS OF THE COMPARISON OF THE THREE INDICES WITH
ACTUAL DATA ........................ 84
Comparability ...................... 86
C-V and R ....................... 86
8-8 and C-V ...................... 88
8-5 and R ....................... 90
B-S and C-V and R ................... 92
Summary ......................... 95
VII. SUMMARY AND CONCLUSIONS .................. 103
Summary ......................... 103
Conclusions ....................... 111
Discussion ....................... 113
Implications for Future Research ............ 114
APPENDICES .......................... ‘ . . 117
I. Roudabush's Technique ................... 118
II. Brennan's and Stolurow's Procedure ............ 125
III. Further Analyses of C-V and R ............... 129
IV. B-S Statistics; Application of the B-5 Decision Rules. . . 133
V. Reliability Estimates of Tests .............. 136
VI. Sample Tests and Objectives ................ 139
VII. Computer Program for the Simulation ............ 172
BIBLIOGRAPHY ........................... 176

vi

Table
2.1
2.2

hh-PD-D-wawwNNN
d-th-HU‘I-hw

0301-th

UTU'IU'IU‘I
01-500“)

LIST OF TABLES

Categories for a Given Item ................

Categories for Individuals Answering Item 1 Correctly
at the Posttest (ISI) ..................

Categories for a Given Item ................
Categories of Performance (Reliability-Crehan) .......
Categories of Performance (Validity-Crehan) ........
Categories for a Given Item ................
Categories for a Given Item ................
True Proportions for a Given Item .............
Observed Proportions for a Given Item ...........
Categories for a Given Item--Observed Proportions .....
Categories for a Given Item--True Proportions .......
Pretest--Actua1 ......................
Posttest--Actua1 ......................
Selected Parameter Values for the Simulation ........
B--Index ..........................
Descriptive Statistics for Each Parameter Set .......
Parameter Sets Where Assumptions for C-V Are Met ......
Average Ranges for the C-V Estimates ............
Parameter Sets Where Assumptions for R Are Met .......

Average Ranges for the R Estimates .............

vii

Page
12

14
15
27
27
36
37
37
39
48
48

49
52
58
64
65
67
69
73

U1

.ONOSOSOSONOONOICSO‘OS

\JNNNOO‘ON

DOOM

#00“)

.5B
.6A
.6B
.7A
.7B

.9
.9A
.98

Page

Summary Statistics Comparing R to C-V ........... 74
Correlations ........................ 74
Summary Statistics for R and C-V With Consideration of

Sample Size and Assumptions ............... 77
Comparison of R and C-V by Parameter Set .......... 80
Corre1ations of C-V and R ................. 87
Corre1ations Between B-S and C-V .............. 9O
Corre1ations Between 8-5 and R ............... 91
Corre1ations for All Items ............ I ..... 92
Corre1ations--Mathematics ................. 93
Corre1ations--Reading ................... 93
Correlations--Midd1e .................... 93
Corre1ations--Upper .................... 93
Corre1ations--Treatment A ................. 94
Corre1ations--Treatment B ................. 94
8-5, R and C-V Values for Items Given on the Pretest and

Retention Test ...................... 96
Agreement of the Three Item Indices 100% Agreement ..... 97
Agreement of the Three Item Indices 67% Agreement ..... 98
Agreement of the Three Item Indices 67% Agreement ..... 99
Pretest--Actua1 ...................... 104
Posttest--Actua1 ...................... 105
Categories for a Given Item--True Proportions ....... 105
Categories for a Given Item--Observed Proportions ..... 105

viii

Table Page

1.1 Categories for a Given Item ................ 119
11.1 Rules for Decision-Making ................. 127
IV.1 B-S Statistics ....................... 134
IV.2 Application of the 8-3 Decision Rules ........... 135
V.1 Reliability Estimates of Tests ............... 137

ix

LIST OF DIAGRAMS

Diagram A Page
4.1 Design of Administration of Items ............ 56
6.1 Design of Administration of Items ............ 85

CHAPTER I
INTRODUCTION

Nggg_

Criterion-referenced testing has been an area much discussed
and researched in recent years. Much of the research and discussions
have focused on the appropriateness of applying classicial measure-
ment theory to criterion-referenced tests and suggestions of new
procedures and statistics for the evaluation of criterion-referenced
tests. Livingston (1971), for example, developed a new statistic
for the estimation of reliability for criterion-referenced tests.
Alternative approaches to classical item statistics were proposed by
several other individuals (Brennan and Stolurow, 1971; Cox and Vargas,
1966; Roudabush, 1973 to mention a few). In addition, a few studies
compared these new item statistics to old statistics (e.g. Cox and
Vargas, 1966; Hambleton and Gorth, 1971; and Hsu, 1971).

Many of these new item statistics, however, were not based
on a theoretical model. If such a model could be found, it would be
easier to explain the item statistics and perhaps possible to develop
more powerful statistical techniques. Moreover, little is known
about the comparability of the new item statistics to each other.
Most of the research has been concerned with the comparison of new
with old; few studies have compared the new item statistics to each
other. It would seem desirable to compare the new statistics both

1

empirically and theoretically, with the aid of a general model, to
determine what the differences among them actually are and to develop

general recommendations for their use.

Purpose

The first purpose of this study is to propose a theoretical
conception of criterion-referenced testing and to explain two basic
item analysis techniques (Cox and Vargas, and Roudabush) theoretically
with respect to this general model.

The second purpose is to determine the adequacy of the Cox
and Vargas and Roudabush techniques. If the two techniques can be
explained by the general model, then the estimate of each index will
be compared to the corresponding true value. In this manner, it may
be possible to determine if one technique estimates the item parameters
better than the other.

A third approach (Brennan and Stolurow) cannot be explained
in terms of the general model due to the nature of the approach.

The Brennan and Stolurow technique combines a number of statistics
with a set of decision rules. The ultimate outcome is a verdict of
revision or no revision for the item and/or the instruction. While
the statistics used in the Brennan and Stolurow method do have the
traditional theoretical framework, the decision rules have only
intuitive appeal. It is not possible to fit the suggested decision
rules of the Brennan and Stolurow technique into a theoretical

framework.

However, the adequacy of the Brennan and Stolurow technique
may be determined by comparison of the three approaches on real data.
This, then, is the final purpose of the study--to determine the com-
parability of the three item analysis procedures (Cox and Vargas,
Roudabush, and Brennan and Stolurow).1 If all procedures provide
identical or nearly identical results then it seems reasonable to use
the simplest method (in terms of computation and data collection) in

the future.

Research Questions

 

In particular, this investigation will consider the following
questions:
1. Can a theoretical conception or a general model of
criterion-referenced testing be defined?
a. Does the C-V technique fit the general model? What
assumptions are needed?
b. Does the R technique fit the general model? What
assumptions are needed?
2. Do the C-V and R techniques adequately estimate the true
values of the item parameters?
a. Does one technique estimate the true values better
than the other?
b. 00 the C-V and R techniques estimate some true values

of the item parameters better than the others?

 

1From this point on, Cox and Vargas, Roudabush, and Brennan
and Stolurow techniques will be abbreviated C-V, R and 8-5, respec-
tively.

3. Do the C-V, R and 8-3 item analysis procedures provide
comparable results?
1a. Are the three procedures more comparable for items in
Mathematics than for items in Reading?
b. Does the comparability of the three procedures depend

on the grade level?

Overview

The previous section provided a brief introduction to the
ideas and questions pursued in this study. Chapter II will provide a
review of the literature relevant to item analysis methods for
criterion-referenced tests. Two types of studies are considered--
studies which proposed item analysis techniques (new and modifications
of traditional approaches) and those which compared new techniques
to old.

The third chapter presents a theoretical conception of
criterion-referenced testing. The C~V index and the R sensitivity
index are described in the context of this theoretical model.

A method for evaluation of the C-V index and the R sensitivity
index with respect to the theoretical model is presented in Chapter
IV. Procedures for determining the comparability of the C-V index,
the R sensitivity index and the B-S method are also discussed in
this chapter.

Chapter V presents the results of the evaluation of the C-V
and R techniques with respect to the model. The results of the
investigation of the comparability of the C-V, R and 8-5 indices in

a practical application are presented in Chapter VI.

Finally, in Chapter VII some implications of the results of
Chapters V and VI for test development are discussed, and some
recommendations for further research on the proposed theoretical

model are given.

CHAPTER II

REVIEW OF LITERATURE

The concept of criterion-referenced measurement in education
has initiated many discussions and much research with respect to
measurement issues. The main points of interest have been cut-off
scores, reliability and item analysis. This review will summarize
the literature on item analysis.

The literature can be divided into two categories. One group
of studies can be collected under the heading of "proposed item
analysis techniques." New techniques have been proposed by some
(Brennan, 1972; Brennan and Stolurow, 1971; Cox and Vargas, 1966;
Crehan, 1974; Hsu, 1971; Ivens, 1970, 1972; Kifer and Bramble, 1974;
Kosecoff and Klein, 1974; Roudabush, 1973; Saupe, 1966) and the use
of old (traditional) techniques have been advocated by others (Davis
and Diamond, 1974; Ebel, 1973; Hambleton and Gorth, 1971; Harris,
1974; Nitko, 1971; Popham and Husek, 1969). The second category
includes research which makes comparisons among the proposed tech-
niques (Cox and Vargas, 1966; Crehan, 1974; Haladyna, 1974; Hambleton
and Gorth, 1971; Helmstadter, 1974; Hsu, 1971; Ivens, 1970, 1972;
Kosecoff and Klein, 1974; Ozenne, 1971).

Proposed Item Analysis Techniques

New Techniques

One of the earliest item analysis techniques proposed for
criterion-referenced tests was suggested by Cox and Vargas in 1966
(Cox and Vargas, 1966). This procedure requires two administrations
of the item--before and after instruction. The item statistic is
then defined as the difference between the proportion of individuals
answering the item correctly as posttest and the proportion of indi-
viduals answering the item correctly at pretest; C-V. (The original
notation was Opp.) This is the simplest technique to use; however,
it has been criticized by Oakland (1972) and Davis and Diamond (1974).

Oakland claims that the C—V technique is limited because it
is "more appropriately used to determine the extent to which students
may profit from instruction rather than to determine the reliability
estimates which apply to a particular CRM" (Oakland, 1972, p. 5).
This is a strange criticism, for indeed the intent of the C-V pro-
cedure is to select items and not to provide reliability estimates.
Oakland also criticizes the use of a statistical technique for item
selection without regard to item content. This is a criticism which
could be applied to the use of any statistical technique in the
selection of items without regard for content.

Davis and Diamond suggest that use of difference scores make
the C-V index unreliable. It should be remembered here that the
statistic is not based on individual difference scores, but the dif-

ference of proportions. They also felt that the use of this statistic

without regard to the content of the items would impair the content
validity of the final form of the test.

According to Davis and Diamond, test developers should use
the same four basic principles that have been in use for 25-30 years.
They do caution, however, using the Second principle without regard
to the content of the item. These principles are:

1. The items in an achievement test should constitute as
nearly as possible a representative sample of the popula-
tion of items that define the domain to be measures . . . .

2. The items in a predictor test, . . . , should constitute
the set (drawn from the population of items that define
the domain to be tested) which best predicts scores on the
designated criterion variable in samples of examinees like
those to whom the test will be administered. . .

3. The items in an achievement test should, within the con-
straint imposed by principle 1, make up as efficient a
measuring instrument as it is possible to produce.

4. Choice-by-choice item-analysis data should be used as a
basis for editing and revising items for achievement,
aptitude, and selection tests. (Davis and Diamond, 1974,
pp. 128-131.)

Of course all these principles are ones that should be con-
sidered regardless of the referencing nature of the test. However,
it does not necessarily follow that the items will be doing the
proper job if these principles are followed.

Ebel (1973) supports the use of the C-V technique when the
purpose of the evaluation is to determine the effectiveness of an
instructional program. However, he indicates traditional item dis-
crimination indices are appropriate when the purpose is to determine
how well an individual has succeeded in a particular course of study.

Ozenne (1971) also recommends the C-V index. In his inves-
tigation of a method of measuring test sensitivity, Ozenne suggested

that a test composed of items selected on the basis of the C-V index

would have the greatest sensitivity to instruction. Haladyna also
recommends the use of the C-V index (Haladyna, 1976 and Haladyna and
Roid, 1976). In fact, he feels that the C-V " . . . index comes con-
ceptually closest to measuring CR item discrimination" (Haladyna,
1976, p. 12).

Other individuals have considered the C-V technique as a
starting point for further modifications. Brennan (1972) proposed
the 8 index, a variation of the C-V technique and the traditional 0.
The 0 statistic is defined as the difference in the proportion of
individuals in the upper group answering the item correctly and the
proportion of individuals in the lower group answering the item
correctly. The upper and lower groups are generally defined as the
top and bottom 27 percent of the individuals ranked on the total test.
The B index is defined as the proportion of individuals in the mastery
group (upper) who answer the item correctly minus the proportion of
individuals in the nonmastery (lower) group who answer the item cor-
rectly (B = U/n1 - L/nz). This index differs from D in that differ-
ent sample sizes in the upper and lower groups are allowed. The
evaluator is then able to use one administration, define the upper
and lower groups according to mastery or nonmastery or by some similar
criterion, and select items on the basis of this index. Brennan also
determined the exact distribution of the 8 index under the null
hypothesis, 8 = 0. This allows the evaluator to compute confidence
intervals for the item statistic.

Hsu had already suggested an identical procedure in 1971

(Hsu, 1971). He suggested that a predetermined cut-off score be

10

established which would classify individuals according to mastery or
nonmastery. According to Hsu, the difference in proportions of those
responding correctly in each group to a given item would be a mean-
ingful discrimination index for items from criterion-referenced
tests. This index is identical to the 8 index.

One of the major problems with this technique is the decision
of what defines mastery and nonmastery. Once this problem is solved,
then it is possible that there will be too few mastery students in
a pilot administration of the item if the group is uninstructed. If
the group has been instructed then there may be too many mastery
students. In either case, U/n] or L/nz would provide somewhat less
than stable proportions and the value of B may not provide an adequate
indication of the item's usefulness.

A modification of the 8 index (and Hsu index) was introduced
by Crehan in his 1974 study (Crehan, 1974). Crehan redefined the
upper and lower groups as independent groups of instructed and unin-
structed, respectively. This modification basically solves the
problem of defining the mastery and nonmastery groups.

The B index as originally proposed by Brennan and Hsu or
modified as suggested by Crehan is very similar to the C-V technique
and traditional techniques. One advantage for using B is the ability
to use a different number of individuals in the upper and lower
groups. A second advantage is the ability to test the null hypothe-
sis, B = 0. It must be remembered, however, that teachers are the
most likely users of criterion-referenced tests. It seems unrealistic

to expect teachers to use sophisticated statistical techniques to

11

select items. A further problem is the availability of probability
levels for B. The table of probability levels is available through
a computer program which Brennan developed. The other criticisms
that were mentioned previously must also be considered in the final
analysis of the 8 index.

A second index that Crehan proposed is defined as the propor-
tion of consistent performances on logically parallel items. In
other words, this index equals the number of individuals who fail
both items plus the number of individuals who pass both items divided
by the total number of individuals. This of course requires the
development of logically parallel items which is not necessarily an
easy task. In addition, it requires the administration of both sets
of items at the same time. For a short test, the time factor would
not be a particular problem.

Crehan also employed a third unique technique in his study.
The items were ranked by having teachers respond to the question,
“Which item would you choose if you were to give a one item test?"
(Crehan, 1974, p. 257). This was done until the item pool was
exhausted. Compared to all the other item analysis procedures pro-
posed, this approach is the most subjective one.1

Another refinement of the C-V method was suggested by
Edmonston, Randall and Oakland (1972). For their method consider the

two by two table below for a given item:

 

1Crehan also used a random ranking of items as an item
selection device. See the section on comparison of techniques for
the results of Crehan's study.

12

 

 

Table 2.1
Categories for a Given Item
Posttest
Pass Fail
Pass pH p12
Pretest
Fail p21 p22

 

The important pieces of information, they claim, are p12 and
p2]. A high value for p21 would indicate a good item. Items that
were less diScriminating would have high p12 values. The refinement
seems unnecessary since the C-V index would be p2] - p12 and provides
information of one value relative to the other.

Schooley, et al. (1976) also recommend consideration of the
proportion of individuals answering the item correctly (p) on pre-
test and posttest. They suggest that the proportion should increase
from pretest to posttest. In addition, items that supposedly measure
the same objective should have similar p values. Those that have
inconsistent p values should be looked at and revised if necessary.
Their approach is very similar to the C-V method since a comparison
of the p values from pretest to posttest would give the same value
as the C-V method.

Ivens also considered the C-V technique in addition to two
indices of his own (Ivens, 1970, 1972; Ozenne, 1971). Iven's indices
require three administrations of the same item to the same subjects.

One of the indices is based on the expectation that there would be a

13

large change in performance from pretest to posttest and a small
change from posttest to retest. Ivens calls this Index 2 and it is
defined as (p post - p pre) (1 - lp retest - p postl ) where p is the
proportion of subjects passing the given item on the particular
administration. The other index (Index 1) is defined as (l - pre-
post agreement) (post-retest agreement) where the agreement is the
proportion of subjects whose item scores (pass or fail) were in agree-
ment across the appropriate administrations.

His final recommendation, however, is that the C-V technique
be used for item selection and the information obtained from Index 2
be used for item revision (Ivens, 1970). The two indices defined by
Ivens need three administrations of the item. In most situations this
would be a definite disadvantage. In addition, if there is a minimum
amount of change from posttest to retest Ip retest - p postl would be
small and l - Ip retest - p postl would be close to one. In this
case, Ivens' Index 2 would be approximately equal to the C-V index.

Ivens' Index 1 is also intuitively appealing. However, Index
1 can have a high value--indicating a good item--and yet be a bad
item. For example, if many students pass the pretest, fail the post-
test and fail the retest, Index 1 would have a high value. Yet,
revision of the item (and probably instruction) should be considered.

Kosecoff and Klein (1974) suggest two indices--an Internal
Sensitivity Index (ISI) and an External Sensitivity Index (ESI). For
the first index (ISI) consider the following table which categorizes

only those individuals who answered Item 1 correctly at the posttest:

14

Table 2.2
Categories for Individuals Answering
Item 1 Correctly at the Posttest

 

 

(ISI)
Posttest
Fail Pass
Fail n1 n2
Pretest
Pass n3 n4

 

where r11 = observed frequency of students who answered Item 1 cor-
rectly on the posttest but failed the pre and posttest; 112 = observed
frequency of students who answered Item 1 correctly on the posttest
but failed the pretest and passed the posttest; n3 = observed fre-
quency of students who answered Item 1 correctly on the posttest but
passed the pretest and failed the posttest; and r14 = observed fre-
quency of students who answered Item 1 correctly on the posttest and
passed the pretest and the posttest.

n - n1
The index ISI is defined as n] 3 n2 + n3 + n4, which accord-

 

ing to Kosecoff and Klein, provides a measure of an item's ability to
discriminate between those who have and have not profited from instruc-
tion. Their interpretation of the index does not, however, follow
from the definition. It is conceivable that the index could have a
high value but all who passed the item at posttest also passed the

item at pretest. How does the item then have the ability to dis-
criminate those who have profited from instruction from those who

haven't? If all the individuals who passed the item at posttest

15

also passed the item at pretest, the item could not be said to be
sensitive to instruction.

Their second index (ESI) is the Cox and Vargas index. The
two indices are identical. Kosecoff and Klein do, however, suggest
a "correction for guessing" for the index. They use the Marks and
N011 procedure, which is also used by Roudabush in the development of
his index, to derive the correction for guessing (Marks and N011,
1967; Roudabush, 1973). They claim to compute the expected cell fre-
quencies and use these values in the computation of the E51. However,
their expected cell frequencies are true frequencies which are
heuristically computed from sample frequencies. This aspect will be
discussed in more detail when Roudabush's sensitivity index is pre-
sented. (See Chapter III and Appendix I).

A method based on the four possible outcome patterns for an
item administered on two occasions was proposed by Popham in 1970
(Kosecoff and Klein, 1974; Ozenne, 1971). The familiar two by two
table (see Table 2.3) was used in conjunction with computation of

Chi-square values.

 

 

Table 2.3
Categories for a Given Item
Posttest
Fail Pass
Fail f1 (n1) f2 (n2)

Pretest
Pass f3 (n3) f4 (04)

 

16

First it is necessary to count the number in each category (f1, f2,
f3, f4--following the notation presented in Table 2.3). Secondly,

a "prototypic item" is defined by taking the median frequency of each
outcome category over all items. Finally, a comparison is made between
this prototypic item and the actual frequencies in the four categories
for each item. Large Chi-square values would suggest that the item

is considerably different than the typical item. One problem with the
technique is that the items in the test must be fairly homogeneous

to give meaningful results. A second problem is not knowing how

large the Chi-square values need to be for one to infer that the

item is atypical or bad.

Three other studies have proposed methods totally different
from the basic two-way table--Cox and Vargas approach. Kifer and
Bramble calibrated a criterion-referenced test using the Rasch model,
which is a latent trait model (Kifer and Bramble, 1974). They felt
that the_Rasch model could determine which items fit the model and
which items need revision. However, as in the Popham method, all
items need to be sampling one trait; if not, some items may not fit
but yet be good items. Item analysis was a subobjective of their
study. Their main emphasis was the desire to generalize about the
scores and obtain more precision concerning the extent to which a
score represents passing a criterion.

Bayesian techniques were applied to item analysis by
Helmstadter (1974). Three separate indices of item effectiveness are
defined in terms of probabilities. The first is the probability that

a subject knows the content given that the correct response was

17

selected. The probability that a subject does not know the content
given that the incorrect response was selected defines the second
index and the probability that a correct decision will be made about
the examinee's knowledge of the content given the results of perform-
ance of that item is the third.

For these indices, P indicates a correct response, P'an
incorrect response, K knowledge and K no knowledge. The first index
is denoted by P(KIP), the second by P(RTP) and the third by

P(correct decision) equal to P(KP or KP). Bayes' theorem then implies

that

_ P(PIK)P(K)
P(Kip) ' P(PIK)P(K) + P(PIR)P(T<’)

and
P(FIR)P(R) ,
P(KIP) = p(p'|K)P(R) + P(PTK)P(K)

 

Each of the subcomponents, such as P(PIK) were established on the
basis of the administration of an item. The probabilities P(KIP),
P(KIP) and P(correct decision) were then computed using these pieces
of information. There is still the same problem with these indices
of determining a cut-off value for the establishment of a knowledge
group and a no knowledge group. These indices can use pretest -
posttest data or a single administration.

Saupe was concerned with maximizing the reliability of
difference scores (Saupe, 1966). He suggested that items possessing

certain characteristics would make the maximum contribution to the

18

reliable measurement of change. According to his analysis, items
with the following characteristics should be considered as good items:

1. Items with high item-total score discrimination indices

for both initial and final administrations of the test.

2. Items with low item-total score discrimination indices

when the total score criterion is from the final adminis-
tration for items in the initial administration and from
the initial administration for items in the final admin-
istration.

3. Items with high correlations between initial administra-

tion item score and final administration item score
(Saupe, 1966, p. 224).

Saupe derived an index that could be used in the selection of
items to measure change. Items with high values of this index would
be selected and items with low values rejected. The index is based
on the correlation of the change in the item score with the change

in the total test score;

xX +g:yY ' er ' YyX

2J1-rny1-rXY

r

 

'“do

where x and y represent item scores and X and Y represent total test
scores.

Although Saupe was not directly concerned with criterion-
referenced tests, his work has some applicability to it. Obviously

items in a pretest-posttest situation are meant to measure change

 

19

and the index might have some usefulness in predicting those items
which are sensitive to change.

The third criterion, however, seems inconsistent with results
of criterion-referenced testing. This criterion specifies that an
item with a high correlation between initial item score and final
item score is a good item. This high correlation would be achieved
only if there is some variance on the pretest (not all individuals
fail) and some variance on the posttest (not all individuals pass).
In addition a high positive correlation is not obtained if an item
is failed by most on the pretest and passed by most on the posttest.
This is the situation desired in criterion-referenced testing. A
high correlation would not designate items sensitive to instruction.
Criterion two suggests that discrimination indices should be low
between item and total score using the opposite administration for
the criterion. Again these low discrimination values could be
obtained and yet the item might be a bad item. For example, a low
discrimination value could be obtained with almost all passing the
item at pretest and getting low scores on the posttest. A similar
situation would result with almost all failing the item on the post-
test and obtaining somewhat high scores on the pretest. These
results are not desirable in criterion-referenced testing. Items
exhibiting these characteristics might not be good items.

As with almost all of these techniques, care must be made to
include items that cover all the objectives. Relying on only statis-
tics to select items may result in the exclusion of some important

aspects that need to be tested. Nitko, when considering this

20

problem, suggested "that tests constructed from carefully defined
domains of items possess reasonably good psychometric properties
without prior statistical selection" (Nitko, 1971, p. 8). On the
other hand, Skager felt that "relying solely upon judgments as an
index of item quality ought to leave us just as uneasy in the case
of criterion-referenced tests as it should be for norm-referenced
instruments" (Skager, 1974, p. 53). One of his suggestions was the
use of item generation rules; although, he indicated that item
selection for criterion-referenced tests is still open to debate.

Hambleton, et al. (1975) also do not advocate the use of
empirical techniques exclusively. They feel that items selected
should be representative of the domain of items and the empirical
methods should be used to detect bad items.

Consideration should also be given to the impact of selecting
items that are sensitive to instruction according to some statistic.
If items are selected which are sensitive to instruction one might
argue that the items, over a number of administrations and revisions,
could become very easy or perhaps require only recall of simple facts.
Care must be taken to include items that measure all aspects of the
domain and to ensure that these items are not only sensitive to
instruction but sensitive to the domain.

Another approach similar to the C-V index was presented to
Roudabush at the 1973 American Educational Research Association
Annual Meeting. It is based in part on a procedure suggested by

Marks and N011 (1967). As it was pointed out earlier, Kosecoff and

21

Klein used a similar technique to develop the "correction for guessing"
for their External Sensitivity Index.

Roudabush's technique is based on the familiar two by two
table presented earlier as Table 2.3. Roudabush also makes two
assumptions. First, he assumes that there is some fixed non-zero
probability, p, that a student who does not know the answer to the
item will guess the correct answer. This p value is determined by
the item only and does not vary from student to student nor from
occasion to occasion for the same student. This fixed p value sug-
gests that there is no partial knowledge on the part of the student,
and that the student's responses are independent at pretest and
posttest when he does not know the correct answer and fails to learn
it.

Further, Roudabush assumes that the only possible result of
exposure to instruction between pretest and posttest is that the
student learns the correct response to an item. This then implies
that the non-zero frequency of f3 is solely due to guessing, further
implying that there is no forgetting. This suggests that the "true"
value of f3 is zero.

With these assumptions Roudabush derives a number which
serves as an index of the degree to which examinees select the cor-
rect response to the item as a function of the instruction received
between pretest and posttest. This number is called a sensitivity
index by Roudabush. It can be expressed as

- f

22

(The original notation was 5.) Further clarifications and derivations

are presented in Chapter III and in Appendix I.

Traditional Techniques

 

Traditional item analysis procedures also have been recom-
mended for use with criterion-referenced tests. Most individuals
have, however, suggested some modifications in the interpretation
of these traditional indices. One of the more detailed procedures
is outlined by Brennan and Stolurow (1971).

Their procedure combines traditional item analysis techniques
with a set of decision rules. Brennan and Stolurow compute four
error rates and two discrimination indices from pretest, posttest
and retention test data. The decision rules are then applied to
determine the adequacy of the item and of the instruction. The
decision rules are similar in context to the first criterion of a
good item suggested by Saupe. Further clarifications of this tech-
nique are presented in Chapter IV and Appendix II. Their procedure
is very complicated and laborious and for this reason, perhaps, has
not been investigated further.

Other individuals have also recommended the use of traditional
indices. Hsu recommends the use of the phi-coefficient with Right
versus Wrong for a given item being one dimension and Mastery versus
Nonmastery the other (Hsu, 1971). For this procedure, a cut-off
score for each behavior must be established in order to declare a
mastery and a nonmastery group. There are other limitations besides

the problem of establishing a cut-off score. The phi-coefficient

23

cannot be used when the item is answered correctly or incorrectly by
all or when all subjects are declared masters or nonmasters. Hsu
then recommends the use of his upper-lower difference statistic,
defined as the difference in proportions of those responding cor-
rectly in the mastery and nonmastery groups, or the point-biserial
correlation coefficient. Hsu's upper-lower difference statistic was
discussed in the previous section.

Hambleton and Gorth (1971) also suggest using traditional
item analysis procedures. Items associated with the same objective
should have approximately the same value for item difficulty. Items
that are different should be modified and tested again. In addition
item discrimination indices can be used. Negative indices would
indicate a need for revision in the item, instructional materials,
and/or teaching. Positive discrimination indices, according to
Hambleton and Gorth, more than likely indicate a shortcoming in the
instructional program. Items with zero discrimination may be
acceptable. Popham and Husek recommended the same interpretations
of discrimination indices in 1969 (Popham and Husek, 1969).

If the traditional methods and the interpretations suggested
by Hambleton and Gorth and Popham and Husek are used, then the
information that is obtained seems to be ambiguous and no definite
decision can be made about the item. However, Brennan and Stolurow
took these bits of information with other information and a set of
rules and have developed a useful guide for item selection for

criterion-referenced tests.

24

Item characteristic curves, another traditional item analysis
technique, can also be used for criterion-referenced tests (Hambleton
and Gorth, 1971). The parameters (difficulty and discrimination)
of the curves supposedly do not change from group to group. This
implies that the parameters could be predicted from the pretest
administration. An obvious disadvantage in using item characteristic
curves would be in the construction and the interpretation of them.
This procedure would not be one of the easiest to use or understand.

Harris also suggests traditional item analysis techniques
for criterion-referenced tests. However, the test should be used
with a sample from a population of instructed students and a sample
from a population of uninstructed students. Item difficulties for
items for a given objective should be equal within each of the two
groups; however, item difficulties should differ between the two
groups (Harris, 1974). Woodson's position is very similar to Harris'
position. Woodson argues that the item needs to be tested in the
proper population. He feels that "items and tests must be evaluated
for the range of the characteristic for which they will be used“ and
if the items and tests give no variability in this population of
observation, then the items and/or tests give no information and are
not useful (Woodson, 1974, p. 64).

Both of these suggestions are considered when pretest and
posttest data are used. The pretest group is generally considered
the uninstructed group and the posttest group the instructed group.
The B-S decision process includes a comparison of the pretest and

posttest item difficulties and the C-V index and R index are

25

comparisons of the pretest and posttest difficulties. Since most'
of the other proposed item analysis techniques also consider pretest
and posttest data, the Harris and Woodson suggestion of testing the

item in a proper population are taken into account.

Summary

The various techniques that have been proposed fall into
essentially two categories. One category of techniques contain the
C-V technique and its variations (Brennan, 1972; Crehan, 1974;
Edmonston, Randall and Oakland, 1972; Hsu, 1971; Ivens, 1970, 1972;
Kosecoff and Klein, 1974). The other category contains item analysis
procedures generally used for norm-referenced tests with possible
alternative interpretations. As is discussed above, these new
meanings for old statistics sometimes result in a technique or pro-
cedure which is similar to the C-V procedure. Every new technique
seems to have as its main purpose, selecting items that are sensitive
to instruction. However, there is a need to be alert to the negative
implications of selecting items sensitive to instruction. Most indi-
viduals recommend using item statistics in conjunction with a review
of the domain or objectives and close scrutiny of the instruction.
This aspect will be discussed more thoroughly in the final chapter.

Review of the proposed techniques has shown that the C-V
index or modifications of the C-V index have been recommended more
frequently than any other procedure as an appropriate item analysis
technique for criterion-referenced tests. The R technique is a

refinement of the C-V technique and, as it will be shown in the

26

following chapter, makes fewer assumptions than the C-V index.
Therefore, the R index may provide a better estimate of an item's
sensitivity to instruction than the C-V index.

The B-S procedure combines the best of traditional methods
in an attempt to select good items for criterion-referenced tests.
All three of these procedures may be considered useful in selecting
items that are sensitive to instruction. Most of the remaining pro-
cedures are latent trait models. While these are useful they fail
to meet the criterion of computational ease which is important in

most of the situations where criterion-referenced tests are used.

Comparing Techniques

Several studies have been done to compare new item statistics
to old item statistics. Crehan (1974) compared six item analysis
techniques using a pool of items constructed by teachers. The pro-
cedures he compared were the C-V, a modified Brennan, a teacher
rating, a point-biseral correlation between item score and total test
score in the posttest situation, a random ranking, and an index
which was defined as the proportion of consistent responses on
logically parallel items.

Crehan used the concepts of reliability and validity to
compare tests composed of items selected by each of the six tech-
niques. Reliability was estimated by (a + c)/N where N = a + b + c
+ d and a, b, c, d are defined in Table 2.4 below.

Validity was estimated by (a + c)/N where N = a + b + c + d

and a, b, c, d are defined differently in Table 2.5 below.

27

 

 

 

 

 

Table 2.4
Categories of Performance (Reliability-
Crehan)
Form 8
Pass Fail
Pass b a
Form A
Fail c d
Table 2.5
Categories of Performance (Validity-
Crehan)
Uninstructed Group Instructed Group
Pass b a
Fail c d

 

In addition validity was estimated by the point-biserial cor-
relation between test score and a dummy variable representing group
membership (instructed group and uninstructed group). The instructed
group was a posttest only group and the uninstructed group was a
pretest group.

The results of his study suggested that the modified Brennan
and C-V methods produced tests with higher test validity. However,
the different item selection methods seemed to have no effect on test

reliability.

28

In order to generalize from the results of this study, the
definitions of reliability and validity employed by Crehan must be
accepted as reasonable. Both definitions are rationally appealing
if not theoretically appealing. Reliability could also have been
estimated with a phi-coefficient. But with either method the deter-
mination of cut-offs is arbitrary and the estimates can increase or
decrease with shifts of the cut-offs. Validity could also have been
estimated with a phi-coefficient. The same problem exists, however,
with determination of cut-offs and assignment to pass or fail groups.
The point-biserial, which was also used to estimate validity, does
not have the problem of determination of cut-offs.

Two groups of individuals were included in the sample. One
group was used to compute item statistics, develop tests and set
passing points. The other group was used to determine reliability
and validity. The process was reversed and reliability and validity
estimates obtained from both groups were averaged. This is unfortunate
since it seems reasonable to think of one group as the cross-
validation sample. The obtained reliability and validity estimates
from both groups could then have been compared and inconsistencies
located.

Item statistics were not compared across samples of individuals,
even though those data were available. Questions such as how did
the item values fluctuate across samples and across subject areas
were not considered in this study.

The only conclusion that we can draw from this study is that

if the C-V or modified Brennan techniques for selection of items for

29

criterion-referenced tests are used, the validity, as defined by
Crehan, might be better than if some other technique for selection
were used.

Several other individuals have also compared the C-V index
to alternative methods (Cox and Vargas, 1966; Haladyna, 1974; Haladyna
and Roid, 1976; Hambleton and Gorth, 1971; Hsu, 1971; Ivens, 1970,
1972; Kosecoff and Klein, 1974). It is interesting to note that of
the 11 studies that are reported here which compare criterion-
referenced item analysis techniques, eight include the C-V method.
This index has to be appealing because of the ease of computation.

In addition it seems to fare extremely well in the comparisons with
other techniques.

Cox and Vargas (1966) and Hambleton and Gorth (1971) con-
cluded that the C-V index produces results different enough from
traditional methods to warrant the consideration of this alternative
technique for criterion-referenced test construction. Cox and Vargas
compared 0 to C-V and Hambleton and Gorth compared C-V to the biserial
correlation and a modified C-V. The modified C-V was defined as the
difference between the proportion of individuals who correctly
answered an item on the delayed posttest and the proportion of indi-
viduals who correctly answered the same item on the pretest, C-V'.
While Hambleton and Gorth found no relationship between C-V and C-V'
with the biserial, Cox and Vargas did find significant Spearman rank
order correlations between the rank on C—V and the rank on D.

Haladyna, on the other hand, concluded from his study that a

point-biserial discrimination index computed on the combined test

30

results of pre and post-instruction examinees is better than C-V.

His conclusion is based on the result of his analysis which indicated
that the two statistics give identical information and the point-
biserial requires a one-step analysis and the C-V requires a two-step
analysis. His argument that the point-biserial is a one-step process
is based on the availability of computer programs to compute the cor-
relations. For a classroom teacher C-V has the advantage of being
easy to compute as well as "conceptually satisfying" (Haladyna, 1974,
p. 98).

Hsu investigated the relationship of a modified C-V (C-V")
with rpbi and the phi-coefficient using various samples of individuals
(Hsu, 1971). The index C-V" is defined as the difference in propor-
tions of those responding correctly in a mastery and nonmastery group.
The mastery and nonmastery groups are established by a predetermined
cut-off score. The samples varied with respect to the ability dimen-
sion and test score distribution. The results indicated that the
., and the phi-coefficient depends on the

pb1
ability dimension and the test score distribution. When the sample

relationship of C-V”, r

consists of individuals with a wide variety of abilities and the test
scores are distributed symmetrically the indices are highly correlated.
Hsu found that a highly discriminating item in one sample
may not be a highly discriminating item in another; therefore, he
recommended that test items not be tried out in a group with a wide
variety of abilities. Items selected on the basis of performance of
this group may not be measuring the same kind of performance in a

second more homogeneous group.

31

Ivens also investigated the C-V index (Ivens, 1970, 1972).
He found that by choosing items with larger values for C-V for one
test and lower values of C-V for a second test, there were marked
differences in the quality of the tests. To measure the quality of
the test, Ivens considered reliability and validity. He used tradi-
tional reliability estimates as well as unique reliability and
validity estimates. All statistics computed supported the fact that
tests composed of items with higher C-V values were better tests. It
should be pointed out that the unique reliability and validity
estimates were somewhat related to C-V. For this reason, higher
reliability and validity estimates for tests constructed from items
with high C-V values would be expected.

The C—V index was again compared to other indices by Kosecoff
and Klein (1974). They redefined C-V as ESI and compared this to
their 151, the phi-coefficient and the point-biserial. (ESI and 151
are defined in an earlier section of this chapter.) The results of
this study showed that ESI was generally lower than 151. The values
of ISI tended to parallel the values of the point—biserial and phi-
coefficient. Of course, the corrected version of ESI resulted in
lower values.

After consideration of the data, Kosecoff and Klein deter-
mined that there had been too many masters at the pretest. To
compensate for this, ESI and 151 were redefined. ESI was defined

as n2 ' n1 (Table 2.3 and Table 2.2 notation, respectively). They
nl+"2
concluded from the results of the analysis with the redefined

32

statistics that 151 is sensitive to instruction. The high proportion
of prior masters caused the index in the first analysis to be arti-
ficially deflated. ESI was found to be an unsatisfactory statistic
because the values tended to vary greatly. The values for ESI did
correlate significantly with the phi-coefficient and point-biserial
values but the correlation coefficients were rather small implying,
perhaps, that ESI would not give the same judgment as traditional
statistics. Almost all the research that has considered the C-V
index (or the ESI) has produced this same result.

Interest in the C-V index remains high as indicated in a
recent comparative study conducted by Haladyna and Roid (1976).
They compared various Rasch statistics, traditional statistics, the
Bayesian indices proposed by Helmstadter (1974), and the C-V index
for a total of 17 indices. The results of the study demonstrated a
high degree of relationship among four item discrimination measures.
These were the z-difference--a Rasch statistic which is an index of
the difference of difficulties of pretest and posttest samples, a
combined samples point-biserial, the C-V index and a Bayesian index
--the probability of having knowledge given that the student gets
the item correct. This study provides further evidence that the C-V
index may be the most appropriate item index for pretest-posttest
situations.

Three comparative studies that did not include the C—V tech-
nique are Roudabush (1973), Helmstadter (1974), and Bernkopf (1976).
Roudabush and Helmstadter compared their own unique indices to tra-

ditional statistics. Unfortunately neither study mentioned exactly

33

which traditional statistics were being used. Roudabush concluded
that his sensitivity index provided different information than the
traditional statistics. Helmstadter, on the other hand, found that
the "classical discrimination index [he defined it no further than
this] comes closest to providing the same item assessment as would
the Bayesian probability of making a correct decision . . . "
(Helmstadter, 1974, p. 3). Haladyna and Roid (1976) confirmed
Helmstadter's result in their study. 0n the basis of the analysis,
Helmstadter also concluded that "items which are effective indicators
that the examinee does know the material are not necessarily the same
items which are effective indicators that the examinee does not know
the material" (Helmstadter, 1974, p. 3).

Bernkopf compared the point-biserial coefficient using total
test score as a criterion (rt), the phi-coefficient (0e), and a
second point-biserial coefficient using the total score on an essay
test as a criterion (re). The dimensions of the fourfold table
for the phi-coefficient were correct/incorrect for the item and
above/below mastery on an independent criterion (the essay test).
All three indices were significantly related. As could be expected

the correlations between the Be and re were higher than the correla-

tions between 9e and rt and re and rt.

Summary

The literature reviewed in this chapter has been divided into
two categories. The first group of studies reviewed, recommends

possible approaches for criterion-referenced item analysis (e.g.

34

Brennan, 1972; Brennan and Stolurow, 1971; Cox and Vargas, 1966;
Crehan, 1974; Hambleton and Gorth, 1971; Hsu, 1971; Ivens, 1970;
Kifer and Bramble, 1974; Kosecoff and Klein, 1974; Roudabush, 1973).
The second group of studies compares a number of proposed techniques
(e.g. Cox and Vargas, 1966; Crehan, 1974; Haladyna, 1974; Hambleton
and Gorth, 1971; Hsu, 1971; Ivens, 1970; Kosecoff and Klein, 1974).

Review of the proposed techniques reveals that the C-V index
or modifications of this index have been recommended more frequently
than any other procedure as an appropriate item analysis technique
for criterion-referenced tests. In addition, the majority of the
comparative studies included the C-V index along with more traditional
indices. The general conclusion is that tests constructed on the
basis of the C-V index result in tests sensitive to instruction
(Ivens, 1970, 1972; Ozenne, 1971). Another conclusion is that the
C-V index results in a different judgment for a given item than
traditional statistics (Cox and Vargas, 1966; Kosecoff and Klein,
1974).

Only two studies included more than one new index in their
comparisons (Crehan, 1974; Haladyna and Roid, 1976). The C-V index
is significantly related to other new approaches--a Rasch statistic
and an index recommended by Helmstadter (Haladyna and Roid, 1976) and
when used produces tests with higher validity (Crehan, 1974).

Two new approaches to criterion-referenced item analysis have
not been researched--one, the R index and two, the B-S procedure.

The R index is a refinement of the C-V technique. It makes fewer

assumptions and may be a better estimate of an item's sensitivity

35

to instruction. The B-S procedure combines traditional methods with
a set of rules to provide a guide for selecting items which are
sensitive to instruction. For these reasons, the C-V index, the
Roudabush sensitivity index (R) and the Brennan and Stolurow pro-
cedure (B-S) were selected for further investigation.

In the following chapter a theoretical basis for criterion-
referenced testing or pretest-posttest situations is provided. It
will be shown that the C-V index and R index can be explained in
terms of a general model; and, as indicated above, it will be shown
that the R index is a refinement of the C-V index which requires

fewer assumptions.

CHAPTER III

THEORETICAL DISCUSSION

In this chapter, a theoretical model for the pretest-posttest
situation is presented. Two item analysis techniques, R and C-V,
which were described earlier, are explained in terms of the general

model.

The results of a given item in any test can be represented

by the following diagram:

 

 

Table 3.1
Categories for a Given Item
ACTUAL
Does Not Know Knows
OBSERVED
Pass q21 q22

 

where q11, q2], q12 and q22 are conditional probabilities with qH
+ q21 ‘ ' and q12 l q22 = "

The probability that an individual who does not know the
answer to a given item will answer the item incorrectly is denoted

by ql]. The probability that an individual who does not know the

36

37

answer to the given item will answer the item correctly is denoted
by q2]. Similarly, q12 and q22 represent the probabilities that an
individual who knows the answer will fail or pass the item, respec-
tively.

Now consider a pretest-posttest situation. This can be
represented with three diagrams. Table 3.1 can be used to define the
pretest results and a similar table with different probabilities
(Table 3.2 below) can represent the posttest. These probabilities

are defined in the same manner as above.

Table 3.2
Categories for a Given Item

 

POSTTEST-ACTUAL

 

 

Does Not Know Knows

Fa" Q11. q12'

OBSERVED . '
Pass q21 q22

 

An additional 2 x 2 table (Table 3.3 below) defines the true propor-

tions of the pretest-posttest situation.

 

 

Table 3.3
True Proportions for a Given Item
POSTTEST
Does Not Know Knows
Does Not Know n1 n2

PRETEST
Knows n3 n4

 

38

In Table 3.3, ﬂ, is the proportion of individuals who do not
know the answer to a given item at both pretest and posttest. Sim-
ilarly, n2 is the proportion of individuals who do not know the answer
to a given item at pretest but learn it by the posttest. n3 is the
proportion of individuals who know the answer at pretest but not at
posttest; and n4 is the proportion who know the answer at both times.
These proportions, a], n2, n3, n4, sum to one. These are true pro-
portions. They are not the observed results of the pretest and
posttest.

The general model is then represented in matrix notation as

P = Q I Q' 1 where I symbolizes the Kronecker product,

and
"1
Q=(‘111 q12) Q.=<q11 q12) 1r= :2
q21 q22 q21 q:22 ‘ "2
and
p1
p
P = 2 or,
p4
p1 n]
p . .
p2 = (Q11 q12)::(‘111 q12> "2
3 , , n
p4 q21 q22 q21 q22 H:

The pk's, described in Table 3.4, are the observed propor-

tions given the probabilities qij and qij. and the true proportions

'll'k.

39

 

 

Table 3.4
Observed Proportions for a Given Item
POSTTEST
Fail Pass
Fail p1 p2
PRETEST
Pass p3 p4

 

Expanding the model,

p1 811811' I l q11q12' " + q12°11

1 ' 2 . "3 + q1zq12. 1'4
92 = q11q21' "1 + q11q22 "2 + q12q21 1T3 1 “12822. "4
p3 qz1q11' “1 T qz1q12' "2 + Q22q11' "3 T 922812. "4
p4 qz1q21' "1 + qz1q22' "2 + qzzq21' “3 + q22q22 "4

This model completely describes the results of a pretest-posttest
situation.

For example, consider p], the observed proportion of individ-
uals who fail both the pretest and the posttest. Each of the actual
proportions, n1, n2, n3, and n4 can contribute to the observed propor-
t'°“° 1" the m0de' p1 = q11q11"'1 l q11q12"'2 T q12q11"'3 + 812812'"4°
If we consider n], the proportion of individuals who do not know
the answer at pretest or posttest, we can observe that some of the
individuals in this category could have guessed correctly at either
the pretest (q21) or the posttest (q21') or at both the prestest and
the posttest. These individuals would not contribute to the observed
proportion p], since they would have passed the item at one or both

times. However, we can include q11 x q]]' x n] which is the

4o

proportion of individuals who really don't know and didn't learn and
failed to guess at either administration. Individuals who did learn
the correct response from pretest to posttest can also contribute to
p]. Those contributing would have failed to guess the correct response
at pretest (qll) and would have answered incorrectly at the posttest
(q12') even though they knew the correct response. Therefore,

qn x qu' x n2 adds to the observed proportion p]. In addition,
individuals who do know the answer at the pretest but don't know the
answer at posttest (n3) contribute to p]. Ordinarily, we would not
expect n3 to be a very large proportion. Individuals who can be
classified in this manner could have failed to respond correctly at
the pretest (q12) even though they knew the answer and could have
failed to guess the correct answer at the posttest (q11'). Finally,
individuals knowing the answer at both pretest and posttest could
have answered incorrectly at both administrations (q12 x q12' x n4).
Therefore, we can see, intuitively, that p1 is the sum of parts of
each of the proportions n1, n2, n3, and an. The observed proportions
p2, p3, and p4 can be explained in a similar manner.

It should be noted that n1, n2, n3, n4 are separated among
each of the observed proportions. If, for example, we add all the
parts of n], which are distributed over p], p2, p3 and p4, then
q11q11'"1 l q11q21'"1 I q21q11"'1 I q21q21"'1 5“°“'d equa' 1'1' Th'5

can easily be shown by factoring this expression:

q11(q11 I q21 )"1 + q21(q11 i q21 )"1 =
(91] + q2])(q11' + (121%1 = n] since

q11 1 q21 = ' a"d q11 + q21 = '°

41

It can also be shown that all the parts of n2, n3, and n4, which are
distributed over the observed proportions, p], p2, p3 and p4, do sum
to n2, n3, and n4, respectively.

There are 12 parameters in this model. If these parameters
could be estimated, useful information would be available for both
the item and the instruction. For example, if we, the proportion of
examinees who learn the answer, could be estimated, then an evalua-
tion of the quality of the instruction could be made. The estimate
of this proportion would also indicate the item's "sensitivity to
instruction."

Estimates of the other parameters would also provide useful
information. For an objective item, estimates of q]], qz], q]]' and
q2]' can be made after consideration of the number of response choices.
For example, a four-choice objective item would ordinarily lead to
an estimate of .25 for q2] or q21', because an individual who does
not know the answer has one chance out of four of choosing the cor-
rect response. It is also generally assumed that q22 and q22' equal
1.0, because it is very unlikely that an individual who knows the
answer will respond incorrectly. However, this may not be the case
for a poorly-written item. For example, a distractor for an item
may be also a correct response; or, the correct alternative could be
worded so ambiguously that even the individual who knows the answer
will not choose it. There is also the possibility that an individual
will make a clerical error.

13
about the quality of an item. A bad item would be one where qz]

Estimates of the qij's and q.."s do provide information

42

or q21' is high; that is, where the probability of guessing is high.
A good item would be one where q22 and q22. approach equal 1.0.
Suppose the parameters are considered in a slightly different
manner. One could perhaps use the concepts of reliability and validity
to describe these parameters. The nk's represent true values.
Estimates of indices defined by the nk's are estimates of the validity
of the item. For example, an estimate of NZ indicates how many or
what proportion of the individuals not knowing at the pretest know
at the posttest. The higher this value, or closer this number is to
1.0, the better the item is measuring what it is supposed to measure.
In other words, indices based on the wk's are indicators of validity.
In addition some of the qij's and qij"s can be considered
to be estimates of reliability. For example, if ql], q22’ qlll’ and
qzz' are close to 1.0 then the item is a perfect indicator of know-
ledge or no knowledge. As these probabilities decrease the item is
a less reliable indicator of knowledge or no knowledge.
Assumptions can be made to simplify this conceptualization.
In the general model Q does not necessarily equal Q'; different
probabilities are defined for the pretest and posttest. It is pos-
sible, however, that for any given item these probabilities would
be identical; that is, that neither time nor instruction would change
these item parameters. One could then assume that Q = Q'.
Roudabush simplifies the situation even further. First, he

assumes that n3 = 0. This implies that there is no forgetting; an

individual who knows an item at pretest will know it at posttest.

43

Second, Roudabush assumes q22 = q22| = 1.0, ignoring the possibility
that someone who knows the answer to an item could fail it.

Under these assumptions the model reduces to:

p1 "1
92 (Q11 0) » (Q11 0) '9
p = 5» o

3 q21 ' '“ q21 ' ﬂ
D4 4

But qn + q2] = l and n] + n2 + n4 = 1, so

2
P1 q11 "1

p2 _ q11 (' ' Q11) T'1 T q11"2

p3 - (' ' Q11) q11"1

p4 (' ' q11)Z"1 T (T ' q11)"2 T (' '"1 ' T'2)

These four equations correspond to equations (1) through (4) pre-

sented in Appendix I.
"2
111+Tl'

 

The sensitivity index is defined as R =
2

This is a reasonable sensitivity index, it is the proportion of indi-
viduals not knowing the answer at pretest who learn it by the post-
test.

Roudabush solves the four equations above using the assump-
tion that the expected observed proportions, p], p2, p3, p4 equal
the sample proportions, fl/N’ f2/N, f3/N, f4/N respectively and
obtains solutions for n] and Hz in terms of f1, f2, f3, and f4.
The f1, f2, f3 and f4 equal the observed numbers of individuals in

each category and N is the total number of individuals. These

44

solutions are then substituted in the definition of R and an estimate
1.2;: .
Unfortunately, the general model cannot be heuristically

of R is

solved since there are seven parameters (unknown) and only three
pieces of information. Therefore, we cannot estimate R without
Roudabush's assumptions. We can, however, compare the true R and
the estimated R for simulated data.

A second index, suggested by Cox and Vargas, can be considered
in the same theoretical framework. Cox and Vargas call their index
the Pretest-Posttest Difference Index (C-V). This is defined as the
percentage of students who pass the item at posttest minus the per-

centage of students who pass the item at pretest. In terms of

observed results, this is f2 + f4 ' f3 + f4 or I2_:_I3.,
N N N
The C-V method can be represented as a special case of the

general model by assuming that Q = Q', q22 = qzz' = 1.0, and q21 =

q21' = 0. Then,
p 11 p 1r
1 1 0 1 0 1T1 1 Tr1
ID2 = I 2 or p2 = 2
p3 O 1 0 1 1T3 p3 n3
D4 4 p4 "3

The C-V index can then be defined, using the notation of Table 3.3,
as (n2 + n4) - (n3 + n4) or C-V (true) = n2 - n3. This is identical
to the definition of C-V given by Cox and Vargas except they use the
observed proportions as estimates of the actual proportions. This

index indicates the sensitivity of the item to instruction. The

45

closer C-V (true) is to 1.0 the greater the sensitivity and the closer
it is to 0.0 the less the sensitivity.

If the equations above are solved heuristically for the true
proportions they are found to be equal to the observed proportions.
In other words, under these assumptions, the observed proportions
are equal to the true proportions. These assumptions, however, are
extremely restrictive; they do not even allow for guessing. In fact
the C-V approach assumes no misclassification, i.e., no error. C-V
is an estimate of C-V (true). Under certain restrictive assumptions
C-V would equal C-V (true). We can compare C-V (true) with C-V for
simulated data in order to observe the impact of less restrictive

assumptions on C-V.

Marx

In this chapter, a theoretical framework is proposed for
criterion-referenced testing in pretest-posttest situations. This
framework suggests that 12 parameters completely describe the pretest-
posttest situation. In addition the Roudabush (R) model and the Cox
and Vargas (C-V) technique are explained in terms of the general
model.

The design of the research is discussed in the following
chapter. The research is considered in two parts. In the first part
of the chapter, the design of the simulation study is presented. The
simulation study uses the theoretical framework proposed in this
chapter to consider the impact of various assumptions on the C-V and

the R indices. The design of the comparison of the C-V, R and B-S

46

techniques with actual data is presented in the second part of the

next chapter.

CHAPTER IV
DESIGN

Part A: Design of the Simulation
The purpose of this part of the study is to answer three of
the research questions posed in Chapter I. The questions that this
part of the study will be directed to are as follows:
1. Do the C-V and R techniques adequately estimate the true
values of the item parameters?
2. Does one technique estimate the true values better than
the other?
3. Do the C-V and R techniques estimate some true values
of the item parameters better than others?
One approach to answering these questions would be to gener-
ate hypothetical data with various item values. In other words, one
approach would be to design and implement a simulation.

Recall from the previous chapter that the theoretical model

is represented by P.= Q2! 0' 3, where

E:

are the observed proportions of individuals corresponding to the

true proportions,
47

=1=1=1
£3de

:1

(see Tables 4.1 and 4.2 below), gjsymbolizes the Kronecker product,

q11 q12 q11' qiz'
= d '= | | g 00' o... -
Q (q21 q22 an Q q21 q22 The Q13 5 and qIJ s repre
sent probabilities and are defined according to Tables 4.3 and 4.4.

 

 

 

 

 

Table 4.1
Categories for a Given Item--Observed
Proportions
Posttest
Fail Pass
Fail p1 p2
Pretest
Pass p3 p4
Table 4.2
Categories for a Given Item--True
Proportions
Posttest
Does Not Know Knows
Does not know n1 n2
Pretest

Knows n3 n4

 

49

 

 

 

 

 

Table 4.3
Pretest--Actua1
Does Not Know Knows
Fail q11 q12
OBSERVED
Pass qZ] q22
Table 4.4
Posttest--Actual
Does Not Know Knows
Fail q11' q12'
OBSERVED
Pass q21' q22'

 

When the model is expanded, P_can be represented by the following:

P = p' = qllqll "1 T qllqlz 1T2 T q12q11 "3 T q12q12"'4
._ p . ' . '
p2 q11q21 T'1 T 911922 "2 + 412421 43 + qnq22 n4
p3 qz'q“ “1 T ququ 1T2 T q22q11 "3 T q22q12 T'4
4 q21q21 "1 T q21822 T'2 T 922921 "3 + 422422 44
TI

The R procedure defines the sensitivity index to be “I "2 .

but for computation uses the sample proportions. Therefore, R is
P2-P3
P1+P2
tions. In addition the C-V index is defined as n2 - n3, but is again

 

computed by calculating where Pk are the sample propor-
computed using sample proportions and is p2 - p3.
If numerical values of "k’ qij and qij' are chosen, then the

expected observed pk can be computed. Random numbers can be generated

50

and then based on the values of pk the number of cases in categories
1, 2, 3, and 4 can be determined. (Categories 1, 2, 3, and 4 follow
the same pattern as the notation for the "k and pk.)

For example, suppose p1 = .1125, p2 = .5075, p3 = .0375 and
p4 = .3425. Suppose also that a random number is generated. This
random number is from a uniform distribution and is between 0.0 and
1.0. If it is less than .1125, then the number of cases in category
1 would increase by 1. If the random number is less than .6200 (.1125
+ .5075) but greater than or equal to .1125, then the number of cases
in category 2 would increase by 1. If the number is less than .6575
(.6200 + .0375) but greater than or equal to .6200, then the number
of cases in category 3 would increase by 1. And finally, if the
number is less than 1.00 but greater than or equal to .6575, then the
number of cases in category 4 would increase by 1. Any random number
generated would be counted in one and only one category. In this
manner, simulated frequencies for the fail-fail group (category 1),
fail—pass group (category 2), pass-fail group (category 3), and
pass-pass group (category 4) are obtained.

For this simulation sample sizes of 50 and 200 will be con-
sidered. The sample size of 50 was selected because in most actual
situations, 50 is the maximum number of individuals available. Some
parameter values will be repeated in the simulation with a sample
size of 200 in order to consider the stability of the indices.

For each set of parameter values 1000 samples will be gener-
ated. For each sample, the R and C-V indices will be computed. Of

course, the true values remain the same for all 1000 cases. A number

51

of descriptive statistics will be computed based on the 1000 samples.
These will include the means and the variances for the R and C-V
indices and the largest and smallest values for each. In addition,
skewness and kurtosis will be computed for each. The simulation is
designed to consider a range of parameter values in order to see how
close the estimate of the R and C-V indices are to the actual values.'

Consider Tables 4.2, 4.3 and 4.4. The probability that an
individual knows the answer yet fails to answer the item correctly,
q12 or q12', is probably quite small. Since q12 + q22 = l and
q12' + qzz' = 1, this assumption would imply that q22 or q22' is
large. In addition, the probability that an individual can guess
the right answer (q21 or q21') can be estimated by the number of
options offered in the item. For example, a good estimate of q2]
for a true-false item would be .50. For a multiple-choice item with
four options a good estimate would be .25. The probability (q21')
that the correct answer could be guessed given some instruction may
stay the same as qZ] or it may decrease or increase. A11 possibil-
ities were considered in the selection of the values of q21'.

Table 4.5 lists the 21 different sets of parameter values
that were selected for the simulation. Sixteen sets designate the
probability of guessing (q21) to be .25 (four--option multiple-choice
item). Eight of these retain this estimate for the posttest
(q2]' = .25). Seven of these sets increase the probability of guessing
for the posttest to .50 (q2]' = .50). This makes the logical assump-
tion that instruction may improve the individual's chances of guessing

the correct answer by eliminating two of the possible options. For

52

 

 

N. F. O. N. O.O O.O ON. ON. O. O. ON. ON. OON _N
N. F. O. N. OO. O_. ON. ON. OO. O_. ON. ON. OON ON
N. O. O. N. OO. O_. OO. OO. OO. ON. ON. ON. OON O_
N. P. O. N. O.O O.O O.O O.O O._ O.O O.O O._ OON O_
N. N. O. N. O._ O.O OO. OO. OO. OF. ON. ON. OON N.
N. O. O. O. O._ O.O O.O O._ O._ O.O O.O O.O OON O.
N. O. O. O. O.O O.O ON. ON. O._ O.O ON. ON. OON O_
N. _. O. N. O.N O.O ON. ON. O.. O.O ON. ON. OO O_
N. _. O. N. OO. OP. ON. ON. OO. O.. ON. ON. OO OF
N. F. O. N. O.O O.O O.O O.O OO. O_. ON. ON. OO N_
N. _. O. N. OO. OP. OO. OO. OO. O_. ON. ON. OO P_
N. O. O. N. O.O O.O O.O O.. O._ O.O O.O O._ OO OF
P. O. O. _. O._ O.O ON. ON. O.O O.O ON. ON. OO O

OO. OO. O. _. O.O O.O OO. OO. OO. OP. ON. ON. OO O
_. O. O. O. O._ O.O ON. ON. O._ O.O ON. ON. OO N

OO. OO. _. O. O.O O.O OO. OO. OO. O_. ON. ON. OO O
N. _. O. N. O.O O.O OO. OO. OO. O_. ON. ON. OO O
N. O. O. O. O.O O.O OO. OO. O.O O.O OO. OO. OO O
N. O. O. O. O._ O.O O.O O.O O._ O.O O.O O.O OO O
N. O. O. O. O.O O.O ON. ON. O.O O.O ON. ON. OO N
N. O. O. O. O._ O.O OO. OO. OO. O_. ON. ON. OO O
OO OF NO _O .NNO .N_O __NO ._FO NNO N_O _NO __O z Ommmmz

empwEOeOa

 

:oOaOOOEOm asp com Om:_m> emumEOeOO umpumpmm
m.¢ mpnmh

53

one set, the value of q21' is set equal to zero, implying that after
instruction the individual has no chance of guessing the correct
response.

One set designates the probability of guessing (q21) to be
.50 (true-false item). This estimate is retained for the posttest
(q2]' = .50). The remaining four sets satisfy the assumptions of
the C-V index. These assumptions include assuming the probability
of guessing is 0.0 for pretest and posttest (q2] = q2]' = 0.0) and
assuming the probability of getting the item right when knowing the
answer is 1.0 for pretest and posttest (q22 = q22. = 1.0).

Based on the assumption that the probability that an individ-
ual who knows the answer yet fails to answer the item correctly
(q12 or qlz') is quite small, q]2 was designated to be 0.0 11 times
and .10 the remaining ten times. The value for q12. (posttest
probability) was set at 0.0 for all but four parameter sets. For
these, the value of q12' remained equal to q12 which had been set
at .10.

The values of n], "2’ n3, n4 were selected to represent
reasonable situations. Two basic sets of values were chosen with n1,
n2, n3, n4 equal to .3, .5, 0.0, .2, and .2, .5, .l, .2 respectively.
Four sets of values were selected to consider the impact of extreme
values on the indices. These sets (6, 7, 8, 9 of Table 4.5) con-
sidered the possibility that the majority of individuals would fail
the pretest and pass the posttest (8 and 9).

As previously stated, for each set of parameter values 1000

samples will be generated. For these 1000 samples, the C—V and R

54

indices will be computed. Means, variances, highest and lowest
values, skewness and kurtosis will also be determined for the C-V
and R values.

In an attempt to answer the research questions, the data
will be considered in a number of ways. All descriptive statistics--
menas, variances, skewness and kurtosis--will be considered for each
of the 21 parameter sets for both indices. Means of each of the C-V
and R values will be compared to their true values and variances of
these indices will also be considered in an attempt to answer the
question of adequacy. For any given parameter set a mean value close
to the true value in conjunction with low variance would imply some
degree of adequacy.

The second question, “Does one technique estimate the true
values better than the other?", will also be answered by considera-
tion of the data. One approach will be to consider how close the
values are to the true value for each set of parameters for each
technique. The variance, skewness and kurtosis values will also be
considered. A comparison of the correlation coefficients between
the true values and the means for each technique might show whether
or not one technique estimates the true values better than the other
technique. However, some caution will be used in the interpretation
of the correlation coefficients and the comparison.

The final question will also be handled descriptively. The
actual data will be considered and an attempt will be made to locate

values that are not estimated as well as others.

55

All questions will be considered descriptively. Each
parameter set with results will be discussed with respect to the
three basic research questions. Summary statistics will be presented
in order to facilitate the understanding of the techniques and the
conclusions reached about them.

Part 8: Design of the Comparison Study
With Actual Data

 

The purpose of this part of the study is to determine the
comparability of three item analysis procedures (C-V, R and B-S).
Data were obtained from the Michigan Middle Cities Project. One
hundred twenty-eight items were chosen from two subject areas, Reading
and Mathematics. Two levels were considered--Middle and Upper.
(These levels generally refer to grades three and four, and five and
six respectively.)

Each item was written for a particular objective. Each
objective was tested by four items on a pretest, four different items
on a posttest and all eight items on a retention test. The retention
test was given approximately 40 days after the posttest.

There were also two treatment groups where item data were
collected. In one treatment, teachers were assigned objectives
(treatment A). In the other treatment, teachers were allowed to
choose objectives (treatment B). Sixteen objectives were chosen to
complete the design which is represented in Diagram 4.1.

The major question to be considered is "Do the C-V, R and B-5
item analysis procedures provide comparable results?" The analysis

of this question will primarily be descriptive. To determine the

56

 

Objective

 

 

 

 

 

 

 

 

 

Subject Level Treatment Number N Items
142 31 1-8
A
116 59 9-16
Middle
112 21 17-24
B
120 20 25-32
Reading
145 66 33-40
A
199 57 41-48
Upper
B 182 30 49-56
166 18 57—64
108 52 65-72
A
111 43 73-80
Middle
107 42 81-88
B
109 37 89-96
Mathematics
198 22 97-104
A
176 46 105-112
Upper
B 187 16 113-120
167 17 121-128
Diagram 4.1

Design of Administration of Items

57

comparability, of the indices C-V and R, a Pearson product moment
correlation will be computed between the C-V and R values.

The B-S procedure (see Appendix 11) does not allow for a
single index. The B-S procedure involves the computation of four
error rates. TER (theoretical error rate) is defined as (J-l)/J
where J is the number of possible answers to an item--or it is simply
the expected proportion of students answering a pretest item incor-
rectly. The Base Error Rate (BER) is the observed proportion of
students answering a pretest item incorrectly. The Posttest Error
Rate (PER) is the observed proportion of students answering a post-
test item incorrectly. In this situation the data used as the post-
test data will be from the retention test. The Instructional Error
Rate (IER) is the proportion of students answering incorrectly on a
terminal test item which is administered to students who have been
exposed to instruction. This last error rate is not included in any
of the decision rules related to item revision.

In addition two discrimination indices are computed, the Base
Discrimination Index (801) and the Posttest Discrimination Index
(POI). These are computed using the total score on the appropriate
test as the criterion. For 801, the criterion will be the pretest
and for POI, the criterion will be the posttest. Again in this
situation the data used will be from the retention test. Two separ-
ate statistics will be used to compute the discrimination indices,
the phi-coefficient and the 8 index. The 8 index equals B/(B + D) -
A/(A + C) where A, B, C and D are defined in Table 4.6.

58

 

 

 

 

Table 4.6
B--Index
Total Test Score
Nonmastery Mastery
1 A 8
Item Score
0 C D

 

Mastery for the items on the pretest and retention test data was set
at three out of four items.

These five pieces of information, TER, BER, PER, 801 and PDI
are then used in conjunction with some rules to determine the adequacy
of the item. Appendix II provides a description of these rules.

Since the B-S procedure does include several statistics, the
comparison of the three indices will be done in the following manner.
First the individual statistics, TER, BER, PER, 801, and PDI, which
are necessary for the B-S procedure, will be computed. The appro-
priate rules will be applied and a decision will be made about the
quality of the item; that is, does the item need to be revised? Each
item can then be assigned a "O" or a "1" depending on the outcome of
the application of the rule. A "0" would indicate non-acceptance or
revision required; a "1“ would indicate acceptance or no revision
I required.

There is a limitation in this procedure. The B-S process
requires that the evaluator set various cut-off points. For example,

the evaluator must decide an appropriate cut-off point between a high

59

and low error rate. Comparison of B-S with the R and C-V indices
will be influenced by the selected cut-off values. To minimize the
effect of this limitation various cut-off values will be set and
several comparisons will be made. Point-biserial correlations will
be computed between the B-S value of "O" or "1" and the C-V value
and the R value.

There is also a limitation with the data used in the computa-
tion of the various indices. Retention data are substituted for
the posttest data generally used in the computation of C-V and R
indices. There may be some additional forgetting not normally found
in a more immediately given posttest. However, since both R and C-V
are computed using the same data, the effect on the comparison of the
two should be minimal. The observed frequency of f3 (those who
forget from pretest to retention test) might appear to influence C-V

more than R since this frequency is included in the denominator of

f2 _ f3 .
> . R only 1ncludes
f] + f2 + f3 + f4

) . However, since the only difference

 

C-V as well as the numerator <
f3 in the numerator (1:24:32
1‘1 + 1‘2
in the two indices is the addition of f3 and f4 in the denominator of
‘ C-V, and if f3 gets larger due to the longer time frame, then f1, f2
and/or f4 would get smaller. Since f1 + f2 are the same in both and
the total f1 + f2 + f3 + f4 is N, a constant, then the effect on

both indices should make little difference in the comparison of the
two. This same argument holds for the impact of a decrease or
increase in f4 on the comparison of the two indices. A similar argu-

ment can be made for the individual statistics of the 8-5 procedure.

60

Two additional questions that are of interest are as follows:

Are the three procedures more comparable for items in Mathe-
matics than for items in Reading?; and,

Does the comparability of the three procedures depend on the
grade level?

It is anticipated that the procedures will be more comparable
for Mathematics than for Reading. Items for Mathematics are con-
structed more easily than Reading items because the subject area is
more structured. In addition the items are generally of higher
quality. The Reading items may be more ambiguous than the Mathematics
items. It is also anticipated that the correlations among the indices
would be almost identical for items given in the Upper grades and
for items given in the Middle grades. There is no reason to expect
the correlations to depend on grade level.

Each of the two questions will be analyzed in two steps. A
comparison of the C-V and R values will be considered separately
then a comparison of the B-S with the C-V and R will be made.

The first question can be analyzed in the following manner.
First, a Pearson product moment correlation will be computed between
the C-V and R values for items in Mathematics and separately for
items in Reading. Then a comparison can be made between these two
correlation coefficients. The null hypothesis can be expressed as
Ho: pR = pM with the alternative hypothesis being H1: DR f 0M,
where DM = the population correlation of C-V and R for Mathematics

and pR = the population correlation of C-V and R for Reading. A

61

Fisher's z- transformation will be made for each of the sample cor-
relations and a z-test will be applied.

Secondly, the point-biserial correlation will be computed
between the B-S and C-V values for Reading and Mathematics and
between the B-S and R values for Reading and Mathematics.

The second question will be considered in the same manner
only correlations will be computed between the various indices for
the two grade levels separately.

A final question which may be considered is "Are the three
procedures more comparable for items given in treatment A than for
items given in treatment B?" The analysis of this question is
similar to the analyses proposed above. The prediction for the cor-
relations among the indices is that they will be higher for treatment
B than for treatment A. This is due primarily to the fact that
teachers were assigned objectives in treatment A. Instruction may
not have been needed for these specific objectives or may not have
been given adequately, so the item response data for treatment A
may be unstable. Items from treatment B should more closely fit the
ideal criterion—referenced situation, i.e., no knowledge on pretest

and knowledge on posttest.

Summar
There are two parts to this study. Each part is designed to
answer different questions. The first part, the simulation, will
attempt to determine if the C-V and R indices adequately estimate

the true values, if one technique estimates the true values better

62

than the other and if for some parameter sets the C-V and R indices
are better estimators of the true values. Data will be analyzed
descriptively for the 21 sets of parameter values used in the simula-
tion. Additional questions, such as the stability of estimates for
different sample sizes, will also be considered in the analysis of
the data.

The second part of the study is designed to determine compar-
ability of the three item analysis procedures, R, C-V, and B-S.
C-V and R values will be computed for 128 items and the B-S values
will be computed on 64 of the 128 items. The relationships among
the indices will be determined using correlation coefficients.
Additional questions pertaining to the comparability of the indices
with respect to subject matter, grade and treatment also will be

considered.

CHAPTER V

RESULTS OF THE SIMULATION

The simulation was designed to answer three questions:

1. Do the C-V and R techniques adequately estimate the true
values of the item parameters?

2. Does one technique estimate the true values better than
the other?

3. Do the C-V and R techniques estimate some true values
of the item parameters better than others?

The results of the simulation for the 21 sets of parameter

values (see Table 4.5) are presented in Table 5.1.

The C-V Index: Adequacy and Stability_

Assumptions Met

Consider the statistics of parameter sets 3, 10, 16 and 18
(see Table 5.2). For these parameter sets, the assumptions for the
C-V index are met. Recall that the assumptions for C-V include no
guessing (q2] = q2]' = 0.0) and an individual who knows the answer
will not fail to answer correctly (q22 = qzz' = 1.0). (The parameter
sets 16 and 18 are identical to 3 and 10 respectively except N = 200.)

63

64

 

 

cc. on mn—. omcm. on mamm. omoo. ~0¢~.1 sumo.- mNo_.- mmvo. emoo. _~oo. oeoopo. ovoo. .epﬁmo. mmmw. owom. 000v. nvps. .—~
Nv. 0» ——. mwmo. on ~Ncw. umpo.1 ~v-.- m__~.- meo.- pw¢o. mmoo. mmoo. cwoopo. “coo. pom—co. mmmw. smoe. coco. n¢—s. .ON
at. ca mc—. pawn. 0» mmOm. ——~p.u ~omm.- mm—o. mva. Fmvo. ammo. ~Noo. woecoo. omoo. swampo. Nmmm. mmom. ooov. mops. .m—
wm. cu mNN. nm—s. 0» _~mm. mm~F.- ~m_m.c _Noo. vmmo. move. Nmmo. Nmoo. mooooo. «moo. -oo~o. mpov. wwsm. oooc. nv—s. .op
cm. a» «N. mmum. cu scam. mooo. va~.1 mvmp.u «NMN. ovvo. memo. m_oo. -oooo. omoo. nopooo. “vac. Nmmu. ooov. megs. .5—
m—o. 0» mam. moms. ou mwm. coco.- Nvmo. ~m_~.u ommv.u some. ommo. m—oo. oooooo. mpoo. —ooooo. moom. mowo. oo0m. om~o. .0—
mom. on mmm. ass. on ¢noc. cemo.1 wovP.1 ~mo~.u mwpo.u coco. mmco. m—oo. Nemmpo. «woo. oooooo. moum. ~¢~o. oo0m. ammo. .mp
om. 0» No. pno. cu “moo. Nooo. ccmc.1 mwop.a c~N_. ammo. manp. omoo. Ncnooo. —~Po. ”upwmo. m—om. cmom. oooe. nv—s. .0—
No. an no.1 ppom. ou ~.- coco. wvo¢.u ”moo.- wNOm. woao. Nomp. coco. o—Nmpo. coFO. -m¢oo. v_0~. came. ooov. ncps. .m—
Nm. 0» co.- ocms. 0» —.1 mmmo. momm.- ovo~.u oooo. —voo. ~NM—. wwoo. mco—No. capo. Nomoso. mnmN. pave. oooe. mops. .m—
om. 0» No. a. cu mmmo. mao~.- emm~.u popo.u .o—~.p come. vowp. mmoo. «Nance. oo_o. ~¢n~—c. comm. Npow. ooov. nvps. .——
n. ca .0. «mom. 0» mo. —mmo.- mooc.- sumo. Nmmm. ammo. omo_. mmoo. mpoooo. mopo. -momo. Nvoo. oesm. ooov. mops. .o—
Na. 0» cm. o.— cu m. ~o¢—.u Noom.- moo”. ono. —mno. osmo. peoo. m~oooo. Mmoo. wooooo. mooo. esww. ooom. mama. .o
cw. ow on. o.— 0» mono. Nmo~.1 owmm.u omoo.1 wmmm. _u~o. memo. mmoo. eemmpo. Nmoo. mooooo. wopo. ppmm. comm. omwm. .w
on. a» m—.1 come. 0» m.- Oops. vaN.1 oo—_.- empo.- ammo. m_N—. «Koo. ocmooo. Nv—o. mwoooo. mono. poo—. coop. pppp. .n
Nm. 0» co.u coco. 0» omoo.u ~_N—.1 mmwv.- coo—.1 ocmﬁ. memo. Nmpp. mwoo. monco. mm_o. wommoo. Nomm. onsm. Como. ————. .o
co. ea c—. Noo¢. ou emom. omvo.- N_ec.1 NFc—.1 Noom. MNwo. voo—. omoo. Nmoooo. —o—o. appooo. mace. was. ooov. nv—n. .m
m. o» No.1 o.— o» ao~o.- mm_—.u —mom.1 oNoo.- o~o~._ mmwo. mm¢_. oNoo. Nm—moo. Fwwo. oewooo. Knew. mmoo. 000m. cmwo. .v
Nu. cu mm. mm. cu mo—v. ammo. mmoF. mwm—.- mmom.1 some. mono. oeoo. oooooo. omoo. vooooo. moov. pmwm. ooom. omwo. .M
No. on —. mNow. cu ~mmp. mac—.1 momc.1 mo—F.- Npmm. memo. mmop. mNoo. ommm_o. ~F_o. spoooo. comm. mowo. 000m. ammo. .N
as. 0» o—. o.— 0» muqm. mac—.1 mm¢¢.- NNoo.- mmmv. Nmmo. ammo. oNoo. mnmooo. vNoo. wmooro. NNOe. mmmm. ooom. omNo. .p
AOOOmV
>-O AOOOOV
>uu a >-u a >1u >10 emenzcm a cmemncm >19 a >1u a uom
>1u omega a «menu Omocxmxm Omwczmxm ONOOHL3x O_O0uN:x cm m cm NO> cowumw>mo .a Lo> cowum_>mo cum: com: magh wagh NoquoLcn
muspomaq manpoOa<

 

umm NmuoEONOu zoom Noe OOOOOOOoOm m>_OQOLoOmo

p.m m—nch

65

 

 

ooom. op oONN. mmm_.1 FNoo. Nwoo. opoo. o_o¢. oooo. .op
ompo. op omom. oooo.1 ~o_~.1 mFoo. mooo. moom. ooom. .op
oooN. on ooeo. Fmoo.1 NNmo. oooo. Nooo. Nooo. oooo. .op
ooNN. op oomm. omoo. mmop.1 oeoo. mooo. mooo. ooom. .m
1 >1o coOuOO>mo >1o >1o
> o eo mocmm OOmczmxm OOOoueag LO> mszomo< :Omz wage

 

pm: mN< >1o Noe Ocowuossmm< mews: Opmm prmEOLOO
~.m wpnmh

66

Comparison--Assumptions Met Versus
Assumptions Not Met

 

The average absolute deviation from the true C-V value for
the mean C—V for the sets where the assumptions are met is .0018.

In comparison for the remaining parameter sets, where the assumptions
for the C-V index are not met, the average absolute deviation from
the true C-V for the mean C-V is .1096, a considerable difference.
The variances for the parameter sets where the assumptions are met
range from .0013 to .0088, but the variances for the sets where the
assumptions are not met, range from .0016 to .0094. Ten of these 17
have variances equal to or greater than .0070. The variances are
lowest for those parameter sets (15 through 21) which have sample
sizes of 200 (.0013 to .0023).

The kurtosis values for the distributions of the C-V index
range from -.2957 to .1005. Only four values are positive; two of
these are for those parameter sets where the C-V assumptions are
met. Since the kurtosis values are not very large or very far from
zero, it seems reasonable to describe the distributions as mesokurtic.

The skewness values range from -.2693 to .0938. Fourteen
values are negative. The skewness values are also not very large
or very far from zero, so the skewness for any parameter set is
slight. If the skewness and kurtosis values are considered together,
then the distributions for all the parameter sets can probably be
described as normal.

A comparison of the averages of the absolute deviations from

the kurtosis value of zero for parameter sets with N = 50 and for

67

parameter sets with N = 200 reveals that the latter average is
slightly larger (.11 for N = 50; .12 for N = 200). A similar com-
parison of the averages of the absolute deviations from the skewness
value of zero for parameter sets with N = 50 and for parameter sets
with N = 200 reveals that the latter are less skewed (.11 for N = 50;
.04 for N = 200). The values, however, for N = 50 and N = 200 do
not differ enough for one to infer that the greater sample size
provides a more normal distribution.

Comparison of the average ranges for parameter sets with
N = 50 and N = 200, .54 for N = 50 and .29 for N = 200, does demon-
strate that the C-V estimates are more stable with larger sample
sizes. For those parameter sets that do not meet the C-V assumptions
the average range is .47, N = 17. For those parameter sets that do
meet the assumptions, the average range is .41, N = 4. There appears

to be slight differences in the averages when sample sizes are also

considered. See Table 5.3 below.

Table 5.3
Average Ranges for the C-V Estimates

 

 

All Sample Size = 50 Sample Size = 200
Meet .41 .53 .28
Assumptions (N=4) (N=2) (N=2)
Does Not Meet .47 .54 .29
Assumptions (N=l7) (N=12) (N=5)
All .46 . .54 .29

(N=21) (N=14) (N=7)

 

68

Final evidence of the adequacy is the correlation between
the true C-V value and the mean C-V value. This correlation is .800
(N=21, p <.001). From the evaluation of the other statistics: ranges,
kurtosis and skewness values and variances, in addition to the cor-
relation cited above, one can infer that the C-V technique provides
reasonable estimates of the true C-V value and these estimates are
distributed normally. However, the technique does provide a more

stable estimate for larger sample sizes (N=200).

The R Index: Adequaqy and Stability

Assumptions Met

Consider the statistics of parameter sets 2, 3, 4, 7, 9, 15
and 16 (see Table 5.4). For these parameter sets, the assumptions
for the R index are met. Recall that the assumptions for the R
index are that guessing is the same for the pretest and the posttest
(q2] = q21'), an individual who knows the answer will not fail to
answer correctly (q22 = q22' = 1.0), and an individual who knows
the answer on the pretest does not forget it on the posttest (H3 = 0).
(The parameter sets 15 and 16 are identical to 2 and 3 respectively
except N = 200.)

Comparison--Assumptions Met Versus
Assumptions Not Met

 

The average absolute deviation from the true R for the mean.
R for the sets where the assumptions are met is .0043. In comparison,
for the remaining parameter sets where the assumptions of the R index

are not met, the average absolute deviation from the true R for the

69

 

 

moms. op ommm. memo. omm¢.1 mpoo. mpoo. momo. ommo. .o_
ooNO. on oNoc. Novp.1 oopo.1 «moo. mooo. Nomo. ommo. .mp
oooo.P op oooo. Noom.1 oNoo. mmoo. mpoo. omoo. ammo. .o

oome. op ooom.1 ONmN.1 oopo.1 NOFo. omoo. PooP. FFOF. .N

oooo.F ou oONo.1 Pmoo.1 oNoO.F pmmo. Om—o. mooo. ommo. .O

oomo. o“ moﬁw. mmop. NNom.1 omoo. opoo. meo. ommo. .m

oNoo. oo Nmop. oom¢.1 Npmm. . NOFo. Pooo. ooNo. ommo. .N

a eo moCOm OOmczmxm OOOopezx a LO> :mwwmwmmm cam: womb

 

um: me< x No; OcoOpOEOOO< memzz Opmm ewumsmemo
¢.m mpnm»

70

mean R is .1426. The variances for the R values where the assumptions
are met range from .0015 to .0221 but the variances for the remaining
R values only range from .0026 to .0194. The reverse might have been
expected. It would seem more likely for the R values to be more
stable when the assumptions of the index are met. This does not seem
to be the case; although, the differences in the ranges of the var-
iances are slight.

Other evidence of the stability or lack of stability of the
estimates of the R index can be obtained by consideration of other
distributional statistics; such as skewness, kurtosis and range.

There are 15 total parameter sets where the kurtosis is
positive. A positive value implies that the curve is leptokurtic
(peaked). Two of the positive kurtosis values are near zero (parameter
set #12, K = .0600 and parameter set #18, K = .0584). For these two
parameter sets the distributions can probably be described as meso-
kurtic. The remaining six parameter sets have negative kurtosis
values; three of these are near zero (parameter set #7, K = -.0186;
parameter set #15, K = -.0189; parameter set #10, K = —.0871). A
negative kurtosis value generally indicates that the curve is
platykurtic (flat); however, the three curves whose values are near
zero could be considered mesokurtic. The largest value is 1.7626 for
parameter set #4. This is one set where the assumptions of the R
index are met. Ideally, the 1000 R values should be concentrated in
a narrow range about the true value.

There are 19 total parameter sets where the distribution is

negatively skewed. Only two parameter sets have positively skewed

71

distributions. Parameter set #16 has a skewness value close to zero
(Sk = .0242) which might indicate a non-skewed distribution. When the
kurtosis value is also considered (K = -.4356), it appears that the
distribution is slightly flat. However, the kurtosis value is not
very large so the interpretation of the two statistics could be that
the distribution of R values for parameter set #16 is fairly normal.
Parameter set #16 has a sample size of 200. Parameter set #15, also
with a sample size of 200, has a small negative kurtosis value and a
small negative skewness value. Again one might infer that the dis-
tribution is fairly normal. Perhaps, for parameter sets that meet
the assumptions of the R index and have sample sizes of 200, the R
values are distributed more normally.

If the other five parameter sets with N = 200 (17, 18, 19,
20 and 21) are also considered, the skewness values are greater than
the values for parameter sets #15 and #16. However, the skewness
value for parameter set #21 (Sk = -.1492) is not very different than
the value for #15 (Sk = -.l462). Also the kurtosis value is fairly
small (K = -.1075). The assumptions for the R index are almost met
in #21 except n3 does not equal zero. The kurtosis values for these
five sets, are all small with three positive values and two slightly
negative. It is interesting to note that the highest kurtosis value
(absolute value) of the seven sets with N = 200, is set #16. A com-
parison of #15 and #16 with the remaining five parameter sets with
N = 200 seems to show that if the assumptions of R are met (or nearly

met) the distribution is more nearly normal.

72

If the absolute deviations from the kurtosis value of zero
are averaged for the parameter sets with N = 50 and for the parameter
sets with N = 200 and these two values (.46 and .17, respectively)
are compared, then further evidence is obtained that the distributions
of R values for larger sample sizes are more nearly normal.

A similar consideration of the absolute deviations from the
skewness value of zero reveals that the average for parameter sets
with N = 50 (.47) is larger than the average for parameter sets with
N = 200 (.22). It seems then that for any given parameter set as the
sample size increases the distribution of R values approaches a normal
distribution.

Now consider the ranges of the R values for the 21 parameter
sets. For parameter sets with sample sizes of 50, N = 14, the average
range is .72. For parameter sets with sample sizes of 200, N =.7,
the average range is .36. The ranges, then, were decreased on the
average by one-half when the sample sizes were increased. For those
parameter sets that do not meet the assumptions and with sample sizes
of 50, N = 9, the average range is .74. For parameter sets with
sample sizes of 200, N = 5, the average range is .39. For those
parameter sets that do meet the assumptions, the average range is .67
for sample sizes of 50, N = 5, and .26 for sample sizes of 200, N = 2.
There is some reduction in the ranges when the assumptions are met;
however, just the increase in sample size without meeting the assump-
tions has a marked effect on the stability of R.

Final evidence which might be considered in answering the

question of adequacy is the correlation of the true R values with

73

 

 

Table 5.5

Average Ranges for the R Estimates

A11 Sample Size = 50 Sample Size = 200
Meet .55 .67 .26
Assumptions (N=7) (N=5) (N=2)
Does Not Meet .62 .74 .39
Assumptions (N=14) (N=9) (N=5)
All .60 .72 .36

(N=21) (N=14) . (N=7)

 

the mean of the estimated R values for each parameter set. This cor-
relation is .759 (N = 21, p <.OOl). Even though this correlation is
significant, it must be remembered that for any given parameter set
there were many R values which greatly differed from the true R.
Consideration of ranges, variances, skewness and kurtosis values
reveals that the R technique more adequately estimates the true R
when the sample size is larger, i.e. N = 200. In addition the R
technique more adequately estimates the true R when the assumptions
of R are met. The differences in the adequacy are far more dramatic,
however, when the sample size is increased than when the assumptions

are met.

The C-V Technique Versus the R Technique
When the distributions of the estimates of the C-V and R
values are compared it appears that, in general, the C-V estimates
are distributed more normally than the R estimates. The R values

tend to be higher than the C-V values and there seems to be a ceiling

74

effect, i.e. the R distributions are generally skewed negatively
and in almost every case approach the upper bound (1.0). The C-V
distributions while skewed negatively in 14 cases seem to span a
middle range of values.

Consider the summary statistics for the C-V and R distribu-

tions provided in Table 5.6 and the correlation matrix in Table 5.7.

Table 5.6
Summary Statistics Comparing R to C-V

 

 

 

 

 

Average Absolute Average
11.11.1101. From 52219212; iii-135?: Eié‘iﬁeii Range of
True Value Values
.0013 to -.2957 to -.2693 to
C‘V “089' .0094 .1005 .0938 '45
.0015 to -.4356 to -.8651 to
R '0965 .0221 1.7626 .1022 '60
Table 5.7
Corre1ations
True C-V Mean R Mean C-V
True R .820 .759 .608
True C-V .855 .800
Mean R .891

 

One can infer from these statistics that the C-V technique
estimates the true C-V values better than the R technique estimates

the true R values. The variances of the distributions of the C-V

75

values are smaller. The largest variance for the C-V values is .0094
while the largest variance for the R values is .0221. The range of
the kurtosis values and the range of the skewness values for the C-V
index are considerably smaller than the ranges for the R index. The
average range of values for the parameter sets is smaller for the C-V
index than for the R index. Finally, if the correlations of the mean
index with the true value are considered, the C-V technique provides a
closer estimate of the true C-V value (r = .80 for C-V compared to

r = .76 for R). It is interesting to note that the means of the
estimates of the Rindex are more closely related to the true C-V
value (.86) than they are to the true R values (.76) or than the
means of the estimates of the C-V index are related to the true C-V
values (.80).

In interpretation of the correlations, it must be remembered
that the mgppg of the estimated values for a given parameter set
(over 1000 values) are correlated with the true values. Means are
more stable than the actual estimates. The other statistics, range,
kurtosis, variance, and skewness, must be considered in the evalua-
tion of the adequacy of the techniques. When all statistics are con-
sidered, the C-V technique seems to provide a more stable estimate
of the true value than the R technique and the distributions of the
C-V values seem to be more normally-shaped than the R values for any

given parameter set.

76

Consideration of C-V and R Techniques
By Parameter Set

 

 

The conclusion from the analyses cited in the previous sec-
tions is that the C-V technique provides a more stable estimate of
the true value than does the R technique. Now consider the parameter
sets individually. Perhaps one technique is a better estimator than
the other technique under certain conditions. If so, what are these
certain conditions?

Consider, first, the parameter sets where the assumptions for
the index are met. Table 5.8 gives the summary statistics for R and
C-V.

It is apparent from these data that, over 1000 samples, the
mean estimate for either index, is better when the assumptions are
met than when they are not. (See column one of Table 5.8.) One
perplexing fact is that the variances for those parameter sets where
the R assumptions are met, span a larger range than do those parameter
sets where the R assumptions are not met. However, if the size of
the samples is also considered and only those parameter sets with
N = 200 are compared, then the variances are less when the R assump-
tions are met. Interestingly, the same unexpected result occurs if
the variances of the C-V estimates are considered. Here, for sample
sizes of 50, the range of the variances is slightly greater when the
assumptions are met than when they are not met. This is not true,
however, for sample sizes of 200. Caution must be used in interpret-
ing these results, since the number of parameter sets used is quite

small (see column six of Table 5.8).

77

 

 

 

 

 

 

 

 

 

NN oo. NNoN. ou Nmoo.1 oNoN.N op omm¢.1 NNNo. cu ONoo. Oomo. NN<

N om. NONo. ou Noam.1 OONN. op ommc.1 Nvoo. ou ONoo. Nmoo. ooN u z

e— NN. NNoN. o» Nmoo.1 ONON._ op NNom.1 NNNo. ou Nmoo. NNoo. om u 2
N13

c— No. Now—.1 on «NON.1 NONN.N op mNoN.1 capo. ou ONoo. oNON. NN<

m on. Noqp.1 cu Noon.1 OONN. op mNo_.1 Ncoo. ow ONoo. oNON. ooN u z

o cN. momm.1 op «NON.1 NoNN.N op oooo. capo. cu Nmoo. NoON. om u 2
pm: uoz Oco_uoE:OO<

N OO. NNoN. on Nmoo.1 oNoN._ o» ommc.1 NNNo. o» mNoo. mooo. NN<

N oN. NONo. o» NoON.1 ooNo.1 0» omm¢.1 vNoo. op ONoo. oooo. ooN u z

O No. NNoN. 0» Nmoo.1 oNoN._ 0» NNom.1 NNNo. op Omoo. omoo. om u 2
pm: OcowuoEOOO<

m

NN ow. omoo. ou moON.1 moop. op NmmN.1 «moo. ou NNoo. oooo. __<

N oN. omoo. ou mmN_.1 ONNo. op NmoN.1 ONoo. o» mNoo. ONoo. ooN u z

O— OO. omoo. oa moON.1 moop. ou OOON.1 Omoo. ou oqoo. NNoN. om u z
a

Np NO. mmmo. o» mooN.1 mooN. ow NOoN.1 oooo. ou o_oo. oooN. NN<

m mN. omoo. ou NNN_.1 ONNo. 0» NOON.1 mNoo. ow opoo. NNoo. ooN u 2

NF cm. Ommo. op NOON.1 mooN. o» OOON.1 emoo. op oOoo. ooNN. om u z
um: uoz Ocowuo53OO<

e NO. omoo. ou OON_.1 NNmo. op NNNN.1 oooo. on ONoo. NNoo. NN<

N oN. oooo.1 op OONN.1 _Noo. op NoNN.1 NNoo. ou ONoo. NNoo. ooN u z

N mm. omoo. 0» Nmoo.1 NNmo. op mNoN.1 oooo. oo mOoo. ONoo. om u 2
pm: OcowuaEOOO<

>1o
Oumm OONO> OONN
Omapm> No OOmczmxm ONOopLOx Omocmpcm>
ewumsmgaa . EoNN OONNON>mo
mo gossaz mocma momem>< No moccm mo mmcmm No mocmm waspomno moONm><

 

OcowuaszOO< cam ONNm m—anm No :oNNONmuOOcoo :u_3 >1o com m Now OuNuONqum accessm

w.m m—omh

78

The range of kurtosis values for R is greater when meeting
the assumptions than when not. The opposite is true for the range of
kurtosis values for C-V. Similarly, the range of skewness values for
R is greater when the assumptions are met than when they are not and
the range of the skewness values for C-V is smaller when the assump-
tions are met than when they are not. Finally, the average range of
the respective values is smaller for both indices when the assumptions
are met.

Sample size has a marked effect on the results of the simula-
tion for any parameter set. Noted above was the effect that sample
size had on the range of the variances. Also the mean of the esti-
mated values is closer to the true value for both indices when the
sample size is 200. However, the increase in sample size for both
indices has a greater effect on the mean of the estimated values when
the assumptions are met than when the assumptions are not met. Of
course, the increase in sample size also decreases the range of esti-
mated values for both indices. For R, this average range is reduced
by 61 percent for the parameter sets meeting the assumptions, but
only by 47 percent for those not meeting the assumptions. For C-V,
the reduction in the average range is 47 percent and 46 percent
respectively. The increase in sample size narrows the range of
estimated values considerably.

The ranges of skewness and kurtosis values are much narrower
for the parameter sets where N = 200 than for the parameter sets
where N = 50 for the R index. For C-V, the ranges are closer,

although generally smaller for N = 200. There is one exception; the

79

range of the kurtosis values when the C-V assumptions are met is
greater for N = 200 than for N = 50. However, there were only two
parameter sets included for these categories, so the statistics must
be interpreted cautiously.

Two factors have been considered above; one, whether or not
the assumptions of a particular index were met and two, sample size,
i.e. what happened to the distributions of estimated values when the
sample size was increased from 50 to 200. The analysis of the data
with respect to these two factors seems to indicate that the C-V
method provides a better estimate when the assumptions are met;
although, the technique is still good under other assumptions. The
R method seems to be unstable. The descriptive statistics indicate
that the R method does not provide good estimates even under the best
of circumstances. An increase in sample size helps the R method.

The C-V technique, although a better technique with a larger sample
size, remains stable with smaller sample sizes.

Now consider the individual parameter sets. Consider only
the mean of the estimates, the variances, and the ranges for each
index for each parameter set. Table 5.9 indicates for each parameter
set whether the absolute deviation from the true value is smaller for
R or C-V, the variance is smaller for R or C-V, and the range is
smaller for R or C-V. For each column the letter R or C-V indicates
that the statistic is smaller for that technique.

In 29 percent of the parameter sets the R technique estimates
the true value better than the C-V technique estimates the true

value. In less than 10 percent of the cases the variance for R is

80

 

 

. Table 5.9 1
Compar1son of R and C-V by Parameter Set

P3424219" Assistant?“
l. C-V C-V C-V
2. R C-V C-V
3. C-V C-V C-V
4. R C-V C-V
5. C-V C-V C-V
6. C-V C-V C-V
7. R C-V C-V
8. R R R
9. R R R
10. C-V C-V C-V
ll. C-V C-V C-V
12. C-V C-V C-V
13. C-V C-V C-V
l4. C-V C-V C-V
15. R C-V C-V
16. C-V C-V R
17. C-V C-V C-V
18. C-V C-V C-V
l9. C-V C-V C-V
20. C-V C-V C-V
21. C-V C-V C-V

 

1For each column the letter R or C-V indicates that the
statistic is smaller for that technique.

smaller, and in 14 percent of the cases the range is

smaller.

Consider the two parameter sets where the R technique appears

to be the better technique (#8 and #9). In these parameter sets, it

81

was assumed that 80 percent of the individuals would not know the
answer at pretest but would know it at posttest (112 = .80). (See
Table 4.5 in Chapter IV.) In parameter set #8, it was also assumed
that instruction would improve the chance of guessing (q2] = .25,
q2]' = .50), and that for the pretest there would be some chance that
an individual knowing the answer would fail the item (q12 = .10).
Parameter set #9 assumed only that there was the same chance of
guessing for both pretest and posttest (q2] = q21' = .25). This
parameter set meets the R assumptions. It is interesting to note
that for the parameter sets where an R occurs in any column the
assumptions for the R index are met in six of these seven cases.
Other than the two factors, sample size and meeting the
correct R assumptions, there seems to be no pattern for the estimates
being better for one parameter set than for another. It does appear,
however, that the more assumptions of the R technique that are not
met, the less accurate the estimates. The C-V technique seems to

provide reasonable estimates regardless of sample size or meeting

assumptions.

Summary

Three questions were considered in the designing of the
simulation. These were:
1. Do the C-V and R techniques adequately estimate the true
values?
2. Does one technique estimate the true values better than

the other?

82

3. Do the C-V and R techniques estimate some true values

better than others?

The answers to these questions were discussed in this chapter.
First, the adequacy of a technique (R or C-V) was determined by con-
sideration of a number of descriptive statistics. It was found that
for R, when the assumptions are met, the technique provides a more
stable and accurate estimate. It was also found that when the sample
size is increased from 50 to 200, the stability and accuracy increases
greatly. A correlation coefficient of .759 between the true R value
and the mean R value for 1000 estimates implies that the procedure
provides a reasonable estimate of the true R.

The C-V technique seems to provide a reasonably accurate and
stable estimate regardless of whether the assumptions are met. The
estimates, however, are more stable with larger sample sizes, e.g.
average range is .54 for N = 50 and .29 for N = 200. A correlation
coefficient of .80 between the true C-V and the mean C-V value for
1000 estimates implies that the procedure provides a reasonable esti-
mate of the true C-V.

Second, the C-V technique seems to estimate the C-V true
value better than the R technique estimates the R true value. The
average absolute deviation from the respective true values is smaller
for C-V than for R (.0891 and .0965, respectively). In addition
the range of variances is considerably smaller for the C-V estimates
than for the R estimates (.0013 to .0094 for C-V and .0015 to .0221
for R) and the average range of estimated values is smaller (.46 for

C-V and .60 for R).

83

The third question was primarily answered in considering the
question of adequacy and stability. For both techniques the estimates
are better when the sample size is larger (N = 50 versus N = 200).

In addition, the R approach is better when the assumptions are met.
This is not true for the C-V approach. The C-V approach seems to
provide a good estimate under almost any assumptions.

The next chapter describes the results of the comparison of
the R and C-V approaches using actual data on 128 items. In addition
a third approach, the B-5 method, is also used on 64 of the 128 items

and the results compared to the R and C-V values.

CHAPTER VI

RESULTS OF THE COMPARISON OF THE THREE
INDICES WITH ACTUAL DATA

The purpose of this part of the study was to determine the
comparability of the three item analysis procedures, C-V, R and B-5.
For this comparison, data were obtained from the Michigan Middle
Cities Project. Sixteen objectives were chosen from two subject
areas, Mathematics and Reading. In addition two levels, Middle and
Upper, were considered in the selection of the objectives. These
levels refer to grades three and four, and five and six, respectively.
Each objective was tested by four items on a pretest, four different
items on a posttest and all eight items on a retention test. The
retention test was given approximately 40 days after the posttest.
There were also two treatment groups considered in the selection of
the objectives. In treatment A, teachers were assigned objectives.
In treatment 8, teachers selected the objectives. Diagram 6.1 shows
the complete design.

The major question considered was "Do the C-V, R and B-S
item analysis procedures provide comparable results?" Three other
questions were also considered:

1. Are the three procedures more comparable for items in

Mathematics than for items in Reading?;

84

85

 

Objective

 

 

 

 

 

 

 

 

 

Subject Level ‘ Treatment Number N Items
142 31 1-8
A
116 59 9-16
Middle
112 21 17-24
B
120 20 25-32
Reading
145 66 33-40
A
199 57 41-48
Upper
182 30 49-56
B
166 18 57-64
108 52 65-72
A
111 43 73-80
Middle
107 42 81-88
B
109 37 89-96
Mathematics
198 22 97-104
A
176 46 ' 105-112
Upper
187 16 113-120
B .
167 17 121-128
Diagram 6.1

Design,of Administration of Items

86

2. Does the comparability of the three procedures depend
on the grade level?; and,
3. Are the three procedures equally comparable for items
given in treatment A as for items given in treatment B?
These last three questions are part of the major question and will be

treated as such in the discussion of the results.

Comparability

C-V and R

Consider the testing procedure for each objective. Four items
were given on the pretest, four different items were used on the
posttest, and all eight items were included on the retention test.
For computation of a C-V index or an R index it is necessary to have
data on a given item at two times, preferably a pretest and a post-
test. In this situation, it was necessary to compute the C-V and R
indices from pretest-retention test data and from posttest-retention
test data. The indices can be computed from pretest-posttest data
on parallel items, but the usefulness and meaningfulness of these
data for item selection and revision is questionable. There are 64
items using pretest-retention test data for which C-V and R can be
computed and 64 different items using posttest-retention test data
for which C-V and R can also be computed. These two sets of data
were considered separately in the analyses.

The results of the correlations of C-V and R are presented

in Table 6.1.

87

 

 

Table 6.1
Corre1ations of C-V and R
r(C-V, R) r(C-V, R)
N (Pretest- N# (Posttest-

Retention) Retention)
All 64 .80** 55 .76**
Math 32 .88** 27 .81**
Reading 32 .87** 28 .67**
Upper 32 .81** 28 .62**
Middle 32 .80** 27 .82**
Treatment A 32 .79** 3O .69**
Treatment 8 ' 32 .82** 25 .80**

 

**Significant at p<:.Ol
#Some values of R did not exist because there were no individ-
uals in the combined categories of f1 (fail-fail) and f2 (fail-pass).

The computation of R involves f] + f2 in the denominator and if this
is zero, R does not exist.

The correlations between C-V and R for the indices computed
on pretest and retention test data range from .79 to .88. All these .
correlations are significantly different from zero (p‘<.01). Using
Fisher's Z-transformation, pairwise comparisons of the correlations
between Mathematics and Reading, Upper and Middle, and Treatment A
and Treatment 8 showed no significant differences.

The correlations between C-V and R for the indices computed
on posttest and retention test data range from .62 to .82. Again
all these correlations are significantly different from zero (p<<.01).
Using Fisher's Z-transformation, pairwise comparisons of the correla-
tions between Mathematics and Reading, Upper and Middle, and Treatment

A and Treatment 8 were made. There were no significant differences.

88

For both sets of data (items given on the pretest and reten-
tion test and items given on the posttest and retention test), the
analyses indicate that:

1. The C-V and R values are significantly related and the

procedures would result in similar item selection;

2. The C-V and R values are not more related for Mathematics

than for Reading;

3. The relationship between the C—V and R procedures does

not depend on grade level; and,

4. There is no difference in the relationship between the

C-V and R procedures when treatments are considered.

B-S and C-V

 

The B-S procedure requires that an item be given on a pre-
test and on a posttest. In this situation, it was necessary to apply
the B-S rules to pretest-retention test data only. There are 64
items for which a decision about item revision, using the B-S pro-
cedure, can be made. Using the rules on posttest-retention test data
is not meaningful.

There is also one additional restriction. To apply some of
the decision rules, it is necessary to select cut-off values. The
analyses of the items using the B-S approach were based on an
arbitrary cut-off value of .50 for the error rates. If the error
rate was below .50 the error rate was considered low; if above .50,
the error rate was considered high. The original intent was to select

multiple cut-off values for the error rates. But the data indicated

89

that choosing different cut-off values would not change the decision
about the revision of the items. Only 18 items met the criterion of
no significant positive difference between the theoretical error rates
(TER) and the pretest error rates (BER). These items all had values
of BER greater than .50. TER, since the items were three-option
multiple choice items, is always .67. It would be meaningless to
lower the cut-off for the error rates since the same 18 items would
be chosen. To raise the cut-off would exclude more items but since
the stronger criterion of no significant positive difference between
TER and BER is met for these 18 items the increase in the cut-off
does not seem particularly reasonable.

First the individual statistics, TER, BER, PER, 801 and PDI
were computed for the 64 items. Then the appropriate rules were
applied and a decision was made about the quality of the item; that
is, does the item need to be revised? Each item was assigned a "O"
or a "1" depending on the outcome of the application of the rules.

A "0" indicates revision is required, and a "1" indicates no revision
is required. See Appendix IV for the statistics on the 64 items and
the resulting application of the rules. Application of the rules
resulted in 18 items needing no revision.

Point-biserial correlations were computed between the result-
ing values from the B-S procedure and the C-V values. The correlations
are presented in Table 6.2.

The correlations between C-V and B-S values range from .45

to .84. All these correlations are significantly different from zero

9O

 

 

Table 6.2
Correlations Between B-S and C-V

N rp-bis
All 64 .70**
Mathematics 32 .69**
Reading 32 .50**
Upper 32 .70**
Middle 32 .68**
Treatment A 32 .45**
Treatment B 32 .84**

 

*1:
Significant at the .01 level.

(p <.Ol). Pairwise comparisons reveal the largest difference is
between the point-biserials for treatment A and treatment B.

These analyses indicate that application of the B-S or C-V
procedure results in selection of many of the same items. In
addition, the B-S and C-V procedures are slightly more comparable for
Mathematics than for Reading; the relationship between the procedures
does not depend on grade level; and the B-S and C-V procedures are

considerably more comparable for treatment 8 than for treatment A.

B-S and R

The same restrictions apply to the comparisons of the B-S and
R indices as to the comparisons of the B-S and C-V indices. Point-
biserial correlations were computed between the B-5 values and the R

values. The correlations are presented in Table 6.3.

91

Table 6.3
Corre1ations Between B-S and R

 

 

N rp-bis
All 64 .36**
Mathematics 32 .39*
Reading 32 .24
Upper 32 .37*
Middle 32 .37*
Treatment A 32 . .21
Treatment 8 32 .52**

 

*Significant at the .05 level.

*
T Significant at the .01 level.

The correlations between R and B—S values range from .21 to
.52, considerably smaller than the correlations between the C-V and
B-S values. Only correlations between all the R and B-S values and
the R and B-S values for treatment B are significant at the .01 level.
The correlations for Mathematics, Upper and Middle are significant
at the .05 level. The correlations are not significantly different
from zero for Reading and treatment A. Pairwise comparisons show
that the largest difference is between the correlations for treatment
A and treatment B.

These analyses indicate that the relationship between the
results of the B-S and R procedures is not very strong, but many of
the same items would be selected with either procedure. In addition
the B-S and R procedures are considerably more comparable for Mathe-

matics than for Reading and for treatment 8 than for treatment A.

92

The relationship does not appear to depend on grade

level.

B-S and C-V and R

The correlations between the indices, R, C-V and B-S
for the 64 pretest-retention test items are all significantly differ-
ent than zero (p <.01). The relationship between the R and B-S
values is markedly different than the other two relationships.

Table 6.4 summarizes the three correlations.

 

 

Table 6.4
Corre1ations for All Items
(N=64)
R B-S
C-V .80** .70**

B-S .36**

 

**Significant at the .01 level.

These significant correlations indicate that the three indices are
related. In particular the R and C-V indices are the most similar.
The R index, however, does not appear to give results as similar to
the B-S procedure as does the C-V index.

Consider the correlations between the indices for each subject
area, grade level and treatment for the 64 items (pretest-retention

test). These correlations are reported in Tables 6.5 A and B, 6.6 A

and B and 6.7 A and B.

93

Table 6.5 A
Correlations--Mathematics

 

R

B-S

 

C-V
B-S

.88**
.39*

.69**

 

Table 6.5 B
Correlations--Reading

 

R

 

C-V

.87**
.24

.50**

 

Table 6.6 A
Corre1ations--Midd1e

R

B-S

 

C-V
B-S

.80**
.37*

.68**

 

Table 6.6 B
Corre1ations--Upper

 

R

B-S

 

.81**
.37*

.70**

 

94

Table 6.7 A
Corre1ations--Treatment A

 

R

B-S

 

C—V .79**
B-S .21

.45**

 

Table 6.7 B
Corre1ations--Treatment B

 

B-S

 

C-V .82**
B-S .52**

.84**

 

*Significant at the .05 level.

*TSigniiicant at the .01 level.

Based on these correlations it appears that the three pro-

cedures are more comparable for items in Mathematics than for items

in Reading, and for items given in treatment 8 than for items given

in treatment A.

Although all the procedures are significantly related for

Mathematics the relationship between the B-S procedure and the R index

is markedly different than the R--C-V and C-V--B-S relationships.

This same difference in the size of the relationships appears in all

of the other comparisons, i.e. Reading, Middle, Upper, treatment A,

and treatment B. The difference is less for treatment B correlations

than for the other comparisons.

95

An alternate method of analyzing the comparability of the
three approaches would be to consider the agreement among the three
methods. If a cut-off value for the C-V and R index is chosen as
.50, i.e. those items with an R or C-V value equal or above .50 are
considered to be good items, then 32 items out of 64 items would be
selected based on the R values and 11 items would be selected based
on the C-V values. Of the 32 items selected based on the R values,
13 were also selected using the B-S procedure. All 11 items selected
based on the C-V values were selected using the B-5 procedure.
Similarly all 11 items selected based on the C-V values were selected
using the R procedure (see Table 6.8).

Table 6.9 represents the agreement among the three indices.
There is complete agreement for 39 of the 64 items or 61 percent. Of
the items where there is 100 percent agreement, 21 of the 39 were
Reading items (54 percent); 20 of the 39 were given in the Middle
grades (51 percent) and 15 of the 39 items were used in treatment A
(38 percent). The disagreement among procedures is more noticeable
between treatments. Of the 64 items there is agreement between the
C-V and B-3 procedures for 57 items or 89 percent. There is con-
siderably less agreement between the C—V and R and R and B-S, the

percentage agreement being 69 percent and 64 percent respectively.

Summary

The purpose of this part of the study was to determine the
comparability of three item analysis procedures; C-V, R and B-S.

Sixteen objectives were chosen from two subject areas, Mathematics

96

Table 6.8

B-S, R and C-V Values for Items Given

on the Pretest and Retention Test

 

 

Item Identification* B-S R C-V
R116Gl RAM 0 .1 .03
R116G2 RAM 0 .48 .25
R116G3 RAM 0 .11 .02
R11604 RAM 0 O 0
R12051 RBM O O O
R12052 RBM O O O
R12053 RBM O 1.0 -.25
R12054 RBM O 1.0 -.15
R142Gl RAM 0 1.0 .06
R142G2 RAM 1 1.0 .52
R142G3 RAM 0 .86 .19
R142G4 RAM 1 .88 .48
R11251 RBM O .71 .24
R11252 RBM O .75 .14
R11253 RBM 1 .09 .05
R11254 RBM O .33 .05
M10951 MBM O .33 .03
M10952 MBM O .27 .11
M10953 MBM O .5 .19
M10954 MBM 0 .21 .08
M108G1 MAM 0 .71 .38
M108G2 MAM l .61 .38
M108G3 MAM l .90 .73
M108G4 MAM l .83 .58
M10751 MBM O .06 .024
M10752 MBM O O O
M10753 MBM O .18 .071
M10754 MBM 0 .06 .024
MlllGl MAM O .75 .28
M11162 MAM O .65 .26
M111G3 MAM O .47 .19
M111G4 MAM 0 .53 .21
M18751 MBU l .90 .56
M18752 MBU 1 .91 .63
M18753 MBU l 1.0 .63
M187S4 MBU l 1.0 .75
R18251 RBU O .67 .2
R18252 RBU O 0 O
R18253 RBU O .33 .07
R18254 RBU 0 -.33 -.O3
M16751 MBU 1 .75 .53
M16752 MBU 1 .73 .65
M16753 MBU 1 .8 .71
M16754 MBU 1 .71 .59

97

Table 6.8--Continued

 

 

Item Identification* B-S R C-V
R16651 RBU O 1.0 .33
R16652 RBU 0 .2 .06
R16653 RBU O .67 .ll
R166S4 RBU O .5 .17
M19861 MAU O .5 .14
M19862 MAU O .67 .18
M19863 MAU 0 .7 .32
M19864 MAU O .5 .14
M17661 MAU 1 .23 .15
M17662 MAU l .26 .26
M17663 MAU 1 .04 .04
M17664 MAU l .11 .11
R14561 RAU O .27 .09
R14562 RAU 0 .5 .20
R14563 RAU O .3 .15
R14564 RAU 0 .3 .11
R199Gl RAU 0 .4 -.l4
R19962 RAU O -.5 -.1l
R19963 RAU O -1.78 -.28
R199G4 RAU O -.125 -.05

 

*The last three letters of the Item Identification refer to
subject area (M = Mathematics, R = Reading); treatment (A or B); and
grade level (M = Middle, U = Upper).

Table 6.9
Agreement of the Three Item Indices
100% Agreement

 

 

Items Items
Unacceptable Acceptable Total
All 28 (44%) 11 (17%) 39 (61%)
Mathematics 8 (20%) 10 (26%) 18 (46%)
Reading 20 (51%) 1 ( 3%) 21 (54%)
Midd1e 17 (43%) 3 ( 8%) 20 (51%)
Upper 11 (28%) 8 (21%) 19 (49%)
Treatment A 12 (30%) 3 ( 8%) 15 (38%)
Treatment B 16 (41%) 8 (21%) 24 (62%)

 

98

Table 6.9 A
Agreement of the Three Item Indices
67% Agreement

 

 

Items Items
Unacceptable Acceptable Total
All 23 (36%) 2 ( 3%) 25 (39%)
Mathematics 13 (52%) 1 ( 4%) 14 (56%)
Reading 10 (40%) 1 ( 4%) 11 (44%)
Middle 10 (40%) 2 ( 8%) 12 (48%)
Upper 13 (52%) O ( 0%) 13 (52%)
Treatment A 15 (60%) 2 ( 8%) 17 (68%)

O

Treatment 8 8 (32%) ( 0%) 8 (32%)

 

99

 

 

NOONV N NNOOV N NNOV O NOOO O NOOO O NOOV O NOONV N NNOOV N NOOO O O NOOOOOOON
NOOOV O NONOV NN NOOONVN NOOV O NOOV O NOOONVN NOOOV O NONOV NN NOOO O O OOOENOOON
NOOOV O NNOOV O NOON O NOOV O NOOV O NOOV O NOOOV O NOOOV O NNOV O NOOOO
NOONV N NNOOV O NNOONON NOOV O NNOV O NOOONVN NOONO N NOOOV O NOOV O ONOONz
NOONV N NOOOV O NOOOV N NOOO O NOOV O NOOOV N NOONV N NOOOV O NOOV O OONOOOO
NNOOV O NOOOV O NNOOV N NOOV O NOOV O NOOOV N NNOOV O NOOOV O NOOV O OONNOEOONOE
NOOV O NOONV ON NOOV N NOOV O NOOV O NOOO N NOOV O NOONV ON NOOV O NNO
O-O .O O-O .>-O O-O .O >-O .O O-O .>-O O-O .O O-O .N O-O .>-O O-O .O

NONON

wNompamou< OEONN

mNnONomoumco OEONH

 

pcwsmmgo< NNo

OmuNucN EmpN mmech on» mo pcmamoeo<

m m.o mpnmh

100

and Reading, two grade levels, Middle and Upper, and two treatments,
assigned objectives (treatment A) and selected objectives (treatment
B). A total of 64 items were analyzed using each of the three item
analysis procedures. An additional 64 items were analyzed using only
the C-V and R procedures.

The major question to be answered was do the C-V, R and B-5
item analysis procedures provide comparable results? Three additional
questions also were considered:

1. Are the three procedures more comparable for items in

Mathematics than for items in Reading?;

2. Does the comparability of the three procedures depend on

the grade level?; and,

3. Are the three procedures more comparable for items given

in treatment A than for items given in treatment B?

Correlation coefficients were computed between the indices
for the 64 items given on a pretest and a retention test. The
Pearson product moment correlation coefficient between the R and C-V
indices was significantly different than zero (r = .80, pi<.Ol).

The point-biserial correlation coefficients between the B-5 procedure
and the C-V index and the B-5 procedure and the R index were also
significantly different than zero (r = .70, p<:.Ol and r = .36, p<<.Ol,
respectively). These correlations indicate that the three indices

are related and provide reasonably comparable results.

The separate analyses of the indices for each subject area,
grade level and treatment indicated that the indices were more com-

parable for Mathematics than for Reading, with all significant

101

correlations between the indices for Mathematics [r(R,C-V) = .88,

p <.01; r(C-V, B-S) = .69, p<<.01; r(R,B-S) = .39, p‘<.051 and only
two out of the three correlations significant for Reading [r(R,C-V)

= .87, p‘<.01; r(C-V, B-S) = .50, pi<.01; r(R,B-S) = .24, not signif-
icant]. The indices were also more comparable for treatment 8 than
for treatment A, with all significant correlations for treatment B
[r(R,C-V) = .82, pi<.01; r(C-V, B-S) = .84, p<=.01; r(R,B-S) = .52,

p <.01] and only two out of the three correlations significant for
treatment A [r(R,C-V) = .79, p<I.Ol; r(C-V, B-S) = .45, p<:.Ol;

r(R, B-S) = .21, not significant]. The correlations between the
indices for the grade levels, Middle and Upper, were almost identical,
with r(C-V, R) = .80 for Middle and .81 for Upper, r(C-V, B-S) = .68
for Middle and .70 for Upper, r(R, B-S) = .37 for both Middle and
Upper. Although all the correlations were significant, the correla-
tions between the R index and the B-5 procedure were significant at
the .05 level while the other correlations were significant at the
.01 level.

The comparison of the R and C-V indices on the 64 items given
on a posttest and a retention test provide additional support that the
use of either the R index or C-V index would result in selection of
many of the same items. Although all the correlations were signifi-
cant at the .01 level, the correlations between R and C-V for Mathe-
matics (.81), Middle (.82) and treatment B (.80) were larger than
the correlations between R and C-V for Reading (.67), Upper (.62) and

treatment A (.69).

102

It is interesting to note that the predictions that the
indices would be more related for Mathematics than for Reading and
more related for treatment 8 than for treatment A were supported by
the results. In addition, pairwise comparisons of the indices by
grade level did not reveal any differences. This was also predicted.

An analysis of the agreement among the three item analysis
procedures showed that when a cut-off of .50 for the R and C-V indices
was used for selection of items, there was complete agreement for 39
of the 64 items (61 percent) given on the pretest and retention test.

In the final chapter the results presented in Chapter V and .
Chapter VI are reviewed. The implications of these results for test

development are also discussed in Chapter VII.

CHAPTER VII
SUMMARY AND CONCLUSIONS

Summar

The first purpose of this study was to propose a theoretical
conception of criterion-referenced testing and to explain two basic
item analysis techniques (Cox and Vargas, C-V and Roudabush, R)
theoretically with respect to this general model. The second purpose
was to determine the adequacy of the C-V and R procedures using the
theoretical model. The final purpose was to compare three item anal-
ysis techniques, the C-V, R and the Brennan and Stolurow (B-S), using
real data.

Previous research indicated that the C-V index, defined as
the difference between the proportion of individuals answering the
item correctly at posttest and the proportion of individuals answering
the item correctly at pretest, was an appropriate item analysis tech-
nique for criterion-referenced tests. Most of the comparative
studies did compare the C-V index to other indices, but in general
these other indices were traditional indices rather than other pro-
posed item analysis techniques for criterion-referenced tests.

The other two indices included in the study (R and B-5) had
not been previously researched. The R index is a refinement of the

C-V technique and the B-5 procedure combines traditional methods

103

104

with a set of decision rules to provide a guide for selecting items
which are sensitive to instruction.

A theoretical model for criterion-referenced testing was
proposed. The model includes 12 parameters that completely describe
the pretest-posttest situation. The model is represented in matrix
notation as:

£=QIQ'_I'

where aisymbolizes the Kronecker product, and

I 1 1T
(”11 q12) (Q11 q12) 11]
Q: g Q. = 1 I :1: 2 s

q21 q22 q21 q22 - n3

1'4
and - P]
£= p2
p3
p4

and qij’ qij" "k and pk are defined by the Tables 7.1, 7.2, 7.3 and

 

 

7.4 below.
Table 7.1
Pretest--Actua1
Does Not Know Knows
Fail qn q12
OBSERVED

Pass q21 q22

 

105

- Table 7.2
Posttest--Actua1

 

 

 

 

 

 

 

 

Does Not Know Knows
Fail q11' q12.
OBSERVED
Pass q21 qzz'
Table 7.3
Categories for a Given Item--True Proportions
Posttest
Does Not Know Knows
Does Not Know "1 "2
PRETEST
Knows "3 “4
Table 7.4
Categories for a Given Item-~0bserved
Proportions
Posttest
Fail Pass
Fail p1 p2
PRETEST
Pass P3 p4

 

106

The qij's and qij'TS are conditional probabilities, with qH + q2] =

1. q12 T q22 T 1. “11' T qz1' T " q12' T “22' T 1. the “k'5 are

true proportions and the pk's are observed proportions with
1.

n _ Z _
1 k - 1 and k = 1 pk - 1.

The R and C-V indices can be explained in terms of this

II M:

k

general model by making certain assumptions. The theoretical frame-
work can be simplified by assuming that the pretest and posttest
conditional probabilities are equal, Q = Q'. Additional assumptions
needed for the model to fit the R procedure are that n3 = 0, i.e.
there is no forgetting, and q22 = q22' = 1.0, i.e. someone who knows
the answer to an item can not fail the item. 50 the general model

can be reduced to

p1 T'1
p2 q11 0 q11 0 "2
p3 = I TTO
p4 q21 ' q21 ' n4

which reduces to the four equations used by Roudabush in the develop-
ment of the R index.

The C-V index is a further simplification of the general
model where q2] = 0, i.e. there is no guessing. However, N3 is not

assumed to be zero. 50 for the C-V index, the model is

p1 1'1

P 1 0 1 O n

p2 = I ,2 0r E.= 2;.
3 0 1 o 1 "3

107

There were two parts to this study. The first part attempted
to determine if the C-V and R indices adequately estimated the true
values, if one technique estimated the true values better than the
other and if the C-V and R indices were better estimators of the true
values for some parameter sets. These questions were investigated
by simulating data for 21 different sets of parameter values using
the model described in Chapter III and briefly described above, as
the theoretical framework. For each set of parameter values 1000
samples of size 50 or 200 were generated. The R and C-V indices
were computed for each sample. Descriptive statistics, means, var-
iances, kurtosis and skewness values were computed based on the R
and C-V values for the 1000 samples for each parameter set.

The adequacy of R and C-V was determined by Consideration
of a number of descriptive statistics, including means, variances,
kurtosis, skewness, and range. It was found that for R, when the
assumptions were met, the technique provided a more stable and
accurate estimate than when the assumptions were not met. It was
also found that when the sample size was increased from 50 to 200,
the stability and accuracy increased greatly. A correlation coef-
ficient of .759 between the true R values and the mean R values for
1000 estimates over the 21 parameter sets indicated that the tech-
nique does produce a reasonable estimate of the true R.

The C-V technique seemed to provide a reasonably accurate
and stable estimate regardless of whether the assumptions were met.
The estimates were more stable with larger sample sizes. A correla-

tion coefficient of .80 between the true C-V values and the mean C-V

108

values for 1000 estimates over the 21 parameter sets indicated that
the technique does produce a reasonable estimate of the true C-V.

The C-V technique seemed to estimate the C-V true value
better than the R technique estimated the R true value. The average
absolute deviation from the respective true values was smaller for
C-V than for R (.0891 and .0965, respectively). In addition the
range of variances was considerably smaller for the C-V estimates
than for the R estimates (.0013 to .0094 for C-V and .0015 to .0221
for R) and the average range of estimated values was smaller (.46
for C-V and .60 for R).

The third questions, i.e. were the C-V and R techniques
better estimators of the true values for some parameter sets, was
primarily answered by considering the question of adequacy and sta-
bility. For both techniques the estimates were better when the
sample size was larger. In addition the R approach was better when
the assumptions were met. This was not true for the C-V approach
which seemed to provide a good estimate under almost any assumptions.

The second part of the study was designed to determine the
comparability of the three item analysis procedures, R, C-V and B-5.
C-V and R values were computed for 128 items and B-5 values were
computed for 64 items. These items were testing 16 objectives from
two subject areas, Mathematics and Reading, two grade levels, Middle
and Upper, and two treatments, assigned objectives (treatment A) and

selected objectives (treatment B).

109

The major question to be answered was whether the C-V, R and
B-5 item analysis procedures provide comparable results. Three
additional questions were also considered:

1. Are the three procedures more comparable for items in

Mathematics than for items in Reading?;

2. Does the comparability of the three procedures depend on

the grade level?; and,

3. Are the three procedures more comparable for items given

in treatment A than for items given in treatment 8?

Correlation coefficients were computed between the indices
for the 64 items given on a pretest and a retention test. The
Pearson product moment correlation coefficient between the R and C-V_'
indices was significantly different than zero (r = .80, p<:.01).

The point-biserial correlation coefficients between the B-5 procedure
and the C-V index and the B-5 procedure and the R index were also
significantly different than zero (r = .70, p<<.01 and r = .36,
p<<.01, respectively). These correlations indicated that the three
indices were related and similar results in item selection would be
obtained using any of the three approaches.

The separate analyses of the indices for each subject area,
grade level and treatment indicated that the indices were more com-
parable for Mathematics than for Reading, with all significant cor-
relations between the indices for Mathematics [r(R, C-V) = .88,

p <.01; r(C-V, B-S) = .69, p‘<.01; r(R, B-S) = .39, p‘<.05] and only
two out of the three correlations significant for Reading [r(R,C-V)

= .87, p <.Ol; r(C-V, B-S) = .50, p <.01; r(R, B-S) = .24, not

110

significant]. The indices were also more comparable for treatment

8 than for treatment A, with all significant correlations for treat-
ment 8 [r(R,C-V = .82, p‘<.01; r(C-V, B-S) = .84, p <.Ol; r(R, B-S)

= .52, p <.01] and only two out of the three correlations significant
for treatment A [r(R, C-V) = .79, p <.Ol; r(C-V, B-S) = .45, p‘<.01;
r(R, B-S) = .21, not significant]. The correlations between the
indices for the grade levels, Middle and Upper, were almost identical,
with r (C-V, R) = .80 for Middle and .81 for Upper, r(C-V, B-S) = .68
for Middle and .70 for Upper, r(R, B-S) = .37 for both Middle and
Upper. Although all the correlations were significant, the correla-
tions between the R index and the B-5 procedure were significant at
the .05 level while the other correlations were significant at the
.01 level.

The comparison of the R and C-V indices on the 64 items
given on a posttest and a retention test provided additional support
that the use of either the R index or the C-V index would identify
many of the same items as good or bad. Although all the correlations
were significant at the .01 level, the correlations between R and
C-V for Mathematics (.81), Middle (.82) and treatment B (.80) were
larger than the correlations between R and C-V for Reading (.67),
Upper (.62) and treatment A (.69).

An analysis of the agreement among the three item analysis
procedures showed that when a cut-off of .50 for the R and C-V
indices was used for selection of items, there was complete agreement
for 39 of the 64 items (61 percent) given on the pretest and reten-

tion test.

11]

Conclusions

 

There does exist a general model that explains the pretest-
posttest situation. This general model can be used in the develop-
ment and explanation of item analysis techniques. The R and C-V
techniques fit the general model with some additional assumptions.

The results of the simulation indicate that the C-V and R
techniques adequately estimate the true values of the item parameters.
However, the C-V procedure provides better estimates of the true
values when there are deviations from appropriate assumptions. In
general the C-V technique seems to estimate the C-V true value better
than the R technique estimates the R true value. Therefore the C-V
item analysis technique probably should be used for analyzing items
from pretest-posttest situations. Both techniques improve with an '
increase in sample size. One perhaps can infer from this that in
developing tests it would be best to use a sample size larger than
50.

The major question to be answered in the study with actual
data was, "Do the C-V, R and B-5 item analysis procedures provide
comparable results?" It was found that the three indices were I
related and did provide results that were reasonably comparable.

The comparability was stronger for items in Mathematics and treatment
8 (selected objectives). The R and C-V procedures were more com-
parable than the B-5 and R techniques but the correlations between
the C-V and B-5 procedures were close to the correlations between

the R and C-V techniques.

112

Inswmmy:

1.

A theoretical conception or a general model of criterion-

referenced testing can be defined.

a.

The R technique fits the general model given certain
assumptions.
The C-V technique fits the general model given certain

assumptions.

The C-V and R techniques provide reasonable estimates

of the true values of the respective indices.

a.

The C-V technique estimates the true COV values
better than the R technique estimates the true R
vaers.

The R technique estimates the true R values better
when the assumptions are met and when the sample size
is larger.)

The C-V technique while reasonably accurate under any
assumptions does become more stable as the sample

size increases.

The C-V, R and B-5 item analysis procedures are related

and similar results would be obtained using any of the

three procedures.

a.

b.

The three procedures are more comparable for items
in Mathematics than for items in Reading.
The comparability of the three procedures does not

depend on the grade level.

113

Discussion

 

This study was intended to determine if an accurate and easy-
to-use item analysis technique existed among the three techniques.
The results of the simulation provided evidence that the C-V index
is a reasonably accurate procedure for the estimation of the true
C-V values even when the assumptions are not met. The R index, on
the other hand, is less accurate and less stable.

The result of the comparisons with actual data leads to the
conclusion that the C-V index provides reasonably close approximations
to the other two methods and is the easiest to compute. The B-S
procedure, while providing a generous amount of information, is
tedious and time-consuming. The R procedure, while not necessarily
more difficult to compute than the C-V index, is perhaps less under-
standable to the everyday practitioner and provides no more informa-
tion.

Not only does this study provide information as to which
item analysis technique is most accurate, most stable and easiest
to compute of the three approaches considered, it also suggests a
theoretical framework. The theoretical development explains two
item analysis approaches and demonstrates how the pretest-posttest
situation can be conceptualized. Other studies have failed to explain,
at least so explicitly, the underlying framework of pretest-posttest
situations and the reasoning behind the suggested methods of analyses
of the items.

While this study has suggested one method which most likely

is the best method for analyzing items and has presented a theoretical

114

framework, there is one limitation to be considered. The analyses
of actual data were limited to the extent that the sample sizes

for each item were relatively small. The results of the analyses,
however, were in agreement with the results of the simulation, i.e.
the C-V index is a reasonable method of item analysis for criterion-
referenced tests. This agreement between the two parts of the
research tends to lessen the impact of the small sample sizes.

As was pointed out in Chapter II, there is a need to be
alert to the negative implications of selecting items sensitive to
instruction. If items are selected which are sensitive to instruc-
tion one might argue that the items, over a number of administrations
and revisions, could become very easy or perhaps require only recall
of simple facts. Care must be taken to include items that measure
all aspects of the domain and to ensure that these items are not
only sensitive to instruction but sensitive to the domain. In
addition, items after a number of administrations and revisions
should probably be piloted in a group consisting of individuals with
and without previous instruction. The quality of the items should
be checked using a number of statistical procedures including tradi-
tional statistics. The individuals included in this pilot who have
received instruction should probably have not just received instruc-

tion.

Implications for Future Research
This study, along with other research on item analysis pro-

cedures for criterion-referenced tests, points to a practical, easy-

115

to-use item analysis index, the C-V index. The study does not pre-
dict that tests developed using this index would be the most valid
and reliable (by whatever definition) tests. However, this study
does indicate that the C-V index provides a reasonable estimate of
the sensitivity of the item to instruction. In addition, the study
did show that the index is reasonably comparable to two other more
complicated (or refined) procedures.

It is interesting to note that the R index which is a more
reasonable one, i.e. less restrictive and fewer assumptions are
needed, does not prove to be the better technique. In fact, the
R procedure provides poor estimates of the true R value and is very
unstable.

Additional research should probably be concerned with the
theoretical conceptualization of criterion-referenced measurement
(pretest-posttest situations) that was proposed in this study. The
theoretical framework could provide a basis for future research in
several areas. One of these areas is the estimation of some of the
parameters. Unfortunately, the model contains 12 parameters and with
only three pieces of information, p], p2 and p3, available it is not
possible to estimate the 12 parameters. However, some restrictive
assumptions could be applied and perhaps, then some of the underlying
parameters could be estimated.

Also within the estimation process there is the possibility
of determining the probability of an individual who knows the item
will actually pass the item. This type of information could be

valuable in being confident of an individuals' attainment of some

116

given mastery level. Information such as that suggested above, that
can be obtained from further investigation of the general model

may prove valuable to the improvement of criterion-referenced

measurement.

APPENDICES

117

APPENDIX I

Roudabush's Technique

118

Roudabush's Technique

 

For this model, consider the folowing 2 x 2 table:

 

 

Table 1.]
Categories for a Given Item
Posttest
Failed Passed
Fa1led f1 f2
PRETEST
Passed f3 . f4

 

where f1 equals the number of students who failed the item at both
pretest and posttest; f2 equals the number of students who failed
the item at pretest and passed it at posttest; f3 equals the number
of students who passed the pretest and failed the posttest; and f4
equals the number of students who passed the item at pretest and)
posttest.

Now assume there is some fixed non-zero probability, p, that
a student who does not know the answer to the item will guess the
correct answer. This p-value is determined by the item only and does
not vary from student to student nor from occasion to occasion for
the same student. This fixed p-value suggests that there is no
partial knowledge on the part of the student, and that the student's
responses are independent at pretest and posttest when he does not
know the correct answer and fails to learn it.

119

120

Assume, also, that the only possible result of exposure
between pretest and posttest is that the student learn the correct
response to an item. Assume that the non-zero frequency of f3 is
solely due to guessing, further implying that there is no forgetting
and implying that the "true" value of f3 is zero.

With these assumptions Roudabush derives a number which serves
as an index of the degree to which examinees select the correct
response to the item as a function of the instruction received between
pretest and posttest. This number is called a sensitivity index (5)
by Roudabush.

In order to clarify this procedure, it is necessary to sketch
briefly the derivation of the sensitivity index.

Since we have already assumed that the "true" value of f3 is
zero (f3 = 0) then we can say that f1 is the probability of guessing
wrong twice times the number of students in the sample who do not

learn the answer. We can state this symbolically as
_ 2 A

where f] is the "true“ number of students who do not learn.
Similarly, f2 enumerates those students who learned the
answer after instruction and did not guess correctly at the pretest,
and those who did not learn but were able to guess the correct
response at the posttest but not at the pretest. This would then say

that

f=p(l-p)i1+(l-p)i2 (2)

121

where f2 is the "true" number of students in the sample who did not
know at the pretest but have learned by the posttest.

Next f3, defined as the number of students who passed the
item at pretest but failed it at posttest, enumerates those students
who correctly guessed the item at pretest, but did not learn the
correct answer via instruction and were not able to guess the cor-

rect answer at posttest. Therefore,
f3 = P(1 ’ P) f]- (3)

Finally, f4, defined as the number of students who passed

the item at both pretest and posttest, enumerates students from three
different categories. The first category comprises all the students
who do know the correct answer at pretest and posttest. The second
category consists of those students who learned the correct answer

for the posttest and guessed correctly the answer at pretest. The
third category represents the students who did not know nor learned
the answer, but were able to guess correctly at both pretest and post-

test. We can represent this as

+ p? + p2? (4)

TTT 2 1

4 4

where f4 if the "true" number of students in the sample who know the

correct answer at both pretest and posttest.

122

Using equations (1) and (3) we can solve for p:

-n -h -n
w _.I
ll 11
'U A
A _1
—1I
I
I
'O
'U V
v N

"h
—_l
'U
I
'1')
w
A
—'
I
'U
v

P=——i— (5)

A

Now we need to find solutions for f1, f2 and f4. (Recall we

have already assumed f3 = 0.)

Since f = (1 - )zf and = -——:é——- th n
l p l p f1 + f3 ’ e

 

 

A f + f 2 2
f = 1 3 if] T 1:3) (5)
l 1 f1 f]
A A f3
Also since f2 = p(l - p)f1 + (1 - p)f2 and p = $7—;-?;- and

A

f1 is equal to the above (6), we have

 

 

 

 

 

 

 

 

 

2
f _ T3 T1 (T1 T T3) + f1 1,
2 f1+ f3 f1+ f3 i1 f1+ f3 2
f] ,.
f=f+ f
2 3 f1+f3 2
f1 ,.
f-f- 1c
2 3 f1+f3 2
$ = (f2 - f3) (f1 + f3) (7)
2 f] '
And finally, since f4 = $4 + pfz + p2?1 and using (5), (6) and (7):
2
f = % + f3 (f2 - f3)(f1 + f3) + f3 (i1 + f3)
4 4 f1+ f3 f1 f1+ f3 f1
2
f = f + f3 (f2 - f3) +-:§
4 4 f f
1 1
. f32 - f3f f 2
f=f+ 2-—
4 4 f f
1 1
» 2 2
f _ f1T4 ‘ Tsfz T f3 ' f3
4 f]
ff—ff ff
,._]4 32 ,._ 32
T4‘ f °TT4TT4Tf

Substituting the observed frequencies for the true frequencies we

have:

124

(T2 ' f3W1 T '3'
f1

 

 

R= .
(f1 + f3)z + (f2 - f3)(f1 + f3)
1 f1

(f2 - f3)(f1 + f3)

 

 

R =
2
”1 T '3’ T ”2 ' T3W1 T T3)
R = 12 ‘ T3
f] + f3 + f2 - f3
R _ T2 ' T3
f1+ f2

APPENDIX II

Brennan's and Stolurow's Procedure

125

Brennan's and Stolurow's Procedure

This procedure was suggested by Brennan and Stolurow (AERA,~
February 1971). Using traditional item analysis techniques, they
combine four "error rates" and two discrimination indices with a
set of rules.

The Theoretical Error Rate (TER) is one suggested error rate.
This is the expected proportion of students answering a pretest item

incorrectly simply on the basis of random guessing. If J is the

 

number of possible answers to an item, then TER = (J - 1)/J. A
second error rate is called the Base Error Rate (BER). This is the
observed proportion of students answering a pretest item incorrectly.
The third error rate, Instructional Error Rate (IER) is the error
rate on a terminal test item for a given objective obtained by stu-
dents who have been exposed to instruction. In addition a Posttest
Error Rate (PER) is used. PER is the observed proportion of students
answering a posttest item incorrectly. IER is only used in the
decision rules related to instruction so it will not be included in
further discussions.

The two discrimination indices used are the Base Discrimina-
tion Index (801) and the Posttest Discrimination Index (POI). Dis-
crimination indices are computed using the total score on the appro-
priate test as the criterion. For 801, the criterion will be the
pretest total score. For PDI, the criterion will be the posttest

total score.

126

127

The error rates are classified as high (H) or low (L) with
the evaluator predetermining an appropriate cut—off point between
high and low error rate. In addition the discrimination indices
can be classified as positive, negative or non-discriminating. By
positive and negative indices, it is meant that the indices discrim-
inate significantly (at some a - level) in the positive and negative
directions, respectively. Brennan and Stolurow recommend the phi-
coefficient and the B index (Brennan, 1972) for criterion-referenced
tests.

An abbreviated list of the rules that Brennan and Stolurow

suggest are presented in the following table.

Table 11.1
Rules for Decision-Making

 

 

Rﬁle TER BER 301 PER PDI Item Decisiona
1 H H NR
L L NR
2. L H NR
3. H L R
4. - ,
6. L 0 NR
7. L 2
L - 2
8 H - R
9 H + 7
H 0 7

11. - - R

128

Table II.1--Continued

 

 

Ralf TER BER 301 PER PDI Item Decision3
16. DER*b R

DER(NS)C NR

 

a"NR" means no revision required.

IIRII . . . .
means rev151on 15 T‘EQUTY‘ECI.

II?"

means the data are not sufficient to make a sound judgment
about whether or not revision is required.

bDER is significantly greater than zero at the .05 level for
a one-tailed test of significance.

CDER is not significantly greater than zero at the .05 level
for a one-tailed test of significance.

*DER is defined as TER minus BER and stands for "Difference
Error Rate."

The significance of a positive difference between BER and

TER can be tested by computing:

1
DER - 2N-

(/ TER(1 - TER)/N

 

 

where N is the total number of students in the sample.
According to Brennan and Stolurow, this computed Z value is
then compared with the normal curve standard score at an appropriate

level of significance for a one-tailed test.

APPENDIX III

Further Analyses of C-V and R

129

Further Analyses of C-V and R

The basic model P_= Q I QT p_provides an expected proportion
of individuals for each category. For any observed frequency (Ti)
produced by the simulation the f, are distributed multinomially with
parameters N, p], p2 and p3. (Of course, p4 is understood.) In
other words, fi ~ MN(N, p], p2, p3). Based on this information a
theoretical estimate of the mean and variance of each index can be

determined.

g:y_

Let's consider the C-V index because it is simplest to
understand.

The expected value of the C-V estimates for each parameter
set can be determined by noting, as above, that f1 ~ MN(N, p], p2,
p3). The expected value of fi/N is simply by definition E(fi/N) = pi.
Therefore, the E(f2/N - f3/N) = p2 - p3. The C-V index is defined

f - f
as _Z_N__§ . So the mean (or expected value) of the C-V index is

p2 " p3-

The variance can also be theoretically estimated by
pzqz psqs p2P3
WM) ‘TTi—TT'NT'TTZTNT—
since the variance of a difference is the sum of the variances minus
twiCe the covariance. The covariance in the case of a multinomial

distr1bution 1s Cov fi’ fi = -N pipj.

130

13]

Additionally we can estimate the standard error of the mean
(expected value) as the standard deviation divided by the square
root of n, which is 1000 in our case. Also the standard error of

the variance can be approximated by the square root of

A2 2 1
JSTFL (Kurtosis + 3) (KUY‘tOSTS + 2):

where n = 1000, 82 is the sample variance, and kurtosis is the sample

kurtosis.
If we consider parameter set #1 from the simulation, (see
Table 4.5 for the parameter values), p1 = .1125, p2 = .5075,
p3 = .0375 and p4 = .3425. The expected value of the C-V index is
p2 - p3 or .4700. The reported mean C-V value is .4722 (Table 5.1).
If we also compute the standard error of the mean, sd/vTﬁ, where
n = 1000, we find that the standard error is .0023. The reported
mean C-V of .4722 is within one standard error of the expected value.
Also consider this same parameter set with respect to the
estimate of the variance. For parameter set #1, the expected
variance is .006482. The reported variance is .0070. The standard
error of the variance is .0052. So the reported variance is within

one (approximated) standard error of the expected variance.

B.
A similar analysis of the expected value (mean) of R can be

done. However, this analysis is considerably more complicated. The

 

1The approximation for the standard error of the variance is
based on calculations and formulae from Sampling Techniques, William
C. Cochran, 2nd Edition, 1963, John Wiley & Sons, Inc. and Statistics
in Biology, V01. 1, C.I. Bliss, 1967, McGraw-Hill.

 

132

p2 p3

W WhTCh 15 51mp1y

expected value can be approximated by

saying that the E(x1/x2) is approximately equal to the E(x])/E(x2)
where E(x]) = p2 - p3 and E(x2) = p1 + p2.

There are problems with the R index and any theoretical
derivations since theoretically and practically the denominator can
be zero.

However, for our purposes consider the approximation of the

mean R (p2 ' p3) for parameter #1.
'31sz

Again p1 = .1125, p2 = .5075, and p3 = .0375, then the expected
value (mean) equals .7580. Note the closeness of the reported value
of .7553. If we compute the standard error of the mean (sdl/Fﬁ,
where n = 1000), we find that the standard error = .0027, and the
reported value is within one standard error of the approximated
expected value.

The derivations of the expected variance of the R index, in
principle, should follow from the derivations of the expected vari-
ance of the C-V index but will not be attempted here because of the

complexity of the derivations.

APPENDIX IV

B—S Statistics

Application of the B-5 Decision Rules

133

12311

 

 

Table IV.1
B-S Statistics
@1 8 0
Objective 0 Item I TER BER BO] PER POI DER N
M1875 1 .67H .63H .87"/.91 .0625L 1.0"/1.0 .04 16
2 .69H 1.0*‘/1.0 .0625L 1.0“/1.0 -.02
3 .63H .87"/.91 0.0 L und/0.0 .04
4 .75H .86*'/.80 0.0 L und/0.0 -.08
R182S 1 .3OL .84'*/.84 .lOL .67"/.5 .37" 3O
2 .17L .68"/.56 .17L .45‘/.42 .50"
3 .20L .40'/.35 .13L .54'*/.46 .47“
4 .1OL .27/.17 .13L .78"/.67 .57"
M1675 1 .71H .57'/.8 .18L .6*/.6 - 04 17
2 .BBH . 1.0“/1.0 .24L .83'/.93 -.21
3 .88H l.0"/1.0 .18L 1.0"/1.0 - 21
4 .82H .3l/.37 .24L .83"/.93 - 15
R1665 l .33L .35/.37 0.0L und/0.0 .34" 18
2 .28L 1.0'*/l.0 .22L .84"/.93 .39"
3 .17L . .72*'/.6 .06L .82"/.78 .50"
4 .33L .88"/.92 .17L 1.0"/l.O .34"
M1986 l .27L 1.0"/1.0 .14L l.0"/l.0 .40" 22
2 .27L l.O“/1.0 .O9L .80“/.67 .40"
3 .45L .67"/.75 .14L 1.0"/1.0 .22'
4 .27L 1.0'*/l.O .14L .61"/.61 .40"
M1766 1 .67H und/und .52H .22/.55 0.0 46
2 1.0M und/und .74H .36*/.77 -.33
3 1.0H und/und .96H -.05/-.05 -.33
4 1.0H und/und .89H .61"/.93 -.33
R1456 1 .33L .64"/.61 .24L .68*‘/.62 .34" 66
2 .39L .62**/.61 .20L .48*'/.41 .28“
3 .SH .73*'/.73 .35L .52"/.54 .17H
4 .35L .54"/.52 .24L .6"/.55 .32"
R1996 l .35L .77"/.79 .49L .6B"/.68 .32" 57
2 .21L .76"/.67 .32L .67**/.62 .46"
3 .16L .64**/.5 .44L .66"/.65 .51"
4 .42L .64"/.68 .47L .79"/.79 .25“
R1166 l .34L .43*'/.41 .3OL .47“/.48 .33“ 59
2 .53H .44"/.45 .27L .37"/.36 .14'
3 .15L .42"/.3O .14L .62*'/.47 .52"
4 .29L .69“/.64 .29L .59'*/.59 .38"
R1205 1 .25L .73"/.73 .25L .47'/.42 .42" 20
2 .25L .2/.2 .25L .36/.33 .42"
3 .25L .73*'/.73 .SH .41/.42 .42"
4 .15L .73'*/.6 .3L .58"/.58 .52"
R1426 l .06L .27/.13 0.0L und/0.0 .61" 31
2 .52H .81‘*/.81 0.0L und/0.0 .15
3 .23L .40'/.34 .O3L 1.0"/1.0 .44"
4 .55H .75"/.75 .06L .70*‘/.97 .12
R1125 1 .33L .79‘*/.79 .1OL .45'/.45 .34" 21
2 .19L .69'*/.57 .05L .69"/.50 .48"
3 .52H .67“/.71 .48L .34/.58 .15
4 .14L .58‘*/.43 .10L 1.0*'/1.0 .53"
M1095 l .08L .38*/.21 .OSL -.14/-.07 .59“ 37
2 .41L .94*'/.96 .3L .87*'/.93 .26"
3 .38L 1.0'T/1.0 .19L .85"/.78 .29"
4 .38L 1.0"/1.0 .3L .87*‘/.93 .29"
M1086 1 .54H .48*'/.54 .15L .68"/.77 .13' 52
2 .63H .71"/.77 .25L .63"/.85 .04
3 .81H .8*'/.76 .08L .43“/.43 -.14
4 .69H .82*'/.85 .12L .82'*/.81 -.02
M1075 l .38L l.O*'/l.0 .36L 1.0*'/1.0 .29" 42
2 .38L .9'*/.9 .38L .95"/.96 .29"
3 .4OL .9*‘/.9 .33L .96*‘/.93 .27"
4 .38L 1.0**/1.0 .35L .9"/.89 .29n
M1116 l .37L .7"/.7 .O9L .73"/.57 .30" 43
2 .40L .95“/.96 .14L .91"/.86 .27"
3 .4OL .95“/.96 .21L .86“/.94 .27"
4 .4OL

.95**/.96 .19L .76'*/.80 .27"I

 

ITER = .67 for all items.

6An error rate is designated High (M) if it equals or is greater than .50; otherwise, it is

designated as Low (L).

.Significance at the .05 level.

HSignificance at the .01 level.

For the discrimination indices. 801 and P01. the first number is the phi-coefficient, the second
number is the 8 index. The notation "und" implies that the computation was not possible because

there was a zero in the denominator.
is undefined.

Therefore, the phi-coefficient or the 0 index in these cases

135

 

 

 

 

Table IV.2
Application of the 8-5 Decision Rules
Ob ective I Item I -39'e
T 1 2 3 4 6 7 8 9 11 16 Result
R1166 1 NA NA R NA NA ? NA NA NA R 0
2 NR NA NA NA NA ? NA NA NA R 0
3 NA NA R NA NA ? NA NA NA R 0
4 NA NA R NA NA ? NA NA NA R 0
R1205 1 NA NA R NA NA ? NA NA NA R O
2 NA NA R NA NR NA NA NA NA R O
3 NA NA R NA NA NA NA ? NA R 0
4 NA NA R NA NA ? NA NA NA R O
R1426 1 NA NA R NA NR NA NA NA NA R 0
2 NR NA NA NA NR NA NA NA NA NR 1
3 NA NA R NA NA ? NA NA NA R O
4 NR NA NA NA NA ? NA NA NA NR 1
R1125 1 NA NA R NA NA ? NA NA NA R O
2 NA NA R NA NA ? NA NA NA R 0
3 NR NA NA NA NR NA NA NA NA NR 1
4 NA NA R NA NA ? NA NA NA R 0
M1095 1 NA NA R NA NR NA NA NA NA R O
2 NA NA R NA NA ? NA NA NA R O
3 NA NA R NA NA ? NA NA NA R O
4 NA NA R NA NA ? NA NA NA R O
M1086 1 NR NA NA NA NA ? NA NA NA R 0
2 NR NA NA NA NA ? NA NA NA NR 1
3 NR NA NA NA NA ? NA NA NA NR 1
4 NR NA NA NA NA ? NA NA NA NR 1
M1075 1 1A NA R NA NA ? NA NA NA R O
2 NA NA R NA NA ? NA NA NA R 0
3 NA NA R NA NA ? NA NA NA R 0
4 NA NA R NA NA ? NA NA NA R O
M1116 1 NA NA R NA NA ? NA NA NA R 0
2 NA NA R NA NA ? NA NA NA R 0
3 NA NA R NA NA ? NA NA NA R 0
4 NA NA R NA NA ? NA NA NA R O
M1875 1 NR NA NA NA NA ? NA NA NA NR 1
2 NR NA NA NA NA ? NA NA NA NR 1
3 NR NA NA NA NR NA NA NA NA NR 1
4 NR NA NA NA NR NA NA NA NA NR 1
R1825 1 NA NA R NA NA ? NA NA NA R 0
2 NA NA R NA NA ? NA NA NA R 0
3 NA NA R NA NA ? NA NA NA R 0
4 NA NA R NA NA ? NA NA NA R O
M1675 1 NR NA NA NA NA ? NA NA NA NR 1
2 NR NA NA NA NA ? NA NA NA 1R 1
3 NR NA NA NA NA ? NA NA NA NR 1
4 NR NA NA NA NA ? NA NA NA NR 1
R1665 1 NA NA R NA NR NA NA NA NA R 0
2 NA NA R NA NA ? NA NA NA R O
3 1A NA R NA NA ? NA NA NA R 0
4 NA NA R NA NA ? NA NA NA R 0
M1986 1 NA NA R NA NA ? NA NA NA R 0
2 NA NA R NA NA ? NA NA NA R 0
3 NA NA R NA NA ? NA NA NA R 0
4 NA NA R NA NA ? NA NA NA R O
M1766 1 NR NA NA ? NA NA NA ? ? NR 1
2 NR NA NA ? NA NA NA ? ? NR 1
3 NR NA NA ? NA NA NA ? ? NR 1
4 NR NA NA ? NA NA NA ? ? NR 1
R1456 1 NA NA R NA NA ? NA NA NA R 0
2 NA NA R NA NA ? NA NA NA R 0
3 NR NA NA NA NA ? NA NA NA R 0
4 NA NA R NA NA ? NA NA NA R 0
R1996 1 NA NA R NA NA ? NA NA NA R 0
2 NA NA R NA NA ? NA NA NA R 0
3 NA NA R NA NA ? NA NA NA R 0
4 NA NA R NA NA ? NA NA NA R 0
NRno revision.
Rrevision.
NA

not applicable.

Tdata insufficient to make judgment.

APPENDIX V

Re1iability Estimates of Tests

136

Table V.1
Reliability Estimates of Tests

 

 

Objective # Pretest Posttest Retention A11
Pretest Posttest
R1116 Y%2.7119 7&2.8644 753.00 X=2.8983 7&5.8983
V=.93 V=1.03 V=1.01 V=.97 V=3.21
KR20=.22 KR20=.58 KR20=.39 KR20=.41 KR20=.62
R1205 733.1 7E2.4 752.7 7E2.55 7E5.25
V=1.l9 V=1.44 V=1.21 V=1.1475 V=3.3875
KR20=.56 KR20=.46 KR20=.41 KR20=.63 KRZOé.77
R1426 7E2.6452 X=3.9032 Y53.9032 X‘3.8387 7&7.7419
V=1.26 V=.09 V=.15 V=.14 V=.51
KR20=.56 KR20=0.0 KR20=.53 KR20=0.0 KR20=.63
R1125 7E2.8095 Y53.2381 YE3.2857 X=3.4762 756.7619
V=1.77 V=1.90 V=.78 V=.63 V=2.18
KR20=.78 KR20=.74 KR20=.51 KR20=.49 KR20=.68
M1095 752.7568 X=3.973 YE3.1622 733.8378 YE7.00
V=2.45 V=.03 V=1.38 V=.51 V=2.27
KR20=.91 KR20=0.0 KR20=.73 KR20=.86 KR20=.74
M1086 YE1.3462 YE3.0192 YE3.3654 YE3.2115 XE6.5769
V=2.13 V=1.71 V=1.08 V=1.04 V=3.55
KR20=.79 KR20=.76 KR20=.69 KR20=.54 KR20=.78
M1075 752.45 YE3.55 752.57 7E2.45 755.02
V=3.58 - V=1.15 V=3.44 V=3.49 V=13.26
KR20=.99 KR20=.87 KR20=.98 KR20=.98 KR20=.98
M1116 7E2.44 732.12 733.37 753.09 Ri6.47
V=3.18 V=3.36 V=1.58 V=2.27 V=6.81
KR20=.94 KR20=.94 KR20=.90 KR20=.92 KR20=.94
M1875 Yéi.31 223.0 R23.88 x=3.44 x=7.31
V=3.09 V=0.0 V=.23 V=.25 V=.34
= =* = : =-
KR20 96 KR20 KR20 .67 KR20 0.0 KR20 .08
R1825 723.23 X-3.77 Y23.47 X=3.80 X=7.27
V=.91 V=.25 V=.78 V=.16 V=.86
KR20=.46 KR20=.20 KR20=.88 KR20=-.l7 KR20=.28

137

138

Table V.l--Continued

 

 

. . Retention
Object1ve # Pretest Posttest Pretest Posttest All
M1675 R%.71 123.35 Yé3.18 ié3.47 226.65
v=1.27~ v=1.99 v=1.91 V=1.66 V=6.46
KR20= 74 KR20=.97 KR20=.88 KR20=.97 KR20=.95
R1665 ¥é2.89 X=3.78 x-3.56 Yé3.72 RE7.28
v=1.99 V=.40 V=.80 V=.65 V=2.65
KR20=.81 KR20=.66 KR20=.73 KR20=.82 KR20=.88
M1896 R22.73 X=3.64 123.50 Yé3.14 726.64
V=2.93 v=.32 v=1 43 v=1.3o v=5.05
KR20=.94 KR20=.08 KR20=.92 KR20=.91 KR20=.78
M1766 7%.33 72.98 R%.89 7&1.17 722.07
v=.22 V=.76 v=.75 v=1.71 V=3.63
KR20=0.0 KR20=.26 KR20=.32 KR20=.80 KR20=.75
R1456 7E2.42 722.58 YE2.97 i22.64 755.61
V=1.85 V=1.36 v=1.33 v=1.11 V=4.36
KR20=.66 KR20=.44 KR20=.58 KR20=.28 KR20=.76
R1996 Y22.86 723.33 7&2.28 Yé3.12 Y55.4o
V=1.88 v=1.20 V=2.38 V=1.65 v=3.92
KR20=.78 KR20=.74 KR20=.79 KR20=.78 KR205,66

APPENDIX VI

Sample Tests and Objectives

139

140

NAME DATE
SCHOOL GRADE

 

Objective #116. Given a sentence with a word underlined that is
either unfamiliar or familiar and used in a new way, the learner will
write a definition using context clues to get the meaning.

 

Circle the correct answer in each.

1.

Sam's team score 10 runs. The other team scored 5 runs. Sam's
team score twice as many runs as the other team.

INl§§_means: a. twenty times as much

b. two times as much

c. four times as much
The rest 0f you may leave, but I would like John to stay.
3§§I_means: a. to sleep

b. to be awake

c. everyone else

The farmer waited until the ground was warm before he planted the
seeds.

QRQQNQ_means: a. to make something fine

b. to make meal for bread

6. to have a place to grow seeds
The ball broke the bridge_of his nose.

BRIDGE means: a. something to drive on to go over
to the other side of the river

b. a card game played by a group

c. the bony part of the nose

NAME DATE
SCHOOL GRADE

141

 

 

 

Objective #116. Given a sentence with a word underlined that is
either unfamiliar or familiar and used in a new way, the learner will
write a definition using context clues to get the meaning.

 

Circle the correct answer in each.

1.

The tire was found against the tree.
TIRE means: a. you need some rest
b. you don't want to read any longer

c. you find it on a car

The corn was picked before the silk on the ear turned dark.
EAR means: a. something to hear with
b. what a piece of corn is called

6. earrings are what girls sometimes
wear on them

Please pass over the boards with care.
PASS means: a. to move or walk with your feet
b. to move with your hands

6. to move in your car

We heard a soft foot step in the hall.
STEP means: a. a ladder
b. to go over

6. to walk

DOOM—J

th-d

000W

(1)030

142
#116 Answer Key

Pretest

Post-test

10.

11.

12.

13.

143

Reading:;PrimarygLevel

 

Given oral directions, the learner will run, walk, march, tiptoe,
hop, jum, slide, skate, and skip.

Given aural stimuli (variety of sounds), the learner will describe
sounds as loud or soft, high or low, fast or slow, short or long,
single or repeated.

Given 3 identical stimuli (shapes, pictures, designs or letters,
etc.) and one clearly different, the learner will identify the
one that is different.

Given a simple story read orally, the learner will demonstrate
his understanding of the main idea by retelling, drawing, or des-
cribing main events and characters.

Given a picture and a sentence begun by the teacher, the learner
will complete the sentence with a reasonable action or concept.

When directed to close his eyes and listen to three words spoken
by the teacher, the learner reproduces the words in the sequence
in which they are spoken.

Given a letter and a series of 3 letters the learner will mark the
letter in the series which is the same as the initially given
letter.

Given the 9 basic colors, the learner will state the name of the
colors.

Given a specific environmental sound, the learner will associate
the sound with its source or reproduce sound, based on child's
knowledge of what specific sources are possible.

Given small individual pictures of children, animals and toys, the
learner will categorize the pictures by sorting them into separate
groups, and explain to the teacher the reason he grouped the
pictures together.

Given pictures or objects, grouped categorically, the learner will
supply categories, names or labels to adequately describe each
group.

Given 4 to 6 pictures which illustrate a known story or activity,
the child will arrange them in proper left to right sequence to
follow the story or action.

Given an action picture, the learner will provide a reasonable and
logical description of what might happen next or predict an event
that led up to the situation depicted in the picture.

15.

16.

17.

18.

19.

20.

21.

22.

23.

24.

25.

26.

27.

28.

144

Given a circle, square, triangle and rectangle, the learner will
contrast 2 shapes by describing different characteristics of the
shapes (a square has 4 corners, a triangle has 3 corners, etc.)

Given a verbal direction only once describing body movement
involving three actions, the learner will act as directed.

Given a story read orally, the learner will demonstrate his
ability to identify the selection as fact or fantasy, (real or
unreal).

Given 2 words orally, (i.e. blue and glue, or soap and soup, top
and Tom, or blue and blue) child will distinguish when words are
same or different.

Given a printed upper or lower case letter and a series of 3
letters, the learner will mark the letter in the series which is
the same as the initially given letter but of the opposite case.

Given 4 sets of pictures with 3 similar pictures in each set and
4 additional pictures one for each set in a separate place, the
learner will identify which should go in each row.

Given an oral selection and a series of 3 text-related pictures
the learner will select the event which occurred first, next or
last.

Given a simple story, orally, the learner will retell its event
in sequence so that the story makes sense.

Given an action picture, the learner will respond to a question
about the picture by using a complete sentence.

Given 3 overlapping figures, the learner will use different
colored crayons to trace over figure.

Given scissors and prepared 8 x 11 paper (2 shapes drawn on
paper), the learner will cut geometric and abstract shapes with
no greater than a % inch deviation.

Given a task of naming a group of objects arranged in 4 rows, 4
to a row, the learner will point to and name pictures in left
to right progression beginning with row one, then 2, 3 and 4.

Given a group of 3 or 4 displayed objects which are viewed and
then covered while one object is removed, the learner will
identify missing objects when shown the changed group.

Given a printed letter and a printed series of three words, the
learner will mark the word that begins with that letter.

29.

30.

31.

33.

34.

35.

36.

37.

38.

39.

40.

41.

42.

145

Given directions to print his name, the learner will do so cor-
rectly.

Given oral directions, including words such as color, draw,
circle, and underline, the learner will follow the directions.

Given two related words, read to him by the teacher and asked how
they are alike or related, the learner will state the manner in
which the two objects or concepts are related, accepting as
correct any justifiable and adequate reSponses.

Lower Level

 

Given a letter name orally and a series of three letters in print,
the learner will mark the letter name spoken.

 

Given a printed word and a printed series of three words, one of
which has the same initial consonant as the first word, the
learner will mark the word that begins with the consonant.

Given an oral story which expresses a mood the learner will mark
from a choice of three pictures, the picture which identifies
the mood in the story.

Given three words, orally, two of which have the same beginning
consonant sound, the learner will name the two words which have
the same beginning consonant sound.

Given a word orally and a series of three letters in print, the
learner will mark the letter which represents the beginning
sound of that word.

Given one word verbally and a series of three printed words, the
learner will mark the word that has the same initial consonant
as the verbally given word.

Given written directions including words such as color, make, draw,
circle and underline, the learner will follow these directions.

Given a picture and a series of three words, the learner will
mark the word that rhymes with the picture name.

Given a word orally and a series of three words orally and in
print, one of which rhymes with the orally given word, the learner
will mark the correct word.

Given an oral sentence with one word missing and cued for the
missing word with a card having printed on it the first letter
of that word (m, d, l, s, h), the learner will say a word that
fits the context and begins with that letter sound.

43.

44.

45.

46.

47.

48.

49.

50.

51.

52.

53.

54.

55.

56.

146

Given an oral sentence with one word missing and cused for the
missing word with a card having printed on it the first letter
of the missing word (t, b, p, w, r, f, g, k, j, and n), the
learner will say a word beginning with the letter that fits the
context of the sentence.

Given a word orally, the learner will print the letter which is
the initial consonant for that word.

Given a printed consonant, the learner will supply orally a word
beginning with the consonant sound.

Given a set of the 8 basic colors names, i.e., red, blue, yellow,
green, black, purple, brown and orange, the learner will indicate
the corresponding color.

Given a sentence read orally, the learner will circle the period
or question mark to indicate the punctuation to be used at the
end of the sentence.

Given a list of Basic Dolch words for pre-primer level, the
learner will read it.

Given a word orally, the learner will supply a word which has
the same ending consonant sound as the spoken word.

Given pictures, the learner will write the final consonant for

‘each picture.

Given a word orally, the learner will write the final consonant
sound.

Given a series of nouns with at least one being plural, the learner
will identify the plural nouns.

Given an oral sentence with one word missing and cued for the
missing word with a card having printed on it the first letter of
that word (v, y, or z), the learner will say a word beginning with
the letter that fits the context of the sentence.

Given pictures of a one syllable long vowel word, and the word
in print, without the vowel, the learner will supply the long
vowel.

Given a list of Basic Dolch words for primer level, the learner
will read it.

Given two lists of words, the learner will identify the words
with the same graphemic base.

57.

58.

59.

60.

61.

62.

63.

64.

65.

66.

67.

68.

147

Given pictures of a one syllable short vowel word, and the word
in print without the vowel, the learner will supply the proper
short vowel.

Given a sentence with a word omitted and a series of three words,
the learner will use the context of the sentence to mark the
correct word.

Given a written sentence, the learner will identify material read
as being real or fantasy.

Given a picture of a familiar activity and three sentences, the
learner will select the sentence which best describes the

picture.

Given a list of Basic Dolch words for first level, the learner
will read it.

Given two words which could be changed to a contraction, and a
series of three contractions, the learner will mark the correct
contraction.

Given the beginning part of a word, a picture and a list of
endings, the learner will match the beginning and ending parts to
name the picture.

Given an oral sentence with one word missing and cued for the
missing word with a card having printed on it the blend with which
the word begins (br, cr, dr, fr, gr, tr, or pr), the learner

says a word beginning with the letter blend that fits the context
of the sentence.

Given an oral sentence with one word missing and cued for the
missing word with a card having printed on it the blend with which
the word begins (bl, cl, fl, 91, or $1), the learner says a word
beginning with the letter blend that fits the context of the
sentence.

Given an oral sentence with one word missing and cued for the
missing word with a card having printed on it the blend with which
the word begins (sk, sw, sm, sn, sp, or st), the learner says a
word beginning with the letter blend that fits the context of the
sentence.

Given an oral sentence with one word missing and cued for the
missing word with a card with the §h_or.th_digraph printed on
it, the learner says a word beginning with the digraph that fits
the context of the sentence.

Given a verb with an "ing" ending and a series of three verbs the
learner will mark the base word.

69.

70.

71.

72.

73.

74.

75.

76.

77.

78.

79.

80.

81.

82.

83.

84.

148

Given an oral sentence with one word missing and cued for the
missing word with a card having printed on it the digraph with
which the word begins (ch, wh), the learner will say a word
beginning with the digraph that fits the context of the sentence.

Given a list of words, the learner will write the words in alpha-
betical order by the first letter only.

Given a compound word, the learner will divide them into the two
root words.

Given three printed sentences, the learner will select the two
that have similar ideas.

Given four paragraphs and four sentences, the learner will select
the sentence that implies what the paragraph says.

Given a list of Basic Dolch words for second level, the learner
will read it.

Given a printed short story, the learner will identify the setting.

Given an oral sentence containing a word unknown in meaning and
a direct definition clue to the word's meaning, the learner states
the meaning of the unknown word.

Given a reading selection, the learner will arrange a series of
randomly placed details into chronological order.

Given a reading selection, the learner will determine the main
idea of the selection.

Given a short story read orally to him by the teacher, the learner
provides details about the story which were implied but not stated.

Given the title of a possible story and a series of possible
details, the learner selects the details which would be appro-
priate for the title.

Given a list of words, the learner will be able to categorize
words as (contractions or compound words).

Given a sentence orally, the learner will determine if the
sentence answers the questions, how, where, when, who or what.

Given a sample table of contents, the learner will demonstrate
his ability to interpret a table of contents by selecting from
a set of printed choices, the page number where certain infor-
mation may be found.

Given a word orally, the learner will determine the number of
syllables in the word.

85.

86.

87.

88.

89.

90.

91.

92.

93.

94.

95.

96.

97.

98.

149

Given a sentence in which an affix has been omitted from one
of the words and a choice of three affixes, the learner will
select the affix appropriate to the sentence.

Given two lists, the learner will match equivalent forms (con-
tracted and uncontracted forms, possessives, and parphrases of
same .

Given a list of words in which part of each word is underlined
the learner will form two new words by substituting new letters
for the underlined parts.

Middle Level

 

Given a list of words, the learner will classify words according
to their structural endings (plural or singular, past or present
tense .

Given a list of words, the learner will locate letters in differ-
ent words that stand for the same sound (vowel or consonant)
including multiple spelling variations.

Given a list of words the learner will add the given suffix, ing,
ed, or s,

Given two lists of words the learner will match the antonyms.

Given a list of words, each beginning with the same letter,
the learner will write the words in alphabetical order using
the second letter.

Given a list of Basic Dolch words for third level, the learner
will read it.

Given a written selection, the student will compose a title
suitable to the material.

Given a reading selection, the learner will list characters
inclUded in the selection.

Given a review word, the learner will write a sentence that
defines the word.

Given a paragraph, the learner will select from a list of three
statements the one which most closely describes the main idea
of the paragraph.

Given a table of contents, the learner will correctly find the
chapter headings or titles and the page number.

99.

100.

101.

102.

103.

104.

105.

106.

107.

108.

109.

110.

111.

112.

113.

115.

150

Given sentences, the learner will identify which ones describe
past time and which ones describe present time.

Given words and paraphrases of some forms, the learner will be
able to match the equivalent forms.

Given a reading selection, the learner will identify aspects of
literature by classifying the selection as humorous or not, with
happy or sad endings.

Given sets containing three words, the learner will choose between
words to fit context or match definitions or answering questions
to indicate identification of words and the sounds they contain.

Given paris of synonyms, homonyms and antonyms, the learner will
circle the pairs of synonyms.

Given a list of words and a list of definitions, the learner
will match words with their most appropriate meaning.

Given a group of words containing a specific variety of suffixes
(er, est, 1y, ful, ness, y) the learner will find the suffix in
each word.

Given a selection to read and a list of sentences, the learner
will locate untrue statements about the selections.

Given a reading selection, the learner will identify elements
of content by indicating objects to fit descriptions or answer
riddles.

Given a topic, the learner will write a story of at least
four sentences.

Given a poem to read, the learner will identify patterns of
rhyme or repetition.

Given a short story, the learner will state the main idea.

Given a story beginning, the learner will write an ending in
such a way that the relationship between the beginning and
ending is logical.

Given a dictionary and a group of words the student will iden-
tify in which quarter of the dictionary each word is located.

Given a list of words, with the first two letters identical in
each word, the learner will alphabetize them.

Given a new word, the learner will use a dictionary to locate
its definition.

116.

117.

119.

120.

121.

122.

123.

124.

125.

126.

127.

128.

129.

130.

131.

151

Given a sentence with a word underlined that is either unfamiliar
or familiar and used in a new way, the learner will write a
definition using context clues to get the meaning.

Given a list of topics and a table of contents, the learner will
find the unit titles, chapter headings, sub-titles and page
numbers related to the topic and record them.

Given a pictorial graph (e.g., weather, population, attendance,
scores, etc.) the learner will interpret the data orally or in
writing.

Given a sentence containing an underlined, incomplete root or
base word and a list of prefixes, the learner will select that
prefix from the list which completes the root word according to
the sentence context (dix, in, de, com, en, sub, mis, re, un).

Given a list of known words the learner will add the following
suffixes where appropriate: er, est, ly, ful, ness, y.

Given a reading selection, the learner will demonstrate ability
to read with understanding by answering questions about details.

Given a list of words, the learner will be able to categorize
structural components of words as affixed or possessives.

Given a reading selection, the learner will demonstrate his
ability to read with understanding by answering questions about
sequence of events.

Given a short story to read, the learner will demonstrate ability
to read with understanding by answering questions about main
ideas.

Given a sentence with a missing multi-meaning word, the learner
will use the context of the sentence to supply the missing word.

Given a set of homonyms, synonyms and/or antonyms, the learner
will define each set as directed.

Given a table of contents, the learner will locate specific
information.

Given a reading selection, the learner will skim the selection
to locate specific information.

Given a topic and several book titles, the learner will identify
the one(s) whose contents would cover the topic.

Given a list of personal pronouns, the learner will write the
correct possessive form and use each in a sentence.

132.

133.

134.

135.

136.

137.

138.

139.

140.

141.

142.

143.

145.

146.

152

Given information found within a telephone directory, the learner
will tell where the information is found.

Given a map, the learner will utilize scales and symbols in
answering questions about a given map.

Given a sentence with a verb (sit, set, lay, learn and teach)
omitted, the learner will write in the correct tense of the
verb located in parentheses next to the blank.

Given a paragraph describing a character in a particular situa-
tion, the learner will identify emotions he imagines were exper-
ienced by that character, consequent to the situation described.

Given incomplete sentences and a present tense verb for each,
complete each sentence with the correct past tense form of the
verb given.

Given a set of guide words, the learner will identify from a
list, those words which would be found on a dictionary page
having guide words.

Given a map, the learner will interpret information to answer
questions.

Given an encyclopedia and a dictionary, the learner will identify
two similarities and two differences between them.

Given the lists of reference sources, the learner will select
the appropriate reference sources to obtain specific information.

Given a sample dictionary page, the learner will be able to
discriminate between guide words, entry words, pronunciation key,
and definitions.

Given a textbook and a list of words within its glossary, the
learner will locate the glossary and list the definition it
gives for each word.

Given a paragraph, the learner will locate the topic sentence.

Opper Level

 

Given a problem or question, the learner will identify the key
words he would look up in an index to find information related

to the problem.

Given an index of an encyclopedia, the learner will locate the
volume and page number of a given topic, illustration or map.

147.

148.

149.

150.

151.

152.

153.

154.

155.

156.

158.

159.

160.

161.

153

Given derived words, the learner will be able to use a dictionary
to locate the base words.

Given an article, the learner will outline, in topic form, its
main points.

Given a list of words whose first three letters are exactly the
same, the learner will be able to arrange the words alphabetically.

Given a scrambled set of words or sentences, the learner will
arrange them into a logical order.

Given the names of the days of the week, months, etc., the
learner will identify abbreviations of these given words, by
matching.

Given a list of pronouns the learner will identify or write the
possessive form of a given pronoun in a given situation.

Given a list of words, the learner will be able to demonstrate
his knowledge of verb suffixes by matching each of the following
suffixes to the appropriate root word in a given list: ize, fy
(or - ifY). -ate, en.

Given a list of words, the learner will be able to demonstrate
his ability to divide words according to the rules of syllabica-
tion, by drawing a line between each syllable.

Given a reading selection, the learner will identify key words,
phrases or passages important to the meaning of the selection.

Given sets of sentences, each containing the same word but with
variations in its meaning, the learner will use the context of
the sentences and the dictionary to identify the meaning of the
word in each sentence.

Given a specific word or list of words, the learner will use a
dictionary to find the syllables, parts of speech, meaning and
synonyms for a given word.

Given a section of a dictionary and a list of words, the student
will locate each word and identify what its grammatical abbre-
viation represents.

Given a list of unfamiliar words of three, four or five syllables,
some of which have been extended by the addition of prefixes

and suffixes, the learner will say each word and sound out the
syllables.

Given a paragraph to read, the learner will be able to identify
traits of a character, found in the paragraph.

162.

163.

164.

165.

166.

167.

169.

170.

171.

172.

173.

176.

177.

179.

180.

154

Given a reading selection, the student will select the event
which creates the major conflict or problem in the story.

Given four categories of mood and a list of words, the learner
classifies each word according to the mood it fits.

Given several phrases, the student will be able to identify which
of the five senses (sight, smell, tase, touch and hearing) each
phrase appeals to.

Given an example for an author card in the card catalogue, the
learner will find the title and call number of the book by the
author.

Given an example of a title card, the learner will find its
author and call number in the card catalogue.

Given an example of a subject card, the learner will locate in
the card catalogue the title, author and call number of one or
more books on that topic.

Given sentences, each an example of figurative language, and
given possible interpretations of meaning of each, the learner
will select the meaning.

Given a random group of factual and opinionated statements the
student will classify each one according to those categories.

Given a list of words beginning with prefixes, the student
will identify the prefix of each word and state the meaning of
the prefix.

Given a selection of cause and effect relationships, the student
will identify these relationships, by matching each cause state-
ment with its corresponding effect statement.

Given a paragraph to read, the learner will identify the authors
purpose in writing a selection, (e.g. entertainment, instruc-
tion, or persuasion).

Given a section heading from a textbook, the learner will briefly
explain what that section might be about.

Given a list, the learner will identify the use of chapter over-
views.

Given an index of a given book, the learner will find pages
where information is found.

Given characteristics of Myths and tall tales, the student will
identify them.

181.

182.

183.

184.

185.

186.

187.

189.

190.

191.

192.

193.

194.

195.

155

Given a reading selection the learner will choose from a series
of sentences the one which best describes motive for some action
or activity.

Given a paragraph to read, the learner will answer a related
question whose answer is implied, but not directly stated within
its content.

Given a selection whose content infers a moral or value, the
student will interpret the content by writing an explanation
of its meaning.

Given specific selections, the learner will be able to evaluate
selections read as to type of literature, such as fable, legend,
mystery, poem or biography.

Given a list of derived words that are not entry words, the
learner will use the dictionary to locate the base (root) word,
and define the derived word.

Given a root word, the learner will add a suffix or prefix, making
appropriate spelling changes in the root when necessary to form
new words as directed.

Given a printed selection and an opinion, the learner will skim
to locate information which supports the opinion.

Given a form or application, the learner will correctly follow
instructions to complete the form.

Given a set of sentences containing two omissions and a choice
of two words, one possessive pronoun and one contraction, the

learner will identify the correct word for ea ch omisssion (it
--it's).

Given the comparative and superlative forms of adjectives,
including the irregular forms of good, bad, many and little, the
learner will write sentences, using the correct form.

Given a set of words denoting business or organizational terms,
the learner will supply the abbreivations for each one.

Given statements from reading selections, the learner will iden-
tify examples of similies, metaphors and alliteration.

Given a list of words which have changed in meaning, the learner
will identify both their original and their current meanings.

Given a paragraph and a list of generalizations about the para-
graph, the learner will recognize those generalizations that

are true.

196.

197.

198.

199.

200.

201.

204.

207.

208.

156

Given a list of phrases, the learner will identify descriptive
phrases as describing action, painting visual pictures and/or
denoting sound.

Given a literary selection (essay, poem or biography) the learner
will write five or more phrases that the author has used to des-
cribe how people feel and/or write his own interpretation of the
selection.

Given the following parts of a book: title page, copyright,
dedication, and table of contents, the learner will define them.

Given a new word in context and its etymology the learner will
identify the word's meaning as used in the context.

Given a sentence expressing a definite mood, the learner will
choose a word describing the mood and match another sentence
expressing the same mood with the first sentence.

Given an article from an encyclopedia or a given selection, the
learner will chart materials in outline form (I, A, B, C, D). '

Given a particular situation, the learner will explain why a
person or group of persons often give very different accounts
of the same events.

Given a reading selection, the learner will identify propaganda
techniques, such as persuasion, unstated assumptions, and
emotionally charged statements.

Given a reading selection, the learner will state whether it
is relatively biased or unbiased.

5.5

157

Math--Primary Level

 

Given a set of geometric shapes, the learner will name a circle,
square, triangle and a rectange.

Given a pair of equivalent sets, with Zero to five members, the
learner will be able to achieve a one to one matching between
members of the sets.

Given a set having one to five objects as members, the learner
will pick out sets that have the same number of objects.

Given a set with less than ten objects, the learner will make an
equivalent set by using actual objects.

Given a set of one to six objects, the learner will form another
set that has exactly one more object and tell how many are in
the new set.

Given a set of ten objects, the learner will count the objects
in the set using the numeral name.

Given picture cards showing sets with one to nine members (one set
per card), the learner will arrange cards in sequential order.

Given cards showing numberals one to nin, (one numeral per card),
the learner will arrange cards in sequential order.

Given a numeral from one to nine, the learner will tell the name
of the numeral.

' Given a number line one to nine, the learner will tell the name

of the numeral that comes just before or just after a given
numeral.

Given two sequentially ordered sets of objects, one of which has
one more than the other, the learner will form a third set that
comes next in order.

Given a set of ten objects varying in attributes, the learner
will sort the objects into two sets according to attributes
(color, shape, size, texture).

Given two sets of objects from one to five, the learner will
combine the sets and tell how many members are in the new set.

Given a set of two to six objects, the learner will separate into
two subsets and will tell how many members are in each subset.

Given a set of one to six objects, the learner will take one
away and tell how many are in the new set.

10.

11a.

12.

14.

15.

17a.

18a.

19a.

20.

21.

22.

23.

24.

24.5

24.6

158

Given real or pictured sets, the learner will indicate which set
identified the following quantitative descriptors: full, empty,
greater, less, same, least, most.

Given a penny, nickel, dime and quarter, the learner will give
the name of each.

Given a set of objects, the learner will compare them, identify
and name the heaviest and the lightest.

Given pictures showing fractional divisions, the learner will
color or circle an area to demonstrate the concept of 8.

Given a calendar, the learner will demonstrate its use by iden-
tifying a day, week and month, when asked to do so.

Given a thermometer, the learner will demonstrate its use by
stating what it's used for in various situations.

Given non-standard units of measure, the learner will use these
units to measure objects in the classroom.

Given a meter stick, the learner will demonstrate its use by
showing how you would use it to measure.

Given real or pictured settings, using three-dimensional objects,
the learner will identify other objects that are in the following
position relationships to the given object. Above, below, under,
on, in, top, bottom, in front of, between, in back of, inside,
and outside.

Given the direction to rote count from l-25, the learner will
do so.

Given no model, the learner will write numerals 0-9 in correct
form so that they are recognizable (reversals are acceptable).

Given the number words from l-9, the learner will write corres-
ponding numerals.

Given a row and a column of objects, not exceeding five, the
learner will name the ordinal name of objects in a row and objects
in a column.

Given a pattern using objects of two or more colors, the learner
will duplicate the pattern selecting from a set of similar 1
objects (red-blue-red-blue-red-blue).

Given a pattern using objects of two or more shapes, the learner
will duplicate the pattern, selecting from a set of similar
objects (X-O-X-O-X-O).

25.

26a.

27a.

28.

29.

30.

31.

32.

33.

34.

35.

36a.

37a.

39a.

159

Given a two part pattern, the learner will continue the pattern.
I Ic::—-1 (:) C) (C5 23

Given the direction to recite chronologically the days of the
week, the learner will do so.

Lower Level

 

Given a blank clock face, the learner will writer numerals to 12
in correct place on the clock face.

Given a number line, the learner will demonstrate the relation-
ship between adding numbers and adding objects, by counting steps
on the number line. -

Given pictures of six sets, some of which are empty, the learner
will identify the empty sets.

Given any addition combination, in mixed horizontal or vertical
form, (0+0, to 5+5) the learner will write the sum with the aid
of a manipulative device.

Given a set of nine or fewer small objects, the learner will
separate the given set into at least two pairs of subsets, then
write the number for each subset.

Given a number line, the learner will demonstrate the relation-
ship between subtracting numbers and subtracting objects by count-
ing steps on the number line.

Given two written numerals less than ten, the learner will indicate
which is greater or lesser in value.

Given a set of pictured objects, not exceeding 25, the learner
will count and write in numeral form, the number of objects.

Given any addition combination in mixed horizontal or vertical
form (0+0 to 0+0) the learner will write the sum with the aid
of a manipulative device.

Given a marked clock face, the learner will identify the time
to the hour.

Given a meter stick/or a 20 cm ruler, the learner will measure
objects or lines to the nearest meter or centimeter.

Given the direction to recite chronologically the months of the
year, the learner will do so.

40a.

42.

43.

44.

45.

46a.

47.

48.

49.

50.

51.

52.

53.

54.

55a.

56.

160

Given pictures of a penny, nickel, dime or quarter, the learner
will identify numerical values of each.

Given addition problems in both horizontal and vertical forms,
through sums of 9, the learner will find the missing addends with
aid of a manipulative device.

Given two consecutive even or odd numerals, 0-25, the learner
will write the numeral that comes between the two given numerals.

Given a numeral, 1-25, the learner will write the numerals that
come before and after the given numeral.

Given up to 90 pictorial objects, (the number must be a multiple
of 10) the learner will form sets of l0 and group and label sets
of tens, as two tens, ... 9 tens.

Given a set of no more than 90 objects groups by tens, the
learner will say and write the numeral.

Given subtraction problems with minuends less than 19, subtra-
hends less than ten, written in both horizontal and vertical
form, the learner will find the differences.

Given any missing addends sentence, with sums up to 18, the
learner will be able to give the missing addend.

Given a mathematical statement, the learner will be able to
place the correct sign > , < , or = between two numerals in the

range l-SO.

Given a number sentence with the operation sign (+ or -) missing,
the learner will complete them by writing in the correct sign.

Given a sequence of numerals, involving skip counting, by two's
up to 30, the learner will write the missing numeral.

Given a sequence of numerals, involving skip counting, by five's
and ten's up to 50, the learner will write the missing numeral.

Given an oral numeral, not to exceed 50, the learner will be
able to write it.

Given an oral word problem requiring addition, with sums less than
18, the learner will find the sum.

Given a marked clock face, the learner will state time to nearest
indicated a hour.

Given pictures of a circle, square or triangle, the learner will
identify the shaded portion that corresponds to a, or k.

57.

58.

59.

61.

62a.

63a.

64a.

65.

66.

67.

68.

69a.

70a.

71a.

72a.

73.

161

Given three, one-digit numerals, with the sum less than 21, the

* learner will find the sum.

Given number phrases less than 20, the learner will supply the
appropriate symbol of equality or inequality, > , = , or < .

Given two two digit numerals, requiring no regrouping, the learner
will find the sum.

Given a problem or the form (two digit - one digit with no regroup-
ing). the learner will find the difference).

Given hours, minutes and days, the learner will indicate the
correct relationship between them.

Given line segments and a ruler, the learner will measure the
line segments to the nearest half centimeter.

Given a 20 centimeter ruler, the learner will construct a line
segment of specified length, designated to the nearest half
centimeter.

Given an addition problem, of the form 21 + 34 = 34 + __3 the
learner will give the missing addend.

Given any one, two or three digit numeral, the learner will write
it in expanded notation.

Given two two digit numerals, requiring regrouping, the learner
will find the sum.

Given a problem of the form, two digit minus one digit, the learner
will find the difference, regrouping if necessary.

Given pictures of money or play money, less than or equal to
$1.00, the learner will compare the values between coins.

Given pictures of money or play money, less than or equal to
$20.00, the learner will write the given money value using the
symbols of dollar sign and decimal.

Given a clock face with hands, the learner will write time in
time notation to half hour and quarter hour.

Given cup, pint, quart and liter containers, the learner will
determine experimentally, the number of cups in a pint, pints in
a quart, approximate quart in a liter.

Given oral word problems involving addition and subtraction with
numbers less than 18, the learner will solve them.

162

73.5 Given subtraction problems in both horizontal and vertical forms,

74.

75.

76.

77.

79a.

80.

81.

82a.

83a.

84.

85.

86.

87.

88.

89.

with minuends not to exceed 18, the learner without regrouping,
will find the missing subtrahend.

Given a problem of the form (three digit minus one, two or three
digit), the learner will find the difference when no regrouping
is required.

Given a problem of the form (two digit minus two digit), the
learner will find the difference, regrouping if necessary.

Given a story problem read orally by the teacher, the learner
will tell which he must use to solve the problem (addition or
subtraction).

Given an oral word problem requiring subtraction, requiring
regrouping the learner will state and do what operation is
necessary to find the difference.

Given several objects divided into (N's, l/3's, g's or whole) by
comparing to the whole unit.

Given number sequences in which some of the numbers are omitted,
the learner will complete the number sequences up to 200.

Given two, three digit numerals, the learner will apply the
appropriate symbol between them ( > , < , =).

Given play money, the learner will make change from $1.00 for
any amount up to $.99.

Given drawings of lines, the learner will point out which ones
are (relatively) horizontal and which are (relatively) vertical.

Given two three digit numbers, the learner will find the sum,
regrouping, if necessary.

Given a three digit minuend and a two or three digit subtrahend,
the learner will find the difference, regrouping if necessary.

Given a number and the consecutive multiples of ten or 100 between
which it falls, the learner will choose the nearer estimate.

Given column addition exercises involving three two digit addends,
the learner will find the sum, regrouping if necessary.

Given a pair of numbers or number phrases less than 1,000, the
learner will supply the appropriate symbol > , <, or =.

Given two addends less than 10,000, the learner will find the sum,
regrouping if necessary.

90a.

91a.

92a.

93a.

94.

95.

96.

97.

98.

99.

100.

101.

102.

103.

163

Given a clock face with hands, the learner will read the time
to the nearest minute.

Given a length expressed in centimeters, the learner will express
it as a number of centimeters plus a number of millimeters.

Middle Level

 

Given an object or line segment, the learner will, without the use
of a ruler, choose the correct estimate from a set of answers of

this form.
2 millimeters, 2 centimeters
2 meters, 2 decimenters

Given the terms, centimeter, meter and decimeter, the learner
will state the relationship between them.

Given a repeated addition sentence, the learner will represent
it as a multiplication sentence with its product.

Given multiplication problems using one as a factor, the learner
will find the product.

Given any multiplication combinations, less than 5 x 5, the
learner will write the product.

Given multiplication problems using zero as a factor, the learner
will find the products.

Given sets of not more than 20 elements, the learner will divide
them into equivalent sub-sets.

Given a mathematical sentence of the form (3 x 4 = 4 x __), the
learner will identify the missing factor.

Given basic multiplication problems, the learner will find the
products, using the distributive property of multiplication over
addition.

Given any multiplication combination up to 9 x 9, the learner
will write the product.

Given multiplication problems in which the factors are whole
numbers less than ten, and one factor is missing, the learner
will record the missing factor.

Given the basic division facts through the nines, the learner
will find the quotients.

104.

105.

106.

107.

108.

109a.

110a.

111.

112a.

113a.

114.

115.

116a.

117a.

118.

164

Given a multiplication number sentence with two missing factors,
the learner will supply any two basic factors to make the given
multiplication number sentence true (ex: ___x ___= 16).

Given any number as the dividend, and zero as the divisor, the
learner will indicate that there is no solution to the problem.

Given a word problem requiring multiplication, the learner will
write the correct equation to go with the problem.

Given a set of multiplication equations in which one factor is
a multiple of 10,000 or 1,000, the learner will write related
division equations.

Given a multiplication problem of the form (one digit number x
two digit number) the learner will find the product.

Given a shaded region located on a piece of graph paper or some
other grid, the learner will find the area by counting the num-
ber of square units.

Given pictures or models of geometric figures; cube, cylinder,
sphere, the learner will identify them.

Given two factors which are multiples of ten, the learner will
find the product.

Given a sentence involving the terms "in the morning, in the
afternoon, in the evening," the learner will supply the appro-
priate AM or PM notation.

Given two times to the nearest half hour, the learner will find
the length of the interval between them.

Given two or three whole number addends less than 100,000 in
horizontal or vertical form, the learner will find the sum,
regrouping if necessary.

Given subtraction problems with up to four digit minuends and
subtrahends, the learner will find the differences, regrouping
if necessary.

Given Arabic numerals 1 through 39, the learner will convert them
to Roman numerals.

Given Roman numerals I through XXXIX, the learner will rewrite
them to Arabic numerals.

Given a numeral with up to four digits, the learner will rewrite
the given numerals, using expanded notation.

119.

120.

121.

122.

123.

124.

125a.

126a.

127a.

129.
130.

131.

132.

133.

134.

135.

165

Given a completed division problem, the learner will identify the
divisor, dividend, quotient and remainder.

Given a series of four numbers, the learner will compute the
average.

Given a multiplication problem with two two digit factors, the
learner will find the product.

Given a division problem, with a two digit divident and a one
digit divisor, the learner will determine the quotient, with or
without a remainder.

Given multiplication problems involving a multiple of ten times
a multiple of 100, the learner will find the products.

Given a multiplication problem with a two digit factor and a
three digit factor, the learner will find the product.

Given the length (whole numbers less than 10), of the sides of
a rectangular region, the learner will find the area.

Given a line segment to measure and a 20 cm ruler with milli-
meter markings, the learner will express its measure in whole
centimeters or millimeters.

Given a sequence of metric pre-fixes, the learner will arrange
them in order from smallest to largest.

Given a fraction orally, the learner will write the fraction.

Given a proper fraction, the learner will identify the numerator
and the denominator of the fraction.

Given a denominator, the learner will supply the correct numer-
ator to make the value of the fraction equal to one, without
the use of aids.

Given a proper fraction with a denominator less than nine, the
learner will explain the meaning of each fraction by making a
drawing or by using fractional cut-outs.

Given a simple fraction, the learner will give at least two
equivalent fractions.

Given fractions with like denominators, the learner will add to
the sum of less than one.

Given any five fractions with like denominators, in random
order, the learner will write them in numerical order.

136.

137.

138.

139.

140.

141.

142a.

143.

144a.

145a.

146.

147.

148.

149.

166

Given a division problem with a three digit divident and a one
digit divisor, the learner will determine the wuotient, with or
without a remainder.

Given two factors of up to three digits each, the learner will
estimate the product by rounding both factors to the nearest ten
and multiplying.

Given the decimal fraction of no more than three places, the
learner will name the place value of the digit.

Given an addition or subtraction of decimal problem in vertical
form with no more than five digits and no more than three deci-
mal places, with each problem having the same number of decimal
places, the learner will find the sum or difference and correctly
place the decimal point.

Given an expressed amount of money, the learner will multiple
or divide the given amount by a whole number.

Given numerals between ten and 5,000, the learner will round off
numerals to the nearest 10's, 100's, or l,OOO's place.

Given any numeral from 1,000 to 9,999,999, the learner will
locate and separate the periods with commas.

Given a story problem, with whole numbers and requiring only
one operation (addition, subtraction, multiplication or simple
division), the learner will choose the correct operation and do
the computation.

Given a six digit numeral in oral form, the learner will write
the given six digit numeral.

Given two times to the nearest minute, the learner will find the
time interval.

Given any four digit number, the learner will give the number that
is 100 or 1,000 less than it is without using formal addition
or subtraction.

Given an exercise in multiplication, the learner will multiply
a three or four digit factor by a two or three digit factor.

Given a division problem with a four digit divident and a one
digit divisor, the learner will determine the quotient with or
without a remainder.

Given division problems with multiples of 100 as dividends and
two digit divisors, the learner will estimate the quotient by
rounding off the divisors to the nearest ten.

150a.

151.

152.

153.

154.

155.

157.

158.

159a.

160.

161a.

162.

163.

167

Given pictures or models of prisms, cones and pyramids, the
learner will correctly identify them.

Given a numeral expressed as a power with an exponent less than
five, the learner will express it as an ordinary base ten numeral.

Given an addition or subtraction decimal problem in horizontal or
vertical form, with no more than three decimal places, the
learner will find the sum.

Given a number with no more than three decimal places, the learner
will round to the nearest whole number, tenth or hundredth as
requested.

Given a number less than 100, the learner will identify the
factors of the given number.

Given a number, the learner will identify multiples of the
given number.

Given division problems with two digit dividends and two digit
multiples of ten as divisors, the learner will find the quotients,
with or without a remainder.

Upper'Level

 

Given division problems with three digit dividends and two digit
multiples of ten as divisors, the learner will find the quotients,
with or without a remainder.

Given word problems, requiring division, the learner will give
the equation and find the quotient.

Given the measurement of each side of a polygon, the learner
will find the perimeter of the given polygon.

Given a list of familiar objects, the learner will choose the
volume measure (cu, cm., cu, dm., cu.m.) that would be nearest
in size.

Given a division problem with a two digit divisor, a four digit
dividend, with or without a remainder, the learner will find the
quotient.

Given a story problem, with whole numbers and requiring only
one operation, (addition, subtraction, multiplication or
division) the learner will choose the correct operation and do
the computation.

164.

165.

166.

167.

168.

169.

170.

171.

172.

173.

174.

175.

176.

177.

178.

179.

168

Given a list of numbers, the learner will identify prime
numbers less than 100 (by circling them).

Given a pair of numbers, each less than 60, the learner will
identify their greatest common factor.

Given a fraction, the learner will reduce it to its simplest
form.

Given a number line segment (0, 1) with dots indicating division
of the segment into equal segments, the learner will identify
the fraction correspondong to a particular dot.

Given an improper fraction, the learner will write it as a mixed
number.

Given a mixed number, the learner will write it as an improper
fraction.

Given a whole number and a mixed number, the learner will find
their sum.

(Given a whole number and a mixed number, the learner will find

the difference.

Given a decimal fraction, the learner will rename it as a common
fraction.

Given a common fraction whose decimal equivalent terminates in
two places or less, the learner will write its decimal equivalent.

Given a whole number and a fraction less than one, the learner
will multiply to find the product.

Given a multiplication problem with fractions less than one as
factors, the learner will find the product in simplest form.

Given a fractional number and a mixed number, the learner will
find the product.

Given multiplication problems having two mixed numerals, the
learner will find the product.

Given a common fraction whose decimal equivalent terminates in
three places or less, the learner will rename the common frac-
tion as a decimal fraction a = 5/10 = .5.

Given any six digit numeral, the learner will rewrite it with
expanded notation, first by using place value words, and then
by using numerals.

180.

181.

182.

183.

184.

185.

186.

187.

187.

188.

189.

191.

192.

193.

194.

169

Given a pair of numbers, each less than 20, the learner will
identify their least common multiple.

Given two fractional numbers, that may or may not require renaming,
the learner will find their sum.

Given two mixed numbers that may or may not require renaming,
the learning will find their sum.

Given subtraction problems involving mixed numerals, the learner
will subtract mixed numerals with renaming and find the differ-
ence in simplest form.

Given two unequal fractions, with denominators of 2, 3, 4, 6 or
8, the learner will tell which is greatest in value.

Given a measurement involving two units in the same system, the
learner will multiply the measurement by a whole number and
regroup as necessary.

Given a division problem with a dividend of no more than five
digits and divisor with no more than three places, the learner
will find the quotient, with or without remainder. The remainders
will be written as fractions in simplest form.

Given a list of fractional numbers, the learner will write the
reciprocal of a number.

Given two fractional numbers, less than one, the learner will
find the quotient.

Given a whole number divisor and a fraction, the learner will
find the quotient.

Given a whole number dividend and a fractional divisor, the
learner will find the quotient.

Given a fraction and a mixed number, the learner will find the
quotient.

Given two mixed numbers, the learner will find the quotient.

Given a numeral from .001 through hundred millions, the learner
will read and identify numerals, expanded numerals, or in word
form.

Given a list of numerals from .999 to 1,000,000, the learner
will round off each numeral to the place value indicated in the
heading.

195.

196.

197a.

198a.

199a.

200a.

201a.

202a.

203a.

204a.

205a.

206a.

207a.

208a.

170

Given a multiplication of decimal problems, with no more than
five digits and no more than three decimal places, the learner
will find the product.

Given a decimal division problem in which the divisor and divi-
dent have no more than five digits and no more than three
decimal places, the learner will find the quotient.

Given a set of equations, the learner will label, identify or
compute as indicat-d, using the properties of addition and
multiplication (dist., assoc., 1's and O's).

Given a measurement such as 1.463 meters, the learner will express
1t as one meter + four decimeters + six centimeters + three
millimeters.

Given diagrams or models of points, lines and planes, the learner
will associate each diagram with one of the words: point, line,
plane.

Given drawings of parallel lines and perpendicular lines, the
learner will associate each diagram with the correct words (paral-
lel, perpendicular).

Given a circle and its related parts, the learner will identify
the center, radius, diameter and circumference.

Given diagrams of segments, lines, rays and angles, the learner
will select and name each as requested.

Given a set of pictured angles, the learner will select those
which are right angles.

Given the formula for finding the area of a triangle and the
measures of the base and height of the triangle, the learner
will find the area.

Given an English or a metric table of equivalent measurements,
the learner will convert from one to another within the same
system.

Given a circle with its radius or diameter, the learner will
find its circumference.

Given the formula for finding the area of a circle and the
measurement of the radius or diameter of a circle, the learner
will find its area.

Given a protractor, the learner will read the measure of any
given angle from O0 to 1800--within two degrees.

209a.

210a.

211a.

212.

213.

214.

215.

216.

217.

218.

219.

171

Given a drawing of a rectangular solid with its dimension (small
whole numbers), the learner will compute the volume.

Given a coordinates, the learner will locate the points on the
grid.

Given three pairs of coordinates and a grid, the learner will
construct a line graph.

Given a square subdivided into an area of 10 x 10 unit squares,
some of which are shaded, the learner will state the indicated
ratio and percent represented by the shaded area.

Given a list of ratios, the learner will express an equivalent
ratio of the given ratios.

Given a list of one or two digit decimal numerals less than one,
the learner will express them as percents.

Given a ratio and the numerator or denominator of an equivalent
ratio, the learner will write the missing numerator or denomin-
ator of the equivalent ratio.

Given a set of percents and two digit decimal numerals, the
learner will write them as fractions in simplified form.

Given a set of proportion problems, where the given terms and the
answer are each whole numbers less than 100, the learner will
find the solution.

Given a percent problem, the learner will write the appropriate
proportion needed to solve the problem in the form,
2 n

-..... 11: 25 0.2.25.
8 10 ’8 TOO’ n 100

Given a set of problems, involving the three types of percent,
the learner will write the appropriate proportion and solve
the problem.

APPENDIX VII

Computer Program for the Simulation

172

173

"000 m0<a

Armzvh<UJm\.mmzlwmdvk<04uuhmwauu

ANmz+nuzvhquu\AmuZi~uZ~v<OJGNHWmm

uaznhzou

«+muzumu4

on” Uh uu

~+mu2ﬂmuc

on" 0» cu

«+«uznuug

on" o» 09

«+ouzuqu4

ohJo>vm~

.h4.>~u~

.h4.>.u~

Duo Juqu

H own on
o
o

00 u» 00 gnaw
om Up 00 .Num
Ob Up 00 .na
.>ou.omwmuvm
h~2.~u

o. ubmwauo
oouhmwm
oﬂduc
oumuz
oﬂmuz
onumz
muzonﬂﬁ ocu U0
ooonozaw

o.xnocomquoAXuocommwdc«xHoNonuvco«XNoNov
muzohaz
.c~a.muaomuuou~Q.&QNNO.KQN~O.¢QHNU.1QH«O.N
quuzqm mu amm234 m4<Dum muz

u
w<zu0u

Z”
a oanmu.
hxwd'v
HHCDOQQ
MAGZOQQ
d 'tLo-OHI lH-«NNNNNNQ O'QDO 0X“
H
Q
11
[U
D
0.
P
W

Q~QNQQNNOGNNOfﬂnaiauumcgNNO+N~Q§QQNNU
¢~Q¢aaNuONNNU+m~leQua09NNO+N~QlaQNuU
anQNKQNNUQNnO+nHQNQQ~N09N~O+N~QiuQNNUN
deNKQNuU*NHO+MHQ*1QH~O~Nu0+wuaimawno*

'nuHH+
(IQNQQQQH
HNLLIIKIGZQH
udlchLQOJ-IQ.

Quuiﬁﬂme
MUOuNUUHm~HH\|

A0
muuomnuo«uuoaamuuouuuuuomuoou

Aunvmzoaunquoaouo«vm..OOOuv

Wcﬂowwwu
ZDHmLUZHu

( H.

NM\NO\OO mNNOh H wk<u 2H4; OoN mmdema

NQ.~Q~

om”
co

om
Ch

00“

PH; U
om~

00
"‘0
H"

«o

174

N000 w0<Q

NM\N0\OO

.1ﬁ»<i~¥ﬁv<"<om

moéonNXﬁ omw DO

0.0"OWOZJW

O.OHOWMZDW

Am0£~h<OJm\OIDmNZ(uzQ
“W04vh<UJu\WZDmMZ<m1w

wDZNPZUU

u+>QUIHH>QOIH

oON\An+>QUI-h<OJlN>QUI

>QOIHQUL

mhm Uh UU

H+.J¥vOZH~J¥u04

mmn Oh U0.>QDLoon«XXVwoquQDLoPJoa!¥wm~u~
«M.MNJ¥ 00m 00

OON\A>QDL~vh<OJm">QOI

oON\~QDI-b<OJlNQUL

OIH>QOLH

OnINQUIu

u+>QDIuN>QUI~

oON\Au+>&UI~vh<OJuH>QUL

>QDIHQUL

00m Uh 00

u+AJ¥~mZH~JXVW£

«mm Oh U05>QDLoon~¥¥v<omUoQULohJo.¥¥v<wl~
amouNJX «ON 00

oON\->QOI~vh<UJuH>QUL

oON\~QUI~vh<UJmHQUL

OIN>QOL~
O~INQUL~
mULouNXX mhm UQ
OUaJxvmé
ONAJXuOL

“HoanX mow OD
WDZHhZUU
mummohwoaﬁﬁ~<w
GammokaAﬁﬁvav
qumohJoaﬁﬁv<v
qumopJoaﬁﬁva
muzoNﬂﬁﬁ OON UO
Aw~3H04J<zw
adv<HmJqum
gavDHCJ—m
a-<uw0~m
wDZ~hZUU
hmquQ+OszNOIJm
hmwm+mZDWHm13w
hWMQlOHaﬁvm
hmwmnaﬁvq

now
00“.
26.
1,:

"l
OﬂdJQJ
_JJ"""

mNNON N whdu anz Com

.mhm
Dom

mmm
Nmn

com
«om

awn
own

Odd

Ohu

wquJmu

“U

175

nooo m0<a

hn\N0\oo

04w
wDZuhZOU OON
on” UP 00

OWN Uh Uu “0.0UoQOhmZvun

nu-h<2uom com
QOhmzoomN u<mu

accou.Xuo¢oomoXuo¢oouoxmo¢oomoax«.moomku
.XN.¢oomoAx~.QoomuN.XNocoOuoax".nomm.NGXN.cohm.Xn.¢oom.Xuvh<£uou CNN
Gum.momoum<>oz<w1060w023wn
oOEDmnOJJ<2moOU~moma<>oz<m2moommZDwom2Jm.wJszmawo—m.ONN hzuua

acuuMoXnuh<Zuom Onm
02.0mm h2~uu

«cann.Xn~h<zaou cum
mzoo~m kZHaQ
acomuouuoDx on

o¢o®mo.ﬂm9¥..¢omuo.ﬂoxm.-comm..umxm ovhdzaom owe
ODXu.m3x&.u¥mom1m.On¢ bluan
OomlaamOZuh<UJu¥0u<>§Ou<>v\diONODxa
Oomla“mozuhquqmiwm<>¥wa4>v\¢zmﬂw3xa
A.WUZVthJmioumﬁua<>v\mZOHO¥m
aamOZ.h<UJuimomima<>v\mimNm¥m

Amuzimuzvh<04&\nanDW+OZDm.IuwQZJmAAmO

MDzthDU 00¢

szhUIszMhD+¢ZONQzQ

uQZM»O+mzuNMZU

uzwhOIQAMhOIlzwaN«Q2uhO
((uzolnﬁﬁvaQZuhO
azwbmﬁuazwhm+vzwﬂczw

Hazwhm+m2mﬂmzw

ozwhmidzMPmrozwhmuadzmhm
4<u2w1.ww.<udzm»m

mDé.unﬁﬁ 00¢ U0
Oooﬂczo
cooﬂnzo
Ooonczw
O.ONMIm
rd>vba0mﬂoum
u<>uhuUWHmOm
vh<uJuuNO¢<>

Q

A
.m
K

«moztmuzuh<OJu\..szWNWZDmviumszw§AmU£VP4UJLvau<>

wazuhzuu omN

mom+UmOszHOm023m

AXWVJNAXVHQHQOW

qum+omm23mucmw23m

(~42

ooN wm<mqmu

nu

BIBLIOGRAPHY

176

BIBLIOGRAPHY

Bernknopf, Stanley & Bashaw, R.L. An investigation of criterion-
referenced tests under different conditions of sample vari-
ability and item homogeneity. A paper presented at the
annual meeting of the American Educational Research Associ-
ation, San Francisco, April 1976.

Brennan, R.L. A generalized upper-lower item discrimination index.
Educational & psychological measurement, 1972, 32, 289-303.

Brennan, R.L. & Stolurow, Lawrence M. An elementary decision process
for the formative evaluation of an instructional system. A
paper presented at the annual meeting of the American Educa-
tional Research Association, New York, February 1971.

Cox, R.C. & Vargas, Julie 5. A comparison of item selection tech-
niques for norm-referenced & criterion-referenced tests. A
paper presented at the annual meeting of the National Council
on Measurement in Education, Chicago, 1966.

Crehan, Kevin D. Item analysis for teacher-made mastery tests.
Journal of educational measurement, winter 1974, 11, no. 4,
255-262.

 

Davis, Frederick B. and Diamond, James J. The preparation of criterion-
referenced tests. Problems in criterion-referenced measure-
ment, CSE monograph series in evaluation, #3. Edited by
Chester w. Harris, Marvin C. Alkin & James N. Popham, 1974.

Ebel, Robert L. Evaluation & educational objectives. Journal of
educational measurement, winter 1973, 19, no. 4, 273-279.

 

Edmonston, Leon P.; Randall, Robert S.; & Oakland, Thomas D. A model
for estimating the reliability and validity of criterion-
referenced measures. A paper presented at the annual meeting
of the American Educational Research Association, Chicago,
April 1972.

Haladyna, T.M. Effects of different samples on item & test charac-
teristics of criterion-referenced tests. Journal of educa-
tional measurement, summer 1974, 11, no. 2, 93-99.

 

The paradox of criterion-referenced measurement. A paper
presented at the annual meeting of the American Educational
Research Association, San Francisco, April 1976.

177

178

Haladyna, T.M. and Roid, G.H. The quality of domain-referenced test
items. A paper presented at the annual meeting of the
American Educational Research Association, San Francisco,
April 1976.

Hambleton, Ronald K. & Gorth, William P. Criterion-referenced testing:
issues & applications. Technical report no. 13; University
of Massachusetts, September 1971; ERIC ED 060 025.

Hambleton, Ronald K.; Swaminathan, Hariharan; Algina, James; & Coulson,
Douglas. Criterion-referenced testing & measurement: A
review of technical issues & developments. Symposium presented
at the annual meeting of the American Educational Research
Association, April 1975, Washington, D.C.

Harris, Chester W. Some technical characteristics of mastery tests.
Problems in criterion-referenced measurement, CSE monograph
series in evaluation, #3. Edited by Chester W. Harris,
Marvin C. Alkin & James W. Popham, 1974.

Helmstadter, G.C. A comparison of bayesian & traditional indexes
of test item effectiveness. A paper presented at the annual
meeting of the National Council on Measurement in Education,
Chicago, April 1974. -

Hsu, Tse-Chi. Empirical data on C-R tests. A paper presented at the
annual meeting of the American Educational Research Associ-

ation, New York, February 1971.

Ivens, Stephen H. A pragmatic approach to criterion-referenced
measures. A paper presented at the annual meeting of the
American Educational Research Association & National Council
on Measurement in Education, Chicago, April 1972.

An investigation of item analysis, reliability & validity
in relation to criterion-referenced tests. Unpublished
doctoral dissertation. The Florida State University, 1970.

Kifer, Edward & Bramble, William. The calibration of a criterion-
referenced test. A paper presented at the annual meeting
of the American Educational Research Association, Chicago,
April 1974.

Kosecoff, Jacqueline B. & Klein, Stephen P. Instructional sensitivity
statistics appropriate for objectives-based test items.
CSE report no. 91, Center for the Study of Evaluation, UCLA,

April 1974.

Livingston, Samuel A. Criterion-referenced applications of classical
test theory. Journal of educational measurement, summer 1972,

9, no. 1, 13-26.

179

Marks, Edmund & Noll, Gary A. Procedures & criteria for evaluating
reading & listening comprehension tests. Educational &4psycho-
logical measurement, 1967, 27, 335-348.

Nitko, Anthony J. A model for criterion-referenced tests based on
use. A paper presented at the annual meeting of the American
Educational Research Association, New York, February 1971.

Oakland, Thomas. An evaluation of available models for estimating
the reliability & validity of criterion-referenced measures.
A paper presented at the annual meeting of the American
Educational Research Association, Chicago, April 1972.

Ozenne, Dan Gilbert. Toward an evaluative methodology for criterion-
referenced measures: test sensitivity. CSE report no. 72,
Center for the Study of Evaluation, UCLA, October 1971.

Popham, James W. & Husek, T.R. Implications of criterion-referenced
measurement. Journal of educational measurement, spring 1969,
6, no. 1, 1-9.

Roudabush, G.E. Item selection for criterion-referenced tests. A
paper presented at the annual meeting of the American Educa-
tional Research Association, New Orleans, February 1973.

Saupe, J.L. Selecting items to measure change. Journal of educational
measurement, fall 1966, 3, no. 3, 223-228.

 

Schooley, Daniel E.; Schultz, Daniel W.; Donovan, David L.; & Lehmann,
Irvin J. Quality control for evaluation systems based on
objective-referenced tests. A paper presented at the annual
meeting of the American Educational Research Association,

San Francisco, 1976.

Skager, Rodney W. Generating criterion-referenced tests from
objective-based assessment systems: unsolved problems in
test development, assembly & interpretation. Problems in
criterion-referenced measure, CSE monograph series in evalua-
tion, #3. Edited by Chester W. Harris, Marvin C. Alkin &
James W. Popham, 1974.

Woodson, M.I. Chas. E. The issue of item & test variance for
criterion-referenced tests. Journal of educational measure-
ment, spring 1974, 11, no. 1, 63-64.

"11111111111114“