Irv!

 

 

 

 

M (774 ‘1! 7
mummilminimummmlllmmum

1293 00609 7376

 

 

 

 

 

This is to certify that the

dissertation entitled

An Alternative Way of Estimating Test
Item Statistics for Test
Development in Malaysia

presented by

Bong-Cheang Quek

has been accepted towards fulﬁllment
of the requirements for

Ph. D. degree in Education

 

 

Mm {Lew

Major professor

Date 10/31/89

 

MS U is an Afﬁrmative Action/Equal Opportunity Institution 0-12771

 

uaaanv W
Michigan State
University

 

 

 

PLACE IN RETURN BOX to remove this checkout from your record.
TO AVOID FINES return on or before date duo.

DATE DUE DATE DUE DATE DUE

 

 

 

 

0' v
:r «u ‘,.7
3.244“ 4M» I

*

\
‘ =ﬁ '
—— — _
1‘
ll

-— # *

-

L

ill—J

MSU Is An Affirmative Action/Equal Opportunity lnditution

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

AN ALTERNATIVE WAY OF ESTIMATING TEST
ITEM STATISTICS FOR TEST

DEVELOPMENT IN MALAYSIA

BY

Bong-Cheang Quek

A DISSERTATION

Submitted to
Michigan State University
in partial fulfillment of the requirements
for the degree of

DOCTOR OF PHILOSOPHY

Department of Counseling, Educational
Psychology and Special Education

1989

ABSTRACT

AN ALTERNATIVE WAY OF ESTIMATING TEST ITEM STATISTICS
FOR TEST CONSTRUCTION IN MALAYSIA

BY
Bong-Cheang Quek

The purpose of the study was to investigate how
accurately experienced Chemistry teachers can estimate the
item statistics of the Chemistry test items that will be
used in the Malaysian Certificate of Education (MCE)
Examination. The study has examined whether the accuracy of
estimation can be improved by an intervention program for
increasing the competency of the teachers' estimation
skills. Also examined were the effects of (1) content
area of item, (2) cognitive level of item, (3) difficulty
level of item, (4) discrimination power of item, and (5)
item-type, on the accuracy of item statistics estimation.

Thirty experienced Chemistry teachers were randomly
assigned to one of two groups: the treatment and the
control groups. The treatment group teachers underwent a
training session which provided them an opportunity to
develop skills and strategies for estimating item
statistics, whereas the control group was not trained.

Equating items were embedded among the items whose
item statistics were to be estimated. The teachers were

each provided with 10 "anchor" items (_i_.g., items with

Bong-Cheang Quek

known population values of item characteristics) to guide
them in the estimation.

A double-repeated measures design with one between-
subjects factor and two within-subjects factors was used to
analyze the data. The data indicated that

1. The treatment was effective in improving the
accuracy of p-value estimation but not the accuracy of
point-biserial estimation.

2. The factors: (a) content area, (b) cognitive
level, (c) difficulty level, (d) discrimination level, and
(d) item type, have a significant effect on the accuracy of
p-value estimation by the experienced teachers.

3. There was no significant difference between the
accuracy of p-value estimation by the teachers competent in
estimation skills and the accuracy of p-value estimation
obtained by field-trial of item pool.

4. The factors: (a) content area, (b) cognitive
level, and (c) discrimination level, significantly affect
the accuracy of point-biserial estimation by the
experienced teachers.

5. The factors: (a) difficulty level, and (b)
item type, do not significantly affect the accuracy of
point-biserial estimation by the experienced teachers.

6. The estimated and population values of the

point-biserial has a Spearman rank correlation of 0.43.

ACKNOWLEDGEMENTS

Many people have contributed toward the success of
this study. First, I am deeply grateful to my wife, Lee-
Chin, and our children Wei-Kin and Wei-Schen for their
love, emotional support, patient understanding, and many
sacrifices throughout my doctoral studies.

I would especially like to thank Dr. Herbert C.
Rudman, the Chairman of my Dissertation Committee, my
academic advisor and my friend for his guidance, insightful
suggestions and personal interest in the study. Without his
professional advice and unfailing support this project
would have taken a much longer time to complete.

I wish to express my gratitude to the members of my
Dissertation Committee: Dr. William Mehrens for his advice,
insightful comments and valuable suggestions; Dr. Stephen
W. Raudenbush for his statistical expertise and valuable
suggestions in data analysis: Dr. Ralph T. Putnam for his
professional expertise and contribution to the early
development of the proposal for the study; and Dr. Louis
Romano for his contribution as a member of the committee.

Special thanks goes to the teachers who participated
in the study. Without their cooperation and contributions

this study would not have been completed.

iv

I would also like to thank the Malaysian Ministry of
Education for the scholarship award which enabled me to
pursue the doctoral program; the Director of the
Educational Planning & Development Division, Datin Asiah
bt. Abu Samah for granting the approval for the project to
be carried out in Malaysia; the Director of the
Examinations Syndicate, Dato' Haji Mohd. Ghazali b. Hj.
Mohd. Hanafiah for permitting the use of the facilities in
the Examinations Syndicate for the study; Puan Nik Faizah
Mustapha, Datin Hajjah Rapiah Tun A. Aziz, Puan Hajjah
Rosni Hamzah, and Mr. Ivan D. Filmer Jr. (senior officers
in the Examinations Syndicate) for their assistance and
advice which enabled the project to be carried out

smoothly.

Page
LIST OF TABLES . . . . . . . . . . . . viii
LIST OF FIGURES . . . . . . . . . . . . x
Chapter
I. THE PROBLEM . . . . . . . . . . 1
Introduction . . . . . . . . . 1
Purpose of the Study . . . . . . 3
Testing and Test Development
in Malaysia . . . . . . . . . 5
Need for the Study . . . . . . . 9
Research Hypotheses . . . . . . . 11
Overview . . . . . . . . . . . 13
II. REVIEW OF THE LITERATURE . . . . . . 14
Introduction . . . . . . 14
Judgment Under Uncertainty . . . . 14
Determinants of Item Difficulty
and Discrimination . . . . . . 23
Empirical Evidence of Accuracy
of Estimate . . . . . . . . . 38
Discussion and Summary . . . . . . 49
III. PROCEDURES AND DESIGN . . . . . . . 57
Sampling Procedure . . . . . . . 58
Sample Item . . . . . . . . . . 61
Design . . . . . . . . . . . 63
Hypotheses . . . . . . . . 89
Statistical Analysis . . . . . . . 90
Summary . . . . . . . . . . . 97
IV. RESULTS . . . . . . . . . . . . 99
Introduction . . . . . . . . . . 99
Accuracy of P-values Estimation . . . 100
Accuracy of Point-Biserial Estimation . 122

TABLE OF CONTENTS

Summary . . . . . . . . . . . 132

vi

Chapter
V. SUMMARY AND CONCLUSIONS . . .

Summary . . . . . . .
Conclusions . . . . . .
Discussion . . .

Implications for Further Research

Appendix

A Teacher Data . . . . . .
Explanation (Technical Terms) .
Determinants of Item Difficulty
Bayesian Approach . . . . .
Sample of Scatterplots . . .

Tables of Means of Accuracy .

Q '11 M U 0 CD

ANOVA Tables . . . . . .

Bibliography . . . . . . . . .

vii

Page
137
137
142

147
161

164
166
168
169
172
178

188

198

LIST OF TABLES

Summary of Findings Related to the Study

Examples of Types of Items . . .

A Summary of Research Findings on
Item Difficulty . . . . . .

Estimated and Population P-Values of the

Equating Items . . . . . .

Estimated and Population Delta-Values of

the Equating Items . . . . .

A Double Repeated Measures Design with
One Between-Subjects and Two Within—

Subjects Factors . ‘. . . .
Competency Indices of Teachers . .

Multiple-Comparisons among the Means
the Different Content Areas. (Form

Multiple-Comparisons among the Means
the Different Content Areas. (Form

Multiple-Comparisons among the Means

of
A)

of
B)

of

the Different Cognitive Levels. (Form

Multiple—Comparisons among the Means

of

the Different Cognitive Levels. (Form

Multiple-Comparisons among the Means

the different Difficulty Levels. (Form A)

Multiple-Comparisons among the Means

the Different Difficulty Levels. (Form B)

Multiple-Comparisons among the Means

of

of

of

the Different Discrimination Levels.

(Form A) . . . . . . . .

viii

A)

B)

Page
54

62

67

80

81

92

95

103

103

106

106

110

110

112

Table Page

4.8 Multiple-Comparisons among the Means of
the Different Discrimination Levels.
(Form B) . . . . . . . . . . . 112

4.9 ANOVA Table for Evaluating the
Difference between the Accuracy of
Estimation by Teachers and the
the Accuracy of Field-Trial . . . . . 116

4.10 Means and Standard Deviations of
Accuracy of Estimation for Different
Methods of Estimation . . . . . . . 121

4.11 Multiple-Comparisons among the Means of
the Different Content Areas. (Form A) . . 127

4.12 Multiple-Comparisons among the Means of
the Different Content Areas. (Form B) . . 127

4.13 Multiple-Comparisons among the Means of
the Different Cognitive Levels. . . . . 129

4.14 Multiple-Comparisons among the Means of
the Different Discrimination Levels.
(Form A) O O O O O O O O O O O 131

4.15 Multiple-Comparisons among the Means of
the Different Discrimination Levels.
(Form B) . . .' . . . . . . . . 131

4.16 Summary of Tests of Significance for
Item Difficulty Estimation . . . . . 133

4.17 Summary of Tests of Significance for
Point-Biserial Estimation . . . . . 136

ix

LIST OF FIGURES

Relationship between P-Value and
Normal Deviate . . . . . . . .

Frequency Distribution of the Original
Accuracy of P-Value Estimates . . .

Frequency Distribution of the Original
Accuracy of Point-Biserial Estimates .

Frequency Distribution of the Transformed
Accuracy of P-Value Estimates . . .

Frequency Distribution of the Transformed
Accuracy of Point-Biserial Estimates .

Interaction of Form and Content on
Accuracy of P-Value Estimation . . .

Interaction of Form and Cognitive Level
on Accuracy of P-Value Estimation . .

Interaction of Form and Difficulty Level
on Accuracy of P-Value Estimation . .

Interaction of Form and Discrimination
Level on Accuracy of P-Value Estimation

Interaction of Form and Item Type on
Accuracy of P-Value Estimation . . .

Frequency Distribution of the Original
Accuracy of P-Value Estimates by
Competent Teachers and by Field-Trial .

Frequency Distribution of the Transformed
Accuracy of P—value Estimates by
Competent Teachers and by Field-Trial .

Interaction of Method of Estimation
and Content on Accuracy of P-Value
Estimation . . . . . . . . . .

Page

80

84

85

86

87

102

105

108

111

114

117

118

119

Figure Page

4.9 Scatterplot between Estimated and
Population Point-Biserials . . . . . 124

4.10 Scatterplot between Point-Biserials
from Item Analysis and Population

Point-Biserials . . . . . . . . . 125
4.11 Interaction of Form and Content on

Accuracy of Point-Biserial Estimation . . 126
4.12 Interaction of Form and Discrimination

Level on Accuracy of Point-Biserial

Estimation O O O O O O O O O O O 130

xi

CHAPTER I

THE PROBLEM

Introduction

The quality of a test is determined, in part, by
the extent to which the scores produced by the test are
reliable and.‘validn Mehrens and Lehmann (1987) defined
reliability as "the degree of consistency between two
measures of the same thing" (p.54) and validity as "the
extent to which certain inferences can be made from test
scores or other measurement" (p.74). Even though
reliability and validity are both important indicators of a
test's quality, a valid interpretation of the test scores
is possible only when there is consistency in the scores.
Thus reliability of test scores is a necessary, although
not sufficient, condition for valid test score
interpretations. Nitko (1983) argued that " we cannot
realistically expect...that any test will yield perfectly
consistent or reliable scores. Nevertheless, as the degree
of reliability of test scores diminishes, so does their
degree of validity" (p.388). This statement underscores the
importance of constructing tests that will yield consistent
scores so that accurate inferences can be made. While it is
not always easy to construct a test that enables valid

interpretations of test scores, it is not as difficult to

develop tests that yield consistent scores, provided the
test developers adhere to the principles of test
construction.

Psychometricians have developed theories that relate
test characteristics such as reliability , mean, standard
deviation, and standard error of test scores with item
statistics such as item difficulty level and item
discrimination index. For example, Kuder-Richardson
reliability coefficient (KRZO) can be computed when the
difficulty level and. the discrimination index (in this
case the point-biserial correlation coefficient) of each
item in the test are known. The mean of the test scores is
the simple sum of the difficulty levels of all the items in
the test. Apart from affecting the test statistics, knowing
item statistics also allows the test developer to control
the test score distribution to serve certain specific
purposes such as awarding scholarships or diagnosing
learning difficulties. In all these cases, item difficulty
level is defined as the proportion of a defined group of
examinees who answer the item correctly, while the point-
biserial correlation coefficient is the correlation of the
score on the item and the total test score for a defined
group of examinees. A logical conclusion is that to
construct a test with desirable qualities, that is, a test
with high reliability, suitable difficulty level and
suitable test score distribution, the test developer needs

to know the difficulty levels and the discrimination

indices of the items. These item statistics can be obtained
through item analysis. The usual approach to obtain the
required information on the items is to try out the items
on a sample of the population similar to the population for
which the test is intended and analyze the items. This
approach of obtaining item statistics is currently
practiced by the Examinations Syndicate, Ministry of

Education of Malaysia.

Purpose of the Studv

This study investigated how accurately experienced
chemistry teachers can estimate the item statistics of the
Chemistry test items that will be used in the Malaysian
Certificate of Education Examination (MCE). Experienced
chemistry teachege are defined as current teachers who have
taught examination class chemistry for at least 3 years and
have some experience in grading the essay or practical
components of the Chemistry examination. Examination
classes are classes in which students are taught curricula
that will be examined in the MCE Examinations at the end of
the academic year, and the teachers teaching these classes
are expected to prepare students so that as high a
percentage of students as possible will pass the
examination in the subjects they teach. This study will
investigate the accuracy with which these experienced
examination class teachers estimate the item

characteristics. The item characteristics to be

investigated are the item difficulty and the item
discrimination index. The item gifﬁieulty is defined, for
purposes of this study, as the percentage of the test
population answering an item correctly (p-value). The irem
discrimination will be defined as the point-biserial

correlation coefficient, between the dichotomous

rpbis'
item score and the total test score.

In addition to investigating how accurately
experienced chemistry teachers can estimate item
characteristics, this study will examine whether the
accuracy of estimation can be improved by an intervention
program aimed at increasing the competency of teachers'
estimation.

It should be noted that the main focus of the study
is not to generalize the findings to the population of
experienced chemistry teachers but to investigate the
degree of accuracy to which the selected group of teachers
are able to estimate item characteristics and to what
extent the intervention program can improve accuracy of
estimation. If it is found that the intervention program
can improve the accuracy of estimation substantially, it
suggests that further research should be conducted to
investigate whether more extensive intervention program can
improve the accuracy further.

This study, then, will be concerned with the

following broad questions:

(a) With what degree of accuracy do experienced
chemistry teachers pp; trained in estimation
skills, estimate the difficulty levels and
discrimination indices of the chemistry items
in the MCE Examination?

(b) With what degree of accuracy do experienced
chemistry teachers rrained in estimation
skills, estimate the difficulty levels and the
discrimination indices of the chemistry items
in the MCE Examination?

(c) Will the training program designed to improve
the competency of estimation of the
experienced chemistry teachers result in an
increase in the accuracy of estimation?

(d) Will the accuracy of estimation depend on the
item-type, the cognitive levels at which the
items were written, the subject matter
contents of the item, item difficulty, and
item discrimination? ’

(e) Are the discrepancies between the item
statistics estimated by the trained teachers
and the corresponding item parameters
(obtained from post-test analysis) of the same
magnitude as the discrepancies between the

item statistics obtained in the field-trial
and the corresponding item parameters ?

Testing and Test Development in Malaysia

A brief description of testing and test development
in the Malaysian Ministry of Education will provide some
background information that will enhance the understanding
of not only the purpose and the need of, but also the

method of analysis used in this study.

Backgroppd.
The educational system in Malaysia is highly
centralized. The syllabi of school subjects are designed

by the Curriculum Development Center of the Ministry of

Education , and schools are required to adopt curricula
that provide instructional activities adhering closely to
the syllabi. Evaluations of pupils' learning, in the form
of examinations, are carried out centrally by another body
called the Examinations Syndicate which is a division in
the Ministry of Education itself.

Public examinations at the end of primary education,
and the third year, the fifth year and the seventh year of
secondary education are conducted at the end of each
academic year. All the students in the respective
examination classes are required to take the designated
examinations and the results of the examinations at the
secondary level are used for promoting the students to the
next level of education. All primary school students are
allowed to enter the secondary schools regardless of the
results of the examinations which they took at the end of
the primary education. The tests constructed for a specific
examination are based on the same curricular contents each
year, except when there is a change of syllabus. The test
papers are 'open', that is, the papers are made available
to the public at the end of the examination period. This
means that fresh tests have to be prepared each year for
the public examinations. It was reported in 1982 that the
Examinations Syndicate constructed annually 34 multiple-

choice objective tests and 288 essay tests (Report on the

second national semipar on test management system, 1982).

How an ObjectivefTeet is Construgreg.

 

A typical objective test developed1 in the

Examinations Syndicate goes through the following fourteen

steps (Report on the second national seminar on tee;
management system, 1982):

(a)
(b)

(C)
(d)

(e)

(f)
(9)
(h)
(i)
(j)
(k)
(1)
(m)
(n)

Selection of panel members

Commissioning prospective panel members to
write items

Panel meeting to review and to write items
Review accepted items by test developer

Assemble accepted items into test booklets
for field-trial

Field-trial of item pool

Item analyzing of item pool
Processing data

Selecting items

Computing predicted test statistics
Preparing first draft of the test
Review of draft by internal committee
Assemble final draft

Proof reading

Among the fourteen steps listed above, steps of

particular relevance to the present study are: field-trial

 

1. The test development in Malaysia is greatly influenced by
the work done in Educational Testing Service (ETS),

Princeton.

In fact the test development procedure in the

Examination Syndicate is modified from that practiced at

ETS .

of item pool, item analyzing of item pool, selection of
items and computing of predicted test statistics from item
statistics.

The item analysis of the item pool is handled by a
mainframe IBM computer. The item analysis reports the
difficulty level (the p-value), the point-biserial
correlation coefficient (rpbis)' and the mean criterion
score (MCS) for each option in the items. The mean
criterion score for a particular option is the mean of the
criterion scores of all the examinees who chose the option
as the answer for the item, while the criterion score is
the total score of the examinee on the test converted to a
standard score with a mean of 13 and a standard deviation
of 4. The p-value of each item is further converted to a
Delta-scale which also has a mean of 13 and a standard
deviation of 4. The Delta-scale is preferred to a pwvalue
for describing the difficulty level of items because a
Delta-scale is linear while a p-value scale is not, and it
is more meaningful to compute mean difficulty on a linear
scale.

Items are selected based on whether they fit the
table of specification, and whether they have suitable
Delta-values and acceptable point-biserial correlation
coefficients. Since the tests constructed by the Malaysian
Examinations Syndicate are achievement tests that are

scored dichotomously, the Kuder-Richardson formula-20

(KRZO) is used to compute the internal-consistency

reliability coefficient of the tests.

208 r-anelyeis .

After the tests have been administered to the actual
student population at the end of the academic year, item
analysis is performed on the items in the test. This set of
item analysis is referred to as the post-analysis. The
post-analysis reports two sets of statistics: one for the
test and the other for the items in the test. The test
statistics portion of the post-analysis reports the
population size on which the analysis is based, the number
of items in the test, the raw mean of the test scores, the
standard deviation of the raw scores, the mean score on the
Delta-scale, the mean of the point-biserial correlation
coefficient, the internal consistency reliability
coefficient (KR20)I and the standard error of measurement.

The item statistics portion of the analysis reports,
for each item in the test, the proportion of examinees
choosing each option, the mean criterion score of the
examinees choosing each option, and the point-biserial for

each option.

Need for rhe §rudy

The test development process in Malaysia and the
examination policy in which test papers are "opened" after
each public examination require field-trials of item pools

to be carried out annually. The volume of multiple-choice

10

objective tests constructed annually -- 34 as reported in
1982 (Report on the second nationalﬁeeminar onjesr
mapagement system, 1982), makes the annual field-trial
exercise a formidable task. The same report estimated that
about 1,900 'working' items , that is, items that satisfy
the tables of specifications and have suitable item
difficulty and item discrimination, are required annually
to construct the 34 objective tests. According to the
report "To come up with this number of working items will
require at least 3,800 draft items. The [Test Development
and Research] Unit [of the Malaysian Examinations
Syndicate] carries out four trial studies a year covering a
total sample of about 20,000 students" (1982, p. 47).

The need to conduct annual field-trials of item
pools has created administrative as well as practical
problems, not to mention the large cost. The administrative
problems have become more acute in recent years due, in
part, to the increasing volume of items to be item
analyzed and, in part, to the recent trend of schools
conducting their regional trial examinations prior to the
public examinations. The regional trial examinations are
conducted by groups of schools in particular regions. The
nature of the tests is similar to that of the public
examinations and the tests are written, in a cooperative
effort, by the teachers in the regions. All the schools
participating in a particular regional trial examinations

adopt the same schedule for the examinations. This means

11

that the field-trial exercise conducted by the Examinations
Syndicate has to compete with the school regional trial
examinations for the limited time available in the schools
for testing. This gives rise to problems in scheduling for
field-trials.

The cycle of test development requires that item
statistics of the item pools be available before the
construction of the final drafts of the tests. This
requirement creates a "bottle—neck" situation in that tests
can not be constructed before trial studies of item pools
and field-trial exercises can only be carried out at a
specific time of the year. Furthermore, the field-trial
exercises are plagued with many scheduling problems.

Taking all these problems and constraints into
considerations, there is, therefore, a need to investigate
an alternative approach of estimating the item statistics,
namely the item difficulty level and the item
discrimination index, of the item pool: with the intentions
that this alternative approach will be less expensive, will
not be constrained by the specific time at which the
estimation can be conducted, and will not require

representative samples of students.

es 0 heses

The research hypotheses of the study are as follows:

1. The accuracy of item difficulty estimation by the
experienced teachers rrained in estimation skills
is better than the accuracy of item difficulty
estimation by the teachers per trained in estimation
skills.

1a.

1b.

1C.

1d.

1e.

3a.

3b.

3C.

3d.

12

There is a difference between items of different
content areas in the accuracy of p-value estimation
by the experienced teachers.

There is a difference between item cognitive levels
in the accuracy of p-value estimation by the
experienced teachers.

There is a difference between items of different
difficulty levels in the accuracy of p-value
estimation by the experienced teachers.

There is a difference between items of different
discrimination power in the accuracy of p-value
estimation by the experienced teachers.

There is a difference between items of different
types in the accuracy of p-value estimation by the
experienced teachers.

The accuracy of item difficulty estimation by the
experienced teachers trained in estimation skills
is the same as the accuracy of item difficulty
estimation obtained from field-trial of item pool.

The accuracy of discrimination index estimation by
the experienced teachers trained in estimation
skills is better than the accuracy of discrimination
index estimation by the teachers pgr trained in
estimation skills.

There is a difference between items of different
content areas in the accuracy of point-biserial
correlation coefficient estimation by the experienced
teachers.

There is a difference between items of different
cognitive levels in the accuracy of point-biserial
correlation coefficient estimation by the experienced
teachers.

There is a difference between items of different
difficulty levels in the accuracy of point-biserial
correlation coefficient estimation by the experienced
teachers.

There is a difference between items of different
discrimination power in the accuracy of point-
biserial correlation coefficient estimation by the
experienced teachers.

13

Be. There is a difference between items of different
types in the accuracy of point-biserial correlation
coefficient estimation by the experienced teachers.

Overview

This chapter has presented the problem, purpose,
background, need and research hypotheses of the study. In
Chapter II a review of the literature related to the study
will be presented. Chapter III describes the procedures and
design of the study. The dependent variables and the
procedure for transforming the skewed distribution of the
original dependent variables to a more nearly normal
distribution are also described. Chapter IV presents the
analysis of data obtained. The analysis of the data for
item difficulty estimation is presented first, followed by
the jpresentation. of ‘the analysis of the discrimination
index estimation. Chapter V contains a summary of the study
and the findings. The conclusions, a discussion of the
findings and the implication of the study are also

included.

CHAPTER I I

REVIEW OF THE LITERATURE

Introductigp

The purpose of this study is to investigate the
accuracy with which teachers estimate item statistics and
whether teachers trained in estimation skills are able to
estimate item statistics more accurately than teachers not
trained in estimation skills. Other studies were reviewed
to find out the factors that affect the values of item
statistics and estimation accuracy, and to develop a
rationale for designing the training program.

Research related to the present study was reviewed
and grouped into three broad categories. One category
concerns judgment under uncertainty. The second category
deals with the determinants of item difficulty and item
discrimination. The third category describes empirical
studies of the accuracy with which subjective judgment of

item characteristics were made under different situations.

Judgment under Uncertainty
An extensive research on jpdgment ppger uncertainty

has been reported in the literature. This body of research

has focused on two areas. One area deals with the

14

15

psyehoiogy of predictipp and the other, expert iudqment

under uncertainty.

Esyepgiogy of Predictigp

Tversky and Kahneman (1974), in a discussion of
judgment under uncertainty, suggested that people rely on
heuristic principles to reduce the complex task of
assessing probabilities and predicting values to simpler
judgmental operations, and cautioned that while these
heuristic approaches are generally useful they often
lead to severe and systematic errors. The writers
demonstrated that three heuristics -- representativeness,
availability, and adjustment and anchoring -- were employed
by people to assess probabilities and to predict values.

Bepresentativeness. The representativeness heuristic
is an approach whereby people predict the outcome that
appears most consistent with the evidence presented to
them. Kahneman and Tversky (1973) illustrated the
occurrence of judgment by the representativeness heuristic
through the following event: A group was given the
description of a person X as, " Mr. X is very shy and
withdrawn, invariably helpful, but with little interest in
people, or in the world of reality. A meek and tidy soul,
he has a need for order and structure, and a passion for
detail." They were then asked to assess the probability
that Mr. X was engaged in a particular occupation from a

number of occupations: farmer, salesman, airline pilot,

16

librarian and physician. In spite of the fact that there
are more farmers than librarians, ‘the representativeness
heuristic led people to believe that Mr. X was a librarian
because he was similar to the stereotype of a librarian.

In a series of studies, Kahneman and Tversky (1973)
have shown that both naive and sophisticated subjects
predict by representativeness. People are (1) insensitive
to prior probability of outcomes, sample size and
probability, (2) have misconceptions of chance and
regression, and (3) have unwarranted confidence produced
by a good fit between the predicted outcome and the input
information. These three outcomes were offered as evidence
that intuitive predictions follow a representativeness
heuristic (Tversky & Kahneman, 1974).

Several other studies supported Kahneman and
Tversky's conclusion that judgmental heuristics are quite
useful but sometime lead to systematic errors. For example,
Hackman (1982) stated that one of the maxims for
institutional researchers is that 'heuristics are run:
always helpful'. Evidence of representativeness, even
though highly unreliable and worthless, will lead people to
ignore the base-rate knowledge or prior probability while
making estimations.

Nisbett and Borgida (1975) investigated the
influence of a representativeness heuristic on people's
reasoning of social behavior. They showed that base rate

information about the behavior of most people in a given

17

situation often has little effect on a subject's
attributions about the causes of a particular target
individual‘s behavior.

Aveilapiiity. Tversky and Kahneman (1974) stated
that the " availability heuristic is employed when people
assess the frequency of a class or the probability of an
event by the ease with which instances can be brought to
mind" (p.1127). In a study to demonstrate this effect,
different lists of names of both sexes were distributed to
different subjects. Each list contained the same number of
names of men and women. However, in some of the lists the
men were relatively more famous than the women, and in
others the women were relatively more famous. When asked to
judge whether the list contained more names of men than of
women, the subjects erroneously judged that the class (sex)
that had the more famous personalities was the more
numerous.

In general, instances of large classes are recalled
better and faster than instances of less frequent classes.
Furthermore, in addition to frequency and probability,
other factors such as relevance, similarity, salience,
familiarity, drama and recency also affect availability
(Tversky & Kahneman, 1973, 1974).

Following Tversky and Kahneman's work, a number of
investigators have carried out field tests of the
availability heuristic. Billings and Schaalman (1980)

provided evidence to support the hypothesis that the

18

variables number, relative frequency, relevance,
familiarity, drama and recency are related to subjective
probability estimates. In a study to assess item-induced
availability biases in ratings of leader behavior, Binning
and Fernandez (1986) found that the availability of
behavior description items, ire., items that described more
specific, imaginable, dramatic, familiar or retrievable
behaviors significantly correlated with the actual ratings
of leader behavior.

Levi and Pryor (1985) extended the list of variables
affecting availability to include reasons of outcome. They
found that the prediction of the outcome of the
presidential debatcﬁ between Reagan and Mondale was
affected by the availability of reasons but not by imagery
of the outcome.

Hence, according' to 'rversky' and. Kahneman (1974),
reliance on availability leads to prediction biases such as
(a) biases due to the retrievability of instances, (b)
biases due to the effectiveness of a search set, (c) biases
due to imaginability and (d) biases due to illusory
correlation.

Adjustment and anchoring. In this judgmental
heuristic, estimates are made by assuming an initial value
and adjustments are then made to yield the final answer.
Whether the initial values are suggested by the formulation.
of the problem or obtained from partial knowledge about the

problem, adjustments made are in most occasions

19

insufficient. Tversky and Kahneman (1974) called this
phenomenon of adjusting estimates from different starting
points resulting in judgment biased toward the starting
values anchoring. The occurrence of an anchoring phenomenon
was demonstrated by a study in which two groups of high
school students estimated, within 5 seconds the numerical
values of two expressions: 8 x 7 x 6 x 5 x 4 x 3 x 2 x 1
andlx2x3x4x5x6x7x8.Thefirstgroup
estimated the descending sequence while the second
estimated the ascending sequence. The median estimate for
the descending sequence was 2,250 while that for the
ascending sequence was 512. The correct answer was 40,320.
Tversky and Kahneman (1974) explained that both groups
obtained a rough estimate by computing the first few
multiplications and made adjustments to obtain the final
values. The starting point for the first group (assuming 3
quick multiplications were made) was 330 and that for the
second group was 6. Because subjects anchored at vastly
different starting points they produced vastly different
estimates.

Tversky and Kahneman (1974) demonstrated with
supporting evidence that this judgmental heuristic led to
systematic and predictable errors.

An example of estimation based on an anchoring and
adjustment heuristic was provided by Hackman (1982) who
wrote that in estimating higher education incremental

budgeting and building estimates both for cost and time

20

required, there were anchors from past year figures or from
initial estimates. "If a decision must be made about a
greatly changed department's budget, the past budget amount
will inexorably affect the new allocation" (p.15).

An insight gained from this literature regarding the
biases to which these judgmental heuristics led has a
direct bearing on the present study which concerns
subjective estimation of item statistics. For example,
better estimates of item statistics may be achieved if
judges are aware of the systematic errors due to a
representativeness heuristic and are advised to make their
judgment based on the characteristics of the population
that will be taking the test rather than just based on the
characteristics of the people or students they are familiar
with. Recognizing that it might not be easy for the judges
to free themselves from the influence of a
representativeness heuristic when making an estimation, a
procedure may be developed to rectify the errors caused by

this heuristic.

Expert Judgment Under Uncertainty
In addition to research on the psychology of judgment

under uncertainty, a number of studies focused on the
issues of erp_e_rr judgment under uncertainty. Several
investigators have studied the problem of individual versus
consensus expert judgment. Winkler (1971), in a study in

which college students and faculty assessed the

21

probabilities for the outcomes of collegiate and NFL
football games, found that consensus judgments are more
accurate than the average of the individuals' judgment.
Winkler (1968) suggested two general methods for arriving
at a consensus subjective judgment. One method is to allow
each expert to revise his or her estimate after seeing the
estimates of the remaining experts involved in the judging
exercise without actually meeting the experts themselves.
The other method is to allow the experts to discuss the
issues in a panel in order to arrive at a final judgment.
One problem with the second method is, according to
Fitzpatrick (1983), the normative effects of opinion
exposure. Beach (1975) felt that with this method "the
experts may falsify their opinions in the hope of swaying
other experts toward their point of view" (p.13). Berk
(1986) suggested that this problem may be overcome by using
independent expert estimates instead of the consensus
value, at the final stage of the process. These concerns
regarding the issue of individual ye consensus judgment and
Berk's suggestion to overcome the problem seem logical.
However, they should be regarded as tentative until they
are verified empirically.

Another problem related to consensus judgment as
pointed out by Beach (1975) is that as a group people may
make more extreme judgment than anyone in the group would
make as an individual. However, this conclusion did not

find support in a study by Goodman (1972) who found that of

22

six groups involved in a study to compare the estimation of
likelihood ratio by individuals and groups, four made more
conservative estimates than the average of the individuals
within the groups while the other two groups were less
conservative. The question of conservative versus extreme
judgment is related to the error of central tendency in
judgment. The error of central tendency, which was
identified by Guilford as a source of error as early as
1954, refers to the tendency for people to avoid making
extreme judgments (Guilford, 1954). In a study which
predicted the difficulty of test items, Tinkelman (1947)
found that there was a tendency for the judges to
overestimate the difficulty level of easy items and to
underestimate the difficulty level of difficult items -- a
phenomenon of regression toward the mean of item
difficulty.

The question of whether training will improve the
accuracy of subjective estimation of probabilities of
outcome is an important one. In a review of the literature
of expert judgment under uncertainty, Beach (1975) reported
that research has shown that "feedback about their
performance relative to the actual state of affairs,
whether it is in terms of evaluation scores, point
probabilities, or probability distributions, can lead to
improvement in performance" (p. 17).

The literature on expert judgment under uncertainty

has provided valuable information for the present study.

23

For example, it is helpful to know that judgment based on
consensus may give rise to a normative effect in which
individual judges abandon their own judgment in favor of
the group judgment. Berk's (1986) advice that judges be
allowed to form their own independent estimates after group
discussion would be appropriate for the present study.
Better estimates may also be achieved if judges are advised
to guard against the tendency for people to adjust their

judgment toward the central value.

ete i ants of tem if 'cu and Discrimination
While research on judgment under uncertainty
provides theoretical insights into how systematic errors
and biases can occur when people make predictions under
uncertainty, a number of other studies have focused on the
practical issues of what factors influence the difficulty

level and the discrimination of test items.

Intrinsic and Extrinsic Determinants

Campbell (1961), in an attempt to isolate the major
factors determining the difficulty of nonverbal items
involving classification of geometrical figures, divided
the factors into two main groups, intrinsic and extrinsic.
Intrinsic determinants pertain to the mental processes that
the item is intended to measure and they include
complexity, abstractness, and novelty of item contents
while extrinsic determinants are factors that affect

percent passing the item but are unrelated to the mental

24

process or processes measured by the item. These factors
include unfamiliarity of item content, context of the item

and personality variables.

Item Difficulty Model
Scheuneman & Steinhaus (1987) suggested that

intrinsic item difficulty be defined in terms of "the item
content, context, characteristics or properties and the
task demands set by the item which must be met by an
examinee with an assortment of skills and abilities in
order to produce a correct answer" (p. 2). It is clear that
Scheuneman's idea of intrinsic item difficulty encompasses
Campbell's (1961) intrinsic and extrinsic determinants of
item difficulty. Scheuneman & Steinhaus (1987) embodied the
issues of intrinsic and extrinsic determinants of item
difficulty and item discrimination, and a host of other
variables that affect observed item difficulty and
discrimination in a theoretical framework. This theoretical
framework is summarized by an item difficulty model and an

item discrimination model. The item difficulty model takes

the following form mere: rpe irep discrimination mode;
WM):

. = - + ' '
01g 99 + “9 + Bl F981 + Egl

where
Dig = observed difficulty of item 1 for group g
99 = the true ability of examinees in group 9 on the

trait the test is intended to measure

25

1T9 = other abilities and attributes that may be used
by individual examinees in group 9 in meeting
the task demand

Bi = the demand of the item on these different

abilities and attributes
WgBi = the interaction between the abilities not
intended to measure and the level of ability
demanded for the task set by the item, and

E91 = error

According to this model, the intrinsic difficulty of
an item is a function not only of the ability demanded by
the item task but also of all abilities and attributes
which are not intended to be measured by the item but may
be used by the examinee in meeting the task demand. More
specifically, the intrinsic difficulty is represented by
the components Bi and W9 in the model. The observed item
difficulty, then, is affected by the true abilities of the

examinee 6 the intrinsic item difficulty and the

9:
interaction 17931 between ability demanded by the item and
the unintended abilities.

The value of the model is that it provides a
framework for systematic investigations and/or discussions
of variables affecting the intrinsic or the observed item
difficulty. For example, the model reminds investigators

that apart from intrinsic item difficulty, examinees'

ability is an important determinant of observed item

26

difficulty. Various factors ‘that affect item difficulty
and item discrimination such as item content, item context,
item format and item complexity can be examined under the
category, "Components of Item Task Demand".

In discussing the effect of item content (a
component of item task demand) on item difficulty,
Scheuneman & Steinhaus emphasized the need to consider
ipgigeprgi demands on knowledge as a source of variation in
item difficulty. These incidental demands, in my opinion,
have often been overlooked by both novice and professional

item writers.

Intrinsic Determinants

It seems helpful to use Campbell's (1961)
terminology to group the literature related to intrinsic
item difficulty into (a) studies that deal with intrinsic
determinants and (b) studies 'that investigate extrinsic
determinants of item difficulty.

Item complexity. Campbell studied the effect on
item difficulty of increasing or decreasing the complexity
of geometrical classification items by varying the number
of figural properties incorporated into the item. The items
were administered to a total of 693 children, from 11 to
12.5 years of age, at ten different schools. The results
did not support the apriori prediction that an increase in
the number of figural properties will increase the item

difficulty. Instead they showed that it was the nature of

27

the properties that were used to classify the geometrical
figures rather than the complexity (as reflected by the
number of figural properties) that influenced the
difficulty of an item. It should be pointed out here that
Campbell's finding is restricted to nonverbal items and
it might not generalize to verbal items. Furthermore, the
complexity of an item is likely to mean different things in
other content areas.

In contrast to Campbell's finding, a number of
studies have shown a significant positive association
between item complexity and item difficulty. For example,
Green (1983) found that item complexity was significantly
associated with empirical item difficulty level. Pollitt,
Entwistle, Hutchinson, and De Luca (1985) reported that the
difficulty of an item was dependent on the complexity of
reasoning processes required to answer chemistry test
items. However, a study by Crawford (1968) showed that when
complexity of an item was defined by the level of
intellectual process at which the item was written, there
was no direct relationship between the complexity and the
difficulty of items. The conflicting findings indicate that
there is disagreement as to what constitutes item
complexity; Campbell (1961) defined. complexity' of a
geometrical classification item as the varying number of
figural properties incorporated into the item, whereas in

Green's (1984) study, complexity of an item referred to the

28

number of steps and amount of information required to
answer the item. Crawford (1968) viewed complexity of an
item as the level of cognitive ability required to process
the task of the item. Pollitt er L. (1985) adopted yet
another definition of complexity. Thus interpretation of
the findings should be based on the definition of
complexity of items perceived by the investigator.

Cognitive procesees. In the literature, the results
of studies investigating whether there is a relationship
between the cognitive processes required to answer an item
correctly and the item difficulty are quite mixed. For
example, Malpas and Brown (1974) requested two judges to
classify 720 General Certificate of Education ordinary-
level mathematics items into one of the two levels of
cognitive demand, i. ., "concrete" and "formal", using
criteria derived from Piaget's theory of cognitive
development. They found that the classification category
correlated significantly with the difficulty index of the
item. Simpson and Cohen (1985) found that items that could
be answered by recalling course information, i.e.,
"knowledge" items, were significantly easier than items
requiring reformulation of course information, i._e_.,
"thinking" items. When difficulty was held constant, item
discrimination was significantly greater for knowledge
items.

In contrast, Crawford (1968) reported that there was

no significant relationship between the taxonomy of

29

intellectual processes and item difficulty for the items in
the Comprehensive Interdepartmental Examinations of the
College of Medicine, University of Illinois. In this case,
the taxonomy of intellectual processes are : knowledge,
generalization, problem-solving of a familiar type,
problem-solving of an unfamiliar type, and evaluation.
While the above two studies allowed cognitive processes to
be confounded by content, Blumberg e_t_ 11. (1982) studied
the relationship between cognitive processes and item
difficulty holding the content constant. In this case only
three taxonomy levels were used, i.e., recognition of
information, interpretation of data and application of
knowledge. The result showed that there was no significant
relationship between taxonomic levels and item difficulty.
In reviewing the literature on Bloom's taxonomy, Scheuneman
& Steinhaus (1987) reported that research " has failed to
demonstrate a clear link between these [cognitive] process
variables and item difficulty" (p. 17).

More recently, with the revival of cognitive
psychology, researchers began to focus on cognitive
components required to process the item tasks as a source
of variations in item difficulty. Cognitive components are
more specific than the broad taxonomy schemes used by
Bloom and others to classify mental processes. Research in
this area concentrated mainly on individual differences in
the cognitive components employed for solving verbal

analogy items. For example, Whitely (1980) found that, in

30

the process of solving verbal analogy items, there were
individual differences in (a) image construction, (b)
response evaluation, and (c) event recovery. However, only
image construction and response evaluation are related to
item difficulty. In a study by Mitchell (1983) the items in
the Word Knowledge and Paragraph Comprehension subtests of
the Armed Services Vocational Aptitude Battery were rated
on. cognitive components: (a) perceptual processing; (b)
executive processing, (c) short term storage, (d) long-term
storage of information structures, and (e) selection and
execution of the response. These cognitive processes were
found to be related to Rasch model item difficulty.
Research in this area focused mainly on verbal items and
individual differences. Hence they are less relevant to the

present study.

Extrinsic Qeterminants

Campbell (1961) considered. factors affecting item
difficulty, but not related to the mental processes
intended by the items, the extrinsic determinants of item
difficulty. Numerous studies on the extrinsic determinants
of item difficulty have been reported in the literature.

tem a a e. The findings of research on the
effects of item language on item difficulty have been quite
inconsistent. Millman (1978) used a computer program to
generate items in elementary statistics that vary in

linguistic presentation but test the same concept. The

31

items were administered to one class of students who took
the course. He found that linguistic variations were
related to item difficulty. However, because of the small
sample size, he cautioned the generalizability of the
results.

In an investigation using undergraduate students as
subjects, Green (1983) found low association between
language difficulty and empirical item difficulty of 10
items selected from a 40-item test in Introductory
Astronomy. In a subsequent study, Green (1984) varied the
difficulty of item language by increasing sentence length
and syntactic complexity, and by replacing more familiar
terms with unfamiliar ones in the stem. The subjects were
990 students in 19 separate undergraduate classes at the
University of Washington. The results showed that language
difficulty of the stem did not affect item difficulty. The
author suspected that the manipulation of language
difficulty in the study was not effective.

In reviewing several studies on item language, Green
(1984) reported that variations in item language difficulty
have been found to affect item difficulty with samples of
young children, but with high school and college samples
the results have not been consistent. She suggested that
difficulty of language would have no effect on item
difficulty once individuals have reached some criterion of

verbal proficiency.

32

Qontent familiarity. Pollitt, Entwistle,

Hutchinson, and. Luca (1985) attempted to determine the
factors consistently associated with item difficulty beyond
simple content consideration. They analyzed the top and
bottom groups of 550 answer scripts of candidates taking
the '0' level Chemistry Examination in Scotland. Pollitt er
e1. found that if the answer to a chemistry test item was
based on a knowledge of the properties of a chemical that
was either unfamiliar to the candidates or could not be
deduced from a set of more basic knowledge, the difficulty
of the item would be high. For example, an item required
the candidates to differentiate between the properties of
sodium and the properties of barium. If the candidates did
not know the properties of barium and these properties
could not be deduced from a knowledge of the properties of
other chemicals, the difficulty of the item would be very
high. The conclusion of the study is consistent with the
idea of "incidental demands of an item" introduced by
Scheuneman (1987). In this case the item intended to test
the knowledge of the properties of sodium. However,
candidates could not answer the question unless they also
knew the properties of barium. The findings of this study
have direct applications for the present study which
requires judges to estimate difficulty levels of chemistry
test items. Judges could be advised to focus on the

possibility that answers to some of the items may require

33

content knowledge unfamiliar to the examinees, resulting in
items becoming excessively difficult.

Item rormat. Several studies have attempted to
ascertain the effects of item format on item
characteristics. Dudycha and carpenter (1973) manipulated
item orientation (positive ye negative stem), structure
(closed stem ye open stem format) and option (presence or
absence of an inclusive alternative). The items were
administered to 1,124 students taking a regularly scheduled
Introductory Psychology test. All three variables were
found to significantly affect item difficulty -- negative
stem more difficult than positive, open stem format more
difficult than closed format, and inclusive alternative
more difficult than specific alternative. No interactions
were found between these three independent variables.
However, only the "presence or absence of an inclusive
alternative" factor significantly affects item
discrimination. Items with specific alternatives
discriminate better than items with inclusive alternatives.
The results also showed an interaction between the "closed
or open" factor and the "positive or negative" factor. The
closed-positive and the open-negative formats were slightly
more discriminating than the closed-negative and the open-
positive formats.

Hughes and Trimble (1965) studied the effect of a
complex distractor "Both 1 and 2 above are correct" on item

difficulty and item discrimination. The result showed that

34

this type of complex alternative can increase item
difficulty. However, its effect on item discrimination was
not clear.

More recently, in a series of studies comparing item
characteristics for parallel multiple-choice items in three
different content areas --statistical terminology, measure-
ment concepts and synonyms -"- Tollefson and Chen
(1986) concluded that items using "none of the above" as a
correct answer have a higher difficulty level but are not
more discriminating than items with one-correct answer
format. Forsyth. and. Spratt (1980) found that multiple-
'choice items with "Not given" as an alternative are more
difficult than items not using this alternative.
Considering the studies together, evidence seems to point
to the conclusion that item format affects item difficulty
and item discrimination.

Multiple-choice optione. The effects of the
characteristics of item options on item characteristics
have been the subject of investigation of numerous
researchers (Chase, 1964: Dudycha & Carpenter, 1973; Dunn &
Goldstein, 1959 Millman, 1978). Homogeneity of options has
been suggested by several authors of measurement textbooks
as an important factor affecting an item's difficulty and
discrimination index (Chase, 1974: Nitko, 1983). Ebel
(1979) stated that "multiple-choice items can be made

easier by making the stem more general and the responses

35

more diverse; items can be made harder by making the stems
more specific and the responses more similar" (p. 162).

Green (1984), studying effects of item
characteristics on multiple-choice item difficulty,
reported that when options of an item became more
convergent the item became more difficult. Pollitt et a1.
(1985) found similar results.

Research has verified the suggestions made in major
measurement textbooks that length of options, use of
technical options, and the extent of grammatical
inconsistencies across stem and options influence the
selection of correct answers (Ebel, 1979; Mehrens &
Lehmann, 1984; Thorndike, 1971). For example, Strang (1977)
found that nontechnical options were more often chosen as
correct answers than were technical options, long options
were more often chosen as correct answers than were short
options regardless of whether the options were technical or
non-technical. Long nontechnical options were most often
chosen as correct answers while short technical option were
least often chosen. The result regarding the length of
option is consistent with the findings of Dunn and
Goldstein (1959) which showed that items containing extra-
long correct alternatives are less difficult. The results
of these two studies imply that items with an incorrect
option which is longer than others and is nontechnical
will be more difficult, because examinees have a higher

tendency to choose this type of option as the key. This

36

finding provides an important clue for judging item
characteristics.

Even though the majority of measurement textbooks
make "stem-options grammatically consistent" a rule in item
writing, the results of the studies on the effect of
grammatical inconsistencies between stem and distractors on
item difficulty are mixed (Board & Whitney, 1972; Dunn &
Goldstein, 1959). In a study on grammatical consistencies,
Plake and Huntley (1984) reported that there was some
evidence of differential sensitivity between males and
females toward subtle cues in items which achieved vowel—
consonant consistency between stems and the correct
alternatives by adding "an" parenthetically to the article
a.

Item context. The literature of educational
measurement is replete with research which attempted to
ascertain the presence or absence of the effects of item
context on item statistics. In a review article of some 40
studies, Leary and Dorans (1985) grouped research on the
effects of item context into three main categories: (a)
single factor, item order effects, (b) item order
interaction effects, and (c) section placement effects.
They concluded that there was evidence of item context
effects. When tests were speeded, items arranged in easy-
to-hard sequence resulted in better test performance than
when items were arranged in hard-to-easy sequence. Under

power conditions, random rearrangement of items or sections

37

of items measuring the same contents did not affect item
difficulty. Another conclusion drawn by Leary and Dorans
was that aptitude test items are more sensitive to item
rearrangement than achievement test items. These findings
suggest that when judges estimating the item statistics of
an item, an important factor to consider is whether the
test will be administered under speeded or power conditions
and whether the item is likely to appear at the beginning
or at the end of the test.

Thus research on what makes test item difficulty has
not resulted in a clear set of factors affecting item
difficulty. A comparison of the methodologies used in this
research reveals that from one study to another there is a
great deal of difference in the sample sizes and the test
instruments used, and the way variables are defined and
manipulated. Under such circumstances, inconsistent
research findings are inevitable. In spite of this
variability, valuable information can be gained from this
research. The fact that various researchers are studying
the effects of complexity, cognitive processes, item
language, item format, item options and item context on
item characteristics indicates that these variables are
perceived by measurement researchers as having an
influence on item statistics. In areas where the findings
showed conclusive relationships between a particular factor
and the item statistics, that variable could be

incorporated either into the training program (to improve

38

estimation skills) or into the design of the present study.
In areas where the findings did not show a conclusive
relationship between the variables and the item statistics,
judges could still be sensitized to the possible influence

of these variables on item statistics.

Empirical Evidence of Aggprecy of Estimates

A number of studies have been carried out to
investigate how consistently and accurately judges could
estimate item statistics. Since this group of studies are
similar to the present study, a more detailed review was
made with the view of incorporating their findings and

methodology into the design of the present study.

Estimation by Erofessional Examiners

Tinkelman (1947) investigated the degree to which
item difficulty could be predicted prior to actual test
administration. Thirty' experienced and competent
professional examiners of public personnel agencies
estimated the percentage of candidates likely to answer
correctly each of the 100 items in a multiple-choice test
for selecting patrolmen. The empirical item difficulties of
the items were determined from a 1000 answer test papers
selected at random from 30,000 candidates to whom the test
was administered.

To find out how well each judge could estimate the
difficulties of the 100 items, Tinkelman correlated the

item difficulties estimated by each judge with the

39

empirically determined item difficulties. He found that the
judges could estimate the relative difficulties reasonably
well (median correlation = 0.53: range from 0.23 to 0.77).
The correlation between the pooled estimates and the
empirical item difficulties was 0.76. To investigate the
consistency of estimation, Tinkelman divided the test into
two halves (one consisting of 50 odd-items and the other 50
even-items) and compared, for each judge, the correlation
coefficient computed for the estimates of the 50 odd-items
with that of the 50 even-items. The result showed that
judges were consistent in their judgment about relative
item difficulties in the two sets of items. The
investigator was aware that the two halves (i.er, the two
sets of 50 items) might not be equivalent and that there
was no time interval between the two estimations. However,
these two factors affect the correlation coefficients of
the two halves in opposite directions. Non-equivalent
halves increase the difference between the correlation
coefficients while no time interval reduces the difference.
Thus the comparison should be treated with caution.
Although judges could estimate the reIeriye
difficulties of items well, the investigator discovered
that they tended to estimate difficulty toward the center
of the scale, that is, difficult items were judged easier
than empirical difficulty and easy items were judged more
difficult. This implied that judges were not able to judge

the epsolute item difficulties well.

40

Tinkelman also investigated the group size of the
judges that would give optimum accuracy in estimating item
difficulties. He selected judgment groups of varying sizes
on the basis of the judges' ability to predict the relative
difficulties of the set of 50 odd-number items. By
comparing the accuracy with which these different judgment
groups predicted the relative difficulties of the set of 50
even-number items, the investigator found that pooling the
judgment from only the top three of the most competent
judges provided predictions of relative difficulty as
accurate as those provided by the entire group of judges.
This finding' has important application for the present
study. It is the competency and not the group size of the
judges that determines the accuracy of estimation. If
incompetent judges can be identified by a preliminary
trial, improvement of judgment accuracy can be achieved by
pooling only the judgment of the competent judges.

The same study found no relation between item
content and the relative accuracy of estimation. Different
item content areas, however, were found to have different
constant estimation errors (defined as the difference
between the estimated and the empirical item difficulties).

1k! a similarly' motivated study, Bejar (1981)
investigated the accuracy with which four experienced
professional test development staff at Educational Testing
Service estimated item statistics of the items in the Test

of Standard Written English (TSWE). In an attempt to

41

increase the accuracy of estimation, a training component
was included in the study . The results of the study showed
that while the interrater reliabilities for item
difficulties and item discrimination indices were very high
(about 0.90), the correlation coefficients between the
subjectively estimated item statistics and the empirically
determined item statistics did not approach the level that
would be required to substitute subjective estimation for
field trial. This low level of accuracy of estimation could
probably be attributed to the fact that the type of items
used in the study were items that attempted to measure
writing skills. While in subject matter like mathematics
the difficulty of an item is largely determined by the
mathematical operation required to solwe the problem, the
difficulty of an item measuring writing skills depends on a
greater variety of factors. According to Bejar (1981) "it
is probably not sufficient to determine what error is
present in an ...item [measuring writing skills], for the
semantic and syntactic context in which that error is
presented may influence item statistics significantly" (p.
303). The question of what type of errors in what semantic
and syntactic context will result in higher or lower item
difficulty is extremely difficult to judge. The
investigator suggested that more raters may be required to

achieve a high level of correlation.

42

Improvement by Anchor Items
The result of the study by Tinkelman (1947) that

judges could only estimate the relative, but not absolute,
difficulties of items ‘well generated interest in other
researchers to investigate this problem further.

In an attempt to improve estimates of both relative
and. absolute item «difficulty, Lorge and Kruglov (1952)
hypothesized that estimates of item difficulty would
improve if judges had knowledge of the difficulty of some
similar items, and that additional information would make
for greater consistency of the judges' estimates. Eight
doctoral students rated the difficulties of 150 arithmetic
items. Four of the judges were given the 150 items, with 30
items whose actual difficulties were known. The other four
judges were given the same 150 items without information
about the difficulty of any of the items. The judges were
required to rank the items according to difficulty and then
estimate the percentage of eighth grade students passing
each item.

The product-moment correlation between estimated and
actual item difficulties was computed for each group. No
significant difference was found. Both groups were also
found to overestimate the average difficulty of the items
to the same extent. This means that the absolute item
difficulties were not estimated more accurately by the

group of judges who were given information about the

 

43

difficulty of similar items than the group who did not
receive that information.

The failure of the additional information (about
some item difficulties) to improve the accuracy of
estimation (both relative and absolute) could be because
the judges were all doctoral students and were not
especially oriented in the teaching of arithmetic or
familiar with the ability of the population taking the
test.

I}: a subsequent study, Lorge and Diamond (1954)
defined competent judges as judges whose mean and standard
deviation of difficulty estimates for a set of items
approximated the empirical mean and standard deviation of
item difficulties. They found that providing "anchor
items", i.e., items with known empirical item statistics,
improved the accuracy of estimates of the less competent

judges to a greater extent than they improved the accuracy

of the competent judges.

Improvement Using Erperienced Teachers

In another study, Lorge and Kruglov (1953) used as
judges persons experienced as teachers of mathematics at
the high school level to estimate the absolute difficulty
of mathematics items under two conditions: one in which
difficulties of a subset of items were given and the other
with no information about item difficulties. The difference

between the average estimated difficulty of the items and

 

44

the average empirical difficulty was computed for each of
the two conditions. The results showed that judges under
both conditions underestipareg the average difficulty of
the items. However, the degree of underestimation was
substantially smaller under the condition in which
information about the difficulty of a subset of items was
given. The result suggested that experienced teachers were
able to make use of the additional information to improve
the accuracy of their estimation. This finding offers a
guideline for selecting judges for the present study. That
is, judges should be selected from experienced subject
matter teachers rather than just from people who have some

knowledge in the subject matter.

Improvement by Rank Order Prediction

In an extension of the previous study, Lorge and
Diamond (1954) assumed that (a) absolute item difficulties
were normally distributed, (b) the correlation between rank
order for difficulty and the absolute difficulty in percent
is 1.0, and (c) the mean and the standard deviation of
difficulties of the judged group of items was known, or
could be estimated with very little error. They then
demonstrated that a better estimate of absolute
difficulties of test items could be obtained by predicting
from average rank order assigned by judges than by
averaging judges' estimates of the percentage likely to

pass each item. The investigators also demonstrated the

45

technique for estimating the mean and standard deviation of
difficulties by including a set of anchor items of known
difficulties among the items to be judged. The technique
involved extrapolation of the quartiles of the distribution
of item difficulties and computation of the mean and
standard deviation using the relationship between median
and mean, and that between semi-quartile range and the
standard deviation. The study also found that providing
judges with information about the item difficulties of a
subset of the items to anchor their judgment improved the
accuracy of estimation. This result was consistent with the
findings of previous research.

The technique used by Lorge and Diamond can be

criticized in that the second assumption that the

ggrreIation between rank order for estimated difficulties
and the empiricaI difficulties is 1.0 is too stringent. In

situations where the correlation is not 1.0 and where the
rank orders for the estimated difficulties of some of the
items did not vary strictly with the empirical
difficulties, the mean and the standard deviation of the
distribution of the difficulties could not be estimated.

When this happens the technique can not be applied.

Improvement by Elaborate Written Report

Quereshi and Fisher (1977), in an attempt to gain
insight into the question of logical estimation of item

difficulty, went beyond the Lorge-Kruglov approach by

46

asking the judges to develop a written report elaborating
the processes and the criteria they used to arrive at their
estimates of item difficulties. Five judges who had
completed 2 years of graduate program in psychology and had
experience in administration and interpretation of
psychological tests ranked 44 letter series items and then
rated the items on a 1 to 10 point scale (1 being the
easiest). Pooled estimates of rank order and rating for
each item were computed. The empirical rank order and
rating of each item were obtained by administering the test
to 186 undergraduates. Spearman Rank correlations were
computed between the subjectively estimated rankings and
the empirical rankings of the items for each judge
separately as well as for the pooled estimates of the
judges. The same correlations were computed for ratings of
items as well. Interjudge consistency was then studied by
computing the Pearson-Product moment correlations among the
ratings of the five judges and the Spearman Rank
correlations among the ranks.

From the intercorrelation indices and the reports of
judges describing the criteria on which they based their
estimates, the investigators concluded that the accuracy of
estimates depended on how elaborately a judge analyzed the
structure and organization of the items. The finding of
this study offers a strategy for improving judgment. That
is, judges should analyze the structure and organization of

the items before making their estimation of item

47

difficulties. The same strategy was also employed by Bejar
(1981) in his study of subject matter experts' assessment

of item statistics.

Studies InvoIving Subjective Esrimeriop

A number of studies were related to the ability of
the judges to subjectively estimate the item difficulties
for minimally competent examinees, especialLy in standard
setting research (Berk, 1986). Melican and Thomas (1984)
used Angoff's (1971) Standard Setting Method to identify
items whose difficulty levels are hard to estimate
accurately. The results of the study suggested that the
difficulties of items involving calculation as well as
items.‘with :negatively’ phrased. stems were harder to
estimate. In both cases judges tended to underestimate the
difficulties.

In a study using a method based on Nedelsky's (1954)
and Angoff's (1971) models to determine the cut-off score
for a certification examination, Bernknopf (1979)
instructed the judges "to draw upon their experience to
construct a hypothetical group of persons, each of whom, in
their judgment, has the minimum amount of academic
knowledge to perform effectively in the schools, and
then to estimate the percentage of the candidates who would
know the answer" (p. 8). The group's estimates of item
difficulties were found to correlate highly with the

empirical item difficulty. The method used by Bernknopf

48

indicated that "appropriate experience" was a crucial
element for the success of the approach.

A study attempted to partial out the effect of
content relevance of items from the accuracy of estimation
of item difficulties was carried out by Ryan (1968).
Fifty-nine secondary level mathematics teachers estimated
the difficulties of 50 multiple-choice mathematics items.
It was found that the ability of the teachers to estimate
item difficulties was higher when the content of the items
was covered in the instruction. When content relevance was
partialed out, only in one of the four subtests was there a
substantial decrease in the proportion of teachers having
significant correlations (between estimated and empirical
item difficulties). This showed that content relevance was
not the only criterion on which judgment of item difficulty

was based.

Poor Estimerion by gngpalirieg Judges

That unqualified judges make poor estimation of item
difficulties is evident in a study conducted by Willoughby
(1980). Eight non-physicians rated independentLy 30 items
from a medical examination for format, relevance,
difficulty, discrimination and overall quality on a scale
of 1 to 5. Group estimates, for each dimension and for each
item, were obtained by computing means across judges. The
empirical item statistics were obtained from 345 medical

students. No significant correlation between estimated and

49

empirical item difficulties was found. However, there was a
significant correlation between estimated and empirical
item discrimination.

In this study, there was no evidence that the judges
had. medical knowledge. Neither' did they' have knowledge
about the characteristics of the population taking the
test. Since item difficulty depends on both the intrinsic
difficulty of the item as well as the characteristics of
the population taking the test, it is not surprising that
the estimated item difficulties did not correlate

significantly with empirical item difficulties.

Discussion and Spmmary
The literature related to the present study has

suggested that three judgmental heuristics may km: in
operation when people made judgments under uncertainty.
Judgments under uncertainty are influenced by the
representativeness of events, availability of instances and
the tendency to make adjustments from an intuitive starting
point. Tversky and Kahneman (1974) stated that "these
heuristics are highly economic and usually effective, but
they lead to systematic and predictable errors" (p. 113).
Research on expert judgment under uncertainty seems
to suggest that consensus judgment may lead to normative
effects in which individual members, under the influence of
other group members, abandon their own estimates and accept

group estimates. A suggestion to overcome this problem is

50

to allow individual members to make independent estimates
at the final stage of group discussions. The advantage of
this suggestion is that group discussions provide the
opportunity for the members to debate and to study the
criteria for making judgments, and independent estimation
allows for individual judgments to contribute to the
overall estimation.

The ‘tendency’ for' people to avoid. making extreme
judgments was identified as a source of error.

Research has shown that training in terms of
providing feedback about judges' performance relative to
the actual values will improve their performance.

An understanding of the psychology of judgment under
uncertainty is important for the present study which
requires judges to make subjective estimates of item
statistics. Before the judges estimate the item statistics,
it would be a good strategy to brief them about the
systematic errors and biases that these judgmental
heuristics can lead to so that better estimates of item
difficulty can be made. The findings of research on expert
judgment also provide valuable information. An implication
of the findings is that group discussions to better
understand or to identify the specific determinants of item
statistics followed by independent individual judgment of
the estimates would be a suitable procedure to adopt for
the present study. The usefulness of providing feedback of

the empirical item statistics during the training sessions

51

(designed to improve estimation skills) has also been
implicated by these research findings.

Several determinants of item difficulty and item
discrimination have been studied by various investigators.
The determinants were broadly divided into (a) intrinsic
determinants, and (b) extrinsic determinants (Campbell,
1961). Intrinsic determinants include item complexity and
cognitive processes/components required to process item
tasks, whereas extrinsic determinants include item
language, content familiarity; item format, option
homogeneity, grammatical inconsistency, option
characteristics and item context. Although the results of
these studies have not been conclusive as to what factors
affect item statistics, there seems to be evidence that
complexity, cognitive components required to process item
tasks, content familiarity, similarity of item options,
item format and item context are closely related to item
difficulty. However, complexity of items was defined
differently in different studies. Thus the definition of
complexity should be examined carefully before the findings
of the studies are applied to other situations.

The literature has provided insights into the
question of subjective judgment of item statistics in terms
of possible factors affecting accuracy of estimation. This
information would be utilized in the present study to
design and develop the intervention program to improve the

accuracy of estimation. Forming small groups to discuss the

52

possible impact of these factors on item statistics might
sensitize the judges to the need to focus on these factors
while making estimation. This may lead to more accurate
estimates.

A number of studies have shown that relative but not
absolute item difficulties can be estimated well. Accuracy
of estimation generally improved when estimation of judges
were pooled; pooling only the estimates of those competent
judges will provide a more accurate estimation than pooling
estimates from the entire group of judges. It was fbund
that estimates tended to regress toward the mean and that
the accuracy of estimation would improve if a subset of
items with known item difficulties was provided to enable
the judges to anchor their judgments. Lorge and Diamond
demonstrated that prediction from average rank order
assigned by judges produced more accurate estimates than
when estimates were obtained by averaging the judges'
estimates. However, as mentioned in an earlier section of
this review, this procedure requires a stringent assumption
which is difficult to satisfy. Hence this method has not
been adopted for the present study. Nevertheless, the
design of the present study could take advantage of the
fact that accuracy of estimation will improve if judges are
given examples of similar items with known empirical item
difficulties, i.e., the idea of anchor items has been
incorporated into the design. Pooling the estimates of only

those competent judges rather than the estimates of the

53

entire group of judges is another step that can be taken to
improve the accuracy of estimation.

In: the studies. of subjective estimation of item
statistics reviewed here, with the exception of the study
by Ryan (1968), there was no evidence that the judges have
sufficient knowledge about the ability of the subjects who
took the test. Psychometric theory has shown that item
difficulty depends both on the intrinsic difficulty as well
as on the characteristics of the population taking the
test. Thus, ignorance of the characteristics of the
population taking the test could be a reason why the judges
were able to estimate only the relative, but not the
absolute, item difficulties well. In this connection, it
seemed appropriate for the present study to use judges who
were experienced subject teachers and who had been involved
in either rating students' examination papers or in
constructing' test. papers for' public examinations to be
taken by their own students.

The findings that relate to the present study are

summarized in Table 2.

54

Table 2.--Summary of Findings Related to The Study.

 

Investigator Year

Finding

Significance

Direct lndirect

 

1. Judgment Under
Uncertainty
1. Guilford 1954
2. Winkler 1971
3. Tversky & 1973
Kahneman &
1974
4. Beach 1975
5. Fitz- 1983
patrick
6. Berk 1986

ll. Determinants of

{gem Statistics

Campbell

Scheunman
&
Steinhaus

1961

1987

There is a tendency for people to avoid making extreme

judgments.

Consensus judgments are more accurate than the average
of the individuals' judgments.

Demonstrated that representativeness, availability,
judgnental and anchoring heuristics were enployed by
people to predict values; these heuristics may lead to
systematic errors.

Experts may falsity their opinions in the hope of
swaying other experts toward their point of views;
Group judgments were more extreme than would be made
by anyone in the group as an individual.

Normative effects of opinion exposure may occur which
results in individual abandoning their own judgments
in favor of consensus.

Suggested that a way to avoid normative effect was to
allow experts to make their own independent judgments
after group discussions.

Classified determinants of item difficulty into
intrinsic and extrinsic factors; intrinsic determinants
pertain to mental processes; extrinsic determinants are
factors affecting the difficulty of the items but are
unrelated to mental processes.

Proposed an Item Difficulty Model which provides a frame-

work for systematic investigations and discussions of
of determinants of item statistics.

X
X
X
X
X
X
X
X
X

Table 2. (cont'd)

55

 

 

Significance
Invest igator Year Finding
Direct Indirect
9. Green 1983 umber of steps and amomt of infomtion required x
to answer an item affect the difficulty of the item.
10. Pollitt 1985 Difficulty of a chemistry test ital depentk on the x
Lt a_l. couplexity of the reasoning processes required to
answer the item;
Examinees' familiarity of content affects item difficulty. x
11. Huges & 1965 Couplex distractors such as “Both 1 and 2 above are x
Iriuble correct" increase item difficulty.
12. Crawford 1968
Malpas 8. 1974
Dram Results regarding the relationships between cognitive x
Sinpson G 1985 processes and item difficulty were mixed.
Cohen
Blmberg 1982
at a_l-
13. Green 1984 Variations in item language has no effect on item difficulty x
once indivimals have reached some criterion of verbal
proficiency;
when options become more convergent, the item become more x
difficult.
14. Dudycha 8. 1973 Item with negative stems are more difficult than item x
Carpenter with positive stem; open stem format more difficult than
closed format; inclusive alternatives more difficult than
specific alternatives.
1S. Strang 1977 Nontechnical options and long options are more often x
chosen as the correct answers.
16. Dun & 1959 Item with extra-long options as the correct answers are x
Goldstein less difficult.
17. Leary & 1985 Concluded from literature review that item context has a x
Durans ' greater effect on item in a speeded test than on items in

a power test; aptitude test item are more sensitive to
item rearrangements than achievement test item.

Table 2. (cont'd)

56

 

Investigator Year

Finding

Significance

Direct Indirect

 

III. Enpirical Evidence of
Accuracy of Estimates

18.

19.

20.

21.

22.

24.

25.

26.

27.

Tinkelmn 1947

Bejar 1981

Lorge G 1952
Kruglov

Lorge G 1953
Kruglov

Lorge & 1954a
Dianond

Lorge G 1954b
Di mend

Ouereshi 1977
8. Fisher

Melican 1984
8. Thomas

Ryan 1968

willoughby 198D

Judges could estimte relative, but not absolute
difficulty of item well; judges tend to regress the
estimation toward the center of the scale; it is the
conpetency and not the groin size of judges that
determines the accuracy of estimtion.

lnterrater consistency for item difficulties and item dis-

crimination irdices were high (about 0.9); the correlation

coefficients between subjectively estimated item statistics
and eupirically estimated item statistics were low.

Judges without experience in teaching arithmetic were not
able to make use of the informtion in the anchor item
to iuprove the accuracy of estimation of arithmetic item.

Experienced high school mathemtic teachers inproved their
accuracy of estimation for mathematic item when provided
with anchor item.

 

Providing anchor item improved the accuracy of estimates
of the less coupetent judges more than it did to the more
conpetent judges.

 

with certain assmptions, better estimtes of item
difficulties could be obtained by predicting from the
average rank order of item assigned by individual judges.

Accuracy of item difficulty estimation depends on how
elaborate a judge analyzes the structure and organiza-
tion of the item.

Difficulties of item involving calculation and negatively
phrased stem were harder to estimate.

Even though the ability of teachers to estimate item
difficulties was higher when the content of the item was
covered in instruction, content relevance was not the only
criterion on which judgment was based.

The item difficulties of 30 medical examination item
estimated by 8 non-physicians did not correlate signifi-
cantly with enpirically determined item difficulties.

X
X
X
X
X
X
X
X
X
X

 

CHAPTER III

PROCEDURES AND DESIGN

The purpose of this study was ‘to investigate how
accurately experienced chemistry teachers could estimate
the item statistics of the Chemistry test used in the
Malaysian Certificate of Education Examination. The
accuracy of estimation of the experienced teachers trained
in estimation skills was compared with that of the
experienced teachers not trained in estimation skills.
Further, the questions of whether accuracy of estimation is
dependent on the content areas, the difficulty levels, the
discrimination power, the cognitive levels, and the format
of the items were examined. Finally, the accuracy of
teachers' estimates. were: compared. with the accuracy of
estimation obtained in a field-trial of the item pool.

This chapter includes a description of the sampling
procedure, the subjects involved, the test materials used,
the design of the study, the hypotheses to be tested, and
the statistical procedures for testing the hypotheses of

this study.

57

58

Sampling Procedure

gambling of Teachers

A total of 30 teachers participated in this study.
These teachers were chosen based on information which
indicated that they had a Bachelor's Degree in Chemistry
and were currently teaching or had recently taught the
Chemistry examination classes. In addition, teachers with
additional experience in examination work (serving in item
writing panels, as raters of essay or the practical
components of the Chemistry Examination, and in
administering the Chemistry Practical Examinations) were
chosen in preference over those teachers who did not have
these experiences. Due to the high cost of the attendance
allowance, subsistence allowance and traveling expenses,
the total number of teachers was restricted to an
affordable number of 30 and to an area within the Federal
Territory' of IKuala Lumpur 'where subjects could commute
between their homes and the place of meeting.

No single sampling frame was readily available. As a
result a list consisting of 30 teachers who satisfied the
conditions mentioned above was compiled from different
sources such as various lists of panel members, examiners
and teachers who had administered the Chemistry Practical
Examination. The resulting list contained teachers who held
in common the necessary academic qualifications but who

differed in the number of years of experience teaching the

59

subject, in the type of experiences they had in
examination work, and in gender and ethnicity. Only two had
experience as item writing panel members.

This research required one treatment group and one
control group. To have divided this final list of teachers
into two groups using simple random assignment procedure
may not have resulted in two equivalent groups especially
when the sample size is small. Thus a procedure similar to
stratified random sampling was used. The two teachers who
had considerable experience in item writing were of the
same gender and same ethnicity, and they were grouped as
one stratum. The rest of the teachers were grouped into
strata with the same gender and ethnicity. Teachers from
each stratum were then randomly assigned to one of two
groups. In the first stratum where there were only two
teachers, a coin was tossed to decide their group
memberships. In each of the other strata, slips of papers
each with an identification number of an individual were
placed in a container and mixed thoroughly. The required
numbers of teachers were then drawn one at a time from the
container and assigned to either the treatment group or the
control group.

The sampling process produced two lists of 15
teachers each. Official letters inviting the selected
teachers to participate in an Item Statistics Estimation
workshop were sent through the Principals of the schools

where the teachers worked. The teachers in the treatment

60

group were invited to attend a three-day training/workshop
session, while the teachers in the control group were
invited for a one-day workshop. The two workshops were
scheduled on two separate weeks. The letters specified the
nature of the workshop and that subsistence and traveling
allowances would be paid accordingly. The teachers were
requested to return a reply slip confirming their consent
to participate in the workshop two weeks prior to the
workshop.

The relevant data concerning the teachers in the

treatment and the control groups are shown in Appendix A.

Sampling of Students

The part of the study concerning discrimination
index estimation required an item analysis with a small
sample of about 100 students (a distinction should be made
between the sample of students used in the regular item
analysis research carried out by the Examinations Syndicate

in the process of constructing the tests and the small

 

sample of students used specifically for this study. The
regular item analysis research has a much large sample
size).

In this study , two forms of test items were used.
Each form was to be tried out with about 100 current
Chemistry examination class students. Two schools with
medium performance in the subject of Chemistry in the

Malaysian Certificate of Education (MCE) Examination were

61

chosen. Medium performance schools were defined as the
schools whose percentage of passes in MCE Chemistry is
similar to the national average. Both schools were located
in the outskirts of the city of Kuala Lumpur. One of the
schools provided 101 students and the other provided 118

students.

Sam t 5

Two "Chemistry Paper 1's" of the Malaysian
Certificate of Education (MCE) Examination were used for
this research. The criteria used for choosing a particular
paper was the availability of the parameter values of the
item characteristics and the item pool estimated values of
the same.

The test paper is a 75 minutes test consisting of 40
multiple-choice items each with 5-options. Of the 40 items,
about 25 are single-answer multiple-choice type and 15 are
multiple-answer multiple-choice type. An example of each
type is shown in Table 3.1.

The two test papers used in this study measure the
same content areas of chemistry. The internal consistency
reliabilities (KRZO) of the tests were 0.917 and 0.909, and
their standard deviations were 9.324 and 8.769. The
population ‘values of the item characteristics were
routinely computed in the post-analysis of the tests by the
Examinations Syndicate. These population values were

available for the present research. Since the focus of this

62

Table 3.1.--Examples of Types of Items.

 

SINGLE-ANSWER MULTIPLE-CHOICE ITEM

The chloride of metal M has a formula MCl and potassium
phosphate has a formula K3PO4. What is the ormula of metal
M phosphate?

A MPO4

B M(PO4)3

C M3PO4

D M2P04

a M2(PO4)3

MULTIPLE-ANSWER MULTIPLE-CHOICE ITEM

 

 

Direction
A B C D E
I,II,III I,III II,IV IV I,II,III,IV
only only only only (all four)
H2(g) + 12(g) .e==é 2HI(g) Heat change = negative

Which of the following changes will increase the yield of
hydrogen iodide in the above equilibrium system ?

I Add more hydrogen into the system.
II Reduce the temperature of the system.
III Remove hydrogen iodide from the system

IV Increase the pressure of the system.

63

study was on the estimation of item statistics rather than
students' performance, it would be more appropriate to use
the term "forms" rather than "tests" to refer to the groups
of items used. Thus the two tests are referred to as Form A

and Form B.

esi

This research utilized an experimental design. Two
equivalent groups each consisting of 15 volunteer
experienced chemistry"teachers were formed by’ia
procedure similar to stratified random sampling. One group
was assigned to the treatment condition and the other
served as a control. The treatment group received
training in estimation skills for two days and the control
group was not trained. Both groups estimated the item
statistics -- the p-value, and the point biserial
correlation coefficient between item score and the total
test score -- for the two forms of Chemistry test items.
As mentioned in the previous section, each form contained

40 multiple-choice items.

Procedure

Three days were scheduled for the training/workshop
session for the treatment group. The teachers reported to a
panel room in the Examinations Syndicate, Ministry of
Education, Malaysia at 8.00 a.m. on each day of the
scheduled. training/workshop session. The first. two ldays

were used to "train" or to help these teachers to develop

64

item statistics estimation strategies and skills. The whole
of the third day was reserved for the teachers to actually
estimate the item statistics.

Training. The training session consisted of two
components, one theoretical and one practical. It involved

the following steps:

A. Theoretical componggt.

The purpose and rationale for carrying out the
research was first introduced to the teachers. The
implications of the success of the project such as
the possibility of the method developed in this
research being applied in future estimation
procedure were also explained. The teachers were
requested to try their best and to use the first
two days of the training/workshop session to develop
some strategies and skills for items statistics

estimation.

The training session then followed the following

sequence:

a» Definitions of items statistics were first
explained to the teachers. An example in which
statistics were used to describe the
characteristics of an object or a person were
given. The definitions of item difficulty and
item discrimination index were then introduced.

The formulae for computing these item

65

statistics, i.e., the p-value and the point
biserial correlation coefficient, were also
explained and illustrated by examples. A
handout to enhance the teachers' understanding
of these definitions was prepared and
distributed. A copy of the handout was included
in Appendix B.

The concept of item discrimination index,
which was difficult to understand from just
examining the formula of a point biserial
correlation coefficient, was illustrated by a
computation of the D-index. By definition, D-
index is the difference in a proportion of
correct responses between the group of those
scoring in the top 27 percent on the total test
and the group scoring in the bottom 27 percent
on the same test (Ebel, 1979; p. 376). To
simplify computations, top and Ibottom 25
percent was used in the example instead of top
and bottom 27 percent. However, teachers were
reminded that although the D-index was similar
to point biserial correlation coefficient in
terms of the concept of discrimination, these
two indices are quite different in terms of
computation.

The teachers were informed of the systematic

errors and biases caused by judgmental

66

heuristics described by Tversky and Kahneman
(1974). The three judgmental heuristics --
Representativeness, Availability, and Anchoring
& Adjustment -- were briefly discussed. The
teachers were cautioned against basing their
estimation on adjustments to unrealistic
initial estimates. They were also advised to be
aware of a tendency for estimation to regress
toward the mean of item difficulty.

c. The determinants of item characteristics, such
as item complexity, option homogeneity, and
familiarity of item content, identified by
various researchers were discussed.
Specifically, a summary of research findings on
the determinants of item difficulty (Table 3.2)
was explained and discussed, and each member
of the group was provided with a copy of the

summary for their reference.

After the discussion of the research findings
related to the determinants of item statistics, teachers
were divided into three groups to study and analyze a set
of 10 sample items with known p-values and point-biserial
correlation coefficients with the aim of discovering what
factors determine the values of the item statistics of

these items.

67

Table 3.2.--Summary of Research Findings on Item

Difficulty.

 

General Fingiggs

Judges are able to estimate relative item difficulty
well, i.g., judges are able to rank items according
to difficulty level well.

Judges either consistently gnderestimate or
overestimate item difficulty levels, i.g., tend to
make a "constant error" of estimation.

Judges tend to judge difficult items easier and easy
items more difficult than the actual difficulty
levels.

Accuracy of estimation of item difficulty levels can
be increased when the judges are given "anchor" items
to guide them in their estimation.

People make estimation of the value of a certain
variable by assuming a rough initial value and making
adjustments to yield the final estimate. The
adjustments made were, on most occasions,
insufficient. Anchor items can prevent judges from
making erroneous initial values which may lead to
inaccurate estimate.

The accuracy of estimation of item difficulty depends
on how elaborate a judge analyzes the structure
and organization of the items.

Qetermigagts gf Itgm Difficulty

When "complexity" of an item is defined as the number
of steps and the amount of information required to
answer the item correctly, it is found that
"complexity of an item has a direct relationship
with item difficulty.

68

Table 3.2 (cont'd)

 

10.

11.

12.

13.

14.

15.

If the answer to a chemistry test item is based on a
knowledge of the properties of a chemical that is
either unfamiliar to the candidates or that cannot
be deduced from a more basic knowledge, the
difficulty of the item will be high.

Items that require complex reasoning with "unknown"
or several reagents are more difficult.

Items are difficult if the syllabus content on which
the items are based involves concepts that are
difficult for the students to grasp.

Items that require incidental knowledge or obscure
facts are more difficult.

"Deductive" problems which involve novel/unusual/new
situations tend to be more difficult.

An item is harder if the options are more homogeneous
and easier if the options are more heterogeneous.

If an item has more than one basis for choosing the
correct answer, it is easier.

An item is easier if the stem is general and the
options are diverse.

 

 

69

These 10 sample items were selected from items used
in Chemistry test papers during the previous 5 years. They
represented items from different content areas, and with a
wide range of p-values and point-biserials. These 10 sample
items are referred to as the "anchor" items in this
research. Each item was printed on a 5 x 8 pink paper.
Besides the item proper, three statistics -— the rank order
of the item difficulty (in this set of items), the p-value,
and the point-biserial of the item -- were also printed on
each of these 5 x 8 pink papers. A leader was appointed in
each of the three subgroups, and was given the
responsibility of leading the discussions and recording the
findings of his/her group regarding the determinants of the
item statistics of the "anchor" items. This studying,
analyzing and discussing of the determinants of item
difficulty and item discrimination were guided by the
summary of the research findings mentioned above.

After each group had completed its list of
determinants of item statistics, a discussion involving the
total group was held, this time led by the investigator.
In the total group discussion, each subgroup leader was
requested to explain to the total group how his/her
subgroup arrived at a particular determinant. Other members
were encouraged to voice their opinions as to whether they
agreed or disagreed with the suggested determinants and
gave reasons for their opinions. An integrated list of

determinants of item difficulty was prepared at the end of

70

the total group discussions. Each member was also provided
with one copy of the list for their reference when they
practiced estimating item statistics which was the next

process in the training program.

B. W

Four sets of items were used for practicing
item statistics estimation. The first :3 sets
consisted of 5 items each and the last set contained
10 items. These items were also selected from 5
previous years Chemistry test papers. The item
parameter values ( _i._e., item p-values and point-
biserials computed from the post-analysis of the
items based on the total population of about 40,000
students) of these items were available. The items
as a whole were selected to be representative of the
content areas of the syllabus, and to have a wide
range of p-values and point-biserials. These items
are referred to as "practiced" items in this study.
Each of them were also printed on an item card as
was done for the "anchor" items described in the
previous section. On each of the item cards, below
the item proper, three small boxes were drawn and
labeled for the teachers to fill in their estimates
of the rank order of item difficulty (of the
particular item in the set of items given), the p-

value and the point-biserial of the item.

 

71

The practice session proceeded as follows:

Each teacher was first given a set of 5
"practice" items and were told that they were
required to practice item statistics estimation
shortly. Before the estimation practice began
they were advised to recall what they had
learned about the definitions of p-values and
point-biserials, and about the research
findings on item statistics estimations. They
were also asked to review the list of
determinants of item statistics that the group
had discussed and finalized the previous day.
They were then requested to first rank the
items in the order of either increasing or
decreasing difficulty, and after which to
estimate the item p—value and item point-
biserial for each item in the set. The
estimates given by each teacher for each item
in the set were recorded on the black-board.
The parameter values of the item difficulty and
item discrimination index of the items were
also recorded on the black board. Teachers were
advised to compare their own estimates with the
parameter values, and for those items which
they over- or underestimated, to try to find

out why that happened. Teachers who could

72

estimate the item statistics within : 0.05 of
the parameter values for 4 out of the 5 items
were identified, and were invited to explain to
the whole group how they went about estimating
the item statistics. With these new insights,
the list of determinants of item statistics was
studied again as a group to determine how they
could be applied in the actual estimation
process.

The procedure described in Step (a) was
repeated with two more sets of five items each.
Each practice was followed by feedback and
discussions of the accuracy of estimation.
Strong emphasis was put on the importance of
the teachers' reflections on their errors of
estimations, and their efforts to improve
estimation skills.

The final practice involved a set of 10 items.
The procedure was the same except that more
items were involved (10 items as compared to 5
items in the previous three practices). At the
end of this final practice, refinements on the
list of determinants of item statistics was
also made. This refined list was to be used by
the teachers when they worked on the actual
estimation. A copy of the list was included in

Appendix C.

73

Estimating

The actual estimation of item statistics by the
treatment group was scheduled on the third day of the
training/workshop session. Each teacher was given two sets
of items. Each set contained 50 items. Each item was again
printed on an item card. The first set (call this Set A)
consisted of the 40 items from Form A and 10 "equating"
items. The second set (Set B) consisted of the 40 items
from Form B and the same 10 "equating" items used in the
first set (the description of Forms A & B was given in the
section on "sample items"). The 50 items in each set were
numbered from 1 to 50 with the equating items appearing in
every interval of 5. The 10 "equating" items were chosen
from past years' Chemistry test papers to represent
different topics in the syllabus and to have a wide range
of item difficulty levels and item discrimination values.

The first set of items were printed on green item
cards and the second set on yellow. Two spaces were
provided and labeled below each item for the teachers to
record their estimates of the item statistics.

In addition to the two sets of items, each teacher
was provided with a separate set of 10 "anchor" items,
each printed on a pink card. These 10 "anchor" items were
the same 10 "anchor" items used in the training session. As
mentioned in the earlier section, the parameter values of
item difficulty and item discrimination of each "anchor"

item were also printed on the card bearing the item. The

74

"equating" items have the same properties as the "anchor"
items: the only difference between them was that the item
statistics of the anchor items were made known to the
teachers while the item statistics of the equating items
were not known to the teachers.

The actual estimating process was divided into two
parts: one for item difficulty and the other for
discrimination index.

Item difficulty. Before teachers began their
estimation exercise, they were requested to first review
the list of determinants of item statistics that they had
prepared the previous day, and then the 10 anchor items.
They were then required to estimate the item difficulty of
the items in Set A first and then items in Set B. Before
being distributed to the teachers, the items in each set
were shuffled so that their placements in the set would be
random. Each set was estimated separately, that is, the
teachers did not refer to the estimates given to the first
set while working on the estimation of the items in the
second set.

As a strategy, they were advised to read each item
carefully, to answer it (the answers to all the items were
provided on a separate sheet), and then to place the item
in one of three categories: easy, medium, and difficult.
The items in each of the three categories were further
divided into three more categories according to difficulty.

No constraints were put on the final number of difficulty

75

categories used and the number of items to be placed in
each category. Teachers made their own decisions as to how
many categories they needed to help them in estimating the
item difficulty.

After the teachers had grouped the items into
different difficulty categories, they were advised to
review their categorization to see whether any relocation
of items into other categories was necessary. This was
followed by giving an estimate of the p-value for each item
in each category. They were reminded to refer to the anchor
items to guide them in the estimation process.

Item discrimination indexu .A review of the
literature provided no evidence to indicate that judges
were able to estimate accurately the discrimination indices
of test items. The review also revealed that very little
research on factors influencing the accuracy of subjective
estimation of item discrimination has been carried out.
Furthermore, the mental processes involved in subjective
estimation of item discrimination indices are much more
complex than those involved in item difficulty estimation.
Thus there are reasons to believe that it will be extremely
difficult for the teachers to make accurate estimation of
item discrimination index. It is also reasonable to believe
that estimating the range within which the population value
of the item discrimination would be expected to lie will be
easier than estimating the specific value of the

discrimination index. It is for this reason that teachers

76

were requested to estimate the pagge rather than the
specific vaiue of the item discrimination index (in this
study the point-biserial) for each item, igg., each teacher
estimated the highest and the lowest values of the range
within which the point-biserial of the item would be
expected to lie.

Once the teachers have estimated the rm; of the
point-biserial, the next step is to identify a procedure
for obtaining a ppipp estimate of the point-biserial from
the estimated range. The Bayesian framework provides such
a procedure. However, the Bayesian approach (described in
Appendix D) requires, in addition to the estimated range,
an empirical estimate of the point-biserial from a small
sample of students.

In the Bayesian terminology, the range (of point-
biserial) estimated by each teacher can be considered as
the prior information of the point-biserial distribution.
The Bayesian approach then combines the prior information
with the information present in the data obtained from the
small sample of students to produce a posterior
distribution. A final estimate of the point-biserial (for
each item) was obtained from the posterior distribution.

To obtain an empirical estimate of the point-
biserial required in the Bayesian approach, the items were
tried out in two schools which provided a total of 219

students for the exercise. Each item was tried out on about

77

110 students. The try out procedure was described as
follows.

All of the items, except the equating items, were
assembled into two booklets, i_,_e_,_, two forms: Form A and
Form B. Each form contained 40 items. The forms were
arranged in ABABAB... sequence and were distributed in that
order to the students who took part in the field trial.
This design (equivalent to matrix sampling) ensured that
the two forms were administered to two equivalent groups of

students from the two selected schools.

Estimation by the Control Group

The teachers in the control group were invited to
attend the workshop for one‘day. They were given an
introduction of the rationale of the research, an
explanation of the definitions of item statistics, and an
illustration of the computation of these item statistics,
in exactly the same way as was done for the treatment
group. However, no training was given to them. The teachers
were po_t informed that they served as a control group.
They were requested to estimate the item statistics using
the same strategies as described for the treatment group,
that is, to group the items into categories with the same
item difficulty and then estimate the p-values of the
items in each categories. They were also reminded to make
use of the anchor items to guide them in their estimation.

Since no training was given, no information about the

78

determinants of item statistics was imparted to the
teachers in the control group. The items whose item
statistics to be estimated by the control group were

exactly the same as those estimated by the treatment group.

Dependent variables

There were two dependent variables in this research,
one was the accuracy of p-value estimation and the other
was the accuracy of point-biserial estimation.

Acc c of -value estimation. The accuracy of p-
value estimation of an item was defined as the absolute
value of the difference between the eguated p-value
estimate of the item and the population p-value of the
item.

The procedure for obtaining the equated p-value
estimate is presented here. As described in the section on
estimating, teachers were required to estimate the p—values
of the items in Forms A & B as well as the equating items
embedded in each form. The information present in the

equating items (i.e., the population and the estimated p-

 

values) was used to derive the equated p-value estimate
for each item. The population p-values of the equating
items were first converted to normal deviates by the
normalizing procedure. These normal deviates were then
transformed linearly to a scale with a mean of 13 and a
standard deviation of 4 (similar to the Delta-scale used

at ETS). The estimated p-values of these equating items

79

were also converted to the Delta-scale. For each teacher,
a best fitting line (using least square errors criterion)
was computed to represent the linear relationship between
the population and the estimated p-values. This line was
then used to equate the estimated p-values to the
population p-values.

Samples of scatterplots showing the relationship
between the estimated and the population Deltas of the
equating items were shown in Appendix E. The equating
process for a given teacher in the treatment group was
illustrated as follows.

An illustration of the ggpating process. Table 3.3
contains the population p-values for the 10 equating items
embedded in Form A, and the p-values estimated by a teacher
(call this teacher T1) for the same 10 equating items.

The estimated and the population p-values were
converted to the normal deviates using the Table of Unit-
Normal Distribution (Glass 8 Hopkins, 1984; p. 522).
The relationship between the p-value and the normal deviate
was illustrated in Figure 3.1.

The normal deviates are the z-scores corresponding
to the values of (l- p) in a normal distribution. The
normal deviates therefore have a mean of 0 and a standard
deviation of 1. Each of the normal deviates corresponding

to a particular p-value was transformed to a Delta scale

80

Tables 3.3.--Estimated and Population P-values for the
Equating Items.

 

Equating
Item 1 2 3 4 5 6 7 8 9 10

 

Population
p-value .82 .75 .66 .58 .52 .44 .40 .32 .32 .23

 

Estimated
p-value .88 .39 .69 .81 .45 .33 .38 .46 .32 .32
Density

    

 

P/
-3 -2 -1 0 +1 +2 +3
Normal Deviate

Figure 3.1--Relationship between P-value and
Normal Deviate.

with a mean of 13 and a standard deviation of 4, using the
following equation:
Delta value = 4 x normal deviate + 13 (3.1)

For example, if the p-value of an item is .65, then
its. normal deviate is -0.253. Substituting this normal

deviate in Equation (3.1) gives a Delta value of

81

11.99 (i.e., 4 x (-0.253) + 13 = 11.99). The estimated and
the population Delta values of the equating items were
displayed in Table 3.4, and the scatterplot of these two

sets of Deltas was given in Appendix E (Figure E1).

Table 3.4.--Estimated and Population Delta Values of the
Equating Items.

 

Equating
Item 1 2 3 4 5 6 7 8 9 10

 

Population
Delta 9.3 10.3 11.3 12.2 12.8 13.6 14.0 14.9 14.9 16.0
Value

Estimated
Delta 8.3 14.1 11.9 9.5 13.5 14.8 14.2 13.3 14.9 14.9
Value

 

Using the usual regression analysis procedure, the
regression equation for the data in Table 3.4 was found to

be:

Equated Delta = .634 x Estimated Delta + 4.77 (3.2)

Equation (3.2) was used to "equate" (for Teacher T1)
the estimated p-values of the items in Form A to the
population p-values. However, before the "equating" could
be carried out, the estimated p-vaiues of the items in Form
A were first transformed to the Delta-scale, using the

procedure described above, i.e., converting the p-values to

82

the corresponding normal deviates which were then
transformed to the Delta values using Equation (3.1). The
resulting estimeped Deltas were then entered in Equa-
tion (3.2) to obtained the corresponding egpated Deltas.
For example, if the estimated Delta values of Item 8

and Item. 16 ‘were 10.9 and 12.9 respectively, then

substituting these estimateg Delpas in Equation (3.2)
produced the corresponding egpggeg_Del§ee as follows:

(1) Estimated Delta = 10.9,

Equated Delta = .634 x 10.9 + 4.77

= 11.7

(2) Estimated Delta = 12.9,

Equated Delta = .634 x 12.9 + 4.77

= 12.9

Finally, the egpated Deltas were transformed back to
the p-vaiue scale using the Unit-Normal Distribution Table.
The resulting p-values are referred to as the "Equated P-
values".

Accurecy of point-bisegiai estimation. The estimates
of point-biserial involved the combination of two
estimates, one from the range of point-biserial correlation
estimated by the teachers and the other from the observed
point-biserial obtained in the small sample field-trial.

The Bayesian approach (see Appendix D) combined these two

83

pieces of information to obtain an estimate of the point
biserial.

The accuracy of point-biserial estimation was
defined as the absolute value of the difference between the
estimated point-biserial and the parameter value of the

point-biserial of the item.

a e es

The frequency distributions of the dependent
variables defined above were found to be positively
skewed (Figures 3.2 & 3.3). Since the hypotheses to be
tested in this study are based, in part, on the assumption
of normal distributions of dependent variables, the
original dependent variables were changed, through the
following logarithmic transformation, to a metric in which
the distributions were more nearly normal:

(Assume Y and R to be the original dependent variables
and Y* and R* to be the transformed variables)

Y* -ln (.05) + 1n (.05 + Y)

R*

-ln (.05) + 1n (.05 + R)

The histograms for the two transformed variables, Y* and R*
are shown in Figures 3.4 & 3.5. ‘The transformed variables
were used as the dependent variables in the hypothesis

testing.

84

Variable: Original Y

Count Mid-pt. One Symbol Equals Approximately 10 Occurences

75 -.005
397 .020
250 .045
216 .070
428 .095
157 .120
168 .145
129 .170
155 .195

83 .220

97 .245

57 .270

68 .295

32 .320

38 .345

20 .370

16 .395

5 .420
7 .445
1 .470
1 .495

*sssssss ,
*ssssssssesss,**************************
******************,******
*ssesssesesseeessseee,
************************,******************
*****e**********

*****************

*sssssessseee

****************

********

*tssseses,

******,

***,***

es,

,***

,*

es

s

*

I...0+...II....+....I....+OO..IOOOO+OOOIIOOOO+OOO

100 200 300 400 450

HISTOGRAM FREQUENCY

Figure 3.2.--Freq. Distribution of the Original Accuracy

of P-Value Estimatesa.

 

aMean
Mode
Kurtosis
S E Skew
Maximum

.120 Std Err .002 Median .105
.105 Std Dev .093 Variance .009
.553 S E Kurt .100 Skewness .984
.050 Range .492 Minimum .000

.492 Sum 288.138

85

Variable: Original R

Count Mid-pt. One Symbol Equals Approximately 10 Occurences

410 .010
289 .035
278 .060
400 .085
225 .110
162 .135
121 .160
161 .185
57 .210
43 .235
36 .260
21 .285
9 .310
8 .335
6 .360
6 .385
3 .410
1 .435
3 .460
0 .485
1 .510

***************,**********************+**
**************e******,*******
**********seeeesesssesssse,*
****************************,***********
*eeseseeesesseeeessesss
****************
************
*************,**
******
*sse,
**,*

*

sabe-

I....+....I....+....I....+....I....+....I....+...

100 200 300 400 450

HISTOGRAM FREQUENCY

Figure 3.3.--Freq. Distribution of the Original Accuracy

of Point-Biserial Estimatesa.

 

aMean
Mode
Kurtosis
S E Skew
Maximum

.094 Std Err .022 Median .073
.010 Std Dev .076 Variance .006
2.170 S E Kurt .103 Skewness 1.280
.052 Range .522 Minimum .000

.522 Sum 209.548

Variable:

86

Transformed Y

Count Mid-pt. One Symbol Equals Approximately 8 Occurences

0 -.30

0 -.15
75 .00
121 .15
131 .30
145 .45
132 .60
226 .75
108 .90
240 1.05
345 1.20
168 1.35
247 1.50
120 1.65
154 1.80
100 1.95
58 2.10
28 2.25
2 2.40

0 2.55

0 2.70

****,****

*******,*******

***********,****
****************,*
***************** ,
sssss*********************,e
************** ,
esee************************** ,
*******************************,***********
********************* ,
ssss*****e*************,*******
*************** ,
*************,*****

********,****

*****,*

**,*

IOOOO+OOOOIOO..+....I.0.0+OOOOIOOOO+OOOOIOOOO+OOO

80 160 240 320 360

HISTOGRAM FREQUENCY

Figure 3.4.--Frequency Distribution of the Transformed

Accuracy of P-Value Estimatesa.

 

aMean
Mode
Kurtosis
S E Skew
Maximum

1.078 Std Err .011 Median 1.133
1.133 Std Dev .548 Variance .300
-.792 S E Kurt .100 Skewness -.045

.050 Range 2.383 Minimum .000

2.383 Sum 2587.090

Variable:

87

Transformed R

Count Mid-pt. One Symbol Equals Approximately 8 Occurences

0

0
108
161
141
145
144
278
148
252
318
138
172
98
79
31

H
OONQQ

-.30
-.15
.00
.15
.30
.45
.60
.75
.90
1.05
1.20
1.35
1.50
1.65
1.80
1.95
2.10
2.25
2.40
2.55
2.70

*****,********

*********,**********
**************,***
*ssssessesseeesess
****************** ,
seeses*sseeeesseseeesesesessse,****
******************* ,
*******************************,
***************************_************
***************** ,
****************,*****

*seeessssee,

*******,**

see,

I.O00+.0.0IOOOO+OIOOIOOOO+OOOOICO..+....I....+OOO

80 160 240 320 360

HISTOGRAM FREQUENCY

Figure 3.5.--Frequency Distribution of the Transformed

Accuracy of Point-Biserial Estimatesa.

 

aMean
Mode

Kurtosis
S E Skew
Maximum

.926 Std Err .011 Median .896

.183 Std Dev .507 Variance .257
-.652 S E Kurt .103 Skewness .073

.052 Range 2.437 Minimum .000
2.437 Sum 2074.955

88

General izabil ity of Resuips

Since the subjects were restricted to those
chemistry teachers who have experience in teaching the
examination classes, in rating examination papers and/or in
writing Chemistry test items, and to those who were
teaching in the Federal Territory of Kuala Lumpur, the
generalizability of the results of this study to other
teachers and items in other subject matters is limited.
However, the main concern of this study is not the
generalizability to other teachers, but the comparison of
teachers' estimation accuracy of item statistics with the
estimation accuracy obtained in the field trial of item
pool, and also the comparison of estimation accuracy of
those teachers who have received treatment to improve
estimation skills with those who did not receive treatment.
The nonrandom selection of teachers did not affect the
comparison of the estimation accuracy of the treatment
group with that of the control group, because random
assignment of subjects had formed two equivalent groups and
treatment occurred over a short period of time. In this
situation, possible threats to the internal validity
(Campbell 8 Stanley, 1963) of the study such as history,
maturation, statistical regression, differential selection,
experimental mortality, selection-maturation interaction,
and experimental treatment diffusion would be

undercontrolled. Since no pretest was involved, the threats

89

due to pretesting and measuring instruments did not affect
the validity of the study.

The fact that the teachers in the treatment group
could be identified and be included in the future item
statistics estimation exercises made the need for the
generalizability' of ‘the results to other‘ teachers less

crucial.

EYEQLQ§§§§
The major hypotheses of the study are presented

here. Each hypothesis is stated in the alternative
hypothesis form. Under hypotheses Hla to Hle and H3a to
H3e, statistical tests for interactions between treatment
effect, form and the respective factor (egg. content area,
cognitive level of item, item-type) stated in the
hypothesis were carried out, and if significant, the
interactions would be taken into consideration in the
interpretation of main effects. Wherever appropriate,
Tukey's method of multiple comparisons was also carried out

to find out which pairs of factor levels were different.

H1: There is a difference in the accuracy of item
Difficulty estimation of experienced teachers
ffained in estimation skills and experienced
teachers pot tfaineg in estimation skills. The
difference favors the teachers trained in
estimation skills.

Hla: There is a difference between items of different
content areas in the accuracy of p-value
estimation by the experienced teachers.

Hlb: There is a difference between item cognitive
levels in the accuracy of p-value estimation by
the experienced teachers.

90

ch: There is a difference between items of
different difficulry levele in the accuracy of p-
value estimation by the experienced teachers.

Hld: There is a difference between items of different
discrimination power in the accuracy of p-value
estimation by the experienced teachers.

Hle: There is a difference between items of different
types in the accuracy of p-value estimation by
the experienced teachers.

H2: There is a difference in the accuracy of item
difficulty estimation of the experienced teachers
trained and competent in estimation skills and
the accuracy of item difficulty estimation
obtained in the item analysis of item pool.

H3: There is a difference in the accuracy of point-
biserial estimation of experienced teachers
trained in estimation skills and experienced
teachers not trained in estimation skills. The
difference favors the teachers trained in
estimation skills.

H3a: There is a difference between items of different
content areas in the accuracy of point-biserial
estimation by the experienced teachers.

H3b: There is.a difference between items of different

cognitive levels in the accuracy of point-
biserial estimation by the experienced teachers.

H3c: There is a difference between items of different
difficulty levels in the accuracy of point-
biserial estimation by the experienced teachers.

H3d: There is a difference between items of different
discrimination pgwer in the accuracy of point-
biserial estimation by the experienced teachers.

H3e: There is a difference between items of different

rypes in the accuracy of point-biserial
estimation by the experienced teachers.

Statistieal Apelysis

A double-repeated measures analysis of variance with
two within-subjects factors and one between-subjects factor

was used to analyze the data for Hypothesis H1 in

91

combination with each of the Hypotheses from Hla to Hle.
For example the combination of Hypothesis H1 with
Hypothesis Hla was tested with the treatment/control as
the between-subjects factor and form and content area as
the two within-subjects factors. The three factors were
completely crossed enabling interaction effects to be
tested. However, teachers were nested within treatment
factor. A diagrammatic representation of the design is
presented in Table 3.5. A double-repeated measures MANOVA
was used because each teacher was considered as a block
and the same teacher estimated the items in both forms
(i._e_._, Form A and Form B) and also in all the different
content areas (irer, chemical structure, electricity &
energy, rates & equilibrium, and descriptive chemistry).
The question of the possibility of unequal correlation
between the accuracy of p-value estimates for any two
levels of the content area factor favored a multivariate
approach (Norusis, 1988). Nevertheless, if the chi-square
test of sphericity is nonsignificant, a univariate
multiple-comparison procedure would be used to test the
pairwise comparisons among the levels of the content area
(Kirk, 1982).

The Hypotheses Hla to Hle, and H3e to H3e implied
that the items in each of the two forms A and B would be
grouped according to the factor tested in the hypothesis.
For example to test hypothesis Hla, the items would be

grouped into 4 different content areas : chemical

92

Table 3.5--A Double Repeated Measures Designa with One
Between-Subjects and Two Within-Subjects Factors

 

Form

 

Treatment Teacher A B

 

CA1 CA2 CA3 CA4 CA1 CA2 CA3 CA4

 

 

T1
T2
Trained
T3
in
Estimation
Skills
T14
T15
T1
Not T2

Trained T3
in .

Estimation .

 

 

 

 

 

 

 

 

 

 

 

 

Skills .
T14
T15
aCA1 = Chemical Structure CA2 = Electricity & Energy
CA3 = Rates & Equilibrium CA4 = Descriptive Chemistry

93

structure, electricity & energy, rates & equilibrium , and
descriptive chemistry.

For hypotheses Hlb and H3b, the items were grouped
into knowledge, comprehension, and application levels.
The initial classification of the items into various
cognitive levels by the panel of teachers at the time of
test construction was used in this study.

For hypotheses ﬁle and H3e, the items were grouped
into easy, medium or hard categories according their
population p-values. Items with population p-values greater
or equal to 0.65 were grouped into easy category, items
with p-values less than 0.65 and greater than 0.49 were
grouped into medium category, and items with p-values
smaller than 0.49 into hard category.

For hypotheses Hld and H3d, the items were grouped
into low discriminating category (items with population
values of r less than or equal to 0.4), medium discrimina-
ting category (r between 0.4 and 0.5), and high discrimina-
ting category (r greater than 0.5).

For hypotheses Hle and H3e, the items were grouped
according to single-answer and multiple-answer type.

For each of the above hypotheses, the mean of the
accuracy of estimation for the items in each particular
category was computed for each teacher. These means were
taken as indicators of the accuracy of estimation of the
teachers on the particular category of items. One teacher

in each of the treatment and the control groups did not

94

provide estimates of the point-biserials for the items.
Hence for the mean accuracy of point-biserial estimation,
.data were available for only fourteen teachers in each
group. The complete tables of the means for the individual
teachers on all dependent variables are presented in
Appendix F.

Hypothesis H2 which involved the comparison of the
accuracy of p—value estimation of the teachers trained in
estimation skills with the item pool accuracy of estimation
was tested by a three-way ANOVA. In this case, an accuracy

index was computed for each teacher by averaging the

 

accuracy' of p-value estimation (i.e., the absolute
difference between the equated p-value estimated by the
teacher and the population p-values of the item) across the
80 items. The associated standard deviation of these
absolute differences was also computed (Table 3.6).
The competency of teachers in estimating item difficulty
was evaluated by the size of the accuracy index and the
standard deviation: the smaller the values of both
indicators the more competent. The ten most competent
teachers were selected and their average equated p-value
estimate (p-bar) were computed for each item. The absolute
differences between these p-bar's and the corresponding
population p-values represent the accuracy with which the
teachers as a group estimated the item difficulty. The
accuracy of estimation obtained in the field-trial of item

pool was indicated by the absolute differences between the

95

Table 3.6.--Competency Indices of Teachers

 

 

Teacher Mean Std. Dev.
Tl .1088 .1053
T2 .1090 .0849
T3 .1106 .0928
T4 .1113 .0853
TS .1163 .0857
T6 .1048 .0788
T8 .1200 .0918
T9 .1120 .0855
T10 .1123 .0863
T11 .1163 .1034
T12 .1239 .0886
T13 .1010 .0838
T14 .1060 .0804

T15 .1224 .0917

 

96

item pool estimated p-values and the corresponding
population p-values. Hypothesis H2 essentially concerned a
comparison of the difference between the accuracy of these
two methods of estimation.

The five hypotheses from Hla to Hle were based on
the same data. This created a problem known as inflation of
the alpha level. If the usual alpha level of 0.05 was
utilized in testing the hypotheses, then the chances of
making a type I error was approximately equal to the sum of
the alpha levels across the tests, i._e._, 0.25. To avoid
this problem, these five hypotheses were tested using an
alpha level of 0.01. The experimentwise alpha would
therefore be fixed at 0.05. These same alpha level would
also be utilized for testing the hypotheses H3a to H3e
which have a similar error rate problem.

In order to provide an indication of the accuracy of
teachers' point-biserial estimation, a Spearman rank
correlation coefficient between the estimated and the
population point-biserials was computed. In this case, an
average estimateg range within which the population point-
biserial was expected to lie was fire; computed for each
item. This was done by averaging, for each item, the upper
limits of the range estimated by all the teachers in the
treatment group, and then averaging the lower limits. The
average upper and the average lower limits then constitute

the everege esrirnated range. An estimate of the point-

biserial for each item was obtained using the Bayesian

97

approach as described in Appendix D. Thus each of the 80
items (from Forms A 8 B) has an estimated and a population
point-biserials. A Spearman rank correlation coefficient
was then computed from these two sets of point-biserials

(Glass & Hopkins, 1984).

gummem

The accuracy with which two groups of experienced
Chemistry teachers estimate item statistics were compared;
one group received training in estimation skills and the
other was not trained. Accuracy of estimation was also
compared between trained teachers and accuracy of estimates
obtained in field-trial of item pool. Thirty experienced
Chemistry teachers were randomly assigned to two groups:
treatment and control.

The items to be estimated by the teachers were
grouped into two forms: A & B: each form contained 40
items. Equating items were embedded in each form. The
teachers were provided with 10 "anchor" items (i._e_., items
with known population item characteristics) to guide them
in the estimation. The training program consisted of a
theoretical component and a practical component. In the
theoretical component, teachers were involved in the
identification and study of determinants of item statistics
whereas the practical component focused on practicing the
application of these determinants in actual item statistics

estimation.

 

98

Statistical analysis of data involved a double-
repeated measures design with one between-subjects factor
and two within-subjects factors. The treatment-control
dimension was the between-subjects factor, and form (A &
B) was one of the two within-subjects factors. The other
within-subjects factor was one of the following four
dimensions: (a) Content area (with 4 levels: chemical
structure, electricity 8 energy, rates & equilibrium, and
descriptive chemistry), (b) Cognitive level (with 3
levels: knowledge, comprehension and application), (c)
difficulty (with 3 levels: easy, medium and hard), (d)
discrimination power (with 3 levels: low, medium and high),
(e) type of item (with 2 levels: single-answer type and
multiple-answer type).

Apart from comparing the trained teachers'
estimation accuracy with the accuracy obtained from an item
analysis of the item pool, examination of the effects of
treatment, different content areas, cognitive levels,
difficulty levels, discrimination power, and item-type on
the accuracy of item statistic estimation were important

targets of the study.

 

 

CHAPTER IV

RESULTS

Introduction

The results of the study are presented in this
chapter. Even though the hypotheses tested with respect to
the accuracy of estimation of p-values and point-biserials
were similar, the results are presented separately.

For the dependent variable: accuracy of estimation
of p-value, Hypothesis H1 in combination with Hypotheses
Hla to Hle were tested using a multivariate double-repeated
measures analysis of variance, and Hypothesis H2 was tested
with a three-way ANOVA in which method of estimation, form
and content area were the three factors. For the dependent
variable: accuracy of estimation of point-biserial,
Hypotheses H3 in combination with Hypotheses H3a to H3e
were tested using a multivariate double-repeated measures
analysis of variance.

The results of the hypotheses testing are presented
in the order with which they are mentioned above. All the

ANOVA tables are presented in Appendix G.

99

100

Accuracy of P-Value Estimation

Beeulte Concerning the Treatment Effegr

The test of Hypothesis H1 was carried out by a
multivariate double-repeated measures in which treatment
was a between-subjects factor, and form and content area
were the two within-subjects factors. The hypothesis was
stated as,

H1. The accuracy of item difficulty estimation by
experienced teachers trained in estimation skills
will be better than the accuracy of estimation by
experienced teachers not trained in estimation
skills.

The difference in the means between the accuracy of
p-value estimation by the trained teachers (mean = 1.05, on
a scale which ranged from 0 to 3) and that of the
untrained teachers (mean = 1.12) was found to be
statistically significant, _F_‘(l,28) = 16.67, p<0.05. Thus
this hypothesis was accepted (alpha = 0.05) and it can be
concluded that the teachers trained in estimation skills
could estimate the p-values of the items more accurately

than the teachers not trained in estimation skills. The

treatment had an effect size of 0.53.

Results Concerning the Effect of antent

The hypothesis was stated as,

Hla. There will be differences among different
content areas in the accuracy of p-value
estimation by experienced teachers.

101

The multivariate test of significance for content
effect indicated that the effect was statistically
significant, £(3,26) = 72.8, p<0.01. The average univariate
F test was also significant, £(3,84) = 111.73, p<0.01. As
was explained in the section on statistical analysis in
Chapter III, in order to avoid the problem of an inflation
of the alpha level, the main effects of content area,
cognitive level, difficulty level, discriminating power and
item type were each tested at an alpha level of 0.01. Thus
Hypothesis Hla was accepted at 0.01 level. The other
within-subjects factor, Le” form, has an observed
univariate F-ratio of 7.02 (_f= 1,28; p < 0.05). This
indicated that the main effect of form was significant
(alpha = 0.05). The treatment x form x content interaction
has a multivariate F-ratio of 2.25 (gf = 3,26: p > 0.05).
For the treatment x form interaction, the multivariate
test has an observed significance level of 0.41: whereas
for the treatment x form interaction, the F-ratio has an
observed significance level of 0.65. Therefore, the
interactions: (a) treatment x form x content, (b)
treatment x content, and (c) treatment x form were all
not significant at 0.01 level. However, the form x content
interaction was found to be significant at 0.01 level by
both multivariate test £(3,26)=38.9, p<0.01 and univariate
test £(3,84)=29.75, p<0.01. The significant interaction is

depicted in Figure 4.1.

102

Form A
Form B

KOWHCOQW
|'-'
j.»

 

 

.90 L

t A

Chemical Electricity Rates & Descrip .
Structure & Energy Equilm. Chemistry

A A A

Figure 4.1.—~1nteraction of Form and Content on Accuracy of
of P-Value Estimation.

The Mauchly sphericity test involving content factor
has an observed significance level of 0.36 (chi-square =
5.5; d_f = 5). This indicated that the assumption of
sphericity is not violated. The results of the e
posteriori test of differences among the means of the
levels of the content factor at each level of the fem
factor are shown in Tables 4.1 and 4.2. In order to avoid
the risk of an inflated Type I error, the e posteriori
multiple comparisons were based on the studentized range
statistic, g. The data showed that for Form A, the mean
accuracy of p-value estimation for "rates & equilibrium"

items was significantly higher than the means for

103

Table 4.1.--Multiple-Comparisons among the Means of the
Different Content Areas. (Form A)

 

Studentized Range Statistics, q

 

Descrip. Chemical Electricity Rates &
Content Chemistry Structure & Energy Equilm.
(Mean=.093) (Mean=.093) (Mean=.102) (Mean=.l39)

 

 

 

 

Descrip. .15 2.8 14.4**

Chemistry

Chemical 2.7 14.2**

Structure

Electricity ll.6**
& Energy

 

** Significant at 0.01 level; the critical value of the
Studentized Range Statistic, g = 4.56, gr = 84,4.

Table 4.2.--Multiple-Comparisons among the Means of the
Different Content Areas. (Form B)

 

Studentized Range Statistics, q

 

Descrip. Electricity Chemical Rates &
Content Chemistry & Energy Structure Equilm.
(Mean=.094) (Mean=.094) (Mean=.124) (Mean=.126)

 

 

 

 

Descrip. .18 9.3** 10.1**

Chemistry

Electricity 7.5** 8.3**
& Energy

Chemical .74

Structure

 

** Significant at 0.01 level; the critical value of the
Studentized Range Statistic, g = 4.56, gr = 84,4.

104

(a) chemical structure, (b) electricity 8 energy, and (c)
descriptive chemistry, at 0.01 level. However, for Form B,
the means of the accuracy for "rates & equilibrium" and for
"chemical structure" were each significantly higher than
the means for (a) electricity & energy, and (b) descriptive
chemistry. [Note: a higher mean accuracy implies that the
p-values of the items in the particular content area were

less accurately estimated than content areas with lower

mean accuracy].

Beeults Concerning the Effect of Cognitive Level

The hypothesis was stated as,

Hlb. There will be a difference among item
cognitive levels in the accuracy of p-value
estimation by experienced teachers.

The obtained multivariate F-ratio of 45.66
(gf=2,27) was significant at the 0.01 level, supporting the
hypothesis that there were differences among item
cognitive levels in the accuracy of p-value estimation. The
average univariate F-test gave a similar result
£(2,56)=58.25, p<0.01. The interactions: (a) treatment x
form x cognitive level, (b) treatment x form, and (c)
treatment x cognitive level were all nonsignificant.
However both multivariate and average univariate F-tests
indicated that the interaction between form and cognitive
level was significant, Multivariate 2K2,27) == 53.5,

p<0.01: univariate average §(2,56) = 43.3, p<0.01. The

105

graphic representation of the interaction is shown in

Figure 4.2.
_ _ _ _ Form A
1.3 Form B
1.2

 

.90 .

“COD-111500?
...a
O

.80 l

 

l_¥i

A A A

Knowledge Comprehen- Applica—
sion tion

Figure 4.2.-—Interaction of Form and Cognitive Level on
Accuracy of p-value Estimation

The Mauchly sphericity test involving the cognitive
level effect has an observed significance level of 0.075
(chi-square = 5.2: gf = 2) indicating that the assumption
of sphericity of the dependent variable was not violated.
The results of an e posteriori ”test of differences among
the means of the cognitive levels at different levels of
the fern factor were presented in Tables 4.3 and 4.4. The
multiple—comparisons were based on the studentized range

statistic, g. The analyses showed that, for Form A, the

106

Table 4.3.--Multiple-Comparisons among the Means of the
Different Cognitive Levels. (Form A)

 

Studentized Range Statistics, q

 

 

 

 

 

Cognitive Application Knowledge Comprehension

Level (Mean=.090) (Mean=.099) (Mean=.1l6)
Application 2.9 8.3**
Knowledge 5.4**

 

** Significant at 0.01 level: the critical value of the
Studentized Range Statistic, g = 4.3; Q: = 56,3.

Table 4.4.--Multiple-Comparisons among the Means of the
Different Cognitive Levels. (Form B)

 

Studentized Range Statistics, q

 

 

 

 

 

Cognitive Application Comprehension Knowledge

Level (Mean=.094) (Mean=.110) (Mean=.126)
Application 5.1** 10.2**
Comprehension 5.1**

 

** Significant at 0.01 level: the critical value of the
Studentized Range Statistic, g = 4.3; Q; = 56,3.

107

mean accuracy of p-value estimation of "comprehension"
items was significantly higher than the means for
(a) "application" items, and (b) "knowledge" items; whereas
for Form B, the mean for "knowledge" items was
significantly higher than the means for (a) "application"
items , and (b) "comprehension" items. The mean for
"comprehension" items was also significantly higher than

the mean for "application" items.

Resulte Congerninq the Effect of Item Difficulty Level
The Ihypothesis concerning the effect of item
difficulty on the accuracy of estimation was stated as,
ch. There will be differences among items of
different difficulty levels in the accuracy of
p-value estimation by experienced teachers.
The multivariate F-test for the main effect of
difficulty level was significant, as it gave an F-ratio of
69.9 (Q; = 2,27; p<0.01). This supports the hypothesis that
there are differences among item difficulty levels in the
accuracy of p-value estimation. Similar result was obtained
from the univariate average F-test (_13=109.5, _cﬁ = 2,56:
p<0.01) The univariate F-test indicated that the p-values
of the "easy" items were estimated significantly less
accurately than the average of the "medium" and "hard"
items, F(1,28) = 141.17, p<0.01. There was no difference in
the accuracy of p-value estimation between "medium" and
"hard" items, F(1,28) = 0.29: p = 0.60. The interactions:

(a) Treatment x form x difficulty level, (b) treatment x

108

form, and (c) treatment x difficulty level were all
nonsignificant. However, both multivariate and univariate
average F-tests indicated that the interaction between
difficulty level and form on the accuracy of p-value
estimation was significant, multivariate £(2,27) = 9.4,
p<0.01; univariate average _E(2,56) = 10.6, p<0.01. Figure

4.3 depicts the interaction graphically.

_ __ Form A
1.5 Form B

KODHGOOV

.90

 

080 4
V

1’ - - -

Easy Medium Hard

 

\

Figure 4.3.--Interaction of Form and Difficulty Level on
Accuracy of P-Value Estimation.

Since the Mauchly's test of sphericity indicated

that the assumption of sphericity was not tenable (chi-

109

square = 10.1, _f = 2; p<0.05), the residual mean squares
appropriate for the specific contrasts of interest (Kirk,
1982) were used in the e posteriori comparisons among the
means at different difficulty levels. Tables 4.5 & 4.6
display the results of the post-hoc comparisons. The
analyses showed that, for D932 forms, "easy" items were
estimated significantly less accurately than both "medium"
and. "hard" items: whereas ‘there 'was no significant
difference in the accuracy of estimation between the
"medium" and the "hard" items.

Resulte Concerning the Effect of

Different Discrimination Power

 

The hypothesis involved in this aspect of the study

was stated as,

Hld. There will be differences among items of
different discrimination power in the accuracy
of p-value estimation by experienced teachers.

Both multivariate and univariate F-ratios for the
main effect of discrimination power have an observed
significance level of less than 0.01 (multivariate E =
27.4, gr = 2,27: univariate average E = 35.9, gr = 2,56).
Thus the hypothesis that there would be differences among
items of different discrimination power in the accuracy of
p-value estimation was accepted at the 0.01 level. The
interactions: (a) treatment x form x discrimination level,

(b) treatment x form, and (c) treatment x discrimination

level were all not significant at 0.01 level. The result of

110

Table 4.5.--Mu1tiple-Comparisons among the Means of the
Different Difficulty Levels. (Form A)

 

Studentized Range Statistics, q

 

 

 

 

 

Difficulty Hard Medium Easy
Level (Mean=.088) (Mean=.095) (Mean=.134)
Hard 2.9 14.6**
Medium 11.2**

 

** Significant at 0.01 level: the critical value of the
Studentized Range Statistic, g = 4.3, gr = 56,3.

Table 4.6.--Multiple-Comparisons among the Means of the
Different Difficulty Levels. (Form B)

 

Studentized Range Statistics, q

 

 

 

 

 

Difficulty Medium Hard Easy
Level (Mean=.089) (Mean=.093) (Mean=.147)
Medium 1.7 15.4**
Hard 13.9**

 

** Significant at 0.01 level; the critical value of the
Studentized Range Statistic, g = 4.3, gf 56,3.

111

the multivariate F-test indicated that the form by
discrimination level interaction was significant, E(2,27) =
7.2, p<0.01 and the univariate average F-test gave similar
result, E(2,56) = 6.2, p<0.01. The graphic representation

of the interaction is shown in Figure 4.4

_ _ Form A
1.3 Form B

 

<00HCOO>
[_l
O

 

t A é A

Low Medium High

Figure 4.4.--Interaction of Form and Discrimination Level
on Accuracy of P-Value Estimation.

The Mauchly sphericity test for the discrimination
power factor has an observed chi-square value of 2.04 (Q; =
2, p>0.05), indicating that the assumption of sphericity
was tenable. The results of the e posteriori test of
differences among the means of items of different
discrimination levels were presented in Tables 4.7 & 4.8.

The post-hoc analyses indicated that, for Form A, the mean

112

Table 4.7.--Multiple-Comparisons among the Means of the
Different Discrimination Levels. (Form A)

 

Studentized Range Statistics, q

 

 

 

 

 

Discrim. High Medium Low
Level (Mean=.095) (Mean=.105) (Mean=.113)
High 2.6 4.3**
Medium 2.1**

 

** Significant at 0.01 level: the critical value of the
Studentized Range Statistic, g = 4.3, gr = 56,3.

Table 4.8.--Multiple-Comparisons among the Means of the
Different Discrimination Levels. (Form B)

 

Studentized Range Statistics, q

 

 

 

 

 

Discrim. High Medium Low
Level (Mean=.094) (Mean=.11) (Mean=.120)
High 6.3** 6.6**

Medium .31

 

** Significant at 0.01 level; the critical value of the
Studentized Range Statistic, g = 4.3, gr 56,3.

113

accuracy of p-value estimation for items with low
discrimination power was significantly higher than the
means for (a) items with medium discrimination power, and
(b) items with high discrimination power. For Form B, the
mean for items with low discrimination power was
significantly higher than the mean for items with high
discrimination power, and the mean for‘ items with medium
discrimination power was also significantly higher than the

mean for items with high discrimination power.

Eeeulte Concerning the Effect of Item Type

The hypothesis concerning the effect of item type on
accuracy of estimation was stated as,

Hle. The p-values of the single-answer multiple
choice items will be estimated more accurately
than that of the multiple-answer multiple
choice items by experienced teachers.

The paired t test for the difference between the
accuracy of p-value estimation of single-answer type and
the accuracy of multiple-answer type was significant, r(28)
= 6.0, p<0.01, supporting the hypothesis that the p-values
of single-answer type items were estimated more accurately
than that of the multiple-answer type items. The
interactions : (a) treatment. )( form )( item type, (b)
treatment x form, and (c) treatment x item type were all
not significant at 0.01 level. The interaction between form

and item type was significant, E(l,18) = 42.6, p<0.01. The

form by item type interaction is represented in Figure 4.5.

114

 

 

 

1.3 , _ _ _ _ Form A
Form B
1.2 .
/
¥ /
l i
A 1.1 , //
C /
c ‘I "
u 1.0 b /
r /
a
c .90 i
Y ,
’1’
Single-answer Multiple-answer

Figure 4.5.--Interaction of Form and Item Type on Accuracy
of P-Value Estimation.

For Form A, the mean of p-value estimation for items with
multiple-answer was significantly higher than the mean of
items 'with single-answer; 3(1) = 6.5, p<0.01. However,
there was no significant difference between the means of

these two item types in Form B.

Results of Teacher and Item PoolrEstimatiop Comparison
The hypothesis concerning the comparison of the
accuracy of estimation by teachers with the accuracy of
estimation by item pool field-trial was stated as,
H2. There will be a difference in the accuracy of
item p-value estimation by experienced teachers
trained and competent in estimation skills and

the accuracy of the item p-value estimation
obtained in the field-trial of item pool.

 

115

The results of the three-way ANOVA for evaluating
Hypothesis H2 are displayed in Table 4.9. The distribution

of’ the. original dependent 'variable (i.e., the absolute

 

difference between the estimated p-value and the parameter
value of item difficulty) was positively skewed as shown in
Figure 4.6. To reduce the skewness, the original dependent
variable was transformed to a scale in which the

distribution was more normal, by the equation:
in = -LN(.05) + LN(.05 + Y)

Figure 4.7 showed the distribution of the
transformed variable (Y*). Based on the transformed scale,
the difference in the means between the accuracy of
estimation by the teachers (mean = 0.946, on a scale which
ranged from 0 to 3) and the accuracy of field-trial
estimation (mean = 0.925) has an F-ratio of 0.092 (Q; =
1,144; p = 0.76). This result indicated that the accuracy
of estimation by the teachers was not significantly
different from the accuracy of estimation obtained in
field-trial (alpha = 0.05). The main effects of form and
content were also not significant at 0.05 level. However,
the method of estimation by content area interaction was
significant, E(3,144) = 4.0, p<0.05. The graphic
representation of the interaction is shown in Figure 4.8.

Efficiency of eetimation bv teacpers. The
efficiency of p-value estimation by the teachers was

defined, in this study, as the extent to which the accuracy

 

116

Table 4.9.--ANOVA Table for Evaluating the Difference

between the Accuracy of Estimation by Teachers
and the Accuracy by Field-Trial.

 

Source of Variation SS

 

Main Effects

Method of Estimation 1.9
Form 13.0
Content 62.4
Two-way Interactions
Method by Form 3.5
Method by Content 247.8
Form by Content 37.6
Three-way Interactions
Method by Form by
Content 71.5
Residual 2940.3
Total 3389.9

DF

 

144

159

 

 

 

MS F Sig. of F
1.9 .09 .762
13.0 .64 .425
20.8 1.0 .387

3.5 .17 .678
82.6 4.0 .009
12.5 .61 .608
23.8 1.2 .324
20.4
21.3

 

117

Variable: Original Accuracy

Count Mid-pt. One Symbol Equals Approximately .6 Occurences

0 -4 .

0 -2 .

7 o ***********,
15 2 *essssssesseeesee,*eeseee
17 4 ***********************,sees
27 5 sees************************,****************
24 3 ***s**************************,*********
16 10 **s************************ ,
16 12 s**************************
10 14 ****************s .

7 15 ************ .

6 13 ********** .

4 20 *******,

3 22 sees,

1 24 s*.

2 26 .**

4 23 ,******

0 30

1 32 **

0 34

0 36

I....+....I....+....I....+....I....+....I....+.
0 6 12 18 24 25

HISTOGRAM FREQUENCY
Figure 4.6.--Freq. Distribution of the Original Accuracy of

P-Value Estimates by Competent Teachers and by
Field-Triala.

 

aValues in figure were obtained from Y which had
been multiplied by 102.

Mean 9.15 Std Err .53 Median 8.00
Mode 5.00 Std Dev 6.66 Variance 44.34
Kurtosis 1.05 S E Kurt .38 Skewness 1.08
S E Skew .19 Range 32.00 Minimum .00

Maximum 32.00 Sum 1464.00

Variable:

118

Transformed Accuracy

Count Mid-pt. One Symbol Equals Approximately .5 Occurences

\Oﬁﬂd‘Ul-hUNI-‘O

***,**********

*******,**********

**********,*

********** ,
********************,***
*s*************s**s****,**********
********************
**************** ,
es******************sss***,*****************
******************** ,
**********************,*********
*******************,
**************

************,

*********,

*sse .

****,*

**,*****

*

I0.00+O0.0‘IOOCO+OOOOIOOOO+O..OIOOOO+OOOOI

5 10 15 25

HISTOGRAM FREQUENCY

Figure 4.7.--Frequency Distribution of the Transformed

Accuracy of P-Value Estimates by Competent
Teachers and by Field-Triala.

 

aValues in figure were obtained from Y* which had
been multiplied by 102.

Mean
Mode
Kurtosis
S E Skew
Maximum

9.36 Std Err .37 Median 9.56
6.93 Std Dev 4.62 Variance 21.32
-.39 S E Kurt .38 Skewness -.03

.19 Range 20.02 Minimum .00

20.02 Sum 1497.00

119

_ Estimation by Teachers
Estimation by Field-Trial

.90 .

 

“(DUNE—1003’

.80 , \
.70 . \

.60

l

T r

 

‘ -

Chemical Electricity Rates & Descrip.
Structure & Energy Equilm. Chemistry

Figure 4.8.--Interaction of Method of Estimation and
Content on Accuracy of P-Value Estimation.

120

of estimation by the teachers approximates the accuracy of
estimation obtained in the field-trial of item pool. The
efficiency of estimation was thus estimated by computing
the percent change in the mean accuracy of estimation with
the mean accuracy of estimation obtained in field-trial as
the base line, and then subtracting the percent change from
100%. The means and standard deviations of the accuracy of
estimation across the items in Forms A & B, for the trained
and the untrained groups as well as the field-trial are
presented in Table 4.10. The computation of the
efficiency of estimation was based on the original
dependent variable rather than the transformed variable

and was illustrated as follows:

Efficiency = 100% - -------------- x 100%

where

Met = Mean accuracy of estimation by teachers
across all items,
“ft = Mean accuracy of estimation by field trial

across all items.

The efficiency of estimation by the teachers who had
been trained in estimation skills was found to be 91%
whereas the efficiency of estimation of the teachers not

trained in estimation skills was 78%.

121

Table 4.10.--Means and Standard Deviations of Accuracy of
Estimation for Different Methods.

 

Mode of

Estimation

 

Estimation
by Teachers
Trained and
Competent
in
Estimation
Skills

Estimation
by Teachers
Not Trained
in
Estimation
Skills

Estimation
by Field-
Trial of
Item Pool

Mean

 

Form A

.0873

.1008

.0850

Form B A&B

 

.104 .0906

.1125 .1066

.0898 .0874

Std. Dev.

 

Form A Form B AGB

 

.071 .077 .074
.085 .083 .084
.052 .061 .057

 

122

Accuracy of Point-Biseriai Estimation

Resuirs Comcerning Treetment Effect

As was done for Hypothesis H1, the test of
Hypothesis H3 was carried out by a multivariate double-
repeated measures in which treatment was a between-subjects
factor, and form and content area of items were the two
within-subjects factors. The hypothesis was stated as,

H3. The accuracy of point-biserial estimation by
experienced teachers trained in estimation
skills will be better than the accuracy by
experienced teachers not trained in estimation
skills.

The difference in the means between the accuracy of
point-biserial estimation by the trained teachers (mean =
0.92, on a scale of 0 to 3) and that by the untrained
teachers (mean = 0.94) was not significant, E(1,26) =
0.28, p = 0.60. Hence the hypothesis was not accepted and
it was concluded that there was no difference between the

accuracy of the trained and the untrained teachers in their

point-biserial estimation at 0.05 level.

Results of Estimated and Eopulation
Eoint-Biserials Comparison

The Spearman rank correlation between the estimated
and the population point-biserials of the 80 items (from
Form A and Form B) has a value of 0.43.

A scatterplot between the two sets of point-
biserials for the 80 items is shown in Figure 4.9. For

comparison purposes, a scatterplot between the point-

123

biserials estimated by the item analysis research and the
population point-biserials (for the same 80 items) is
displayed in Figure 4.10. The two scatterplots seem to
show a different spread of points. The "teacher E
population" plot seems to have a fairly even spread of
points along' the ‘whole range of ‘point-biserial values:
whereas the "item analysis y_s_ population" plot seems to
show a wider spread of points at the lower end of the

range.

Results Concerning the Effect of Contemr

The hypothesis with respect to the effect of content
area on the accuracy of point-biserial estimation was
stated as,

H3a. There will be differences among different
content areas in the accuracy of point-
biserial estimation by experienced teachers.

The data indicated that the main effect of content

area has a multivariate F-ratio = 27.28 (Q = 3,24:
p<0.01) and thus supported the hypothesis. It can be
concluded that the accuracy of point-biserial estimation
was different for items of different content areas.
The interactions: (a) treatment x form x content area,
(b) treatment x content area, and (c) treatment x form were
all not significant at 0.01 level. However, the
interaction between form and content area was found to be

significant, multivariate E(3,24) = 11.8, p<0.01. The

"Si-*0” SOP-ﬁUHC'UO'U

waa++H1001wwn

124

 

 

 

 

.65 r
.60 > '
.55 w 0
. O “O O

.50 e

O . . O
.45 0 . go

0 O 0
O O

O O
.40 ° '

. O 0

g 0 O
. O O O
.35 »
.30 >
O
.25 >
O
020 1 1 1 A 1 1
.30 .35 .40 .45 .50 .55 .60 .65

Figure 4.9.--Scatterplot between

Point-Biserials.

Estimated Point-Biserial

Estimated and Population

 

Popu1atolon peolnt Bilserolal

 

(TSP-0'0 SCI-bri'ﬂlHC'UO’U

hlﬂlmtimlnhhw

.65

.60

O
U1
U1

UT
0

.45

.40

.35

.30

.25

.20

125

 

 

 

 

.30 .35 .40 .45 .50 .55 .60 .65

Estimated Point-Biserial

Figure 4.10.--Scatterplot between Point-Biserials from Item

Analysis and Population Point-Biserials.

significant interaction is represented graphically in

Figure 4.11.

 

 

 

_ _ _ Form A
r Form B
1.2 ,
/\
,’ \
A 1.1 . ,’ \
c / \
C / \
u 1.0 /
r / \
a / \
C .90 / \
Y / \
/
.80 /
’1’ - - A -
Chemical Electricity Rates & Descrip.
Structure & Energy Equilm. Chemistry

Figure 4.11.--Interaction of Form and Content on Accuracy
of Point-Biserial Estimation.

The Mauchly’s sphericity test for the levels of the
content factor was not significant, chi-square(5) = 1.14,
p>0.05, again indicating that the assumption of sphericity
was tenable. Tables 4.11 & 4.12 display the results of the
a posteriori multiple-comparisons among the means of the
different levels of the content factor. The post-hoc
comparisons shows that, for Form A, the mean accuracy of
point-biserial estimation for "rates & equilibrium" items

was significantly higher than the means for (a) "chemical

 

 

127

Table 4.11.--Multiple-Comparisons among the Means of the
Different Content Areas. (Form A)

 

Studentized Range Statistics, q

 

Chemical Descrip. Electricity Rates &
Content Structure Chemistry & Energy Equilm.
(Mean=.080) (Mean=.088) (Mean=.106) (Mean=.120)

 

 

 

 

Chemical 2.0 6.1** 9.5**
Structure
Descrip. 4.1 7.5**
Chemistry
Electricity 3.4

& Energy

 

** Significant at 0.01 level; the critical value of the
Studentized Range Statistic, g = 4.53, gr = 78,4.

Table 4.12.--Multiple-Comparisons among the Means of the
Different Content Areas. (Form B)

 

Studentized Range Statistics, qa

 

Descrip. Electricity Chemical Rates &
Content Chemistry & Energy Structure Equilm.
(Mean=.085) (Mean=.085) (Mean=.086) (Mean=.095)

 

 

 

 

Descrip. .02 .28 2.4
Chemistry

Electricity .26 2.4
& Energy

Chemical 2.0
Structure

 

aThe critical value of the Studentized Range Statistic,
Q = 4.53, g; = 78,4.

128

structure" items, and (b) "descriptive chemistry" items:
the mean for "electricity & energy" items was significantly
higher than the mean for "chemical structure" items.
However, for Form B, none of the means was significantly

different from other means at 0.01 level.

Results Concerning the Effect of Cognitive Level

The hypothesis to be tested with regard to this

aspect of the study was stated as,

H3b. There will be a difference among item
cognitive levels in the accuracy of point-
biserial estimation by experienced teachers.

This hypothesis was supported by the data

(multivariate F-ratio = 5.7, __f = 2,25: p<0.01). It was
concluded that the point-biserials of items with different
cognitive levels were estimated with significantly
different accuracy. None of the interaction effects was
statistically significant. The Mauchly’s test of sphericity
was also not significant (chi-square = 2.3, df 2: p>0.05).
Table 4.13 presents the e posreriori comparisons among the

means. The data showed that none of the means was

significantly different from each other.

Results Concerning the Effect of Item Difficuity Level
The hypothesis concerning this effect was stated as,

H3c. There will be differences among items of
different difficulty levels in the accuracy of
point-biserial estimation by experienced
teachers.

 

 

129

The main effect of difficulty level was not
significant, as it gave a multivariate F-ratio of 1.17 (Q;
= 2,25; p>0.01) and the hypothesis was not accepted. All
the two-way as well as the three-way interactions effects

among the factors were also nonsignificant.

Table 4.13.--Multiple-Comparisons among the Means of the
Different Cognitive Levels.

 

Studentized Range Statistics, qa

 

 

 

 

 

Cognitive Application Comprehension Knowledge
Level (Mean=.087) (Mean=.093) (Mean=.096)
Application 2.8 3.8
Comprehension 1.0

 

aThe critical value of the Studentized Range Statistic,
g = 4.3, gr = 52,3.

Resulte Concerning the Effect pf
Different Discrimination Power

The hypothesis to be tested was stated as,

H3d. There will be differences among items of
different discrimination power in the accuracy
of“ point-biserial estimation. by experienced
teachers.

The result of multivariate test of significance has

an observed F-ratio of 10.4 (Q; = 2,25: p<0.01). Thus the

hypothesis was accepted. The interaction between form and

130

discrimination level was also significant, multivariate

E(2,25) = 5.8, p<0.01. Figure 4.12 depicts the significant

 

 

interaction.
____FormA
1.3 . Form B
1.2 .
A
c 1.1 L
c
u
r 100 b
a
c
y .90 .
.80 ,
’1’
’l’ r - r

 

Low Medium High

Figure 4.12.--Interaction of Form and Discrimination Level
on Accuracy of Point-Biserial Estimation.

The results of the e posteriori tests of differences
among the means of the accuracy of point-biserial
estimation at. different discrimination levels are
presented in Tables 4.14 & 4.15. The data showed that the

means were not significantly different from each other.

Results Concerning the Effect of item Type

The hypothesis was stated as,

H3e. The point-biserials of the single-answer
multiple-choice items will be estimated more
accurately than that of the multiple-answer
multiple-choice items by experienced teachers.

 

131

Table 4.14.--Multiple—Comparisons among the Means of the
Different Discrimination Levels. (Form A)

 

Studentized Range Statistics, qa

 

 

 

 

 

Discrim. Medium High Low
Level (Mean=.088) (Mean=.100) (Mean=.106)
Medium 2.8 4.2
High 1.4

 

aThe critical value of the Studentized Range Statistic,
g = 4.3, Q; = 52,3.

Table 4.15.--Multiple-Comparisons among the Means of the
Different Discrimination Levels. (Form B)

 

Studentized Range Statistics, qa

 

 

 

 

 

Discrim. High Medium Low
Level (Mean=.094) (Mean=.110) (Mean=.120)
High .33 1.5
Medium 1.1

 

aThe critical value of the Studentized Range Statistic,
g = 4.3, gr = 52,3.

132

The paired t test has an observed significance level
greater than 0.01 (t-value = 0.44, Q = 26) and the
hypothesis was not accepted. The point-biserials of single-
answer multiple-choice items were not estimated more
accurately than that of the multiple-answer multiple-choice
items. All the two-way as well as the three-way effects

were also not significant at 0.01 level.

Summary

The results of the statistical data analyses for
this study were presented in two separately sections, one
for the accuracy of p-value estimation and the other for

the accuracy of point-biserial estimation.

Accuracy of P-value Estimation

The tests of all the hypotheses involving the
accuracy of p-value estimation were presented in Table 4.16
and the results were summarized as follows:

1. The p-value estimation by experienced teachers
trained in estimation skills was significantly
more accurate that by experienced teachers not
trained in estimation skills.

2a. There was a significant difference among the
items of different content areas in the accuracy
of p-value estimation by experienced teachers.

2b. There was a significant difference among the
items of different cognitive levels in the
accuracy of p-value estimation by experienced
teachers.

2c. There was a significant difference among the
items of different difficulty levels in the
accuracy of p-value estimation by experienced
teachers.

133

Table 4.16.--Summary of Tests of Significance for Item
Difficulty Estimation.

 

Double Sphericity Sig. of F
Repeated Test Univ
Measures Effect Chi-Sq Multiv (Avr F)
Form Treatment - .000
& Form - .013
Content Content .358 .000 .000
Treat. x Form - .649
Treat. x Content .405 .309
Form x Content .050 .000 .000
Treat. x Form x
Content .106 .207
Form Treatment - .000
& Form - .000
Cogn. Cognitive Level .075 .000 .000
Level Treat. x Form - .408
Treat. x Cogn. Level .808 .749
Form x Cogn. Level .464 .000 .000
Treat. x Form x Cogn.
Level .631 .604
Form Treatment - .000
& Form - .022
Diff. Difficulty Level .006 .000 .000
Level Treat. x Form - .293
Treat. x Diff. Level .094 .047
Form x Diff. Level .806 .001 .000
Treat. x Form x
Diff. Level .089 .057
Form Treatment - .000
& Form - .001
Discrim. Discrimination .360 .000 .000
Level Treat. x Form - .460
Treat. x Discrim. .136 .169
Form x Discrim. .449 .003 .004
Treat. x Form x .343 .419
Discrim.
Form Treatment - .000
& Form - .001
Item Item Type - .000
Type Treat. x Form - .274
Treat. x Item Type - .884
Form x Item Type - .000

 

Treat. x Form x Item
Type

.171

 

2d.

2e.

3a.

3b.

134

There was a significant difference among the
items of different discrimination power in the
accuracy of p-value estimation by experienced
teachers.

The p-values of the single-answer multiple-choice
items were estimated significantly more
accurately than the multiple-answer multiple-
choice items by experienced teachers.

There was no significant difference between the
accuracy of p-value estimation by the teachers
competent in estimation skills and the accuracy
of p-value estimation obtained by field-trial of
item pool.

The efficiency of estimation (as compared with
the accuracy of estimation by field trial of item
pool) by the teachers trained and competent in
estimation skills was estimated to be 91% and the
efficiency of estimation by teachers not trained
in skills was estimated to be 78%.

Accuracy of Point-biserial Estimatign

 

The tests of all hypotheses involving the accuracy

of point-biserial estimation were presented in Table 4.17

and the results were summarized as follows:

4.

5a.

There was no significant difference between the
accuracy of point-biserial estimation by the
teachers trained in estimation skills and that of
the teachers not trained in estimation skills.

There was a significant difference among the
items of different content areas in the accuracy
of point-biserial estimation by experienced
teachers.

5b. There was a significant difference among the

items of different cognitive levels in the
accuracy of point-biserial estimation. by
experienced teachers.

5c. There was no significant difference among the

items of different difficulty levels in the
accuracy of point-biserial estimation by
experienced teachers.

5d.

5e.

135

There was a significant difference among
items of different discrimination power in
accuracy of point-biserial estimation
experienced teachers.

There was no significant difference between
single-answer multiple-choice items and
multiple-answer multiple-choice items in
accuracy of point-biserial estimation
experienced teachers.

The Spearman rank correlation between
estimated and the population point-biserials
0.43.

the
the

by

the
the
the

by

the
was

136

Table 4.17.--Summary of Tests of Significance for
Point-Biserial Estimation.

 

 

Double Sphericity Sig. of F
Repeated Test Univ
Measures Effect Chi-Sq Multiv (Avr F)
Form Treatment - .603
& Form - .000
Content Content .951 .000 * .000
Treat. x Form - .990
Treat. x Content .339 .304
Form x Content .370 .000 .000
Treat. x Form x
Content .303 .422
Form Treatment - .000
& Form - .000
Cogn. Cognitive Level .316 .009 .001
Level Treat. x Form - .823
Treat. x Cogn. Level .296 .352
Form x Cogn. Level .026 .243 .206
Treat. x Form
x Cogn. Level .921 .958
Form Treatment - .357
& Form - .000
Diff. Difficulty Level .505 .327 .370
Level Treat. x Form - .772
Treat. x Diff. Level .195 .144
Form x Diff. Level .041 .033 .134
Treatment x Form x
Diff. Level .276 .208
Form Treatment - .450
& Form - .000
Discrim. Discrimination .194 .001 .002
Level Treat. x Form - .867
Treat. x Discrim. .653 .677
Form x Discrim. .019 .009 .074
Treat. x Form x
Discrim. .905 .874
Form Treatment - .353
& Form - .000
Item Item Type - .666
Type Treat. x Form - .781
Treat. x Item Type - .191
Form x Item Type - .268
Treat. x Form x Item
Type — .860

 

CHAPTER V

SUMMARY AND CONCLUSIONS

umma

The quality of a test is determined, in part, by the
extent to which the scores produced by the test are
reliable and valid. Even though reliability and validity
are both important indicators of a test's quality, a valid
interpretation of the test scores is possible only when
there is consistency in the test scores. Thus reliability
of test scores is a necessary, although not sufficient,
condition for valid test score interpretations.
Psychometricians have developed theories that relate test
characteristics such. as reliability, mean, standard
deviation, and standard error of test scores with item
statistics such as item difficulty and item discrimination
index. These theories enable the test developer to control
the quality of a test through the use of item statistics
such as the p-value and the point-biserial correlation
coefficient. In the Examinations Syndicate, Malaysia, the
p-value and the point-biserial of an item are estimated
through annual field-trials of an item pool. However, this
annual field-trial exercise is plagued with practical as

well as administrative problems.

137

138

The purpose of this study was to investigate an
alternative approach for estimating item statistics.
Specifically, this study had investigated how accurately
experienced chemistry teachers can estimate the item
statistics of the Chemistry test items that will be used in
the Malaysian Certificate of Education (MCE) Examination.
In addition, this study has examined whether the accuracy
of estimation can be improved by an intervention program
aimed at increasing the competency of the teachers'
estimation.

A review of the literature indicated that studies
related to the present research can be grouped into three
broad categories: (a) Judgment Under Uncertainty, (b)
Determinants of Item Difficulty and Discrimination, and (c)
Empirical Studies of .Accuracy’ of Item Statistics
Estimation.

Judgment under uncertainty dealt with the psychology
of prediction and expert judgment. This group of studies
reveal that judgmental heuristics -- representativeness,
availability, and judgmental and anchoring, are generally
useful but often lead to severe and systematic bias.
Studies concerning expert judgment focused on the problems
of' consensus judgment ‘versus. individual judgment. The
studies recommended that, in order to avoid a normative
effect in which individual judgments were unduly influenced
by group judgments, judges should be allowed to form their

own independent estimates after group discussion. Research

139

on expert judgments also indicates that feedback about the
experts' performance relative to the actual state of
affairs can lead to improvement.

According to the studies, determinants of item
statistics could be broadly divided into (a) intrinsic
determinants, and (b) extrinsic determinants. Intrinsic
determinants include item complexity and cognitive
processes/components required to process item tasks,
whereas extrinsic determinants include item language,
content familiarity, item format, option homogeneity,
grammatical inconsistency, option characteristics and item
context. Although the results of these studies have not
been conclusive as to which of these factors affect item
statistics, there seems to be evidence that item
complexity, cognitive components required to process item
tasks, content familiarity, similarity of item options,
item format and item context are closely related to item
difficulty.

Empirical studies of item statistics estimation seem
to indicate that judges could estimate the relative but not
the absolute item difficulties well, that the accuracy of
estimation generally improved when estimation of judges
were pooled, that pooling only the estimates of those
competent judges will provide a more accurate estimation
than pooling estimates from the entire group of judges,
and that providing "anchor" items will improve accuracy of

estimation.

140

In the present study, 30 experienced teachers who
have taught the examination classes in the subject of
Chemistry were randomly assigned to one of two groups: the
treatment and the control groups. Examination classes are
classes in which the teachers prepared students to take the
Malaysian Certificate of Education Examination at the end
of the academic year.

The teachers in the treatment group were invited to
attend a 3-day training/workshop session at the
Examinations Syndicate, Ministry of Education, Malaysia.
The first two days of the training/workshop sessions were
devoted to providing the teachers with the opportunity to
develop skills and strategies for estimating item
statistics. The training session consisted of a theoretical
and a practical component. The theoretical component
provided the teachers with the necessary knowledge about
item statistics, informed the teachers of the pitfalls
involved in judgment under uncertainty and the research
findings of the determinants of item statistics. Teachers
analyzed some sample items and were helped to develop their
own list of determinants of item statistics. The practicai
eomponenf provided opportunity for the teachers to practice
and to sharpen their skills in estimation. Feedback was
provided after each practice to move the teachers'
estimates, by successive approximations, closer to the
parameter values of the item difficulty and item

discrimination. The third day of the training session was

141

reserved for the teachers to actually estimate the item
statistics. The teachers in the control group estimated the
the item statistics without being trained in estimation
skills, although they were also informed of the definitions
and the meaning of the item statistics to be estimated.

The items to be estimated by the teachers were
grouped into two forms: A & B: each form contained 40
items. Equating items were embedded in each form. The
teachers were each provided with 10 "anchor" items (i.e.,
items with known population values of item characteristics)
to guide them in the estimation.

The dependent variables in this research are: (a)
the accuracy of p—value estimation which was defined as the
absolute difference between the equated p-valueestimate of
the item and the population p-valwe of the item; (b) the
accuracy of point-biserial estimation which was defined as
the absolute difference between the estimated point-
biserial and the parameter value of the point-biserial of
the item.

It was hypothesized that the trained teachers would
be able to estimate the item statistics more accurately
than the teachers not trained in estimation skills, and
that the accuracy of estimation by the teachers who had
been trained and were competent in estimating item
statistics was not different from the accuracy of
estimation obtained in field-trials of the item pool. It

was also hypothesized that each of the factors: (a) content

142

area of item, (b) cognitive level of item, (c) difficulty
level of item, (d) discrimination power of item, and (e)
item type, affected the accuracy of estimation by the
experienced teachers.

Statistical analyses of the data involved a double
repeated measures design with one between-subjects factor
and two within-subjects factors. The treatment-control
dimension was the between-subjects factor, and form was
one of the two within-subjects factors. The other within-
subjects factor was one of the five factors (a) to (e)
mentioned above. In other words, a double-repeated measures
analysis of variance technique was employed to test the
treatment effect, the main effects of form, content area of
item, cognitive level of item, difficulty level of item,
discrimination of item and item type, as well as the
interactions among the factors. The difference between the
accuracy of p-value estimation by the teachers trained and
competent in estimation skills and the accuracy of
estimation obtained in field-trial of item pool was tested
by a three-way ANOVA in which the factors, form and content
area, were incorporated to increase power and to study

possible interactions.

Conclusions

Aecurecy of E-value Estimatiom

1. There *was statistical evidence that the

intervention program was effective in increasing the

143

accuracy of p-value estimation by the experienced Chemistry
teachers. There was no significant interaction effect
between treatment and form. There was also no significant
interaction effect between treatment and each of the
following factors: content area, cognitive level of item,
difficulty level of item, discrimination power of item,
and item type. Thus form, content area, cognitive level of
item, difficulty level of item, discrimination power of
item, and item type do not affect the generalization of the

treatment effect.

2a. The p-values of items from different content
areas were estimated with different accuracy by the
experienced chemistry teachers. Content area interacted
with form. In Form A, the accuracy of estimation increases
in the order: "rates & equilibrium", "electricity &
energy", "chemical structure", and "descriptive chemistry";
whereas in Form B, the order was: "rates & equilibrium",
"chemical structure", "electricity & energy", and

"descriptive chemistry".

2b. The p-values of items of different cognitive
levels were estimated with different accuracy by the
experienced chemistry teachers. Cognitive level interacted
with form. In Form A, the accuracy of estimation increased
as the cognitive level of items changes from

"comprehension" to "knowledge" and then to "application".

 

144

However, in Form B, the accuracy increased in the order:

"knowledge", "comprehension", and "application".

2c. The p-value of items of different difficulty
levels were estimated with different accuraoy by the
experienced chemistry teachers. Difficulty level interacted
with form. In Form A, "hard" items were most accurately
estimated, followed by "medium" items and then "easy"
items; whereas in Form B, the accuracy decreased in the

order: "medium", "hard", and "easy" items.

2d. The p-values of items of different discrimina-
tion power were estimated with different accuracy by the
experienced chemistry teachers. In both forms, the high
discriminating items were most accurately estimated,
followed by the medium and then the low discriminating
items. However, discrimination level interacted with form.
The difference in accuracy of estimation between items with
high discrimination power and items with medium
discrimination power was smaller in Fonm A as compared to

that in Form B.

2e. The p-values of single-answer items were
estimated more accurately than that of multiple-answer
items. However, item type interacted with form. In Form A,
the single-answer items were more accurately estimated
whereas in Form B, the multiple-answer items were slightly

more accurately estimated.

 

145

3a. There was no statistically significant
difference between the accuracy of p-value estimation by
the teachers who were trained and competent in estimation
skills and the accuracy of p-value estimation obtained by

field-trial of item pool.

3b. The efficiency of estimation by the teachers
trained and competent in estimation skills (as compared
with the accuracy of estimation by field trial of item
pool) was estimated to be 91%, and the efficiency of
estimation by the teachers not trained in estimation skills

was 78%

Accuracy of Point-biserial Estimariom

4. There was no statistical evidence that the
intervention program increased the accuracy of point-

biserial estimation by the experienced chemistry teachers.

5a. The point-biserials of items of different
content areas were estimated with different accuracy.
Content area interacted with form. In Form A, the
difference in accuracy of estimation between the "chemical
structure" items and the "electricity & energy" items, and
also between the "rates & equilibrium" items and the
"descriptive chemistry" items were larger than the

corresponding differences in Form B.

5b. The point-biserials of items of different

cognitive levels were estimated with different accuracy.

146

Even though the mean of accuracy of estimation for the
"application" items was the lowest (i.e. most accurately
estimated), the post hoc comparisons did not indicate any
significant differences among the means of the three

cognitive levels.

5c. The point-biserials of items of different
difficulty levels were not estimated with significantly

different accuracy.

5d. The point-biserials of items of different
discrimination power were estimated with significantly
different accuracy. Discrimination level interacted with
form. In Form A, the accuracy of estimation increased in
the order: "low discriminating" items, "high
discriminating" items, and "medium discriminating" items:
whereas in Form B, the order was: "low discriminating"
items, "medium discriminating", and "high discriminating"

items.

Be. There was no statistically significant
difference in the accuracy of point-biserial estimation

between single—answer items and multiple-answer items.

6. The Spearman rank correlation between. the
estimated and the population point-biserials for the 80

items was 0.43.

147

Discuseigp

Estimating P-velues

The study showed that the experienced Chemistry
teachers who had been trained in estimation skills were
able to estimate the p-values of the items significantly
more accurately than did those experienced Chemistry
teachers who were not trained in estimation skills. In
other words, the intervention program has been effective in
improving the accuracy of estimation of item difficulties.
The estimated effect size of the treatment was 0.53
standard deviations which was considered a reasonably large
value. It represents a fair amount of improvement in
accuracy of estimation. However, whether this improvement
is large enough to offset the cost involved in training the
experienced teachers was not investigated in this study.
Another relevant concern is: Was the accuracy of estimation
by the trained teachers high enough that it could be used
in place of an empirical method of estimation? This
question, although not answerable in absolute terms, will
be addressed later. The fact that treatment did not
interact with form, content area, cognitive level of item,
difficulty level of item, discrimination power of item, and
item-type suggests that treatment effect does not depend on
the levels of these factors.

The treatment was effective in spite of the small

sample size used in the study. This could be attributed to

148

the fact that the design of the intervention program
incorporated the findings of three broad areas of research
in subjective estimation, namely: (a) the psychology of
estimation, (b) the determinants of item statistics,
and (c) the empirical studies of subjective estimation of
item statistics. The use of experienced teachers, who are
familiar with the curriculum, the characteristics of the
student population, and the nature of the examination, as
the judges in the study could also be a reason for the
effectiveness of the treatment. It is reasonable to believe
that judges who are per familiar with these characteristics
may not be able to benefit from the intervention program as
much as those who are familiar.

The p-values of items of different content areas were
estimated with significantly different accuracy. As
indicated in Figure 4.1, the "rates & equilibrium" items
were least accurately estimated. A likely explanation might
be that the content area "Rates & Equilibrium of Chemical
Reactions" involves very abstract and difficult concepts.
They are difficult to grasp, especially for students who
are being exposed to them for the first time. However, once
the fundamental concepts in this area are mastered, the
whole topic becomes easy. Teachers who have taught the
concepts for a number of years will tend to feel that the
topic consists of just a few simple rules which can be
easily applied to solve the problems in the items. This

feeling may bias their judgments when estimating the

149

difficulty of items in this particular content area. In
general, to estimate the p-values of an item, a judge has
to make judgments concerning not only the intrinsic
difficulty of the item, but also the degree to which the
students have mastered the concept or concepts on which the
item was based.

As indicated earlier, items in this content area
could be very difficult if students did not completely
understand the few fundamental concepts. However, once
these fundamental concepts are mastered, most of the items
could be answered correctly by routine applications of a
few rules. The items then become very easy. Thus the
ability to make an accurate judgment about students' degree
of mastery of the fundamental concepts is crucial for an
accurate estimation of the difficulty of the items in this
content area. A small error in the judgment may result in a
large error in the accuracy of estimation. This need for an
accurate assessment of students' mastery of the fundamental
concepts of "rates & equilibrium of chemical reactions"
coupled with the tendency for the teachers to feel that the
topic is easy, result in teachers' estimations being least
accurate in this content area.

The study showed that the p-values of the "chemical
structure" items in Form B were less accurately estimated
than that in Form A.(Figure 4.1). However, for other
content areas the items in Form B were either more

accurately estimated or about the same as that in Form A.

150

The two forms were constructed to measure the same content
areas of Chemistry. The corresponding test statistics,
i.e., means, standard deviations, and reliabilities (KR20),
of the two forms are similar (the mean Deltas of Form A and
Form B are 12.3 and 12.0, their standard deviations are
9.32 and 8.77, and their KR20's are .917 and .909
respectively). Furthermore, the accuracy of estimation for
three out of the four content areas rank in the same manner
in both forms (Figure 4.1). The exceptionally poor accuracy
of estimation for "chemical structure" items in Form B was
unanticipated.

The p-values of items of different cognitive levels
were estimated with different accuracy. An examination of
the graph of accuracy versus cognitive levels (Figure 4.2)
indicated that the "knowledge" items in Form B have an
unexpectedly low accuracy of estimation as compared to the
same type of items in Form A. A further investigation
showed that three of the "knowledge" items in Form B were
exceptionally poorly estimated. These two observations seem
to indicate that the variability of the accuracy of p-value
estimates for "knowledge" items was larger than that for
items of the other two cognitive levels. This larger
variation in the accuracy p-value estimates for "knowledge"
items could be attributed, in part, to the fact that
"knowledge" items tend to have a larger range of p-values.
Items measuring knowledge that is familiar to the examinees

will tend to be easy and hence have very high p-values. On

 

151

the other hand, if a "knowledge" item measures a certain
fact or specific knowledge that the examinees either could
not remember or did not learn, its p-value will be very
low. It is unlikely that the judges would be able to
identify which specific knowledge/facts the examinees had
not learned or could not remember. This could then lead
to a situation in which, for some of the items, the judged
p-values differ greatly from the actual p-values, resulting
in the p-values of some items being exceptionally poorly
estimated.

The p-values of items of different difficulty levels
were estimated with varied accuracy. Specifically, the
easy items were least accurately estimated. This result can
be explained, in part, by the error of central tendency
phenomenon (Tinkelman, 1947: Guilford, 1954) which is the
tendency for people to avoid making extreme judgments and
check the middle of the scale instead. Hence items with
high p-values (i.e., easy items) tend to be rated as medium
items. However, the phenomenon of error of central tendency
did not seem to affect the accuracy of estimation for
"hard" items. A probable reason is that teachers have a
natural tendency to find out why certain items are
difficult and how to help students answer these items
correctly. As a consequence, teachers are sensitized to
recognize difficult items rather than easy items. Thus

experienced teachers are able to estimate the p-values of

152

difficult items more accurately than that of the easy
items.

With regard to items with different discrimination
power, the study showed that the p-values of low
discriminating items were estimated less accurately than
that of the high discriminating items. The result was
consistent with what was expected. It is generally true
that the discrimination index of an item reflects the
quality of the item. If an item has a low discriminating
power, it generally indicates that some ambiguities are
present in the stem and/or the options. In many instances,
a low discrimination index is also caused by the presence
of extraneous difficulties which make the item difficult
for both the high and the low ability groups. Ertraneous
difficulties are difficulties that may arise as a result of
examinees not learning or not being being taught certain
concepts or facts (which the examinees were assumed to
know) that are essential for answering the particular item
correctly.

When estimating the p—value of an item, the teachers
first answer the item themselves and then assess how
difficult it will be for the examinees to answer it
correctly. It is generally true that teachers are more
knowledgeable than the examinees in the content areas
measured by the item, and hence are unlikely to experience
any difficulties in answering the item even if extraneous

difficulties are present in the item. In other words, it is

153

difficult for teachers to detect the presence of extraneous
difficulties. If teachers are not able to detect irrelevant
or extraneous difficulties, it is unlikely that they will
be able to estimate accurately the p-values of low
discriminating items because low discriminating items are
items that are ambiguous and/or have extraneous
difficulties.

The results of the study suggested that the p-values
of the single-answer type items were more accurately
estimated than that of the multiple-answer type items. A
possible explanation could be offered as follows: Single-
answer format is less complex than multiple-answer format.
In estimating the p-value of a single-answer type item,
the judges read the stem, evaluate the options and then
make a judgment about how difficult it would be for the
examinees to select the right answer. However, in a
multiple-answer type situation, the judges have to make,
for each item, as many such judgments as the number of
right alternatives. Furthermore, in the case of multiple-
answer format, the judges also have to make judgments on
how the various combinations of alternatives affect the
item difficulty. These two factors make the p-value
estimation of multiple-answer items more difficult and
hence less accurate than in the case of single-answer type
items. Although the result of the study is consistent with

expectations, the significant interaction effect between

154

item-type and form limits the generalizability of the
finding.

The results of data analysis indicated that Form had
a significant interaction with each of the within-subjects
factors -- content area, cognitive level, difficulty level,
discrimination level, and item type -- investigated in this
study. The jpresence. of interactions indicated that the
generalization of the main effect of each of the five
within-subjects factors must be qualified. A discussion of
the generalizations, taking into consideration the
presence of interactions, is presented below:

Eorm x content area impereeripp. An examination of
Figure 4.1 and Tables 4.1 & 4.2 seemed to indicate that the
interaction between Form and Content Area factors resulted
mainly from the differential mean accuracy of estimation of
"chemical structure" items in Form A and Form B. In other
words, if we disregarded the level "chemical structure",
the findings that (a) the "rate & equilibrium" items were
less accurately estimated than both the "electricity &
energy" and. the "descriptive chemistry" items, and (b)
there was no significant difference between the accuracy of
the "electricity & energy" items and the "descriptive
chemistry" items could be generalized to other forms.

W The instability
of the accuracy of p-value estimation for "knowledge" items
across the two forms (see Figure 4.2) appeared to be the

main source of variation that contributed to the

 

 

155

significant interaction effect between Form and Cognitive
level factors. Although the instability of "knowledge"
items has complicated the interpretation of the main effect
of cognitive level, it does not, however, invalidate the
generalization (to other forms) that the "application"
items are more accurately estimated than the
"comprehension" items.

Form x difficulty level imteracrion. Even though the
interaction between Form and Difficulty level factors was
significant, an examination of the graphical representation
of the interaction effect (Figure 4.3) showed that the
conclusion that "easy" items were less accurately estimated
than either the "medium" items or the "hard" items can
still be generalized to other forms. The results in Tables
4.5 & 4.6 also allow the finding that the "medium" items
and the "hard" items are estimated with similar accuracy to
be generalized to other forms.

Form x discrimination level interaction. The
situation for the interaction between Form and
Discrimination level factors (Figure 4.4) is slightly
different from that for Form x Difficulty level
interaction. In this case, in spite of the presence of
interaction, the finding that the p-values of the "highly
discriminating" items were more accurately estimated than
that of the "low discriminating" items can still be
generalized to other forms. However, no generalization (to

other forms) can be made with regard to either the

156

difference in the accuracy of estimation between the
"highly' discriminating" and the "medium discriminating"
items or between the "medium discriminating" and the "low
discriminating" items.

Form x item type interacriem. Even though the main
effect of item type was significant, the interaction
between Form and Item Type factors was also significant.
The presence of an interaction indicates that the effect of
item type on the accuracy of p-value estimation depends on
form, thus limiting the generalization of the main effect.

A comparison of the accuracy of p-value estimation
by the experienced teachers trained and competent in
estimation skills with the accuracy of estimation by a
field-trial of item pool was carried out in the study. The
purpose of the comparison was to address the question as to
whether the estimation by these teachers could be used as a
substitute for the estimation obtained from the empirical
method. The results suggest that the accuracy of estimation
by the trained and competent teachers was not significantly
different from the accuracy obtained in the item analysis
research of an item pool. This conclusion was arrived at
by comparing the mean accuracy of estimation by the 10 most
competent teachers (competent in terms of estimation
skills) with the accuracy obtained in the field-trial of
item pool, using an ANOVA technique. The method of
comparison used in this study was different from the usual

approach reported in the literature (Bejar, 1981: Large &

 

157

Diamond, 1954: Lorge & Kruglov, 1953: Quereshi & Fisher,
1977; Tinkelman, 1947: Willoughby, 1980) in which the
estimated p-values were typically correlated with the
empirically determined p-values of the items under study. A
shortcoming of using correlational technique in this type
of study is that high correlation between subjectively
estimated p-values and empirically estimated p-values does
not necessarily result in high agreement between the two
sets of p-values: conversely, high agreement does not
necessarily result in high correlation.

Using the accuracy of estimation obtained in the
field-trial of item pool as the standard of accuracy, the
study showed that the accuracy of estimation by the
teachers competent in estimation skills was about 91% of
the accuracy obtained in the field-trial of item pool.
Whether or not this degree of accuracy is sufficiently
high that subjective estimation by experienced teachers
can be used in place of empirical estimation should be
determined by a practical rather than a theoretical
consideration. Apart from degree of accuracy, the
availability of students for field-trials, administrative
constraints involved in field-trials of item pool,
financial expenses, and the constraints of time are some
of the factors that will influence the choice between
teacher-estimation and empirical procedure for estimating

item statistics in the process of test construction. It is

158

only after these factors have been considered that a choice

between the two methods can be made.

s ' ' - 's ' o ' s

The. study' showed. that. treatment did. not improve
teachers' accuracy in estimating the point-biserials of the
items. The nonsignificant treatment effect could be due to
the fact that the estimation procedure, in this case,
consisted of two components: (a) teachers' subjective
estimation and (b) a small-sample empirical estimation.
The data from the treatment and the control groups were
each combined with the same data obtained from the
empirical estimation procedure to produce the respective
sets of data for the treatment group and the control group
teachers. Incorporating a common set of data from the
empirical estimation has an effect of reducing the final
difference between the mean accuracy of point-biserial
estimates of the treatment and that of the control groups.
This could then result in a nonsignificant difference. It
is perhaps relevant to point out that very little research
on subjective estimation of point-biserials has been
reported in the literature. As a result, the intervention
program was not able to benefit from a review of the
literature to gain insights concerning subjective point-
biserial estimation.

The accuracy of point-biserial estimation was found

to be affected by content areas, cognitive level of the

159

items, and the discrimination power of the items. However,
the accuracy was not affected by the difficulty level of
the items, and the item-types.

Regarding content areas, the "rates & equilibrium"
items were least accurately estimated. In the case of
cognitive levels, it was the "knowledge" items that were
most poorly estimated: whereas among the different
discrimination levels, the accuracy of the "low
discriminating" items was the least. It is interesting to
note that the accuracy of p-value estimation of items in
each of these three areas (i.e., "rates & equilibrium",
"knowledge" and "low discriminating") was also the lowest
when compared to other levels within the appropriate
factor. This similarity helps to shed some light on the
question of why the point-biserial estimation of items in
these areas are lowest. First of all, the accuracy of
point-biserial estimation depends on how accurately the
proportion of the high ability group who will answer the
item correctly and the proportion of the low ability group
who will answer the item incorrectly are estimated. To the
extent that these proportions are inaccurately estimated,
the (accuracy' of'jpoint-biserial estimation ‘will be low.
Since the p-values of the items in these three areas were
poorly estimated, it follows that the proportion of the
high ability group who will answer the items correctly and
the proportion of the low ability group who will answer

incorrectly would also be poorly estimated. This results in

160

poor point-biserial estimations. Thus the explanation
(which was given in the earlier section) for the low
accuracy of p-value estimation of items in these three
areas could also account for the low accuracy of the point-
biserial estimates in these areas.

Both multivariate and univariate F-tests indicated
that the interaction between Egrm epd Qontept Area factors
was significant at 0.01 level. However, a study of Figure
4.11 and Tables 4.11 8 4.12 showed that, for Form A, the
mean accuracy of "rate 8 equilibrium" items was
significantly higher than that of (a) "chemical structure"
items and (b) "descriptive chemistry" items. The mean
accuracy of "electricity 8 energy" items was significantly
higher than that of "chemical structure" items. However,
such results were not repeated for the items in Form B.
Thus it may not be appropriate to generalize the findings
concerning the differences among the means to other forms.

With regard. to ‘the interaction between ‘Egrm__emg
Diseriminariop level factors, even though the multivariate
F-test indicated that the interaction was significant, the
Tukey's test did not show any significant differences among
the means in either form. This discrepancy was probably due
to the fact that Tukey's multiple-comparison procedure is a
post-hoc comparison technique which is known to be less

powerful (Glass 8 Hopkins, 1984).

161

Implications For Further Research

This study explored two major concerns in item
statistics estimation. One concerned whether the accuracy
of subjective estimation could be improved by an
intervention program, and the other concerned the extent to
which the accuracy of the subjective estimation approached
that of the accuracy of estimation obtained during item
analysis.

Although the results of the study with respect to
both concerns were quite promising, (i.e., the treatment
effect was significant for p-value estimation and the
accuracy of estimation by the teachers was not
significantly different from that obtained in the field-
trial of item pool), additional research in this area is
required before subjective estimation procedures can be
used in the place of an empirical estimation procedure.

Several directions can be suggested for future
research on subjective estimation. One direction that might
be taken is to replicate the study using larger samples of
teachers from different geographical regions and using
Chemistry test items from other years. The present study
has indicated that several properties of the items such as
content areas, cognitive level of the item and item-type
interacted with test forms. Replication with other forms
can shed more light on the pattern of interaction between

each of these factors and form.

162

Only Chemistry test items and chemistry teachers
were used in this study. A relevant question that can be
asked is to what extent can the results of the study be
generalized to teachers and also test items of other
subject matters. Thus another direction which future
research can take is to replicate the study using teachers
and items from other subject matters such as Physics and
Biology.

This study showed that the intervention program was
effective with two days of training. If subjective
estimation procedure were to be used in place of empirical
estimation procedure, it would be important to find out
whether a more intensive training program could improve
the accuracy of estimation over and above the improvement
achieved in the original two-day training session. To
answer this question, research similar to the present study
could be carried out with varying intensity and duration of
training program as two of the factors in the research
design.

The study has also identified the specific level
within each of the factors that was least accurately
estimated. For example, among the various content areas,
"rates 8 equilibrium" items were least accurately
estimated, and among items of different difficulty levels,
the estimation of the item statistics of the "easy" items
was least accurate. Further research efforts could be

taken to explore and to devise procedures to improve the

163

accuracy of estimation in those specific levels. One
possible approach is to train judges using more items in
these areas as practice items and providing feedback to
fine tune their estimation after each practice session. A
study in which the types and quantity of practice items
used in the training sessions are manipulated may address

this issue.

APPENDICES

APPENDIX A

TEACHER DATA

Table A1.--Description of the Teachers in the Treatment Group.

 

Teacher Std.
Identification T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12 T13 T14 T15 Mean Dev.

 

Age 35 28 35 35 33 4O 38 38 40 31 39 43 32 29 28 34.9 4.55
Teaching
Experience 9 5 9 11 9 16 14 12 15 4 15 20 7 6 4 10.7 4.48
(Year)

Average No. of 6 4 7 5 3 15 8.5 7.5 4.5 3 13 6 2 4.5 3.5 6.18 3.67
Years Teaching
Forms 4 8 5*

No. of Years
Teaching 0 0 4 1 0 8 0 3 1 0 0 0 0 0 1.21 2.24
Form 6*

No. of Years

Rating Essay 7 1 0 0 0 13 1 2 0 0 4 1 4 D 2.36 3.58
Chemistry

Test Papers

No. of Years

Rating Practi- 0 O D 10 , O 0 O 6 D 0 12 1 0 0 0 1.93 3.87
cal Chemistry

Test Papers

Percent of

Students

Obtaining 67 55 80 90 75 96 66 55 93 69 39 6O 60 70 69.6 15.5
a Grade of 6

and 8etter"

 

* Forms 4, 5 8 6 are equivalent to the 10th, 11th 8 12th Grades in the American educational system.

** Grading is based on a 1-9 point system, a grade of 6 is considered as having passed with credit.

164

1J6£5

Table A2.-'Description of the Teachers in the Control Group.

 

Teacher . Std.
Identification T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12 T13 T14 T15 Mean Dev.

 

Age 30 35 42 33 33 35 27 31 40 43 33 37 35 39 29 34.8 4.55
Teaching
Experience 6 12 15 10 1O 9 4 6 15 16 8 12 1D 11 5 9.9 3.59
(Year)

Average No. of
Years Teaching 3 5 13 10 2 9 4 1.5 13 8 3.5 6 6.5 11 5 6.71 3.8
Forms 4 8 5*

No. of Years
Teaching 0 12 10 D 5 7 0 6 O 1 0 0 D 2 3.07 4.03
Form 6'

No. of Years

Rating Essay 5 D 3 4 2 6 0 1 5 1 5 2 3 D 2 2.6 1.96
Chemistry

Test Papers

No. of Years

Rating Practi- 0 0 0 O 3 0 2 0 0 13 O O 0 0 O 1.2 3.3
cal Chemistry

Test Papers

Percent of

Students

Obtaining 36 75 so 91. 50 52 90 71 58 70 52 90 1.0 66 10.7
a Grade of 6

and Better"

 

' F°rm5 4. S 8 6 are equivalent to the 10th, 11th 8 12th Grade in the American educational system.

" Grading is based on a 1-9 point system, a grade of 6 is considered as having passed with credit.

APPENDIX B

EXPLANATION

Technical Ierms

This research involves several technical terms that
have to be explained in details. You are encouraged to
understand these terms before you begin the estimation
work.

Item Statistige

In the field of educational testing and measurement,
the term an item is used by test developers to mean a
question in an objective question paper. Item statistics
are figures that indicate the characteristics of an item.
It is common for people to use different types of
statistics to describe the properties/characteristics of
different things. For example, statistics used to indicate
the characteristics of a boxer include: height, age and
arm-length. In educational measurement, the statistics used
to indicate the characteristics of an item are: item
difficulty and discrimination index. The following example
will clarify the concept:

The following item was used in Mathematic Test for
standard three:

Example: 28 + 45 =

A. 19
B. 63
C. 73

Item Difficulty is .45
Discrimination Index is .28

Item Difficulty

. Item difficulty is a figure that indicates how
dlfficulty an item is. It is the ratio of the number of
Students who answer the item correctly to the total number
0f students who answer the item. For example, suppose 100
Students answer the item shown in the example above, and
OUt.(3f the 100 students only 45 students answer it
corractly. Hence, the item difficulty of the item is
45/100, that is, the item difficulty is .45.

166

167

In theory, item difficulty is a number that can take
values from 0.00 to 1.00. An Item difficulty of 0.00 means
that the item is too difficulty and no one has answered it
correctly. On the other hand, item difficulty of 1.00 shows
that the item is too easy and every student who attempted
the item got the right answer. However, in general, useful
items have item difficulty in the range of .20 -.80.

Discrimination Index

Discrimination index is an item statistic which is
more difficult to understand and also more difficult to
estimate accurately. It is a number which indicates the
extent to which high ability students answer the item
(question) correctly and the extent to which low ability
students answer the item wrongly. In theory, a
discrimination index can take values from -1.00 to +1.00.
But in practice its values generally do not go below -.40
and do not go above .70.

An item with a high and positive discrimination
index means that the majority of the students who answer
the item correctly are those students with high ability and
that the majority of students who answer the item wrongly
are those students with low ability. In other words, items
with high discrimination index are able to discriminate or
differentiate high ability students from low ability
students. In contrast, items with low and positive
discrimination indices ( for example, +.05) are not able to
differentiate high ability students from low ability
students because approximately equal proportions of high
ability students and low ability students answer the item
correctly.

As explained above, a discrimination index may take
negative values. Negative discrimination index (for
example, -.30) means high ability students answer the item
incorrectly and low ability students answer the item
correctly.

. It is clear, then, that items with high and positive
discrimination index are more useful than items with low
39d positive discrimination index and items with negative
discrimination index should not be included in the test.

5a.

APPENDIX C

DETERMINANTS OF ITEM DIFFICULTY
Frequency with which the same type of items appears
in the previous years' test papers is related to

the difficulty of the item.

The time of the year when the topic was taught;

items from topics taught earlier tend to be easier.

Items based on unfamiliar (obscure) aspects of the

syllabus are more difficult.

Look for options that are too obvious: obvious

options reduce item difficulty considerably.

Questions involving calculations are generally more
difficult.
If items involve calculations which are mechanical,

the items tend to be easier.

Items requiring direct recall or testing direct

relationship are easier.

The way the stem is phrased often gives an

indication of the difficulty of the item.

Items involving "mole" concept and "ratio" concept

are generally more difficult.

Items requiring transfer of knowledge tend to be

more difficult.

168

APPENDIX D

BAYES IAN APPROACH

The theoretical value of a correlation coefficient
lies within the range of +1.00 and -1.00. In the framework
of a Bayesian theorem (Box 8 Tiao, 1973: Iversen, 1984), if
no prior knowledge about the distribution of the population
correlation coefficient is available, it essentially means
that the prior distribution of the population correlation

coefficient F has a rectangular distribution from -1.00 to

 

+1.00, i.e., the unknown Ficould be anywhere in that range.
From the 'anchor' items, and the experience gained in the
training' sessions, the ‘teacher' were expected to narrow
down, for each item, the range in which the item
discrimination index will lie. This narrower range
represents the information about the prior distribution of
the point-biserialfp. In a Bayesian framework, this prior
distribution is translated to a normal distribution through
the following equation:

1 1+,o
e = ---t In ------ 1 (1)

2 1 — ,0
The variable 9 is normally distributed. Thus if the
upper and lower limits of the range of the value oflp were
denoted by PU and PL respectively, then the upper and lower
limits of the range of e, 9U and 9L respectively, would be

computed as follows:

169

170

1 1 +,0U
GU — --- 1n ------
2 1 ’P0
and
l 1 +
2 1 -PL

The mean ([N) of the prior distribution became:

j“. = (9U + 9L)/2

The prior standard deviation became:

O” = (GU ‘ 9L)/4

The point-biserial (r) obtained from the tryout
sample of 110 students was transformed into a 9 scale using
equation (1) and this transformed value was denoted by z,
where

1 l + r
z = —--[ ln ------ ]
2 1 - r
With the observed discrimination index transformed to z,
the random variable z followed a normal distribution with

variance given by:

The information of the discrimination index given in

the prior distribution with meanJ/x' and standard deviation

171

0" could be combined with the observed sample
discrimination index by the following equation:
1 1
---_ 9 + _____ z
'2 var(z)
11 = ...................
/a 1 1
---- + .....
(7'2 var(z)
’11" represents the mean of the posterior

distribution, which could be converted to the P-scale. This

is taken as the Bayesian estimate of the point-biserial of

the item.

scapunwvhacmao'm

WTTF‘OIJ

18

17

16

15

14

13

12

11

10

APPENDIX E

SAMPLE OF SCATTERPLOTS

 

 

A A A

 

 

10 11 12 13

14 15 16 17

Estimated Delta

Figure E1.--De1ta Plot for Teacher No.1

(Treatment Group,

172

Form A)

sownmwcpow

Didi-“(DU

19

18

17

16

15

14

13

12

11

10

 

 

 

 

#

10

f v f

11 12 13

# ﬁ '

14 15 16 17

Estimated Delta

Figure E2.--Delta Plot for Teacher No.2

(Treatment Group,

Form A)

SOP-WWHC'UO’U

”ﬁt-'00

19

18

17

16

15

14

13

12

11

10

174

 

 

A

 

A A A

 

10

11 12 13

14 15 16 17

Estimated Delta

Figure E3.--Delta Plot for Teacher No.3

(Treatment Group,

Form A)

.‘JOi-hd'mi-‘C'UO'U

0’erbe

19

18

17

16

15

14

13

12

11

10

175

 

 

 

 

10 11 12

13 14 15 16

Estimated Delta

Figure E4.--Delta Plot for Teacher No.4

(Control Group,

Form A)

:oenmwcoom

Blrfi-HDU

19

18

17

16

15

14

13

12

11

10

176

 

 

 

 

10

11 12 13

14 15 16 17

Estimated Delta

Figure E5.--Delta Plot for Teacher No.5

(Control Group,

Form A)

SOP-ﬂﬂli-‘C’UO’U

Diffi-‘DU

19

18

17

16

15

14

13

12

11

10

177

 

 

 

 

10

11 12 13

14 15 16 17

Estimated Delta

Figure E6.--Delta Plot for Teacher No.6

(Control Group,

Form A)

APPENDIX F

TABLES OF MEANS OF ACCURACY

Table F1.--Mean of Accuracy on Different Content Areasa. (P-Value)
Form
Treatment Teacher A 8
CA1 CA2 CA3 CA4 CA1 CA2 CA3 CA4
T1 090 095 147 056 113 066 146 079
Trained T2 096 093 122 063 116 095 112 117
. T3 077 105 114 094 112 092 119 095
In T4 105 110 132 099 114 077 116 111
Estimation
Skills TS 086 092 123 095 112 102 113 092
T6 080 078 153 090 144 099 104 100
T7 071 102 151 079 121 093 141 070
T8 097 078 132 074 124 095 115 065
T9 101 095 143 082 106 110 101 076
T10 091 083 156 087 124 121 147 089
T11 080 084 145 109 125 108 110 079
T12 095 102 128 094 106 112 143 089
T13 089 105 148 085 121 105 126 093
T14 097 099 138 087 103 102 127 097
T15 104 104 154 109 111 088 144 107
T1 087 114 140 097 131 103 132 095
Not T2 100 131 133 102 144 108 130 114
Trained T3 119 143 140 130 142 102 125 087
in T4 083 104 120 098 123 089 129 123
Estimation 15 081 094 161 088 135 127 105 071
Skills T6 089 113 145 107 124 084 127 108
T7 095 113 146 086 132 080 131 094
T8 089 113 121 098 135 103 110 091
T9 118 099 130 085 150 082 136 115
T10 088 114 127 086 125 103 150 092
T11 086 105 146 112 129 106 123 101
T12 102 092 147 106 123 116 132 088
T13 101 101 145 094 113 111 132 080
T14 087 085 136 103 130 096 122 091
T15 115 112 153 089 119 105 132 096

8Values in table have been multiplied by 10 .
CA1 = Chemical structure; CA2 = Electricity 8 Energy
CA3

Rates 8 Equilibrium; CA4 8 Descriptive Chemistry.

178

179

Table F2.--Hean of Accuracy on Different Difficulty Levelsa. (P-Value)

Form

Treatment Teacher A B
0A1 DAZ DA3 DA1 0A2 DA3
T1 119 089 077 134 080 082
Trained T2 103 D95 078 107 110 113
_ T3 141 084 076 146 081 081
Esti:ation T4 131 120 077 122 089 099
Skills TS 113 090 088 124 086 101
T6 095 D87 097 117 108 120
T7 123 097 073 127 088 103
T8 119 076 083 145 075 080
T9 100 109 090 120 074 100
T10 106 092 096 128 114 119
T11 154 073 078 162 072 080
T12 152 072 099 147 065 109
T13 151 086 085 172 072 081
T14 149 087 081 140 104 072
T15 143 102 100 138 109 085
T1 138 096 099 180 086 074
Not T2 173 112 074 162 107 103
Trained T3 139 139 123 149 081 113
in T4 128 085 095 145 099 097
Estimation T5 124 090 092 136 093 107
Skills T6 137 110 087 141 086 100
T7 130 113 083 144 107 079
T8 141 103 075 170 088 072
T9 106 109 102 172 084 102
T10 149 095 074 165 104 078
T11 157 094 082 175 073 088
T12 153 081 097 174 082 080
T13 153 084 095 160 068 087
T14 163 065 081 171 063 087
T15 141 107 100 144 108 085

3

8Values in table have been multiplied by 10 .

DA1 = Easy Item; 0A2 = Medium Items; DA3 = Hard Items.

180

Table F3.--Hean of Accuracy on Different Cognitive Levelsa. (P-Value)
Form
Treatment Teacher A B

CN1 CNZ CN3 CN1 CNZ CN3

T1 102 091 085 117 095 087

Trained T2 097 089 090 123 113 084
. T3 092 115 075 119 109 077
Esti2ation T4 106 116 106 114 108 084
Skills TS 083 117 078 130 088 102
16 084 111 074 129 109 107

T7 087 118 076 120 105 093

T8 093 102 063 119 096 092

T9 096 119 077 120 099 072

T10 093 106 088 116 131 109

T11 088 114 081 122 105 092

T12 103 118 073 126 107 097

T13 090 122 093 133 109 087

T14 108 101 094 132 100 078

115 096 122 123 103 111 125

T1 093 121 112 126 117 104

Not T2 121 141 071 146 122 102
Trained T3 142 136 117 134 111 105
in T4 084 114 101 131 113 098
Estimation TS 086 115 096 111 119 109
Skills T6 108 132 077 125 109 096
T7 108 118 091 121 106 106

T8 093 124 092 137 106 088

T9 116 106 090 148 119 094

T10 099 108 105 129 111 112

111 097 128 089 134 119 085

T12 110 104 102 130 114 098

T13 095 127 087 127 111 078

T14 097 109 074 127 114 083

T15 104 124 115 138 116 069

8Values in table have been multiplied by 103.

CN1 = Knowledge; CN2 = Comprehension; CN3 = Application.

181

Table F4.--Hean of Accuracy on Different Discrimination Levelsa.
(P-Value)
Form
Treatment Teacher A B

DL1 DL2 DL3 DL1 0L2 DL3
T1 117 088 081 110 110 078
Trained T2 094 096 088 123 099 108
. T3 107 115 074 108 127 071
Esti:ation T4 123 104 105 097 106 112
Skills T5 084 119 085 112 110 091
T6 090 099 088 110 116 123
T7 105 096 092 101 122 098
T8 096 092 083 107 115 083
T9 095 112 096 101 108 088
T10 103 094 095 108 130 125
T11 110 108 077 118 121 080
T12 116 103 091 124 110 097
T13 116 104 093 134 127 065
T14 121 086 101 123 109 082
T15 121 124 096 105 131 093
T1 128 115 088 133 127 083
Not T2 135 116 105 145 111 123
Trained T3 142 132 129 112 130 108
in T4 109 103 090 122 118 104
Estimation T5 109 093 099 111 124 104
Skills T6 103 124 106 122 109 102
T7 129 099 100 115 116 100
T8 106 119 092 130 123 077
T9 082 125 109 136 125 105
T10 120 091 101 126 130 090
T11 119 118 090 138 131 071
T12 122 086 110 130 125 087
T13 136 100 087 122 114 087
T14 136 089 071 129 128 069
T15 130 102 123 135 100 105

8Values in table have been multiplied by 103.

DL1

Low Discrimination; DLZ = Medium Discrimination;

DL3 = High Discrimination.

182

Table F5.--Hean of Accuracy on Different Item Typea. (P-Value)

Form

Treatment Teacher A 8
SA HA SA MA
T1 077 122 101 101
Trained T2 078 116 103 122
_ T3 094 102 105 105
Esti:ation T4 105 119 101 111
Skills T5 092 102 102 113
T6 091 094 118 113
T7 095 100 103 116
T8 075 114 106 099
T9 103 097 099 103
T10 090 109 135 098
T11 090 109 116 096
T12 092 120 115 105
T13 098 112 109 119
T14 093 118 102 114
T15 109 119 108 117
T1 097 128 118 115
Not T2 108 133 127 125
Trained T3 122 154 116 120
in T4 093 111 117 114
Estimation T5 094 110 116 111
Skills T6 108 116 114 108
T7 091 138 112 110
T8 098 117 111 117
T9 102 113 122 125
T10 098 113 120 114
T11 103 116 118 114
T12 102 113 117 115
T13 097 122 111 106
T14 077 130 114 109
T15 105 131 123 097

8Values in table have been multiplied by 103.

SA = Single-answer type; HA 2 Multiple-answer type.

183

Table F6.--Hean of Accuracy on Different Content Areasa.

(Point Biserial)

Form
Treatment Teacher A 8
CA1 CA2 CA3 CA4 CA1 CA2 CA3 CA4
T1 082 099 107 086 072 082 093 091
Trained T2 087 106 132 093 100 086 085 061
. T3 072 087 145 103 105 059 107 089
in . T4 079 115 108 106 084 095 116 074
Estimation
Skills T5 092 133 118 106 098 073 081 086
T6 061 105 140 084 082 081 105 075
T7 062 097 111 100 084 072 117 086
T8 065 103 137 051 075 100 098 091
T9 072 094 105 111 092 063 107 099
T10 093 105 116 054 068 083 076 058
T11 060 107 134 075 064 077 084 076
T12 112 114 139 040 085 098 095 109
T13 065 109 101 087 062 064 098 083
T14 107 106 119 079 091 100 108 123
T1 092 136 164 109 099 084 089 091
Not T2 106 096 . 109 079 075 089 089 063
Trained T3 098 088 120 097 089 078 092 088
in T4 086 094 104 075 075 099 081 064
Estimation T5 069 096 145 079 110 137 108 116
Skills T6 079 101 118 114 075 083 092 075
T7 069 108 120 082 089 073 079 114
T8 098 095 130 112 103 079 099 073
T9 038 132 110 071 097 067 103 092
T10 091 085 128 086 090 058 072 067
T11 069 084 094 078 072 072 100 064
T12 070 123 134 118 084 113 101 107
T13 099 124 074 095 097 119 074 091
T14 065 121 100 105 094 097 100 071
_-_--____-____--_-------_-_-_--_-_------_-_§ __________________________

8Values in table have been multiplied by 10 .
CA1 = Chemical Structure; CA2 8 Electricity 8 Energy;
CA3

Rates 8 Equilibrium; CA4 = Descriptive Chemistry.

184

Table F7.--Mean of Accuracy on Different Difficulty Levelsa.

(Point-Biserial)

Form

Treatment Teacher A 8
DA1 0A2 DA3 DA1 DAZ DA3
T1 084 079 122 089 073 082
Trained T2 089 090 132 081 090 086
. T3 087 085 116 102 081 085
‘n T4 082 099 128 089 091 093

Estimation

Skills T5 135 116 096 096 076 083
T6 104 089 096 080 091 086
T7 074 086 112 086 082 096
T8 090 088 086 068 096 106
T9 098 079 109 111 085 068
T10 089 086 108 069 087 061
T11 101 090 089 075 072 073
T12 086 114 101 076 097 115
T13 109 077 099 068 067 085
T14 080 114 107 115 090 101
T1 138 128 105 094 111 075
Not T2 101 091 102 055 110 082
Trained T3 100 093 101 082 082 095
in T4 064 089 115 085 070 082
Estimation T5 120 084 081 110 144 106
Skills T6 129 091 089 051 103 095
T7 121 108 053 077 096 093
T8 109 094 114 099 077 092
T9 088 092 098 095 081 090
T10 107 078 101 093 077 051
T11 086 068 094 106 062 055
T12 117 096 124 093 084 118
T13 080 108 123 091 113 090
T14 127 O88 094 084 105 089

8values in table have been multiplied by 103.

DA1 = Easy Item; DA2 = Medium Items; DA3 = Hard Items.

185

Table F8.--Mean of Accuracy on Different Cognitive Levelsa.

(Pointosiserial)

Form

Treatment Teacher A 8
CM1 CM2 CM3 CM1 CM2 CN3
T1 089 097 094 095 074 079
Trained T2 110 102 091 069 097 091
_ T3 100 098 082 098 091 078
Esti:ation T4 096 106 110 077 105 088
Skills TS 116 118 108 095 079 084
T6 100 091 095 076 102 067
T7 084 095 094 092 092 074
T8 098 093 063 077 090 107
T9 104 096 072 092 096 072
T10 092 090 102 067 078 066
T11 101 093 079 072 066 091
T12 097 095 125 099 086 107
T13 108 095 061 081 080 050
T14 105 110 086 103 102 105
T1 136 119 113 087 095 094
Not T2 098 096 098 075 079 085
Trained T3 103 095 091 087 091 078
in T4 095 O95 070 074 089 071
Estimation T5 096 093 088 123 111 122
Skill T6 119 090 089 076 070 108
T7 109 086 086 101 083 077
T8 100 109 103 091 088 092
T9 095 097 083 083 095 088
T10 104 088 084 096 067 053
T11 082 076 085 100 069 051
T12 125 106 094 114 095 084
T13 105 099 113 091 111 078
T14 113 105 073 075 104 092

8Values in table have been multiplied by 103.

CN1 = Knowledge; CNZ = Comprehension; CM3 = Application.

186

Table F9.--Mean of Accuracy on Different Discrimination Levelsa.

(Point-Biserial)

Form

Treatment Teacher A 8
DL1 0L2 DL3 DL1 DL2 DL3
T1 100 104 079 085 078 085
Trained T2 104 094 108 082 099 072
. T3 097 089 099 101 097 068
Esti2ation T4 097 107 104 098 101 068
Skills T5 130 101 115 073 088 099
T6 108 083 096 077 078 105
T7 100 094 080 102 072 093
T8 106 065 094 072 095 103
T9 089 096 095 113 097 049
T10 100 074 105 064 062 093
T11 106 071 101 064 067 095
T12 102 072 129 096 088 104
T13 103 080 094 094 062 064
T14 100 092 114 116 114 072
T1 155 097 123 104 093 074
Not T2 097 081 112 079 071 090
Trained T3 095 108 090 100 076 084
in T4 066 090 108 083 087 065
Estimation T5 137 079 069 100 125 131
Skills T6 106 105 092 069 081 095
T7 115 073 098 087 074 108
T8 113 103 098 109 097 057
T9 088 072 114 093 093 080
T10 113 074 094 089 063 070
T11 097 074 073 112 069 040
T12 130 091 111 099 095 106
T13 095 104 113 095 097 097
T14 115 097 092 084 090 103

aValues in table have been multiplied by 103.

DL1 = Low Discrimination; 0L2 8 Medium Discrimination;

DL3 = High Discrimination.

187

Table F10.--Mean of Accuracy on Different Item-Type'.

(Point-Biserial)

Form

Treatment Teacher A 8
SA MA SA MA
T1 094 092 074 095
Trained T2 106 096 085 086
_ T3 098 090 088 095
Esti2ation T4 107 095 092 089
Skills T5 118 111 085 087
T6 097 092 086 083
T7 107 064 096 075
T8 091 083 083 099
T9 100 082 094 082
T10 082 112 075 065
T11 095 089 077 068
T12 096 113 096 094
T13 094 089 075 072
T14 102 104 102 105
T1 127 ‘ 120 087 101
Not T2 090 110 083 073
Trained T3 086 115 086 089
in T4 090 089 078 082
Estimation T5 085 106 116 120
Skills T6 101 101 075 090
T7 096 092 085 093
T8 109 096 080 108
T9 097 087 092 086
T10 088 102 070 080
T11 078 084 081 067
T12 122 090 097 104
T13 110 096 093 102
T14 112 082 096 084

aValues in table have been multiplied by 103.

SA = Single-answer type; MA = Multiple-answer type.

APPENDIX G

ANOVA TABLES

Table Gl.--Repeated Measures (Form 8 Content) By Treatment.

(P-Value Estimation)

 

Effect

Between Subjects
Treatment

Subj w. Treat.

Within Subjects
Form

Treat. x Form

Form x Subj w. Treat.

Content

Treat. x Content

Content x Subj w.Treat.

Form x Content

Treat. x Form x Content

Form x Content x Subj w.
Treat.

df

28

84

Mean Sq.

F

16.68

Sig. of F

 

.000

O 013

.649

.000

.309

.000

.207

 

188

189

Table GZ.—-Repeated Measures (Form 8 Difficulty Level) By

Treatment.

(P-Value Estimation)

 

Effect

Between Subjects
Treatment

Subj w. Treat.

Within Subjects
Form
Treat. x Form
Form x Subj w. Treat.
Difficulty Level
Treat. x Diff. Level
Diff. x Subj w. Treat.
Form x Diff. Level

x Form x Diff.

Treat.

Form x Diff. x Subj w.
Treat.

df

28

56

Mean Sq.

F

20.03

Sig. of F

 

.000

.022

.29

.000

.037

.000

.057

 

190

Table G3.--Repeated Measures (Form 8 Discrimination) By

Treatment.

(P-Value Estimation)

 

Effect

Between Subjeets

Treatment

Subj w. Treat.

Within Subjects
Form
Treat. x Form
Form x Subj w. Treat.
Discrimination
x Discrim.

Treat.

Discrim. x Subj w.
Treat.

Form x Discrim.
Treat. x Form x Discrim.

Form x Discrim.
x Subj w. Treat.

df

28

56

Mean Sq.

F

19.49

14.41

.68

Sig. of F

 

.000

.000

.416

.000

.169

.004

.419

 

191

Table G4.--Repeated Measures (Form 8 Cognitive Level) By

Treatment.

(P-Value Estimation)

 

Effect

Between Subjects
Treatment

Subj w. Treat.

Within Subjects

Form

Treat. x Form

Form x Subj w. Treat.
Cognitive Level
Treat. x Cogn. Level

Cogn. Level x Subj w.
Treat.

Form x Cogn. Level

Treat. x Form
x Cogn. Level

Form x Cogn. Level
x Subj w. Treat.

df

28

56

Mean Sq.

 

43.27

F

17.84

28.16

.71

58.25

.29

42.46

.51

Sig. of F

 

.000

.000

.408

.000

.749

.000

.604

 

192

Table GS--Repeated Measures (Form 8 Item Type) By Treatment

(P-Value Estimation)

 

Effect

Between Subjects

 

Treatment

Subj w. Treat.

Within Subjects
Form
Treat. x Form
Form x Subj w. Treat.
Item Type
Treat. x Item Type

Item Type x Subj w.
Treat.

Form x Item Type

Treat. x Form
x Item Type

Form x Item Type
x Subj w. Treat.

df

28

28

Mean Sq.

21.34

.97

.72

27.46

.02

.75

.87

F

21.94

12.98

36.48

.02

Sig. of F

 

.000

.001

.274

O 000

.884

.000

.171

 

Table G6.--Repeated Measures (Form 8 Content) By Treatment.

193

(Point-Biserial Estimation)

 

Effect

Between Subjects
Treatment

Subj w. Treat.

Within Subjects
Form
Treat. x Form
Form x Subj w. Treat.
Content
Treat. x Cont.
Cont. x Subj w. Treat.
Form x Cont.
Treat. x Form x Cont.

Form x Cont. x Subj w.
Treat.

df

26

78

Mean Sq.

F

.28

25.31

.00

26.40

12.07

.95

Sig. of F

 

.603

C 000

1.000

.000

.304

.000

.422

 

194

Table G7.--Repeated Measures (Form 8 Difficulty Level) By
Treatment.
(Point-Biserial Estimation)

 

Effect df Mean Sq. F Sig. of F

 

Between Subjects
Treatment 1 3.09 .88 .357

Subj w. Treat. 26 3.51

Within Subjects

Form 1 53.04 28.16 .000
Treat. x Form 1 .16 .09 .772
Form x Subj w. Treat. 26 1.88

Difficulty Level 2 2.47 1.01 .370
Treat. x Diff. Level .2 4.89 2.01 .144
Diff. x Subj w. Treat. 52 2.43

Form x Diff. Level 2 6.11 2.14 .128
Treat. x Form x Diff. 2 4.67 1.64 .205

Form x Diff. x Subj w.
Treat. 52 2.85

 

195

Table G8.--Repeated Measures (Form 8 Discrimination) By

Treatment.

(Point-Biserial Estimation)

 

Effect

Between Subjects

Treatment

Subj w. Treat.

Within Subjects
Form

Treat. x Form

Form x Subj w. Treat.

Discrimination Level

Treat. x Discrim.

Discrim. x Subj w.
Treat.

Form x Discrim.
Treat. x Form x Discrim.

Form x Discrim.
x Subj w. Treat.

df

26

52

Mean Sq.

F

.59

28.29

.03

Sig. of F

 

.450

.000

.867

.002

.677

.065

.901

 

196

Table G9.--Repeated Measures (Form 8 Cognitive Level) By

Treatment.

(Point-Biserial Estimation)

 

Effect

Between Subjects

Treatment

Subj w. Treat.

Within Subjects

Form

Treat. x Form

Form x Subj w. Treat.
Cogn. Level

Treat. x Cogn. Level

Cogn. Level x Subj w.
Treat.

Form x Cogn. Level

Treat. x Form
x Cogn. Level

Form x Cogn. Level
x Subj w. Treat.

df

26

52

Mean Sq.

F

.42

23.74

.05

.04

Sig. of F

 

.524

.000

.823

.001

.352

.206

.958

 

197

Table G10.--Repeated Measures (Form 8 Item Type) By

Treatment.

(Point-Biserial Estimation)

 

Effect

Between Subjects

Treatment

Subj w. Treat.

Within Subjecte
Form
Treat. x Form
Form x Subj w. Treat.
Item Type
Treat. x Item Type

Item Type x Subj w.
Treat.

Form x Item Type

Treat. x Form
x Item Type

Form x Item Type
x Subj w. Treat.

df

26

26

Mean Sq.

F

.89

22.55

.08

.19

1.28

1.80

.03

Sig. of F

 

.353

.000

.781

.666

.268

.191

.860

 

BI BLI OGRAPHY

DIDLIQGRAPEY

Angoff, W. H. (1971). Scales, norms, and equivalent scores.
In R. L. Thorndike (ed.), Educational Measurement (2nd
ed.). Washington, DC: American Council on Education.

Beach, B. H. (1975). Expert judgment about uncertainty:
Bayesian decision making in realistic settings.

Organizatignal Behavier eng Hgmep Perfermance, 2, 10-
59.

Bejar, I. I. (1981). Subject Eatrer experts' assessment of
item statistics. (Report No. RR-81-47). Princeton, NJ:
Educational Testing Service.

Berk, R. A. (1986). A consumer's guide to setting performance
standards on criterion-reference tests. Eeview of
Educational Researcp, §§(1), 137-172.

Bernknopf, S. (1979, April). A defepeipie model for
determining e minimai gut-off score for eriterion-
referenced tests. Paper presented at the Annual Meeting
of the National Council on Measurement in Education,
San Francisco, CA.

Blumberg, P., Alschuler, M. D., 8 Regmovic, V. (1982).
Should taxonomic levels be considered in developing

examinations ? Edueetigpei amg Esychologicei
Measurement, 2;, 1—7.

Billings, R. S., 8 Schaalman, M. L. (1980). Administrators'
estimations of the probability of outcomes of school
desegregation: A field test of the availability

heuristic. Organizationai Behavior and Human
Eerformance. Ee, 97-114.

Binning, J. F., 8 Fernandez, G. (1986, August). Heuristic
processes in ratings pf leader behavior: Assessing
irem-induced availability biesee. Paper presented at
the Annual Convention of the American Psychological
Association, Washington, DC.

Box, G. E. P., 8 Tiao, G. C. (1973). anesian imference in
statisticei analysis. Reading, MA: Addison-Wesley.

198

199

Campbell, A. C. (1961). Some determinants of the difficulty
of non-verbal classification items. Educational and

Esychgiogicei Measuremenr, 21, 899-913.

Campbell, D. T., 8 Stanley, J. C. (1963). Experimental and
quasi-experimental designs for research on teaching. In
N. L. Gage (Ed.), ook ea on eachin .
Chicago: Rand McNally.

Chase, C. I. (1964). Relative length of option and response
set in multiple choice items. Egneerional and

Esychoipgieai neeeuremenr, 21, 861-866.

Chase, C. I. (1974). neeenremenr fer Egngerionei Evaluation.
Reading, MA, Menlo Park, CA, London, Don Mills,

Ontario: Addison-Wesley Publishing Company.

Crawford, W. R. (1968). Item difficulty as related to the
complexity of intellectual processes. Jgurnal of

Educational Measurement, §(2), 103-107.

Dudycha, A. L., 8 Carpenter, J. B. (1973). Effects of item
format on item discrimination and difficulty. Journal

pf Applied Esyengiggy, §E(1), 116-121.

Dunn, T. F., 8 Goldstein, L. G. (1959). Test difficulty,
validity, and reliability as functions of selected
multiple-choice item construction principles.

Educationai end Psychological Measurement, 1D, 171-179.

Ebel, R. L. (1979). ss t ls d 'o a easurement
(3rd ed.). Englewood Cliffs, NJ: Prentice-Hall, Inc.

Fitzpatrick, A. R. (1984, April). social influence in
sfandarg seeping: The effect pf group interaction on

individuais' judgments. Paper presented at the annual
meeting of the American Educational Research

Association, New Orleans, LA.

Forsyth, R. A., 8 Spratt, K. F. (1980). Measuring problem
solving ability in mathematics with multiple-choice
items: The effect of item format on selected item and
test characteristics. ur ' na

Measurement. 11(1). 31-43-

Glass, G. V., 8 Hopkins, K. D. (1984). Eretistical methods
in_edgsation_and_e§xsbelogx (2nd ed-)- Englewood
Cliffs, NJ: Prentic-Hall, Inc.

Goodman, B. C. (1972). Action selection and likelihood ratio
estimation by individuals and groups. Drganigational

Eehavior eng Euman Performance, 1, 121-141.

200

Green, K. E. (1983). Subjective judgment of multiple-choice
item characteristics. duca d Ps c o o ical
Measurement, 5;, 563-570.

Green, K. E. (1984). Effects of item characteristics on

multiple-choice item difficulty. u ' nd
Esychoiogicel Meesnremenr, 51, 551-561.
Guilford, J. P. (1954). Eeyengmefrie mernods (2nd ed.). New
York: McGraw-Hill Book Co.
Hackman, J. D. (1982, May). v ' f ' ' ut'onal
esea c e 5° ' co ’t ve h es ch.

Paper presented at the annual forum of the Association
for Institutional Research, Denver, CO.

Hughes, H. H. 8 Trimble, W. E. (1965). The use of complex
alternatives in multiple-choice items. Edncetional and

Psychologicai measurement, 2§(1), 117-126.

Iversen, G. R. (1984). Bayesian statisrical inference.
Bevery Hills, London, New Delhi: Sage Publication.

Kahneman, D., 8 Tversky, A. (1973). On the psychology of
prediction. Esycholegieel Review, ED, 237-251.

Kirk, R. E. (1982). Experimental gesign (2nd ed.). Monterey,
CA: Brooks/Cole Publishing Company.

Leary, L, F. 8 Doran, N. J. (1985). Implications for
altering the context in which test items appear: A
historical perspective on an immediate concern. Review

of Educationel Research, §§(3), 387-413.

Levi, A. S. 8 Pryor, J. B. (1985, August). Mediarors of the

evailability heuristic in probability estimates of
fnture evenr . Paper presented at the Annual Convention
of the American Psychological Association, Los Angeles,
AC.

Lorge, I., 8 Diamond, L. K. (1954a). The value of
information to good and poor judges of item difficulty.

Educetignel end Es yeno oiggigei Meeenrem enr, 14, 29- 33.

Lorge, I., 8 Diamond, L. K. (1954b). The prediction of
absolute item difficulty by ranking and estimating

techniques. Educational and Psychological Measurement,
11, 365-372.

Lorge, I., 8 Kruglov, L. (1952). A suggested technique for
the improvement of difficulty prediction of test items.

Educationai end Esycnoiogical Measurement, 1;, 554-561.

201

Lorge, I., 8 Kruglov, L. (1953). The improvement of
estimates of test difficulty. Educational and

Esychological Measnremenr, 1;, 34-46.

Malpas, A. J., 8 Brown, M. (1974). Cognitive demand and
difficulty of GCE O-level mathematics pretest items.

WW1. 155-161.

Marascuilo, L. A., 8 McSweeney, M. (1977). Mgnparemetric and
' ut on- ree e ods o soc a1 sc'ences.
Monterey, CA: Brooks/Cole Publishing Company.

Mehrens, W. A., 8 Lehmann, I. J. (1984). Meeeuremenr end
W New York: Holt.
Rinehart and Winston, Inc.

Mehrens, W. A., 8 Lehmann, I. J. (1987). 'n s n d'zed
rests in educatign (4th ed.). New York and London:
Longman.

Melican, G., 8 Thomas, N. (1984, April). dent' 'ca 'on of

items that ere hard to rere eeeurately using Angoff'e

standard setting methog. Paper presented at the Annual
Meeting of the American Educational Research

Association, New Orleans, LA.

 

Millman, J. (1978). e e na ts m d' 'cu t °
preliminery investigation. (CSE Report No. 114). Los

Angeles, CA: Center for the Study of Evaluation, UCLA
Graduate School of Education.

Mitchell, K. J. (1983). ngnirive progessing geterminants of

tem d' f' u t on the verba subtests of the armed

eeryices yocational epfituge bertery. Arlington, VA:

Army Research Institute for the Behavioral and Social
Sciences.

 

Nedelsky, L. (1954). Absolute grading standards for
objective tests. d 'on and s cho o i al

HEQEBIQEQDE: lip 3'19-
Nisbett, R. E., 8 Borgida. (1975). Attribution and the

psychology of prediction. 0 rso it nd
W 32. 932- 943.

Nitko, A. J. (1983). duca on sts n eas me t: An
intreducrign. New York: Harcourt Brace Jovanovich.

Norusis, M. J. (1988). §ES§£PC+ Advanced statistics V2.0:
Chicago, IL: SPSS Inc.

202

Plake, B. S., 8 Huntley, R. M. (1984). Can relevant
grammatical cues result in invalid test items ?

Edu2ati9nal_and_2sxshglegigal_nea§urement. ii. 687-696.

Pollitt, A., Entwistle, N., Hutchinson, C., 8 De Luca, C.
(1985). Wher makes erem gpeerigne difficult ?

Edinburgh: Scottish Academic Press.

Quereshi, M. Y., 8 Fisher, T. L. (1977). Logical versus
empirical estimates of item difficulty. Educational and

2sxsbelegisal_ueasnrement. 31. 91-100-

Report on rne secong narionel eeminar en pest management
eyerem. (1982). Kuala Lumpur, FT: The Examinations

Syndicate, Ministry of Education, Malaysia.

Ryan, J. J. (1968). Teacher judgments of test item
properties. Journal of Educational Measurement, §(4),
301-306.

Scheuneman, J. D., 8 Steinhaus, K. S. (1987). A theoretical
framework for the srudy pf item difficulty end
diserimination. (Report No. RR-87-44). Princeton, NJ:
Educational Testing Service.

Simpson, D. E., 8 Cohen, E. B. (1985). Problem solving

gnestions for multiple-choice rests: A method for

enalyzing the cognitive demands of items. Paper
presented at the annual meeting of the American

Educational Research Association, Chicago, IL.

Strang, H. R. (1977). The effects of technical and
unfamiliar options on guessing on multiple-choice test

items. Journal of Educationel Measurement, 15, 253-259.

Tinkelman, S. (1947). Difficulty prediction of test items.
Ieacnere College Contributign to Education, No. 941.
New York: Bureau of Publications, Teachers College,
Columbia University.

Tollefson, N., 8 Chen, J. S. (1986). A comparison of item
gifficnlty end item giecriminarion of multiple-choice
items using "none of the above" options. Paper

presented at the Annual Meeting of the Midwest
Educational Research Association, Chicago, IL.

Tversky, A., 8 Kahneman, D. (1974). Judgment under
uncertainty: Heuristics and biases. Egience, 185, 1124-
1131.

Whitely, S. E. (1980). Modeling aptitude test validity from
cognitive components. gournel of Educational
Esycnology, 1;, 750-769.

203

Whitney, D. R., 8 Board, C. (1972). The effect of selected
poor item writing practices on test difficulty,

reliability and validity. Journel pf Edueatignel
Meesurement, e(3), 225-233.

Willoughby, T. L. (1980). Reliability and validity of a
priori estimates of item characteristics for an
examination of health science information. Educerignal

end Eeychological Meesuremenr, AD, 1141-1145.

Winkler, R. L. (1968). The consensus of subjective
probability distributions. Management Seience, 1;, B61-
B75.

Winkler, R. L. (1971). Probabilistic prediction: Some

experimental results . W
Statistical Association, ee, 675-685.

      

HICH

1111111111111