This is to certify that the

thesis entitled
The Utilization of Antecedent Data in
Conjunction with Test Results for

Curricular Decision Making

presented by

Bernhard Darwin Kaufman, Jr.

has been accepted towards fulﬁllment
of the requirements for

_Ph..D..__ degree in Wt and
Evaluation

4:44am [4454” m

Major professor

Date Februarx 13: 1980

THE UTILIZATION OF ANTECEDENT DATA IN
CONJUNCTION WITH TEST RESULTS FOR CURRICULAR
DECISION MAKING

BY

Bernhard Darwin Kaufman, Jr.

A DISSERTATION

Submitted to
Michigan State University
in partial fulfillment of the requirements
for the degree of

DOCTOR OF PHILOSOPHY

Department of Counseling, Personnel Services
and Educational Psychology

1980

ABSTRACT
THE UTILIZATION OF ANTECEDENT DATA IN

CONJUNCTION WITH TEST RESULTS FOR CURRICULAR
DECISION MAKING

BY

Bernhard Darwin Kaufman, Jr.

Decisions about mastery of an achievement domain
are frequently made on the basis of a small sample
of items. Because of the small number of items the
possibility of incorrect decisions is high. One way
of improving these decisions is to utilize additional
information in consort with the test information.

This study sought to determine the efficacy
of incorporating non-test information into test based
decision models. These models were compared, based
on classification accuracy. The non-test information
variables of the study were instructional time history,.
instructional testing history, mathematics achievement
and sex. The history variables were captured from
files maintained on students in a computer managed
instructional program. The standard by which the models
were compared was mastery classification based on a
156 item test concerning a unit on multiplication and

division. This variable also served as the dependent

Bernhard Darwin Kafuman, Jr.

variable in model development.

There were three phases of analysis in the research.
The first used stepwise regression to discover the
relationships which existed among the non-test informa-
tion variables, a set of subtests drawn from the 156
item test and the results of the 156 item test itself.
Also, during this phase, the incremental validity of
subtests was determined as well as the functional length
of subtests combined with instructional time and mathe-
matics achievement.

During phase II least squares and Bayesian models
were develoPed for the purpose of making decisions
about mastery of the domain. The least squares model
contained mathematics achievement and instructional
time as non-test information. In order to apply the
Bayesian model, a parameter indicating the value of
prior information needed to be set. The coefficient
which resulted in the best decision precision established
the value of prior information at 2.75 test items.

The final phase compared the Bayesian and least
squares decision approaches with the raw score or pro-
portion correct approach for making mastery classifications .
Mastery levels of .70, .75, .80, .85, and .90 were
examined. None of the approaches stood out as being

more effective. Comparison of classification based

Bernard Darwin Kaufman, Jr.

the least squares models, containing the non-test infor-
mation variables with and without a six item subset
of the domain, indicated that adding the test information
did not improve classification accuracy.

Four conclusions were reached as a result of
the analysis. First, a six item test does not improve
mastery classification beyond what was possible with
pre-existing information. Second, learning rate repre-
sents information which is independent of mathematics
achievement. Third, neither least squares or Bayesian
approaches improve decision precision over that obtained
uSing raw scores. Finally, decision precision is im-
proved when twelve items are used rather than six.

It was recommended that teachers develop ways
of using pre-existing information as they monitor pupils.
Having measures of achievement and learning rate, one
may need only to keep track of on task behavior. Pupils
behaviors suggesting frustration can be taken to indicate
a need for diagnosis. At such a point, a test of suffi-
cient length to yield accurate decisions can be adminia'
Stered. In sum, if pupils are initially well placed
in the curriculum and instructional methods and materials
are carefully selected, testing can be restricted to
points where diagnosis is indicated by off task behavior

reflecting frustration whose cause the teacher cannot

Bernhard Darwin Kaufman, Jr.

easily identify.

LIST OF
LIST OF

LIST OF

Chapter

I.

II.

III.

TABLE OF CONTENTS

TAB LES O O O O O O O O I O O O 0
F1 GURE S O O O O O O O O O O O O

SYMBOLS O O C O O O O O O 0

THE PROBLEM . . . . . . . . . .

Problem . . . . . . . . . . . .
Solution . . . . . . . . . . .
Need for the Study . . . . . .
Purpose of the Study . . . . .
Definition of terms . . . . . .

REVIEW OF LITERATURE . . . . . .

Definitions . . . . . . . . . .
Estimating Domain Scores . . .
Proportion correct . . . . .
Classical Model II . . . . .
Bayesian Model II . . . . . .
Binomial Model. . . . . . . .
Criterion Referenced Decisions
Validity . . . . . . . . . . .
Domain test length . . . . . .
Summary . . . . . . . . . . . .

DESIGN AND PROCEDURES . . . . . .

Population . . . . . . . . . .
Sample . . . . . . . . . . . .
Variables . . . . . . . . . . .
Methodology . . . . . . . . . .
Phase I . . . . . . . . . . .
Phase II . . . . . . . . . .
Phase III . . . . . . . . . .

ii

iv
vi

vii

Page

ubWNNl-d

10
ll
12
16
23
24
27
30
33

37

37
37
38
42
42
47
49

IV.

Variables
Phase I
Phase II
Phase III

V. INTERPRETATION, CONCLUSIONS AND
RECOMMENDATIONS .

LIST OF REFERENCES

LIST OF NOTES

FINDINGS

iii

53

53
56
66
68

78

87

91

10.

11.

12.

13.

14.

LIST OF TABLES

Bayesian and classical variance components .
Contrasts for the model factor . . . . . . .

Descriptive statistics for all variables
used in the study. . . . . . . . . . . . . .

Descriptive statistics for DOMAIN, SUBTEST(J)
and items comprising the six objectives . .

Intercorrelation of information variables
and domain achievement . . . . . . . . . . .

Stepwise regression statistics for all
permutations of the information variables
with DOMAIN. . . . . . . . . . . . . . . . .
Partial correlations and coefficients of
alienation for the information variables
with DOMAIN. . . . . . . . . ... . . . . . .

Regression statistics for relating TIME
and STEP to DOMIN O O O C O O O O O O O O 0

Statistics for incremental validity
anal-YSiS O I O O O O O O O O O O O O O O I I

Coefficients of correlation and determina-
tion for SUBTEST(J) with DOMAIN. . . . . . .

Regression statistics for TIME and
STEP With DOMAIN O O O O I I O O O O O O O 0

Regression statistics for TIME, STEP
and SUBTEST(6) with DOMAIN . . . . . . . . .

p* values for three values of t for
subtests of length 6, 12 and 18. . . . . . .

Number and percent of correct classification

iv

Page

20

52

54

55

57

59

61

62

62

64

67

67

67

74

15.

16.

17.

Analysis of variance statistics . . . . . . . 74

Means and variances for levels of model
and mastery level . . . . . . . . . . . . . . 75

Scheffe' contrast statistics for the
model factor . . . . . . . . . . . . . . . . 77

LIST OF FIGURES

Reduction of uncertainty for combina-
tions of 3 information variables . . .

The relationship of R2 and subtest length.

The relationship of t to correct
classification for themastery level of

.70 . . . . . . . . . . . . . . . . .

The relationship of t to correct
classification for themastery level of

.75 . . . . . . . . . . . . . . . . .

The relationship of t to correct
classification for themastery level of

.80 . . . . . . . . . . . . . . . . .

The relationship of t to correct
classification for the masterylevel of

.85 O O O O I O O O O O O O O O O O O

The relationship of t to correct
classification for theinasterylevel of

.90 . . . . . . . . . . . . . . . . .

vi

Page

58

65

69

70

71

72

73

KEY TO SYMBOLS AND NOMENCLATURE

An individuals domain proportion correct

Mastery level proportion

An estimate of an individual's domain proportion correct
An individuals mastery/non-mastery classification
An individuals true score

An estimate of an individuals true score

The reliability of a test

An individuals raw score

Mean raw score for a group

Classical true variance

Classical error variance

Mean classical true score for a group

Classical observed variance

Arcsin transformation of Ni

Tukey-Freeman arcsin transformation of Xi

Number of items in a test

Mean of a group of gi's

Classical estimate of true variance where scores
have been subjected to Tukey-Freeman transformation

Classical estimates of observed variance

Classical estimate of error variance where scores
have been subjected to Tukey-Freeman transformation

vii

Classical estimate of arcsin transformation of "i

Inverse chi square
Scale parameter of inverse chi square distribution

Degrees of freedom for inverse chi square
distribution

Mean of the inverse chi square distribution

Test information parameter
Bayesian estimate of arcsin transformation of "i
Mean of arcsin transformed ni's

Number of individuals

Bayesian estimate of the mean of arcsin trans-
formed ni's

Bayesian estimate of true variance where scores
have been subjected to Tukey-Freeman transformation

Bayesian estimate of error variance where scores
have been subjected to Tukey-Freeman transformation

Bayesian estimate of observed variance where
scores have been subjected to Tukey-Freeman
transformation

Bayesian marginal mean estimate of arcsin trans-
formation of “i

Bayesian marginal mean estimate of pr0portion of
true to observed variance

Arcsin transformation of "o

Mastery

Non-mastery

Loss associated with false positive
Loss associated with false negative

viii

ﬂllﬂz:

X1,X

2

TEST:

TIME:

STEP:

SUBTEST(J):

DOMAIN:

SIX:

TWELVE:

YHAT:

YHATP:

BAYES6:

BAYESlZ:

x

N ~6>O

Expected loss
Mean of posterior marginal
Variance of posterior marginal

Proportion boundaries of the indifference
region

Raw score boundaries of the indifference
region

Instructional testing history
Instructional time history

Sequential test of Education Progress
Mathematics Concepts Test

Domain item samples of length J
J = 6, 12, . . .60

Score on 156 item division and multipli-
cation test

Classification based on SUBTEST(6)
Classification based on SUBTEST(12)

Classification based on least square
estimate containing TIME and STEP

Classification based on least square
estimate containing TIME, STEP and
SUBTEST(6)

Classification based on Bayesian marginal
mean estimate containing SUBTEST(6)

Classification based on Bayesian marginal
mean estimate containing SUBTEST(IZ)

Mean of classifications based on C
Scheffe' contrast

Variance of a Scheffe' contrast

ix

CHAPTER I

THE PROBLEM

Problem

Individualized instruction requires frequent
decisions about each person passing through the curri-
culum. The basis for these decisions is often an estimate
of a domain score based on a small sample of items
from the domain. Because of the small sample the possi-
bility of incorrect decisions is great. Millman (1973)
has shown that with a mastery level of eighty percent,
more than a third of those students whose actual domain
achievement is sixty percent will get four of five
items correct and thus be misclassified as having mastered
the objective or unit.

The test data available in such decision situations
is not the only existant information which is pertinent
to the decisions. In fact, there is usually information
present prior to testing. Cronbach and Gleser (1965)
have challenged testers to show that the application
of their instruments result in an improvement in the
quality of decisions. To use Sechrist's (1963) term,
testers should demonstrate the incremental validity
of the tests they employ. No such investigation has

1

been done with domain referenced tests. Thus, it is
not known that the estimates of domain scores based
on small item samples yield new information for decision

making.

Solution

One way of improving the quality of decisions
made with the aid of domain test estimates is to utilize
additional information in conjunction with the estimate.
Such information, once identified may be joined with
test information in a mathematical model which should
yield improved domain estimates.

There are two statistical approaches to modeling;
Bayesian and least square regression. Both of these
will likely yield an improved estimate. No research
has been done in an applied setting with the domain
score known. Therefore there is no empirical basis

for recommending one procedure over the other.

Need for the study

Domain tests are being widely used for decision
making. It is conceivable that decisions based on
short tests alone may be worse than those made knowing
only historical information. While it may not be feasi-

ble to eliminate tests from an instructional sequence,

educators should be alerted to the fact that their

results alone are not a sound basis for decisions.
If test data do not provide information, decision makers
should be so aware.

Further, if the solutions proposed are sound,
this should demonstrate in an applied setting. Then
guidance in the application of the procedures should

be made available to practitioners.

Purpose of the study

There are two components to the research reported
herein. One has to do with the investigation of the
information value of several variables, including test
data, with respect to results on a domain test. Once
these various information relationships were illuminated,
two models were compared to each other and a raw score
for their efficacy as a bases for criterion referenced
decisions. Objectives 1 and 2 below form the first
component. Objective 3 the second.

Specifically stated the objectives were:

1. To determine the information existant in
four antecedent and collateral variables
relative to domain achievement.

2. To couple information with test results
in order to determine:

a. the incremental validity of short
domain tests,

b. if decision precision can be improved
by using antecedent and collateral
data with test results,

c. the functional lengths of several
short domain tests.

3. to compare the Bayesian marginal mean model,
the least square regression model: andthe raw score
approach with respect to decision precision.

Definition of terms

 

Given below are definitions of several terms
which are used throughout this thesis.

Domain test
"Any test consisting of a random or stratified
random sample of items selected from a well
defined set or class of tasks."(Millman, 1974,
p. 315)

Criterion referenced testing
The use of a test to make decisions about a
criterion.

Information
Datum is information if and only
if it reduces the uncertainty involved in
making a decision.

Functional test length
The length of test necessary to provide informa-
tion equivalent to that provided by collateral,
antecedent and test information.

Incremental validity
The extent to which a multiple correlation

is raised by the addition of test results to

a set of prior existing information.

Domain achievement
The prOportion of items correct on a set of
items which comprehensively cover an objective
or set of objectives.

Decision precision
The proportion of correct classifications

made on the basis of a given decision algorithm.

CHAPTER II
REVIEW OF THE LITERATURE

Two excellent reviews have been prepared which
cover criterion referenced testing comprehensively.
These are Millman (1974) and Hambleton, Swaminathan,
Algina, and Coulson (1978). Because of the comprehen-
siveness of these monographs the present review draws
heavily on these two papers.

The topics to be covered in this review are:

1) definitions, 2) estimation of domain scores, 3)
criterion referenced decisions, 4) validity, and 5) test
length. Some of these tOpics are covered in greater
depth than others. The criteria for depth of coverage
was the topic's direct relevance to the research. For
example, the estimation of domain scores is the direct
focus of the study and thus the greatest amount of

space is devoted to this area.

Definitions

 

As Hambleton et a1. (1978) have observed there
is by no means a single accepted definition of a criterion
referenced test. Two quotations which are at opposite
poles of the generality continuum illustrate this.

6

The first is the most restrictive.
"A pure criterion referenced test is
one consisting of a sample of production
tasks drawn from a well defined population
of performances, a sample that may be used
to estimate the proportion of perfor-
mances in that population at which the
student can succeed." (Harris and Stewart,
1971), p. 1)
Ivens defined a criterion referenced test, in
most general terms, as one "comprised of items keyed
to behavioral objectives." (Ivens, 1972, p. 2) Clearly
one must have a referent which is more specifically
defined than is the case if both of these quotations
are allowed within the class of the concept "criterion
referenced test."

The purpose of this section of the review will
be to arrive at a term for and definition of the kind
'of test we are investigating in this research. To
do this we will allude to some terms and corresponding
referents which will help delimit our concept.

Hambleton et al. (1978) point out that criterion
refers to a minimal acceptable level of functioning.
This definition is consistent with Glaser and Nitko
(1971), Millman (1974L and Harris, et al. (1974).
So a criterion referenced test could be one which was

used to make a decision about this minimal acceptable

level of functioning. Herein lies the problem, when

one applies the accepted definition of criterion; cri-
terion referenced implies only that the test has some
relationship to a decision about level of functioning.
Looking at it from this point of view, Iven's definition
seems most appropriate. That is, a test comprised

of items keyed to behavioral objectives defined as

Mager (1962) does would be criterion referenced in

the sense that the results could be used to make a
decision about the minimal acceptable level of function-
ing.

Glaser and Nitko (1971), consistent with Harris
and Stewart (1971), speak of production standard in
their definition of criterion referenced but also,
as do Harris and Stewart, they use the words "well
defined population of performances." So, not only
should these tests measure a level of functioning,
that level should be generalizable to some larger domain
or population. What Harris and Stewart do not allude
to is criterion in the sense of minimal acceptable
level of functioning.

Hively, et al. (1968), Bormuth (1970) and Osburn
(1968) have specified algorithmic procedures for defining
a domain of test items. Popham (1975) describes what
he calls an amplified objective which specifies in

detail the testing situation, response alternatives

and a criterion of correctness, in effect, defining
the domain of items. Baker (1974) also provides pro-
cedures for carefully defining the item domain of an
objective. The direction of the work in this area
seems to underline the importance of the notion of
domain.

As one might suspect the importance of the domain
has motivated the term Domain Referenced Test. Millman
(1974) defines such tests as:

'"any test consisting of a random or

stratified random sample of items

selected from a well defined set or

class of tasks." (Millman, 1974, p. 315)
It should be noted that such a definition does not
refer to a criterion. The definition of a test can
be separated from the specification of a desired level
of functioning (as Harris and Stewart's (1971) definition
also illustratesl In fact, a single domain referenced
test can be used to make decisions about more than
one criterion. Admittedly, there is a connection between
the decision criterion to be addressed with the results
of a domain-referenced test and the definition of the
"set or class of tasks." However, in developing the
test items the emphasis is on content domain, the cri—
terion can be established separately.

Thus, it seems most appropriate to refer to domain

tests. In current practice such tests are most often

10

used to make decisions about a person's status relative
to a criterion. It is appropriate to say that
scores are domain referenced and decisions based
on the scores are criterion referenced.

The use of the term criterion-referenced testing
to describe general approaches whose overall aim is
to make decisions about a criterion is useful. Domain
or objective referenced tests are but tools which can

be employed in this pursuit.
Estimating Domain Scores

The basic problem is; given an individual's ob-
served score on a criterion referenced test, what is
his score on the domain, and further,.does this represent
mastery or non-mastery status (Hambleton and Novick,
1973). To use the symbols which seem to appear most
consistently in the literature (Swaminathan, Hambleton
and Algina, 1975; Hambleton and Novick, 1973; Novick,
Lewis and Jackson, 1973); if Xi (an individual's score
is known, what is “i (the domain score) and further
what is mi (wi=1 if mastery, wi=0 if non-mastery).

So the problem is to obtain ﬂi (an estimate of ﬂi)
and mi (an estimate of mi).
There are five distinguishable procedures which

have been described in the literature for solving this

11

problem. These are: .1) proportion correct, 2) classical
model II, 3) Bayesian model II, 4) Bayesian marginal
mean, and S) the Binomial (Note 1). The first four
of these differ from the fifth in that they provide
a single direct ﬁi. The binomial procedure yields
information about the probability that "i is greater
than some given mastery level no.

The remainder of this section will provide dis—

cussionwof each of these five procedures.

Proportion Correct

The estimate of the proportion correct is the
ratio of correct items to the length of the test. This
value can also be thought of as the raw score multiplied
by a constant which is the inverse of the number of
items. For a small number of items this estimate yields
tenuous results. Millman (1974) has shown that for
a mastery level of 80 percent, more than a third of
those who could achieve only 60 percent of the domain
of items will get at least four of five items correct
and thus the decision of mastery will be in error.
Hambleton,et al. (Note 1) observed that "procedures
which take other information into account are more

desirable."

12

Classical Model II

The Classical Model II and Bayesian Model II
allow for the inclusion of other information into the
decision making process . The classical model includes the
mean of the group in which the individual is a member.
This is collateral information. The Bayesian Model
II considers in addition to the group mean, an investiga-
tor's subjective feeling regarding the prior status
of the group. The remainder of this section discusses
the classical model II in detail.

Jackson (1972) observed that Truman Kelley's
(1927) estimate of true score effectively joined test
results with the collateral data of group mean. Lord
and Novick (1968) state Kelley's formula for the estimate
of true score (T) as

m =

oxxu X + (l-pxx.) ux (1)

Where pxx' is the reliability, X test score and ux

the mean for the group. Thus test data is incorporated
through X and the collateral data by way of “x' Novick
and Jackson (1974) observe that

2 2
0T x + 0E “T

Q = (2)

 

Classical true score theory (Lord and Novick, 1968)

assumes that “T = “A” Thus expression (2) can be

13

rewritten in the form

2 2
GT X+OE ux

 

 

 

T= 2+ 2 (3)
OT GE
2 2 2
Further, true score theory assumes 0X = OT + CE
so that
0 2 o 2
"_ T E
T—02X+02ux (4)
X X

This expression makes clear the fact that Kelley estimates
are "...a weighted sum of two separate estimates, one
based upon the individual's observed score X and the
other based on the mean of the group to which he bee
longs..." (Lord and Novick, 1968, p. 65). It can further
be observed that when the test is highly reliable (i.e.,
0E2 is small) the test data is weighted heavily. If

the test is not highly reliable then the estimate is

more dependent on collateral data namely px.

In order to utilize Kelley's procedure in situa-
tions where binary decisions, such as mastery/non-
mastery, are to be made; Jackson modified the above
procedures. He applied the Tukey-Freeman arcsine trans-
formation to individual scores (Xi) and obtained the»
transformed estimate

1/2

g. = 1/2 [Sin-1(x1i/n+1)1/2 + Sin-1(Xi+l/n+lﬂ (5)

l

14

Under this transformation of Xi' the corresponding trans-
formed variable 7i for the proportion correct ( ni)

is given by the expression

. -1
Yi = Sin Vni (6)

If the number of test items is at least eight then
the distribution for 91 will be approximately normal
with the mean being the transformed value of the prOpor-

tion correct (yi) and the variance (4n»+2)-1 (Anscombe,

1). In classical notation

this can be written as gi ~N(T,0E2).

1948). That is gi ~N(Y (4n+2)’
I

The statement gi~'N(yi,(&1+2)-1) is about a fixed
person (i) under the hypothetical condition of a finite
number of repeated testings. If there is a single
testing of a finite number (N) of persons (i.e., i
= 1,2...N) then Jackson (1972) has shown that the mean

is given by

N
9~= Z gi/N (7)

and the variance (¢c) is

l

N 2 -1 -
$0 = l . 1 (91-9.) - (4n+2) 1(N-1) (8)
1:

This expression can be rewritten as

15

.31 (g.-g.)2 N _1
¢ = 1‘ 1 - (4n+2) (9)
C N-l N-l

 

 

to facilitate the determination of its connection with
true score theory. The first term of the expression
is the observed variance, the second term is error

variance. We can write

 

 

 

65c = ¢gc ¢EC (10)
. 2
and note that ¢c IS the analogue of 0T . Also ¢gc
. ' 2 2
IS analogous to OX and ¢EC to 0E .
Returning to (2), the Kelley formula for the
transformed variables becomes
$ 9. + ¢ g.
Yic = f 1 EC (11)
¢c + ¢EC
or A
A ¢ ¢E
Yic = C . + C (12)

This is clearly a weighted sum of the transformed test
scores and the mean of the scores, the mean's weight
being inversely related to the reliability of the test.
Once the transformed true prOportion correct
(Tic) is obtained, one can return to the original scale

by a sine transformation of §ic' namely

16

f. = (1+.5/n) Sin2 - .25/n (13)

l Yic
This value (ﬁi) is the estimated domain proportion
correct and is based not only on the proportion correct

of a subset of items from the domain but also on a

group's performance on the same subset of items.

Bayesian Model II

 

This model estimate uses test data (Xi), collateral
data (X), as well as prior information. This method
requires setting a prior distribution representing
an investigator's belief prior to testing and then
making revised estimates after testing. These revised
estimates are based on prior beliefs as well as an
individual's test results and the group mean. The
distribution which takes all three pieces of information
into account is called the posterior distribution.

The question of determining the correct prior
distribution has been the subject of considerable theo-
retical study by Novick and his colleagues (Novick
et al., 1973 and Swaminathan, et al., 1975). The current
status of these investigations suggest the following.

a) The specification of the mean is not particu-

larly important and may be represented by
a uniform distribution in which any score

is equally likely.

b)

17

The prior beliefs about variance can be ade-

quately represented by an inverse chi square

distribution with two parameters; scale and

degree of freedom.

i)

ii)

iii)

The degree of freedom parameter (0)

should be set at 8.

The scale parameter (A) can then be solved
for in an equation with a single unknown

namely, the variance. The equation is

A = (v-2)$bm (14)

The necessary estimate of the variance

(8b ) can be obtained as follows:

a) mSpecify the true prOportion correct
for the typical examinee in the sample.

b) "...Specify the number of test items,
t, that would have to be administered
to the examinee in order to obtain

as much information about "i as is

deemed to be available (Note 1, p. 31).

c) $b is then defined by the equation
m
A _ -1
- (4t+2) (15)
¢bm

d) The true proportion correct (Yib)

is then estimated by

18

1+z(vi-v.)2 1
gi[ N-V-l

1+2(yi-y.)2

 

] + mb‘4t‘t'2) (16)

-<>
(.1.
0‘

] + [4t+2]-1

 

N-v-l

and the mean of the proportion correct

 

(Y-b) by
A 2Y1
Y'b N (17)

e) Novick et al., (1973) observe that this

is equivalent to

 

x -1
§ = gi $12) + Y‘b (4t+2) (18)
lb $b + (4n+2)-1
A -l 2
where ¢b = (N+v—l) [A+z(yi-y.) ] (l9)

$b is the Bayesian true variance estimate
for Iib’ ¢Eb is the Bayesian error variance
estimate for §ib, and $gb is the Bayesian
observed variance estimate for §ib.

Using this notation (16) can be rewritten
as

$ ¢
__2__ + ¢Eb y.b (20)
¢gb gb

 

Yib =

As Novick et al., (1973) indicate, this estimate

has a form analogous to Kelley's true score estimation

l9

procedure. The differences between 91b as estimated
by equation (18) and f1 as estimated by equation (12)
result from the procedures used for determining the
several variance components and the use of y.b as the
true mean rather than g.. Table 1 allows a comparison
of the Bayesian and Classical variance estimates. Ex-
aminationcﬂfthe formuli in the table indicates that
prior information is incorporated into the estimate

of yb by the estimation procedure for 3b . A is deter-

mined by

1

1 = (v- 2)(4t + 2)‘ (21)

where t is the number of test items that would need
to be administered to the examinee to obtain as much
information about ”i as is deemed available prior to
testing. Further, because of the iterative nature
of the solution of equation (16), the y.b obtained
for the concluding iteration will have been influenced
by the value of t.

Thus differences in estimated values for y are
a function of differing amounts of regression due to
the variance estimates as well as a different "true"
mean on which the regressions occur. Theoretically,
the advantage of the Bayesian Model II procedure rests

on an improvement in the estimates of true variance,

20

 

 

 

 

 

 

m
a and u a nme + aw u n e om>mmmmo
NAoml.Ov N.
z
nim+ﬂac Huz u one Am+uas n nme mommm
H 2 HI
HIZ H I Z U H and Q
nim+aav z n a ( @ HNA.>).>V m +4_H)AH(>+zsu a meme
mi.m).msw z
OUCMHHm>
qaonmano zaHmmwmm
Hmpoz

mmumEHumm mocmﬁnm> HMUHmmMHO pom cmwmmmmm mo comwnmmﬁoo d

.H manna

21

observed variance and the true mean accomplished by

incorporating prior information through the parameter A.

Bayesian Marginal Mean
Lewis, Wang and Novick (1973) observe that if
one wishes to make overall decisions about all groups,
joint estimates such as those of Bayesian Model II
are apprOpriate. However, they note that for individua-
lized instruction, decisions about each individual
are usually.desired and therefore marginal estimates
are indicated.
Hambleton,et al., (Note 1) note that the Bayesian
Model II requires complicated iterative solutions.
Tables prepared by Wang (1973) allow relatively easy
computation of marginal estimates. The procedure demands
that the degree of freedom parameter be set (again
8, according to Novick, et al. (1973)) and ¢ib is
determined by specifying t in the manner descriged
above. With these values p* can be read from Wang's
table and the estimate of ?i is

bm

A = t -
Yibm g. + p (91 g.) (22)

which can then be transformed to “i by equation (13).
The marginal mean procedure is an extension of
the Bayesian Model II and as such effectively considers

the three types of data; test, collateral, and prior

22

beliefs. It should be understood that all of the Bayesian
estimates have been designed for use when one's knowledge
of prior status can at best be represented by subjective
belief about t. It is this subjective belief which

is quantified by the method described for establishing

the Bayesian true variance.

The parameter p* is an estimate of

A

c)
¢+¢E

a reliability indicator. Lewis, Wang and Novick (1973)
report that an empirical study of
33).
$b + $Eb

and p* indicate that "...p* is substantially larger

 

than p for moderate n." (p. 12) As the number of items
increase, the discrepancy between p and p* becomes
smaller (p. 13), and thus estimates of yb and ybm be-
come increasingly similar.

One might expect that if the Bayesian methods
do allow for a meaningful incorporation of prior infor-
mation into the computation of p, then these values
would be larger than for the corresponding classically
computed values. However, in at least one empirical
study this was not the case (see Novick et al., 0973,

pp. 39—41)). In this instance, the investigators

23

questioned the estimates of $i' It would seem that
dilemmas such as this are best addressed by studying

the quality of decisions made by various estimates.

Binomial Model

 

One may use the binomial model discussed by Mill-
man (1974) for making probability statements about
the true achievement status of an individual. In order
to do this three parameters are needed; minimum passing
score, number of items and the level of certainty required
for establishing mastery.

With these values specified, mastery/non-mastery
decisions can be made to a prescribed probability
level knowing only the actual score on a test. Tables
prepared by Millman (1972) make this model very simple
to apply.

As Millman (1974) has observed, all Bayesian
approaches yield a regressed estimate of domain scores.
That is, if an individual's obtained score is below
the group's mean, her estimated domain score will be
higher than her obtained score. Analogously, if her
obtained score is above the mean, her estimated score
will be lower. .These statements also hold for classi-
cal model II. Such statements do not hold for Millman's

binomial model.

24

Criterion Referenced Decisions

Only Hambleton, Novick and their colleagues (Ham-
bleton, et al., 1973; Hambleton, 1974; Swaminathan
et al., 1975; Hambleton, et al., 1978) seem to have
given attention to the problem of making decisions
based on domain estimates. It will be recalled from
the previous review of procedures for estimating domain
scores that once "i is obtained one must determine
the appropriate value of wi' In the binary classification;
if YiEEYothen mi = l or if yi<<yo then wi = 0. Both
Yi and ”i are true values and in practice must be esti-
mated. Hambleton et al., (1978) and Swaminathan et
al., (1975) have presented a method for ascertaining
P (mi = 1) on the basis of Bayesian posterior distribu-
tions.

Whenever a decision is made in the face of uncer-
tainty, there will be misclassification. In the case
of mastery/non-mastery classification there are two
decision actions available. We will call them a1
(mastery) and a2 (non-mastery). In this binary case
there are also two kinds of error to be made. If the

action is a and mi = 0, this is called a false positive

1
error. If the action is a2 and mi = 1, this is referred
to as a false negative. With each error type some

loss is incurred. In the testing setting these may

25

be unnecessary consumption of materials, teacher time,
or student affect such as boredom or frustration. The
loss associated with false positives can be symbolized

by L (mi = 0, a1) = 1 The false negative loss is

01'
given by L (mi = 1, a2) = 110. The aim of decision
classification is to minimize the expected loss (EL)
associated with the action. Thus EL ( w, a) is to

be minimized. The two loss functions are:

EL (w) a1) = 110 P (yi<:yo) and (23)
EL (0), 62) = 101 P (YiZYo)' (24)
The decision rules are:
a1 if 101 P (YiZYo) < 110 P (yi<yo) (25)
a2 if 101 P Wig-Yo) > 110 P (yi<y0) (26)
if 101 P (yii'yo) = 110 P (yi<yo) (27)

one is equally well off with either decision.

No one appears to have tackled the problem of
estimating l for the two forms of misclassification
possible in mastery/non-mastery decisions. Hambleton
(1974) has speculated on the matter. He feels that
false positive error is more serious than false negative
error since a student will have a second chance in

most systems. Further, if the subject matter is

26

hierarchical a false positive will likely be frustrated
by attempts to achieve future objectives.

Hambleton et al., (1978) describe procedures
for determining the probability of mastery given a
domain score for each of the Bayesian models reviewed
earlier. First one must calculate the mean ( pi) and
variance (012) of the posterior marginal distribution
(i.e., posterior for each score) by using formulas
given in Note 1 . Then a z score is calculated
for each individual by z = Yo - ui/oi. This result
can then be used with any table of normal deviates
to find the probability that an individual's "i is
above the matery level (no).

The final step is to combine loss values with
probabilities of mastery (P (yi 3 yo)) and non-mastery

(P (Yi < 70)). By comparing

EL (w, al) 101 P (Yi 3 Yo) with

EL (w, a2) 110 P (yi < yo).

and taking the action corresponding to the smaller
of the two one makes the decision with the smallest
expected loss.
All of this work is theoretical. No reports
of attempts to find actual values for 1 have been pu-

blished. Such investigations are necessary. In the

27

absence of supported rationale to the contrary, the

practice of setting 101 = 110 = 1 seems most sensible.

Validity

Cronbach (1971) states:

"The phrase validation of a test

is a source of much misunderstanding.
One validates, not a test, but an
interpretation of data arising from a
specified procedure." (p. 447)

 

The same authority also points out that there are "...
two uses of tests; (a) for making decisions about peOple
tested and (b) for describing these peOple (p. 445).
In the criterion-referenced testing situation where
domain tests are employed, it seems that these two
uses suggest three validity questions for domain tests
utilized for criterion referenced decisions. (1) Is
the test content valid? (2) Is the test domain valid?
(3) Is the test criterion valid? Hambleton et al.,
(Note 1) argue that the question of content validity
is inextricably twined with domain specification and
thus the validity of the content is a function of the
adequacy of these specifications. Two procedures for
systematically specifying the content domain are item
form and amplified objectives (Millman, 1974).

"An item form has the following charac-

teristics: 1) it generates items with a
fixed syntactical structure; 2) it contains

28

one or more variable elements; and 3) it

defines a class of item sentences by

specifying the replacement sets for the

variable elements" (Osborn, 1968, p. 97)
Such procedures seem best suited for mathematical and
scientific content areas. The second procedure which
Hambleton et al., (Note 1) believe:

"provides an excellent balance between

the clarity achieved with item genera-

tion schemes and the practicality

of behavioral objectives" (p. 15)
are what Popham (1975) calls amplified objectives.
These are

"...expanded statements of an educa-

tional goal which provides boundary

specifications regarding testing situa-

tions, response alternatives, and criteria

of correctness." (Millman, 1974, p. 335)
In fact an amplified objective can contain an item
form as defined by Osborn. While both approaches are
tedious, Popham's approach would seem to provide a
means of overcoming Ebel's (1971) concern that only
trivial domains can be specified.

The question of domain validity can be answered,
as Millman (1974) observes, by determining the relation-
ship between scores on tests X and Y when X is composed
of a randomly selected set of items from Y, the set
of all items in the domain.

To answer question (3) one must determine the

adequacy of criterion related decisions based on the

29

domain score or an estimate of same. Within the mastery
learning model such decisions are about mastery/non-
mastery. In such a setting one might think that domain
and criterion validity would be the same since the
decision directly relates to the content. However,
since small differences near the decision threshold

are less significant for domain validity than for cri-
terion validity, these values may differ markedly.

One should note that a domain test is itself
a procedure involving item specification rules, items,
and sampling plans. Domain and content validation
determine the adequacy of these procedures. However
it seems that these two types of validation are indepen-
dent of interpretation of results or estimates of the
domain. Certainly, criterion validity seeks to address
interpretation issues, namely the adequacy of decisions
based on a test's results.

Sechrist (1963) used the term Incremental validity
to refer to the extent to which a variable raises the
multiple correlation when it is included in a set of
predictor variables. Cronbach and Gleser (1965) suggested
that testers justify the use of instruments by showing
that an improvement in some decision resulted from
application of tests and further that the magnitude

of improvement warranted the cost involved. Thus they

30

were challenging testers to determine the incremental
validity of their instruments. In the context of cri-
terion referenced testing, if prior information is
available, does the data in the form of an estimate

of the domain score actually reduce uncertainty below
the level possible with the prior data alone? The
issue of incremental validity is a special case of

the criterion validity question in that it relates

to decisions based on estimates of the domain score.

Domain Test Length

 

As Hambleton et al., (Note 1) observe;
"The problem of determining test length
(in the criterion-referenced situation)
is related to the size of the misclassi-
fication errors one is willing to toler-
ate." (p. 63)
In general, the longer the test the smaller the size
of misclassification error. However, the reality of
objective based curriculum systems which use a number
of domain tests to make criterion decisions is that
the feasible length of tests is quite restricted. Novick
and Lewis (1974) feel that twenty items per objective
is too large. However, in practice, criterion decisions

are often made based on results of tests of five or

fewer items.

31

Two avenues have been explored as means of speci-
fying test length given the specific magnitude of mis-
classification error one is willing to tolerate. Novick
and Lewis (1974) have developed the Bayesian solution
to the problem while Millman (1972) and Fhaner (1974)
have used strict binomial methods. Each of the two
approaches will be discussed in the remainder of this
section.

Millman's procedure yields the proportion of
misclassifications given an examinee's true proportion
correct ("i)' a mastery level (no) and the number of
items on a test. By applying the tables (Millman,

1972) one can determine the number of items necessary

to make a decision which is accurate to a given proba-
bility level. A disadvantage is that to use the method
one must have the true score if specific recommendations
about test length are to be made. Millman's tables

are important in that they show, theoretically, the

high degree of uncertainty in making criterion decisions
based on curriculum embedded tests. For example, the
probability of a false negative when "0 = .8 and

"i = .8 and n = 15 is .35.

Fhaner (1974) model is based on the same binomial
theory as Millman's. One must specify an indifference

region (n1<ﬁi<w2) about the cutoff score within which

32

classification errors are considered unimportant. Then,
with the acceptable probabilities of misclassification
specified, the necessary test length and X1 and X2
corresponding to Hi and u: may be solved for using
a normal approximation to the binomial. Unfortunately,
using this procedure if N1 = .70 and n2 = .80, then
n = 121: clearly an unacceptably long test. Looking
at the problem in reverse, Fhaner's empirical investi-
gations show that if between 12 and 17 items are used
n2 = n1 = :3. This seems quite large for most applica-
tions. Simply, both binomial models suggest that tests
of the typical length used in criterion referenced
situations will lead to many misclassifications. The
notion of effective lengthening of tests by utilizing
prior knowledge is hopeful.

Novick et al., (1973) suggest that the kind of
subjective prior information they foresee being used
as worth between six and fifteen additional items.
Coupled with their recommendations that a test length
of twelve or less is "very desirable" (Novick and Lewis,
1974, p. 158), it is suggested that the application
of Bayesian methods outlined in the earlier section
on estimating domain scores could result in reasonable
length testing sessions and decision certainty equivalent
to that achieved with tests containing eighteen to

thirty-seven items.

33

Novick and Lewis (1974) have explored a Bayesian
model with the prior distributions having specific
means. They observed that when "...the average test
score of the group is high (i.e., above the criterion
level) and there is little variation among individuals,
shorter tests become feasible." (p. 148) These authors
have developed tables which yield test length and minimum
Xi for advancement. These tabular recommendations
are based on prior distributions with known mean. Values
are recommended for several loss ratios. In addition
one is able to take into consideration his feelings
about the extent of dispersion in the prior. Hambleton
et al., (Note 1) caution that the recommendations hold
for the Bayesian Beta Binomial Model only and the cpti-
mality of the recommendations for the Bayesian models

reviewed earlier is not known.

Summary

It was concluded from the review of definitional
issues about criterion referencing and domain referencing
that decisions are criterion referenced and tests are
domain or objective referenced.

The literature reflects a concern about using
proportion correct or raw score for criterion referenced

decisions when only a few items are present. Several

34

scholars have sought to develop effective procedures
which allow information in addition to test scores

to be considered in the decision algorithm. There

has been investigations of two approaches for accomplish-
ing this.

The first is Truman Kelley's classical true score
model. This approach makes use of information about
the status of the group from which individuals are
drawn to arrive at the best estimate of an individuals
standing on the variable being measured. The extent
to which group status is considered is a function of the
proportion of true variance accounted for by the test.
Since domain or objective referenced tests often yield
scores based on a few items and thus do not account
for the desired level of true variance, many applications
of this model to criterion referenced situations will
result in the incorporation of collateral data, namely,
the group mean, into the estimation process.

A second procedure for adding non-test information
to the estimation process follows the Bayesian model.
This methodology also takes group mean into the estima-
tion. In addition, this approach attempts to incorporate
antecedent information (t) by asking the investigator
to set the number of items which would provide information

equal in amount to that which he has about the subjects

35

prior to testing. As with Kelley's approach, the extent
of inclusion of the group mean is a function of the
amount of error variance. This error factor is a single
valued function of t.

Some attention has been given to the theoretical
questions of making decisions based on domain estimates.
An algebra for incorporating loss values into the decision
process has been developed, however, this writer could
find no work which provided insights into the problem
of how to best set loss values.

Content, domain and criterion validity have been
delineated in this chapter. The first two are primarily
dependent on the adequacy of content or domain Specifi-
cations. The domain validity of a subset of items
which meets Specifications is the correlation of results
on that subset with results over the entire domain.
Criterion validity reflects the precision with which
decisions are made about reaching or not reaching a
standard.

Another validity issue addresses the question
of whether test information improves decision making
and if so, to what extent. This issue is an important
one for those contemplating use of short domain or

objective referenced tests.

36

The final section of this chapter addressed re-
search on domain test length. The use of the binomial
model to make decisions about test length were discoura-
ging. The binomial literature suggests a need for tests
much longer than seem practical in most instructional
applications. The use of antecedent and collateral data
may help reduce test length requirements.

The research which is described in the following
chapters aimed to determine if decision precision can
be improved by use of antecedent and/or collateral infor-
mation. The research proceeded in three stages. In
the first, information present in several antecedent
and collateral variables was determined. Then, least
squares and Bayesian domain estimation models were de—
veloped. Finally, the mastery/non-mastery classifica-
tions based on the least squares, Bayesian and raw score

(proportion correct) approaches were compared.

CHAPTER III

DESIGN AND PROCEDURES

Population

 

The population for this study was all fifth grade
students involved in the MICA (Managed Instruction with
Computer Assistance) project in the Madison (Wisconsin)
Public Schools during the year of 1976. Six of the thirty-
three elementary schools in the district participated
in the project. Of the six, four were located on the
east side of the cities' isthmus and two on the west
side. Madison has a high percentage of professional
and white collar workers. The west side of town is pri—
marily residential with some large business enterprises
such as insurance companies and financial institutions.
The east side of Madison contains the city's industry.
The oldest residential areas are on the east side and,
in general, property values are lower in the eastern
part of the city. Most blue collar workers live on the
east side.

The schools which participated in the MICA project
were selected primarily because of the interest of their
staffs and administrators computer managerial instruction.
The final judgment regarding which schools were selected
was made by the MICA project director using his

37

38

knowledge of the teachers and principals of the schools

which were interested in joining the project.

Sample

Four of the six schools in the project were selected
for the study. Both schools on the west side were chosen
along with two of the four on the east side. These
two were picked at random. The four schools had a total
of 225 students in the MICA mathematical project in the
fifth grade. For the purpose of this study values for
all of the variables defined in the next section needed
to be available for each subject. This was the use for
172 of the 225 students. These 172 comprised the sample
for the study. Since a portion of the research required
a cross validation group, the sample was randomly dicho-
tomized into sub samples of seventy-five and ninety-
seven. The smaller of the sub samples was used for cross

validation purposes.

Variables

 

Each subject took a one hundred fifty-six item
test which covered material on a single instructional
unit called Introduction to Multiplication and Division.
This unit was comprised of six objectives. The test

contained twenty-six items for each of the objectives.

39

For each student in the two groups the student
history file of the MICA system was queried to amass
testing history, number of school days of instruction
per objective, and sex. The student history file was
automatically maintained and updated by MICA software.
Each time a student took a test the score was inputed
by computer terminal. Records of the date of initial
pretesting and successful post testing were kept by the
system. Thus accurate information about testing and
rate of objective achievement were available for each
subject in the MICA project. In addition, standardized
achievement results were gathered for each child.

The variables of the study can be divided into
three types: those containing information about domain
achievement, the actual domain achievement, and those
containing information about decisions (mastery/non-
mastery) about domain mastery.

The information variables are:

1. Instructional Testing History (TEST)

2. Instructional Time History (TIME)

3. Sequential Test of Educational Progress
(STEP)

4. Sex (SEX)

5. Domain Item Samples (SUBTEST (J) )

The measure of actual domain achievement was a
156 item test (DOMAIN).

The decision variables were designations of mastery

or non-mastery on the basis of several decision criteria.

40

The criteria were:

below.

1. Score on SUBTEST (J), 12, 18, ... 60

2. A least squares estimate of DOMAIN containing
a subset of the information variables but
not SUBTEST (J).

3. A least squares estimate of DOMAIN containing
SUBTEST (J), J=6 and 12.

4. A Bayesian estimate of DOMAIN, based on
SUBTEST (J), J=6 and 12.

A definition of each of the variables is written

1. Instructional Testing History (TEST):

This variable was the mean number of instruc-
tional tests per objective taken during the four month
school period preceeding testing on the large multi-
plication and division domain test.

2. Instructional Time History (TIME):

This variable was the mean number of school
days that each subject in the sample spent per ob-
jective during the four month school period
preceeding testing on the domain test.

3. Sequential Test of Educational Progress (STEP):

This variable was the raw score on the STEP Math-
ematical Concepts Test taken prio to domain testing.

4. SEX
This variable was the sex of the subject.
5. Domain Item Samples (SUBTEST (J) ):
There were several variables in this category.

Each was the raw score on a sample of items

403

drawn from the items in the domain. Ten such
subtests of length 6, 12, ...60 were created.

Each was the result of stratified random sampling
(without replacement) from the 156 items. The
stratification factor was objective. Since

the sampling process was done separately for

each of the J subtests, it was possible for

an item to be present in more than one of the
subtests.

6. Domain Achievement (DOMAIN)

.This variable was the raw score on the

156 item unit Test. This test was comprised of

156 items covering a unit on multiplication and
division. The unit contained six objectives for
which there were 26 items each. The items on the
domain test were typically used as pre, post and
review tests in the MICA system. The development
of these items began in the late 1960's. They were
written by teachers and underwent continual content
review. Since 1972, when the items became part of
the MICA math program, the items have undergone
analysis to assure they were keyed correctly and that
the foils were plausible. In addition the content
validity was again assessed by members of the MICA

project staff.

41

Each of the classification variables was calculated
for five mastery levels: seventy, seventy-five, eighty,
eighty-five and ninety percent. The proportion corres-
ponding to each mastery level was multiplied by the
length of the test and this figure was rounded to the
nearest integer to obtain the various criteria. Then
the values of the classification criteria were compared
with individual subject values to obtain the classi-
fication variable values. These were zero of one in
all cases. There were six bases for classification,
which are defined below.

7. Mastery or non-masteqrbased on the six

item domain sample (SIX).

8. Mastery or non—mastery based on the twelve
item domain sample (TWELVE).

9. Mastery or non-mastery based on a least
squares estimate which contains the variables
TEST, TIME and STEP. The deve10pment of
this variable is discussed later in this
chapter (YHAT).

10.. Mastery or non-mastery based on a least
squares estimate composed of those variables
listed for (9) plus SUBTEST (6). (YHATP).

11. Mastery or non-mastery based on a Bayesian
estimate which utilizes the information of the variables
listed for (9) above and is based on SUBTEST

(6) and SUBTEST (12). (BAYES6 and BAYESIZ)

42

Several descriptive statistics were calculated
for the variables of the study. Means, standard devia-
tions and ranges were calculated for all variables.

In addition, Hoyt reliabilities and standard errors
of measurement were obtained for DOMAIN and SUBTEST

(J) (J=6, 12,...60) as well as the six objectives of

the domain test.

Methodology
There were three stages of analysis for this
study. The first was undertaken to discover the infor-

mation relationship which existed between the antecedent

and concommitant variables and the domain achievement.

The second was to develop least squares and Bayesian
models for making estimates about domain achievement.

In the third phase, classifications based on the estimates
of the models developed in Phase II were compared.

The remainder of this section will discuss the methods

used in each of the three phases.

Phase I

The approach used in this phase is based on the
assumption that data represents information if and
only if it reduces uncertainty involved in making a
decision. Least squares stepwise regression (Draper

and Smith (1966), Kerlinger and Pedhauzer (1973), Rose-

43

boom (1966), and Cohen and Cohen (1975)) was employed
to reveal the information which existed between STEP,
TEST, TIME, SEX and SUBTEST (J) (J=6, 12,...60).

The first step in determining the information
value of the information variable was to examine the corre-
lation matrix of these variables and DOMAIN. This
was followed by a series of stepwise regression analyses
to determine the most parsimonious set of antecedent
variables which contained information about

Specifically, the partial coefficients of correla-
tion and alienation were examined. Partial correlation

provides an index of the amount of information in one

variable (say X) relative to a second (say Y) which
is distinct from that information present in yet a
third variable (call it W) relative to the second (Y).
Conversely, the partial coefficient of alienation is
an index of the uncertainty present when a decision
about one variable (Y) is made on the basis of the
information in a second variable (X) which is distinct
from that in yet a third variable (W).

Phase one identified the interrelationships between
the variables relative to domain achievement. The
zero, first and second order partials for the antecedent
and concommitant variables relative to the criterion

were examined for the purpose of describing the infor-

mational relationships. Based on this examination

and the t-tests of significance for the various partial
regression weights those information variables which
contain information relative to domain achievement

were identified. Then a regression equation was develOped.
All of the partial regression weights of this equation
were significant at the .05 level. The partial corre-
lations indicated the distinct information present

in each variable. The coefficient of alienation (1-R2)
was the proportion that uncertainty was reduced by
considering the set of independent variables as infor-

mation about domain achievement.

The concepts of incremental validity and functional
length Of a test were used to illuminate the value of
the prior information relative to test information.
Incremental validity was determined by adding domain
item samples of ascending size (SUBTEST (J) J=1,6,12,...
60) to the regression equation while at the same time
deleting the sample of the previous size (SUBTEST (J-l)
with J¢1). The incremental validity was the change
in R2 when each domain item sample was considered with
the prior information. The t-test of the regression
weight for the item sample variable indicated if the
incremental validity was significantly different from

zero. The difference between the coefficient of alie—

45

nation prior and subsequent to adding item sample results
indicated the degree of uncertainty reduction in the
domain estimate accomplished by use of test results.

The incremental validity of three tests of lengths
6, 12, and 18 items were assessed. The tests were
generated by stratified random sampling from the original
domain of one hundred and fifty-six items. Stratifi-
cation was based on the six objectives. Sampling was
with replacement of items to the domain pool after
each test Was generated so that items could appear
on more than one test. In effect, these equations

were studied in a stepwise fashion. The base variables

were those found to be significant in the earlier portion
of Phase I. The stepped in variable was, in each case,
the domain item sample raw score. The regression weight
of each test was evaluated using an F-test. Multiple
correlations and coefficients of alienation were calcula-
ted for the equation containing only the base variables
and also for each of the three equations with test
results added. This allowed determination of 1) the
existence of incremental validity, 2) the magnitude
of existent incremental validity, and 3) the extent
of uncertainty reduction attributable to the test results.
The functional lengthening of a test refers to

a process of utilizing prior data in conjunction with

46

test results for improving the quality of decisions.
When three types of data were incorporated into

a decision model, the functional length of the test

when coupled with the other data was equal to the length
of the hypothetical test which provided the same amount
of information when used alone.

The functional length of the 6, 12 and 18 item
subtests were investigated. Stratified random item
samples of the respective lengths were generated. Then
three equations were constructed; each contained the
results of one of the three short tests and the base

variables of the equation develOped earlier. The

functional length of a test is defined as the length

of the test whose correlation with the domain is equal
to the multiple correlation of the model containing

the information variables and the results of SUBTEST
(J). So if the correlation between SUBTEST (12) and
DOMAIN is equal to the multiple correlation of the
model containing the information variables and the
results of SUBTEST (6), then the functional length

of SUBTEST (6) is twelve. Linear interpolation was

used to find functional length for the intervals between

6 and 12 and 12 and 18.

47

Phase II

The purpose of this phase of the analysis was
to develop least squares and Bayesian models in order
to make decisions about mastery and non-mastery of
the domain. The development of the least squares model
followed directly from the work done in phase I.

Least squares model deve10pment aimed to identify
the most parsimonious set of variables for predicting
domain achievement, namely the domain score. The desired
model has the form of a linear equation whose partial
coefficients were each significant at the .05 level.
Various permutations of the variables were considered
in a stepwise fashion. This assured that the final

equation was parsimonious while at the same time

containing the maximal information about the criterion
variable.

A second least square model was developed by
simply adding SUBTEST (6) to the information model.

To assure that the errors of estimation would
not be correlated with the independent variable, the
least squares statistics were based on a different
sample than the one to be used in phase III. The model
building sample contained 97 subjects drawn at random
from the 172 subjects sample of the project.

The Bayesian model used in this study is the

48

model II which was derived by Novick, Lewis and Jackson
(1973) and Lewis, Wang and Novick (1973) and discussed
by Millman (1974) and Hambleton, et al., (1978). The

Domain Score estimate was

1:
ll

0* gj + (1-p*) g. (30)

where

g. = Sin-1[(x. + 3/8)/(n + 3/4)]1/2, (x.

J J J
and n representing j's raw score and the number of
items respectively).

m
g. = Z g./m (mirathe number of subjects)
j=1 3'
and p* is the Bayesian estimate of the proportion of

true to observed variance.

The statistic p* is a function of the prior information
about the sample expressed as the length of the test
whose sum of correct responses would represent the
same amount of information as was available without
testing.

In order to get some empirical notion of the
effect of this information factor on classification,
p* was calculated for 62 values of t ranging from
2.75 to 18 in increments of .25.

Based upon the fluctuation of the classification

variable as a function of t, the decision about which

49

t values would be used in phase II of the analysis

was made. Another factor considered the decision process
leading to the selection of the precise Bayesian equa-
tions to be used for phase III was the data obtained

in phase I analysis pertaining to incremental validity.
Based on the point of View of Millman (1974) and Swa-
minathan, et al., (1975) the most apprOpriate t

value is the number of test items which would yield

as much information as was available without testing.
From this,one would deduce that the value of the classi-
fication variable should reach a maximum at about the
points where t equals the test length value of the

prior information.

Phase III

 

Phase III of the analysis focused on the comparison
of several approaches for making decisions about mastery
or non-mastery of this achievement domain. There were
three specific approaches studied. These were referred
to as: l) the raw score approach, 2) the least square
approach and 3) the Bayesian approach. Within the
raw score approach, scores on SUBTEST (6) and SUBTEST
(12) were the criterion for classification. Within
the least square approach two models were used. The
independent variables of the first were STEP and TIME.

The second model used these two independent variables

50

as well as SUBTEST (6). Two Bayesian models were exa-
mined: one based on SUBTEST (6) and a second based
on SUBTEST (12) Both attemphai to incorporate
information of the decision variables through specifica-
tions of t. Thus, in all, there were six separate
models for making classification decisions.
Classifications were made at five mastery levels
for each of the models. As a result there were a total
of 30 decision criteria, five for each model. The
classifications made on the basis of each of the 30
criteria were compared to the mastery - non-mastery
classification made on the basis of the 156 item domain

score. These comparison resulted in 30 unique vectors

of 75 zeros and ones. Each element of a vector repre-
sented a subject. The value of the element indicated
concordance of model classification and domain classi-
fication.

These data were represented as a two-way design
with repeated measures on each of the factors. The
fixed levels of Factor 1 were the six models. The
five mastery levels comprised the levels of Factor
2. The variance was analyzed to determined if there
were model effects or mastery level effects. The two

F tests for main effects were:

MS
mastery

Fmastery = MS and

 

within

51

MS

F = model
model MS

 

within

Under the assumption of homogeneous correlations of

all pairs of levels on each fixed factor these values

have F distributions with 4 and 20, and 5 and 20 degrees

of freedom respectively. If the homogeneity assumption

is not met, the most conservative F distribution of

the ratios will have 1 and 5 and 1 and 4 degrees of

freedom respectively. (Greenhouse and Geisser, 1959).
In this study, the homogeneity assumption was

not tenable. Thus, the most conservative F distribution

was used to determine if there were mastery or model
effects at the .05 level. If calculated F values exceeded
those at the .05 level of the appropriate conservative
distribution the selection of model or mastery level
was deemed to affect different classification success
levels.

In addition to learning if there were main effects,
the study sought to determine if approaches differed.
In order to determine if approaches as well as specific
levels of significant factors differed, Scheffe's method
of multiple comparisons was used. The contrasts of

interest are given in Table 2.

52

Table 2. Contrasts for the model factor

Contrast
(KSIX + iTWELVE)/2 I (ivHAT + YYHATP)/2
(ESIX + iTWELVE)/2 ' (EBAYES6 + EBAYE812)/2
(ETHAT + i YHATP)/2 ” (EBAYESG + EBAYESlZ)/2
YSIX " iTWELVE
3ESIX ' SETHAT
X - X

SIX YHATP

XBAYES6 - XBAYESlZ

CHAPTER IV
FINDINGS

This chapter contains the results of the analyses
which were described in Chapter III. The findings
are presented in five sections of which four are parallel
to the discussion of the analysis and design in Chapter
III. The first section gives the statistics which
describe the variables of the study. Sections two
through four will present the results of the three
phases of analysis. The final section of this chapter
will summarize the findings in terms of the three objec-

tives which were stated in Chapter I.

Variables

 

Table 3 shows the descriptive statistics that
were calculations for the twenty variables considered in
this study. The statistics of the test variables of
the study are given in Table 4. Since the domain test
serves as the criterion for much of the analysis of
this research, its reliability of .9777 is of particular

importance.

53

54

 

.H ) o mnvﬂ. poem. some. mﬁmmxcm
.H ( o «cam. meow. mmmw. omm><r
.H I o «mam. mmmv. ommo. ma<:>
.H I o unﬁm. mmmv. exam. 9<=>
.H u o omwa. mmov. mean. m>cuze
.H a o mmmm. hmmq. name. me
.vm~(.v mo.w~m Hmm.m~ mm.m- z~<zoo
.oo) .5 mm.omH mhm.o~ chm.hv .ocvemusmau
.vm) .m so.moH NmH.o~ omm.av Avmvemmemam
.mv) .v mmo.mh vmmn.m vn~.hm .mvvemmemsn
.mv) .N NoH.mm hamm.h ~vo.mm Avaemuamzm
.om) .m mnm.om momm.m ~mm.mm .omvemmemsm
.omu .m va.N~ mmhh.v mm~.v~ AOMVEmmemsm
.qmu .v mmo.o~ mmo~.m vhm.m~ AvNVEmmemDm
.mau .H ~m~.- mmmm.m hmo.qﬁ .mnvemmemam
.mH) .N ooem.¢ m5m~.~ mmom.m .mavemmEmam
.w u .H «mmv.H mam~.H mmmo.m .o.emmem:m
.N n .H MMHmN. «nuom. vmmv.~ xmm
o.mvno.m v-.mm moe«.m om~.m~ imam
m.~mummmm.m Nmm.~m momo.v oaom.m mzHe
o.vno.H momvm. mvnmm. hnmm.m Ewes
cocoa .coccwuo> Godu0w>ma com: @532
cunccoum wannaun>

>c=um ecu CH new: meancwun> Ham HOu mowumwumum m>Humﬁuommo .m canoe

55

 

maam.m mmvm. mm»m.oa mamm.nv Iomvemmemom
mamm.m momm. mmma.oa mmmm.ae Aemsemmemom
mmem.~ mmmm. ammn.m vena.am imesemmemom
mmma.m mama. nﬁmm.n noeo.mm imecemmemom
Hmam.a mmmm. mmmm.m «Hmm.m~ Ammsemmemom
mmaw.ﬁ ooem. mmnn.v namﬁ.v~ Iomcemmemom
aaon.a mvmm. mmoa.m venﬂ.mﬁ iemcemmemcm
amma.a mmom. mmmm.m memo.ea Amﬁsemmemom
man.H comm. Hn-.m mmmm.m Imasemmemom
same. mmam. mHNN.H mmmo.m Amvemmemom
mmmm.H seem. nmmm.o nomn.na o m>Heomemo
onR.H oemm. mmme.m mmvm.mﬂ m m>Heombmo
momm.H mmmm. mmnv.m mmmo.o~ a m>Heomnmo
aomm.~ «mam. mamm.m nmmm.om m m>Heomnmo
ommm.H ommm. mmmm.a nmmm.m~ m m>Heomemo
amam.a mmmm. hmmm.e mmae.am H m>Heomemo
mme~.a mama. mamm.mm Hmmm.mmH zHazoo
Honum hawHMQMHHmm coﬂpma>ma new: umwa
cnmocmpm neon ouncemnm

mm>Huommno me on» mnemHHmEoo
mamua cam Iecammemom .zHazoo now monumanmum m>anaanommo .v magma

56

Phase 1

Based on very low correlations of SEX with the
other information variables (see Table 5), it was elimi-
nated from further consideration in the study.

Figure 1 depicts graphically the information
value of various combinations of the variables STEP,
TEST, and TIME. The vertical axis represents the coef-
ficient of alienation while the horizontal indicates
the configuration of variables under consideration.

By following the lines on the graph from left to right
one can gain insight into the uncertainty reduction
which will accrue by adding the indicated variable.

If one compares the slopes of the line segments with
the same initial point but different ending points,
the relative informational value of the added variable
will be apparent. For example, comparing the slope

of 08 with on indicates that STEP provides more infor-
mation about the dependent variable than does TEST.

In fact, examination of the segments representing the
addition of TEST to equations, indicates that TEST
contributes very little (if any) information. TIME
appears to provide some information, but not as much
as STEP. Another way of looking at the value of a
variablehs information is seen in Table 6.

Study of the Sixth table confirms that STEP is

57

ompuﬂﬁo mum mucwom Hmﬁﬂomp HHm a

 

pm pm mm am mm em om mm mm mm mm) mm) an Ho Loosemmemom .mH
mm mm mm am mm as mm mm mm hm) mm) mm Ho- Asmsemmemom .eH

mm mm Hm mm om mm om mm mm) mm) em mo imvcemmemom .mH
mm mm. mm om am He mm am) am- mm Ho imasemmemom .NH

mm om am am mm mm mm- am- mm mo .mmsemmemom .HH

mm mm mm am am mm) mm- mm mo Aomoemmemom .OH

om mm mm mm mm) «m: mm so Aemsemmemom .m

Hm mm mm mm) mm) mm mo imHsemmemom .m

mm Hm mH) hm) mm mo AmHsemmemom .5

am mHu on. em mo- Hmsemmemom .m

mm- am) mm mo zHazoo .m

Ho Hm) mo amme .s

hm) mo- mzHe .m

«NHI AMBm .N

xmm .H

mH «H mH NH HH oH m m a m m a m m H
mﬂﬁmﬁmxwmﬂﬂuw GHMEOU 6G0 WOHQMHHM> COﬂﬂMEHOMGﬂ MO ﬁOﬂUﬂHwHHOUHmuhHH om OHQMB

Uncertainty

J.
.90

080‘

.704

 

STEP
TEST
TIME

-<II>Q

Figure l.

58

 

w

 

U11»

None 1: 2

Number of variables combined

Reduction of uncertainty for combinations
of 3 information variables

59

 

ooo. mmvo. mmmm. onmm. mzHB m
ooo. omvN. mHNm. ommm. Amam m
Hoo. mhmo. mhmo. mmmN. Emma w
000. MONN. momm. 05mm. mmem m
000. hmmo. Nmma. ommm. mEHB m
Hoo. mnoo. mnoo. mmmm. Emma m
hmv. HNoo. mmmm. onmm. Emma v
000. HONN. mwmm. mmmm. mmam v
000. Nva. Nvma. mmmm. MSHB w
000. momm. mmmm. onmm. mmam m
mmm. omoo. NmMH. ommm. Emma m
000. NvMH. Nvma. mmwm. MSHB m
hmv. HNoo. mmmm. onmm. EmmB N
000. Nomo. mvmm. mmmm. MEHB N
coo. avom. avom. mamm. mmem N
Hoo. mmeo. momm. 05mm. MEHB H
NmH. vmoo. mNHm. ommm. Emma H
ooo. Hvom. Heom. mHmm. mmem H
vm coaumcﬂenmuma mo .mmou coaumcﬂEHouma cOHumamHHoo mmHQMHHm> mmpm coﬂumsvm

ca mmcmnu

no “#88338

mHoHpHsz

szzoo nuaz meanneun> coﬂumEHOMcﬂ can mo

mGOHuMHSEHmQ Ham MOM moﬂumwumum coﬁmmmummn cmﬁ3mmum

.m manna

.60

the most informative variable. Time is the second most
informative. Examination of the results of the six
equations (especially equation 4) suggests that TEST
has no useful relationship with DOMAIN which is inde-
pendent of STEP and TIME. Table 7 gives additional
insight into the relationships of the information in-
herent in the four variables. In particular, the zero
order partials for the three information variables
indicate that each has a significant information factor
relative to DOMAIN. However, the lack of significance
of the first and second order partials

rDB .o, ’rDB.y

and rEB- Y suggest that the information in TEST relative
to DOMAIN is accounted for by STEP and TIME.

Based on the data presented in Tables 6 and
7 and the earlier elimination of sex as a useful variable,
the most parsimonious regression equation relating
non-test informational variables to DOMAIN included
the independent variables TIME and STEP only. The
basic statistics for this equation are presented in
Table 8.

In order to determine the utility of including
test information in the decision process regarding
domain achievement, incremental validity was explored.

Table 9 provides the data for assessing this incremental

validity. The base, non-test, information accounts

Table 7.

61

Partial correlations and coefficients of
alienation for the information variables with DOMAIN

 

 

 

_ =_ * =- *
rDa .551* rDB .260 rDY .366
KDa— 835 KDB= .966 KDY- .931
Zero - 313* 273* 613*
Order raB- . ray— . rsy— .
K = .950 = .962 K - .790
dB aY BY
rDB .110 rDyoa .269 erd .577
K = .994 K = .963 K - .817
. = -_-_-_. * :
First rDa.3 .513* rDY-B .271 raYJ3 .1081
Order
K = 858 K = .963 K - .9941
_ * = _
rDa.Y .504 rDB-Y .048 ran' .192
K = .864 K = .999 K = .981
=-— * = = . *
Second rDy.aB .253 rDB.aY .057 rDa.BY 505
Order K = .967 K = .998 K = .863

 

*Significant at .05 level

Domain
STEP
TEST

TIME

62

 

 

 

 

OHm.mmm« comm. Nth. OHmm. mmmm. AmHvBmmBmDm
www.mnoe NOHw. hmHm. hHmm. omam. ANHvBmmemDm
nom.¢n« Nmmm. man. Hmmm. Hmvh. Amvammemom
onm.m¢« mmom. mvmm. mmmm. AMZHB .mmemv mmmm
m coaumcmﬂaé mo muwpﬂam> coepccwﬁhmuwa coaucawuuou moHQMHHm>
ucoﬂowwmmoo HmucmﬁmnocH mo pcmﬂowmmmoo mHmﬂuasz
mammHmcm NHHUHHM> HmucmEmHUGH MOM moﬂpmwumum .m manna
vooo. mmmmm.m1 ommm.) MSHB
oooo. Shomm.h mhmw. mmam
Hm>mq mon>IB pcmﬂoﬂmmmou manmwnm>
cocooHMHcmHm scammmummm
powwcHMUcmum

zHazoo on imam can mzHe mcHumHmn 90m moHumHumum GOHmmmnmmm .m mHnme

63

for over thirty-five percent of the variance in the
dependent variable, domain achievement. The six item
subtest result accounts for an additional twenty percent.
If it is assumed that the relationship between information
and test length is approximately linear in the interval
between six and twelve items, a ten item test will augment
the information in the base variables by an amount equal
to that contained by the base variables. Since the
F tests listed in Table 9 are for the partial regression
coefficients relative to the dependent variable DOMAIN,
each test significantly augments the base variables.
Tables 9 and 10 may be used to deduce the func-
tional length of SUBTEST(6) coupled with STEP and TIME.
Table 9 shows that the coefficient of determination
(R2) for the base variables plus SUBTEST(6) is .5521.
Reference to Table]I)allows one to see that this R2
value lies between the r2 for SUBTEST(6) and SUBTEST(12).
The graph of Figure 2 shows the relationship between
the length of the subtests and the corresponding r2
with domain. Based on this graph, it seems reasonable
to obtain the functional length of the six item test
by linear interpolation. This process yields a value
of 8.03 which is the functional length of the 6 item
subtest augmented by the two information variables.

Clearly, the coefficients of determination in Tables

64

 

«no. nmm. Homsemmemom

 

com. 0mm. ismsemmemom
com. mmm. Amecemmemom
cam. mam. imasemmemom
omm. oma. Imm.emmemom
mam. cmm. Aomsammemom
OHm. «mm. .qmsemmamom
omm. mmm. .mHsemmemom
mmm. nom. HmHsemmemom
mHe. mam. Aesemmemom
COH¢MCﬂEHO¥OQ MO m¢CmHUHMH®OU meQMHHm>
mpcmﬂoﬂwmmoo coaumamnuoo

ZH4EOQ £#H3 AhvBmMBmDm HOW GOﬂ#MCﬂEH®#0U UGM COHﬁMHOHHOO MO mHCQHUHMMOOU

.oH wanme

653

2
Figure 2. The relationship off! and subtest length

1.00)

.90

.80

.70

.604

.50‘

.40

.30‘

.204

.10

 

 

 

12 18 24 10 36 42 48 54 60

ON

Subtest length

66

9 and 10 suggest that the antecedent variables provide

-a decreasing amount of information in combination with
test data as the number of items in subtests of the domain
increases. In effect, the functional lengths of subtests
containing 12 or more items is the same as the length

of the specific subtest.

Phase II

Tables 11 and 12 contain the basic statistics
of the two least squares regression models which are
to be the basis for classification. The first of these
two tables contains only the information variables
STEP and TIME. Table 12 presents the statistics for
the information model with SUBTEST(6) added. The stan-
dard errors of these two models are 22.42 and 19.40
respectively.

The statistic of the Bayesian model which is
roughly analogous to regression weights of the least
square model iS(wh Table 13 presents values of 0*
for three values of t which span the range of t values
used in this study. Numbers are given for SUBTEST(6),
SUBTEST(12), and SUBTEST(18). Reference to equation
(22) of Chapter II suggests that as t increases, the
influence of the mean becomes larger. Also, in all

cases the influence of a subjects score becomes greater

67

 

 

ommmm. mmonHm. Hesse. emuH cmmuanm
oamme. amass. Haoam. emuH w>Hmze
mmHmm. mmmmm. amemv. smuH me
mH m.m mb.N mNﬁm
up up up camp

mH UCM Nﬂ am ##OCQH MO WHmmgﬂgm (HOW (... MO WGSHM> mwhﬂu. .HOM mmﬂdﬂnr «.0 .MH MHQMB

 

 

oooo. mnmon.m amov. Amsemmemom
NNoo. mmomH.m( meN.) mzHB
oooo. wmhmo.m mmme. mmem
Hm>mH mon>|B ucmﬂoﬂwmmou mHQMHHm>
cocooHMHcmﬂm coﬂmmmummm
cmuﬂphmccmum

zHazoc cqu imvemmemcm can imam .mzHe now moHanumum conmmnmmm .NH mHnme

 

 

Hooo. mNHmH.¢) mmHm.) mzHB
oooo. Hemom.> seem. mmem
Hm>mH mon>IB ucmﬁoﬂmmmou mHQMHHm>
cosmoHMHcmHm scammmummm
cmuﬂoumocmum

zHazoo nqu mean can mzHa now moHumHumnm eonmmnmmm .HH mHnme

68

as test length increases. Figures 3, 4, 5, 6 and 7
illustrate the effect of the t parameter on the classi-
fication of the Bayesian Model. One can see that in
all cases the most accurate classification can be achieved
with t set equal to 2.75. Thus for the purpose of com-
paring models, it was judged apprOpriate to use the
Bayesian Model with t equal to 2.75. This is the
apparent "best" Bayesian model available for the present
data.

The means and variances of the two raw score

decision criterion are given previously in Table 1.

Phase III

 

The concluding set of findings yield information
about the relative effectiveness of the three approaches
for making decisions about mastery or non-mastery of
the achievement domain. Table 14 presents the number
and percentage of correct classifications for each of
the six models at each mastery level. The remainder
of this section discusses results of the statistical
analysis of these data.

As is shown in Table 15, the analysis of variance
yieldedsignificant mastery level and model effects.

With respect to the mastery level factor, the proportion
of correct classifications appears to decrease as

mastery levels increases. This can be seen in Table 16.

69

 

 

 

- when m.vm m.NH m.oH m.m m.o m.v mh.N
Io"
.... NHII-z
((I man
am.
In
I
O
. d
O
J
3
O t.
.00 w
O
T:
D
O
1
I
w
05. 3
3
TL
P
s
s
r.
3
:5. m.
9
......O.........‘O'...’..I...O..OI..........I..........O.... 1
C t.
......O.....'...C....... o
u
I I ..cm.

 

on. no Ho>oH pounce on» ecu :oHuconHmmmHo uumuuoo Cu u we mHzmcoHucHeu 029 .m cucmwm

7O

 

 

 

. m.wH «”...a; m.NH m.c.n m.m m.o m.v mh.N
Ill o H
.... NHHZ
I) mHuz
om.
d
1
o
d
o
z
.4
. M
ow U
c
To.
3
o
I
1
m
/// on. 1
3
T.
D.
a:
S
t.
’l’
. .21. m.
9
- 1
... ... .. ... .. ... ..... ......C. .. ............. .. .. ... ........I. .... C ........... . '...... m.
IIIIIIIIIIIIIIIIIIIIIIIIIIIIII J;(IIIIII(III))IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII u
so.

 

mp. mo Hw>eH umanne ecu HOu :oHucoHMHmmmHo uoouuoo o» u uo mHzmcoHuchH 028 .v ousmﬂh

'71

 

 

- .....3 as): m: 3: 2 m... m... m:
.())( e uz
.... NHuz
--- mHuz
om.

d

I

0

d

O

I

.4

. t

.30 m

0

J

3

O

I

I

w

or. 3

3

I

P

S

S

T.

a 7:

.om. m.

p.

3

T.

00......OI...O.......I.........OOCOOUOIOOOCOCOO.........I........I.......O.I.0.0...O O

U

,sm.

om.

 

no He>mH noumnE ecu HOL :oHumoHuHmmmHo womuuoo Do u we lwzmcoHuchH m2? .m oocmqm

 

 

72

) when mﬂva m.Nm m.o~ m.m m.w m.v mh.N

Ill-ll o "H

.... NHJZ

--- 3%

cm.
.d
I
0
d
O
1
.4
. W
.ow U
0
T:
3
O
I
I
a
C O . O I . .. 0. . O . .. . C . .. .. ... . .. .. .. ... .. ... .. ... Oh OF. 3
H D
o TL
. P
o S
o s
O T.
o T?
llllllllllllllllllllllllllllllllllllllllllllll scallllllllllllllllllllllllllllllllllilli om. ﬂ.
. P
. ‘4
o t
. O
c U
.om.

mm.

wo Hm>mH umummE

 

ecu HON coHumoHWHmmcHo uowuuoo Cu u uo oHcmcoHochu are .m muzmHm

73

 

 

) mHmH mHVH m.~H m.oH m.m m.e m.v m~.~
IIIIW"
.... ~Huz
--- mHuz
om.
d
I
0
d
O
I
3
. M
OOOOOOOIIOOOC” vow u
. O
o I:
u 3
. O
. 1
. I
. H
............. “------a)-------((-------)(-)))---(--(-----(--)(--)------)uu---(------ ca. 3
u 3
. .l
. 9
o S
o S
0 T.
OMCOOOOOIOOOIOCOI.....IOOOOOOOOOOOOOO.....OOODIOOIOOOOOCQOOIOOOOIOOCOO 7’
low. m.
P
3
TL.
0
U
.oa.

cm. «0 Hc>wH sconce ecu new coHooomemmmHo uoouuoo 0o u wo dHcmcoHumHon one

 

.5 mucmHm

74

 

Hm>mH mo. um namoHuHamHm. HmmmH. ommm cHnuHe

mm. Hummﬂ. 0N coﬁuomnwucH

 

He.m u m Hmmo *mn.m HHmo~.H a Hm>mH Humane:
§ 0

Ha.“ n a Hmmo «os.OH sommm.H m choz

:oHuannano m a m: we monsom

m>Hum>Hmmcoo

moﬂumﬂumum moccanm> mo mamaHmcd .mH manna

 

 

mm me Hm mm me He 2
h.mh o.om o.mm m.mh h.Nm n.¢m w om
, we be be me He me 2
m.mm h.Nm n.Nm o.vm m.Hm o.om w mm
me on me e No cm 2
n.Nm n.0w m.mm m.mm h.Nm h.@@ w om
mm mm em mm mm mm 2
h.Nm m.m> o.Nh m.mw h.Nm m.mn m mu
em om mm mm «0 mm 2
m.mm o.om h.mn m.hb m.mm o.¢m w on
NHmmwdm mmmwdm memmw Bdmw 0>Hw38 me Hm>mq
Hmpoz mnmummz

macaumoHMHmmmao uomHHoo mo pamoumm can Hmnﬁoz .vH manna

75

 

 

NNHN. mmmm. om.
HMHN. mmmm. mm.
ovON. wmﬁh. om.
mama. Nth. mh.
mmvH. thm. on. Hm>mq humummz
mHvH. mme. NHmMW¢m
NmHN. mmmw. mmmwdm
NMHN. mmmo. madmw
mNON. boon. Emmw
mHaH. mammh‘ m>qmza
HmHN. Mbhm. me HTGOE
moccaum> cmmz Hm>mH Houomm

Hm>mH wumumme can HcUoE mo mam>ma How mQUGMHHm> can meme:

.mH manna

76

The Scheffe' contrasts in Table 17 suggests that the
model effect stems from the classification differences
between the two raw score models and the two Bayesian
models. There are no significant differences between
the three decision approaches. It is notable that the
variances for correct classification by SUBTEST(12)

and BAYESlZ is considerably lower than is the case for
the other four models. This fact is appropriately con-
sidered in consort with change in R2 values between
SUBTEST(6) and SUBTEST(12) in Table 6.

This chapter has summarized the findings of the
three phases of analysis. Initially,.the utility of
the information variables for reducing uncertainty about
domain achievement was reported. Then the parameters
of the decision models were presented. Finally,
the results of the statistical comparisons of the six
models were given. The final chapter of this thesis
discusses the implications of the findings and presents

conclusions which can be drawn from them.

 

77

Table 17. Scheffe' contrast statistics for the model factor

Variance of

 

 

 

 

 

 

 

Contrast (W) Contrast a? T/ag
32' + X X + _
51x TWELVE _ YHAT XYHATP .0022 1.1304
2 2
i + — X + 3?
SIX XTWELVE _ BAYESG BAYESlZ .0022 _ .08483
2 2
i + 2' X + X
BAYESG BAYESlZ _ YHAT YHATP .0022 _1.2153
2 2
__ - _ _ *
XSIX XTWELVE .0272 5.588
szx - XYHAT .0272 -1.081
._ - — _ *
XBAYES6 XBAYESlZ '0272 5°294
g- - g .0272 - .293

SIX BAYESG

* “A =
Wow >.05F1,4 7'71

 

CHAPTER V
INTERPRETATION, CONCLUSIONS, RECOMMENDATIONS

In this chapter the data presented in the previous
chapter are evaluated in terms of the objectives given
in Chapter I. Conclusions based on the evaluation
are also given along with recommendations for practice
and subsequent research.

The first objective of this project was:

to determine the information existent in
four antecedent and collateral variables
relative to domain achievement.

It should be recalled that data are considered
information if and only if it reduces the uncertainty
involved in making a decision. Analysis of the four
information variables suggested that only two truly
yielded information. Sex was unrelated to any of the
variables of the study. TEST, while correlated with
domain achievement, contained no information not present
in TIME. The other variable which contained information
relative to DOMAIN was STEP achievement.

The two significant information variables indi-
cate prior mathematics achievement and learning rate.
The relationship between prior mathematics achievement

and subsequent test performance was certainly expected.
78

79

The fact that learning rate had predictive utility
independent of achievement is of interest. This finding
is consistent with.Carroll's (1963) hypothesis that time is a
central factor in achievement. The findings of this
research suggest that if two pupils have identical

prior achievement and different prior learning rates,
the student with the higher rate will be expected to
score higher on subsequent achievement measures. Thus,
in terms of estimating posterior scores everything

else being equal, quicker students should surpass the
less quick ones. In addition, students with slightly
inferior achievement but higher learning rates should

be expected to catch pupils with higher achievement

Vbut lower learning rates. It seems reasonable to conclude
that in the long run if opportunity and motivation

are equal the advantage will always be with the quicker
student.

The classroom teachers trying to summarize the
useful prior information they possess relevant to subse-
quent achievement should consider both achievement
and rate of learning. Achievement level seems to be
most important; however, rate, being a dynamic variable,
should be considered in terms of the length of time
which has passed since the last appraisal of achievement

level. Gettinger and White (1979) have recently reported

80

an approach to measuring time to learn which would
allow teachers in traditional classroom settings to
easily appraise learning rate. It is recommended that
teachers familiarize themselves with their approach
and apply it routinely.

The procedure is as follows: pupils study stan-
dard materials,which they have not mastered,for a speci-
fied length of time and are then tested. This is repeated
until mastery at some arbitrary upper limit has been
reached. Time to learn is then said to be the number
of trials required. The cited authors had students
follow the process for six types of tasks and set time
to Learn as the mean number of trials needed for mastery.

The second objective of this study aimed to
determine:

1) the incremental validity of short domain
tests,

2) if decision precision can be improved
by using antecedent and collateral data
with test results, and

3) the functional lengths of several short
domain tests.

Incremental validity refers to the extent to
which a multiple correlation is raised by the addition

of test results to a set of prior existing information.

81

Thus the incremental validity of SUBTEST(6), SUBTEST(12),
and SUBTEST(18) is .1478, .3167, and .3272
respectively. The incremental validity of the six
item test is less one quarter of the base information
(assuming no prior information with respect to the
bases). Cronbach and Gleser (1965) have written that
"tests should be judged by the increase in validity
which they offer." In terms of information, as this
study has defined it, the six item test does provide
some. In order to determine if the amount of informa-
tion is meaningful with respect to mastery-nonmastery
decisions, the decision precision based on the prior
information and the prior information combined with
the six items subtest was compared. (It should be
recalled that "decision precision" has previously been
defined as the proportion of correct classifications
made on the basis of a given decision algorithm. The
decision based on the application of the algorithm
to the domain achievement score is the correct one.)
The results of this comparison indicated that
decision precision was not improved by using the six
item test. The implication of this finding is clear.
Test data do not provide decision relevant information
that was not available prior to testing. Thus, while

use of the tests might be justified on instructional

82

grounds, a decision to test with six items is not justi-
fiable as a means of improving decisions about mastery-
nonmastery. This is true regardless of whether the
prior information is incorporated by least squares

or Bayesian approach.

The number of test items necessary to provide
information equivalent to that of the collateral, ante—
cedent and test information already available is referred
to as the functional length of a test. Thus TEST,

TIME and SUBTEST(6) have a functional length of 8.03.

One could use Figure 2 to set a functional length for

the base prior information. The value would be slightly
more than five. It is clear that if one considers

the prior information and then the six item test, the
information value of the test is reduced to that of

about three items. The findings of phase III of the
analysis suggest that this is not a sufficient number

of items to improve decision precision, vis-a-vis mastery-
nonmastery, significantly.

For subtests of 12 items or more the functional
length is the same as the actual length. Thus one
would expect that the decision precision of an algorithm
incorporating prior information would be the same as
one based solely on test score.

To address the final objective of this research,

83

comparisons of the three decision approaches were made.
With respect to decision precision the three approaches
do not differ.
In order to spur insights into the result of
no difference among the approaches it is useful to
compare the approaches in detail. While each of the
models is linear, the least squares approach is not
directly comparable algebraically to the other two.
However, the Bayesian and raw score approach are analogous
and comparison of their algebraic basis is instructive.
In order to do this, one should recall the Kelley
model for estimating true scores. The Kelley model

is

T = pXX, X + (l-pxx,)X

Where pxx' is the proportion of true to observed variance,
X is an observed score and X is the mean of such scores
(T = X). The raw score approach is the specific case
where pxx, = 1 and thus T = X.

The Bayesian Marginal Mean Model has the same
form as Kelley's Model. Like Kelley's approach, it
contains a parameter which is, in part, a function
of score variance. However, this parameter is also
influenced by prior subjective estimates about the

sample in question. Specifically, this prior information

84

is incorporated into the model by specification of
a value for prior information in terms of the number
of test items the information is worth. Table 13 indi-
cates that p* is clearly a function of t. However,
Figures 3 through 7 as well as the results of the Scheffe'
contrasts suggest that the decision about mastery-
nonmastery is not particularly sensitive to t. It
appears that for the purpose of classifications of
mastery or non-mastery, incorporation of prior informa-
tion by means of t has little value. For after the
complex calculations of the Bayesian Model are completed
it functions as the raw score form of Kelley's Model.
For making the kinds of decisions made most
frequently by educators, the raw score model is clearly
indicated because of its simplicity.
The following three points summarize the comparison
of the models.
1. Decision precision was the same for
the six item raw score model and the
least square model containing only ante-
cedent and concommitant information.
2. Decision precision was improved when
12 items were used rather than six.
3. The raw score model is preferred to the

Bayesian model.

85

All of the preceding discussion holds for mastery
levels of .70, .75, .80, .85, and .90. However, across
all models the precision decreases as the mastery level
is increased. This trend does not appear to be uniform
for all models. It seems as though the models containing
the least information decline most in precision. Both
the raw and Bayesian approaches using six items show
the greatest consistent decline.

The finding of this research which seems to have
the greatest utility for current classroom practice is
that selected prior information appropriately weighted,
can be used to yield decisions about subsequent achievement
which are as accurate as decisions based on a six item
test. This fact can be useful as teachers informally
monitor pupils on a day to day or even minute to minute
basis. Assuming that a teacher has prior measures of
achievement and rate of learning is invariant (at least
within a subject and group of pupils) it may be sufficient
to keep track only of students on task behavior to assure
they are progressing. Perhaps students can be taught
that frustration in learning attempts signals a diagnosis
point where they should ask for help. If the teacher
can't easily identify the problem, then a test of suffi—
cient length to diagnose the difficulty is called for.

It may be that the frequent tests called for by current

86

individualized programs are unnecessary. What may be
called for instead is a sound initial placement of instruc-
tional materials and methods based on learning rate.
After this start subsequent testing can be done when
frustration is indicated by off task behavior or iden-
tified by the student.

Such an approach would probably result in some
students taking frequent tests and others taking very
few. It would reduce unnecessary assessment and assure
that when a test was given its purpose would be clear
to both teacher and student. Hopefully, it would allow
tests with sufficient items to assure infrequent errors
in instructional decisions. These suggestsion will need
further investigation.

This research cannot be generalized beyond the
curriculum and grade level of focus. Such extension
would require further research. It is suggested that
efforts be focused on issues related to classroom practice
as discussed in the previous paragraphs rather than the

replication of the present study.

LIST OF REFERENCES

LIST OF REFERENCES

Anscombe, F.J. The Transformation of Poisson, Binomial
and Negative Binomial Data. Biometrika, 1948,
35, 246-254.»

Baker, E.L. Beyond objectives: Domain-referenced tests
for evaluation and instructional improvement.
EDUC. TECH., 1974, 14, 10-16.

Bormuth, J.R. On the Theory 9f Achievement Tests,

Chicago: University of Chicago Press, 1970.

Carroll, J.B. A model for school learning. Teachers
College Record, 1963, 64, 723-733.

Cohen, J. and Cohen, P. Applied Multiple Regression]
Correlation Analysis for the Behavioral Sciences.
Hilldale, New Jersey: Lawrence Erlbaum Associates,
1975.

Cronbach, L.J. and Gleser, G.C. Psychological tests and
Personnel Decisions. University of Illinois
Press, Urbana, 1965.

Cronbach, L.J. Test Validation In R.C. Thorndike (ed.)
Educational Measurement: Second Edition,
Washington, D.C.: American Council on Education,
1971.

Draper, N.R. and Smith, H. Applied Regression Analysis.
New York: John Wiley & Sons, Inc., 1966.

Ebel, R.L. Criterion referenced measurements: limita-
tions. School Review, 1971, 62, 282-288.

Fhaner, S. Item sampling and decision making in achieve-
ment testing. British Journal 2f Statistical
Psychology, 1974, 21, 172-176.

 

Gettinger, M. and White, M.A. Which is the StrongeSt
correlates of school learning? time to learn or
measured intelligence: Journal 9f Educational
Psychology, 1979, 11, 405-412.

 

87

88

Glaser, R., and Nitko, A.J. Measurement in learning
and instruction. In R.L. Thorndike (ed.) Educa-
tional Measurement. Washington: American Council

on Education, 1971.

 

Greenhouse, S.W. and Geisser, S. On methods in the
,analysis of profile data. Psychometrika, 1959,

23, 95—112.

Harris, C.W., Alkin, M.C. and Popham, W.J. Problems
in Criterion-referenced Measurement, CSE monograph
series in evaluation. No. 3. Los Angeles: Center
for the Study of Evaluation, University of
California, 1974.

Harris, M.L. and Steward, D.M. Application of classical
strategies to criterion referenced test construc-
tion. A paper presented at the Annual Meeting of
the American Educational Research Association, 1971.

Hambleton, R.R. Testing and decision making procedures
for selected individualized instructional programs.
Review gf Educational Research, 1974, 44,’37l-400.

Hambleton, R.R., Novick, M.R. Toward an integration of
theory and method for criterion-referenced tests.
Journal of Educational Measurement, 1973, 12,

1593170."’

Hambleton, R.R., Swaminathan, H., Algina, J., and Coulson,
D.B. Criterion-referenced testing and neasurement:
A review of technical issues and developments.
Review 9f Educational Research, 1978, 48, 1-47.

Hively, W., Patherson, H.L., and Page, S.A. A "universe-
defined" system of arithmetic achievement tests.
Journal of Educational Measurement, 1968, 5,

275-290._‘

Ivens, S.H. An investigation of item analysis, reliability
and validity in relation to criterion-referenced
tests. Unpublished Doctoral Dissertation.

Florida State University, 1972.

Jackson, P.H. Simple approximations in the estimation
of many.parameters. British Journal 2f Mathematical
and Statistical Psychology, 1972, 25, 213-229.

 

 

89

Kelley, T.L. Interpretation 9f Educational Measurements.
Yonkers on Hudson, New York: World Book, 1927.

Kerlinger, F.N. and Pedhazur, E.J. Multiple Regression
in Behavioral Research. New York: Holt, Rinehart

and Winston, Inc., 1973.

Lewis, C., Wang, M.M. and Novick, M.R. Marginal distri-
butions for the estimation of prOportions in
m groups. ACT Technical Bulletin No. 13. Iowa
City, Iowa: The American College Testing Program,
1973.

Lord, F.M. and Novick, M.R. Statistical Theories
2f Mental Test Scores. Reading, Mass.: Addison-
Wesley, 1968.

 

Mager, R.F._ Preparing Instructional Objectives. Palo
Alto, California: Pearson Publishers, Inc., 1962.

Millman, J. Determining test length: Passing scores
and test lengths for objective-based tests.
Instructional Objectives Exchange, Los Angeles,
California, 1972.

Millman, J. Criterion-referenced measurement. In
W.J. POpham (ed.) Evaluation in Education: Current
Applications. Berkeley, California: McCutchan
Publishing Co., 1974.

 

 

Millman, J. Passing scores and test lengths for domain-
referenced measures. Review prEducational
Research, 1973, 43, 205-216.

Novick, M.R. and Jackson, P.H. Statistical Methods for
Educational and Psychological Research. New York:
McGraw-Hill, 1974.

Novick, M.R., Lewis, C. and Jackson, P.H. The estimation
of proportions for M groups. Psychometrika,
1973, 38, 19—45.

Novick, M.R. and Lewis, C. Prescribing test length for
criterion-referenced measurement. In C.W. Harris,
M.C. Alkin.and W.J. Popham (eds.) Problems in
Criterion-Referenced Measurement. Monograph
Series in Evaluation No. 3. Los Angeles, Center
for the Study of Evaluation, University of
California, 1974.

9O

Osborn, H.G. Item Sampling for.achievement testing.
pg. and Psych. News, 1968, 28, 95-104.

 

Popham, W.J. Educational Evaluation, Englewood Cliffs,
New Jersey: Prentice Hall, 1975.

 

Rozeboom, W.W. Foundations 9f the Theory 9f Prediction.
Homewood, Illinois: The Dorsey Press, 1966.

 

 

 

Sechrest, L. Incremental validity: A recommendation.
Educational and Psychological Measurement, 1963,
33, 153—158.

 

 

Swaminathan, H., Hambleton, R.R., Algina, J.. A Bayesian
decision-theoretic procedure for use with criterion-
referenced tests. Journal 9f Educational
Measurement, 1975, 12, 87-98.

 

 

Wang, M.M. Tables of constants for the posterior
marginal estimates of proportions in M groups.
ACT Technical Bulletin No. 14. Iowa City, Iowa:
The American College Testing Program, 1973.

LIST OF NOTES

91

LIST OF NOTES

1. Hambleton, R.R., Swaminathan, H., Algina J., and
Coulson, D. Criterion Referenced Testing and
Measurement: A Review E: Technical Issues and
Developments. Unpublished Manuscript, University
of Massachusetts, 1975.