--
zv-m.

5

 

pays-2:125:21?

""".“f:::7: 1:"

55551525555555 5
255559154“

”‘u' :1

5‘ 551‘s {5
1555. 55(51 1:55 1,

555155552
5 1
55:55.55];

5

't—‘L<.;_‘”M’:' . .
't ‘ 2."

' «la—w 1:33“;
A .2

' "‘2"
- .. 4

ﬁ‘

A m4 '1-5 A
“Zia;

-4-M.
1:2,:

1-15-3221
m ' t. P,
”rag—:2
“E!
21'

5555.11

, . L15“ 1 5'55
,555.’ {5555193 "““m‘g‘h 55v

‘iiESﬁf '_.5.

w...

55.’ I5.

~ ' ./
- 25.-
:32. f ..
=1»- , a .="
£32,-.. :ﬁ: é‘éua‘g
E "—' £5}.- E):
glif'rdM-‘L *:

511;:

5521,

”ix-3:??- ‘_

>~

5

35.31%,“

.1.—

.37—

1;:iE-55- ‘
. .

A 1 in"; 1
J'W‘W‘c.»

4
.

.' I
5:5,-

I?!)
13." - A

ﬁ‘éé _‘ £2 a“ 3%.: A , .2 '. I
*{xfé‘g 5:.— 9'.“ .g‘
.. . 'Lgf-za.._ 235 . V I A-

$839 -
€22 <

- ‘ g-5 Q'
; n... .. '1"

." '-
1);”-
«512 A,

LI

.3.—

.. .
:fi‘gﬁz~.:~‘

.5552 , 5‘55
W5 5

. 511551.. 1‘L
I! 5.121 5.1 . 2
51225 25“ 2155‘

‘ ”rat: ,
_ ‘gﬁ

:72: i. ;_ _‘ I
2w: :2 r ~ _

’- 1.1 T4
, . 3}
35555435555531.5555 521

“15‘“ [5.251553

I'M
25:5
5 11?-

5 55
I

(15:55.55
2 2'51

I"

ST".
.5555:

L‘ . 4 ..
. aw“
*up

..

1‘ ’2'
3.2 ' J ‘
wf iii?!“ '5 ”:1 n A.»
v: #221; {5.1% 3‘51- 3:; 5 y
‘ 5. I “ L. 1::-
.. @3323“! “3?; “27 _ :11
.. §£~ “—9517 2..— 4 : —
‘4 ;‘ . K r

J

”1...: 4;, 2.11
“ .1; ,, m’:

2..

5515-155
‘1

2%

:E'JS‘.
n ' 13-2.:
$52“

“Fa

g.
~42: .

2-
'31:

2%

‘8‘.”
1'5 55“:

"rk.

4'21.

. '31:»?

‘.

.n

.<..

w

3.1::

3‘32

£6:

.n.
-l~'?

2:52

.a'.

5* .
-' 43.5““...9. F a:
3" ﬁr .4 a.;
"a“... .- g 3, J3.”
““rﬁ .M......x_
‘2“ ~21»:,- 43251.57 43‘
_. “iii-L; .._. Lﬁ.*:~%:§x'h g:
frs‘, fﬁgg'
2?" S: 2 €3,365“
2 - 1555?»; w,
‘ ‘“‘ ”1:52?
3‘21 #1:}; v4”
. '1:r
.5: : ~“ 1‘
“‘“ ‘::¢...:r %;:

2%?

a

'2

.mA

‘ _
m? 23-
:VJQWO
.232. w
.L . .r

 

£5559 m"

51;:
L 21143:.-
x‘. $313

$3.3“

“251

5‘1 émgﬂuﬁk

4 ‘1“.ww"5“"13( I CS. :.
1.1“ 5“?“ .,..L a . "55.15315" 1:5}:

55222.51

ﬁn: magi

55555-515. 1““‘8‘11
L” ‘8 L .3 LL 55:”;
- “55‘2"? ‘L “‘5

“:51

$311.

$55.
5&22)?

ﬁg‘gﬁaﬁﬁﬁ 1.395152%, "55“: 5.

KERN“ LLV " I
I! 55'3““ "1 ‘i‘r‘wﬁ

(LL-
.2 55512 55555525
1% '.

$211,?

25-3.: 5'
5 (‘f ' "1" £ 15
1.. 1?! 1.553%“ I’LL 21:; I
' ““5552 :1 “RINK, .‘ 1’12".
”53:15 555555“ :55- 5“

11".,1

ﬁx 5,51“th
ﬁﬁt‘iﬁﬂ‘” 2;. «inc. ‘
221.1211 gm. .
Li“? WELW 5:38:55“ 5‘5?" :51?“
“5513235555

. 2&5
“$535515 9“:
1“." . '13.. LLl
"5‘ 5~ QR 1
55, :55“: 2111‘s
5‘55. 5555555525 {555 -
5‘1, 5525‘ ‘ 2%- .gm A

.1 1k)???” 3 3L..LL‘;I5111.I§

«,‘515‘L

4/23. 1’33!”
‘01 $535; Q’Ln‘ﬁput 5“

‘31
1.52

‘5
‘1.
.2,

8
%‘M
.151
L455, .
511535.155?

This is to certify that the

thesis entitled

RESIDUAL GAIN SCORES AS A
CRITERlON FOR CHANGE:
INFERENTIAL PROBLEMS

presented by

Khalil Elaian

has been accepted towards fulfillment
of the requirements for

Ph.D. degree in Educatiqnal-
Statistics and Research
Design; Department
of Counseling, Educa-
' ycholocv and

   

Major professor

Date //2§/X?
/ /

0-7639 MS U is an Afﬁrmative Action/Equal Opportunity Institution

 

 

)V153I_J RETURNING MATERIALS:
Place in book drop to
LIBRARIES remove this checkout from
.——. your record. FINES will

 

 

 

E be charged if book is
returned after the date
stamped below.

 

no nor may a

rm “5" ""5” mu v

 

 

RESIDUAL GAIN SCORES AS
A CRITERION FOR CHANGE:

INFERENTIAL PROBLEMS

by

Khalil Elaian

A DISSERTATION

submitted to
Michigan State University
in partial fulfillment of the requirements for the degree of

DOCTOR OF PHILOSOPHY

Department of Counseling, Educational Psychology,
and Special Education

1984

COPYRIGHT BY

KHALIL ELAIAN
I981:

ABSTRACT

RESIDUAL GAIN SCORES As A CRITERION FOR CHANGE:
INFERENTIAL PROBLEMS
by

Khalil Elaian

To determine the effect of teacher behaviors, W, in process-product
research, residual gain scores, Z, are often used as the criterion. Significant
correlations between class means of residualized gain scores and teacher
behaviors, riw's, are taken as evidence of teacher effects. The purposes of the
study were to determine the conditions under which testing Ho: 02w = O is, in
fact, equivalent to testing for no teacher behavior effect, and also to investigate
the appropriateness Of using different definitions of residual gain scores in
testing the null hypothesis. Five different forms Of residualized gains were
considered based on the total, 21, between, 22, and within regression coefficient,
Z3, a newly derived estimate of the regression of posttest class effects on
pretest class effects, Z4, and finally the parameter for the class effects
regression coefficient, Z 5.

A linear structural model was built to determine the conditions under
which testing 02w = O is equivalent to testing no teacher behavior effect on
student achievement. The analytic results showed that the two null hypotheses
are equivalent if either of the following conditions are met: (a) there is no
initial confounding of teacher behavior and class composition or (b) the SIOpe of

posttest class effect on pretest class effect, 31,15 equal to the slope Of

 

 

 

Khalil Elaian

posttest on pretest for within classes, 82 and a perfectly reliable pretest. When
the conditions are not met; however, the two null hypotheses are equivalent only
for 24 and Z 5.

A Monte Carlo approach was taken to investigate the apprOpriateness of
using different rzw's in testing the hypothesis of no teacher behavior effect.
Three criteria were considered: (a) the mean estimates of Riw's, (b) empirical
Type I error rates, aand (c) empirical power. Parameters varied in the study
were the degree of initial confounding, the reliability of the pretest, the number
of classrooms, and the number of students in a classroom.

The results of the study showed that when there was a substantial amount
of initial confounding, the test statistics using rilw: rizw: and r23“, were only
valid in a few situations. These tests, particularly the tests using rilw and ri3w,
tended to be too liberal in situations where B 1 = 82 or 81>82 and too conservative
when 813%. Parallel results for the tests using r21“, and rz-3W were Obtained with
increasing sample size. However, the test statistics using r24“, and rz-5W were
the only tests which remained valid as initial confounding, sample, and class size
increased and in the presence Of errors of measurement. Also, the results
indicated that increasing sample and class size increased the empirical power of
both rig“, and r25“, in situations where 81 = 32 or 81) 82.

It was concluded that procedures used by process-product researchers in
forming residual gain scores typically provide misleading results. Sometimes the
test statistics used are too liberal and other times they are too conservative.
Therefore, it is recommended that process-product researchers who wish to test
for no teacher behavior effect use 24. In addition to yielding valid Type I error

rates across all conditions investigated, the procedure had reasonable power and

Khalil Elaian

does not have the unrealistic requirement of knowing the value of a parameter a

priori.

DEDICATION

TO my wife,
Nasrin Bakir,
and my son,

Rami Elaian.

ii

ACKNOWLEDGEMENTS

I am deeply grateful to Professor Andrew C. Porter, my academic advisor
and dissertation chairperson, for his generous and endless advice,
encouragement, support, and patience. I would like to thank Professor Richard
Houang for serving as a member Of my dissertation committee and for the
generous help he has given me with the dissertation. I would also like to thank
Professors Robert Floden and James Stapleton for serving on my committee.

Working in the Office of Research Consultation (ORG) has provided me
with invaluable experience. Many thanks to Professor Joe L. Byers who hired me
in the ORC.

I would further like to thank the University of Jordan for four years Of
financial support.

I wish to acknowledge the moral support of my parents, wife, and friends.

Finally, I wish to thank Barbara Reeves for her typing of the final

manuscript.

iii

TABLE OF CONTENTS

List of Tables
List of Figures '

CHAPTER 1: STATEMENT OF THE PROBLEM
Definition of Residual Gain Scores
Research Questions

CHAPTER 11: REVIEW OF THE LITERATURE
Alternative Indices of Responses
Use of Indices of Responses

CHAPTER III: THE ANALYTIC CHAPTER
A Linear Structural Model for
Process/Product Research
Defining Values of K
Relationships Between Regression Coefficients
and the Structural Model2 2
Conditions Under whichY ((61 -82}? V0 + Bloeo) : 0
Distributions of Test Statistics

CHAPTER IV: SIMULATION PROCEDURE
Simulation Parameters
Data Generation Routine

CHAPTER V: RESULTS OF THE EMPIRICAL INVESTIGATION
Mean Estimates of 02w when 83 = 0
Initial Confounding Effects
Effects of Presence Of Errors of measurement (WOYOE 1)
Sample and Class Size Effect
Empirical Type I Errors for One and Two Tailed
T-Test When Testing Haﬁz“, = 0
Initial Confounding Effect
Empirical Type I Errors Of Test Statistics when the
Premeasure Contains Errors of Measurement
Sample Size Effect
Class Size Effect
Empirical Power
Effects of Initial Confounding
Effects of Presence of Errors of Measurement
in the Premeasure
Sample Size Effect
Class Size Effect

iv

vi

viii

12

12
16

16
18
19

21
21
25

29
30
30
33
35

35
38

#3
#6
5O
53
55

58
61
64

CHAPTER VI:

Bibliography

SUMMARY AND CONCLUSIONS

68

73

10.

ll.

12.

13.

LIST OF TABLES

The Total Variance-Covariance Matrix (2)

Relationships Between Regression Coefficients and the
02w to the Structural Coefficients

Design of the Study

Between (25), Within (SW) and Errors of measurement (Xe)
Variance Covariance Matrices

Means of Empirical Sampling Distributions Of Riw's for
Different Combinations of 81, 82 and for c = 30, s = 20,
Dy y = .8 and B3 = 0.00

O 0

Means of Empirical Sampling Distributions of Riw's for
plovo = '8

Effects of Sample Size on Mean Estimates Of Riw's for
I: .2

Effects of Class Size on Mean Estimates of Oiw's for
Y: .2, pyoyo :- 08, C = 30, and 63 = 0

Effects of Initial Confounding on Empirical Type 1 Errors
for the One-Tailed Tests of OZW's where C = 30,
s = 20 andQy y = .8

O 0

Effects of Initial Confounding on Empirical Type I Errors
for the Two-Tailed Tests of Dzw's where c = 30,
5:20andpy y =.8

0 0

Effects of the Presence of Errors of Measurement in the
Premeasure on the Empirical Type 1 Errors for One-Tailed
Tests of Riw's where c = 30, s = 20 andY: .2

Effects Of the Presence of Errors of Measurement in the
Premeasure on the Empirical Type I Errors for Two-Tailed
Tests ofpzjw's where c = 30, s : 20 andY: .2

Effects of Sample Size on Empirical Type 1 Errors for

One-Tailed Tests of Qw's where s : 20,
Y: .2 and p), y = .8
0 0

vi

15

16
23

20

32

3‘}

36

37

40

#1

44

45

47

14.

15.

16.

17.

18.

19.

20.

21.

22.

23.

214.

Effects of Sample Size on Empirical Type I Errors for
Two-Tailed Tests of RZW'S where s = 20,
Y: .2 and Dy y :08

0 0

Effects of Class Size on Empirical Type I Errors for
One-Tailed Tests of Riw's where c = 30,
Y: .2 and O); y = .8

0 0

Effects Of Class Size on Empirical Type I Errors for
Two-Tailed Tests of Riw's where c = 30,
Y: .2 and Dy y = .8

0 0

Effects of Initial Confounding on Empirical Power for
One-Tailed Tests of Riw's = 0 where c = 30, s = 20,

= .1 andp = .8
83 yoyo

Effects of Initial Confounding on Empirical Power for
Two-Tailed Tests of Riw's = 0 where c = 30, s = 20,
%:.Iand0y y 2.8

O 0
Effects of Errors Of Measurement in the Premeasure
on Empirical Power for the One-Tailed Tests of piw‘s = 0
where c = 30, s = 20,Y= .2, and IE; = .1

Effects of Errors Of Measurement in the Premeasure
on Empirical Power for the Two-Tailed Tests Of
QZw's : 0 where c = 30, s = 20,Y = .2, and B3 = .1

Effects of Sample Size on Empirical Power for
One-Tailed Tests of Diw's where s = 20, Y = .2

= .1 and = .8
B3 pyoyo

Effects of Sample Size on Empirical Power for
Two-Tailed Tests of OZW's where s = 20, Y: .2

: .l and = .8
8’3 pyoyo

Effects of Class Size on Empirical Power for
One-Tailed Tests of Diw's where c = 30, Y = .2

: .1 and = .8
8“ Floyo

Effects of Class Size on Empirical Power for

Two-Tailed Tests Of Dzw's where C = 30: Y= .2
B3 = .1 and pyoyo = .8

vii

49

51

52

56

57

59

6O

62

63

65

66

1.

A structural model

LIST OF FIGURES

viii

13

CHAPTER 1

STATEMENT OF THE PROBLEM

Residual gain scores (RGS) are Often used as a criterion for measuring
change in educational research. For example, in reference to evaluating teacher
effectiveness, Veldman and BrOphy (1974) state "it is generally accepted that
residual gain scores are superior to simple pretest-posttest difference scores as
measures of teacher influence" (p. 321). Process/product research on teaching
can be used to illustrate the practice of setting residual gain scores as a
criterion for study (Gage, 1977). Process refers to teaching behavior and product
refers to student learning. Residual gain scores are used as a product variable
which is meant to control for initial differences among classrooms in their
compositions of students. The residual gain scores are typically constructed
from student pre— and post-instruction achievement test scores.

BrOphy and Evertson's (1974) two year replicated study conducted at the
University of Texas provides a specific example of using residualized gain scores
as the criterion in process/product research. Thirty teachers were included in
the first year, and 28 in the second year. Classroom Observations were made to
assess teacher behavior. Scores on five subtests of the Metropolitan
Achievement Test (MAT) were available for each student. The MAT obtained on
the first year was used as the pretest and the MAT for the second year as
posttest. For each student, predicted values of the posttest scores were
determined from the pretest scores based on the total sample regression line.
Residual gain scores were computed by subtracting predicted posttest scores

from the actual posttest scores.

To determine the effects of teacher behaviors, Pearson's product-moment
correlation coefficients between process variables and average residual gain
scores, aggregated by teacher, were Obtained. The sample correlations were
tested for significance by t-tests, with c - 2 degrees of freedom, where c refers
to the number of teachers. The null hypothesis that BrOphy and Evertson
intended to test was that teacher behavior had no effect on student
achievement. The aim of the present study is to investigate the appropriateness
of using residualized gain scores in order to determine the process product

relationship.

Definition of Residual Gain Scores

 

Consider the model used in forming residual gain scores:
(1)2 : Y (t) - K Y (0) where Y (0) is the measure at time 0,
Y (t) is the measure at time t, and
K is an adjustment coefficient.
As described previously, 2, the constructed residual gain score aggregated by
teacher, is correlated with a measure of teacher behavior, W. Let r2.” denote
the sample correlation coefficient between 2 and W, and 02w be the
corresponding parameter. For a test of H0: 02w : 0 to be apprOpriate, not only
must the variables, Z and W, be linearly related but, in addition, 02w must be
only a function Of change in achievement caused by W. If either one Of the

above conditions is false, a test of ”0‘ 92w = 0 Will lead to spurious conclusmns.

As Z is constructed from equation I, the apprOpriateness of 02w = 0 for a
null hypothesis depends upon the choice of K, the adjustment coefficient. While
K is assumed to be a known constant, in practice, this is seldom the case.
Usually K is estim ated from the relationship between Y (0) and Y (t), in terms of

a regression coefficient. Because the nature of the data on student performance

is hierarchical (i.e., students are nested within classrooms) three regression
coefficients are available: the between classroom regression coefficient, the
within classrooms regression coefficient, and the total regression coefficient.
For most educational data, these three regression coefficients are not
interchangeable. Further, it will be shown that in some situations, none Of the
three coefficients estimate an appropriate correction parameter.

To further complicate matters, the sampling distribution of riw will be a
function of the estimator for K. Unfortunately, the nature of the sampling
distribution of r2“, is unknown (at least for most situations), and the use of the t-
distribution to test Ho : 92w = 0 as in BrOphy and Evertson's study, may not be
valid even when the sample regression coefficient estimates an appropriate

correction parameter (Draper 6c Smith, 1981).

Research Questions

 

The intent of the present study was to investigate the appropriateness of
using a t-test to test 02w = 0 as evidence for no teacher behavior effect on
student achievement. More Specifically, the following research questions were
investigated.

1. What are the conditions under which testing Oiw : 0 is

equivalent to testing no teacher behavior effect for different
methods of defining Z?

2. Given no teacher behavior effect, how well does the distribution
of the "t" statistic based on each of several different methods of
defining z approximate the theoretical t-distribution for varying
amounts of (a) initial confounding, (b) presence of errors of
measurement in the premeasure, (c) number of classrooms, and

(d) class size?

The investigation was conducted in two steps. First, the conditions under
which pr = 0, if and only if there is no effect of teacher behavior on student
achievement, were determined analytically. Second, a simulation study was
conducted to investigate empirically the distribution of "t" statistics using
different methods of testing [Ew-

In Chapter II, relevant literature will be reviewed. In order to examine the
situation thoroughly, a structual model is introduced in Chapter 111. Chapter IV
presents the design of the simulation study. The results obtained from the

empirical study and the conclusions reached are presented in Chapters V and VI.

CHAPTER 11

REVIEW OF THE LITERATURE

In experimental research the experimenter manipulates variables of
interest and observes the manner in which the manipulation affects the variation
of the dependent variables. In order to be reasonably sure that the observed
variation in the dependent variable is indeed due to the manipulated variables,
the experimenter must control all possible confounding variables. Porter 6c
Chibucos (1974) suggested two catergories for these possible confounding
variables in the context of program evaluation:

1. Systematic differences in the dependent variable dimensions that
are present in the units Of analysis at the outset of program
participation.

2. Systematic differences that occur in the dependent variable
dimensions during program participation which are not a function of
program participation. (p. #40).

While randomization is one of the most powerful methods to control
confounding variables of the first category it does not insure controlling
confounding variables of the second category. To the extent both categories of
possible confounding variables are controlled, arguments for causal relationships
between independent and dependent variables are strengthened.

Studies of natural variation are also used to investigate the possibility of

causal relationships among independent and dependent variables. As was the

case for experimental research the investigator must be concerned about both
types of confounding variables. In studies of natural variation, however,
randomization is by definition not a part of the design and so other methods must
be employed to guard against confounding. One general method, which has
enjoyed considerable use, involves the formulation of an index of response of
which residualized gain scores, the focus Of this study, represented a specific

type.

Alternative Indices of Responses

 

The index of reSponse is defined by Zij = Y(thj - K y(0)ij where Y(0), Y(t)
are pre and post measures for the 1th individual in the jth group, K is some known
constant. In addition to requiring scores on the measure of interest at two points
in time to formulate Z, K has to be set to an apriori known value. However, the
value K should take depends on knowledge regarding the natural growth model
which adequately describes the data if there were no effects of the independent
variable.

The most commonly used index of reSponse is raw gain scores, where K is
set to unity,

Dij :-. Y(t)” — y(0)ij where Dij is the raw gain for the ith individual in the jth

group.

In other words, raw gain scores are created by taking the difference Of the post
measure and premeasure scores on the dependent variable dimension.

Raw gain scores as a measure of individual change have been criticized in
the literature for having low reliability and for correlating negatively with
premeasure scores (Cronbach 6r Furby, I970, Linn 6c Slinde, 1977; Lord, 1963).
Cronbach and Furby, have also questioned the use of raw gain as a strategy to
measure group Change in studies of natural variation, agreeing with Lord (1967)

that gain scores are an inappropriate strategy to control for confounding

variables in natural variation studies of causal relationships. In contrast, Porter
(1973) has suggested that under certain assumptions gain scores may provide the
best technique for natural variation studies. Porter argued that given treatment
effects are additive the pre and posttest measure the same variable in a common
metric and there is no change in variances from pretest to posttest; it can be
shown that the gain score strategy does provide a reasonable approach to data
analysis in natual variation studies. Bryk and Weisberg (1977) showed that under
natural growth (i.e., no treatment effect) this gain score strategy provides an
unbiased estimate of the treatment if and only if the group growth patterns are
parallel (which is equivalent to Porter's assumptions) .

Standardized gain scores represent yet another form Of index of response
that has been used to analyze data from studies of natural variation. K in the
index of reSponse is set to either one of OYt / 0 YO’ OT)“; / OTYQ or SYt /syo
where Ozyt and Uzyo are the pOpulatIon variances of the dependent variable

I l 2 2 I O
dimenSIon at pre and posttests, 0 Ty and 0 Ty are the pOpulatIon variances for
t o

the true variables and s and s are the sample estimates of G and O .
Y Yo Yt Yo

t
Using ANOVA of standardized gain scores as a strategy to control initial
confounding was introduced by Kenny (I975). Even though Kenny did not
distinguish between the different types of standardized gain scores, he argued
that when individuals were assigned to a program based on sociological or
demographic variables, standardized gain scores provide the best index of
re5ponse for controlling initial confounding. Olejnik and Porter (1981) clarified
Kenny's recommendations by showing that the validity of standardized gain
scores is dependent upon the model of natural growth that applies in the absence
of treatment effects. They also pointed out that the two alternative ratios of

population standard deviations are equivalent if the reliability of the pretest and

posttest are equal. Finally and perhaps most importantly, they pointed out that

using a ratio of sample standard deviations followed by ANOVA is an incorrect
procedure that results in misleadingly small standard errors.

Residual gain scores are yet another form of index of response that has
been used in studies of natural variation. Three different types of residual gain
scores appear in the literature of measuring change. The first, which is called
True residual gain scores is defined by setting K in the index of response to the
slope of true posttest on the true pretest. True residual gain scores were
suggested by Tucker et a1. (1966) and called a "base free measure of change."
The second, called Observed residual gain scores, Z, sets K = BYtYO (i.e. the
slope of the manifest variables). The third called estimated residual gain scores,
2, sets Kza’tYO where BYtYO is the sample estimate of the lepe of yt on yo.

Residual gain scores as measures of individual change have been
Characterized in the literature as uncorrelated with initial status but suffering
from low reliability (Kessler, 1977; Linn 6c Slinde, 1977). Using ANOVA on
observed residual gain scores as an analysis strategy in natural variation studies
is comparable to using analysis of covariance. The only difference between the
two procedures is that ANCOVA estimates the value of K from the data while
ANOVA on the Observed residual gain scores requires that K be set apriori to
BYtYO'

ANOVA on true residual gain scores is parallel to estimated true scores
analysis of covariance originally developed by Porter (1967). Again the
distinction is that the true residual gain score approach requires that a
population slope be known apriori while estimated true score ANOVA estimates
that slope from the data. Performing ANOVA on the estimated residual gain
scores raises at least two problems. First the expected value of the estimated
residual gain score is unknown (Draper and Smith, 1981) making it difficult to

determine whether the strategy provides unbiased estimates of the causal

relationship of interest. Second, the procedure suffers from the same bias of
standard errors that Olejnik and Porter (1981) noted for standardized gain scores
using sample standard deviations.

Uses of Indices of Responses

 

Using an index of response in lieu of randomization in natural variation
studies has been a controversial tepic. Perhaps the best known antagonist of
their use is Lord (1967, 1969) who has stated "with the data usually available for
such studies, there is simply no logical or statistical procedure that can be
counted on to make proper allowance for uncontrolled preexisting differences
between groups" (Lord, 1967, p.35). More recently Cronbach and Furby (1970)
have indicated basic agreement with Lord's pessimistic view of the utility of
using statistical adjustment in natural variation studies. On the other hand,
Elashoff, 1969; Hornquist, 1968; Porter (Sc Chibucos, I974 hold a more Optimistic
view. Hornquist (1968) has stated

Even if the initial standing of the subjects is controlled by means
of a number Of relevant variables, there will always be room for
uncontrolled differences that may be important. The
investigator, who because of the nature of his problem cannot
use random or systematic assignments Of subjects to treatments,
has to live with an insecurity in that respect . . . and try to
behave intelligently within the limitations of his design . . . or
leave the scene Of non-experimental research"(p.57).
Porter (1973) has stated ". . . the interpretation of results from designs lacking
random assignment requires yet another degree Of tentativeness above and
beyond what would have been required had random assignment been employed"
(p.41).

Research on teacher effectiveness is one Of the areas in which residual

gain scores have been used most heavily. For some researchers (e.g.

Rosenshine,l970) residual gain scores are considered the definition for teacher

effectiveness and so the logical dependent variable in studies to identify

lO

desirable teacher behaviors. Known as process/product research (Dunkin 6c
Biddle, 1974), studies of effective teacher behavior obtain pre and posttests of
students achievement to form the dependent variable and Observations of
teachers to form independent variables. The residual gain scores are computed
for each student and then aggregated to the classroom/teacher level. The
correlations of class means on residual gain scores and teacher behaviors are
computed and tested for significance. Significant correlations are taken as
evidence that teacher behavior affects student achievement. Examples of
process/product research using residualized gain scores tO control for
confounding variables are BrOphy 6c Evertson, 197A; Creemer, 1974; Creemer and
Weeda, 1974; Soar, 1966; and Veldman and BrOphy, 1974. In all of these studies
BYtYo was unknown and so estimated to define the "constant" in the residualized
gain scores. The researchers, however, ignored this distinction when conducting
their tests of significance of correlation between teacher behavior and
residualized gains. A test statistic using rzw, which is apprOpriate to test Ho :
02w = 0 does not necessarily imply that the parallel test statistics using r2w: is
also a valid test of 02w : 0.

Testing Diwzo as a test for no teacher behavior effect was investigated in
the present study. The investigation was in two parts, analytic and empirical.
The analytic part was conducted to determine the conditions under which piwzo
is equivalent to testing no effect Of a teacher behavior on student achievement.
The investigation considered several different possible formulations of Z. The
empirical investigation was conducted to investigate the apprOpriateness of a "t"
test statistic to test Ho : eiw=o when sample estimates rather than population
parameters were used to define the residualized gain scores. A Monte Carlo
method was used to simulate the sampling distributions Of the different test

statistics based on different formulations of residualized gain scores. These

 

 

 

 

 

 

 

11

were then compared to the theoretical reference distributions to determine the

validity of each test statistic under study.

 

CHAPTER III

THE ANALYTIC CHAPTER

In this chapter, a linear structural model that defines the problem of
measuring change in studies of process/ product research will be presented. The
model incorporates the aggregated characteristics of the data and the possibility
of measurement errors. Given the model, the conditions under which 92w : 0 is

equivalent to no teacher behavior effect on achievement will be identified.

A Linear Structural Model for Process/Product Research

 

As in equation 1, residual gain scores are constructed from Y(O) and Y(t),
the pre- and post-measures of student achievement. The prOposed structural
model attempts to elucidate the relationships among Y(0), Y(t), and W, a variable
representing teacher behavior.

For student 1 in class j, the observed score Y(L)ij can be decomposed into:
(2) Y(L)1j=0(L)ij + e(L)ij , L = 0, t
where ML)” is the part of Y(L)ij which is free from errors of measurement, and
e(L)1j represents measurement error. The n(L)1j is further decomposed into two
components: the class effect and the deviation of student score from his class
mean,

(3) ML)” 2 A(L)j + V(L)1j , L = 0, t

where A(L)j is the class effect at time L, and V(L)ij represents the deviation Of
the ith student score from the mean of jth class. Combining the two equations,
Y(L)1j can be written as

(4) Y(L)1j = A(L)j + V(L)jj + e(L)1j , L =0 ,1-

12

 

13

The measure of teacher behavior can also be decomposed into
(5) Wj : j + egj,
where 5]“ is the true measure of the behavior of teacher j assigned to class j and
e 5. represents measurement error.
I
Schematically, the structural relationships among the three variables are

shown in Figure 1.

Figure l. A structural model.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

3H),,
613' 9 V(t)1j ___.__; Y(t)“. (F_1A(t)ij (———H.I egj
82 81 gj <— OJ.
/
v(0)1.j'__s Y(0),-J- e— AIO),-j <—— AJ'
Ie(0)1.j

The B's are the structural coefficients, y represents the reciprocal
relationship between 5j and A(0)j. Hj, Gij: Oj and Aj are residuals or
Specification errors. The structural equations for Mt), and V(t)ij are
(6) A(t)j =81A(0)j 2.835., + Hj.

(7) V(t)1j =52 V(0)1j + G1}.

Within class j, V(t)1j is linearly related to V(0)ij~ This is equivalent to the
assumption of a linear growth operating within each class at the individual level.
The same rate of growth, 82, occurs within each class.

The decomposition of IKL)” into A(L)j and V(L)ij also implies that the class
effect is additive (i.e., A(L)j is a constant effect for all students in the same

Class). The effect of the teacher behavior, W, on student achievement is the

 

14

same for all students in the same class. Teacher behavior may, however, have a
direct effect (83) on A(t) and a reciprocal relationship (Y) with A(O). The former
will result in changes in performance (for the class as a whole) as a consequence
of being exposed to the teacher behavior of interest. The reciprocal
relationships (Y) represents confounding between initial class composition and
teacher behavior. In school settings, students are virtually never randomly
assigned to classes, and so substantial class effects exist before the start of the
school year. Importantly, these differences may be at least in part a
consequence of having teacher j in class j. This will have some impact on A(t)j
through A(O). Also, this reciprocal relationship represents the possibility that
the composition of the class may affect the way the teacher teaches (Doyle,
1979) which can affect A(t)j.
Given the following two assumptions, 83 represents the effect of the
teacher behavior on student achievement:
1. Prior to the study, there is no other teacher behaviorﬁj, that is
correlated with «E and which has some effect on A(0)j and/or
A(t)j.
2. During the study, there is no other teacher behavior, E2, that is
correlated with E and which has some effect on A(t)j.
These first two assumptions are necessary to leave the interpretation of 837‘ 0
clearly a function of the effect of W and not some other teacher behavior

variables.

D B

The Relationship Between 2w and 3

 

The observed variables Yt, Y0 and W are assumed to have a multivariate
normal distribution with a mean vector Of zero and a variance covariance

matrix, 2 (see Table l).

15

Table l
The Total Variance - Covariance Matrix (Z )

Y(t) Y(O) W
Y(t) OZAt + 02v, +08,
Y(O) 81035.0 + B3YOAOOE+BZOZVO OZAO +02V0 +020
W BIYOAO cg + B302E YOAOGE 02€+02eg

In the structural model, errors of measurement and specification errors are
assumed to be uncorrelated among themselves and with the latent variables, '5,
V's, A's and E.

The coefficient, pzw can be written as

To determine the relationship between pzw and 83,, the variances and covariance
are expressed in terms of the structural coefficients.
The covariance between Z and W can be written as
02.. = sew) - E<2>E<w>
_-. E(y(t)w) - KE(37(0)W)
= E(At + Vt + ét) (g+ ea) - KE(A0 + ‘70 + é0)(g + ea)
-_- E(At€) + E(Vt€) - KE(A0€) - KEW‘OE)

Since V's are defined at the individual level and E; at the class level,

5

E(V(L)g) = EEJ(€,§,V(LIIJI

 

15

 

 

 

 

 

 

 

 

 

    

 

 

 

 

 

 

 

 

 

 

 

 

 

 

. ya
a .3326... E g 9.33:3... .3 353.... a. ..
u n — . 9‘ ”Eta o c 3 ca 0
A No a o C. - 3.6 a: A Ax - {Ago <0» [K
A
0‘ x acouncOu a
Home o a o .o Jan—bx f
a o o
Jo... - we <03... 3%.. . «.0... A o o .M . Am a...
i111 A
{we I 3.... 33:: a
u 0. O o c> o 00 o) I ~
g u .
DOMG O O 0 IE ”D Q ~° .O -O\ o O ﬁAch “—AOV “
ADO—ne’ONQn—u vbﬂor 00 )0 a :- >vu
. . A : _ o. A A... ~ A u A... . . . ~ ~ 2.1528
b c o 2 n . 3 o o» A A a A. I
a a 0) ADV» “—2: n. h «dang H n couauzs
. . A... A? .o A x - :A A: x - : w 55:.
u n .o .oo .3 n~ 9.0.99.33 name
331...” a. o 8.. E . . . Pr. 9% a: a: :8
A o o» a. - .c... . 32$ - J: o <2. A m. A a» u 2 . A . . F. . NA x - to . 2.23:5
.oc . .uAnc . An A. o» u o< o, . o< _ 8: «A8 A «A: a 5.32.5
o. A... o a one. . .u~# . .o a A x . :2: x - A: 5.38
< o e
an iv. no. no In Ooo>092030~ O O 2: H.
«O C 0 U G O O( E n a n .90 o >~D o fab A h! I ROvbvu
A o 52:31... . 32... - .2? o. . o. . a. a . a f d 1&7 a m A. . .“uunﬁ
.o a. .2 a- a. o a some}. . 3%.. . 8...... A A... - 5:2: A: - 3.3. a .3:
o . no .5... a... o . n. 5.... n... 3.53:5 2.32:... 22:53 .3332. 2.35.8
I: 3 232:... 8.39.0!
3:13:03 A2333» .5 3 Inc a... 3:23:25 v:— ScoUIAuou c3385 2.93:5 05 .0 «.352...

u 0;:

 

16

02w : E(At€) - KE(A0€)
: [$302+ Cov Ao€(Bj - K)
: B3OE+YOAOOE(BI - K)
Then 8 2 + (B K)
30 YOA O l '
(8) 02W 2 g 0 E.

 

020w
Equation (8) indicates that if
1. Y = 0, (i.e., no initial confounding) and/or
2. 81: K,
the statement 02w :0 is equivalent to B3 = 0 (provided that the variances

are all greater than zero.)

Defining Values of K

 

In practice, the regression coefficient for predicting y(t) from y(0) is the
value most frequently chosen to represent K. Because of the nested nature Of
the data, however, there are three such regression coefficients. In order to
examine (the appropriateness of using any one of these coefficients for K, the
relationships between each of the coefficients and the structural coefficients are

derived and shown in the following section.

Relationships Between Regression Coefficients and the Structural Model

The total regression coefficient (81'), the between regression coefficient

(8 and the within regression coefficient (8w) can be expressed in terms of the

B)
model components as follows (Table 2);

By definition,

EEj(Y(t)1.J.

E(Y(O)1j - M

- MY(t)) (Y(0),j - MY(0))

 

Y(0))2

17

As before, both My“) and My(0) are zeros.
The numerator is C°V(Y(t)ij: Y(mij) and the
denominator is Var(Y(o)1j)

Thus, C0v(Y(t)1.j . Y(O)ij)
BT ‘

 

Var(Y(0j,j)

Substituting COV(Y(t)ij, Y(0)il) and Var(Y(o)ij. for their corresponding
values in Table 1 yields
2 2 , 2
8T : (BIO/3‘0 + 820 V0 + BBYOAOO€)/(QZAO +0 V0 +0260)

Similiarly, the between regression coefficient is,

8 ENG); 1- My(t)) (Y(0), -. MY(O))
B ‘ E(Y(O)j-My(o))2

 

. Cov (my {now var (9(0),).

By using equation 4 to obtain the means of Y(t)”, Y(0)ij and by substitution
1 2 2

—B o + B O + yB o O

s 2 VO 1 A0 3 A0 g

 

l 2
o + —o + —O
A 5 V0 5 e0
Similiarly, the within regression coefficient is,

B _ Ej(Y(t)1j - My(t)j)(Y(0)ijv- My(0)j)

w ..
Ej(Y(0)1j - My(0)j)2

= 8102 /(02 +02 I)
2 V0 V0 eO

As shown in Table 2, when 53 :0, 02w equals zero if

(9) Y((B -B)O2 +802
l 2 VO l eO

l8

irreSpeCtive of the choice of regression coefficients. for 02“, = to be equivalent

to B3 = 0, equation 9 is both the necessary and sufficient condition.

Conditions under which Y ((81 -82)02v0 + 810280) : 0
When v =0

11 y = 0, irrespective Of the relationships among 81, 52, Ozvo and Ozeo or

the choice of K, equation 9 will be true. Put another way, whenY = 0, there is no
problem of adjusting the achievement criterion for initial confounding with the

teacher behavior.

When Y 7‘ 0
If Y does not equal zero, for equation 9 to hold, (81 - 82) O
must equal zero. This can happen when

= 2 2 2
l. B] 82 (O V0 /(0 V0 + Geo) or.

The former can happen only under unlikely circumstance. The latter can

happen, if a perfectly reliable premeasure is used (so that O2eO = 0): and when

subjects are randomly assigned to clasrooms (so that 81 = 82).

In examing the relationship between 02w and 33, none of the conditions
identified seems likely to obtain in. practice. Random assignment can rarely be
achieved in practice and perfectly reliable achievement measures rarely exist.

An alternative to using a regression coefficient as a method for defining K

would be to estimate 81, directly. For example, from Table 2

2

 

 

2 2 2 -
820 v + 81‘7 A BIG Y 820 v
___ o o . _ s = o 0
ST 67A + 02A + 6y~ when 83 — 0 thus, 1 2
0 0 e0 0 A

O

 

 

to estimating K in each of the several ways.

19

Since, E(MSB

I: SO2 2
3f) A0 + 0 V0 +02e0, and

E(MSwyO ) =6sz , 02,30

2 l
=‘5—(MS - MS )
0 A0 85/0 WyO

From Table l

8w = 82 (OZVO/ bzvo +02e0))
A 2 —/\
820 V0 —BwMSwyO
81 = (éTuMsB - MSw)/ s + MSW) —’B‘stw)/ (MSB - MSW) /s)

or El :67 M55 + ((s-I)§T- Séwmsw / ((MSB - MSW)

Distributions of Test Statistics

 

Even under conditions where if 83 : 0 then 02w : 0, a t-test of Diw : 0

could still be inapprOpriate due to the effect of having estimated the value Of K
based on sample data rather than setting K a priori to a known constant. Thus

what rem airsto be done is to determine the effects on the distribution of "t" due

defined as the sampling distributions of the t ratio with c-Z degrees of freedom

which is obtained from r2w using the equation

r-Zw / c-2 (Hays, 1973, p. 661).

 

The "t" distribution of r2w is

 

20

Since the exact nature of the "t" distributions of riw's could not be
determined, a simulation study was conducted. In addition to using estimates of
BT’ BB and 8w to form residual gain score, Z1, 22 and Z3 respectively, the use of
the proposed estimate Of 81, was used to form residual gain score 24. For
comparison, another form of residual gain score, Z 5, was formed by setting K to

a priori known constant (i.e., Kzel).

CHAPTER IV

SIMULATION PROCEDURE

As shown in the previous chapter, testing pi,” = 0 is equivalent to testing

H0: B3 :0 if either of the following conditions are met; I) v =0, 2) 81 =82, given
a perfectly reliable premeasure. Interestingly it was found that for both of these
two situations the equivalence between 92w = O and B3 = O is true regardless of
whether Z is defined using K set to the total, between or within regression
coefficient or any other values of K for that matter. However, in practice, the
parametric values of 8T: BB: 8w and 81 are seldom known. Thus, the purpose of
the simulation study was to investigate the apprOpriateness of using a t-test to
test 02w : 0 in situations where estimates are used for 81', 88,8“, and 81. The
empirical sampling distribution Of "t" statistics for each of the four methods of
defining residual gain scores were simulated and compared with the central t-
distribution. The means of the empirical sampling distributions, empirical Type I
error rates, and empirical powers were used to determine the appropriateness of
using a t-test to test Dzwzo.

The procedures employed in this empirical study will now be discussed.

First, the description of the simulation parameters will be given, and then the

data generation routine will be described.

Simulation Parameters

 

As stated previously, this investigation required the study of random

sampling distributions of "t" based on rilw rzzw, ri3w: r2“, and r25W'

21

22

Empirical generation Of the random sampling distributions was done repeatedly
taking random samples from a known population, an approach which is typically
referred to as Monte Carlo. The parameters of interest were the number of
classes per sample, the number of students withing each class, the value of 81
relative to B 2, the reliability of the premeasure, the magnitude of initial
confounding, and the central and non-central cases.

As previously stated, the means Of the manifest variables, Yt: Y0 and W
were set equal to zero. Also, without loss of generality Ozyt, Ozyo and O 2w
were set equal to l. Yt: Y0 and W were assumed to have a multivariate normal
distribution.

Both the number of classes, c, and the number Of students per class, 5,
were allowed to vary so that effects on the distributions of the various "t"
statistics could be investigated. The number of classes was set at 10, 30 and 50.
Ten Classes (or teachers) were chosen as an easily obtainable sample size. Fifty
classrooms were chosen as an unusually large sample size. The number of
students per class was set at 10, 20 and 30. The size of 10 was chosen as a lower
bound for classroom size which might occur through loss Of data. Class sizes Of
20 and 30 are typical of schools today.

While 82 represents the within class regression lepe, given a perfect
premeasureﬁ does not represent exactly the between slope, as shown in Table 2.
Consequently the exact magnitude of B 1, relative to 32 cannot be decided.
Therefore, three different combinations of 81 and 82 were selected. First, 91
was set equal to B 2 with value equal to .7. Second, 81 was set greater than 82
with values .7 and .3 respectively. Third, 81 was set smaller than 82 with values
.3 and .7 reSpectively. The last situation was included for comparison in spite of

the fact that it is rarely encountered in practice. (e.g., Cronbach, I976)

23

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

E _ 2 v
._ t w t e t e .4. .
i._ .i 11. 11 A
_ 2 .
n t t t i t a _ no .. _
_ LN] 23
h -7. I
u A. t a. t t t h .u_. PW
+ r r o
A 2 VJ
. i 0
t t t t I i t _ Va
.2 o
.— t t t t t i _ o ._
_ P1 3
. at
. 2
t e t a. t i e n .n.-
u .2 t-
t . D
2 l.
B
* < ..
bl .3—
8 _
2
B 1
t :1
2 O
.5 VJ
* .1 0
_ l y
E e t
A 2 0
b _
t _ < ..
_ Lul 9.03
.2
t A...
El
0m ON CA cm ON CA cm ON CA om ON CA cm ON CA om ON on em ON CA on ON CA on ON CA
om om OA Cm on 0" cm Om CA
G. n P N. H P O n P

 

 

 

xoaum Co cawmoo

m mAamp

 

 

 

24

When 8 1 7‘ 82 the ratio of between variation to within variation varies with
the number of students per class. That is, the intraclass correlation, PI, gets
smaller as the number of students per class increases. The intraclass correlation
was set at .30 regardless of c and s for the present study. This value was chosen
because there is evidence, for example in school mathematics, that actual school
variation accounts for 30 percent of the student achievement of MAT
mathematics scores (Haney, 1974).

Since Yt and Y0 contain errors of measurement, the estimators of the
different adjustment coefficients (i.e., 81-, BB and 3w) will be biased. The
magnitude of bias is proportional to the reliability of Y0. In other words, the
bias depends on the premeasure only (Porter, 1971). The reliability of both pre
and post measure was set to .8. This value was chosen as a moderate reliability
for achievement tests (Ebel, 1979). Since measurements of teacher behavior
have lower reliability (BrOphy, 1974), .5 was selected as the reliability
coefficient of W.

As a result of setting 02w OZYO’ 02w : l, thYe’ DYOYO : .8 and the
reliability of w to 5 the values taken by Ozet’ozeO’ ozeg and 02g were .2, .2, .5 and
.5, reSpectively. Also, as a result of setting P1 = .3 the values taken by (,on and
0 2V0 were .24 and .56, respectively, in the presence of errors of measurement.

Three levels of initial confounding were considered: y = O to indicate no
confounding y = .4 to indicate substantial confounding, and y = .2 as an
intermediate level of initial confounding.

Lastly, both the central and non-central cases were included in the study to
examine the probability of Type I and Type II errors. For the purpose of this
study, B 3 was set equal to 0.00 and 0.10. .l was chosen as an arbitrary value to
indicate the non-central case. Table 3 illustrates all possible combinations of

the six design dimensions included in the simulation study. An "*" marks the

 

 

 

25

cells examined. These cells were selected to facilitate the investigation of the
effects of initial confounding, presence of errors of measurement in the
premeasure, relative magnitude of 81 to 82, sample and class sizes on the
distribution of "t" statistics for different methods of defining rzw- One thousand

samples were simulated for each of the selected cases.

Data Generation Routine

 

Three manifest variables were generated Yt: Y0 and W. The three
variables were generated to have a multivariate normal distribution with a mean
vector of zero's and a variance covariance matrix (see Table 1). As shown in
equations LL and 5 in the analytic chapter, the manifest variables are defined

Yt=At+Vt+et,

Y0=A0+Vo+eo,

W : E + e8

where all the components have been defined previously. Thus, X can
be decomposed into 2w, EB and 2e, the within, between and errors of
measurement variance covariance matrices reSpectively, as shown in Table 4.
Having identified the set of parameters for each pOpulation, the Cholesky factor
was computed for the between and within population variance-covariance
matrix. These were used to transform generated between and within normal
variates with (0,1) into between and within components with the desired vector
of means and variance covariance matrix.

A FORTRAN program was written to generate the sample data and
compute summary statistics for each sample. In order to generate the sample

data, the between, within and errors of measurement components needed to be

generated.

26

 

 

 

n O O
wmmb o o o o o uwb wb <0» wmomm + wo <O>Hm

o o o o o 0
web 0 >Nb >Nomm <No we <b>mm + <~0Hm

u p a
mmb >Nb <NO
3 8: 3: 3 8; 3; 3 8; 3:
m 3
w w mw

 

 

 

 

chpmz mocmwgm>ou mocmwcm> N mwv “cosmesmmmz eo mcoeem new a ZNV cwcuw: .A mmq cmmzpmm
d mPDmH

 

 

 

27

Concerning the between components, two basic steps were used to
generate At, A0 and E. First, a vector of independent normal variates, L, was
generated by calling the function GGNQF three times, once for each latent
variable. This function which is adapted by [MS]. (1982) generates one pseudo
random norm a1 deviate (0, D every time it is called. Second, the obtained normal
variates were transformed into a vector of At, A0, 5. This was done by
multiplying E. with the transpose of the Cholesky factor of 23(denote T'). This
can be summarized as

_At_1
A0 : T x L

E

L..—

 

 

Steps one and two were repeated as many times as the number of classes in the
sample, c. The obtained At, Ao,€ had a multivariate normal distribution with a
vector mean of zero and EB variance covariance matrix. The within components
Vt, V0 were generated in a similar way as the between components except 2w
was used instead of EB.

GGNQF was also used to generate the normal deviates used to form errors
of measurement for the manifest variables. The normal deviates were then,
mulitiplied by the standard error of measurement.

Having generated the between, within and error components, each manifest
variable was obtained by addition of its components parts.

Asubroutine was written to compute the different forms of rgw's. The
obtained sample correlation coefficients were transformed into a t—ratio with c-
2 degrees of freedom using equation 10. Throughout this dissertation the
empirical t-sampling distribution of r21“, will be denoted as tle, rizw as tzzw:

r23“, as t23w, r24“, as t24w and r25“, as tz5W°

28

Another subroutine was written to obtain empirical Type I and Type II
errors for the tzw's at nominal values of .005, .01, .025, .05, .1, .995, .99, .975,
.95, and .90. This allowed consideration of fit for both one and two tailed tests
of the null hypothesis 02w = 0.

In order to check the accuracy of the computer program written to
calculate summary statistics, the simulated data for the 5 classes with 5
students each design were printed out and analyzed separately using the SPSS
statistical package. The results of the two sets of calculation agreed perfectly.
The simulation portion of the program was verified by executing the program to
obtain Type I errors for a set of parameters in which Yt, Y0, W were perfectly
reliable, Y = 0 and 81 = B 2. Under these conditions the different tzw's all have a
central t-distribution. The empirical Type 1 errors of the tzw's were in close
agreement to their corresponding nominal alphas. For example, the empirical
Type 1 errors of tzlw, tzzw: tz3w, tzgw and t25w were .049, .049, .051, .048,
.051 for upper tail 01: .05 and .054, .052, .056, .052, .050 for .05 lower tailotz: .05;
.100, .096, .100, .099 and .101 for upper tail CL: .10 and .105, .107, .106, .110 and
.105 for lower tailOt ‘-' .10 nominal alpha.

For each cell identified in Table 3 the program was run once. The seed
number for every run was the random number generated after the last one used

by the preceeding run.

CHAPTER V
RESULTS OF THE EMPIRICAL INVESTIGATION

In Chapter 111, it was shown that Ho: 92w 2 0 is equivalent to Ho: 8 3 = 0 if
either of the following conditions are met: 1) Y = 0 2) 81 = 32 and a perfectly
reliable premeasure. When these conditions are not met, however, piw = 0 is
equivalent to H0: B3 = 0 only for 24 and Z5. This chapter demonstrates
empirically the Type I error and power of this first test statistics of Ho: oiw=o
for situations which are common in educational research.

The variables of interest in the empirical investigation were: magnitude of
initial confounding, reliability of the premeasure, relative magnitude of 81 to 82,
number of classes per sample (sample size), and number of students within each
class. Any combination of levels of the above variables identifies a sampling
distribution for each of the several riw's. The specific sampling distributions
investigated were selected according to a design which facilitated investigation
of the effects of each of the several design variables while holding the other
variables constant. The subset of sampling distributions chosen to study is
represented by asterisks in the six dimensional matrix in Table 3.

The effects of initial confounding, presence of errors of measurement in
the premeasure, sample and class sizes on the mean estimated of piw'S: the
empirical Type I errors and empirical power of the one and two tailed tests of
piw's are presented in this chapter.

In general, the results of the study showed that when there was a
substantial amount of initial confounding, the test statistics for tle, t22w and
t23W were only valid in a few situations. These tests, particularly tzlw and

29

 

 

30

t23w: tended to be too liberal in situations where 31 = 82 or 81> 52 and too
conservative when 81 < 82. Parallel results for tle and t23W were obtained with
increasing sample size. However, the test statistics for t2“W and t25w were the
only tests which remained valid across all levels of initial confounding, presence
of errors of measurement, sample and class sizes. Furthermore, the results of
the study indicated that increasing sample and class size and presence of errors
of measurement increased the empirical power of both t24W and t25w in

situations where 81 = 32 or 51> 82.

 

Mean Estimates of 95w when 33 = 0

Initial Confounding Effects

 

By examining the equations in column 5 of Table 2, one can predict that
whenY: 0 and B3 = 0 eacn of the five riw‘s under investigation have expected

value equal to zero. The numerators of these equations are

YGA GE ((81 - 62 0'2\/ + B102e ) for O‘ 9 p' 9 and D7:

0, 0 0 zw 22w

1 W

3

f ..
0 orpZW

and yo
4 A

(B1 - K) for 02 w'

G
0‘E 5

Given these numerators, one can see that all piw's increase as y increases,

holding other variables constant. Inspection of the numerators also makes clear

that the sign and magnitude of the oz‘w's is affected by the relationship of 81 to E32
For example, when BPQ and is large, the mean estimates of 921w: 022w andﬁkz3W
are expected to depart positively from zero. Similarly, when 81 = 82 (and errors
of measurement are present) the departure of these mean estimates will be in
the positive direction but not as far as was the case when 81 > 82. In situations

where 81< 82 and Yis large, the departure of the mean estimates offal-1w, 022w and

 

 

31

- ~ - . _ 2 2 . . . _ .
923w Will be negative given (81 82)0 V- >810 e0 The mean estimate of 922w 1S
expected to be smaller in absolute value than the mean estimate of 921w and
023w This is because all three share the same numerator but 922w has the

largest denominator. The denominators, as shown in Table 2, are:

“OZZOZW (02A + O2V + OZe ) for pi w,
O 0 0 1
02 02 (502 + 02 + 02 ) for p- ,
2 W A0 V0 e0 22w
02-02 (0? ew+ 02 ) for p~ .
W v0 e0 Z3W

In summary, given Y is large, it is predicted that in situations where 51>82,
the empirical sampling distributions of rilw and r23“, will be centered to the
right of the central t-distribution and to its left when 81 < 32 (though these also
depend on the magnitude of errors of measurement.) Also, it is expected that
the empirical sampling distributions of raw, and r25“, will be the closest to the
central t-distribution across all combinations of Bi and 82.

Table 5 shows the effect of initial confounding on the mean estimates of
921w: 922w: 923w: Oiqw and 025w under the the three different combinations of Bi
and 82, where sample size and class size were held constant at 30 and 20
respectively, and pYoYo = .8. As expected, the means of the empirical sampling
distributions of rzw's were all near zero when Y = 0. As Y increased to .2 the
mean estimates of 021w and 923w increased to .026 and .033 when 81 = 82 and
.058, .07, respectively when 81>82. However, their values decreased to -.02 and
-.028 when 81 < 82. Increasing to .4 caused the sampling distribution mean
estimates of 921w and 923w to depart far from zero, particularly in situations
where 81> 92. Their values were .05, .066 when 51 = 82, -.047, -.059 when 81
< 82 and .119 and .147 respectively when 81 > 82.

While the sampling mean estimates of 922w remained relatively close to

zero across all levels of y and across all combinations of 81 and 82, there was a

32

m000.m000. NvF. 0N0. m_—. quoo. 0000. N0. NFO. 000. mm00.- N000. 0000. m_000.m000. mm A —0
200. mm00.-000.-0000.i§0.i :0.i 0000. wmofmoof N0.i 0000. N0000. 0000. 000. 0000. m0 v _0

mm0. mxxﬁ 000. F0. m0. 0000.- 0000. 000. 000. 0N0. 0000. 00000. 0F00.i 0000. m_00. mm n _0

 

 

 

 

 

3mm 3¢N 30w 3mm sz 3mm 3¢~ 30w 3NN 3HN ZmN zvN 30w BNN 3HN
E i; -L L L L -L L i; i; -L i; i; -L -L
w. u > N. u > 0.0 u >
o o
00.0 n ma new 0. n > >0 .0N u m .00 u o Low ncm N0 ._0 Co
3N

meowumcwnsou pcmemwmwo Low m. 10 mo mcovuznvgpm_0 ace—asem _mowL?aEm wo memo:
m m_nmh

33

slight increase in the mean of rizw's in situations where 81 > 82 as increased.
Mean riw's were .0015 at y :0, .012 at Y =.2, and .024 at Y =.4. However, these
mean estimates were close to zero only because the specific values of Dy 0Y0 and
(BI — 82) were such that the two parts of the numerator in 022w compensated
each other.

The sampling mean estimates of 024w and 025w remained the closest to

zero across all levels ofy and across all combinations of 51 and B 2.

Effects of presence of errors of measurement (OYoYo 76 1)

 

oZe is a common component shared by the numerators of 921w, pzzw and
923w. Since Oze has a positive or zero value its presence should increase the
departure of mean estimates of glw, 032w and 923w from zero in situation where
B]: 82 or B] > 82. However, this departure decreases in situations where Sf 82.
Due to the absence of 026 from the equations of 924w and 925w: errors of
measurement were expected to have no effect on their sampling mean estimates.

Table 6 reports the effect of the presence of errors of measurement in the
premeasure on the sampling mean estimates of Dzw's for the three different
combinations of Bi and B 2 for c = 30, s = 20 and Y:.2.

As expected, the mean estimates ofoi 1w: 022w andG23w increased due to
presence of errors of measurement when Bl = 82. While their values were all

0
equal to '003 when Y0Y0: 1.0 they became .026, .005 and .033, respectively

when pYoYo = .8. Also, as expected, presence of errors of measurement brought
the mean estimates of 92 1w and 923w closer to zero in situations where Bl< 82.

Their values were -041, -.056 when pYoYO = 1.0 and became -.02 and -.028 when

p . . . 8') B D —
YoYo -.- .8. However, in Situations where 1 2 and YoYo - .8, the mean

 

34

«N00.

__0.i

0000.

0000.-

m0.

0N0.i

mm0.

 

m. u 963

NFO.

000.:

wmo.

N0.i

0N0.

1N

0000.-

m00.

m00.

3

P00.

000.

 

mmo. 000.

000.: 00.0

000. 000.

ZmNL 3NNL
0._ u 0>0>Q

 

0 u mm use N. ”>2ﬁn0N n m .0m u o Low new N0 ._0 Co mcowpmcwasou

pcmcweew: Lo» m. -

0 m—nmh

a we meowpznvepmw0 asepgsmm Pmovgwasm Co memo:

 

._.

 

35

estimates of 021w and 023w did not increase as expected. Their values were .057,
p Z . g p : o o
.075 when YoYo l and 058, 07 when YoYo 8
The mean estimates of Dew and 925w remained the closest to zero in the
presence and absence of errors of measurement and across all combinations of 81

and 82.

Sample and Class Size Effect

 

Due to presence of s in its denominator, 022w was not only expected to
have a smaller mean estimate than 021‘” and 0:23“, but also it was expected to
get smaller as 5 increased. c is not part of any of the equations of piwi
therefore, it was expected the mean estimates of piw's would not be affected by
changing sample size.

Table 7 shows the mean estimates of Diw's across different levels of sample
Size where Y = .2, pYoYo : .8 and s = 20.

As expected, the mean estimates of all oiw's were not affected by
increasing c across combinations of 81 and 82. For example, the mean estimates
of 921w: 922w: 923w: 924w and 025w were .06, -.01, .07, -.0045, .0073 for c = 10,
.58, .012, .07, .0008, .0074 for c = 20 and .053, .01, .07, .0006, and .0012 for c :
50 in situations where 81 > 32. Table 8 shows the mean estimates of Ozw's across
different levels of class size where Y : .2, pYoYo : .8, c = 30 and 3 = 0.

As expected, the mean estimates of Oilw, 023w: 024w and 925w were not
affected by increasing 5. The mean estimate of 922w decreased slightly as 5
increased. For example, the mean estimates of 922w were .018 for s : 10, .012

for s = 20 and .0048 for s : 30 in situations where 51> 82.

Empirical Type I Errors for One and
Two Tailed t-Tests When Testing HO:02W = 0

 

To evaluate the validity of the t-test in testing HO: 02w : 0, the empirical

values of the tests for tzw's were compared to the critical values obtained from

36

 

 

 

 

 

 

 

 

 

NHoo. coco. No. Ho. Nmo. NNoo.- Nooo.- eNe.- eoo.- NNo.- Hoo.- moo. mmo. emoo. NNo. om
swoo. Nooo. No. NHo. Nmo. HHo.- mooo.- NNo.- Noo.- No.- Neoo.- moco. NNo. moo. 6N0. om
Neoo. meoo.- No. Ho.- we. Ho. eooe. mHo.- Ho. eoo.- ecc. NHoc.- mHmo. mcoo. mNo. OH
H e N N H m e N - N H m e N N H
3 we 3 we 3 we we 3 we 3 we 3 we 3 we 3 we 3 we 3 we 3 we 3 we 3 we 3 Ne
Nu . He Ne . He Ne u He
0 H mm U50 oCN H m aw. H OXOXQ
N. n > Low m.3NQ we mmumeumu saw: :0 mem wPQEmm $0 mpumwn—m

N mpamh

 

37

 

 

H000. 000. 00. 0000. 0F0. 000. 0H00.- 000.- 0000.- «00.- 0000.- 000. 000. 0H00. 000. 00
«000. 0000. 00. 0_0. 000. 3—0. 0000.- 000. 000.- 00.- 0000.- 0000. 0000. 000. 000. 00
0000. 000.- 00. 0H0. 000. Fe0.- @000. N00. ~000.- 00.- 0000. H00.- 000. 000. 000. 0_
3mwe 3ewe 30we 30we 3Hwe 30we 3¢we 30we 30we 3Hwe 3mwe 3qwe 30we 30we 3Hwe m

 

.0. u e. eoe m.3wa mo mmpmewumm saw: co mem mmmHu we mpomewm
0 mHDNH

 

 

38

the t-distribution with c-2 degrees of freedom for selected level of significance.
When the null hypothesis is true (i.e., B3 = 0), the observed relative frequency of
data sets having values of tle, tzzw, tZ3W’ tznw: and t25w greater than the
critical values in the upper tail or smaller than the same critical values in the
lower tail, yield the empirical levels of significance. Comparison to the selected
or nominal levels of significance gives an indication of whether the test used is
conservative, liberal, or correct. Comparisons were made at three nominal
levels of significance which are commonly used by educational researchers; .01,
.05 and .1. Observed levels of significance were in all cases based on calculating
tzw's for 1000 replications from a multivariate normal distribution with Specified
characteristics. To facilitate comparison of empirical and nominal levels of
significance, 95% probability intervals were computed using the normal
approximation of the binomial distribution with n=1000 and P equal to the
selected levels of significance. Thus, if the selected level of significance was
.05, the 95% probability interval would be .05 f 1.96 ((.05)(1-.05)/(1000))'/2 = .05:
.014. The probability limits for the nominal alpha's are presented in Tables 9
through 16. If the empirical Type I errors exceeded the upper value of the
probability limit this indicated a liberal test. On the other hand, if it was less
than the lower value of the probability limit this indicated a conservative test,
otherwise the t-test was considered valid. The .05 nominal alpha will be chosen
through out this chapter as the primary base for comparison of the different

situations.

Initial Confounding Effect

 

It was argued earlier in this chapter, given Y is large, the empirical
sampling distributions of rilwa rizw and r23“, will be located to the right of

the central t—distribution in situations where 81: 82 or B l > 82, and to its left

 

 

39

when 81 < 82. AS mentioned earlier, this prediction did not hold for the sampling
distribution of rizw. Also, it was argued that the empirical sampling
distributions of rig“, and r25“, would be the closest to zero. AS a consequence,
given Y is large it was expected that using the test statistics tle and t23w to
test 92w 2 0 would result in Liberal tests in Situations where BI :32 or 81> 82,
and in conservative tests when 81 < 82. However, both tqu and t25w were
expected to result in a valid test of the hypothesis of interest.

Table 9 shows the empirical Type I errors of the one tailed test of 92w
across three levels of initial confounding and across three combinations of B 1
and 52 for c = 30, S = 20 and pYoYo : .8 Comparable results for the two tailed
tests are shown in Table 10. It Should be mentioned that here and throughout
this paper, only the positive tail was considered for the one tailed tests.

a, :6,

All the empirical Type I errors of the one-tailed tests for tzw's were within
1.96 standard errors of their corresponding nominal alphas whenY = 0 andY = .2.
AS Y increased to .4, most of the empirical Type I errors for the one-tailed tests
for tle and t23W were, as expected, greater than the upper limits of their
corresponding nominal alphas. The other tzw's were not affected. For example,
at .05 level of Significance, the empirical Type I errors for one-tailed tests for
tzlw: tzzw: tz3w- tzgw- and t25W were .082, .048, .097, .043 and .047
respectively.

While the empirical Type I errors of the two-tailed tests for tzw's were in
close agreement with the one-tailed tzw's when Y :0 and Y :.2, they differed as Y
increased to .4. For example, at .05 nominal alpha, the empirical Type I errors
of the two-tailed tests for tzlw: tzzw: tz3w, tzgw and t25w were .043, .029,
.056, .038 and .041, respectively (Table 10). The two—tailed tle and t23W were

only valid due to compensating lack of fit in in each tail. Thus for r'z'lw: r22“,

40

one—m Hm:HEoc 0:_ucoammeeoo mu_ 06 HHEHH eoon ecu cozH eoHHQEm o
23?. H228: ocwueoammeeoo 3H eo HE..— ewaa: 2: :9: e383 e

No_. mo—. ivom. .mmp. wow. mop. mc_.. somp. iv—P. e—m—. mo—. mo—. N——. mop. NPP. N—_ “moo
. . . . . . . . . . . . . . . voo.-omo.
mmo omo imm— moo aoc— ovo ovo ‘0:— F00 «pmo omo mmo omo omo moo ac.
—o. —_o. eomo. q—o. avo. —_o. —o. «mmo. Nﬂo. «No. ppo. po. moo. Fo. po. o—Jn%woo
. . . . . . . . . . . . . . . N—_.iwno.
mo— Nmo ovmo smoo omvo oo_ Pmo acmo omoo ammo —o~ mo— omo vo— vmo H.
voo.-omo.
ovo. ~00. .o—o. ovmo. empo. oeo. vvo. ovmo. mvo. omo. moo. moo. mco. mo. Nvo. .10.
o—o.-Noo.
moo. Koo. —o. moo. _o. moo. _o. ooo. Po. moo. moo. moo. moo. moo. moo. .0.
N__.-ooo.
oop. mmo. «o_. —o—. a—m_. mmo. mmo. mﬁp. Pmo. mop. mo—. No—. oo—. mo—. mop. _.
voo.-omo.
Koo. mvo. «mmo. ovo. ammo. mvo. mmo. ooo. Nvo. mmo. omo. —mo. mmo. omo. Nmo. so.
woo. woo. iNo. coo. oFo. moo. woo. woo. woo. moo. v—c. «Po. . P—o. vpo. N—o. opoo-moo.
_ .
30e0 zreu ..30\~, 30No 3HNH. . .3o\.. 3¢N~ 30w~.-.30\0, 3H00- anH 3VNH 3000 30NH 3.x“ - mu_EHH xu_~_nm
. .I- f . i . H P N IDOkQ UCQ
- l 03 -- - - -l -1-! - - N , - 0- r -- - - ecu-.2 30:23:
o>o 3N
0. u >0 vce 00 u 0 .00 n 0 memc3 0 n m. i; we mummp umpwmh-oco we» eom meoeem H on»» Heo_e3030 co 0c_u::ow:00 _e_uwc~ we muommem

.3 oz:

mm

41

egg—o Hoseso: mcwucoamweeoo mo. 00 H_E__ eoon mgu coca eQHHeEm o
ceapo HmCHEo: 0c_vcoammeeoo mu. mo u_a._ ewes: we» cmzu emuemem «

mop. 000. «000. 000. «00—. 000. ~00. «KFP. _m0. 00—. 0mo. pop. 000. v0—. 000. 0__.m000.
900.-000.
000. Nvo. «0——. 0v0. «000. 0¢0. 0v0. 000. .00. 000. Nvo. 000. ceo. vvo. veo. no. 00 A .0
0_o.-000.
v—o. 000. «000. 0_0. £000. 000. 000. m_0. 0P0. 0F0. moo. 0—0. v—o. NFO. 0_0. .c.
0__.-000.
000. 000. 000. .000. 000. 00F. 000. ¢0_p. 0_. 00F. 000. 00_. 000. Rep. 000. H.
. . . . . . . . . . . . . . . v00.-000.
000 000 000 .mmo 0V0 000 000 000 00 000 000 vmo mvo 000 00 :0. «0 v .0
000. 000. 0P0. 000. P_0. opo. opo. 0—0. 0_0. 0—0. —_0. —_0. 0P0. —p0. 0_0. 0_0.mw00.
0—_.-000.
00". 000. ~——. .000. 00_. 000. cv00. 000. 000. 000. NFF. 0——. 00—. 00—. 00.. _.
eoo.-omo.
on. 000. 000. @00. 000. 000. 000. 00. 000. 000. mvo. 000. 000. 000. Q00. 00. 00 . H0
. . . . . . . . . . . . . . . . 0P0.-000.
«00 000 __0 000 000 _0 000 000 000 000 000 000 000 000 ~00
.o.
:n.. :eN. . Ewe. .2N..- .3HNH - .emNH :VNH :NNH . :NNH :HNH. 30.. See. :MNS sz. :HNH . mo.e.. »o_H_ea
V. N er N. N f O M * IDOLQ TCC New“ aﬁm

aeeH< Ha:_5oz

0. u > >Q use .00 u m .00 u 0 oemzz 0 u m.3wa yo mammh vm__eb-ozp on» eow meoeem H womb Hmo_e_asm co mcwoczoecou .meuwcH 0o muomoww
0_ wpnch

42

and r23“, either the one —tailed test was too liberal or the two-tailed test was
too conservative. In contrast, both the one and two-tailed tests for tzw, and

t25w were valid in testing the hypothesis of interest across all levels of Y .

Bl<82

As expected, all the empirical Type I errors of the one-tailed tests were

 

within 1.96 standard errors of their corresponding nominal alphas when Y = 0. AS
Y increased to .2, the empirical Type I errors for the one-tailed tests for tle
and t23W became slightly conservative (e.g. .036 and .034 for nominal alpha of
.05) asY increased to .4 the degree of conservativeness increased to .019 and .016
at .05 nominal alpha. As expected, the one-tailed test for t22W also became
conservative with increased initial confounding but less so than either rilw or
rzzw-

While the one-tailed tests using tzlw, t22w and t23W were conservative
when Y z .4, only the two-tailed test using tz2W was conservative. Its empirical
Type I errors was .035 at .05 nominal alpha. It should be mentioned that both the
one and two-tailed tests using tz4w and t25W were valid in testing Ho: 92“,:0

across all levels of Y.

81> 82

 

As expected, all the empirical Type I errors of the one-tailed tests were
within 1.96 standard errors of their corresponding nominal alphas whenY = 0. As
Y increased to .2 and to .4, the empirical Type I errors for the one-tailed tests
using tle and tZ3W increased to .091, .103 whenY : .2 and to .148, .193
respectively whenyz .4 at .05 nominal alpha. However, the one-tailed test using
t22w was not liberal at .05 as Y increased, but was at .l nominal alpha (e.g., .114
whenY: .2 and .127 whenY = .4). None of the two-tailed tests were liberal when

Y = 0 and Y = .2 at .05 nominal alpha. But as Y increased to .4 the two-tailed

 

43

tests using tle and t23W became liberal (e.g., empirical Type I errors of .094
and .116 respectively at .05 nominal alpha).

Again the one and two-tailed tests using tzqw and t25w were valid across
all combinations of 81 and 82 and across all levels of Y.

Effects on Empirical Type I Errors of Test Statistics
When the Presmeasure Contains Errors of Measurement

 

 

As mentioned earlier, presence of errors of measurement was expected to
push the empirical t-sampling distributions of rilw, r22“, and r 23w to the
right of a central t-distribution when 81 = 82 or 81 > 82. Also, errors of
measurement were expected to bring the empirical sampling distributions of
rilw, r22“, and r23“, closer to the central t-distribution in situations where 51
< 82.

Table 11 shows the empirical Type I errors of the one-tailed tests of Oiw's
when pYoYo : 1.0 and .8 across the three combinations of 81 and 82 for c = 30, s

= 20 andY:.2. Comparable results for the two—tailed tests are Shown in Table 12.

81 = 82

All the empirical Type I errors of the one and two-tailed tests using tzw's
across both levels of pYoYo were within 1.96 standard errors of their
corresponding nominal alphas accept for the two-tailed tests using t22W and
t24w where the empirical Type I errors were conservative. In contrast to what

was expected, at least .8 reliability of the pretest does not invalidate the r21“,
3

rZ-Zw and r23“, procedures when 81 = 82.

81(82

 

As expected, the one-tailed tests using tzlw: tZZW and t23W were less
conservative when 51< 2 and reliability of the premeasure was less than perfect.

The empirical Type I errors were .013, .015, .026 when 0 YoYo : 1.0 and .036,

.mcapm Hm:_soc ccwccoammeeoo mp. we 0.5.3 ee3o~ mew can» emHHcEm

4L-

 

 

 

 

.mgg—m Hmcwsc: ccmvecammeeoo mu. eo 0.2.3 e000: we» can» ewuomeu m
NHH.-000.
00H. 00H. :00". evHH. eﬁwH. 000. HcH. ee0H.. HH. ee0H. w.
0.-000. 0 H
000. 000. ¢00H. H00. ¢H00. 000. 000. «NHH. 000. .000. 00. 0 A 0
0H0.-000.
H00. 00. #000. 0H0. n00. 0H0. 0H0. e000. 0H0. e000. H0.
0HH.-000.
00H. ~00. 6000. 0000. ow00. omwo. 00H. 6000. H00. 6000. H.
H.-000. 0 H
000. 000. 0000. 000. 0000. 000. 000. e000. omHo. o0H0. 00. 0 v 0
0H0.-000.
000. 00. 000. 000. ~00. 0H0. H0. 000. 000. 000. #0.
NHH.- 00.
000. 000. 0H”. H00. 00H. 000. 000. 00.. 00. 00H. H.
e00.-000. 0 H
000. 000. 000. 000. 000. 000. 000. 0e0. 0v0. 000. 00. 0 u 0
0H0.-000.
000. 000. 000. 000. 000. 0H0. 0H0. 0H0. 0H0. 0.0. H0.
30~u zeuu 30~u 30~u 3HNu 30N0 3eNu 30Nu 30Nu 3HNu muHsH. auHHHao
. . . . 0 .H
o o o o -aoea 0:0 0 0
0. 0 0e .2 u x 0e 232 25.52

 

0. u r .00 u m .00 n o mews: 0 u m.3we eo mumm» cormmu-oco eow meoeem
a wax» pao.e0050 wee co meammmswea on» e. ucmsmeamcmz eo meoeem we wocmmwea wzu we muoowem

Hﬁ 03000

45

.a:a~m Hccmsoc 0ceuccammeeco wow 00 u.E.~ emzo~ we» was“ emHHesm o
.ogmhm Ho:.50c 0:.0coemmeeoo mew oo upsw~ ewaa: on“ case emuamec e

 

 

 

 

0-.-000.

000. ~00. ¢w-. ~00. 00~. 000. 00c. e00~. 000. e00~. ~.
v00.-000. 0 H

000. 000. 000. ~00. 000. ~00. 000. eweo. 000. ¢~wo. 00. 0 A 0
0~0.-000.

000. 000. 0~0. 0~0. 0~0. 0~0. -0. ¢N~0. 0~0._ e0~0. ~0.
0-.- 00.

00~. 000. «0-. 0~. 00~. 00~. 000. e0-. ~00. 00H. ~.
00 .- 00. 0 ~

000. 000. N00. 00. 000. 000. 00. 000. meo. woo. 00. 0 v 0
0~0.-000.

0~0. 0~0. 30~0. 0~0. ¢0~0. 0~0. -c. 0~0. ~H0. 0~0. ~0. -
0-.- 00.

000. 0000. 000. 0000. 000. 000. 000. 000. 000. 000. ~.
e00.-000. 0 H

000. 000. 00. 000. 000. ~00. 000. ~00. 000. 000. 00. 0 u 0
0~0.-000.

~0. 000. 000. 000. 000. -0. 0H0. ~0. 0~0. ~0. ~0.
30~u 3c~u 30~u 30~u 3-u 30N0 30N0 30Nw. 30Nu 3~Nu mHHEHH auHHPQQ
o o o o -noea can 00 .~0
. 0 0 . 0 0
.0 - a 0 H u a 000~< ~mcpsoz

 

0. u r 00m 00 u-m .00 u o meogz «0 + m.3w

o oo mumwe 0m-ou-o3p eoe meoeeu

H max» Foo~e~aEw ocu co meammosmea we» :_ acmsmeamomz eo meoeeu we moemmwea we» we muoooeu

0~ m

~pmh

 

 

46

.043, and .034 when pYoYo = .8 at .05 nominal alpha. The empirical Type 1 errors
for the one and two-tailed tests using t2“w and t25w remained within 1.96
standard errors of their corresponding nominal alphas across both levels of pYoYo'
While presence of errors of measurement did not have any noticeable effect on
the two-tailed tests for tzzw: tz,‘W and t25w: less than perfect reliability of the
premeasure appeared to make the t2 1w and t23w tests slightly too liberal for

nominal .01 (e.g. the empirical Type I errors were both .018).

81>82

 

In contrast to what was expected, the one-tailed tests using tz 1w and tZBW
became less liberal in the presence of errors of measurement at .01 and .05

nominal alphas (e.g. empirical Type I errors of .098 and .117 for Dy = 1 but

oYo
.091, .103, respectively for O YoYo : .8 at .05). The expected increased
liberalness due to errors of measurement in the pretest did occur, however, for
nominal alpha .1.

A Similar decrease in liberalness was found for the two-tailed test using
tle and t23W (e.g. empirical Type I errors of .071, .077 for DYOYo : 1.0 but .053
and .063 for pYoYo = .8.

Once again the one and two-tailed tests using t24w and tzjw were valid

across all combinations of 81 and 82 and across both p and .8.
YoYo

Sample Size Effect

 

It was expected that increased c Should result in increased power. This
should not affect Type I error rates for valid tests but should increase problems
for tests that are too liberal (and may be even for tests that are too
conservative).

Table 13 shows the empirical Type 1 errors of the one-tailed tests of 92w

across three levels of sample size and across three combinations of 81 and B 2 for

 

5+7

 

oHH. NHH.. HmH. ooH. .No..
Noo. mo. eoo. NHH. oeo.
eoo. HHo. NHo. oNo. HHo.
NoH. NoH. ooH. Hoo. ooH.
ooo. omo. moo. Hoo. ooo.
ooo. ooo. ooo. Noo. Noo.
ooo. ooo. ooH. .NH. moo.
omo. omo. ooo. Heo. moo.
ooo. Ho. HHo. NHo. Noo.
0......“ a... .. a...» ......_l.... .3 -
o
ooa . 3 .oN --

m meme: 0 u m.3wo .o momo. um..oh-o:0 eo. meoeem ~ mosh .oo.e.asm co o~.0 0.05m0 eo muomuwm

 

ozo.o .o:_soc 0:.0ooommeeoo mo. 00 0.2.. e030. we“ coco ew..050 0

050.0 .oc.soc 0:.0coommeeoo mo. 00 0.5.. emoo: on» coca eouooeo .

mom. . oo... .oH.. HNH. ooH. ooo. ooH. moo. oNH. N_H. ooo.
a. c. a. a. i ..
oeo. .NoH. Hmo. .Hoo. ooo. moo. moo. ooo. Noo. ooo..ooo. N
0:5. \m f. mm
Ho. .NNo. NHo. .No. Ho. Ho. mHo. Ho. NHo. oHo. Noo.
_o.
Hoo. oeeo. oeoo. ooeo. ooo. ooo. NoH. ooo. oo. N_H. ooo.
H.
ooo. oomo. moo. omo. oeo. Neo. omo. eeo. omo. eoo. ooo.
:C. No v ~0.
Ho. ooo. Ho. woo. NHo. ooo. NHo. ooo. NHo. o_o..Noo..
_o.
moo. .oHH. Hoo. ooH. ooo. ooo. NoH. ooo. NoH. NHH..0oo.
H.
omo. ooo. Neo. emo. oeo. Neo. Hoo. moo. Noo. ooo. oNo.
., . N H
.o o . o
ooo. ooo. ooo. moo. ooo. Ho. HHo. HHo. ooo. o_o..Noo.
_o.
300.0 TSMNQ . ENNW . 3.3.1 .300”. i 3.0.0.01 - BWVH . qu KAN-H- - - . mu,.~,._=.~... ~0....Eo #- - -l
OMHU and. lDOkQ 3:0 mm .ﬁm

- -- :---- --:;-;---- -7-.- -t. ,;-:...-:-: -.- 50.<Ho0£oz

0. 0.000

 

 

48

Y : .2, s = 20 and Dyoy = .8. Comparable results for the two-tailed tests are

0

shown in Table 14+.

81==Bz

As expected, the one-tailed tests for tle and t23W became increasingly

 

liberal as c increased (e.g., the empirical Type 1 errors were .052, .051 for c =
10, .054, .06 for c = 30 and .071, .081 for c = 50 at .05 nominal alpha). The other
one-tailed tests remained valid across all levels of c and across all levels of
nominal alpha.

All the empirical Type 1 errors for the two-tailed tests were within 1.96
standard errors of .05 nominal alpha indicating that sample size has little to no
effect on the empirical Type I errors of the two-tailed tests when 81 = 82 at .05

nominal alpha.

81‘ 82

 

As c increased the one-tailed tests for tzlw’ t22W and t23W became
increasingly conservative, particularly at .1 nominal alpha. While Type I errors
were within acceptable bounds for c = 10, the empirical Type I errors were .075,
.087, .07# for c = 30 and .081, .1 and .071 for c = 50. Increasing sample size did
not, however, cause a problem for the validity of tzgw and t25w-

The effect of increasing sample size on the validity of tests tz1W , t22W

and tZBW were just the opposite for two-tailed tests than for one-tailed tests.
For example, the empirical Type I errors were .112, .113, .121 at .1 nominal alpha

for c = 50 indicating all three tests were liberal.

B1>82

As expected, most of the empirical Type I errors of the one-tailed tests for

 

tzlw’ tzzw: and t23w were beyond the upper probability limits of their

 

49

mmﬁ.

mmo.

moo.

oﬁﬁ.

mmo.

moo.

mod.

mmo.

moo.

a.

3 \o

3

i

o

ﬁmﬁ.

mmo.

ooo.

mﬁﬁ.

mmo.

“do.

ooH.

meo.

moo.

.

5

.imzq»a»|
1 Q

NmH. om“. Ncﬂ.
.1 1 ¥
coo. mmo. moo.
a. i.
immo. Koo. oHo.
ﬁwﬁ. moo. NHH.
i .1.

«mo. mmo. mvo.
moo. Ho. ooo.
oﬁﬁ. ooH. oﬁﬁ.
mmo. mo. mo.
ago. moo. Ho.
3nm~.- 3N\~.. 3.x“

omno
uco N. u > .om u

omo.

moo.

Koo.

mod.

mmo.

oﬁo.

mmo.

Koo.

Ho.

0

3 \u

m wows: o u m

Hmo.

mvo.

moo.

mmo.

mmo.

ego.

coo.

mvo.

.1.

a.

mﬁﬁ.

moo.

mﬁo.

MHH.

moo.

oHo.

mmo.

mo.

ooo.

3

o:

 

ago—o _o:_Eoc mcwvcoammggoo mu_ oo uFE_~ ewzop one coco Lm_PoEm o

ago—o PocwEo: mcwoeoamogcoo mu_ oo uFE_P Loon: ozu cozy couowco o

Hoo.

ooo.
Ho.

ooH.
mo.
ooo.
ooo.
ooo.

moo.

3

omuo

moo

ooH. woo. woo. ooH. moo.
mmo. ovo. ommo. Hoo. mmo.
NHo. woo. coo. woo. moo.
ooH. mo~. mmo. NoH. mmo.
mmo. mvo. Hmo. woo. ﬁmo.
oﬁo. woo. woo. moo. moo.
omo. mmo. moo. omo. moo.
ovo. vvo. mco. omo. mvo.
moo. ago. moo.. NHo. Woo.
.3~\~- . . .3nxo, m3v\o H.3m\o - 3m\o
oﬂno

In , ., .. -.. . . .. . . :
. -Q we mumo» uw—wo~-ozp mop Lo» mcoegm H moo» Foo_gwosw co

op mpnmp

Hoﬁ.

mmo.

ooo.

ooH.

ovo.

ooo.

~mo.

Hmo.

go.

zdxo

m__.
H.

oc:..ooo.
so.

x_o. moo.
_o.

m__. was.
H.

coo. Orc.
:3.

o_:. «cc.
_o.

m__. xoo.
..
oo:. omc.
:9.
w_:. moo.
_o.

mu_Ew4 Au___am

-oogm :zo

: 9:2 25...:
o~_m «_osom oo moooooo

.xxo.

C\..

n

 

50

corresponding nominal alphas for c = 30 and c :50. The liberalness of these tests
was increased as c increased. For example, the empirical Type I errors of the
one-tailed tests for tzlw: t22W and t23W were .62, .49, .063 for c = 10, .091,
.051, .103 for c = 30 and .112, .067 and .129 for c = 50 at .05 nominal alpha.

All of the two-tailed tests were valid when c = 10 and 30 accept for t2“W
which was conservative for c = 10 at .05 nominal alpha. As c increased to 50,
the two-tailed tests for tz 1w and t23W became liberal at .05 nominal alpha (e.g.
empirical Type I errors of .073 and .074). Surprisingly at .1 nominal alpha even

the two-tailed test, for tzqw and t25w became too liberal.

Class Size Effect

 

On a priori grounds it was difficult to predict the effect that varying class
size might have on the validity of the several test statistics under investigation.
As reported earlier only the formula for 922w was a function of class size, 5, and
there it appeared in the denominator.

Table 15 reports the empirical Type I errors of the one-tailed tests for
t zw's across three levels of class size and across three combinations of 81 and 82
forY : .2, c = 30, and OYOYO: .8. Comparable results for the two-tailed tests are

shown in Table 16.

8:32
All empirical Type I errors for both one and two-tailed tests were within
1.96 standard errors of .05 nominal alpha. Further, the liberalness of tle and

t23W remained stable as 5 increased. These results indicate that increasing

class size does not have an effect on the validity of the tests.

 

51

 

ozoﬁm amcﬁso: wcuocoommhcco my“ mo ugEaﬁ uo3oa 05o cwzu LoHHqu

«anm Haemeo: wagocoomwuoOU mum uo quum noon: ecu coca uwuawuo

30

 

i
oo_. oo_. o_o. o_o. .oo_. ..:~o.. oo_. ooo. o__. _o.. ~o_. Nos. -_. o__. omo. o_o.-:xo.
« a t « « r _ « o x —.
“mo. omo. o... ooo. ~o_. ooo. ooo. oo_. _oo. _oo. Noo. omo. ooo. moo. ooo. ooo. ooo.
a. o « o. a. c e no. mm x _m
~_o. ooo. _oo. ooo. oNo. Hoo. _o. woo. woo. No. o_o. ~_o. coo. m_o. woo. x_o. moo.
« a a i i * mo.
oo_. ooo. coho. moo. onoo. oo_. zoo. coho. Boo. omoo. ooo. ooo. oooc. _oo. °moo. N... oxo.
o.
omo. “so. oooo. ooo. Koo. ooo. «so. Demo. moo. ooo. «mo. omo. moo. omo. ooo. ooo. ore. .
00. mm v «m
zoo. oo. moo. _o. ooo. ooo. oo. ooo. Ho. Boo. «_o. _o. __o. oo. oo. w_o..o:o..
_o.
ooo. moo. ooH.. Koo. «No. moo. moo. «mo_. _oo. oo_. Boo. ooo. .o~_. coo. «o__. o__..o::.
_.
ooo. so. ooo. moo. mmo. moo. ooo. ooo. Noo. ono. mo. ooo. moo. Boo. Noo. ooo. oo.< m _
. oo. o ; o
ooo. ooo. ~_o. ooo. ooo. moo. ooo. ooo. ooo. moo. _o. o_o._ m_o. moo. «_o. o_o.-m::.
ﬁo.
L . 3:3. , Sol. ENS 3—3- - 3m: 33o 3m: . 3x53 3—: ,-. . 33; 35”.: smﬁ 3N3 3.5 - 3.2:: 32.3....
om a o oo . o oo n o -oooo_:_o mo .oo
-. -1-.l.!dionll.--..- iii! -1-] , 3N , . . . - .- Z: .. ; -1 f - 932 7525"
m. u > >o woo .N. n > .om n u «can: .o u m. -o we mummp um__op-oco on» Low mcocgm H coop _oo_e_oEm co m~_m moo—u we muuwoom

mp w—nmh

52

oo_.
«no.
m_o.
mmo.
oqo.
_oo.

moo.

Nqo.

ooo.

zo~o 3_

m.

n~o.
moo. mmo.
do. ~_O.
mmo. mmo.
@co. OQO.
moo. _go.
mmo. mmo.
ooo. ooo.
ROG. moo.
wa,m Exxo
On I m
o>o>

moo.

~o.

moo.

mmo.

noo.

,N

xo

Omo.

«no.

moo.

_

I‘ll- ‘

\u . 3

cmo. dmo. ¢~__.
moo. moo. moo.
moo. moo. moo.
mo~. omo. N_~.
mmoq mmo. moo.
0.0. o_o. o_o.
mmo. coo. mmo.
“co. m<o. no.
~0. moo. coo.
soy. zcxo .Lm\o,
ON I m

ago—c ~mcwsoc wouo::;mmuuou

mzoﬁm accuse: wcﬁoccomoupoU

_mo. oo~..
~co. mmo.
o_o. N_o.
0~. oo_.
mo. mmo.
o_o. o_o.
ocoo. omo.
m¢o. woo.
moo. moo.
anxo..H3m-|

~o_. mo.
«no. Hmo.
__o. moo.
mmo. cmo.
mmo. nmo.
“_o. woo.
mo_. qo_.
mqo. oqo.
~Ho. mo.
IZOxu . ,3v\.~

o woo N." » .0m u o memo: .o u m.3ma mo.muwwh nmowo»-orw.ogo Loo mgoegw o «mob
o. m—noh

q-.
#

tumo.
c~o.
No_.
mmo.

moo.

0~.

neo.

"o.

meo

muﬁ u:

mad he ooEﬁH pogo: may

woo.

woo.

V

u—EHH oozcﬁ wzu

,moo ,.3
o~ n m

m~_. N__.MQQC.
« ~.

mmo. coo. or:.
as.

"_o. w_o. koo.
_o.

ooo. m... moo.
H.

omo. ooc. o»:.
30.

oo. m_:..m:c.
_o.

oo_. m__. xxc.
_.

mo. coo. omc.
no.

moo. o_c. «:0.
oo.

ooo mo_s_o ao___oo
IDOLQ ozm
ooo_< _oo_soz

_aoog_osm co owom moo—u yo muomoow

cczu umaﬁmsm

cmzu heuwwcc

0

¢

\\1
m
I\
c1

53

8.1 <82

All one and two-tailed tests for tle, tzzw: tz3w, tzgw and tz5w were

 

valid across all levels of s at .05, except the one-tailed test for t23W when
5:20 and 30 which was conservative (i.e., empirical Type I error of .34 in each

case).

8932

At .05 nominal alpha, the one—tailed test for t22W was liberal for 5:10 but

 

became valid as 5 increased (e.g., Type I error of .065 for s = 10, .051 for s = 20
and .06 for s = 30). While this appears to be a positive effect of increasing class
size it is important to note that the trend in changing Type I errors was not
monotonic at .05 nominal alpha and was not present at other nominal alpha levels

investigated.

Empirical Power

 

The value of 0.10 was chosen for 83 in order to illustrate and contrast the
power of the five test statistics for tzw's for testing the hypothesis H0: 83 : O.
The empirical powers of these test statistics were determined for the case where
the significance levels were .01, .05 and 0.10. Since the results on the empirical
powers of the 5 test procedures were similar for the three significance levels,
only the results for .05 nominal alpha are reported.

In this section, only the tests which were found to have empirical Type I
errors within two standard errors of the nominal alphas when 02w = 0 when B3 =
0 are discussed. A "+" mark was used through the following tables to identify the
tests. The power of a test is defined as the probability of correctly rejecting the
null hypothesis. In general five factors affect the power of a test statistic:
sample size, discrepancy between null and alternative hypothesis, error term,

size of nominal alpha and whether a test is a one or two-

 

54

test. The conditions under which riw's increase or decrease can be determined
by examining column 6 of Table 2. The pOpulation values of p‘zw's when 83 ﬁ

are determined by

 

 

 

 

 

- 2 2 -
21W" (( )+ 8302) for p‘ a
2 2 2 2W
0 A + O V + 0 e ‘
O O O
2
«b 10‘ YOAOOg((BI - 82)O V0 + 810 e B3YOAOOg) 8 2
2 W (( + 3G€)forp~ a
2 2 2 z w
o A + c V + o e 2
O O O
' Z
,/ET17 YOAOO£K(B] 62)0 V0 + 610 e
2“ w (( - )+ 8302 ) for 023w ’
02V + 01e
O O
I Y0 0 ((8-02 + B 02 + Y8 O 0 - 8 02 )
Vbziozw (( A0 a ' A0 3 V0 3 A0 g T YO )+ B 02 ) for
2 3 g ’
so A
p_ 0

 

55

Sinces is not present in the formulas of 924w and 925w: one can expect

that class size will have no effect on the empirical power of the tests for t2”;W

and t25w- However, increases in class size is expected to droP the power of the
test for tzzw: particularly when 81 < 32. Again, class size is expected to have no

effect on the empirical power of the tests for tzgw and tZ5W°

Effect of Initial Confounding

 

Table 17 reports the results of the effect of initial confounding on
empirical power of one-tailed tests for tzw's for c = 30, s : 20, pYoYo : .8.

Comparable results for the two-tailed tests are shown in Table 18.

 

It is important to remember that only the one and two-tailed tests for
tzgw: tz5w and the one-tailed test for tZZW were valid at .05 nominal alpha.

The empirical power of the one and two-tailed tests for t25W remained
essentially constant across all levels of Y (e.g., the one-tailed test's empirical
powers were .173 forY : 0, .173 forY : .2 and .168 forY = .4). In contrast the
empirical power of the one and two-tailed tests of 02,”, decreased asY increased
(e.g., for the one-tailed test the empirical powers decreased from .164 to .157 to
.132 asY increased from 0 to .2 to .4). Similarly there was evidence that the
power of the one-tailed t22W suffered with increasingY (e.g., empirical power of

.167 forY: 0, .171 forY : .2 and .148 forY : .4).

B1<82
Earlier it was shown that only the one and two-tailed tests for tqu and
tij were valid given the null hypothesis.

Once again the empirical power for one and two-tailed tests for t25w

remained essentailly constant as Y increased (e.g. the empirical powers of the

 

 

56

A0 H mm con: mzoﬁm amcqeo: oLu oo mucuuw oumocmum 03u c_;u~3 muouuw H ooze Hooﬁuﬁoswv

 

«on.

mm~.

mmo.

m_~.

¢-.

mwo.

emu.

oo_.

mmo.

3

m.

+

mam. «_m.
+

~m_. mom.
+ -
oqo. mmo.
+

mmo. on..
+

.o_. co.
+

«Nos moo.
+

MNN. coo.
+

Noo. mom.
+

«mo. moo.

wxo 3m\o

v. u >

o>o>o vac _.

3

 

+ +
cum. _oq. mum.
+ +
¢o~. nmm. ~N_.
+ +
mmo. ¢m_. mqo.
+

~o_. on~. #_N.
+

mmo. moo. qm~.
+

No. __0. one.
+

«on. men. mom.
+

oq~. New. mmo.
+

cmo. _oo. omo.
. .. i n

Nxo .sqxo -: \u

n -1-

m .om n m .om u o mews: o u

NmN.

mm_.

mmo.

+
0N.

+
om_.

+
ﬁmo.

+
moN.

+
~n_.

+
Koo.

:e\o- :mxo 3

mom.

omm.

moo.

oe_.

mmo.

qmo.

_mm.

u >

. In

an $0

+
mom.

~m_.

m__.

~mo.

cow.

N\# ,.3

mom.

0mm.

moo.

cm_.

oo_.

emo.

«om.

_xy.

.momoh.oo_ooo-ooo

 

omoo o__o> .
+ + + + +
dmN. com mmu. OnN. NmN.
o.
+ + +
on_. oo_. oo_. _o_. mm_. x _
.o. . A
+ + + + + u a m
oso. omo. oso. “so. _mo.
_o.
+ + + + +
mmw. mNN. New. NmN. «cw. H.
+ +
~m_. mm_. nm~. mm_. mm_. N _
30. .m v m
+ + + + +
one. 0~0. ﬁqo. Ono. mmo.
_o.
+ +
moo. omo. Now. Koo. NAN.
o.
+ + + + +
mnﬁ. qo—. eo—. Now. ~o~.
; . m . _
oo o - o
+ + + . + +
gco. moo. coo. moo. moo.
oo.
3m~u .Scxo EMVo 3mxo . 3-o .
o u » coo—o _oo_soz No .oo

up m—nmp

moo moron _owwowoaw.:o mcwocamoooo.po_ow:~ we mooooom

 

ﬁ57

Ao u no cog: mzaam Hmcaso: ocu mo mucuuo opmvcmom oau sazoaz mucouo _ wo>e Hmoquuova ommu U__m> .

 

 

 

 

 

_o~.

ooH.

3

oc..

«me.

+
ego.

o\o

owﬂm

3

mum.

new.

oo_u

co.

m:

u >

mm“.

NO~.

_mo.

.3mxo

.¢»o>o uco P. u m .om u m .on.m o mcozr.o n.m

3

omm.

NmN.

mo.

mo.

Nco.

_

to

+
mmo.

n

3 \o

'lll'l'tl

3

+
05—.

omo.

m_o.

vxu

can.

mm_.

moo.

qﬁa.

ono.

oao.

Nam.

mmq.

coo.

. Enoo.

N." >

3N

amo.

mmo.
coo.
owo.
+
No_.
+
mmo.

+
mmo.

.3N
. 00

\o

+ + + + +
_oo. _o.. «no. oo_. om_. oo_.
+ + + + + _
mmo. mmo. _oo. _oo. oo. oo.. .
3C. RE A ~@
+ + + + +
qqo. oNO. «No. mmo. ONO. 0N0.
~C.
+ + + + +
mo_. no.. mo_. coo. wo_. oq_.
H.
+ + + + +
mmo. coo. mmo. mmo. moo. mmo. N _
3:. mov o
+ + + + +
moo. m_o. n_o. o_o. o_o. o_o.
_C.
+ + + + +
ocN. Nm~. ~NH. mNH. «ﬁn. ouﬁ.
o.
+ + + + +
Om_. oo_. mmo. ~o~. mmo. ¢O~. N
o: o , oo
+ + + + +
moo. mmo. mmo.. wNO. mmo. ~mo. .c.
. e: .. . , - .N.. a.
zoxwl 3n~o- zoxu .zmxo- -3 No- 3 No oco_< _ocvsoz m .q
u m
>

OH

oo mowmh mupw¢p-oxh coo omxoa Fou_g_osm no moroczowcoo Poou_c_ wo.woumowm
mp wpnoh

 

58

one-tailed test were .131 fory : 0 and .124 for both Y = .2 and Y: .4). And again
the empirical power for one-tailed test for t2“,W decreased from .139 to .120 to
.10 as Y increased from 0 to .2 to .4. It should be noted that the two-tailed tzgw

and t25w were less powerful than the one-tailed tests.

BFBZ

Only the one and two-tailed tests for taw, tz5W and the two-tailed test
for t22W were valid under the null hypothesis.

In contrast to previous results, the empirical power of the one and two-
tailed tests for t25w increased slightly as Y increased (e.g. empirical powers for
one-tailed test of .157, .171, .173 as Y increased from 0 to .2 to .4). The
empirical power of the one-tailed tests for t24w did not have a clear relationship
to Y but the two-tailed test had essentially constant power as Y increased (e.g.
the empirical powers were near .09).

Similarly, the empirical power of the two-tailed tests for tzzw were not
much affected by varying (e.g., empirical powers of .09, .102 and .102 aSY

increased from 0 to .2 to .4).

Effects of Errors of Measurement in the Premeasure

 

Due to the absence of error of measurement components from the formulas
of 034w and 925w one can expect that errors of measurement in the premeasure
will not affect the empirical power of the test statistics using rigw and risw.
Errors of measurement were expected to increase the power of the test
statistics for tzzw-

Table 19 reports the results of the effect of errors of measurement on the
empirical power of the several one-tailed tests for tzw's for c = 30, s = 20 and Y

z: .2. Comparable results for the two-tailed tests are shown in Table 20.

 

 

 

.59

 

 

 

Ao H mm con;
agape PmcwEo: on» Co mLOLgm ogmocmum o3“ cw;u_3 mcoegm _ quh poowg_a5mvummu Up_m> +
+ + + +
«mm. own. son. now. “on. mm—. on”. Nom. om~. omm.
H.
+ + + +
Hun. mm—. OWN. mnﬁ. 0mm. Noﬁ. mmo. com. moﬁ. NuN. N _
mo. m A a
+ + + +
moo. mmo. mmo. omo. moo. euo. omo. omo. mmo. mmo.
Ho.
+ + + +
~_N. com. me“. am". on“. oMN. NNN. and. -~. and.
. H.
+ . + + +
q-. o~_. smo. m_~. ooH. mmo. cmo. mmo. o~_. moo. N a
mo. v
. m m
+ + + +
one. amo. «no. Hmo. omon one. «mo. moo. no. No.
~o.
+ + + + + +
now. New. ﬁmm. com. com. mew. coo. «on. cod. mom.
H.
+ + + + + +
moo. ﬁnd. mom. ﬁnd. Now. ~_—. qmo. q—ﬁ. mmo. _ﬁ. m ~
+ + + + + +
omo. mac. moo. mmo. ﬁno. omo. mmo. omo. mmo. mmo.
Ho.
3m~u 3¢~u 3m~u BNNH 3-wl 1:3mezliamww1- sow {.18me- 13mmw+.l .
w. I Coacha O.— " Chooha QSQP< ch—EOZ N@ .H

 

F. u mm vcm .N. n > .om n m .om u o meme; .0 u m.3~a

mo mummh omp_o»-mco mzp Low Lmzoa _ooogonm co meomomsoga on“ o_ acmsogamomz oo mgogcu Co moowoou
m? apnok

 

 

«o . mo 5;:
ogopo Pocoso: on» oo mcoeem ucovcoum ozu cw;u_3 mcoggm H moop _oo_c_oEmvommo oo_o> +

 

6()

 

 

 

 

 

 

+ + + +
NNH. Omo. omN. mmo. moN. ONd. moo. mQN. ONH. ooN. H.
+ + + +
cod. OmO. mmo. NO~. mmo. mmo. OmO. mm“. OoO. _oo. N H
mo. o A n
+ + + +
mNO. moo. moo. ~NO. coo. QHO. NOO. co. «~0. mo. HO.
+ + + +
mm_. Nmo. coo. mNo. wﬁ~. mmo. coo. Moo. mm~. NHH. H.
+ . + + +
NOO. duo. mmo. oOO. mmo. moo. who. mmo. mmo. no. N ~
+ + + +
ONO. ~NO. coo. ONO. o~o. «NO. NO. “HO. NNO. moo.
Ho.
+ + + + + +
mmo. nod. NHN. NoH. OON. qu. moo. Ono. Odo. mo.
H.
+ + + + + +
mmo. «mo. mmH. moo. Om_. mmo. ooo. mmo. #oO. Hmo.
o. No . so
+ + + + + +
mmo. Ono. oco. mmo. mqo. O_O. HMO. mﬁo. moo. woo.
Ho.
2m~u 3v~u 3m~u 3m~p ZHNH 3m~p 3¢~u 3m~u 2N~u 3H~p
o .u 0.3mm. I -1--- . o; u omoao 932 pact—52 No L

 

3N _. H mm new N. u > .oN u m .om u u mews:
o u m. -o $0 mama» anWop-ozp Low Lozoa _oUPLonm co mgamowsmga wow so ucmsmgzmomz mo mcogcu we muomowm
om opomh

 

 

61

Bi=32

As seen earlier, only the one and two-tailed tests for tzzw: t24W and t25w

 

were valid, given the null hypothesis. The empirical powers of the one and two-
tailed tests for tzzw: t2,“W and tz5w were increased in the presence of errors of
measurement (e.g. empirical powers of .095, .171 for tzzw: .094, .157 for tZ4W’

.117, .173 for tZSW).

Bi<82

Only the one and two-tailed test for ‘24w and t25w were valid under the

 

null hypothesis.

The empirical power of the one and two-tailed test for tz5W decreased in
the presence of measurement error (e.g., empirical power for the one-tailed was
.139 for oYoYo : 1 and .124 for pYoYo : .8). Similarly, empirical power of t2“W
decreased slightly in the presence of errors of measurement (e.g., .126 for pYoYo

= l and .120 for pYoYo = .8).

81 >82

Earlier it was shown that only the one and two-tailed tests for tzgw and

 

tZSW were valid given the null hypothesis.
Both empirical power of the one and two-tailed tests for tz4w and t25w
increased in the presence of errors of measurement (e.g., .089, .102 for pYoYo :-

1.0 and .159 and .171 for pYoYo .-. .8).

Sample Size Effect

 

It was predicted that the empirical power of all the one and two-tailed
tests for tzw's would increase as c increased. Table 21 gives the results of the
effect of sample size on the empirical power of the one-tailed tests for tzw's for
s = 20,Y = .2 and pYoYo = .8. Comparable results for the two-tailed tzw's are

shown in Table 22.

62

 

 

 

o "Hwocogz ocho Hocwso: mcu oo mgoegm ucoucoum ozu :ngHz mcoccm H ooxp Heooe_oea ammo o_Ho> +
+ + 4 + + +
mmo. NNn. Noe. mmo. No. Now. How. omm. mmw. Hem. omH. neH. ooH. NoH. ooH. H.
+ + + + + +
How. ~H~. mom. Nam. mmo. _o.. omH. omm. ooH. cow. Nmo. mmo. ooH. mmo. moo.
DC.
+ + . + + + +
coo. mmo. ooH. one. HNH. moo. omo. moo. no. moo. ooo. HHo. nHo. oHo. mmo.
_o.
+ + + + + + +
How. cw. oHN. com. com. HHN. cm. moH. ooH. ooH. mmH. mmH. NoH. qu. ooH.
H.
+ + + + + + +
qu. NoH. ooH. on. qu. «NH. ONH. moo. mHH. ooH. woo. «no. moo. mmo. «no.
so.
+ + + + + + + + +
moo. moo. Hno. coo. mmo. Coo. Hmo. «No. Hmo. owe oNo. mHo. mHo. “Ho. mHo. _..
o
+ + + + + + + + +
mom. «on. mac. Ham. an. now. New. Hmm. com. «Hm. moH. «NH. mHN. ooH. Hm. H.
+ + + + + + + + +
mow. new. wmm. mow. mHn. ooH. omH. mew. HNH. Now. mHH. HoH. mNH. moH. an.
me.
+ + + + + + + + .+
omo. moo. HmH. mmo. NNH. coo. Koo. moo. mmo. Hoo. one. mmo.. «no. mmo. mmo.
Ho.
:3... .-....o.:. .33. .39. of -....:;£3o-..o-.om::so: 35-. zoo. ..o:.,3m~._s 3N3: of”... , ea: 35.52
cm n u on u 0 OH I u
o o 3~
o. u > >q coo H. H mm .m. n > .ow u m mews; o n m. to we mommp uoH_o»-mco vs» Low Lozoa Hoo_c_osm co w~om «Hasom yo muoooou

Hm mHnmh

N
d:
/\

—<
m.

 

 

€53

Ho H no can: oono Ho:_eo: mzu wo meocem ucovcoum oxu c_;o_z meocgm H womb Hoo_gHanV ommu u_H6> +

 

 

 

+ + + + + +
mmN. mHN. «on. mom. cam. NoH. ONH. omm. moH. How. ooH. oHH. .HNH. NHH. oHH.
+ + + + + + H
:_. oz. SN. 9:. oz. 3.. ooo. m2. NS. «2. :5. N8. So. So. So.
.3. N A ~
+ + . + + + + . m m
omo. «mo. Hmo. Nqo. eoo. mmo. mHo. moo. Hmo. «co. mHo. HHo. mHo. woo. mHo.
+ + + + + + + + + _o.
2:. 2;. NE. NE. N2. o2. «2. EH. m2. m2. 3H. NoH. 2:. mo... ooH.
H.
+ + + + + + + + +
8H. 2:. mo. 2:. Bo. woo. :o. 30. 80. mmo. So. one. one. one. o3. N
. H
mo o v o
+ + + + + + + + +
mmo. mmo. o8. «No. to. _ 08. H8. So. So. So. So. So. 3o. 2o. 2o.
. _o.
+ + + + + + +
Hem. mew. own. cmN. mHm. ooH. ooH. mHN. NoH. QON. cmH. «HH. ch. HNH. ooH.
H.
+ + + + + + + + +
.2. RH. SN. 9:. «cm. 3o. «8. mi. 3o. o2. So. So. 20. sec. mg. m H
mo. o - o
o
+ + + + + + + + +
one. moo. omo. moo. moo. mmo. cmo. coo. mmo. moo. mHo. mHo.. Hmo. mHo. HNo.
_c.
:35. .-.,.,...:-- on: 3o: as- -35, goat ...m:.-...o: 35.- 2...:- ...€;.3.o33 3oz. .5 .
.:om . o-... . - --- on u o oo . o goo—o Hoooeoz No .oo
. - o>o>.. . m . . My. ,.-- -i- ---. -3. -- , . -..-- . .--- . . -
m u o EB H u m N n > om u m 89.3 o u m. -a we 3.3: 823-03 23 Loo .8on H8235 go 33 39:3 oo Soot—u

NN ~32.

 

 

64

Earlier it was shown that only the one and two-tailed tests for tzzw: t2“W
and t25w were valid under the null hypothesis when 81 = 32 or 81(82-

The empirical power of these tests increased as c increased (e.g., the
empirical power of the one-tailed tests for tzzw: t2“W and t25w at c = 50 were
234, 243, and 214 percent of their power when c = 10 in situations where 81 :82).

Only the one and two-tailed tests for tz4w and t25w were valid given the
null hypothesis, in situations, where 81 < 82. Again, the empirical power of these
tests increased as c increased (e.g. the empirical power of the one-tailed tests
for tqu and t25w were .83, .82 for c = 10, .159, .171 for c = 30 and .217, .231

for c = 50).

Class Size Effect

 

It was expected tnat increased class size would have no effect on the
empirical power of the tests for tz4w and t25W7 but that the empirical power of
the test for t22W would dr0p as 5 increased.

Table 23 reports the results of the effect of number of students per class
on the empirical power of the one-tailed tests for tzw's for c = 30, Y = .2 and
O YoYo = .8. Comparable results for the two-tailed tests for tzw's are given in

Table 24.

 

As was shown earlier only the one and two-tailed tests for tzzw: tz,‘W and
t25w were valid, given the null hypothesis.

The changes in the empirical power of the one-tailed tests for t24w and
t25w were not large but in each case power increased monotonicaly with s (e.g.,
empirical powers at s = 30 were .112 and .119 percent of the empirical powers
when s = 10 and .102 and .103 percent of the empirical powers when s = 20).

There was, however, no clear relationship between power and class size for t22W

 

 

 

m .
Ho u m cog; mgoHo Hoc_Eo: moo mo mcoccm veoocoom ozo :_;u_3 mecccm H maxp HooHL_asmo ammo uoHo> +

 

£55

+

com.

ooH.

Hmo.

mHN.

NHH.

mmo.

+

qu.

+

mnH.

+

moo.

n

\u- .-

.HNN.

mmH.

moo.

com.

ooH.

Nmo.

+

moN.

+

ooH.

+

Nqo.
3r~&--3n\o.- 2N

3M

 

_ON. mom.
No~. oQN.

Ono. moo.
NON NNH.
OOH. coo.
Hmo. Ono.
+

NmN. 0mm.
+

QOH. HHN.
+

moo. Hoo.
.3 - a: -

o oo. H. u no .N.

mom.

omN.

mso.

umo.

«No.

Hmn.
ooN.
mmo.

Nu

ON u m

3

mom.

oom.

+

HmH.

+

mmo.

oxo

anxo..-3 \o

+ +
«MN. OHN.
+ +
mNH. moH.
+ +
Ono. nNo.
+ +
HN. mmH.
+ +
oNH. HNH.
+ +
mmo. oNo.
+ +
mew. 5mm.
+ +
HmH. qu.
+ +
“co. mmo.

v

OH

o mews: o u m.3ma mo mumop voHomh-mco we“ Low Lmzoa Hoo_eoo5w :o m~_m

00. No A Ho
Ho.

H.
oo. oo v oo
_o.

H.
m: mm Ho
Ho.

moo_< _oo_soz a ..

moo—u we muomoou

(56

Ho . mm cmEseceHe HecHEoc one we mLoLLe eceeceum ezu :Honz egocee H eexH HoUHLHesg ammo eHHe> .

 

+ +
mmH. NoH. NNN.

+ +
omo. mmo. NoH.

+ .
Omo. mNo. omo.

+ +
OmH. oNH. QHH.

+ +
omo. omo. moo.

+ +
mHo. NHo. NHo.

+ +
ooH. ooH. mmN.

+ +
mmo. OOH. mmH.

+ +
oNo. oNo. omo.

o: .2.... . on:

3

Om u m

o eco H. n no .N. " >..om u

o. u

o>o>

mmH. NmN.
moH. qu.
«no. oco.
NNH. HNH.
«so. moo.
oHo. «Ho.
+

NNH. mHN.
+

moH. NNH.
+

HNo. Nmo.

NAH- - y .3 ﬁx,“—

+
NmH.

+
ooH.

+
mNo.

mmH.

moo.

ONo.

3 V

+
05H.

+
omo.

+
mHO.

+
NmH.

+
Hmo.

+
HNo.

+
NoH.

+
«mo.

+
Omo.

No

+
omN. me. HQN. qu.
+
mmH. NoH. mmo. Nno.
+
mqo. HNo. coo. mHo.
+
«HH. mNH. wHH. mcH.
+
omo. coo. omo. mmo.
+
oHo. No. mHo. oHo.
+ +
mHN. NoH. ooN. NoH.
+ +
mmH. Hmo. omH. ooH.
+ +
cqo. mmo. mqo. omo.
znxo. BmxH Son . 3n-
oN n m
- 3N - r.

o memo: o n m

cm mHne»

+
NNH.

+
mmo.

+
NHo.

+
HqH.

+
Noo.

+
NHo.

+
OoH.

+
Hmo.

+

omo..

.3... ,_

-o co momoo oo__oo-ozo moo

mNN. QmH. NHN.

omH. moo. NNH.

no moAHo
cmo. oHo. Nmo.
HO.
mHH. NoH. «NH. H.
Nno. who. HBO. - N H
:o. u v o
NHO. HHO. mHo.
_O.
+
NmH. mnH. mH.
H.
+
MNH. ooH. mHH.
.+
qu. «No. «no.
HO. .
3mxo 3N~o .3HNo .

e e a: 50 N .H
oH u m z H< H H z m m

Le» Lozea HoeHLHQEw :e e~Hm mmeHo we moeeoom

 

 

67

(empirical powers of .160, .171 and .164 as 5 increased). Similar but less
pronounced relationship between power and class size were found for the two-

tailed tests.

81452

 

Only the one and two-tailed tests for tzzw, t2¢w and t25W had empirical
Type I errors within two standard errors of the nominal values when 83 = 0.

The empirical powers of the one-tailed tests for tzzw: t2“W and t25w
tended to drOp slightly as 5 increased, particularly from 20 to 30 (e.g., empirical
powers of .114, .113, .,106 for tzzw: .121, .120, .106 for t2,“W and .126, .124, .112
for tzjw as 5 increased from 10 to 20 to 30). However, the relationship between
power and class size for the two-tailed tests for tzzw, tzw, and t25W were not
clear (e.g., empirical powers of .077, .066, .074 for tzzw, .082, .071, .076 for

t24w: .075, .82, .80 for tzjw as 5 increased from 10 to 20 to 30).

BPBZ

Only the one and two-tailed tests for t24w, t25W had expirical Type I

 

errors within two standard errors of the nominal alphas when 83 = 0.
The empirical power of t2"W and tz5w tended to increase with class size
though this relationship was most in evidence for one-tailed test at alpha .1 (e.g.,

empirical powers of .211, .257, .271 for tz,‘W and .234, .272 and .286 for t25w)°

 

 

CHAPTER VI
SUMMARY AND CONCLUSIONS

The purposes of this investigation were to determine the conditions under
which testing for Ho: 02w=0 is equivalent to testing for no teacher behavior
effect. Five different methods for defining Z were investigated under a variety
of conditions defined by varying (a) the amount of initial confounding, (b)
presence of errors of measurement in the premeasure, (c) sample size, (d) class
size and (e) the relationship between 61 (i.e., the structural slope of class effect
at time t on class effect at time 0) and 82 (i.e., the structural slope of within
class deviation at time t on within class deviation at time 0).

A linear structural model which incorporates the hierarchical nature of the
data and the possibility of measurement errors was provided in chapter three to
determine analytically the conditions for which testing 02w = 0 is equivalent to
testing 83 = 0. The results showed that equivalence of the two null hypotheses
does occur if either of the following conditions are met (l)Y= 0 (i.e., no initial
confounding of teacher behavior and class compositions) (2) 81 = 82, given a
perfectly reliable measure. Such equivalence between pzw = 0 and B3 = 0 is true
regardless of whether Z is defined using K set to the total, between or within
regression coefficients.

A Monte Carlo approach was taken to investigate the appropriateness of
the different test statistics for tzw's in testing the hypothesis of no teacher
behavior effect on student achievements. As expected, whenY = 0 and B3 = 0 all
of the mean estimates of piw's were near zero. Further, the empirical

distributions of the "t" statistics for the different forms of riw's were close to

68

 

 

 

69

their theoretical t-distribution across all combinations investigated. Finally, all
of the test statistics for tzw's were valid and tzlw’ t23W and t25w had empirical
power greater than tzzw and tzgw-

Increasing the amount of initial confounding, Y , caused the mean
estimates of 021w, 022w and 023w to depart from zero, but did not effect the
mean estimates of 0224“, and 025w. Results of the empirical Type I error rates
paralleled, for the most part, the empirical results for values of piw's.
Increasingy caused tle and t23W to be centered to the right of the theoretical
t-distribution when 81 = 82 or 31> 32 and to its left when 81 < 52. This caused
the tests to be too liberal in the first two cases and too conservative in the third
case for one tailed tests. For two-tailed tests,tle and t23W were again too
liberal when 81 > B 2 but valid for the other two relationship between B 1 and 82.
Results of the empirical Type I error rates indicated that increasingY caused the
one-tailed test for t22W to be too conservative when 81 = 82, the one and two-
tailed tests to be too conservative when 81 < 82 and the one-tailed test to be too
liberal when 81 >82. The only tests for which empirical Type I error rates were
not affected by increasing the amount of initial confounding were tnW and
t25w- It can be concluded that as Y increased, only tzzw: tZgW and tz5w had
empirical Type I errors within two standard errors of the nominal alphas when 81

81 and B

= 82 and t24w: t25w for the other relationships between 2. However, in
situations where t22W was a valid test, it had greater power than tzqw but less
than t25w-

Errors of measurement in the premeasure caused the mean estimates of
p‘zfp 022w and 023w to depart slightly from zero when 81 = 82 and to become
Closer to zero, at least for tzlw and t23w when 81 < 82. However, errors of

measurement did not effect the mean estimates of 02'“, or 025w. The one-tailed

 

 

 

70

tests using tz 1w and t23W became less conservative as a result of the presence
of errors of measurement when Sf 82.

The effects of errors of measurement on the two tailed tests were not the
same as those for the one-tailed tests. For example, errors of measurement
brought the empirical Type I errors for tle and t23W closer to the nominal
alphas's when 81 > 82. Concerning the power of the tests with valid Type I errors,
power for tzuw and t25w tended to increase in the presence of errors of
measurement in the premeasure when 51 = 82 or 81 > 82 but decreased when 81
<82- Further t25w had greater power than t2“W across all combinations of 81 and
82-

Sample size was found to have no effect on the mean estimates of pzws
across all combinations of 81 and 32. Increasing sample size affected empirical
Type I error rates for the one-tailed tests using tle and t23w (i.e., the tests
were too liberal when 81 : 82 or 81 > 52 but too conservative when 81 < 82).
While the results of the empirical Type I error rates for the two-tailed tests
were parallel to the one-tailed tests when 81 > B 2, they differed in situations
where Bl = 82 or 81 > 82. Except for the one—tailed tests when 81> 82, increasing
sample size did not affect the empirical Type I error for tzzw across all
combinations of 81 and 82 at .05 nominal alpha. Statistics tzaw and t25w were
the only tests to remain valid as sample size increased. The power of these two
tests increased with sample size, and for most cases tz5w had greater power
than tZQW'

Number of students within each class had no effect on the mean estimates
of all pzw's. Also, it had no effect on the one and two-tailed test across all
combinations of 31 and 82. In general the tests which were valid, conservative
or liberal when classes were small remained so when class size increased. tzqw:

tij and in some cases t22W were the only tests which had empirical

 

71

Type errors within two standard errors of the nominal alpha when 83 = 0. For
these tests, the empirical power tended to increase with class size whenﬁl = 82 or
81 > 82 and to-drOp slightly when 8 1 < 82.

In conclusion, when students are randomly assigned to classrooms or when
81 = 82 and the premeasure has perfect reliability, testing H0: 02“, = 0 is
equivalent to testing no teacher behavior effect. This equivalence is true
regardless of whether Z is defined using K set to total, between or within
regression weights. However, when students are not randomly assigned to
classrooms (i.e.,Y 7‘ 0) which is typically the case in practice, the test statistics
using tzlw: t22W and t23w were valid only in a few situations. In general these
tests, particularly, tle and t23w tended to be too liberal in situations where 81
>32 (the typical case in education) and too conservative when 81 < 82.
Interestingly, the only tests were valid for all conditions investigated were the
tests for t24w and tzjw- Since K is usually unknown in practice, the procedure
of choice should be tzqw- In addition to being valid, it affords an estimate of K
rather than requiring K to be known apriori.

Increasing sample size and presence of errors of measurement increased
the empirical power of t24W and t25w- Their empirical power increased slightly
with class size when 81 = 82 or 81 > 82 but decreased slightly when Bf 82.
While the empirical power of t25W remained constant inncreased when 81 = 32
or 81 < 82, the empirical power of t2“W decreased. In situations where 81> 82,
the empirical power of tzjw increased but tzgw reamined constant. The results
of the investigation of the two-tailed tests were not in complete agreement with
their corresponding one-tailed tests. A possible explanation is that the
distribution of the test statistics may be skewed.

The results of this investigation are limited to the parameter values

chosen. In other words, if some parameter values were changed such as OYoYo

72

and the relative magnitude of Bi and 82 some of the results would be different.
For example, the satisfactory results based on using t22w were functions of the
chosen parameter values. If the chosen values of pYoYo’ Bi and 82 had been .9,
.3 and .9 instead of .8, .3 and .7 when Bf 82 the values of 922w would be changed
from .003 to .06. The test statistic,tzzwa may be too conservative instead of
valid for these new parameters.

The results of this study indicate that procedures used by process-product
researchers in forming residual gain scores typically provide misleading results.
Sometimes the test statistics used are too liberal and other times they are too
conservative. Therefore, it is recommended that process-product researchers
who wish to test for no teacher behavior effect use tz,‘W which involves
setting K = él- In addition to yielding valid Type I error rates across all
conditions investigated, the procedure had reasonable power (though not as good

as if K were a known constant).

 

 

 

 

BIBLIOGRAPHY

 

 

BIBLIOGRAPHY

Brophy, J. E., & Evertson, C. M. Process-product correlations in the Texas
teacher effectiveness study; final report. Report No. 74—4, University of
Texas at Austin, 1974.

Bryk, A. 5., 6c Weisberg, H. Use of the non-equivalent control group design when
subjects are growing. Psychological Bulletin, 1977, g, 950-962.

Creemers, B. Evaluation of styles of teaching initial reading: An investigation
into the relationship between teachers' use of a specific method and pupil
achievement (with summary in English). Utrecht: Drukkerij Elinkwijk B.
V., 1974.

Creemers, B. or Weeda, P. Evaluation of teaching styles: An investigation into
the relationship between pupil achievement and teachers' use of a method
for teaching initial reading. Unpublished manuscript, State University of
Utrecht, Netherlands, 1974.

Cronbach, L. I]. Research on classrooms and scholars: Formulation of questions,
design and analysis. An occasional paper of the Stanford Evaluation
Consortium, Stanford University, 1976.

Cronbach, L. 3., 6c Furby, L. How should we measure "change"-—or should we?

Psychological Bulletin, 1970, B, 68-80.

Doyle, W. Classroom tasks and students‘ ability. In P. L. Peterson 6c H. J.
Walberg (Eds.), Research on teaching. Berkley, CA: McCutchan, I978.

Draper, N. R., 6: Smith, H. Applied regression analysis. New York: John Wiley
6r Sons, 1981.

 

Dunkin, M. 3., dc Biddle, B. J. The study of teaching. New York: Holt, Rinehart,
8r Winston, 1974.

Ebel, R. L. Essentials of educational measurement. Englewood Cliffs, NJ:
Prentice-Hall, 1979.

 

Elashoff, J. D. Analysis of covariance: A delicate instrument. American
Educational Research Journal, 1968, _6, 383—401.

 

Gage, R. M. Essentials of learning for instruction. Hinsdale: The Dryden Press,
1977.

 

Haney, W. Units of analysis issues in the evaluation of project follow through or
there must be heresy in there some place (Contract No. OEC-0-74-0394).
Cambridge, MA: Huron Institute, 1974.

73

 

 

 

 

 

74

Hays, W. Statistics for the Social Sciences. New York: Holt, Rinehart, 6c
Winston, 1973.

 

Hornquist, K. Relative changes in the intelligence form 13 to 18. Scandinavian
Journal of Psychology, 1968, 2, 50-82.

IMSL Library. The IMSL Libarary, Vol. 3. Houston: International mathematical
and Statistical Libraries, 1982.

Kenny, D. A quasi-experimental approach to assessing treatment effects in the
non-equivalent control group design. Psychological Bulletin, 1975, 8_2, 345-
362.

Kessler, R. C. The use of change scores as criteria in longitudinal survey

research. Quality and quantity, 1977, l_l, 43-66.

Linn, R. L., (St Slinde, J. A. The determination of the significance of change
between pre— and posttesting periods. Review of Educational Research,
1977, g, 121—150.

 

Lord, F. M. Elementary models for measuring change. In E. W. Harris (Ed.),
Problems in measuring change. Madison, WI: Unviersity of Wisconsin
Press, 1963.

 

Lord, F. M. A paradox in the interpretation of group comparisons. Psychological
Bulletin, 1967, Q, 304—305.

Lord, F. M. A paradox of interpretation of group comparisons. Psychological
Bulletin, 1969, 68, 304-305.

Olejink, S. F., 6: Porter, A. C. Bias and mean square errors of estimators as
criteria for evaluating competing analysis strategies in quasi-experiments.
Journal of Educational Statistics, 1981, _6, 33-53.

 

Porter, A. C. Analysis strategies for some common evaluation paradigms. Paper
presented at the Annual Convention of the American Educational Research
Association, New Orleans, 1973

Porter, A. C. The effcts of using fallible variables in the analysis of covariance.
Unpublished Ph.D. dissertation, University of Wisconsin, 1967.

 

Porter, A. C. How errors of measurement affect ANOVA, regression analysis,
ANCOVA, and factor analysis. Paper presented at the Annual Convention
of the American Educational Research Association, New York, 1971.

Porter, A. C., (St Chibucos, T. R. Selecting an analysis strategy. In G. Borich
(Ed.), Evaluation Educational Programs and Products. New York:
Educational Technology Press, 1974.

 

Rosenshine, B. The stability of teacher effects upon student achievement.
Review of Educational Research, 1970, Q, 647-662.

 

75

Soar, R. S. An integrative approach to classroom learning (Grant No. 5-Rll MH
020045). Philadelphia: Temple University, 1966.

Tucker, L. R., Damarin, F., or Messick, S. A base-free measure of change.
Psychometrika, 1966, 3_l, 457-473.

 

Veldman, D. C., 6: BrOphy, J. C. Measuring teacher effects on pupil
achievement. Journal of Educational Psychology, 1974, 3, 319-324.

 

 

 

 

   
      

IIIIIIII

“111111111ijljljjjjjjijjjjjjj“

9