-- zv-m. 5 pays-2:125:21? """.“f:::7: 1:" 55551525555555 5 255559154“ ”‘u' :1 5‘ 551‘s {5 1555. 55(51 1:55 1, 555155552 5 1 55:55.55]; 5 't—‘L<.;_‘”M’:' . . 't ‘ 2." ' «la—w 1:33“; A .2 ' "‘2" - .. 4 fi‘ A m4 '1-5 A “Zia; -4-M. 1:2,: 1-15-3221 m ' t. P, ”rag—:2 “E! 21' 5555.11 , . L15“ 1 5'55 ,555.’ {5555193 "““m‘g‘h 55v ‘iiESfif '_.5. w... 55.’ I5. ~ ' ./ - 25.- :32. f .. =1»- , a .=" £32,-.. :fi: é‘éua‘g E "—' £5}.- E): glif'rdM-‘L *: 511;: 5521, ”ix-3:??- ‘_ >~ 5 35.31%,“ .1.— .37— 1;:iE-55- ‘ . . A 1 in"; 1 J'W‘W‘c.» 4 . .' I 5:5,- I?!) 13." - A fi‘éé _‘ £2 a“ 3%.: A , .2 '. I *{xfé‘g 5:.— 9'.“ .g‘ .. . 'Lgf-za.._ 235 . V I A- $839 - €22 < - ‘ g-5 Q' ; n... .. '1" ." '- 1);”- «512 A, LI .3.— .. . :fi‘gfiz~.:~‘ .5552 , 5‘55 W5 5 . 511551.. 1‘L I! 5.121 5.1 . 2 51225 25“ 2155‘ ‘ ”rat: , _ ‘gfi :72: i. ;_ _‘ I 2w: :2 r ~ _ ’- 1.1 T4 , . 3} 35555435555531.5555 521 “15‘“ [5.251553 I'M 25:5 5 11?- 5 55 I (15:55.55 2 2'51 I" ST". .5555: L‘ . 4 .. . aw“ *up .. 1‘ ’2' 3.2 ' J ‘ wf iii?!“ '5 ”:1 n A.» v: #221; {5.1% 3‘51- 3:; 5 y ‘ 5. I “ L. 1::- .. @3323“! “3?; “27 _ :11 .. §£~ “—9517 2..— 4 : — ‘4 ;‘ . K r J ”1...: 4;, 2.11 “ .1; ,, m’: 2.. 5515-155 ‘1 2% :E'JS‘. n ' 13-2.: $52“ “Fa g. ~42: . 2- '31: 2% ‘8‘.” 1'5 55“: "rk. 4'21. . '31:»? ‘. .n .<.. w 3.1:: 3‘32 £6: .n. -l~'? 2:52 .a'. 5* . -' 43.5““...9. F a: 3" fir .4 a.; "a“... .- g 3, J3.” ““rfi .M......x_ ‘2“ ~21»:,- 43251.57 43‘ _. “iii-L; .._. Lfi.*:~%:§x'h g: frs‘, ffigg' 2?" S: 2 €3,365“ 2 - 1555?»; w, ‘ ‘“‘ ”1:52? 3‘21 #1:}; v4” . '1:r .5: : ~“ 1‘ “‘“ ‘::¢...:r %;: 2%? a '2 .mA ‘ _ m? 23- :VJQWO .232. w .L . .r £5559 m" 51;: L 21143:.- x‘. $313 $3.3“ “251 5‘1 émgflufik 4 ‘1“.ww"5“"13( I CS. :. 1.1“ 5“?“ .,..L a . "55.15315" 1:5}: 55222.51 fin: magi 55555-515. 1““‘8‘11 L” ‘8 L .3 LL 55:”; - “55‘2"? ‘L “‘5 “:51 $311. $55. 5&22)? fig‘gfiafififi 1.395152%, "55“: 5. KERN“ LLV " I I! 55'3““ "1 ‘i‘r‘wfi (LL- .2 55512 55555525 1% '. $211,? 25-3.: 5' 5 (‘f ' "1" £ 15 1.. 1?! 1.553%“ I’LL 21:; I ' ““5552 :1 “RINK, .‘ 1’12". ”53:15 555555“ :55- 5“ 11".,1 fix 5,51“th fifit‘ififl‘” 2;. «inc. ‘ 221.1211 gm. . Li“? WELW 5:38:55“ 5‘5?" :51?“ “5513235555 . 2&5 “$535515 9“: 1“." . '13.. LLl "5‘ 5~ QR 1 55, :55“: 2111‘s 5‘55. 5555555525 {555 - 5‘1, 5525‘ ‘ 2%- .gm A .1 1k)???” 3 3L..LL‘;I5111.I§ «,‘515‘L 4/23. 1’33!” ‘01 $535; Q’Ln‘fiput 5“ ‘31 1.52 ‘5 ‘1. .2, 8 %‘M .151 L455, . 511535.155? This is to certify that the thesis entitled RESIDUAL GAIN SCORES AS A CRITERlON FOR CHANGE: INFERENTIAL PROBLEMS presented by Khalil Elaian has been accepted towards fulfillment of the requirements for Ph.D. degree in Educatiqnal- Statistics and Research Design; Department of Counseling, Educa- ' ycholocv and Major professor Date //2§/X? / / 0-7639 MS U is an Affirmative Action/Equal Opportunity Institution )V153I_J RETURNING MATERIALS: Place in book drop to LIBRARIES remove this checkout from .——. your record. FINES will E be charged if book is returned after the date stamped below. no nor may a rm “5" ""5” mu v RESIDUAL GAIN SCORES AS A CRITERION FOR CHANGE: INFERENTIAL PROBLEMS by Khalil Elaian A DISSERTATION submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Department of Counseling, Educational Psychology, and Special Education 1984 COPYRIGHT BY KHALIL ELAIAN I981: ABSTRACT RESIDUAL GAIN SCORES As A CRITERION FOR CHANGE: INFERENTIAL PROBLEMS by Khalil Elaian To determine the effect of teacher behaviors, W, in process-product research, residual gain scores, Z, are often used as the criterion. Significant correlations between class means of residualized gain scores and teacher behaviors, riw's, are taken as evidence of teacher effects. The purposes of the study were to determine the conditions under which testing Ho: 02w = O is, in fact, equivalent to testing for no teacher behavior effect, and also to investigate the appropriateness Of using different definitions of residual gain scores in testing the null hypothesis. Five different forms Of residualized gains were considered based on the total, 21, between, 22, and within regression coefficient, Z3, a newly derived estimate of the regression of posttest class effects on pretest class effects, Z4, and finally the parameter for the class effects regression coefficient, Z 5. A linear structural model was built to determine the conditions under which testing 02w = O is equivalent to testing no teacher behavior effect on student achievement. The analytic results showed that the two null hypotheses are equivalent if either of the following conditions are met: (a) there is no initial confounding of teacher behavior and class composition or (b) the SIOpe of posttest class effect on pretest class effect, 31,15 equal to the slope Of Khalil Elaian posttest on pretest for within classes, 82 and a perfectly reliable pretest. When the conditions are not met; however, the two null hypotheses are equivalent only for 24 and Z 5. A Monte Carlo approach was taken to investigate the apprOpriateness of using different rzw's in testing the hypothesis of no teacher behavior effect. Three criteria were considered: (a) the mean estimates of Riw's, (b) empirical Type I error rates, aand (c) empirical power. Parameters varied in the study were the degree of initial confounding, the reliability of the pretest, the number of classrooms, and the number of students in a classroom. The results of the study showed that when there was a substantial amount of initial confounding, the test statistics using rilw: rizw: and r23“, were only valid in a few situations. These tests, particularly the tests using rilw and ri3w, tended to be too liberal in situations where B 1 = 82 or 81>82 and too conservative when 813%. Parallel results for the tests using r21“, and rz-3W were Obtained with increasing sample size. However, the test statistics using r24“, and rz-5W were the only tests which remained valid as initial confounding, sample, and class size increased and in the presence Of errors of measurement. Also, the results indicated that increasing sample and class size increased the empirical power of both rig“, and r25“, in situations where 81 = 32 or 81) 82. It was concluded that procedures used by process-product researchers in forming residual gain scores typically provide misleading results. Sometimes the test statistics used are too liberal and other times they are too conservative. Therefore, it is recommended that process-product researchers who wish to test for no teacher behavior effect use 24. In addition to yielding valid Type I error rates across all conditions investigated, the procedure had reasonable power and Khalil Elaian does not have the unrealistic requirement of knowing the value of a parameter a priori. DEDICATION TO my wife, Nasrin Bakir, and my son, Rami Elaian. ii ACKNOWLEDGEMENTS I am deeply grateful to Professor Andrew C. Porter, my academic advisor and dissertation chairperson, for his generous and endless advice, encouragement, support, and patience. I would like to thank Professor Richard Houang for serving as a member Of my dissertation committee and for the generous help he has given me with the dissertation. I would also like to thank Professors Robert Floden and James Stapleton for serving on my committee. Working in the Office of Research Consultation (ORG) has provided me with invaluable experience. Many thanks to Professor Joe L. Byers who hired me in the ORC. I would further like to thank the University of Jordan for four years Of financial support. I wish to acknowledge the moral support of my parents, wife, and friends. Finally, I wish to thank Barbara Reeves for her typing of the final manuscript. iii TABLE OF CONTENTS List of Tables List of Figures ' CHAPTER 1: STATEMENT OF THE PROBLEM Definition of Residual Gain Scores Research Questions CHAPTER 11: REVIEW OF THE LITERATURE Alternative Indices of Responses Use of Indices of Responses CHAPTER III: THE ANALYTIC CHAPTER A Linear Structural Model for Process/Product Research Defining Values of K Relationships Between Regression Coefficients and the Structural Model2 2 Conditions Under whichY ((61 -82}? V0 + Bloeo) : 0 Distributions of Test Statistics CHAPTER IV: SIMULATION PROCEDURE Simulation Parameters Data Generation Routine CHAPTER V: RESULTS OF THE EMPIRICAL INVESTIGATION Mean Estimates of 02w when 83 = 0 Initial Confounding Effects Effects of Presence Of Errors of measurement (WOYOE 1) Sample and Class Size Effect Empirical Type I Errors for One and Two Tailed T-Test When Testing Hafiz“, = 0 Initial Confounding Effect Empirical Type I Errors Of Test Statistics when the Premeasure Contains Errors of Measurement Sample Size Effect Class Size Effect Empirical Power Effects of Initial Confounding Effects of Presence of Errors of Measurement in the Premeasure Sample Size Effect Class Size Effect iv vi viii 12 12 16 16 18 19 21 21 25 29 30 30 33 35 35 38 #3 #6 5O 53 55 58 61 64 CHAPTER VI: Bibliography SUMMARY AND CONCLUSIONS 68 73 10. ll. 12. 13. LIST OF TABLES The Total Variance-Covariance Matrix (2) Relationships Between Regression Coefficients and the 02w to the Structural Coefficients Design of the Study Between (25), Within (SW) and Errors of measurement (Xe) Variance Covariance Matrices Means of Empirical Sampling Distributions Of Riw's for Different Combinations of 81, 82 and for c = 30, s = 20, Dy y = .8 and B3 = 0.00 O 0 Means of Empirical Sampling Distributions of Riw's for plovo = '8 Effects of Sample Size on Mean Estimates Of Riw's for I: .2 Effects of Class Size on Mean Estimates of Oiw's for Y: .2, pyoyo :- 08, C = 30, and 63 = 0 Effects of Initial Confounding on Empirical Type 1 Errors for the One-Tailed Tests of OZW's where C = 30, s = 20 andQy y = .8 O 0 Effects of Initial Confounding on Empirical Type I Errors for the Two-Tailed Tests of Dzw's where c = 30, 5:20andpy y =.8 0 0 Effects of the Presence of Errors of Measurement in the Premeasure on the Empirical Type 1 Errors for One-Tailed Tests of Riw's where c = 30, s = 20 andY: .2 Effects Of the Presence of Errors of Measurement in the Premeasure on the Empirical Type I Errors for Two-Tailed Tests ofpzjw's where c = 30, s : 20 andY: .2 Effects of Sample Size on Empirical Type 1 Errors for One-Tailed Tests of Qw's where s : 20, Y: .2 and p), y = .8 0 0 vi 15 16 23 20 32 3‘} 36 37 40 #1 44 45 47 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 214. Effects of Sample Size on Empirical Type I Errors for Two-Tailed Tests of RZW'S where s = 20, Y: .2 and Dy y :08 0 0 Effects of Class Size on Empirical Type I Errors for One-Tailed Tests of Riw's where c = 30, Y: .2 and O); y = .8 0 0 Effects Of Class Size on Empirical Type I Errors for Two-Tailed Tests of Riw's where c = 30, Y: .2 and Dy y = .8 0 0 Effects of Initial Confounding on Empirical Power for One-Tailed Tests of Riw's = 0 where c = 30, s = 20, = .1 andp = .8 83 yoyo Effects of Initial Confounding on Empirical Power for Two-Tailed Tests of Riw's = 0 where c = 30, s = 20, %:.Iand0y y 2.8 O 0 Effects of Errors Of Measurement in the Premeasure on Empirical Power for the One-Tailed Tests of piw‘s = 0 where c = 30, s = 20,Y= .2, and IE; = .1 Effects of Errors Of Measurement in the Premeasure on Empirical Power for the Two-Tailed Tests Of QZw's : 0 where c = 30, s = 20,Y = .2, and B3 = .1 Effects of Sample Size on Empirical Power for One-Tailed Tests of Diw's where s = 20, Y = .2 = .1 and = .8 B3 pyoyo Effects of Sample Size on Empirical Power for Two-Tailed Tests of OZW's where s = 20, Y: .2 : .l and = .8 8’3 pyoyo Effects of Class Size on Empirical Power for One-Tailed Tests of Diw's where c = 30, Y = .2 : .1 and = .8 8“ Floyo Effects of Class Size on Empirical Power for Two-Tailed Tests Of Dzw's where C = 30: Y= .2 B3 = .1 and pyoyo = .8 vii 49 51 52 56 57 59 6O 62 63 65 66 1. A structural model LIST OF FIGURES viii 13 CHAPTER 1 STATEMENT OF THE PROBLEM Residual gain scores (RGS) are Often used as a criterion for measuring change in educational research. For example, in reference to evaluating teacher effectiveness, Veldman and BrOphy (1974) state "it is generally accepted that residual gain scores are superior to simple pretest-posttest difference scores as measures of teacher influence" (p. 321). Process/product research on teaching can be used to illustrate the practice of setting residual gain scores as a criterion for study (Gage, 1977). Process refers to teaching behavior and product refers to student learning. Residual gain scores are used as a product variable which is meant to control for initial differences among classrooms in their compositions of students. The residual gain scores are typically constructed from student pre— and post-instruction achievement test scores. BrOphy and Evertson's (1974) two year replicated study conducted at the University of Texas provides a specific example of using residualized gain scores as the criterion in process/product research. Thirty teachers were included in the first year, and 28 in the second year. Classroom Observations were made to assess teacher behavior. Scores on five subtests of the Metropolitan Achievement Test (MAT) were available for each student. The MAT obtained on the first year was used as the pretest and the MAT for the second year as posttest. For each student, predicted values of the posttest scores were determined from the pretest scores based on the total sample regression line. Residual gain scores were computed by subtracting predicted posttest scores from the actual posttest scores. To determine the effects of teacher behaviors, Pearson's product-moment correlation coefficients between process variables and average residual gain scores, aggregated by teacher, were Obtained. The sample correlations were tested for significance by t-tests, with c - 2 degrees of freedom, where c refers to the number of teachers. The null hypothesis that BrOphy and Evertson intended to test was that teacher behavior had no effect on student achievement. The aim of the present study is to investigate the appropriateness of using residualized gain scores in order to determine the process product relationship. Definition of Residual Gain Scores Consider the model used in forming residual gain scores: (1)2 : Y (t) - K Y (0) where Y (0) is the measure at time 0, Y (t) is the measure at time t, and K is an adjustment coefficient. As described previously, 2, the constructed residual gain score aggregated by teacher, is correlated with a measure of teacher behavior, W. Let r2.” denote the sample correlation coefficient between 2 and W, and 02w be the corresponding parameter. For a test of H0: 02w : 0 to be apprOpriate, not only must the variables, Z and W, be linearly related but, in addition, 02w must be only a function Of change in achievement caused by W. If either one Of the above conditions is false, a test of ”0‘ 92w = 0 Will lead to spurious conclusmns. As Z is constructed from equation I, the apprOpriateness of 02w = 0 for a null hypothesis depends upon the choice of K, the adjustment coefficient. While K is assumed to be a known constant, in practice, this is seldom the case. Usually K is estim ated from the relationship between Y (0) and Y (t), in terms of a regression coefficient. Because the nature of the data on student performance is hierarchical (i.e., students are nested within classrooms) three regression coefficients are available: the between classroom regression coefficient, the within classrooms regression coefficient, and the total regression coefficient. For most educational data, these three regression coefficients are not interchangeable. Further, it will be shown that in some situations, none Of the three coefficients estimate an appropriate correction parameter. To further complicate matters, the sampling distribution of riw will be a function of the estimator for K. Unfortunately, the nature of the sampling distribution of r2“, is unknown (at least for most situations), and the use of the t- distribution to test Ho : 92w = 0 as in BrOphy and Evertson's study, may not be valid even when the sample regression coefficient estimates an appropriate correction parameter (Draper 6c Smith, 1981). Research Questions The intent of the present study was to investigate the appropriateness of using a t-test to test 02w = 0 as evidence for no teacher behavior effect on student achievement. More Specifically, the following research questions were investigated. 1. What are the conditions under which testing Oiw : 0 is equivalent to testing no teacher behavior effect for different methods of defining Z? 2. Given no teacher behavior effect, how well does the distribution of the "t" statistic based on each of several different methods of defining z approximate the theoretical t-distribution for varying amounts of (a) initial confounding, (b) presence of errors of measurement in the premeasure, (c) number of classrooms, and (d) class size? The investigation was conducted in two steps. First, the conditions under which pr = 0, if and only if there is no effect of teacher behavior on student achievement, were determined analytically. Second, a simulation study was conducted to investigate empirically the distribution of "t" statistics using different methods of testing [Ew- In Chapter II, relevant literature will be reviewed. In order to examine the situation thoroughly, a structual model is introduced in Chapter 111. Chapter IV presents the design of the simulation study. The results obtained from the empirical study and the conclusions reached are presented in Chapters V and VI. CHAPTER 11 REVIEW OF THE LITERATURE In experimental research the experimenter manipulates variables of interest and observes the manner in which the manipulation affects the variation of the dependent variables. In order to be reasonably sure that the observed variation in the dependent variable is indeed due to the manipulated variables, the experimenter must control all possible confounding variables. Porter 6c Chibucos (1974) suggested two catergories for these possible confounding variables in the context of program evaluation: 1. Systematic differences in the dependent variable dimensions that are present in the units Of analysis at the outset of program participation. 2. Systematic differences that occur in the dependent variable dimensions during program participation which are not a function of program participation. (p. #40). While randomization is one of the most powerful methods to control confounding variables of the first category it does not insure controlling confounding variables of the second category. To the extent both categories of possible confounding variables are controlled, arguments for causal relationships between independent and dependent variables are strengthened. Studies of natural variation are also used to investigate the possibility of causal relationships among independent and dependent variables. As was the case for experimental research the investigator must be concerned about both types of confounding variables. In studies of natural variation, however, randomization is by definition not a part of the design and so other methods must be employed to guard against confounding. One general method, which has enjoyed considerable use, involves the formulation of an index of response of which residualized gain scores, the focus Of this study, represented a specific type. Alternative Indices of Responses The index of reSponse is defined by Zij = Y(thj - K y(0)ij where Y(0), Y(t) are pre and post measures for the 1th individual in the jth group, K is some known constant. In addition to requiring scores on the measure of interest at two points in time to formulate Z, K has to be set to an apriori known value. However, the value K should take depends on knowledge regarding the natural growth model which adequately describes the data if there were no effects of the independent variable. The most commonly used index of reSponse is raw gain scores, where K is set to unity, Dij :-. Y(t)” — y(0)ij where Dij is the raw gain for the ith individual in the jth group. In other words, raw gain scores are created by taking the difference Of the post measure and premeasure scores on the dependent variable dimension. Raw gain scores as a measure of individual change have been criticized in the literature for having low reliability and for correlating negatively with premeasure scores (Cronbach 6r Furby, I970, Linn 6c Slinde, 1977; Lord, 1963). Cronbach and Furby, have also questioned the use of raw gain as a strategy to measure group Change in studies of natural variation, agreeing with Lord (1967) that gain scores are an inappropriate strategy to control for confounding variables in natural variation studies of causal relationships. In contrast, Porter (1973) has suggested that under certain assumptions gain scores may provide the best technique for natural variation studies. Porter argued that given treatment effects are additive the pre and posttest measure the same variable in a common metric and there is no change in variances from pretest to posttest; it can be shown that the gain score strategy does provide a reasonable approach to data analysis in natual variation studies. Bryk and Weisberg (1977) showed that under natural growth (i.e., no treatment effect) this gain score strategy provides an unbiased estimate of the treatment if and only if the group growth patterns are parallel (which is equivalent to Porter's assumptions) . Standardized gain scores represent yet another form Of index of response that has been used to analyze data from studies of natural variation. K in the index of reSponse is set to either one of OYt / 0 YO’ OT)“; / OTYQ or SYt /syo where Ozyt and Uzyo are the pOpulatIon variances of the dependent variable I l 2 2 I O dimenSIon at pre and posttests, 0 Ty and 0 Ty are the pOpulatIon variances for t o the true variables and s and s are the sample estimates of G and O . Y Yo Yt Yo t Using ANOVA of standardized gain scores as a strategy to control initial confounding was introduced by Kenny (I975). Even though Kenny did not distinguish between the different types of standardized gain scores, he argued that when individuals were assigned to a program based on sociological or demographic variables, standardized gain scores provide the best index of re5ponse for controlling initial confounding. Olejnik and Porter (1981) clarified Kenny's recommendations by showing that the validity of standardized gain scores is dependent upon the model of natural growth that applies in the absence of treatment effects. They also pointed out that the two alternative ratios of population standard deviations are equivalent if the reliability of the pretest and posttest are equal. Finally and perhaps most importantly, they pointed out that using a ratio of sample standard deviations followed by ANOVA is an incorrect procedure that results in misleadingly small standard errors. Residual gain scores are yet another form of index of response that has been used in studies of natural variation. Three different types of residual gain scores appear in the literature of measuring change. The first, which is called True residual gain scores is defined by setting K in the index of response to the slope of true posttest on the true pretest. True residual gain scores were suggested by Tucker et a1. (1966) and called a "base free measure of change." The second, called Observed residual gain scores, Z, sets K = BYtYO (i.e. the slope of the manifest variables). The third called estimated residual gain scores, 2, sets Kza’tYO where BYtYO is the sample estimate of the lepe of yt on yo. Residual gain scores as measures of individual change have been Characterized in the literature as uncorrelated with initial status but suffering from low reliability (Kessler, 1977; Linn 6c Slinde, 1977). Using ANOVA on observed residual gain scores as an analysis strategy in natural variation studies is comparable to using analysis of covariance. The only difference between the two procedures is that ANCOVA estimates the value of K from the data while ANOVA on the Observed residual gain scores requires that K be set apriori to BYtYO' ANOVA on true residual gain scores is parallel to estimated true scores analysis of covariance originally developed by Porter (1967). Again the distinction is that the true residual gain score approach requires that a population slope be known apriori while estimated true score ANOVA estimates that slope from the data. Performing ANOVA on the estimated residual gain scores raises at least two problems. First the expected value of the estimated residual gain score is unknown (Draper and Smith, 1981) making it difficult to determine whether the strategy provides unbiased estimates of the causal relationship of interest. Second, the procedure suffers from the same bias of standard errors that Olejnik and Porter (1981) noted for standardized gain scores using sample standard deviations. Uses of Indices of Responses Using an index of response in lieu of randomization in natural variation studies has been a controversial tepic. Perhaps the best known antagonist of their use is Lord (1967, 1969) who has stated "with the data usually available for such studies, there is simply no logical or statistical procedure that can be counted on to make proper allowance for uncontrolled preexisting differences between groups" (Lord, 1967, p.35). More recently Cronbach and Furby (1970) have indicated basic agreement with Lord's pessimistic view of the utility of using statistical adjustment in natural variation studies. On the other hand, Elashoff, 1969; Hornquist, 1968; Porter (Sc Chibucos, I974 hold a more Optimistic view. Hornquist (1968) has stated Even if the initial standing of the subjects is controlled by means of a number Of relevant variables, there will always be room for uncontrolled differences that may be important. The investigator, who because of the nature of his problem cannot use random or systematic assignments Of subjects to treatments, has to live with an insecurity in that respect . . . and try to behave intelligently within the limitations of his design . . . or leave the scene Of non-experimental research"(p.57). Porter (1973) has stated ". . . the interpretation of results from designs lacking random assignment requires yet another degree Of tentativeness above and beyond what would have been required had random assignment been employed" (p.41). Research on teacher effectiveness is one Of the areas in which residual gain scores have been used most heavily. For some researchers (e.g. Rosenshine,l970) residual gain scores are considered the definition for teacher effectiveness and so the logical dependent variable in studies to identify lO desirable teacher behaviors. Known as process/product research (Dunkin 6c Biddle, 1974), studies of effective teacher behavior obtain pre and posttests of students achievement to form the dependent variable and Observations of teachers to form independent variables. The residual gain scores are computed for each student and then aggregated to the classroom/teacher level. The correlations of class means on residual gain scores and teacher behaviors are computed and tested for significance. Significant correlations are taken as evidence that teacher behavior affects student achievement. Examples of process/product research using residualized gain scores tO control for confounding variables are BrOphy 6c Evertson, 197A; Creemer, 1974; Creemer and Weeda, 1974; Soar, 1966; and Veldman and BrOphy, 1974. In all of these studies BYtYo was unknown and so estimated to define the "constant" in the residualized gain scores. The researchers, however, ignored this distinction when conducting their tests of significance of correlation between teacher behavior and residualized gains. A test statistic using rzw, which is apprOpriate to test Ho : 02w = 0 does not necessarily imply that the parallel test statistics using r2w: is also a valid test of 02w : 0. Testing Diwzo as a test for no teacher behavior effect was investigated in the present study. The investigation was in two parts, analytic and empirical. The analytic part was conducted to determine the conditions under which piwzo is equivalent to testing no effect Of a teacher behavior on student achievement. The investigation considered several different possible formulations of Z. The empirical investigation was conducted to investigate the apprOpriateness of a "t" test statistic to test Ho : eiw=o when sample estimates rather than population parameters were used to define the residualized gain scores. A Monte Carlo method was used to simulate the sampling distributions Of the different test statistics based on different formulations of residualized gain scores. These 11 were then compared to the theoretical reference distributions to determine the validity of each test statistic under study. CHAPTER III THE ANALYTIC CHAPTER In this chapter, a linear structural model that defines the problem of measuring change in studies of process/ product research will be presented. The model incorporates the aggregated characteristics of the data and the possibility of measurement errors. Given the model, the conditions under which 92w : 0 is equivalent to no teacher behavior effect on achievement will be identified. A Linear Structural Model for Process/Product Research As in equation 1, residual gain scores are constructed from Y(O) and Y(t), the pre- and post-measures of student achievement. The prOposed structural model attempts to elucidate the relationships among Y(0), Y(t), and W, a variable representing teacher behavior. For student 1 in class j, the observed score Y(L)ij can be decomposed into: (2) Y(L)1j=0(L)ij + e(L)ij , L = 0, t where ML)” is the part of Y(L)ij which is free from errors of measurement, and e(L)1j represents measurement error. The n(L)1j is further decomposed into two components: the class effect and the deviation of student score from his class mean, (3) ML)” 2 A(L)j + V(L)1j , L = 0, t where A(L)j is the class effect at time L, and V(L)ij represents the deviation Of the ith student score from the mean of jth class. Combining the two equations, Y(L)1j can be written as (4) Y(L)1j = A(L)j + V(L)jj + e(L)1j , L =0 ,1- 12 13 The measure of teacher behavior can also be decomposed into (5) Wj : j + egj, where 5]“ is the true measure of the behavior of teacher j assigned to class j and e 5. represents measurement error. I Schematically, the structural relationships among the three variables are shown in Figure 1. Figure l. A structural model. 3H),, 613' 9 V(t)1j ___.__; Y(t)“. (F_1A(t)ij (———H.I egj 82 81 gj <— OJ. / v(0)1.j'__s Y(0),-J- e— AIO),-j <—— AJ' Ie(0)1.j The B's are the structural coefficients, y represents the reciprocal relationship between 5j and A(0)j. Hj, Gij: Oj and Aj are residuals or Specification errors. The structural equations for Mt), and V(t)ij are (6) A(t)j =81A(0)j 2.835., + Hj. (7) V(t)1j =52 V(0)1j + G1}. Within class j, V(t)1j is linearly related to V(0)ij~ This is equivalent to the assumption of a linear growth operating within each class at the individual level. The same rate of growth, 82, occurs within each class. The decomposition of IKL)” into A(L)j and V(L)ij also implies that the class effect is additive (i.e., A(L)j is a constant effect for all students in the same Class). The effect of the teacher behavior, W, on student achievement is the 14 same for all students in the same class. Teacher behavior may, however, have a direct effect (83) on A(t) and a reciprocal relationship (Y) with A(O). The former will result in changes in performance (for the class as a whole) as a consequence of being exposed to the teacher behavior of interest. The reciprocal relationships (Y) represents confounding between initial class composition and teacher behavior. In school settings, students are virtually never randomly assigned to classes, and so substantial class effects exist before the start of the school year. Importantly, these differences may be at least in part a consequence of having teacher j in class j. This will have some impact on A(t)j through A(O). Also, this reciprocal relationship represents the possibility that the composition of the class may affect the way the teacher teaches (Doyle, 1979) which can affect A(t)j. Given the following two assumptions, 83 represents the effect of the teacher behavior on student achievement: 1. Prior to the study, there is no other teacher behaviorfij, that is correlated with «E and which has some effect on A(0)j and/or A(t)j. 2. During the study, there is no other teacher behavior, E2, that is correlated with E and which has some effect on A(t)j. These first two assumptions are necessary to leave the interpretation of 837‘ 0 clearly a function of the effect of W and not some other teacher behavior variables. D B The Relationship Between 2w and 3 The observed variables Yt, Y0 and W are assumed to have a multivariate normal distribution with a mean vector Of zero and a variance covariance matrix, 2 (see Table l). 15 Table l The Total Variance - Covariance Matrix (Z ) Y(t) Y(O) W Y(t) OZAt + 02v, +08, Y(O) 81035.0 + B3YOAOOE+BZOZVO OZAO +02V0 +020 W BIYOAO cg + B302E YOAOGE 02€+02eg In the structural model, errors of measurement and specification errors are assumed to be uncorrelated among themselves and with the latent variables, '5, V's, A's and E. The coefficient, pzw can be written as To determine the relationship between pzw and 83,, the variances and covariance are expressed in terms of the structural coefficients. The covariance between Z and W can be written as 02.. = sew) - E<2>E _-. E(y(t)w) - KE(37(0)W) = E(At + Vt + ét) (g+ ea) - KE(A0 + ‘70 + é0)(g + ea) -_- E(At€) + E(Vt€) - KE(A0€) - KEW‘OE) Since V's are defined at the individual level and E; at the class level, 5 E(V(L)g) = EEJ(€,§,V(LIIJI 15 . ya a .3326... E g 9.33:3... .3 353.... a. .. u n — . 9‘ ”Eta o c 3 ca 0 A No a o C. - 3.6 a: A Ax - {Ago <0» [K A 0‘ x acouncOu a Home o a o .o Jan—bx f a o o Jo... - we <03... 3%.. . «.0... A o o .M . Am a... i111 A {we I 3.... 33:: a u 0. O o c> o 00 o) I ~ g u . DOMG O O 0 IE ”D Q ~° .O -O\ o O fiAch “—AOV “ ADO—ne’ONQn—u vbflor 00 )0 a :- >vu . . A : _ o. A A... ~ A u A... . . . ~ ~ 2.1528 b c o 2 n . 3 o o» A A a A. I a a 0) ADV» “—2: n. h «dang H n couauzs . . A... A? .o A x - :A A: x - : w 55:. u n .o .oo .3 n~ 9.0.99.33 name 331...” a. o 8.. E . . . Pr. 9% a: a: :8 A o o» a. - .c... . 32$ - J: o <2. A m. A a» u 2 . A . . F. . NA x - to . 2.23:5 .oc . .uAnc . An A. o» u o< o, . o< _ 8: «A8 A «A: a 5.32.5 o. A... o a one. . .u~# . .o a A x . :2: x - A: 5.38 < o e an iv. no. no In Ooo>092030~ O O 2: H. «O C 0 U G O O( E n a n .90 o >~D o fab A h! I ROvbvu A o 52:31... . 32... - .2? o. . o. . a. a . a f d 1&7 a m A. . .“uunfi .o a. .2 a- a. o a some}. . 3%.. . 8...... A A... - 5:2: A: - 3.3. a .3: o . no .5... a... o . n. 5.... n... 3.53:5 2.32:... 22:53 .3332. 2.35.8 I: 3 232:... 8.39.0! 3:13:03 A2333» .5 3 Inc a... 3:23:25 v:— ScoUIAuou c3385 2.93:5 05 .0 «.352... u 0;: 16 02w : E(At€) - KE(A0€) : [$302+ Cov Ao€(Bj - K) : B3OE+YOAOOE(BI - K) Then 8 2 + (B K) 30 YOA O l ' (8) 02W 2 g 0 E. 020w Equation (8) indicates that if 1. Y = 0, (i.e., no initial confounding) and/or 2. 81: K, the statement 02w :0 is equivalent to B3 = 0 (provided that the variances are all greater than zero.) Defining Values of K In practice, the regression coefficient for predicting y(t) from y(0) is the value most frequently chosen to represent K. Because of the nested nature Of the data, however, there are three such regression coefficients. In order to examine (the appropriateness of using any one of these coefficients for K, the relationships between each of the coefficients and the structural coefficients are derived and shown in the following section. Relationships Between Regression Coefficients and the Structural Model The total regression coefficient (81'), the between regression coefficient (8 and the within regression coefficient (8w) can be expressed in terms of the B) model components as follows (Table 2); By definition, EEj(Y(t)1.J. E(Y(O)1j - M - MY(t)) (Y(0),j - MY(0)) Y(0))2 17 As before, both My“) and My(0) are zeros. The numerator is C°V(Y(t)ij: Y(mij) and the denominator is Var(Y(o)1j) Thus, C0v(Y(t)1.j . Y(O)ij) BT ‘ Var(Y(0j,j) Substituting COV(Y(t)ij, Y(0)il) and Var(Y(o)ij. for their corresponding values in Table 1 yields 2 2 , 2 8T : (BIO/3‘0 + 820 V0 + BBYOAOO€)/(QZAO +0 V0 +0260) Similiarly, the between regression coefficient is, 8 ENG); 1- My(t)) (Y(0), -. MY(O)) B ‘ E(Y(O)j-My(o))2 . Cov (my {now var (9(0),). By using equation 4 to obtain the means of Y(t)”, Y(0)ij and by substitution 1 2 2 —B o + B O + yB o O s 2 VO 1 A0 3 A0 g l 2 o + —o + —O A 5 V0 5 e0 Similiarly, the within regression coefficient is, B _ Ej(Y(t)1j - My(t)j)(Y(0)ijv- My(0)j) w .. Ej(Y(0)1j - My(0)j)2 = 8102 /(02 +02 I) 2 V0 V0 eO As shown in Table 2, when 53 :0, 02w equals zero if (9) Y((B -B)O2 +802 l 2 VO l eO l8 irreSpeCtive of the choice of regression coefficients. for 02“, = to be equivalent to B3 = 0, equation 9 is both the necessary and sufficient condition. Conditions under which Y ((81 -82)02v0 + 810280) : 0 When v =0 11 y = 0, irrespective Of the relationships among 81, 52, Ozvo and Ozeo or the choice of K, equation 9 will be true. Put another way, whenY = 0, there is no problem of adjusting the achievement criterion for initial confounding with the teacher behavior. When Y 7‘ 0 If Y does not equal zero, for equation 9 to hold, (81 - 82) O must equal zero. This can happen when = 2 2 2 l. B] 82 (O V0 /(0 V0 + Geo) or. The former can happen only under unlikely circumstance. The latter can happen, if a perfectly reliable premeasure is used (so that O2eO = 0): and when subjects are randomly assigned to clasrooms (so that 81 = 82). In examing the relationship between 02w and 33, none of the conditions identified seems likely to obtain in. practice. Random assignment can rarely be achieved in practice and perfectly reliable achievement measures rarely exist. An alternative to using a regression coefficient as a method for defining K would be to estimate 81, directly. For example, from Table 2 2 2 2 2 - 820 v + 81‘7 A BIG Y 820 v ___ o o . _ s = o 0 ST 67A + 02A + 6y~ when 83 — 0 thus, 1 2 0 0 e0 0 A O to estimating K in each of the several ways. 19 Since, E(MSB I: SO2 2 3f) A0 + 0 V0 +02e0, and E(MSwyO ) =6sz , 02,30 2 l =‘5—(MS - MS ) 0 A0 85/0 WyO From Table l 8w = 82 (OZVO/ bzvo +02e0)) A 2 —/\ 820 V0 —BwMSwyO 81 = (éTuMsB - MSw)/ s + MSW) —’B‘stw)/ (MSB - MSW) /s) or El :67 M55 + ((s-I)§T- Séwmsw / ((MSB - MSW) Distributions of Test Statistics Even under conditions where if 83 : 0 then 02w : 0, a t-test of Diw : 0 could still be inapprOpriate due to the effect of having estimated the value Of K based on sample data rather than setting K a priori to a known constant. Thus what rem airsto be done is to determine the effects on the distribution of "t" due defined as the sampling distributions of the t ratio with c-Z degrees of freedom which is obtained from r2w using the equation r-Zw / c-2 (Hays, 1973, p. 661). The "t" distribution of r2w is 20 Since the exact nature of the "t" distributions of riw's could not be determined, a simulation study was conducted. In addition to using estimates of BT’ BB and 8w to form residual gain score, Z1, 22 and Z3 respectively, the use of the proposed estimate Of 81, was used to form residual gain score 24. For comparison, another form of residual gain score, Z 5, was formed by setting K to a priori known constant (i.e., Kzel). CHAPTER IV SIMULATION PROCEDURE As shown in the previous chapter, testing pi,” = 0 is equivalent to testing H0: B3 :0 if either of the following conditions are met; I) v =0, 2) 81 =82, given a perfectly reliable premeasure. Interestingly it was found that for both of these two situations the equivalence between 92w = O and B3 = O is true regardless of whether Z is defined using K set to the total, between or within regression coefficient or any other values of K for that matter. However, in practice, the parametric values of 8T: BB: 8w and 81 are seldom known. Thus, the purpose of the simulation study was to investigate the apprOpriateness of using a t-test to test 02w : 0 in situations where estimates are used for 81', 88,8“, and 81. The empirical sampling distribution Of "t" statistics for each of the four methods of defining residual gain scores were simulated and compared with the central t- distribution. The means of the empirical sampling distributions, empirical Type I error rates, and empirical powers were used to determine the appropriateness of using a t-test to test Dzwzo. The procedures employed in this empirical study will now be discussed. First, the description of the simulation parameters will be given, and then the data generation routine will be described. Simulation Parameters As stated previously, this investigation required the study of random sampling distributions of "t" based on rilw rzzw, ri3w: r2“, and r25W' 21 22 Empirical generation Of the random sampling distributions was done repeatedly taking random samples from a known population, an approach which is typically referred to as Monte Carlo. The parameters of interest were the number of classes per sample, the number of students withing each class, the value of 81 relative to B 2, the reliability of the premeasure, the magnitude of initial confounding, and the central and non-central cases. As previously stated, the means Of the manifest variables, Yt: Y0 and W were set equal to zero. Also, without loss of generality Ozyt, Ozyo and O 2w were set equal to l. Yt: Y0 and W were assumed to have a multivariate normal distribution. Both the number of classes, c, and the number Of students per class, 5, were allowed to vary so that effects on the distributions of the various "t" statistics could be investigated. The number of classes was set at 10, 30 and 50. Ten Classes (or teachers) were chosen as an easily obtainable sample size. Fifty classrooms were chosen as an unusually large sample size. The number of students per class was set at 10, 20 and 30. The size of 10 was chosen as a lower bound for classroom size which might occur through loss Of data. Class sizes Of 20 and 30 are typical of schools today. While 82 represents the within class regression lepe, given a perfect premeasurefi does not represent exactly the between slope, as shown in Table 2. Consequently the exact magnitude of B 1, relative to 32 cannot be decided. Therefore, three different combinations of 81 and 82 were selected. First, 91 was set equal to B 2 with value equal to .7. Second, 81 was set greater than 82 with values .7 and .3 respectively. Third, 81 was set smaller than 82 with values .3 and .7 reSpectively. The last situation was included for comparison in spite of the fact that it is rarely encountered in practice. (e.g., Cronbach, I976) 23 E _ 2 v ._ t w t e t e .4. . i._ .i 11. 11 A _ 2 . n t t t i t a _ no .. _ _ LN] 23 h -7. I u A. t a. t t t h .u_. PW + r r o A 2 VJ . i 0 t t t t I i t _ Va .2 o .— t t t t t i _ o ._ _ P1 3 . at . 2 t e t a. t i e n .n.- u .2 t- t . D 2 l. B * < .. bl .3— 8 _ 2 B 1 t :1 2 O .5 VJ * .1 0 _ l y E e t A 2 0 b _ t _ < .. _ Lul 9.03 .2 t A... El 0m ON CA cm ON CA cm ON CA om ON CA cm ON CA om ON on em ON CA on ON CA on ON CA om om OA Cm on 0" cm Om CA G. n P N. H P O n P xoaum Co cawmoo m mAamp 24 When 8 1 7‘ 82 the ratio of between variation to within variation varies with the number of students per class. That is, the intraclass correlation, PI, gets smaller as the number of students per class increases. The intraclass correlation was set at .30 regardless of c and s for the present study. This value was chosen because there is evidence, for example in school mathematics, that actual school variation accounts for 30 percent of the student achievement of MAT mathematics scores (Haney, 1974). Since Yt and Y0 contain errors of measurement, the estimators of the different adjustment coefficients (i.e., 81-, BB and 3w) will be biased. The magnitude of bias is proportional to the reliability of Y0. In other words, the bias depends on the premeasure only (Porter, 1971). The reliability of both pre and post measure was set to .8. This value was chosen as a moderate reliability for achievement tests (Ebel, 1979). Since measurements of teacher behavior have lower reliability (BrOphy, 1974), .5 was selected as the reliability coefficient of W. As a result of setting 02w OZYO’ 02w : l, thYe’ DYOYO : .8 and the reliability of w to 5 the values taken by Ozet’ozeO’ ozeg and 02g were .2, .2, .5 and .5, reSpectively. Also, as a result of setting P1 = .3 the values taken by (,on and 0 2V0 were .24 and .56, respectively, in the presence of errors of measurement. Three levels of initial confounding were considered: y = O to indicate no confounding y = .4 to indicate substantial confounding, and y = .2 as an intermediate level of initial confounding. Lastly, both the central and non-central cases were included in the study to examine the probability of Type I and Type II errors. For the purpose of this study, B 3 was set equal to 0.00 and 0.10. .l was chosen as an arbitrary value to indicate the non-central case. Table 3 illustrates all possible combinations of the six design dimensions included in the simulation study. An "*" marks the 25 cells examined. These cells were selected to facilitate the investigation of the effects of initial confounding, presence of errors of measurement in the premeasure, relative magnitude of 81 to 82, sample and class sizes on the distribution of "t" statistics for different methods of defining rzw- One thousand samples were simulated for each of the selected cases. Data Generation Routine Three manifest variables were generated Yt: Y0 and W. The three variables were generated to have a multivariate normal distribution with a mean vector of zero's and a variance covariance matrix (see Table 1). As shown in equations LL and 5 in the analytic chapter, the manifest variables are defined Yt=At+Vt+et, Y0=A0+Vo+eo, W : E + e8 where all the components have been defined previously. Thus, X can be decomposed into 2w, EB and 2e, the within, between and errors of measurement variance covariance matrices reSpectively, as shown in Table 4. Having identified the set of parameters for each pOpulation, the Cholesky factor was computed for the between and within population variance-covariance matrix. These were used to transform generated between and within normal variates with (0,1) into between and within components with the desired vector of means and variance covariance matrix. A FORTRAN program was written to generate the sample data and compute summary statistics for each sample. In order to generate the sample data, the between, within and errors of measurement components needed to be generated. 26 n O O wmmb o o o o o uwb wb <0» wmomm + wo Hm o o o o o 0 web 0 >Nb >Nomm mm + <~0Hm u p a mmb >Nb ou mocmwcm> N mwv “cosmesmmmz eo mcoeem new a ZNV cwcuw: .A mmq cmmzpmm d mPDmH 27 Concerning the between components, two basic steps were used to generate At, A0 and E. First, a vector of independent normal variates, L, was generated by calling the function GGNQF three times, once for each latent variable. This function which is adapted by [MS]. (1982) generates one pseudo random norm a1 deviate (0, D every time it is called. Second, the obtained normal variates were transformed into a vector of At, A0, 5. This was done by multiplying E. with the transpose of the Cholesky factor of 23(denote T'). This can be summarized as _At_1 A0 : T x L E L..— Steps one and two were repeated as many times as the number of classes in the sample, c. The obtained At, Ao,€ had a multivariate normal distribution with a vector mean of zero and EB variance covariance matrix. The within components Vt, V0 were generated in a similar way as the between components except 2w was used instead of EB. GGNQF was also used to generate the normal deviates used to form errors of measurement for the manifest variables. The normal deviates were then, mulitiplied by the standard error of measurement. Having generated the between, within and error components, each manifest variable was obtained by addition of its components parts. Asubroutine was written to compute the different forms of rgw's. The obtained sample correlation coefficients were transformed into a t—ratio with c- 2 degrees of freedom using equation 10. Throughout this dissertation the empirical t-sampling distribution of r21“, will be denoted as tle, rizw as tzzw: r23“, as t23w, r24“, as t24w and r25“, as tz5W° 28 Another subroutine was written to obtain empirical Type I and Type II errors for the tzw's at nominal values of .005, .01, .025, .05, .1, .995, .99, .975, .95, and .90. This allowed consideration of fit for both one and two tailed tests of the null hypothesis 02w = 0. In order to check the accuracy of the computer program written to calculate summary statistics, the simulated data for the 5 classes with 5 students each design were printed out and analyzed separately using the SPSS statistical package. The results of the two sets of calculation agreed perfectly. The simulation portion of the program was verified by executing the program to obtain Type I errors for a set of parameters in which Yt, Y0, W were perfectly reliable, Y = 0 and 81 = B 2. Under these conditions the different tzw's all have a central t-distribution. The empirical Type 1 errors of the tzw's were in close agreement to their corresponding nominal alphas. For example, the empirical Type 1 errors of tzlw, tzzw: tz3w, tzgw and t25w were .049, .049, .051, .048, .051 for upper tail 01: .05 and .054, .052, .056, .052, .050 for .05 lower tailotz: .05; .100, .096, .100, .099 and .101 for upper tail CL: .10 and .105, .107, .106, .110 and .105 for lower tailOt ‘-' .10 nominal alpha. For each cell identified in Table 3 the program was run once. The seed number for every run was the random number generated after the last one used by the preceeding run. CHAPTER V RESULTS OF THE EMPIRICAL INVESTIGATION In Chapter 111, it was shown that Ho: 92w 2 0 is equivalent to Ho: 8 3 = 0 if either of the following conditions are met: 1) Y = 0 2) 81 = 32 and a perfectly reliable premeasure. When these conditions are not met, however, piw = 0 is equivalent to H0: B3 = 0 only for 24 and Z5. This chapter demonstrates empirically the Type I error and power of this first test statistics of Ho: oiw=o for situations which are common in educational research. The variables of interest in the empirical investigation were: magnitude of initial confounding, reliability of the premeasure, relative magnitude of 81 to 82, number of classes per sample (sample size), and number of students within each class. Any combination of levels of the above variables identifies a sampling distribution for each of the several riw's. The specific sampling distributions investigated were selected according to a design which facilitated investigation of the effects of each of the several design variables while holding the other variables constant. The subset of sampling distributions chosen to study is represented by asterisks in the six dimensional matrix in Table 3. The effects of initial confounding, presence of errors of measurement in the premeasure, sample and class sizes on the mean estimated of piw'S: the empirical Type I errors and empirical power of the one and two tailed tests of piw's are presented in this chapter. In general, the results of the study showed that when there was a substantial amount of initial confounding, the test statistics for tle, t22w and t23W were only valid in a few situations. These tests, particularly tzlw and 29 30 t23w: tended to be too liberal in situations where 31 = 82 or 81> 52 and too conservative when 81 < 82. Parallel results for tle and t23W were obtained with increasing sample size. However, the test statistics for t2“W and t25w were the only tests which remained valid across all levels of initial confounding, presence of errors of measurement, sample and class sizes. Furthermore, the results of the study indicated that increasing sample and class size and presence of errors of measurement increased the empirical power of both t24W and t25w in situations where 81 = 32 or 51> 82. Mean Estimates of 95w when 33 = 0 Initial Confounding Effects By examining the equations in column 5 of Table 2, one can predict that whenY: 0 and B3 = 0 eacn of the five riw‘s under investigation have expected value equal to zero. The numerators of these equations are YGA GE ((81 - 62 0'2\/ + B102e ) for O‘ 9 p' 9 and D7: 0, 0 0 zw 22w 1 W 3 f .. 0 orpZW and yo 4 A (B1 - K) for 02 w' G 0‘E 5 Given these numerators, one can see that all piw's increase as y increases, holding other variables constant. Inspection of the numerators also makes clear that the sign and magnitude of the oz‘w's is affected by the relationship of 81 to E32 For example, when BPQ and is large, the mean estimates of 921w: 022w andfikz3W are expected to depart positively from zero. Similarly, when 81 = 82 (and errors of measurement are present) the departure of these mean estimates will be in the positive direction but not as far as was the case when 81 > 82. In situations where 81< 82 and Yis large, the departure of the mean estimates offal-1w, 022w and 31 - ~ - . _ 2 2 . . . _ . 923w Will be negative given (81 82)0 V- >810 e0 The mean estimate of 922w 1S expected to be smaller in absolute value than the mean estimate of 921w and 023w This is because all three share the same numerator but 922w has the largest denominator. The denominators, as shown in Table 2, are: “OZZOZW (02A + O2V + OZe ) for pi w, O 0 0 1 02 02 (502 + 02 + 02 ) for p- , 2 W A0 V0 e0 22w 02-02 (0? ew+ 02 ) for p~ . W v0 e0 Z3W In summary, given Y is large, it is predicted that in situations where 51>82, the empirical sampling distributions of rilw and r23“, will be centered to the right of the central t-distribution and to its left when 81 < 32 (though these also depend on the magnitude of errors of measurement.) Also, it is expected that the empirical sampling distributions of raw, and r25“, will be the closest to the central t-distribution across all combinations of Bi and 82. Table 5 shows the effect of initial confounding on the mean estimates of 921w: 922w: 923w: Oiqw and 025w under the the three different combinations of Bi and 82, where sample size and class size were held constant at 30 and 20 respectively, and pYoYo = .8. As expected, the means of the empirical sampling distributions of rzw's were all near zero when Y = 0. As Y increased to .2 the mean estimates of 021w and 923w increased to .026 and .033 when 81 = 82 and .058, .07, respectively when 81>82. However, their values decreased to -.02 and -.028 when 81 < 82. Increasing to .4 caused the sampling distribution mean estimates of 921w and 923w to depart far from zero, particularly in situations where 81> 92. Their values were .05, .066 when 51 = 82, -.047, -.059 when 81 < 82 and .119 and .147 respectively when 81 > 82. While the sampling mean estimates of 922w remained relatively close to zero across all levels of y and across all combinations of 81 and 82, there was a 32 m000.m000. NvF. 0N0. m_—. quoo. 0000. N0. NFO. 000. mm00.- N000. 0000. m_000.m000. mm A —0 200. mm00.-000.-0000.i§0.i :0.i 0000. wmofmoof N0.i 0000. N0000. 0000. 000. 0000. m0 v _0 mm0. mxxfi 000. F0. m0. 0000.- 0000. 000. 000. 0N0. 0000. 00000. 0F00.i 0000. m_00. mm n _0 3mm 3¢N 30w 3mm sz 3mm 3¢~ 30w 3NN 3HN ZmN zvN 30w BNN 3HN E i; -L L L L -L L i; i; -L i; i; -L -L w. u > N. u > 0.0 u > o o 00.0 n ma new 0. n > >0 .0N u m .00 u o Low ncm N0 ._0 Co 3N meowumcwnsou pcmemwmwo Low m. 10 mo mcovuznvgpm_0 ace—asem _mowL?aEm wo memo: m m_nmh 33 slight increase in the mean of rizw's in situations where 81 > 82 as increased. Mean riw's were .0015 at y :0, .012 at Y =.2, and .024 at Y =.4. However, these mean estimates were close to zero only because the specific values of Dy 0Y0 and (BI — 82) were such that the two parts of the numerator in 022w compensated each other. The sampling mean estimates of 024w and 025w remained the closest to zero across all levels ofy and across all combinations of 51 and B 2. Effects of presence of errors of measurement (OYoYo 76 1) oZe is a common component shared by the numerators of 921w, pzzw and 923w. Since Oze has a positive or zero value its presence should increase the departure of mean estimates of glw, 032w and 923w from zero in situation where B]: 82 or B] > 82. However, this departure decreases in situations where Sf 82. Due to the absence of 026 from the equations of 924w and 925w: errors of measurement were expected to have no effect on their sampling mean estimates. Table 6 reports the effect of the presence of errors of measurement in the premeasure on the sampling mean estimates of Dzw's for the three different combinations of Bi and B 2 for c = 30, s = 20 and Y:.2. As expected, the mean estimates ofoi 1w: 022w andG23w increased due to presence of errors of measurement when Bl = 82. While their values were all 0 equal to '003 when Y0Y0: 1.0 they became .026, .005 and .033, respectively when pYoYo = .8. Also, as expected, presence of errors of measurement brought the mean estimates of 92 1w and 923w closer to zero in situations where Bl< 82. Their values were -041, -.056 when pYoYO = 1.0 and became -.02 and -.028 when p . . . 8') B D — YoYo -.- .8. However, in Situations where 1 2 and YoYo - .8, the mean 34 «N00. __0.i 0000. 0000.- m0. 0N0.i mm0. m. u 963 NFO. 000.: wmo. N0.i 0N0. 1N 0000.- m00. m00. 3 P00. 000. mmo. 000. 000.: 00.0 000. 000. ZmNL 3NNL 0._ u 0>0>Q 0 u mm use N. ”>2fin0N n m .0m u o Low new N0 ._0 Co mcowpmcwasou pcmcweew: Lo» m. - 0 m—nmh a we meowpznvepmw0 asepgsmm Pmovgwasm Co memo: ._. 35 estimates of 021w and 023w did not increase as expected. Their values were .057, p Z . g p : o o .075 when YoYo l and 058, 07 when YoYo 8 The mean estimates of Dew and 925w remained the closest to zero in the presence and absence of errors of measurement and across all combinations of 81 and 82. Sample and Class Size Effect Due to presence of s in its denominator, 022w was not only expected to have a smaller mean estimate than 021‘” and 0:23“, but also it was expected to get smaller as 5 increased. c is not part of any of the equations of piwi therefore, it was expected the mean estimates of piw's would not be affected by changing sample size. Table 7 shows the mean estimates of Diw's across different levels of sample Size where Y = .2, pYoYo : .8 and s = 20. As expected, the mean estimates of all oiw's were not affected by increasing c across combinations of 81 and 82. For example, the mean estimates of 921w: 922w: 923w: 924w and 025w were .06, -.01, .07, -.0045, .0073 for c = 10, .58, .012, .07, .0008, .0074 for c = 20 and .053, .01, .07, .0006, and .0012 for c : 50 in situations where 81 > 32. Table 8 shows the mean estimates of Ozw's across different levels of class size where Y : .2, pYoYo : .8, c = 30 and 3 = 0. As expected, the mean estimates of Oilw, 023w: 024w and 925w were not affected by increasing 5. The mean estimate of 922w decreased slightly as 5 increased. For example, the mean estimates of 922w were .018 for s : 10, .012 for s = 20 and .0048 for s : 30 in situations where 51> 82. Empirical Type I Errors for One and Two Tailed t-Tests When Testing HO:02W = 0 To evaluate the validity of the t-test in testing HO: 02w : 0, the empirical values of the tests for tzw's were compared to the critical values obtained from 36 NHoo. coco. No. Ho. Nmo. NNoo.- Nooo.- eNe.- eoo.- NNo.- Hoo.- moo. mmo. emoo. NNo. om swoo. Nooo. No. NHo. Nmo. HHo.- mooo.- NNo.- Noo.- No.- Neoo.- moco. NNo. moo. 6N0. om Neoo. meoo.- No. Ho.- we. Ho. eooe. mHo.- Ho. eoo.- ecc. NHoc.- mHmo. mcoo. mNo. OH H e N N H m e N - N H m e N N H 3 we 3 we 3 we we 3 we 3 we 3 we 3 we 3 we 3 we 3 we 3 we 3 we 3 we 3 Ne Nu . He Ne . He Ne u He 0 H mm U50 oCN H m aw. H OXOXQ N. n > Low m.3NQ we mmumeumu saw: :0 mem wPQEmm $0 mpumwn—m N mpamh 37 H000. 000. 00. 0000. 0F0. 000. 0H00.- 000.- 0000.- «00.- 0000.- 000. 000. 0H00. 000. 00 «000. 0000. 00. 0_0. 000. 3—0. 0000.- 000. 000.- 00.- 0000.- 0000. 0000. 000. 000. 00 0000. 000.- 00. 0H0. 000. Fe0.- @000. N00. ~000.- 00.- 0000. H00.- 000. 000. 000. 0_ 3mwe 3ewe 30we 30we 3Hwe 30we 3¢we 30we 30we 3Hwe 3mwe 3qwe 30we 30we 3Hwe m .0. u e. eoe m.3wa mo mmpmewumm saw: co mem mmmHu we mpomewm 0 mHDNH 38 the t-distribution with c-2 degrees of freedom for selected level of significance. When the null hypothesis is true (i.e., B3 = 0), the observed relative frequency of data sets having values of tle, tzzw, tZ3W’ tznw: and t25w greater than the critical values in the upper tail or smaller than the same critical values in the lower tail, yield the empirical levels of significance. Comparison to the selected or nominal levels of significance gives an indication of whether the test used is conservative, liberal, or correct. Comparisons were made at three nominal levels of significance which are commonly used by educational researchers; .01, .05 and .1. Observed levels of significance were in all cases based on calculating tzw's for 1000 replications from a multivariate normal distribution with Specified characteristics. To facilitate comparison of empirical and nominal levels of significance, 95% probability intervals were computed using the normal approximation of the binomial distribution with n=1000 and P equal to the selected levels of significance. Thus, if the selected level of significance was .05, the 95% probability interval would be .05 f 1.96 ((.05)(1-.05)/(1000))'/2 = .05: .014. The probability limits for the nominal alpha's are presented in Tables 9 through 16. If the empirical Type I errors exceeded the upper value of the probability limit this indicated a liberal test. On the other hand, if it was less than the lower value of the probability limit this indicated a conservative test, otherwise the t-test was considered valid. The .05 nominal alpha will be chosen through out this chapter as the primary base for comparison of the different situations. Initial Confounding Effect It was argued earlier in this chapter, given Y is large, the empirical sampling distributions of rilwa rizw and r23“, will be located to the right of the central t—distribution in situations where 81: 82 or B l > 82, and to its left 39 when 81 < 82. AS mentioned earlier, this prediction did not hold for the sampling distribution of rizw. Also, it was argued that the empirical sampling distributions of rig“, and r25“, would be the closest to zero. AS a consequence, given Y is large it was expected that using the test statistics tle and t23w to test 92w 2 0 would result in Liberal tests in Situations where BI :32 or 81> 82, and in conservative tests when 81 < 82. However, both tqu and t25w were expected to result in a valid test of the hypothesis of interest. Table 9 shows the empirical Type I errors of the one tailed test of 92w across three levels of initial confounding and across three combinations of B 1 and 52 for c = 30, S = 20 and pYoYo : .8 Comparable results for the two tailed tests are shown in Table 10. It Should be mentioned that here and throughout this paper, only the positive tail was considered for the one tailed tests. a, :6, All the empirical Type I errors of the one-tailed tests for tzw's were within 1.96 standard errors of their corresponding nominal alphas whenY = 0 andY = .2. AS Y increased to .4, most of the empirical Type I errors for the one-tailed tests for tle and t23W were, as expected, greater than the upper limits of their corresponding nominal alphas. The other tzw's were not affected. For example, at .05 level of Significance, the empirical Type I errors for one-tailed tests for tzlw: tzzw: tz3w- tzgw- and t25W were .082, .048, .097, .043 and .047 respectively. While the empirical Type I errors of the two-tailed tests for tzw's were in close agreement with the one-tailed tzw's when Y :0 and Y :.2, they differed as Y increased to .4. For example, at .05 nominal alpha, the empirical Type I errors of the two-tailed tests for tzlw: tzzw: tz3w, tzgw and t25w were .043, .029, .056, .038 and .041, respectively (Table 10). The two—tailed tle and t23W were only valid due to compensating lack of fit in in each tail. Thus for r'z'lw: r22“, 40 one—m Hm:HEoc 0:_ucoammeeoo mu_ 06 HHEHH eoon ecu cozH eoHHQEm o 23?. H228: ocwueoammeeoo 3H eo HE..— ewaa: 2: :9: e383 e No_. mo—. ivom. .mmp. wow. mop. mc_.. somp. iv—P. e—m—. mo—. mo—. N——. mop. NPP. N—_ “moo . . . . . . . . . . . . . . . voo.-omo. mmo omo imm— moo aoc— ovo ovo ‘0:— F00 «pmo omo mmo omo omo moo ac. —o. —_o. eomo. q—o. avo. —_o. —o. «mmo. Nflo. «No. ppo. po. moo. Fo. po. o—Jn%woo . . . . . . . . . . . . . . . N—_.iwno. mo— Nmo ovmo smoo omvo oo_ Pmo acmo omoo ammo —o~ mo— omo vo— vmo H. voo.-omo. ovo. ~00. .o—o. ovmo. empo. oeo. vvo. ovmo. mvo. omo. moo. moo. mco. mo. Nvo. .10. o—o.-Noo. moo. Koo. —o. moo. _o. moo. _o. ooo. Po. moo. moo. moo. moo. moo. moo. .0. N__.-ooo. oop. mmo. «o_. —o—. a—m_. mmo. mmo. mfip. Pmo. mop. mo—. No—. oo—. mo—. mop. _. voo.-omo. Koo. mvo. «mmo. ovo. ammo. mvo. mmo. ooo. Nvo. mmo. omo. —mo. mmo. omo. Nmo. so. woo. woo. iNo. coo. oFo. moo. woo. woo. woo. moo. v—c. «Po. . P—o. vpo. N—o. opoo-moo. _ . 30e0 zreu ..30\~, 30No 3HNH. . .3o\.. 3¢N~ 30w~.-.30\0, 3H00- anH 3VNH 3000 30NH 3.x“ - mu_EHH xu_~_nm . .I- f . i . H P N IDOkQ UCQ - l 03 -- - - -l -1-! - - N , - 0- r -- - - ecu-.2 30:23: o>o 3N 0. u >0 vce 00 u 0 .00 n 0 memc3 0 n m. i; we mummp umpwmh-oco we» eom meoeem H on»» Heo_e3030 co 0c_u::ow:00 _e_uwc~ we muommem .3 oz: mm 41 egg—o Hoseso: mcwucoamweeoo mo. 00 H_E__ eoon mgu coca eQHHeEm o ceapo HmCHEo: 0c_vcoammeeoo mu. mo u_a._ ewes: we» cmzu emuemem « mop. 000. «000. 000. «00—. 000. ~00. «KFP. _m0. 00—. 0mo. pop. 000. v0—. 000. 0__.m000. 900.-000. 000. Nvo. «0——. 0v0. «000. 0¢0. 0v0. 000. .00. 000. Nvo. 000. ceo. vvo. veo. no. 00 A .0 0_o.-000. v—o. 000. «000. 0_0. £000. 000. 000. m_0. 0P0. 0F0. moo. 0—0. v—o. NFO. 0_0. .c. 0__.-000. 000. 000. 000. .000. 000. 00F. 000. ¢0_p. 0_. 00F. 000. 00_. 000. Rep. 000. H. . . . . . . . . . . . . . . . v00.-000. 000 000 000 .mmo 0V0 000 000 000 00 000 000 vmo mvo 000 00 :0. «0 v .0 000. 000. 0P0. 000. P_0. opo. opo. 0—0. 0_0. 0—0. —_0. —_0. 0P0. —p0. 0_0. 0_0.mw00. 0—_.-000. 00". 000. ~——. .000. 00_. 000. cv00. 000. 000. 000. NFF. 0——. 00—. 00—. 00.. _. eoo.-omo. on. 000. 000. @00. 000. 000. 000. 00. 000. 000. mvo. 000. 000. 000. Q00. 00. 00 . H0 . . . . . . . . . . . . . . . . 0P0.-000. «00 000 __0 000 000 _0 000 000 000 000 000 000 000 000 ~00 .o. :n.. :eN. . Ewe. .2N..- .3HNH - .emNH :VNH :NNH . :NNH :HNH. 30.. See. :MNS sz. :HNH . mo.e.. »o_H_ea V. N er N. N f O M * IDOLQ TCC New“ afim aeeH< Ha:_5oz 0. u > >Q use .00 u m .00 u 0 oemzz 0 u m.3wa yo mammh vm__eb-ozp on» eow meoeem H womb Hmo_e_asm co mcwoczoecou .meuwcH 0o muomoww 0_ wpnch 42 and r23“, either the one —tailed test was too liberal or the two-tailed test was too conservative. In contrast, both the one and two-tailed tests for tzw, and t25w were valid in testing the hypothesis of interest across all levels of Y . Bl<82 As expected, all the empirical Type I errors of the one-tailed tests were within 1.96 standard errors of their corresponding nominal alphas when Y = 0. AS Y increased to .2, the empirical Type I errors for the one-tailed tests for tle and t23W became slightly conservative (e.g. .036 and .034 for nominal alpha of .05) asY increased to .4 the degree of conservativeness increased to .019 and .016 at .05 nominal alpha. As expected, the one-tailed test for t22W also became conservative with increased initial confounding but less so than either rilw or rzzw- While the one-tailed tests using tzlw, t22w and t23W were conservative when Y z .4, only the two-tailed test using tz2W was conservative. Its empirical Type I errors was .035 at .05 nominal alpha. It should be mentioned that both the one and two-tailed tests using tz4w and t25W were valid in testing Ho: 92“,:0 across all levels of Y. 81> 82 As expected, all the empirical Type I errors of the one-tailed tests were within 1.96 standard errors of their corresponding nominal alphas whenY = 0. As Y increased to .2 and to .4, the empirical Type I errors for the one-tailed tests using tle and tZ3W increased to .091, .103 whenY : .2 and to .148, .193 respectively whenyz .4 at .05 nominal alpha. However, the one-tailed test using t22w was not liberal at .05 as Y increased, but was at .l nominal alpha (e.g., .114 whenY: .2 and .127 whenY = .4). None of the two-tailed tests were liberal when Y = 0 and Y = .2 at .05 nominal alpha. But as Y increased to .4 the two-tailed 43 tests using tle and t23W became liberal (e.g., empirical Type I errors of .094 and .116 respectively at .05 nominal alpha). Again the one and two-tailed tests using tzqw and t25w were valid across all combinations of 81 and 82 and across all levels of Y. Effects on Empirical Type I Errors of Test Statistics When the Presmeasure Contains Errors of Measurement As mentioned earlier, presence of errors of measurement was expected to push the empirical t-sampling distributions of rilw, r22“, and r 23w to the right of a central t-distribution when 81 = 82 or 81 > 82. Also, errors of measurement were expected to bring the empirical sampling distributions of rilw, r22“, and r23“, closer to the central t-distribution in situations where 51 < 82. Table 11 shows the empirical Type I errors of the one-tailed tests of Oiw's when pYoYo : 1.0 and .8 across the three combinations of 81 and 82 for c = 30, s = 20 andY:.2. Comparable results for the two—tailed tests are Shown in Table 12. 81 = 82 All the empirical Type I errors of the one and two-tailed tests using tzw's across both levels of pYoYo were within 1.96 standard errors of their corresponding nominal alphas accept for the two-tailed tests using t22W and t24w where the empirical Type I errors were conservative. In contrast to what was expected, at least .8 reliability of the pretest does not invalidate the r21“, 3 rZ-Zw and r23“, procedures when 81 = 82. 81(82 As expected, the one-tailed tests using tzlw: tZZW and t23W were less conservative when 51< 2 and reliability of the premeasure was less than perfect. The empirical Type I errors were .013, .015, .026 when 0 YoYo : 1.0 and .036, .mcapm Hm:_soc ccwccoammeeoo mp. we 0.5.3 ee3o~ mew can» emHHcEm 4L- .mgg—m Hmcwsc: ccmvecammeeoo mu. eo 0.2.3 e000: we» can» ewuomeu m NHH.-000. 00H. 00H. :00". evHH. efiwH. 000. HcH. ee0H.. HH. ee0H. w. 0.-000. 0 H 000. 000. ¢00H. H00. ¢H00. 000. 000. «NHH. 000. .000. 00. 0 A 0 0H0.-000. H00. 00. #000. 0H0. n00. 0H0. 0H0. e000. 0H0. e000. H0. 0HH.-000. 00H. ~00. 6000. 0000. ow00. omwo. 00H. 6000. H00. 6000. H. H.-000. 0 H 000. 000. 0000. 000. 0000. 000. 000. e000. omHo. o0H0. 00. 0 v 0 0H0.-000. 000. 00. 000. 000. ~00. 0H0. H0. 000. 000. 000. #0. NHH.- 00. 000. 000. 0H”. H00. 00H. 000. 000. 00.. 00. 00H. H. e00.-000. 0 H 000. 000. 000. 000. 000. 000. 000. 0e0. 0v0. 000. 00. 0 u 0 0H0.-000. 000. 000. 000. 000. 000. 0H0. 0H0. 0H0. 0H0. 0.0. H0. 30~u zeuu 30~u 30~u 3HNu 30N0 3eNu 30Nu 30Nu 3HNu muHsH. auHHHao . . . . 0 .H o o o o -aoea 0:0 0 0 0. 0 0e .2 u x 0e 232 25.52 0. u r .00 u m .00 n o mews: 0 u m.3we eo mumm» cormmu-oco eow meoeem a wax» pao.e0050 wee co meammmswea on» e. ucmsmeamcmz eo meoeem we wocmmwea wzu we muoowem Hfi 03000 45 .a:a~m Hccmsoc 0ceuccammeeco wow 00 u.E.~ emzo~ we» was“ emHHesm o .ogmhm Ho:.50c 0:.0coemmeeoo mew oo upsw~ ewaa: on“ case emuamec e 0-.-000. 000. ~00. ¢w-. ~00. 00~. 000. 00c. e00~. 000. e00~. ~. v00.-000. 0 H 000. 000. 000. ~00. 000. ~00. 000. eweo. 000. ¢~wo. 00. 0 A 0 0~0.-000. 000. 000. 0~0. 0~0. 0~0. 0~0. -0. ¢N~0. 0~0._ e0~0. ~0. 0-.- 00. 00~. 000. «0-. 0~. 00~. 00~. 000. e0-. ~00. 00H. ~. 00 .- 00. 0 ~ 000. 000. N00. 00. 000. 000. 00. 000. meo. woo. 00. 0 v 0 0~0.-000. 0~0. 0~0. 30~0. 0~0. ¢0~0. 0~0. -c. 0~0. ~H0. 0~0. ~0. - 0-.- 00. 000. 0000. 000. 0000. 000. 000. 000. 000. 000. 000. ~. e00.-000. 0 H 000. 000. 00. 000. 000. ~00. 000. ~00. 000. 000. 00. 0 u 0 0~0.-000. ~0. 000. 000. 000. 000. -0. 0H0. ~0. 0~0. ~0. ~0. 30~u 3c~u 30~u 30~u 3-u 30N0 30N0 30Nw. 30Nu 3~Nu mHHEHH auHHPQQ o o o o -noea can 00 .~0 . 0 0 . 0 0 .0 - a 0 H u a 000~< ~mcpsoz 0. u r 00m 00 u-m .00 u o meogz «0 + m.3w o oo mumwe 0m-ou-o3p eoe meoeeu H max» Foo~e~aEw ocu co meammosmea we» :_ acmsmeamomz eo meoeeu we moemmwea we» we muoooeu 0~ m ~pmh 46 .043, and .034 when pYoYo = .8 at .05 nominal alpha. The empirical Type 1 errors for the one and two-tailed tests using t2“w and t25w remained within 1.96 standard errors of their corresponding nominal alphas across both levels of pYoYo' While presence of errors of measurement did not have any noticeable effect on the two-tailed tests for tzzw: tz,‘W and t25w: less than perfect reliability of the premeasure appeared to make the t2 1w and t23w tests slightly too liberal for nominal .01 (e.g. the empirical Type I errors were both .018). 81>82 In contrast to what was expected, the one-tailed tests using tz 1w and tZBW became less liberal in the presence of errors of measurement at .01 and .05 nominal alphas (e.g. empirical Type I errors of .098 and .117 for Dy = 1 but oYo .091, .103, respectively for O YoYo : .8 at .05). The expected increased liberalness due to errors of measurement in the pretest did occur, however, for nominal alpha .1. A Similar decrease in liberalness was found for the two-tailed test using tle and t23W (e.g. empirical Type I errors of .071, .077 for DYOYo : 1.0 but .053 and .063 for pYoYo = .8. Once again the one and two-tailed tests using t24w and tzjw were valid across all combinations of 81 and 82 and across both p and .8. YoYo Sample Size Effect It was expected that increased c Should result in increased power. This should not affect Type I error rates for valid tests but should increase problems for tests that are too liberal (and may be even for tests that are too conservative). Table 13 shows the empirical Type 1 errors of the one-tailed tests of 92w across three levels of sample size and across three combinations of 81 and B 2 for 5+7 oHH. NHH.. HmH. ooH. .No.. Noo. mo. eoo. NHH. oeo. eoo. HHo. NHo. oNo. HHo. NoH. NoH. ooH. Hoo. ooH. ooo. omo. moo. Hoo. ooo. ooo. ooo. ooo. Noo. Noo. ooo. ooo. ooH. .NH. moo. omo. omo. ooo. Heo. moo. ooo. Ho. HHo. NHo. Noo. 0......“ a... .. a...» ......_l.... .3 - o ooa . 3 .oN -- m meme: 0 u m.3wo .o momo. um..oh-o:0 eo. meoeem ~ mosh .oo.e.asm co o~.0 0.05m0 eo muomuwm ozo.o .o:_soc 0:.0ooommeeoo mo. 00 0.2.. e030. we“ coco ew..050 0 050.0 .oc.soc 0:.0coommeeoo mo. 00 0.5.. emoo: on» coca eouooeo . mom. . oo... .oH.. HNH. ooH. ooo. ooH. moo. oNH. N_H. ooo. a. c. a. a. i .. oeo. .NoH. Hmo. .Hoo. ooo. moo. moo. ooo. Noo. ooo..ooo. N 0:5. \m f. mm Ho. .NNo. NHo. .No. Ho. Ho. mHo. Ho. NHo. oHo. Noo. _o. Hoo. oeeo. oeoo. ooeo. ooo. ooo. NoH. ooo. oo. N_H. ooo. H. ooo. oomo. moo. omo. oeo. Neo. omo. eeo. omo. eoo. ooo. :C. No v ~0. Ho. ooo. Ho. woo. NHo. ooo. NHo. ooo. NHo. o_o..Noo.. _o. moo. .oHH. Hoo. ooH. ooo. ooo. NoH. ooo. NoH. NHH..0oo. H. omo. ooo. Neo. emo. oeo. Neo. Hoo. moo. Noo. ooo. oNo. ., . N H .o o . o ooo. ooo. ooo. moo. ooo. Ho. HHo. HHo. ooo. o_o..Noo. _o. 300.0 TSMNQ . ENNW . 3.3.1 .300”. i 3.0.0.01 - BWVH . qu KAN-H- - - . mu,.~,._=.~... ~0....Eo #- - -l OMHU and. lDOkQ 3:0 mm .fim - -- :---- --:;-;---- -7-.- -t. ,;-:...-:-: -.- 50.82 As expected, most of the empirical Type I errors of the one-tailed tests for tzlw’ tzzw: and t23w were beyond the upper probability limits of their 49 mmfi. mmo. moo. ofifi. mmo. moo. mod. mmo. moo. a. 3 \o 3 i o fimfi. mmo. ooo. mfifi. mmo. “do. ooH. meo. moo. . 5 .imzq»a»| 1 Q NmH. om“. Ncfl. .1 1 ¥ coo. mmo. moo. a. i. immo. Koo. oHo. fiwfi. moo. NHH. i .1. «mo. mmo. mvo. moo. Ho. ooo. ofifi. ooH. ofifi. mmo. mo. mo. ago. moo. Ho. 3nm~.- 3N\~.. 3.x“ omno uco N. u > .om u omo. moo. Koo. mod. mmo. ofio. mmo. Koo. Ho. 0 3 \u m wows: o u m Hmo. mvo. moo. mmo. mmo. ego. coo. mvo. .1. a. mfifi. moo. mfio. MHH. moo. oHo. mmo. mo. ooo. 3 o: ago—o _o:_Eoc mcwvcoammggoo mu_ oo uFE_~ ewzop one coco Lm_PoEm o ago—o PocwEo: mcwoeoamogcoo mu_ oo uFE_P Loon: ozu cozy couowco o Hoo. ooo. Ho. ooH. mo. ooo. ooo. ooo. moo. 3 omuo moo ooH. woo. woo. ooH. moo. mmo. ovo. ommo. Hoo. mmo. NHo. woo. coo. woo. moo. ooH. mo~. mmo. NoH. mmo. mmo. mvo. Hmo. woo. fimo. ofio. woo. woo. moo. moo. omo. mmo. moo. omo. moo. ovo. vvo. mco. omo. mvo. moo. ago. moo.. NHo. Woo. .3~\~- . . .3nxo, m3v\o H.3m\o - 3m\o oflno In , ., .. -.. . . .. . . : . -Q we mumo» uw—wo~-ozp mop Lo» mcoegm H moo» Foo_gwosw co op mpnmp Hofi. mmo. ooo. ooH. ovo. ooo. ~mo. Hmo. go. zdxo m__. H. oc:..ooo. so. x_o. moo. _o. m__. was. H. coo. Orc. :3. o_:. «cc. _o. m__. xoo. .. oo:. omc. :9. w_:. moo. _o. mu_Ew4 Au___am -oogm :zo : 9:2 25...: o~_m «_osom oo moooooo .xxo. C\.. n 50 corresponding nominal alphas for c = 30 and c :50. The liberalness of these tests was increased as c increased. For example, the empirical Type I errors of the one-tailed tests for tzlw: t22W and t23W were .62, .49, .063 for c = 10, .091, .051, .103 for c = 30 and .112, .067 and .129 for c = 50 at .05 nominal alpha. All of the two-tailed tests were valid when c = 10 and 30 accept for t2“W which was conservative for c = 10 at .05 nominal alpha. As c increased to 50, the two-tailed tests for tz 1w and t23W became liberal at .05 nominal alpha (e.g. empirical Type I errors of .073 and .074). Surprisingly at .1 nominal alpha even the two-tailed test, for tzqw and t25w became too liberal. Class Size Effect On a priori grounds it was difficult to predict the effect that varying class size might have on the validity of the several test statistics under investigation. As reported earlier only the formula for 922w was a function of class size, 5, and there it appeared in the denominator. Table 15 reports the empirical Type I errors of the one-tailed tests for t zw's across three levels of class size and across three combinations of 81 and 82 forY : .2, c = 30, and OYOYO: .8. Comparable results for the two-tailed tests are shown in Table 16. 8:32 All empirical Type I errors for both one and two-tailed tests were within 1.96 standard errors of .05 nominal alpha. Further, the liberalness of tle and t23W remained stable as 5 increased. These results indicate that increasing class size does not have an effect on the validity of the tests. 51 ozofim amcfiso: wcuocoommhcco my“ mo ugEafi uo3oa 05o cwzu LoHHqu «anm Haemeo: wagocoomwuoOU mum uo quum noon: ecu coca uwuawuo 30 i oo_. oo_. o_o. o_o. .oo_. ..:~o.. oo_. ooo. o__. _o.. ~o_. Nos. -_. o__. omo. o_o.-:xo. « a t « « r _ « o x —. “mo. omo. o... ooo. ~o_. ooo. ooo. oo_. _oo. _oo. Noo. omo. ooo. moo. ooo. ooo. ooo. a. o « o. a. c e no. mm x _m ~_o. ooo. _oo. ooo. oNo. Hoo. _o. woo. woo. No. o_o. ~_o. coo. m_o. woo. x_o. moo. « a a i i * mo. oo_. ooo. coho. moo. onoo. oo_. zoo. coho. Boo. omoo. ooo. ooo. oooc. _oo. °moo. N... oxo. o. omo. “so. oooo. ooo. Koo. ooo. «so. Demo. moo. ooo. «mo. omo. moo. omo. ooo. ooo. ore. . 00. mm v «m zoo. oo. moo. _o. ooo. ooo. oo. ooo. Ho. Boo. «_o. _o. __o. oo. oo. w_o..o:o.. _o. ooo. moo. ooH.. Koo. «No. moo. moo. «mo_. _oo. oo_. Boo. ooo. .o~_. coo. «o__. o__..o::. _. ooo. so. ooo. moo. mmo. moo. ooo. ooo. Noo. ono. mo. ooo. moo. Boo. Noo. ooo. oo.< m _ . oo. o ; o ooo. ooo. ~_o. ooo. ooo. moo. ooo. ooo. ooo. moo. _o. o_o._ m_o. moo. «_o. o_o.-m::. fio. L . 3:3. , Sol. ENS 3—3- - 3m: 33o 3m: . 3x53 3—: ,-. . 33; 35”.: smfi 3N3 3.5 - 3.2:: 32.3.... om a o oo . o oo n o -oooo_:_o mo .oo -. -1-.l.!dionll.--..- iii! -1-] , 3N , . . . - .- Z: .. ; -1 f - 932 7525" m. u > >o woo .N. n > .om n u «can: .o u m. -o we mummp um__op-oco on» Low mcocgm H coop _oo_e_oEm co m~_m moo—u we muuwoom mp w—nmh 52 oo_. «no. m_o. mmo. oqo. _oo. moo. Nqo. ooo. zo~o 3_ m. n~o. moo. mmo. do. ~_O. mmo. mmo. @co. OQO. moo. _go. mmo. mmo. ooo. ooo. ROG. moo. wa,m Exxo On I m o>o> moo. ~o. moo. mmo. noo. ,N xo Omo. «no. moo. _ I‘ll- ‘ \u . 3 cmo. dmo. ¢~__. moo. moo. moo. moo. moo. moo. mo~. omo. N_~. mmoq mmo. moo. 0.0. o_o. o_o. mmo. coo. mmo. “co. m o>o>o vac _. 3 + + cum. _oq. mum. + + ¢o~. nmm. ~N_. + + mmo. ¢m_. mqo. + ~o_. on~. #_N. + mmo. moo. qm~. + No. __0. one. + «on. men. mom. + oq~. New. mmo. + cmo. _oo. omo. . .. i n Nxo .sqxo -: \u n -1- m .om n m .om u o mews: o u NmN. mm_. mmo. + 0N. + om_. + fimo. + moN. + ~n_. + Koo. :e\o- :mxo 3 mom. omm. moo. oe_. mmo. qmo. _mm. u > . In an $0 + mom. ~m_. m__. ~mo. cow. N\# ,.3 mom. 0mm. moo. cm_. oo_. emo. «om. _xy. .momoh.oo_ooo-ooo omoo o__o> . + + + + + dmN. com mmu. OnN. NmN. o. + + + on_. oo_. oo_. _o_. mm_. x _ .o. . A + + + + + u a m oso. omo. oso. “so. _mo. _o. + + + + + mmw. mNN. New. NmN. «cw. H. + + ~m_. mm_. nm~. mm_. mm_. N _ 30. .m v m + + + + + one. 0~0. fiqo. Ono. mmo. _o. + + moo. omo. Now. Koo. NAN. o. + + + + + mnfi. qo—. eo—. Now. ~o~. ; . m . _ oo o - o + + + . + + gco. moo. coo. moo. moo. oo. 3m~u .Scxo EMVo 3mxo . 3-o . o u » coo—o _oo_soz No .oo up m—nmp moo moron _owwowoaw.:o mcwocamoooo.po_ow:~ we mooooom fi57 Ao u no cog: mzaam Hmcaso: ocu mo mucuuo opmvcmom oau sazoaz mucouo _ wo>e Hmoquuova ommu U__m> . _o~. ooH. 3 oc.. «me. + ego. o\o owflm 3 mum. new. oo_u co. m: u > mm“. NO~. _mo. .3mxo .¢»o>o uco P. u m .om u m .on.m o mcozr.o n.m 3 omm. NmN. mo. mo. Nco. _ to + mmo. n 3 \o 'lll'l'tl 3 + 05—. omo. m_o. vxu can. mm_. moo. qfia. ono. oao. Nam. mmq. coo. . Enoo. N." > 3N amo. mmo. coo. owo. + No_. + mmo. + mmo. .3N . 00 \o + + + + + _oo. _o.. «no. oo_. om_. oo_. + + + + + _ mmo. mmo. _oo. _oo. oo. oo.. . 3C. RE A ~@ + + + + + qqo. oNO. «No. mmo. ONO. 0N0. ~C. + + + + + mo_. no.. mo_. coo. wo_. oq_. H. + + + + + mmo. coo. mmo. mmo. moo. mmo. N _ 3:. mov o + + + + + moo. m_o. n_o. o_o. o_o. o_o. _C. + + + + + ocN. Nm~. ~NH. mNH. «fin. oufi. o. + + + + + Om_. oo_. mmo. ~o~. mmo. ¢O~. N o: o , oo + + + + + moo. mmo. mmo.. wNO. mmo. ~mo. .c. . e: .. . , - .N.. a. zoxwl 3n~o- zoxu .zmxo- -3 No- 3 No oco_< _ocvsoz m .q u m > OH oo mowmh mupw¢p-oxh coo omxoa Fou_g_osm no moroczowcoo Poou_c_ wo.woumowm mp wpnoh 58 one-tailed test were .131 fory : 0 and .124 for both Y = .2 and Y: .4). And again the empirical power for one-tailed test for t2“,W decreased from .139 to .120 to .10 as Y increased from 0 to .2 to .4. It should be noted that the two-tailed tzgw and t25w were less powerful than the one-tailed tests. BFBZ Only the one and two-tailed tests for taw, tz5W and the two-tailed test for t22W were valid under the null hypothesis. In contrast to previous results, the empirical power of the one and two- tailed tests for t25w increased slightly as Y increased (e.g. empirical powers for one-tailed test of .157, .171, .173 as Y increased from 0 to .2 to .4). The empirical power of the one-tailed tests for t24w did not have a clear relationship to Y but the two-tailed test had essentially constant power as Y increased (e.g. the empirical powers were near .09). Similarly, the empirical power of the two-tailed tests for tzzw were not much affected by varying (e.g., empirical powers of .09, .102 and .102 aSY increased from 0 to .2 to .4). Effects of Errors of Measurement in the Premeasure Due to the absence of error of measurement components from the formulas of 034w and 925w one can expect that errors of measurement in the premeasure will not affect the empirical power of the test statistics using rigw and risw. Errors of measurement were expected to increase the power of the test statistics for tzzw- Table 19 reports the results of the effect of errors of measurement on the empirical power of the several one-tailed tests for tzw's for c = 30, s = 20 and Y z: .2. Comparable results for the two-tailed tests are shown in Table 20. .59 Ao H mm con; agape PmcwEo: on» Co mLOLgm ogmocmum o3“ cw;u_3 mcoegm _ quh poowg_a5mvummu Up_m> + + + + + «mm. own. son. now. “on. mm—. on”. Nom. om~. omm. H. + + + + Hun. mm—. OWN. mnfi. 0mm. Nofi. mmo. com. mofi. NuN. N _ mo. m A a + + + + moo. mmo. mmo. omo. moo. euo. omo. omo. mmo. mmo. Ho. + + + + ~_N. com. me“. am". on“. oMN. NNN. and. -~. and. . H. + . + + + q-. o~_. smo. m_~. ooH. mmo. cmo. mmo. o~_. moo. N a mo. v . m m + + + + one. amo. «no. Hmo. omon one. «mo. moo. no. No. ~o. + + + + + + now. New. fimm. com. com. mew. coo. «on. cod. mom. H. + + + + + + moo. find. mom. find. Now. ~_—. qmo. q—fi. mmo. _fi. m ~ + + + + + + omo. mac. moo. mmo. fino. omo. mmo. omo. mmo. mmo. Ho. 3m~u 3¢~u 3m~u BNNH 3-wl 1:3mezliamww1- sow {.18me- 13mmw+.l . w. I Coacha O.— " Chooha QSQP< ch—EOZ N@ .H F. u mm vcm .N. n > .om n m .om u o meme; .0 u m.3~a mo mummh omp_o»-mco mzp Low Lmzoa _ooogonm co meomomsoga on“ o_ acmsogamomz oo mgogcu Co moowoou m? apnok «o . mo 5;: ogopo Pocoso: on» oo mcoeem ucovcoum ozu cw;u_3 mcoggm H moop _oo_c_oEmvommo oo_o> + 6() + + + + NNH. Omo. omN. mmo. moN. ONd. moo. mQN. ONH. ooN. H. + + + + cod. OmO. mmo. NO~. mmo. mmo. OmO. mm“. OoO. _oo. N H mo. o A n + + + + mNO. moo. moo. ~NO. coo. QHO. NOO. co. «~0. mo. HO. + + + + mm_. Nmo. coo. mNo. wfi~. mmo. coo. Moo. mm~. NHH. H. + . + + + NOO. duo. mmo. oOO. mmo. moo. who. mmo. mmo. no. N ~ + + + + ONO. ~NO. coo. ONO. o~o. «NO. NO. “HO. NNO. moo. Ho. + + + + + + mmo. nod. NHN. NoH. OON. qu. moo. Ono. Odo. mo. H. + + + + + + mmo. «mo. mmH. moo. Om_. mmo. ooo. mmo. #oO. Hmo. o. No . so + + + + + + mmo. Ono. oco. mmo. mqo. O_O. HMO. mfio. moo. woo. Ho. 2m~u 3v~u 3m~u 3m~p ZHNH 3m~p 3¢~u 3m~u 2N~u 3H~p o .u 0.3mm. I -1--- . o; u omoao 932 pact—52 No L 3N _. H mm new N. u > .oN u m .om u u mews: o u m. -o $0 mama» anWop-ozp Low Lozoa _oUPLonm co mgamowsmga wow so ucmsmgzmomz mo mcogcu we muomowm om opomh 61 Bi=32 As seen earlier, only the one and two-tailed tests for tzzw: t24W and t25w were valid, given the null hypothesis. The empirical powers of the one and two- tailed tests for tzzw: t2,“W and tz5w were increased in the presence of errors of measurement (e.g. empirical powers of .095, .171 for tzzw: .094, .157 for tZ4W’ .117, .173 for tZSW). Bi<82 Only the one and two-tailed test for ‘24w and t25w were valid under the null hypothesis. The empirical power of the one and two-tailed test for tz5W decreased in the presence of measurement error (e.g., empirical power for the one-tailed was .139 for oYoYo : 1 and .124 for pYoYo : .8). Similarly, empirical power of t2“W decreased slightly in the presence of errors of measurement (e.g., .126 for pYoYo = l and .120 for pYoYo = .8). 81 >82 Earlier it was shown that only the one and two-tailed tests for tzgw and tZSW were valid given the null hypothesis. Both empirical power of the one and two-tailed tests for tz4w and t25w increased in the presence of errors of measurement (e.g., .089, .102 for pYoYo :- 1.0 and .159 and .171 for pYoYo .-. .8). Sample Size Effect It was predicted that the empirical power of all the one and two-tailed tests for tzw's would increase as c increased. Table 21 gives the results of the effect of sample size on the empirical power of the one-tailed tests for tzw's for s = 20,Y = .2 and pYoYo = .8. Comparable results for the two-tailed tzw's are shown in Table 22. 62 o "Hwocogz ocho Hocwso: mcu oo mgoegm ucoucoum ozu :ngHz mcoccm H ooxp Heooe_oea ammo o_Ho> + + + 4 + + + mmo. NNn. Noe. mmo. No. Now. How. omm. mmw. Hem. omH. neH. ooH. NoH. ooH. H. + + + + + + How. ~H~. mom. Nam. mmo. _o.. omH. omm. ooH. cow. Nmo. mmo. ooH. mmo. moo. DC. + + . + + + + coo. mmo. ooH. one. HNH. moo. omo. moo. no. moo. ooo. HHo. nHo. oHo. mmo. _o. + + + + + + + How. cw. oHN. com. com. HHN. cm. moH. ooH. ooH. mmH. mmH. NoH. qu. ooH. H. + + + + + + + qu. NoH. ooH. on. qu. «NH. ONH. moo. mHH. ooH. woo. «no. moo. mmo. «no. so. + + + + + + + + + moo. moo. Hno. coo. mmo. Coo. Hmo. «No. Hmo. owe oNo. mHo. mHo. “Ho. mHo. _.. o + + + + + + + + + mom. «on. mac. Ham. an. now. New. Hmm. com. «Hm. moH. «NH. mHN. ooH. Hm. H. + + + + + + + + + mow. new. wmm. mow. mHn. ooH. omH. mew. HNH. Now. mHH. HoH. mNH. moH. an. me. + + + + + + + + .+ omo. moo. HmH. mmo. NNH. coo. Koo. moo. mmo. Hoo. one. mmo.. «no. mmo. mmo. Ho. :3... .-....o.:. .33. .39. of -....:;£3o-..o-.om::so: 35-. zoo. ..o:.,3m~._s 3N3: of”... , ea: 35.52 cm n u on u 0 OH I u o o 3~ o. u > >q coo H. H mm .m. n > .ow u m mews; o n m. to we mommp uoH_o»-mco vs» Low Lozoa Hoo_c_osm co w~om «Hasom yo muoooou Hm mHnmh N d: /\ —< m. €53 Ho H no can: oono Ho:_eo: mzu wo meocem ucovcoum oxu c_;o_z meocgm H womb Hoo_gHanV ommu u_H6> + + + + + + + mmN. mHN. «on. mom. cam. NoH. ONH. omm. moH. How. ooH. oHH. .HNH. NHH. oHH. + + + + + + H :_. oz. SN. 9:. oz. 3.. ooo. m2. NS. «2. :5. N8. So. So. So. .3. N A ~ + + . + + + + . m m omo. «mo. Hmo. Nqo. eoo. mmo. mHo. moo. Hmo. «co. mHo. HHo. mHo. woo. mHo. + + + + + + + + + _o. 2:. 2;. NE. NE. N2. o2. «2. EH. m2. m2. 3H. NoH. 2:. mo... ooH. H. + + + + + + + + + 8H. 2:. mo. 2:. Bo. woo. :o. 30. 80. mmo. So. one. one. one. o3. N . H mo o v o + + + + + + + + + mmo. mmo. o8. «No. to. _ 08. H8. So. So. So. So. So. 3o. 2o. 2o. . _o. + + + + + + + Hem. mew. own. cmN. mHm. ooH. ooH. mHN. NoH. QON. cmH. «HH. ch. HNH. ooH. H. + + + + + + + + + .2. RH. SN. 9:. «cm. 3o. «8. mi. 3o. o2. So. So. 20. sec. mg. m H mo. o - o o + + + + + + + + + one. moo. omo. moo. moo. mmo. cmo. coo. mmo. moo. mHo. mHo.. Hmo. mHo. HNo. _c. :35. .-.,.,...:-- on: 3o: as- -35, goat ...m:.-...o: 35.- 2...:- ...€;.3.o33 3oz. .5 . .:om . o-... . - --- on u o oo . o goo—o Hoooeoz No .oo . - o>o>.. . m . . My. ,.-- -i- ---. -3. -- , . -..-- . .--- . . - m u o EB H u m N n > om u m 89.3 o u m. -a we 3.3: 823-03 23 Loo .8on H8235 go 33 39:3 oo Soot—u NN ~32. 64 Earlier it was shown that only the one and two-tailed tests for tzzw: t2“W and t25w were valid under the null hypothesis when 81 = 32 or 81(82- The empirical power of these tests increased as c increased (e.g., the empirical power of the one-tailed tests for tzzw: t2“W and t25w at c = 50 were 234, 243, and 214 percent of their power when c = 10 in situations where 81 :82). Only the one and two-tailed tests for tz4w and t25w were valid given the null hypothesis, in situations, where 81 < 82. Again, the empirical power of these tests increased as c increased (e.g. the empirical power of the one-tailed tests for tqu and t25w were .83, .82 for c = 10, .159, .171 for c = 30 and .217, .231 for c = 50). Class Size Effect It was expected tnat increased class size would have no effect on the empirical power of the tests for tz4w and t25W7 but that the empirical power of the test for t22W would dr0p as 5 increased. Table 23 reports the results of the effect of number of students per class on the empirical power of the one-tailed tests for tzw's for c = 30, Y = .2 and O YoYo = .8. Comparable results for the two-tailed tests for tzw's are given in Table 24. As was shown earlier only the one and two-tailed tests for tzzw: tz,‘W and t25w were valid, given the null hypothesis. The changes in the empirical power of the one-tailed tests for t24w and t25w were not large but in each case power increased monotonicaly with s (e.g., empirical powers at s = 30 were .112 and .119 percent of the empirical powers when s = 10 and .102 and .103 percent of the empirical powers when s = 20). There was, however, no clear relationship between power and class size for t22W m . Ho u m cog; mgoHo Hoc_Eo: moo mo mcoccm veoocoom ozo :_;u_3 mecccm H maxp HooHL_asmo ammo uoHo> + £55 + com. ooH. Hmo. mHN. NHH. mmo. + qu. + mnH. + moo. n \u- .- .HNN. mmH. moo. com. ooH. Nmo. + moN. + ooH. + Nqo. 3r~&--3n\o.- 2N 3M _ON. mom. No~. oQN. Ono. moo. NON NNH. OOH. coo. Hmo. Ono. + NmN. 0mm. + QOH. HHN. + moo. Hoo. .3 - a: - o oo. H. u no .N. mom. omN. mso. umo. «No. Hmn. ooN. mmo. Nu ON u m 3 mom. oom. + HmH. + mmo. oxo anxo..-3 \o + + «MN. OHN. + + mNH. moH. + + Ono. nNo. + + HN. mmH. + + oNH. HNH. + + mmo. oNo. + + mew. 5mm. + + HmH. qu. + + “co. mmo. v OH o mews: o u m.3ma mo mumop voHomh-mco we“ Low Lmzoa Hoo_eoo5w :o m~_m 00. No A Ho Ho. H. oo. oo v oo _o. H. m: mm Ho Ho. moo_< _oo_soz a .. moo—u we muomoou (56 Ho . mm cmEseceHe HecHEoc one we mLoLLe eceeceum ezu :Honz egocee H eexH HoUHLHesg ammo eHHe> . + + mmH. NoH. NNN. + + omo. mmo. NoH. + . Omo. mNo. omo. + + OmH. oNH. QHH. + + omo. omo. moo. + + mHo. NHo. NHo. + + ooH. ooH. mmN. + + mmo. OOH. mmH. + + oNo. oNo. omo. o: .2.... . on: 3 Om u m o eco H. n no .N. " >..om u o. u o>o> mmH. NmN. moH. qu. «no. oco. NNH. HNH. «so. moo. oHo. «Ho. + NNH. mHN. + moH. NNH. + HNo. Nmo. NAH- - y .3 fix,“— + NmH. + ooH. + mNo. mmH. moo. ONo. 3 V + 05H. + omo. + mHO. + NmH. + Hmo. + HNo. + NoH. + «mo. + Omo. No + omN. me. HQN. qu. + mmH. NoH. mmo. Nno. + mqo. HNo. coo. mHo. + «HH. mNH. wHH. mcH. + omo. coo. omo. mmo. + oHo. No. mHo. oHo. + + mHN. NoH. ooN. NoH. + + mmH. Hmo. omH. ooH. + + cqo. mmo. mqo. omo. znxo. BmxH Son . 3n- oN n m - 3N - r. o memo: o n m cm mHne» + NNH. + mmo. + NHo. + HqH. + Noo. + NHo. + OoH. + Hmo. + omo.. .3... ,_ -o co momoo oo__oo-ozo moo mNN. QmH. NHN. omH. moo. NNH. no moAHo cmo. oHo. Nmo. HO. mHH. NoH. «NH. H. Nno. who. HBO. - N H :o. u v o NHO. HHO. mHo. _O. + NmH. mnH. mH. H. + MNH. ooH. mHH. .+ qu. «No. «no. HO. . 3mxo 3N~o .3HNo . e e a: 50 N .H oH u m z H< H H z m m Le» Lozea HoeHLHQEw :e e~Hm mmeHo we moeeoom 67 (empirical powers of .160, .171 and .164 as 5 increased). Similar but less pronounced relationship between power and class size were found for the two- tailed tests. 81452 Only the one and two-tailed tests for tzzw, t2¢w and t25W had empirical Type I errors within two standard errors of the nominal values when 83 = 0. The empirical powers of the one-tailed tests for tzzw: t2“W and t25w tended to drOp slightly as 5 increased, particularly from 20 to 30 (e.g., empirical powers of .114, .113, .,106 for tzzw: .121, .120, .106 for t2,“W and .126, .124, .112 for tzjw as 5 increased from 10 to 20 to 30). However, the relationship between power and class size for the two-tailed tests for tzzw, tzw, and t25W were not clear (e.g., empirical powers of .077, .066, .074 for tzzw, .082, .071, .076 for t24w: .075, .82, .80 for tzjw as 5 increased from 10 to 20 to 30). BPBZ Only the one and two-tailed tests for t24w, t25W had expirical Type I errors within two standard errors of the nominal alphas when 83 = 0. The empirical power of t2"W and tz5w tended to increase with class size though this relationship was most in evidence for one-tailed test at alpha .1 (e.g., empirical powers of .211, .257, .271 for tz,‘W and .234, .272 and .286 for t25w)° CHAPTER VI SUMMARY AND CONCLUSIONS The purposes of this investigation were to determine the conditions under which testing for Ho: 02w=0 is equivalent to testing for no teacher behavior effect. Five different methods for defining Z were investigated under a variety of conditions defined by varying (a) the amount of initial confounding, (b) presence of errors of measurement in the premeasure, (c) sample size, (d) class size and (e) the relationship between 61 (i.e., the structural slope of class effect at time t on class effect at time 0) and 82 (i.e., the structural slope of within class deviation at time t on within class deviation at time 0). A linear structural model which incorporates the hierarchical nature of the data and the possibility of measurement errors was provided in chapter three to determine analytically the conditions for which testing 02w = 0 is equivalent to testing 83 = 0. The results showed that equivalence of the two null hypotheses does occur if either of the following conditions are met (l)Y= 0 (i.e., no initial confounding of teacher behavior and class compositions) (2) 81 = 82, given a perfectly reliable measure. Such equivalence between pzw = 0 and B3 = 0 is true regardless of whether Z is defined using K set to the total, between or within regression coefficients. A Monte Carlo approach was taken to investigate the appropriateness of the different test statistics for tzw's in testing the hypothesis of no teacher behavior effect on student achievements. As expected, whenY = 0 and B3 = 0 all of the mean estimates of piw's were near zero. Further, the empirical distributions of the "t" statistics for the different forms of riw's were close to 68 69 their theoretical t-distribution across all combinations investigated. Finally, all of the test statistics for tzw's were valid and tzlw’ t23W and t25w had empirical power greater than tzzw and tzgw- Increasing the amount of initial confounding, Y , caused the mean estimates of 021w, 022w and 023w to depart from zero, but did not effect the mean estimates of 0224“, and 025w. Results of the empirical Type I error rates paralleled, for the most part, the empirical results for values of piw's. Increasingy caused tle and t23W to be centered to the right of the theoretical t-distribution when 81 = 82 or 31> 32 and to its left when 81 < 52. This caused the tests to be too liberal in the first two cases and too conservative in the third case for one tailed tests. For two-tailed tests,tle and t23W were again too liberal when 81 > B 2 but valid for the other two relationship between B 1 and 82. Results of the empirical Type I error rates indicated that increasingY caused the one-tailed test for t22W to be too conservative when 81 = 82, the one and two- tailed tests to be too conservative when 81 < 82 and the one-tailed test to be too liberal when 81 >82. The only tests for which empirical Type I error rates were not affected by increasing the amount of initial confounding were tnW and t25w- It can be concluded that as Y increased, only tzzw: tZgW and tz5w had empirical Type I errors within two standard errors of the nominal alphas when 81 81 and B = 82 and t24w: t25w for the other relationships between 2. However, in situations where t22W was a valid test, it had greater power than tzqw but less than t25w- Errors of measurement in the premeasure caused the mean estimates of p‘zfp 022w and 023w to depart slightly from zero when 81 = 82 and to become Closer to zero, at least for tzlw and t23w when 81 < 82. However, errors of measurement did not effect the mean estimates of 02'“, or 025w. The one-tailed 70 tests using tz 1w and t23W became less conservative as a result of the presence of errors of measurement when Sf 82. The effects of errors of measurement on the two tailed tests were not the same as those for the one-tailed tests. For example, errors of measurement brought the empirical Type I errors for tle and t23W closer to the nominal alphas's when 81 > 82. Concerning the power of the tests with valid Type I errors, power for tzuw and t25w tended to increase in the presence of errors of measurement in the premeasure when 51 = 82 or 81 > 82 but decreased when 81 <82- Further t25w had greater power than t2“W across all combinations of 81 and 82- Sample size was found to have no effect on the mean estimates of pzws across all combinations of 81 and 32. Increasing sample size affected empirical Type I error rates for the one-tailed tests using tle and t23w (i.e., the tests were too liberal when 81 : 82 or 81 > 52 but too conservative when 81 < 82). While the results of the empirical Type I error rates for the two-tailed tests were parallel to the one-tailed tests when 81 > B 2, they differed in situations where Bl = 82 or 81 > 82. Except for the one—tailed tests when 81> 82, increasing sample size did not affect the empirical Type I error for tzzw across all combinations of 81 and 82 at .05 nominal alpha. Statistics tzaw and t25w were the only tests to remain valid as sample size increased. The power of these two tests increased with sample size, and for most cases tz5w had greater power than tZQW' Number of students within each class had no effect on the mean estimates of all pzw's. Also, it had no effect on the one and two-tailed test across all combinations of 31 and 82. In general the tests which were valid, conservative or liberal when classes were small remained so when class size increased. tzqw: tij and in some cases t22W were the only tests which had empirical 71 Type errors within two standard errors of the nominal alpha when 83 = 0. For these tests, the empirical power tended to increase with class size whenfil = 82 or 81 > 82 and to-drOp slightly when 8 1 < 82. In conclusion, when students are randomly assigned to classrooms or when 81 = 82 and the premeasure has perfect reliability, testing H0: 02“, = 0 is equivalent to testing no teacher behavior effect. This equivalence is true regardless of whether Z is defined using K set to total, between or within regression weights. However, when students are not randomly assigned to classrooms (i.e.,Y 7‘ 0) which is typically the case in practice, the test statistics using tzlw: t22W and t23w were valid only in a few situations. In general these tests, particularly, tle and t23w tended to be too liberal in situations where 81 >32 (the typical case in education) and too conservative when 81 < 82. Interestingly, the only tests were valid for all conditions investigated were the tests for t24w and tzjw- Since K is usually unknown in practice, the procedure of choice should be tzqw- In addition to being valid, it affords an estimate of K rather than requiring K to be known apriori. Increasing sample size and presence of errors of measurement increased the empirical power of t24W and t25w- Their empirical power increased slightly with class size when 81 = 82 or 81 > 82 but decreased slightly when Bf 82. While the empirical power of t25W remained constant inncreased when 81 = 32 or 81 < 82, the empirical power of t2“W decreased. In situations where 81> 82, the empirical power of tzjw increased but tzgw reamined constant. The results of the investigation of the two-tailed tests were not in complete agreement with their corresponding one-tailed tests. A possible explanation is that the distribution of the test statistics may be skewed. The results of this investigation are limited to the parameter values chosen. In other words, if some parameter values were changed such as OYoYo 72 and the relative magnitude of Bi and 82 some of the results would be different. For example, the satisfactory results based on using t22w were functions of the chosen parameter values. If the chosen values of pYoYo’ Bi and 82 had been .9, .3 and .9 instead of .8, .3 and .7 when Bf 82 the values of 922w would be changed from .003 to .06. The test statistic,tzzwa may be too conservative instead of valid for these new parameters. The results of this study indicate that procedures used by process-product researchers in forming residual gain scores typically provide misleading results. Sometimes the test statistics used are too liberal and other times they are too conservative. Therefore, it is recommended that process-product researchers who wish to test for no teacher behavior effect use tz,‘W which involves setting K = él- In addition to yielding valid Type I error rates across all conditions investigated, the procedure had reasonable power (though not as good as if K were a known constant). BIBLIOGRAPHY BIBLIOGRAPHY Brophy, J. E., & Evertson, C. M. Process-product correlations in the Texas teacher effectiveness study; final report. Report No. 74—4, University of Texas at Austin, 1974. Bryk, A. 5., 6c Weisberg, H. Use of the non-equivalent control group design when subjects are growing. Psychological Bulletin, 1977, g, 950-962. Creemers, B. Evaluation of styles of teaching initial reading: An investigation into the relationship between teachers' use of a specific method and pupil achievement (with summary in English). Utrecht: Drukkerij Elinkwijk B. V., 1974. Creemers, B. or Weeda, P. Evaluation of teaching styles: An investigation into the relationship between pupil achievement and teachers' use of a method for teaching initial reading. Unpublished manuscript, State University of Utrecht, Netherlands, 1974. Cronbach, L. I]. Research on classrooms and scholars: Formulation of questions, design and analysis. An occasional paper of the Stanford Evaluation Consortium, Stanford University, 1976. Cronbach, L. 3., 6c Furby, L. How should we measure "change"-—or should we? Psychological Bulletin, 1970, B, 68-80. Doyle, W. Classroom tasks and students‘ ability. In P. L. Peterson 6c H. J. Walberg (Eds.), Research on teaching. Berkley, CA: McCutchan, I978. Draper, N. R., 6: Smith, H. Applied regression analysis. New York: John Wiley 6r Sons, 1981. Dunkin, M. 3., dc Biddle, B. J. The study of teaching. New York: Holt, Rinehart, 8r Winston, 1974. Ebel, R. L. Essentials of educational measurement. Englewood Cliffs, NJ: Prentice-Hall, 1979. Elashoff, J. D. Analysis of covariance: A delicate instrument. American Educational Research Journal, 1968, _6, 383—401. Gage, R. M. Essentials of learning for instruction. Hinsdale: The Dryden Press, 1977. Haney, W. Units of analysis issues in the evaluation of project follow through or there must be heresy in there some place (Contract No. OEC-0-74-0394). Cambridge, MA: Huron Institute, 1974. 73 74 Hays, W. Statistics for the Social Sciences. New York: Holt, Rinehart, 6c Winston, 1973. Hornquist, K. Relative changes in the intelligence form 13 to 18. Scandinavian Journal of Psychology, 1968, 2, 50-82. IMSL Library. The IMSL Libarary, Vol. 3. Houston: International mathematical and Statistical Libraries, 1982. Kenny, D. A quasi-experimental approach to assessing treatment effects in the non-equivalent control group design. Psychological Bulletin, 1975, 8_2, 345- 362. Kessler, R. C. The use of change scores as criteria in longitudinal survey research. Quality and quantity, 1977, l_l, 43-66. Linn, R. L., (St Slinde, J. A. The determination of the significance of change between pre— and posttesting periods. Review of Educational Research, 1977, g, 121—150. Lord, F. M. Elementary models for measuring change. In E. W. Harris (Ed.), Problems in measuring change. Madison, WI: Unviersity of Wisconsin Press, 1963. Lord, F. M. A paradox in the interpretation of group comparisons. Psychological Bulletin, 1967, Q, 304—305. Lord, F. M. A paradox of interpretation of group comparisons. Psychological Bulletin, 1969, 68, 304-305. Olejink, S. F., 6: Porter, A. C. Bias and mean square errors of estimators as criteria for evaluating competing analysis strategies in quasi-experiments. Journal of Educational Statistics, 1981, _6, 33-53. Porter, A. C. Analysis strategies for some common evaluation paradigms. Paper presented at the Annual Convention of the American Educational Research Association, New Orleans, 1973 Porter, A. C. The effcts of using fallible variables in the analysis of covariance. Unpublished Ph.D. dissertation, University of Wisconsin, 1967. Porter, A. C. How errors of measurement affect ANOVA, regression analysis, ANCOVA, and factor analysis. Paper presented at the Annual Convention of the American Educational Research Association, New York, 1971. Porter, A. C., (St Chibucos, T. R. Selecting an analysis strategy. In G. Borich (Ed.), Evaluation Educational Programs and Products. New York: Educational Technology Press, 1974. Rosenshine, B. The stability of teacher effects upon student achievement. Review of Educational Research, 1970, Q, 647-662. 75 Soar, R. S. An integrative approach to classroom learning (Grant No. 5-Rll MH 020045). Philadelphia: Temple University, 1966. Tucker, L. R., Damarin, F., or Messick, S. A base-free measure of change. Psychometrika, 1966, 3_l, 457-473. Veldman, D. C., 6: BrOphy, J. C. Measuring teacher effects on pupil achievement. Journal of Educational Psychology, 1974, 3, 319-324. IIIIIIII “111111111ijljljjjjjjijjjjjjj“ 9