.e;. 1.; . 1|. fibula. 9. A car; :7? . . . .- JunHPHP I: 4:. e.) ‘ {A 1.1 13.. t7. . I). s ‘0. I): .rnl? .J‘lfi . . ..... . . fir}. ‘ .1- 5:3 . . F3- [1 1.5". k.“..,p..NF...iI..» 5’. , Vl’vl¢1” :13; u v I I I . I). a I ‘f'l‘ .I‘IA'. 3.1.; . , . {I}. £3..Bt§iriii-ll§x. . I? I. ,.l’t..l).-Oll’u ibis]! . Ir. ,3f! .I..- .7 V . .. . . I . . . [Utli - .’ o, . . ‘ I, f I?” v . ., in .3“! c.‘ l‘ ’1 . . 2 . ill 39.11.14» . .9..3F _ . V A I . {Q‘|.s.lfi.blll’lvck .‘t-\ '| , 1.... .I. .. .r. 9-»... 35¢- ”It! .....an!._.w.lo:l».frJ .vfiqfiiulk ill. ,. I . Z. 1:36...- . l . . .1. . . 4.. .. . A ii...x . 3:! . L. .. : .. ‘ .IIJ}! . .. ‘ ...‘..v . 9“ . ......é...L..>..hu...L.n-. . V . ‘ . .5 u . . V , , (i... 11:. . 1.4- » ‘1 ‘ 5 .. 2.: 3...: .3. L3. “.35.... .. if il|.ll A. i “:"dk‘; LIBRARY‘W Michigan State University This is to certify that the dissertation entitled EXAMINATION OF THE USDE NORM-REFERENCED EVALUATION MODEL presented by Irene Mary Leland has been accepted towards fulfillment of the requirements for Ph.D. . Educational Psychology degree in ll Ba Major professor Date //7////g7 M5 U i: an Affirmative Action/Equal Opportunity Institution 0-12771 MSU LIBRARIES .—:,—.. RETURNING MATERIALS: Place in book drop to remove this checkout from your record. FINES will be charged if book is returned after the date stamped below. I Mill r761?” 3 6, ~ 'w’ " EXAMINATION OF THE USDE NORM-REFERENCED EVALUATION MODEL by Irene Mary Leland A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSPHY Department of Counseling, Educational Psychology and Special Education 1987 Copyright by IRENE MARY LELAND 1987 ABSTRACT EXAMINATION OF THE USDE NORM-REFERENCED EVALUATION MODEL BY Irene Mary Leland The norm-referenced model replaces the control group with the norming population of a standardized test. It assumes natural growth follows the pattern observed in this norming population. The model lacks validity when local patterns of growth differ systematically from those of the norming population. Due to its wide usage in evaluating Chapter 1 Projects there remains a need for additional research to provide a clearer understanding of the model. Determination of whether local conditions re— sult in patterns of growth sufficiently different from the national norms to invalidate the model is needed. This study approached the norm-referenced model as im- plying a model of growth in measured achievement over suc- cessive school years. Patterns of growth implied by the model and the norms for two standardized tests were com- pared. Local data from six school districts selected for dif- ferences in use of out—of—level testing, effectiveness of overall school program and proportion of low SES students were examined. Local patterns of growth were used to test the norm-referenced model. Consequences of differences Irene Mary Leland between local and national growth patterns on apparent gains, Type I errors and power were determined. For two districts longitudinal data were compared with expecta— tions based on cross-sectional norms. The patterns of growth implied by the norm-referenced model and norms of the standardized tests studied were found to be curvilinear with the rate of growth highest in the early grades. Significant differences were found in the growth curves for the two tests. The model is robust with respect to differences in percentile rank. Significant differences between tests were found at Grade 6 in reading. In mathematics signifi- cant differences were found for specific grade levels and districts. Local patterns of growth affect both Type I errors and power. Small true gains are unlikely to be detected at the local level but educationally significant gains usual- ly will be. Results of the longitudinal analyses indicate at least some cases where local longitudinal data do not match the results of cross-sectional data for the national or local populations. This study identifies several areas that may cause bias which should definitely be considered when interpre- ting local data. Dedicated to my parents Walker M. and Emily S. Dawson ACKNOWLEDGEMENTS I would like to acknowledge all of those who have helped me in this endeavor. First, I would like to thank Dr. Andrew C. Porter who served as my major advisor and as chairman of my committee. His encouragement enabled me to achieve more than I would have thought possible before I started this program. Then I want to thank the members of my committee: Dr. Norman T. Bell who encouraged me to be- lieve I could undertake and complete a doctoral program at this point in my life; Dr. William W. Farquhar who helped me to relate the research methods I studied in my course- work to my work in evaluation at the Michigan Department of Education; Dr. Roy V. Erickson who helped with the statistical problems I encountered in the analyses for this dissertation and Dr. Lawrence 1. OVKelly who was always willing to listen when I needed to clarify my thoughts on some problem. Next I want to thank all my coworkers at the Michigan Department of Education who encouraged me in this under— taking. Special thanks go to Dr. David L. Donovan and Dr. Daniel E. Schooley. Without their encouragement and sup- port I could not have completed this program. vi I also want to thank Ms. Caryl Basel, Ms. Marsha Bowers, Mr. LaVerne Dittenber, Mr. George Johnson, Dr. Franci Moorman and Mr. Donald Peters for their assis- tance in collecting the data for this study. Then I want to thank my family for their understand- ing and support during the time I have been working on this project. Without their help, especially that of my two youngest children, Robert and Carolyn, I could not have completed this endeavor. And finally, I want to thank my many friends who have encouraged me in this undertaking. vii TABLE OF CONTENTS EEEE LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . x LIST OF FIGURES. .. . . . .. . . .. . . .. . . .. .xi Chapter I. INTRODUCTION. . . . . . . . . . . . . . . . . . 1 Need for this Research 3 Overview of the Study. . . . . . . . . . . 3 Definitions of Important Terms . . . . . . . . 5 II. THE NORM-REFERENCED MODEL AND ITS ASSUMPTIONS ABOUT NATURAL GROWTH. . . . . . . . . . . . . 7 III. REVIEW OF THE LITERATURE. . . . . . . . . . .10 IV. PROCEDURES . . . . . . . . . . . . . . . . . .24 Development and Comparison of Patterns of Growth Using the Assumptions of the Norm- referenced Model . . . . . . . .24 Effect of Variables Hypothesized to Affect the Norm— referenced Model's Validity at the Local Level . . . . . . . . . . . . . . . . . . .30 Selection of Districts . . . . . .31 Collection and Processing of Local Data . ..33 Development of Growth Curves Based on Local District Data. . . . . . . . . . . . .35 Effects of Local District Growth Patterns on the Validity of the Norm— referenced Model . .36 Determination of No- treatment Gains . . .36 Probability of Type I Errors. . . . ..38 Probability of Detecting True Gains . . .41 Comparison of Longitudinal Data with Expectations Based on Cross-sectional Norms .42 viii Chapter Page V. RESULTS OF THE STUDY. . . . . . . . . . . . . .44 Comparison of Growth Curves for the CAT and MAT O O O O O O I I O O O O O O O O I O I O .44 Analysis of the Growth Curves for the CAT and MAT. . . . . . . . . . . . . . . . .54 Comparison of National and Local Norms . . . .55 Stability of Local Norms . . . . . . . . . . .63 Effect of Local Growth Patterns on Gains Measured by the Norm-referenced Model . . . .63 Effects on Expected No-treatment Growth .63 Effects on Likelihood of Type I Errors. .68 Effects on Power to Detect True Gains . .74 Effects on Size of Sample Needed to Detect True Gains. . . . . . . . . . . .77 Comparison of Longitudinal Data with Cross- sectional Norms . . . . . . . . . . . . . . .77 VI. DISCUSSION . . . . . . . . . . . . . . . . . .81 Assumptions the Norm- referenced Model Makes About Natural Growth. . . . . . . . . . . .81 Comparison of Growth Curves for the CAT and MAT . . . . .82 Robustness of the Norm- referenced Model with Respect to Variations Found in Local Patterns of Growth . . . . . . .83 Stability of Local Growth Patterns from Year to Year . . . . . . . . . . . .86 Adequacy of Cross-sectional Norms for Estimating Longitudinal Patterns of Growth. .87 Implications of These Findings . . . . . . . .88 Areas Needing Additional Research. . . . . . .90 VII. SUMMARY AND CONCLUSIONS . . . . . . . . . . .93 Conclusions. . . . . . . . . . . . . . . . . .94 Appendix A. GRAPHS OF LOCAL GROWTH CURVES . . . . . . . . .97 B. YEARLY DISTRICT DATA. . . . . . . . . . . . . 109 REFERENCES . . . . . . . . . . . . . . . . . . . . . . 113 ix Table 10. ll. 12. 13. 14. 15. 16. 17. 18. LIST OF TABLES Summary of Powers'(l983) Data Total Number of Students Tested in 1983, 1984 and 1985 . . . . . . . . . . . . . . . . . . Reading 4-way ANOVA Summary Mathematics 4-way ANOVA Summary CAT and MAT Scale Scores, Reading and Mathematics. . . . . . . CAT and MAT Rescaled Scale Scores Differences Between CAT and MAT Growth Curves National and Local Growth Curves. . . Comparison of National and Local Percentile Ranks . . . . . . . . . . . . . . . . . . . Size of District and Stability of Local Norms Significance of Expected Zero Gain Scores Effect on Power of Local Growth Patterns. .34 .37 .38 .45 .51 .54 .56 .57 .64 .65 .75 Effect on Sample Size of Local Growth Patterns.78 Longitudinal Data Yearly District Data: CAT - Reading. Yearly District Data: MAT - Reading. Yearly District Data: CAT - Mathematics. Yearly District Data: MAT - Mathematics. .79 109 110 111 112 Figure 10. ll 12 13. 14. 15. 16. 17. 18. 19. LIST OF FIGURES Distribution of Hypothetical Group. CAT MAT CAT Math MAT Math Rescaled Rescaled National Reading. National Reading. National National Expected Expected Expected Expected Reading Growth Curves Reading Growth Curves Growth Curves. Growth Curves. Reading Growth Curves: Math Growth Curves: and Local and Local and Local and Local Zero Gain Zero Gain Zero Gain Zero Gain Level Districts. Expected Zero Gain MEAP Districts Expected Zero Gain Districts. COL Growth Curves: CIM Growth Curves: xi Growth Growth Growth Growth Scores: Scores Scores Scores Scores Scores Curves: Curves: Curves: Curves: Reading in Math: in Math: in Math: in Math: in Math: Reading Reading CAT MAT - CAT vs MAT CAT vs MAT. CAT - MAT - Math Math CAT MAT Out-of- Improving Low SES .47 .48 .49 .52 .53 .58 .60 .61 .62 .67 .69 .70 .71 .72 .73 .97 .98 Figure 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. CLS MOL MIM MLS COL CIM CLS MOL MIM MLS Growth Growth Growth Growth Growth Growth Growth Growth Growth Growth Curves: Curves: Curves: Curves: Curves: Curves: Curves: Curves: Curves: Curves: xii Reading Reading Reading Reading Mathematics Mathematics Mathematics Mathematics Mathematics Mathematics 102 103 104 105 106 107 108 EXAMINATION OF THE USDE NORM-REFERENCED EVALUATION MODEL CHAPTER I INTRODUCTION A number of evaluation models have been suggested for use in estimating program effects when a randomly assigned control group is not feasible. In response to legislative mandate RMC Research Corporation developed several models for use in evaluating federally funded programs (Tallmadge and Wood, 1976). One of these, the norm-referenced model has since become almost universally used in evaluating Chapter 1* Projects (Reisner at al, 1982). The norm- referenced model replaces the control group with the norm- ing population of a standardized test. It assumes that natural growth, (i.e. growth that would occur in the absence of any special program), follows the same pattern as that observed in the norming population. The norm-referenced model lacks validity when local patterns of natural growth differ systematically from those of national norming pepulations. A number of condi- tions can affect the apparent growth observed in a local group (e.g. the similarity between local and national populations, local and national school practices, test scheduling and the match of the test to the local school * Formerly Title I, in July 1982 ESEA Title I was replaced by ECIA Chapter 1. The models originally developed under Title I have continued to be used under Chapter 1. curriculum) when using national norms. There is a need to determine whether local conditions actually do result in patterns of natural growth sufficiently different from those of the national norms to invalidate this method. It is important to estimate the effects of specific condi- tions and the amount of bias likely to be incurred under each. Local school district norms need to be compared to national norms to determine whether there is evidence of bias. The norm-referenced model compares longitudinal pro- gram data with cross-sectional norming data. There is a need to determine whether cross-sectional data provide a useful measure of the growth patterns found in longitudi- nal data. This study had a threefold purpose. 1. To examine the assumptions of the norm-refer- enced model with respect to the nature of natural growth and to determine whether the growth curves implied by different standardized tests are congruent. 2. To describe conditions other than program ef- fects that would be expected to affect the gains actually observed, and to examine, using local district data, their effects on local patterns of growth and the implications for the validity of the norm-referenced model. 3. To compare the growth actually found in lon- gitudinal data with that implied by cross-sectional norms . Need for this Research This study is important because the norm-referenced model has come to enjoy wide use for evaluating program effects. If it is found to have questionable validity then its popularity is not warranted. It is important to examine the robustness of the norm-referenced model with respect to local deviations from conditions pertaining to national norming populations. School practices and other conditions that can affect the pattern of local natural growth need to be identified. Examination of the validity of using cross-sectional norms to evaluate longitudinal data is also important. 9225212! 2f the Study The following research questions represent the prima- ry thrust of the project. 1. What assumptions does the norm-referenced model make about natural growth? More specifically what patterns of natural growth are implied by the assumptions? 2. Are the growth curves implied by the model the same for all standardized tests? 3. How robust is the norm-referenced model with respect to variations found in local patterns of growth? 4. How stable are the local patterns from year to year? 5. How adequate are cross-sectional norms for estimating longitudinal patterns of growth? From these questions the following plan for testing the robustness of the model was developed. 1. The patterns of growth implied by national norms for two tests would be determined. 2. Selected local districts would be examined for differences on factors believed to influence patterns of growth. Local patterns of growth implied by local norms for these districts would be determined. 3. Consequences of differences between local and national growth patterns on observed normal curve equiva- lent (NCE) gains would be calculated and the effect on Type I errors and power of tests for program effects determined. 4. For two sample districts patterns of growth based on longitudinal data would be compared with patterns of growth obtained using cross-sectional data. The local data would be used to test the norm-refer- enced model. If the national norms provide a valid com- parison group for students in local compensatory education programs the expected growth of the local students should be the same whether local or national norms were used. The research hypothesis is that students growing in achievement at the rate implied by the local norms would maintain the same percentile rank from year to year when evaluated with national norms. Definitions of Important Terms At this point definitions of the following terms used in this paper are included. 1. Natural growth - The growth in achievement that would take place normally in an educational setting in the absence of any special program. 2. Expanded standard scores or scale scores - Scores on a continuous scale designed in such a way that scores at each test level are normally distributed (as are standard scores) and that scores on different levels at successive time points represent growth across levels. The scaling method is based on Thurstone's (1925) absolute scaling technique. Expanded standard scores are called by different names in the various test manuals. The manuals for the tests used here refer to this type of score as scale scores so that term will be used in this paper. The scales are considered to be equal interval so that a difference of 5 points represents the same difference anywhere on the scale and so that succesive scores can be used as a measure of growth over time. 3. Growth curve - A curve depicting the relation- ship between achievement in scale scores and grade level placement. 4. Normal curve equivalents (NCE's) - Scores con- verted to a normalized standard score scale with a mean of 50 and a standard deviation of 21.06. NCE's are the same as percentile ranks at the 1st, 50th and 99th percentiles. Between these points they are distributed on an interval scale whereas percentiles are not. 5. Functional level testing - Testing students with a test level on which they may be expected to answer between 30 and 80 percent of the items correctly. In some cases this may require out-of—level testing, (i.e. testing with a test level other than that recommended for a stu- dent's grade level placement, or use of different test levels at pretest and posttest times). Scores for stu- dents tested out—of-level are converted to scale scores. These in turn are converted to in-level percentiles or NCE's. Thus, the performance of students tested out-of- level is still compared with that of their grade-level peers. CHAPTER II THE NORM-REFERENCED MODEL AND ITS ASSUMPTIONS ABOUT NATURAL GROWTH The norm-referenced model uses students‘ growth rela— tive to that implied by national norms as the basis for evaluating the effectiveness of an educational program. The no-treatment expectation is that students will main— tain the same relative status, (i.e. percentile rank or NCE score), over a period of time. The model as approved by the U. S. Department of Education (USDE) has several requirements designed to improve the validity of the results. First, students are required to be pretested and posttested at dates near the empirical norming dates of the test being used. If the evaluation is based on a fall—to-spring testing schedule a test with both fall and spring empirical norms must be used. Second, to counter the effects of statistical regres— sion the model requires that students be selected for pro- gram participation on some basis other than their pretest scores. Finally, to be properly implemented the model re- quires functional level testing. For many low-achieving students this means using a test level that was normed on students in a different (lower) grade. When tests are normed, test levels are administered to students on the basis of their grade placement rather than their function- al level. Certain assumptions about natural growth are implicit in the norm-referenced model. The norm-referenced model compares observed growth with an estimate of what stu- dents' achievement would have been under conditions of natural growth. It assumes cross-sectional norms provide a valid estimate of natural growth. For example, consider a group of students who achieved at the 30th percentile at the end of third grade. Their expected achievement at the end of fourth grade would be the same as that of the norm group students at the end of fourth grade who achieved at the 30th percentile. Thus, the difference between the achievement of third and fourth grade students at the 30th percentile represents a year's expected growth for students at that percentile rank and grade level. Put another way, under the equipercentile assumption percentile ranks and NCE scores for a group of students would be expected to remain constant across grade levels in the absence of special treatment effects. Students in a grade-level cohort with normal growth would be expected to show zero NCE gains from year to year. Grade level placement determines the norm group with which students' scores are compared and, hence, their percentile ranks and expected achievement. CHAPTER III REVIEW OF THE LITERATURE Models suggested in the literature for use in esti- mating program effects when a randomly assigned control group is not feasible and subjects are expected to change in the absence of any treatment are basically of two types: 1. models that use a non-equivalent comparison group for estimating expected gains, and 2. models that use information about the stu- dents' prior performance to estimate expected gains. The models developed by RMC Research Corporation (Tallmadge and Wood, 1978) for evaluating Title I programs are examples of the first type. Olejnik's (1977) model and the value-added model (Bryk and Weisberg, 1976; Bryk, Strenio and Weisberg, 1980) are examples of the second type. The three RMC models all evaluate gains using a type of standard score, the normal curve equivalent or NCE (Tallmadge and Wood, 1978). Thus, all make use of some kind of comparison group. The RMC model referred to in the Title I literature as Model A is the norm—referenced model, the model being 10 11 evaluated here. The model assumes that, without treat- ment, students' percentile ranks and hence NCE scores will remain the same over time. The no—treatment expectation for the postest score is the group's mean pretest NCE score. Tallmadge and Wood (1978) state (p.44), ”Those children in the (norm) sample who obtained the same test scores as Title I children serve as a kind of control group.” The analogy with a control group is not complete- ly accurate because: 1) in nearly all cases the same children are not included in both the pretest and postest norming samples, and 2) the composition and treatment with respect to relevant variables of the subsample being com- pared with the Title I children is not known. Neverthe- less, data from the norming population are used as a substitute for control group data in the norm-referenced model. The norm-referenced model was developed and refined by RMC Research Corporation for use with the Title I Evaluation and Reporting System or TIERS (Reisner et a1, 1982). In a sense it was an outgrowth of earlier Title I evaluation decisions based on the concept of a "year's growth" using test scores expressed as grade equivalent units (Tallmadge, 1982). The model was intended to provide information for Congress on the effectiveness of federally funded pro- grams. A major concern was in obtaining data that could 12 be aggregated to provide an overview of program effects at the state or national level. The data so aggregated include results for large numbers of students involved in a variety of programs. Analyses of data at the national level identify small differences in gains as statistically significant while glossing over large differences among the programs aggregated (Reisner et a1, 1982). In con- trast, at the local level where single projects are evalu- ated small sample sizes may lead to large standard errors. Reisner et al(1982) describe the typical project as having fewer than 30 students and note that many projects have fewer than ten. With the large amount of variance present in student gains and the typically small sample sizes sta— tistical power is often a problem. The norm-referenced model as approved for use by USDE relies solely on the difference between observed and ex- pected means and makes no provision for testing interac- tion effects. For example, Reisner et al(1982) note that the analyses performed when using the model give no infor- mation about how effective programs are in raising the floor of the distribution as opposed to the mean. Since the implied purpose of any compensatory education program is to remediate educational deficiencies, it seems possi- ble that in many cases educational programs having their greatest effects on the lowest achieving students have been implemented. Such programs might result in substan- tial gains for the neediest students even though the mean 13 program gain is not large. Of particular importance to the proposed study, the model makes no provision for variations in the effect of the regular school program. Where the regular program is markedly more or less effective than the national average, patterns of natural growth in a district would differ from those of the national norm group. The varying effectiveness of regular school programs is perhaps one of the most important variables affecting the validity of the model. Tallmadge (1982) explores the question of school effects on the model in one of his analyses using scores for treatment and control students matched by school building. He found evidence of school effects in the data for Grade 2 but not for Grades 4 and 6. He notes that the treatment evaluated by the norm- referenced model is the students' total school experience. Thus an ineffective Chapter 1 program in an effective school system might show gains larger than those in the national norms. In contrast, for an effective program in an ineffective system the observed gains might be less than the no-treatment expectation. A number of other concerns about the validity of the model have been raised in the literature. The problem re- ceiving the most attention is that of regression to the mean. It was recognized when the model was first consid— ered as a problem if the same test results were used for selection of students into the program and as pretest 14 scores (Tallmadge and Horst, 1976, Tallmadge, 1977). As a result the model approved by USDE for use with Title I programs requires selection of students on the basis of different test scores than those used for the pretest (Tallmadge and Wood, 1976). Concern has been expressed by a number of writers that there is still a residual regression artifact present in the observed gains so long as the selection test corre- lates more closely with the pretest than with the posttest (Linn, 1980, Roberts, 1980, Trochim, 1982). An adjust- ment could be made for this residual regression if the selection-pretest and selection-postest correlations were known. For the norming population they are not. Even the pretest-posttest correlation for the norming population is not known since the same individuals are not involved in the pretest and posttest norming. In the case of local studies, the correlations could be calculated and the adjustment made for any residual regression. This proce- dure has rarely been followed in actual practice. A number of studies have addressed the validity of the equipercentile assumption, that in the absence of special treatment students will maintain the same relative positions with respect to one another over time. Included are studies by Tallmadge(1985,l982), Hiscox and Owen (1979), Powers et al(1983), and Linn(1979). In several cases evidence was found to dispute the validity of the equipercentile assumption in school and district data. 15 Linn (1979) used data originally collected by Van Hove et a1 (1970). The Van Hove study looked at achieve- ment test results at two grade levels for schools in six urban school districts. Unweighted average percentile ranks for schools categorized by percent of minority stu- dents were compared. While the main purpose of the origi- nal study was to compare performance in the six school districts, Linn used the data to consider the validity of the equipercentile assumption. He converted the average percentile ranks from Van Hove's study into NCE's and examined the implications of the results for the norm- referenced model. He found that in nearly all of the cities the NCE scores for the later grade (6) were lower than for the earlier grade (either 3 or 4). He notes that the data are cross-sectional rather than longitudinal and that averaging percentiles across parts of a battery may conceal "interesting” trends. More to the point averaging percentile ranks is not considered an acceptable measure- ment procedure. While it's difficult to estimate the effect that averaging percentile ranks had on them, these results do raise doubts about the validity of assuming that urban minority students would maintain a constant NCE score (or percentile rank) from year to year in the ab- sence of a compensatory education program. Hiscox and Owen (1978) found evidence of significant variability in the percentile ranks of Portland (Oregon) students over a four year period. Using longitudinal data 16 from the Portland Public Schools they followed the Compre- hensive Test of Basic Skills (CTBS) scores for groups of low-scoring fourth and seventh grade students for up to four years. Students were grouped both by the percentile rank of their first year's score and by the number of years (zero to four) that they were eligible for Title I services. Hiscox and Owen imply that all eligible stu- dents received Title I services. However, they provide no evidence that they checked beyond students' test scores to determine that this was actually so. Included in their data were scores for students attending eight elementary schools offering Title I programs. No information is given about the nature of the Title I programs at these schools or even whether all the schools provided the same type of Title I program. Their results showed year-to- year differences in the groups' percentile standings rang- ing from zero to nine percentiles with most differnces in the two to three percentile range. From this they con- clude that the ”noise” level in a local norm-referenced study may be greater than the potential student gains. This suggests that results from data for a single year (or cohort) may, at the local level be unduly influenced by the presence of local variation. Powers et a1 (1983) in their study of Tucson(Arizona) students found that groups of these students gained con- sistently in percentile rank over a fall-to-spring inter- val. Seventh and ninth grade students, from schools that 17 did not participate in Title I projects, were grouped on the basis of stanines and also into deciles. When students were grouped by stanines into low (1-3), middle (4-6) and high (7-9) achieving students the mean NCE gains for all the groups (low, middle and high at each grade level) were consistently greater than zero. From this he concluded that use of the equipercentile assumption is inappropriate. Tallmadge (1985) reanalyzed Powers' data and con- cluded that they actually provide support for the equiper- centile assumption. He notes that when the selection test scores are included in the analysis a different picture results. In Table 1 may be found the overall mean seventh and ninth grade scores for the selection test, pretest and posttest along with the corresponding N's and standard deviations as reported by Powers (1983, p.301L Normally in using the norm-referenced model only the pretest and posttest scores would be considered. However, in this case the selection test scores were used only to subdivide the group for analysis and not for selecting students into the study. Further, the selection test scores were obtained two years prior to the pretest. Under these conditions, Tallmadge argues, it is reasonable to consider the mean selection scores for the total sev- enth and total ninth grade groups as providing evidence for or against the validity of the equipercentile assump- tion. He finds the selection and pretest scores obtained 18 Table 1. Summary of Powers'(l983) Data Grade N Test Mean SD Score 7 1327 Selection 59.24 18.59 Pretest 58.61 19.20 Posttest 62.11 18.11 9 1897 Selection 59.45 19.14 Pretest 58.88 18.33 Posttest 61.03 18.86 two years apart to be "virtually identical." The posttest scores obtained the following spring are, however, signif- icantly higher. Tallmadge examines a number of possible alternative hypotheses. He concludes that the most likely explanation is one that attributes the apparent gains to an artifact resulting from misapplication of the norm- referenced model, e.g. "stakeholder" bias. This results in bias when the data are interpreted using the norm- referenced model. A study by David and Pelavin (1978), found evidence that fall-to-spring gains were not maintained over the summer. Analysis of Title I data nationally (Reisner et al, 1982), suggests that gains based on fall-to-spring testing are artifically inflated relative to those ob- 19 tained when following an annual testing schedule. A num- ber of sources of bias that may contribute to the infla- tion of fall-to-spring gains are identified. Thus, Powers' findings may reflect sources of bias present in gains based on fall-to spring testing that are not present when growth is measured annually. An earlier study by Tallmadge(1982) using a national data base provides support for the validity of the equi- percentile assumption. He uses data from the Sustaining Effects Study* and from the 1977 norming of the California Achievement Tests. An unknown number of students in the CAT norming file were participants in compensatory educa— tion programs. Students were selected in a manner de- signed to simulate the procedures in implementing the norm-referenced model and the randomized control group de- sign. Specifically, he divided students in the data base into high- and low-achieving groups on the basis of a selection test cutoff score. The low-achieving group was randomly assigned to simulated treatment and control groups. (Of course, neither group actually received any treatment.) The data used were for students tested in both the fall and spring of the same year. Talmadge reports the results of three analyses using scores for *The Sustaining Effects Study (SES) is a longitudinal study of compensatory education funded by the U.S. Department of Education. The SES data analysis file used by Tallmadge was one made up of scores of students who had not participated in compensatory education programs. 20 students in grades 2, 4 and 6. In all of these he com- pared gains for the treatment group obtained using a simulated control group model. Although his results for the norm-referenced model indicate a positive bias of about 1 NCE for the Title I groups, he found them to be comparable in degree of accuracy with those obtained from the simulated control group model. (The variation for the randomized control group was, however, about equally posi- tive and negative.) As a result of his study he concluded that "the norm-referenced model yields gain estimates that are reasonably comparable to those derived from the ran- domized control group design'(p. 110). One aspect of the validity of the equipercentile as- sumption concerns the use of cross-sectional norms to generate the no-treatment expectation for project evalua- tion. Local data collected for the norm-referenced model are longitudinal with pretest and posttest scores for the same students. Norming data are usually based on scores of different students for each grade level and norming date (Reisner et a1 1982). The norm-referenced model assumes that the cross-sec- tional data provide a valid representation of longitudinal growth. Specifically, the use of cross-sectional norms assumes that the students in the norm population represent the same population at different points in time (Murray et al 1979). When evaluating local projects the use of the model assumes that the norm population is representative 21 of the local population at the pretest and posttest time points, or at least that any differences are constant across the pretest to posttest time period. Thus, the growth rate of students in a project would, in the absence of any special program, be the same as that for students in the norming population. Variables that have been addressed as posibly affect— ing the validity of using a national norming population as a control group include: differences in school policies relative to promotion of students (Tallmadge, 1977), dif- ferences in the racial/ethnic composition (Tallmadge, 1977), differences resulting from comparing longitudinal project data with cross-sectional norming data (Reisner et al, 1982; Trochim, 1982), and use of the appropriate test level (Tallmadge & Wood, 1978). Other studies in contexts unrelated to the norm-referenced model have examined fac- tors that affect patterns of growth in school achievement. For example, Langer et a1 (1984) found age at school entry related to the pattern of growth in later school achieve- ment. Bryk et a1 (1980) found achievement at day care centers related to a number of background variables in- cluding sex, race and SES as well as to age. Porter et a1 (1978) found that standardized tests differ in content and hence in match to any given curriculum. The match between test and curriculum could affect the pattern of growth observed. The studies reported by Linn (1979), Powers et a1 (1983) and Tallmadge (1982) suggest that the 22 effectiveness of the overall school program may be an important factor in the results obtained. Nationwide 99 percent of school systems use the norm-referenced model to evaluate Chapter 1 programs. Most of those using another model use RMC's special re- gression model, Model C (Reisner et al 1982). Trochim (1982) found enough school districts in Florida using Model C to permit a study comparing results obtained using the norm-referenced and special regression models. On the basis of a meta-analysis of evaluations of districts using either of the two models Trochim found that programs evaluated with the norm-referenced model tended to show positive NCE gains while those evaluated with the special regression model were more likely to show zero or negative NCE gains. Though Trochim concludes that the norm-refer- enced model overestimates the effectiveness of programs while the regression model underestimates it, his study provides no evidence of the extent to which the difference in results is due to sources of bias present in each of the models or to bias in the districts electing to use a given model. None of the above studies approaches the norm-refer— enced model as providing a model of natural growth in measured achievement. Though the assumptions of the norm- referenced model when combined with the percentile norms and expanded standard score scale for a given test implies a model of expected growth in the achievement measured, 23 this aspect of the model has not received the attention of researchers. Yet it is important. As Linn (1981,p. 183) notes, "Most work on the measurement of change has devoted little or no attention to models of growth. But good models of growth seem crucial to measuring and inter- preting measures of change.”. Developing sound models of growth, especially for growth in measured achievement over the school years, will require considerable research.” CHAPTER IV PROCEDURES The procedures for carrying out this study can be grouped into three parts: l)those involved in development and comparison of the patterns of growth implied by the norm-referenced model; 2)those involved in collecting and analyzing the local cross-sectional data; and 3)those involved in collecting and analyzing the longitudinal data. Develoeeent and geezeriaea 2: Passages 2: Erasta Usina the Assumptions 2; the Norm-referenced M2931 Although not explicitly specified by the norm—refer- enced model, expected patterns of growth are implied when the equipercentile assumption is applied to empirical data. Defining these patterns of growth required data from multiple time points, scores expressed on a develop- mental scale across grade levels and an equivalence be- tween percentile ranks at each grade level and scores on a developmental scale, such as the expanded standard score scale used for the scale scores. Procedures specified under the model for comparing students tested out-of—level with in-level norms requires 24 25 that raw scores be converted to in-level NCE's (Tallmadge & Wood, 1976). The scores are converted using scale scores. This assumes that for every scale score there is an equivalent in-level NCE and at any grade level there is a scale score equivalent of every NCE score. Both NCE's and scale scores are designed to provide equal interval scales at all grade levels. NCE's remain the same across grade levels, whereas scale scores permit a measure of growth over time. The expected "normal growth” pattern across grade levels for students attaining a given percen- tile rank or NCE score on a pretest can be represented by the scale score equivalents of that rank or score. No- treatment growth can be expressed in terms of expanded scale score equivalents across grade levels for any NCE score or percentile rank. Two nationally standardized test batteries, the Cali- fornia Achievement Tests, 1978 edition(CAT) and the Metro- politan Achievement Tests, 1978 edition(MAT), were selec— ted for use in this study. These two tests are among the tests most widely used for evaluating Michigan Chapter 1 programs. For these tests the patterns of growth implied by the equipercentile assumption and the tests' norms tables (California Achievement Tests, 1978; Prescott, Below, Hogan and Farr, 1978) were examined. Scores for Total Reading and Total Math batteries were used. Using empirical spring norms the scale scores corresponding to the 10th, 30th, and 50th percentile ranks at each grade 26 level were determined. These percentile ranks cover the range of achievement usually encountered in Chapter 1 programs. Scale scores were listed and plotted across grade levels for the selected percentile ranks. The scale scores represent expected achievement on the tests' empir- ical spring norming dates each year. Use of spring norms assumes a spring-to-spring test interval which provides a more valid determination of yearly growth than fall-to-spring testing. It eliminates both the fluctuations that result from differences in the growth rate during periods of vacation and schooling and what Tallmadge(1985) refers to as ”stakeholder bias." This is the bias that results when the classroom teacher who has a stake in the students' achievement gains admin- isters both the pretest and the posttest. Approximately 80 percent of Michigan school districts evaluate Chapter 1 programs using a spring-to-spring test schedule. Although fall-to-spring testing is more common in other states, USDE recommends that districts be encouraged to use an- nual testing for program evaluation (Reisner et a1 1982). Rarely do school systems test students district-wide more than once a year. Scale scores for the 10th, 30th and 50th percentiles based on spring empirical norms for grades Kindergarten through 12 were determined for the CAT and MAT. The scale scores for each percentile rank were then plotted against grade level to produce a set of growth curves for each 27 test. Next the resulting growth curves for these two tests were compared to determine whether they appeared to be similar. The two reading tests purport to measure achievement in the same subject and so do the two mathema- tics tests. It was hypothesized that they would show similar growth curves. Differences between the curves for the two tests would indicate differences content between the two tests or lack of an interval scale for one or both tests. If there are differences the success of a local program being evaluated with the norm-referenced model would depend on the test used and on its match to the local curriculum. Further, the norm-referenced model is used to allow reading and mathematics data to be aggre- gated statewide and nationally across tests and at the state level to compare the achievement of districts using different tests. Implicit in these procedures is the assumption that the tests used measure achievement in a common content for which there is a common growth curve on an interval scale. Since the two tests use different scales it was impossible to compare the expected scores for the two tests directly. To make possible a comparison of the scores for the two tests they were rescaled by setting the 50th percentile score for Grade 2 at 600 and that for Grade 6 at 800. The difference between the 50th percen- tile scores for Grades 2 and 6 on the new scale(200) was 28 divided by the differences between the scores for the same points on the original scales to obtain the coefficients for converting other points on the curves to the new scale. The following formulae were used in the rescaling. CAT reading: 'Y-600+1.4286(X-360) MAT reading: Y-600+1.7094(X-620) CAT mathematics: Y-600+l.4388(X—352) MAT mathematics: Y-600+1.0582(X-507) With the scores on the same scale it was possible to compare the variance (from the 50th percentile to the 30th and 10th) as well as the configurations of the growth curves for the two tests. To test the null hypothesis that the growth curves are the same for both tests it was necessary to test both the configurations and variances. To test whether the configurations were the same Lord's(1957) test that the disattenuated correlation between two tests is 1.00 was used. (Ho: (Oxy-1.00) Lord's test requires in addition to the correlation between the two tests a measure of the reliability of each of them. The problem here was that the growth curves used in this study were taken directly from the tests' norms tables and could be assumed to have a reliability of 1.00. Any difference would then be significant. What was really wanted was a test of whether the growth curves at a given percentile rank for the two tests differed significantly more from each other than they did from the growth curves for other percentile ranks for the same test. That is, the correlations between 29 curves for different percentile ranks for a single test represent the expected variation among curves for that test. Therefore, the correlation between points on the growth curves for two different percentile ranks for one test was used in place of the usual reliability coeffi- cient. In the case of the 50th percentile curve the correlations between it and the 30th and 10th percentile curves were calculated and averaged. These correlations for the CAT and the MAT were then used with Lord's test in place of the reliability coefficients. Let rxx - the average correlation between the CAT curve being tested and the other CAT curves; ryy - the average correlation between the MAT curve being tested and the other MAT curves; and rxy - the correlation between the CAT and MAT curves being tested. 0 0 0 0 Under Ho assume (3 xx - f.) yy - ,0 xy - ‘0 0 9 cannot be measured directly but Lord(1957) demon- strates that £0 can be estimated using: p0 - 1/6(rxx + ryy + 4rxy). The alternative hypothesis is that the curves are not the same. (H1: ny¢1.00) Under H1 assume plxx - rxx‘ Plyy ' ’yy‘ 8"" 1 9 xy - rxy. 30 The formula for Lord's test is then: <1- a°)2 [(1+ £P)2-4< ~°)21 7d 2 p 1 -(N-1)1og. --------------------------------------- (1-rxx)(1-ryy)[(1+rxx)(1+ryy To test the hypothesis that the levels of the two tests' growth curves are the same the SPSSx subprogram T-TEST was used with a paired-samples design. The hypo- thesis tested was Ho:‘fllx_y-0 for each of the percentile ranks. Effect of Variables Hypothesized to Affect the Norm;; referenced Model's Validity at the Local Level A number of variables may affect the validity of the model's assumptions about natural growth. Three of them were selected for this study. The first variable was the use of out-of—level testing. Since the model requires functional level testing many districts test their low— achieving students out-of—level. The results are compared with norming data obtained from students tested in-level. To test the validity of this practice two school districts that test their Chapter 1 students with a test level below that recommended for their grade placement were selected for inclusion in this study. Another factor was the effectiveness of the overall school program in reducing the proportion of low-achieving students in a school district. Districts showing a 31 pattern of improving achievement over a period of years might be expected to show changing patterns of growth. Two districts that have shown a decreasing proportion of students achieving less than 50 percent of the objectives on the Michigan Educational Assessment Program (MEAP) tests over the period from 1982 to 1984 were selected for inclusion in this study. A variety of background vari- ables, (e.g. sex, race and SES), may affect patterns of growth if composition of the local population differs markedly from that of the norming population on any of them. A high proportion of low SES students has been most closely linked to Title I (now Chapter 1) programs. Eli- gibility for Chapter 1 funding is based on the proportion of low SES students in any given school building. Schools participating in test norming studies may be as- sumed to have an average number of low SES students over— all. Two school districts with a high proportion of stu- dents eligible for Chapter 1 were included in the study. Selection 2; Districts A total of six school districts were selected, one that exemplifies each of three variables for each of the two tests. This resulted in the following design: 32 Out-of—Level Improving High Prop. of Testing Used MEAP Scores Low SES Students I ----------------- I ---------------- I ---------------- I CAT I 1 I 1 I 1 I I ----------------- I ---------------- I ---------------- I MAT I 1 I 1 I 1 I I ----------------- I ---------------- I ---------------- I The following abbreviations will be used to identify the districts in this study. COL - CAT, out-of—level testing used CIM - CAT, improving MEAP scores CLS - CAT, high proportion of low SES students MOL - MAT, out-of—level testing used MIM - MAT, improving MEAP scores MLS - MAT, high proportion of low SES students The six districts were selected from those that used either the CAT or MAT both for reporting Chapter 1 achievement data and for district-wide testing in Grades 2 through 6. A list of 120 districts which used the CAT and 119 which used the MAT for Chapter 1 evaluations in the 1982-83 school year was obtained. These were evaluated for each of the three variables in the design. From 12 to 22 districts were identified as ranking high on one of the variables and average or low on the other two. These districts were then contacted to determine whether or not they would have the data needed for the study. They were asked whether they did district-wide testing, which tests they used and at what grade levels. Those that did dis- trict-wide testing with either the CAT or MAT at three or more grade levels from Grades 2 through 6 were considered candidates for the study. From three to seven districts 33 in each cell were identified as having the needed data. Willingness to cooperate and ability to provide the data in a useable form were also evaluated. Where more than one district met the criteria for a cell and were equally able and willing to provide the desired data final selec— tion was made by a random drawing. Reading and mathema- tics data were requested for Grades 2 through 6 for the years 1983, 1984 and 1985. The final sample consisted of three suburban dis- tricts, COL, CIM and MIM, and three rural (largest town had 2,185 population) districts, CLS, MOL and MLS. These districts served from 7.25 (MIM) to 34.71 (MLS) percent of their students in Grades 2 through 6 in Chapter 1 pro- grams. Four of the districts had Chapter 1 programs in both reading and mathematics. One, MIM, had a program in reading only and one, MOL, had programs in reading and language arts. Collection and Processing of Data Table 2 shows the total number of student scores ob- tained by subject, grade level and district. One dis- trict, MOL, had test data for Grade 6 only for 1985. For all other districts and grade levels three years of data were obtained. Only one of the districts had computer facilities that enabled them to aggregate district-wide frequency distributions for submission. Two of the districts aggre— gated their data by building before submitting it. The 34 Table 2. Total Students Tested in 1983, 1984 and 1985 Grade Level District 2 3 4 5 6 Reading COL 2559 2626 2625 2766 2901 CIM 887 808 821 777 775 CLS 484 486 494 477 522 MOL 178 166 208 204 66 MIM 225 247 251 244 275 MLS 107 93 107 92 111 Mathematics COL 2549 2625 2630 2766 2903 CIM 887 806 823 776 740 CLS 484 484 496 476 523 MOL 183 166 207 203 67 MIM 198 247 251 244 275 MLS 108 95 106 92 111 remaining three districts submitted individual pupil data. The data received from the local districts were entered into a Honeywell(GCOS) Computer, aggregated into district- wide frequency distributions where necessary and analyzed using SPSSx Version 2.0. Examination of the size of the districts participat- ing in this study reveals that size of district is con- founded with the test used. A review of the Michigan districts that used the CAT as opposed to those that used the MAT to measure Chapter 1 achievement in 1982-83 re- veals that the 120 districts using the CAT reported data for more than twice as many students as the 119 districts using the MAT (CAT, 3,871; MAT, 1,662). The districts used in this study reflect the differences in size of 35 districts using the tests in Michigan. Surprisingly, in both cases the low SES districts were the smallest in the sample using their respective tests. Development pf QEEEEE Curves Based gm Local District Growth curves were developed using local data for the six school districts. Local 10th, 30th and 50th percen- tile rank scores were identified for both reading and mathematics for each of the three years and for a compos- ite of the three years' combined data. Growth curves based on local data for the six school districts were compared with those for the national norms of the tests used by these districts. The question of whether to use the scores of all students tested in a district or only the scores of students who did not re- ceive Chapter 1 services in developing local norms was considered. On the one hand inclusion of Chapter 1 stu- dents' scores in the local norms would tend to adjust out any Chapter 1 treatment effect. On the other hand omis- sion of Chapter 1 students' scores would bias the distri- bution because nearly all the omitted scores would be taken from the low end of the distribution. The resulting local norms would be based on distributions with few or no values in the range where Chapter 1 students would be 36 expected to score. National norms for all tests currently used to evaluate Chapter 1 programs in Michigan are based on populations that include Chapter 1 students. It was decided to base the local norms on scores for all students tested in a district. This means that Chapter 1 students were included as they are in the national norms. The pro- portion of Chapter 1 students in the local norming popula- tion ranged from 7.251 to 34.71! with the median being 13.741. The proportion in the national populations could not be determined though it is usually assumed to be comparable to the proportion of Chapter 1 students in the school pOpulation nationwide. £32 Ngmm-referenced model Determination pg No-treatment Gains The local percentile ranks corresponding to the na- tional 10th, 30th and 50th percentile ranks were identi- fied and the expected posttest scores for each if the local percentile rank was maintained were determined. From this it was calculated what the apparent no-treatment gains would be under the model if growth occurred in accordance with local rather than national norms. With data for five grade levels gains for four spring-to-spring intervals or grades could be calculated. At this point the data could be grouped into a 37 four-way design (2 tests x 3 district types x 3 percentile ranks x 4 gradesL To determine whether the number of categories could be reduced four-way ANOVA's using the spring-to-spring gain scores as the dependent variables and the pooled grade by percentile rank interactions as the error terms were run separately for reading (Table 3) and for mathematics(Table 4). (SPSSx subprogram MANOVA was used for the ANOVA runs.) A criterion of‘fi - .05 was used to determine whether there were significant ef- fects for a variable. On the basis of the results of the ANOVA runs, as shown in Tables 3 and 4, it was determined that there were no significant main effects or interac- tions involving percentile rank. The data for the three Table 3. Reading 4-Way ANOVA Summary Source of Variation SS df MS F Constant 20.06 1 20 06 1 77 Test 83.59 1 83 59 7 37* District Type 12.48 2 6.24 .55 Grade 192.33 3 64 11 5.65** Percentile Rank .06 2 03 .00 Test by District Type 20.23 2 10 12 .89 Test by Grade 139.33 3 46 44 4.09* Test by Percentile Rank 15.72 2 7.86 69 District Type by Grade 70.07 6 11 68 1 03 District Type by Percentile Rank 15.46 4 3.87 .34 Test by District Type by Grade 125.02 6 20.84 2.13 Test by District Type by Percentile Rank 39.08 4 9.77 .86 Residual 408.50 36 11.35 Total 1141.93 72 * Significant if 0(- .05. ** Significant if d.‘ .01. 38 Table 4. Mathematics 4-Way ANOVA Summary Source of Variation SS df MS F Constant 3.32 1 3 32 32 Test 1.50 1 1 50 14 District Type 100.09 2 50 05 4.77* Grade 210.65 3 70 22 6 69** Percentile Rank 2.44 2 1 22 12 Test by District Type 157.07 2 78 54 7.48** Test by Grade 97.07 3 32 36 3 08* Test by Percentile Rank .16 2 .08 .01 District Type by Grade 141.08 6 23.51 2.24 District Type by Percentile Rank 94.60 4 23.65 2.25 Test by District Type by Grade 228.52 6 38.09 3.63** Test by District Type by Percentile Rank 73.96 4 18.49 1.76 Residual 377.82 36 10.50 Total 1488.28 72 * Significant if 0<- .05. ** Significant if 0( - .01. percentile ranks, therefore, were combined into a single category with a mean of 37.3 NCEs or at the 27th percen- tile rank. The resulting 2x3x4 design was used in the remaining mathematics analyses. An additional reduction in categories was possible with the reading data. Since there were no significant effects associated with district type in reading the data for the three district types were combined. This resulted in a 2x4 design which was used in the reading analyses. Probability pf Type 1 Errors The gains were then tested for significance to deter- mine whether under simulated zero-gain conditions they 39 would result in a Type I error if the model were used to evaluate hypothetical groups of 10, 20 or 30 students. The test statistic typically used in evaluating the effectiveness of Chapter 1 programs is a one-tailed t— test. For the purpose of comparing local and national norms the formula is: where: 7(- - the score that would be observed on the basis of the local norms; ,Aé- the expected score on the basis of the national norms; 6‘!- the standard deviation of the national norming population, and n - the number of students in the group being evaluated. While the standard deviation for that portion of the norming population with a mean score at the 10th or 30th percentiles is not included in the technical data pub- lished for either of the tests used it is possible to calculate what it would be given the standard deviation for the total population and the following assumptions. 1) The scores are normally distributed. 2) Groups were established by setting a cut score and including all students scoring below it in the group. 40 The following procedure was used. First convert the desired percentile rank to its equivalent z-score. Assume that _a_ - the mean score of the distribution and that _b_ - the cut score that will result in a distribution with mean 3. (See Figure l.) l./’ @r .\\\\\\\. , b‘ Figure 1. Distribution of Hypothetical Group To find the variance for a population with mean 3 it is first necessary to find the cut point p. The mean 2- score for the population with scores x < b equals the ratio of the ordinate of the curve at x - b to the density for x < b. These values were found by successive itera- tion first for the 10th and 30th percentile ranks and later for the 27th percentile. Since a population with a mean at the 50th percentile would include the entire distribution its standard deviation was assumed to be known. The variance for the desired distribution is then given by the formula: Var Xb - 1 + ab — a2 Taking the square root gives the standard deviation in 41 z-scores. Multiplying by 21.06 (the standard deviation for NCE scores) converts the standard deviation to NCE's. The standard deviation for the 27th percentile rank was calculated to be 13.92 NCE's. ‘That is, for the combined groups with a mean at the 27th percentile of 37.3 NCE's the standard deviation would be 13.92 NCE's. The effect on the Type I error rate of using national norms when the local norms represent the true normal pattern of growth was tested for programs resulting in zero true gains with n's of 10, 20, and 30 and for 0(- .05 and .01. Probability g; Detecting True Gains The study also examined the effect on the power of the test to detect true gains under each of the above conditions when a program results in gains of 1.0, 4.0 and 7.0 NCE's. In practical terms the effect of differences in power is most apparent in the size of the sample that would be needed to assure detecting real gains if they exist. The size sample that would be needed to detect true gains of 1.0, 4.0 and 7.0 NCEs with 0(1- .05 and .01 and with a probability of p 3,.50 was calculated. Since the t-value varies with the sample size successive iterations were used where necessary to match the sample size required with the t-value. 42 22mpmgmggm pf Longitudinal Data gith Expectations Based gm Cross-sectional Normm This study next examined the similarity between lon- gitudinal and cross-sectional data for two of the six dis— tricts. Two districts were selected which were able to identify matched non-Chapter 1 scores from the data used earlier in the study. Those selected happened to be the two that used out-of—level testing(COL and MOL). This portion of the study was limited to data from students not in Chapter 1 programs. The districts were asked to provide matched data for the three years, 1983, 1984 and 1985 for a sample of non-Chapter 1 students. Students in the longitudinal sample belonged to one of three cohorts: Cohort 1 - students with test scores for Grades 2, 3 and 4; Cohort 2 - students with test scores for Grades 3, 4 and 5; and Cohort 3 - students with test scores for Grades 4, 5 and 6. The matched data were used to compare the observed growth found in longitudinal data with that expected on the basis of the cross—sectional norms. Since the longi- tudinal study was limited to students with test data available for three years the samples were considerably more limited than the group on which the cross-sectional analyses were based. Elimination of students served by Chapter 1 programs further restricted the group. In 43 particular the number of low-achieving and highly mobile students were severely reduced in the longitudinal sample. These restrictions result in serious confounding that limits the interpretation that can be drawn from these data. Changes in NCE scores (gains) from 1983 to 1984, 1984 to 1985 and 1983 to 1985 were computed and tested (using SPSSx subprogram MANOVA) for significance. The purpose of this analysis was to determine whether the pattern of growth differs when longitudinal rather than cross-sec- tional data are used. This provided a further test of the validity of the equipercentile assumption. CHAPTER V RESULTS OF THE STUDY The results of this study will be presented in the same three groupings as were used in Chapter 4. Comparison pf EEEEEE Curves for the CAT and mm: Table 5 lists the scale scores corresponding to the 10th, 30th and 50th percentile ranks for reading and mathematics from Kindergarten through Grade 12, for the CAT and the MAT. The growth curves for the CAT and MAT for reading and mathematics are shown in Figures 2-5. There appear to be obvious differences in the growth curves, particularly for the reading tests. Both tests imply a higher rate of growth in the early grades and a tendency to level off at higher grade levels. They dif- fer, however, in the proportion of total growth that is assumed to occur at the early grade levels. On the basis of the MAT scale, 53 percent of total growth in reading occurs by the end of Grade 2. Using the CAT norms and scale, 53 percent of total reading growth does not occur until the end of Grade 4. In mathematics a similar though less pronounced difference exists. The MAT scores indi- cate that 36 percent of total growth occurs by the end of Grade 2 while on the basis of the CAT scores it would appear that 36 percent of total growth is not attained 44 45 Table 5. CAT and MAT Scale Scores, Reading and Mathematics CAT MAT Subj. Gr. 101-116 301-116 501-116 101-ile 301-116 501-ile Reading K 185 214 235 271 326 363 1 245 280 303 412 466 506 2 290 332 360 524 576 620 3 324 370 401 566 621 661 4 357 406 443 602 655 695 5 385 438 473 628 682 718 6 405 460 500 632 693 737 7 424 484 521 642 709 754 8 447 509 553 645 725 775 9 463 528 574 674 746 791 10 481 548 596 704 773 815 11 497 573 619 724 787 827 12 505 582 633 746 806 846 Math K 246 259 271 239 289 329 1 278 299 311 327 378 418 2 309 336 352 403 468 507 3 343 374 394 454 521 567 4 372 406 428 501 578 622 5 395 436 463 546 624 664 6 418 462 491 564 646 696 7 425 482 517 597 681 731 8 458 514 555 615 700 759 9 469 533 579 637 717 772 10 485 551 601 663 738 789 11 499 566 612 682 757 800 46 2 ‘ - '11-: -33- ans—He 5 —i—- 10.- {- i.— L r:— l- i. r- ,. l l l I l ' I] T" :3 D O C. D D D D D O C) D C! D f‘~ to 10 ‘1' f") N 1'" seJoos smog ll 1.2 :0 as. [‘5 (N Grade Level CAT Reading Growth Curves Figure 2. 47 ao>uso £u>ouu mcmveom Huau museum nus: H emu “mats... museum unannom vomeommm .0 enamvh Esme... mmocw. fl 2. 2 m m. a. a m .... n ..... s e . p i L _ . . n _ m u — ”a“. O a s - n-.. x xi}; ('1 P‘v’ ‘[ 99.1093 9:033 53 Husu auaouo can: veneouox .n ousmmh . PB mesa N PF 09 m m h _ 0 tech ”REHHKE IlPlJWiilATillflllltiillhfilllfl. Nom 3... IT a... m2 E... 11.. n8 :0 1...: x11.) :3 so Iml no. :6 1|. . .8: .000 sauces 91003 ‘fpv 54 Afléllfllfi 22 £22 growth curves for the CAT and MAT Table 7 shows the results of the comparison of the growth curves for the two tests. The values for rxy range from .9653 to .9917. Those for ’6 vary from .9768 to .9941. These high values can be attributed to several factors. First, nearly all of the variation in the scores for any single curve is the result of yearly gains in achievement. Second, there is only one point on each curve at each grade level. And finally, the scores com- prising each curve were obtained from national norms ta— bles based on scores for several thousand students. Table 7. Differences Between CAT and MAT Growth Curves Subject .. lx-iie rxy ,5 X1 H761) d 36.: t12 Pt(12) Reading 50 .9711 .9806 8.3391 <.0001** 30.54 21.08 1. 30 .9720 .9813 9.0148 <.0001** 48.31 23.05 2.10 .058 10 .9653 .9768 8.1262 <.0001** 74.15 26.34 2.82 .016* Mathematics 50 .9831 .9884 6.4193 <.0001** 26.15 8.67 3.02 .011* < 3 30 .9868 .9910 7.2590 .0001** 30.00 7.62 . 10 .9917 .9941 5.0935 <.0001** 41.62 9.06 4.59 .001** For an overall level of 0(- .05 the levels for the three percentile ranks must be individually set at CX- .017. With regard to the configurations the probabilities of the growth curves being the same for the two tests at the three percentile ranks in reading and in 55 math are all less than .001. The growth curves for the CAT and MAT were found to be significantly different in shape. The test that the levels of the growth curves for reading and math for the two tests are the same showed significant differences at the 10th percentile in reading and at all levels in math (p<.017). These results indicate that the hypothesis that the growth curves for the two tests are the same can be rejec- ted. Comparison 2; National and Local Norms Table 8 shows the comparison of scale scores for the growth curves for the six local districts from Grade 2 through Grade 6 in reading and mathematics with the na- tional norms. The most readily apparent difference be- tween the local and national norms is that a number of the local growth curves are much higher than those based on the national norms. This is made clearer in Table 9 which shows the national percentile equivalents of the local percentile ranks. In reading the scores for all three districts using the CAT are above the national norms at all percentile ranks. COL 10th percentile scores nearly match those for the national CAT 30th percentile while CLS scores exceed them. Both COL and CLS 30th percentile students achieved higher scores than the 50th percentile nationally. In several other cases the local 30th percentile scores ap- proach those for the national 50th percentile. Figure 8 56 Table 8. National and Local Growth Curves Reading l-ile Scale Score Equivalents Rsnk* Gr. CAT COL CIM CLS MAT MOL MIM MLS 10 2 290 332 322 345 524 545 561 510 3 324 365 351 376 566 582 607 546 4 357 404 389 416 602 631 637 616 5 385 434 427 441 628 658 660 627 6 405 458 431 464 632 681 686 675 30 2 332 363 351 370 576 596 608 571 3 370 399 393 404 621 629 647 630 4 406 440 426 446 655 655 683 653 5 438 476 460 476 682 697 693 688 6 460 506 471 506 693 721 725 717 50 2 360 385 372 392 620 624 644 625 3 401 423 412 432 661 645 670 664 4 443 470 449 466 695 691 719 685 5 473 505 481 496 718 723 718 703 6 500 540 500 532 737 751 769 747 Mathematics 10 2 309 326 338 337 403 464 464 425 3 343 361 362 372 454 498 526 484 4 372 394 389 410 501 553 578 511 5 395 429 422 438 546 579 619 591 6 418 452 442 468 564 604 637 639 30 2 336 347 356 354 468 501 507 478 3 374 386 387 396 521 541 575 531 4 406 423 414 429 578 588 626 569 5 436 458 452 459 624 628 658 632 6 462 484 473 500 646 676 704 678 50 2 352 355 373 365 507 525 541 507 3 394 403 406 407 567 571 608 569 4 428 441 432 443 622 618 666 617 5 463 477 474 478 664 660 688 656 6 491 509 495 524 696 721 741 722 *National percentile ranks for CAT and MAT; local percentile ranks for all others. 57 Table 9. Comparison of National and Local Percentile Ranks Reading Local Z-ile National Percentile Rank Equivalents Rank Gr. COL CIM CLS MOL MIM MLS 10 2 30 22 38 18 24 6 3 26 20 33 15 24 5 4 28 21 34 20 22 14 5 28 25 32 19 20 10 6 29 17 32 25 27 23 30 2 51 42 56 38 44 28 3 49 44 52 34 43 34 4 48 41 52 30 44 29 5 52 42 52 38 36 33 6 53 35 53 42 44 40 50 2 67 58 71 52 62 52 3 65 58 70 42 56 52 4 67 54 64 49 62 45 5 70 56 65 53 50 42 6 70 50 66 57 66 55 Mathemats 10 2 20 32 30 28 28 15 3 20 21 29 23 32 18 4 22 19 34 22 30 12 5 25 21 31 16 28 19 6 25 20 34 18 27 28 30 2 42 56 51 47 50 35 3 41 42 51 38 54 34 4 44 37 49 34 52 27 5 45 41 46 32 47 34 6 44 37 55 42 53 43 50 2 55 73 64 60 68 50 3 58 61 62 52 68 51 4 59 52 61 48 70 48 5 60 58 61 48 60 46 6 61 52 71 60 68 60 58 #5 IT. .06 unavaom : Hw._ $8.0 ID a) ~v- ~r) N . onn town 8 V) «r 391003 smog 10mm WHU.JWI \l\\\l\ .éu lil.@l\l 400 lmTu 59 shows the 30th percentile reading growth curves for the districts using the CAT. The reading scores for two (MOL and MIM) of the three districts using the MAT are above the national norms for the 10th and 30th percentile ranks. MIM scores are above the national norms for the 50th percen- tile as well. Grade 6 scores for all three districts are higher than the national norms for all percentile ranks. Figure 9 shows the 30th percentile reading growth curves for the districts using the MAT. In mathematics the scores for all three districts using the CAT are also above the national norms at all percentile ranks. CLS 10th percentile scores exceed those for the national CAT 30th percentile and CLS 30th percentile scores exceed those for the national 50th percentile. Figure 10 shows the 30th percentile mathe- matics growth curves for the districts using the CAT. Mathematics scores for all three districts using the MAT are above the national norms at the 10th and 30th percentiles. For MIM they also exceed the national norms for the 50th percentile rank. MIM 10th percentile scores in mathematics are close to the national MAT 30th percentile norms and MIM 30th percentile scores are close to the national 50th percentile norms. Figure 11 shows the 30th percentile mathematics growth curves for the districts using the MAT. 60 mcmueom I Husu suaouu «need us- "anewuez .o ousmuh $.33 260.6 m n ... _ n n 1‘) N bl u A u cum 0: 0 w. 9 u cum .8 0 .0 J 9 S i own 32 Ihl 2:3.ITII .65 [ml 5.... 1| - 3Q. 61 sun: I Husu zuaouu gene» can assemuaz _m>o4 munco .9" «cause Ionn 59.1008 QIDOS 62 sue: : HoAIuouuno wnogo ”nun: an nououm dado ouoN vouuoaxm .m _ ouauam ¢In WIN ‘0" ‘J'IID'. ....... l. u I I IIIIII nlulIaIJll...‘ I, #. .lll "ln. . III Will 1.! r“ p or! I C “EDS 33f»! r: 72 . . mu wuum.uuan mama maa>oumefl “swan a“ muuoum :«mu ouou vouuumxm w~ mu: .m .. rm. my. . «I k. 1 K H). " o I. u H - k u .r I1 I: C r I Pl 5"..‘1 ‘ 'l 'I“- l- u 'a “ul‘"| \+......| k... l 0“ 0|" DO I] .x a. Ln '0’. ..‘J I‘ n...|u ’ I I." . I'J.| .~ -, 5% I€r .10 U..." I: 'I' - .............. J .’ "b'll. III I ........... 1‘ ' - | ‘f‘lh‘- ‘L y C. ' ”..‘IL I ' ' | ‘0’. | ..\ "Ilfl- .J a II l' I'll! \I.I.c ml‘ "|‘I'l|| a” ‘ l a... 5:. " 'l l s. c. I ..I.. .11.... L .r' n I .u\ Inlllllll III‘. A - .-l -I‘O. + f w - 234.4”! f ~08 m ..s 1 Enuoi+l 73 uuuwuummn mum Bod "gun: cm mououm cwmw ouoN vuuuwpxm 330 .nm «human I .Ir w ll.u. «U -.w.‘ F — .afu No ‘1. .15 “\I .8 I5» """""" J’al- ' - ' - ' | ' | 5.. 33.. III ll‘. .‘. 1.x .\.. .1. . . ... I ..... 1.1. I‘.‘ .‘ I. .-‘U .". I ..I m4: mug I+I. . II: \o ..o 5.0 1-. I I II IIII II ' I II 'n ' I III I. I ‘J.|.¢' .t III .\III. .III . ."f .— III. a )- KI. lift. III. I I}. IN f - i - - ' I. '3----|"-'--|-‘|- “0-- ‘ I-.. | \l‘ f“... Q. I I. LI . Ix II ‘9' U or 74 apparent gain of 5.58 NCE's. Given this difference and assuming that the measured mean gains would have a t(29) distribution, the probability of a measured gain high enough to result in a Type I error where n - 30 and 0( - .05 is p ) .60. For MLS Grade 6 mathematics programs the result of the differences between the local and national norms is an expected apparent gain of 8.94 NCE's. Given this differ- ence the probability of a measured gain high enough to result in a Type I error given n - 30 and 0(- .05 is greater than.90. Effects on Power to Detect True Gains It can also be seen that significant differences from zero will affect the power of the model to detect true gains when they are present. A true gain of l NCE will rarely be detected (p<.15) with groups of 30 or fewer students. However, if the pattern of local growth is such that apparent positive gains are expected without true additional gains then the likelihood of detecting a true 1 NCE gain as a positive program effect is greatly increased. Table 12 shows the expected observed gains and the probabilities of detecting true gains of 1, 4 and 7 NCE gains with n-30 anch levels of .OS and .01 for the local data in Table 11. Only in those cases where Type I errors are likely is there a high probability of detecting a true 1 NCE gain. The probability of detecting 75 Table 12. Effect on Power of Local Growth Patterns Reading Tst/ 0 NCE +1 NCE 2100' +4 NCE 2,4- +7 NCE 1- Dst Gr Gain Gain .05 .01 Gain .05 .01 Gain .05 .01 CAT 3 -2.28 -1.28 (.05 (.01 .72 .2 (.05 4.72 .6 .3 .— 4 .49 1.49 .1 (.05 4.49 .5 .2 7.49 9 .7 5 .82 1.82 .2 (.05 4.82 .6 .3 7.82 .9 .7 6 -1.23 -.23 (.05 (.01 2.77 3 .1 5.77 7 4 MAT 3 -2.03 -1.03 (.05 (.01 1 97 2 .1 4.97 6 3 4 2.25 3.25 .3 .1 6.25 .8 .5 9 25 > 95 9 5 .62 1.62 .2 (.05 4.62 .5 .3 7.62 .9 .7 6 5.58 6.58 .8 .6 9.58 >.95 .9 12.58 >.99>.99 Mathematics COL 3 1.30 2.30 .2 .1 5.30 .6 4 8.30 .9 .8 4 .00 1.00 .1 (.05 4.00 .5 2 7.00 .8 .6 5 1.83 2.83 .3 .1 5.83 .7 .4 8.83 >.95 8 6 -.20 .80 .1 (.05 3.80 4 2 6.80 .8 6 MOL 3 -4.67 -3.67 (.01 (.01 -.67 (.05 (.01 2.33 .2 .1 4 -9.23 -8.23 (.01 (.01 —5.23 (.01 (.01 -2.23 <.01<.01 5 -.37 .63 .1 (.05 3.63 .4 .2 6.63 .8 .6 6 1.78 2.78 .3 .1 5.78 .7 .4 8.78 ).95 .8 CIM 3 -5.33 -4.33 (.01 (.01 -1.33 (.05 < 01 1.67 2< 05 4 -3.67 -2.67 (.01 (.01 .33 .l (.05 3.33 .4 .1 5 2.90 3.90 .4 .2 6.90 .8 .6 9.90 >.95 .9 6 -3.09 -2.09 (.01 (.01 .91 1 < 05 3.91 4 MIM 3 2.54 3.54 4 .2 6 54 .8 5 9 54 > 95 .9 4 .60 1.60 .2 (.05 4.60 .5 3 7.60 9 .7 5 -1.63 -.63 (.05 (.01 2.37 .2 1 5.37 7 .4 6 .53 1.53 1 (.05 4.53 .5 3 7.53 9 .7 CLS 3 1.59 2.59 3 .1 5.59 .7 4 8.59 9 8 4 -.44 .56 .1 (.05 3.56 .4 2 6.56 8 5 5 -2.56 -1.56 (.05 (.01 1.44 .1 < 05 4.44 5 2 6 3.36 4.36 5 .2 7 36 .9 7 10.36 ) 95 9 MLS 3 1.00 2.00 .2 .1 5.00 .6 .3 8.00 .9 .8 4 -3.81 -2.81 (.01 (.01 .19 .1 (.05 3.19 .3 .1 5 3.47 4.47 .5 .2 7.47 9 .7 10.47 >.95 9 6 8.94 9.94 >.95 .9 12.94 >.99 >.99 15.94 >.99>.99 76 a 1 NCE gain in Grade 6 reading with the MAT is .8 if 0K- .05 and .6 for 4Q: .01. A l NCE gain made by MLS students in Grade 6 would be detected with p > .95 when ok- .05 and with p - .9 for O<= .01. Even the likelihood of detecting a true 4 NCE gain is small with a group of 10 to 20 students. However, a Chapter 1 program that produces a 4 NCE gain on top of an expected 4 NCE gain from the regular program has a high probability of being detected. With an n of 30 the proba- bility of detecting a true 4 NCE gain is 0.5 if (in .05 and .2 when *K- .01. In those cases where a Type I erroris likely,a true4-NOEgain willluadetected most of the time. The probability of detecting a 4 NCE gain with the MAT in Grade 6 reading is > .95 if 0<- .05 and .9 for 0(- .01. A true 4 NCE gain in mathematics by Grade 6 MLS students will be detected with p > .99 with an Oklevel of either .05 or .01. Where the expected no- treatment gain is negative the probability of detecting a true 4 NCE gain is small. In mathematics for MOL and CIM Grade 3 students the probability is < .05 when o<= .05 and < .01 if (X- .01. For MOL Grade 4 mathematics the probability is < .01 for an 0(level of either .05 or .01. A 7 NCE gain, in contrast, has a high probability of being detected for a group of 20 or 30 students and is likely to be identified even for as few as 10 students provided the expected rate of growth for the group is at 77 least as great as that for the national norm group. Where the expected no-treatment gain is negative even a true 7 NCE gain in Chapter 1 students' achievement is unlikely to be detected. For MOL Grade 4 mathematics students the probability of detecting such a gain remains (.01 for either d- .05 or 0(- .01. For MOL and CIM Grade 3 mathematics a true 7 NCE gain would be detected only with p - .2 if ¢K= .05 and with p =.1 and p<.05 respectively if ak- .01. 2222222 22 2222 22 222222 222222 22 222222 2222 92222 Table 13 shows the effect of local growth patterns on the size of the sample needed to detect true gains of 1.0, 4.0 and 7.0 NCE's with O(- .05 and .01. If the expec- ted no-treatment gain plus the true gain results in an expected observed gain that is negative the positive pro- gram gains will not be detected no matter how large the sample involved. Where the expected observed gain in NCE's is positive but less than 1.0 the size of the sample needed is very large. However, where the expected ob— served gain is large (7.0 NCE's or more) the sample needed may be less than 10. 9222222222 22 222222222222 2222 2222 222222222222222 22222 Table 14 shows the mean NCE scores for each of the three cohorts (Cohort 1 - students tested in Grades 2, 3 and 4; Cohort 2 - students tested in Grades 3, 4 and 5; 78 Table 13. Effect on Sample Size of Local Growth Patterns Reading Tst/ 0 NCE +1 NCE n,0(= +4 NCE mag: +7 NCE n,fi= Dst Gr Gain Gain .05 .01 Gain .05 .01 Gain.05 .01 CAT 3 -2.28-1.28 --- --- 1.72 178 356 4.72 25 50 4 .49 1.49 238 474 4.49 28 55 7.49 11 22 5 .82 1.82 159 318 4.82 24 48 7.82 11 20 6 -1.23 -.23 --- ~-- 2.77 70 139 5.77 18 35 MAT 3 -2.03-1.03 --- --- 1.97 138 271 4.97 23 46 4 2.25 3.25 51 103 6.25 15 30 9.25 8 16 5 .62 1.62 201 401 4.62 27 52 7.62 11 21 6 5.58 6.58 14 28 9.58 8 15 12.58 6 10 Mathematics COL.3 1.30 2.30 101 199 5.30 21 40 8.30 10 19 4 .00 1.00 528 1052 4.00 35 69 7.00 13 25 5 1.83 2.83 67 134 5.83 17 34 8.83 9 17 6 -.20 .80 824 1644 3.80 38 76 6.80 13 26 MOL 3 -4.67-3.67 --- --- -.67 --- --- 2.33 98 194 4 -9.23-8.23 --— --- -5.23 --- ——- -2.23 --- --- 5 -.37 .631329 2650 3.63 42 83 6.63 14 27 6 1.78 2.78 70 138 5.78 18 35 8.78 9 17 CIM 3 -5.33-4.33 --- --- -1.33 --- --- 1.67 189 377 4 —3.67-2.67 --- --- .33 4844 9660 3.33 49 98 5 2.90 3.90 36 72 6.90 13 25 9.90 7 14 6 -3.09-2.09 --- --- .91 637 1270 3 91 36 72 MIM 3 2.54 3.54 44 88 6.54 14 28 9.54 8 15 4 .60 1.60 206 411 4.60 27 53 7.60 11 21 5 —1.63 -.63 --- --- 2.37 95 187 5.37 20 40 6 .53 1.53 225 449 4 53 28 54 7.53 11 22 CLS 3 1.59 2.59 81 158 5.59 19 37 8.59 9 17 4 -.44 .56 1682 3354 3.56 43 87 6.56 14 28 5 -2.56-1.56 --- --- 1 44 254 507 4.44 28 57 6 3 36 4.36 29 58 7 36 12 2310.36 7 13 MLS 3 1JH)2.00 133 263 5.00 23 45 8.00 10 20 4 -3.81-2.81 -—- --- .19 14613 29140 3.19 53 107 5 3.47 4.47 28 56 7.47 11 2210.47 7 13 6 8.94 9.94 7 14 12.94 5 10 15.94 4 8 Cohort N 1 72 51.59 2 201 57.00 3 191 56.38 Total 464 55.91 - Mathematics 1 72 55.48 2 201 60.55 3 191 60.17 Total 464 59.61 79 Table 14. 1984 54. 57. 57. 57. 78 55 75 20 Mean Scores Longitudinal Data Gains 1985 1983-84 1984-85 1983-85 ----------------------—-----------—-----—---------—---——-. — COL - Reading 54. 58. 59. 58. 35 80 04 .19 .55 .37 .29 -.43 .25 .29 .01 2. 1. 2. 76 80 66 .30** - MOL - Reading 1 29 2 28 3 33 Total 90 66. 59. 57. 61. 34 - Mathematic 1 29 2 28 3 33 Total 90 * Significant if 9(- 63. 58. 55. 58. 58 53 17 93 62. 60. 58. 6O ** Significant ifrx - 72 77 72 .65 01 61. 55. 60. 59. 89 13 02 10 N 1. .86 .24 .55 72 1. -l. .83 -5. 64* 30 55 -1. 69 ~3.4O 4. 85* .17 .05 80 and Cohort 3 - students tested in Grades 4, 5 and 6) for the two districts over the three-year period for which data were collected. For COL there were small positive gains(1.3 and 1.0 NCE's) in reading for each of the two years. Individually these gains were not significant but they were both posi- tive and when taken together resulted in a significant overall two-year gain. In mathematics significant gains of approximately 4 NCE's were found for all cohorts the first year and for Cohort 3 the second year as well. The mathematics gains were all positive resulting in signifi- cant overall two~year gains for all cohorts and the total sample. For MOL the results were mixed. In reading Cohort 1 showed a significant loss the first year followed by a significant gain the second year. The resulting two year loss was not significant. This may be a reflection of the year-to-year variation that was apparent in the yearly data particularly for the smaller districts. (See Table 11 and Appendix B.) MOL Cohort 3 showed nonsignificant positive gains for the two years individually that added up to a significant total gain. In mathematics the re- sults showed no significant overall change. CHAPTER VI DISCUSSION This study set out to answer certain questions re- lated to the validity of the norm-referenced model and its usefulness in evaluating local programs. In this chapter the results of the study and their implications relative to these questions will be discussed. An examination of the norm-referenced model and its assumptions found that they do yield a method for depic- ting graphically the patterns of growth implied by the model and a test's norms. The model is based on the equipercentile assumption that without intervention stu- dents' percentile rank standings would remain constant from year to year. It further assumes that percentile ranks can be converted to NCE's and to a test's scale scores. When given percentile ranks were converted into corresponding scale scores at each grade level it was possible to depict graphically expected growth curves for the two tests examined in this study, the CAT and the MAT. The resulting growth curves for both tests in reading and mathematics are curvilinear with the most rapid growth in 81 82 the lower grades and decreasing rates of growth in the higher grades. This indicates that the skills measured by these tests are learned primarily in the lower grades. 9222222222 22 222222 222222 222 222 E22 222 MAT Comparison of the graphs indicates the growth curves for the CAT and MAT are not the same. They differ signif- icantly in configuration for both reading and mathematics. Significant differences in level found for mathematics at all percentile ranks and for reading at the 10th percen- tile indicate there is greater spread from the 50th to the 10th percentile for the MAT than for the CAT. Differences in both the rate and proportion of total growth expected at certain grade levels imply that either the scales are not both equal interval or the tests are measuring differ- ent skills. Either makes the practices of aggregating data and comparing programs across tests invalid. In addition, if the scale scores for a test are not equal interval the procedure for converting out-of—level test scores to in-level NCE's is also invalid for that test. Itis beyondthe scopeof thisstudy'todetermine whether the scale scores for these two tests are actually equal interval. Possibly the application of Item Response The- ory methods to data from these tests would provide the answer . 83 Robust2ess 22 222 2222-reference2 2°de2 2222 2222222 22 Variations Found in Local Patterns of Growth When the growth curves implied by the norm—referenced model were examined using local data the most apparent difference found between the local and national growth curves was in the levels of the corresponding percentile ranks. The local growth curves in most cases appear to match the national curves but for a higher percentile rank. To the extent that they match the national curves in shape and are only displaced vertically the estimates of expected posttest scores and hence the gains will still be unbiased when national norms are used. Since the analysis of expected no-treatment gains found no sig- nificant differences due to percentile rank the hypothesis that the percentile ranks are parallel for the grade levels examined cannot be rejected. This indicates the norm-referenced model is robust with respect to differ- ences in percentile rank. No systemmatic bias was found for any of the dis- trict types. The variables identified here do not appear to have produced common effects across tests on the achievement of students in the selected districts. Use of out-of—level testing with low achieving students cannot be rejected as inappropriate for use with this model. The effects of an improving total school program are also not a sufficient basis for rejecting use of national norms in 84 evaluating student gains. Even a high proportion of low SES students does not result in differing patterns of growth at all grade levels. This indicates the model is robust with respect to the variables identified. Where significant differences were found they were specific to particular grade levels. The test by grade level interactions found in the local data suggest that the differences between the tests are specific to particu- lar test levels. In reading the local norms for the MAT are significantly different from the national norms at Grade 6 only. The local norms are higher than the MAT national norms which underestimate student achievement. The finding of significant bias in reading for the MAT at Grade 6 indicates that for these Michigan districts use of the MAT Grade 6 norms does not provide a valid expectation of no-treatment reading achievement. Specifi- cally the MAT norms underestimate the reading achievement of their Grade 6 students by more than 5 NCE's. Use of the norm-referenced model in this situation would overes- timate any program effect. In mathematics the significant differences found were specific to given districts and to one or two grades in those districts. MLS Grade 6 local norms were higher than the MAT national norms. Local norms for CIM Grade 3 and MOL Grades 3 and 4 were lower than the national norms for their respective tests. These differences would re- sult in apparent gains of 9 NCE's for MLS at Grade 6, of 85 -5 NCE's for CIM and MOL at Grade 3 and of -9 NCE's for MOL at Grade 4 when the norm-referenced model with national norms is used to evaluate student achievement that by local norms would show a zero NCE gain. The district by grade level interaction suggests that any significant dif- ferences are the result of local district variables. It can be hypothesized that there are local conditions or practices not identified in this study that do affect the validity of the model when used for local evaluation. One possibility may be differences in the degree to which the local curriculum matches the test content at each grade level. All of the districts in this study that used the CAT are larger than any of the districts using the MAT. Therefore, any effects of district size are confounded with the effects of the test used. It is impossible to tell from the data obtained what if any part size played in these findings. The only instances where Type I errors are signifi- cantly more likely to occur using the model with national norms are those where the no—treatment expectation yields positive gain scores. From the data reported here this occurs in Grade 6 reading evaluated with the MAT and in Grade 6 math in district MLS. Any bias in the no-treatment expectation will affect the power of the model to detect true gains that do occur. Localdistricts with30 orfewer Chapterl studentsat a 86 grade level were found unlikely to detect small gains of 1 to 4 NCE's. These gains are of the magnitude usually found in statewide and national aggregations of achieve- ment data. Gains of 7 NCE's which would generally be considered to be educationally significant would be detec- ted.even with small groups. An exception wouldluain instances of significant negative no-treatment expected gains,as Lnthe casesof Grade3 math:h1districts MOL and CIM and Grade 4 math in district MOL. In these cases even gains as large as 7 NCE's are unlikely to be detec- ted. Stabilitx 22 22222 222222 22222222 220m Year to Year Districts that used the CAT showed less year-to-year variation and, therefore, more stability in their growth patterns than did those that used the MAT. Stability of the yearly norms is one effect that might be hypothesized to be related to district size. In this study it is impossible to separate the effects of district size on stability from the effects of the test used. If the difference in stability of the yearly curves for the two tests is the result of differences in size the combining of three years' data should provide composite growth curves for the MAT districts comparable in reliability to the yearly curves of the smaller CAT districts. If a year- to-year variation equivalent to less than 7 NCE,s is 87 considered acceptable then the data for the CAT districts is within acceptable limits. 22222222 22 222222222222222 22222 222 2222222222 222222222222 22222222 22 222222 The examination of the longitudinal data provides mixed signals regarding the use of cross-sectional norms to evaluate programs where observed achievement is mea- sured longitudinally. The MOL mathematics data support the model. So does the MOL reading data overall though it indicates that, for local projects at least, allowance must be made for cohort differences. The COL reading data supports use of the model for a single year's evaluation but indicates it is invalid for evaluations over periods of two or more years. On the basis of the COL longitudi- nal math data, however, the model would have to be rejec- ted. The positive gains found in the COL math data repre- sent gains significantly above what would have been expec— ted on the basis of either the national or the local district norms. Though Chapter 1 students were excluded from the longitudinal sample it is possible that local district programs or staff provided extra help to students in non-Chapter 1 buildings. In any case it is clear that data from the longitudinal sample do not necessarily re- flect either national or district norms. The students in the longitudinal sample were limited to those tested annually in the same school district over a three-year period. It may be that these students in 88 districts such as COL improve relative to the total dis- trict population over time. It may even be that the entire population is gradually improving over time. This study cannot determine whether either of these possibili- ties holds. The higher local norms found in several of the districts suggests that for some districts the entire population may be improving. Implications of These Findings In summary these outcomes indicate that the norm- referenced model is robust with respect to percentile rank. It is also robust with respect to the district variables identified in this study in reading. Due to the small sample size with only one district per cell it is impossible from the data collected to separate the test by district type interaction found in mathematics from the effects of other district variables. The hypothesis that the differences found in mathematics are due to district variables other than those identified in this study cannot be rejected. The growth curves developed from the national norms for the two tests indicate that there are significant differences between them. They imply that either the tests are measuring different skills at certain grade levels or the scale for at least one of the tests is not equal interval. Overall it appears that the MAT is more likely to detect gains at Grade 6 than is the CAT. In 89 mathematics it was impossible to determine whether the significant differences found are the result of differ- ences between the national norms at certain levels of the two tests or of local variables, such as differences in the curriculum, which affected achievement relative to the national norms in an individual district. In any case further research on the impact of the differences between tests on data aggregated across tests at the state and national level is needed. National norms may be appro- priate for evaluating nationally aggregated data for a given test but not be a valid basis for aggregating data across tests. The effect of district size also needs to be studied further. In this study district size and test used were confounded so that it was not possible to separate their effects in the data. The relationship between number of students and stability of local norms is especially impor— tant. Although possible effects of test used cannot be completely discounted it appears from the data that dis- tricts with fewer than one hundred students per grade would be well advised to pool more than one year's data if they want to use local norms. The longitudinal data raise serious questions about the validity of using cross-sectional norms with longitu- dinal data. The sample used differed from the district population in two important ways. Students were included in the longitudinal sample only if they 1)were tested in 90 the same district on three separate dates over a span of two years and 2) had not received Chapter 1 services. Scores used in the cross—sectional norms include both Chapter 1 as well as non-Chapter 1 students who were tested on a single test date. It might be expected that merely limiting the sample to students who remain in the same school district for one or two years would result in gains relative to a national norming population that includes students who move more frequently. sssss Essslsg Additional Research This study examined only two of the dozen or so stan- dardized achievement tests in wide use by school dis— tricts. Other standardized tests need to be examined to determine what differences exist in the patterns of growth which the norm-referenced model implies for them. If gain scores are to be aggregated across tests, there is a need to determine whether there are differences in content and scale which will bias the results. The question of the validity of aggregating achievement data across tests needs to be further investigated. Data from a sample of only six districts and longitu- dinal data from only two districts were analyzed in this study. Still, it identified an important number of in- stances in which the model doesn't work. Until the vari- ables which are the source of the bias observed are iden— tified the effort needed to determine whether or not the 91 model will work in the case of a given local program is more than would be involved in implementing a better method of evaluation in the first place. Other district variables than the ones identified here which may affect the validity of the model in evalua- ting mathematics programs need to be identified. Local curriculum variables and the match between the test used and the curriculum taught probably are more crucial in mathematics than in reading. Reading skills are more independent of the curriculum and are not all learned in school. The effect of district size needs to be more thor- oughly studied. What effect does it have on the impact of local variables such as those identified in this study? In particular to what extent are the differences observed between the tests in this study a result of differences in the size of the districts using each? The question of the validity of cross-sectional norms for measuring longitudinal data is particularly important. The results may reflect differences in achievement of students who remain in the same district for at least two years as opposed to the achievement of all students pres- ent for a single test administration. If this is the case they indicate an important difference between cross-sec— tional and longitudinal data as measures of expected achievement. Another possible hypothesis is that they reflect differences in the achievement of non-Chapter 1 92 students, particularly those attending schools not eligi- ble for Chapter 1, as opposed to the average achievement of all students in a district. If this is the case they would not affect the validity of using cross-sectional norms to evaluate the achievement of Chapter 1 students. It is also possible that true local effects on student achievement may become apparent only when longitudinal data collected over a number of years is analyzed. CHAPTER VII SUMMARY AND CONCLUSIONS Although there have been numerous studies of the norm-referenced model there is still disagreement about its validity. Due to its wide usage there has remained a need for additional research to provide a clearer under- standing of the model. This study examined the validity of theinodeliJithe following ways. 1. It approached the norm-referenced model as implying a model of growth in measured achievement over successive school years. 2. It compared patterns of growth implied by the norm-referenced model for two widely used standardized tests, the CAT and the MAT. 3. It examined patterns of growth in six school districts selected as exemplifying factors thought likely to affect the validity of the norm-referenced model. Lo- cal patterns of growth were compared with those implied by the national norms for the test used. District growth patterns were analyzed and used to estimate gains that would be observed under no-treatment conditions. 4. It evaluated the statistical effects of using local rather than national norms in testing the effectiveness of local programs. Likelihood of Type I errors and effects on power to detect true gains were 93 94 analyzed. 5. It compared the results obtained from longi— tudinal data with those expected on the basis of cross- sectional norms. Conclusions The patterns of growth implied by the norm-referenced model and the norms of the standardized tests studied are curvilinear with the highest rate of growth in the early grades and a decreasing rate in the higher grades. There are significant differences in the growth curves implied by the norms of the CAT and MAT. These affect the results obtained with the model differently at different grade levels. The model is robust with respect to differences in percentile rank. In reading it is also robust with re- spect to the district variables identified in this study. The mathematics results are more complex but it appears probable that the differences observed were due to local variables other than those identified in the design of the study. Results of the longitudinal analyses are inconclu- sive. Data from one district allow the hypothesis that the cross—sectional norms provide a valid expectation for longitudinal data to be rejected. The second district's data do not. Clearly in some cases local non-Chapter 1 data may not match the results of cross—sectional data for 95 the national or local district populations. From the analysis of the statistical effects of local patterns of growth it is apparent that they affect the probability of Type I errors and the power of the model to detect true gains. When the local expectation is signifi- cantly higher than that based on the national norms the probability of Type I errors is significantly increased. Small true gains are unlikely to be detected at the local level unless the local rate of growth is greater than in the national norms. Educationally significant gains of 7 NCE's or more will usually be detected at the local level except in cases where the local rate of growth is signifi- cantly less than that in the national norms. The norm-referenced model evaluates students‘ total educational experience including both their regular school program and any special programs such as Chapter 1. While this study involved only a small sample of districts it does identify several areas that should definitely be considered when interpreting local data and that may cause bias at the national level. It is, however, possible that the norm-referenced model yields valid aggregate estimates at the national level even though the estimates of effec— tiveness for specific programs are biased. Much work is still needed to identify the significant variables that affect the growth of local populations of students. Fur- ther research is needed to develop local models of growth 96 in reading and mathematics achievement as measured by standardized tests. it" '7 -.IAII'..3 6i"- APPENDIX A GRAPMS OF LOCAL GROWTH CURVES «nausea “sobsso Auaouu A00 .o~ ousmum _m..201_ m. U UL 10L. APPENDIX A GRAPHS OP LOCAL GROWTH CURVES N n l. m mm _ V ON” .1 r -~\'- 1.132%... 1.\.\\... .123 5.1.1.0 lnIIIII. l 0“” .311... 5.5.2.... A... \t...‘ .II.\....I. \\ .1. Jun... .15 - r1 5.1.x .s 15.1.12... .1........... $5.3.th I our? NW.» {PI-1.1.1 alil :91. I... D 11...... .I.I\. . l .2... i... x\.\\ 9 [twin L1... . lillilllli .I 51m \5\..1 MW“. 1... ..... .. on... w 5111...] .1...|uh\\ S m .t. IIIII .\ 5...: 1.11... 1%! 1.1.. £21.15... 0 -24.... 5.1.1.1 1.. ON“ I151.) ....u\ .61" I 0h.“ 97 98 m;-QO 111. In»: l..........l...1. ..‘hfll 1 maavnox 32:90 guaouu an“. .0“ nun—mum _m}m4 mvoLo w w ¢ m c p r . _ 1.....- 0b r \L...) till-1.1 1.11.... . hfl1 1kg 15...... .......... NR. xk .-Oxfi 51.1.11. .1..I.... \ .. .I I1! .11. 1‘ ll\L-I.| oLIL-I- \u xx. EH. \\ I1I Illtl \. ..s... ...... Ra... LI 1. [lull ; s .l. Ab \IILI .lllll l>n1 I. ...........- 15...... 1...... I 0.0.? m I 1...- 1... I L.. 1.1. 1...... Ad llll I _ 1111 .11 ca lulnll .ll... 1.. lIlLII leInnc‘mlI 0 .1. .1... 1......1...‘ 0 111111111111 .11 _ - P... m .r. \.....l.. 8 lullllllll..l\lll.‘lvfi ”+1... 99 «cannon 305.26 guaouu mac .o~ chauwh 3.5.. muncw m m, ¢ n u _ p .1 r 0 N n 515 5 u. .I‘”.\.5.. 0.5.51.1ml O“m Illilll III I‘ll . 555.5 .55.... f... 551K.- .LIIII l. \\ .5 5. .m. 5.55 55. .55. 5.... S 5. ........ - on... O l\..I\l.ll \150. lkll‘llil. II 1.1 D fill-Illll .11. 1 5|. Ill 1111.... .5 .. 5m... 9 llltr‘Lll. .1. 5151. link-1.1.5 [I'll Inm. {III S .1 .55. .55 . 'ltl' I tilt. O a 1.5.. - on... J 5...... 9 [III S .15....55 1555.555. . 1.....- 5x. all.ll [lulul O .L 5.5.5.5. 55...... I DNu IT M ”4.5 Imlll 100 1:) '1". l1: ' Q menu-om uuo>uno Aunouu 402 .«N unnumm _m>m._ mnogw m a + n N F m _ _ D P n 10mm l..hL-1l I.~l119\K U 1...... - o E .5... % ..ll..\+5. ILIIlllhm. D [ll-1.55 lllll'll buflllsll II I‘Ill'llxtl ImllllIll|l 1515.- . 3 llllllllll 5.1 51.5.15 51.15.5555 l 0”” S '11]. ll 5 5. 0 ll]... .15.! .5- +1. 1 O W55...55. 5519A.\\5. MW 1.1155515... .5155. 8 15%| lllllllIIlI I O P“ 55.3.1.1! "Il'l‘llll.l| 0 £555 [Own 101 meat-ox «ao>u:o nuaouu an: .«N ounmnh 5pm.. $9.9... m. ..m .T n N _ _ _ N Own \\\1\. I 0 mm ”M ‘35.. «1.... .2. o; I‘ll-ll‘w‘nullll. llllllll|1l A s '0'! Ins-l .lll Oahu .1 I‘ll-‘11 .. 9 EL»... 1.....- 51...... - 8w 8 Illtlll I1.|1I..| 1.2111 AU [Ill [1.1 +l|l11ul| It'll wall [-le w ..wlllll x... .8 5.5.0..) 1.1.55... 5 'l‘lxll tltl\ ..I O F h will...» Ilul 5.1....1 “ha... , .‘c I .53.... .I‘II Lilli _ a.-- r. 8... rDFm 102 I . 9.51;] on: “a I IPA-cl.- u I . 1.1.).1 —‘ I a. a a m.‘ I}. .... Om ul-a .uwwl ..r.... L manuuom 33:20 .3335 mu: .m« ouammm _m>m._ one”... 05 I I D O (O r- ‘0 ('3' 39.1003 91005 I O u ‘— f‘s w 6 ¢ n o _ _ _ _ - lixu... ..15... lll'll‘ +5.... KM \.u\. .10..“ U k... . .1. .\ ..1 \ok. c.l\- .10“. - \ |+..\ .1... Iflklllhkll]. \\ a? ..LI lllllxullms .1..\\. (10 ‘I‘n‘ "Illl' I'llllu‘ ..a lull. I [Nil-1 .\ ml] - u L... s... .1... .1. it! [1‘10 li‘l‘ltl' .1'.I\II|II..Vr l.‘ I llll.‘ lellltllll ... I + ma... 15...... ..fiflwxufifiuuxlfil m.. a... I‘Ki' ”5...... lawn [orm 103 menu-605:2 «afzau cutouo 400 .cu ousuuh _m....m._ «.690 .9. m... m P _ own Ii} -|D r -L5I\.\55\ .5... Lu llull Ill-l. . in 5.5.11... 5.......\\.. 5555. 5.5... s... l Ohm... 5.5 5.5. ..s... 5.55.. .11.... .\\.\. «5...... .....L .. It‘ll! |IIIl|ullv .. ..\\ 15.55. 51.55 5555}. S s 5.....- 5...... \xx .. O Nv 0. 505! III 1m ml I‘IIII.II.n\o5 I IlLlloll \R-III fl". I‘Ilull II 5|.l .. a I555ll5 5.. :5 .. 5.555.581 Cu 4." I w lull Ill5.1\5\- m -55 55 5.5.5.5 .555 l O h+ m Illlllllll 5“le c0 Horn. 5.5.55 9Atlllllllllli O I 0 N m I D h .w 104 maanEoguqz 33:90 cuaouo xmu .mn «wanna $3.3 mnn: w. m n . m m... .l r _ . ON.” 5555 .5... r . lllllllill In I ltllI‘oI—IU 55.51551 . l.- Ohnr 15.555 I 5.5.1 In a. .I II‘lIIVIII IMIIIILLIII Il511|155\5.5. 55.55.... 55555. 5.5.5... 5 555 II 5.5.5 51005555 \55. 555 5515 I 5.... 55$ 5.... .. om. - I55l1. ll.5| Illllll I|5ll5llu5l5l 5.1. tilt... 55% $15!]! 11.1.1 I 555515 llll.ll|.|.ll.llm51 .5151! \55\ I 35555... .555... .. 06v l "lluliu. .r X555... r. ONW m... Imam L... .4 ..u.... m Iml I Ohm was 9 0 89.10 105 muuuaaoAumz "mo>u=o nuaouo mgu .ou ouamum _ m .5, O]— m U 0.. U m w ¢ M N F r r . F 55.55.55... 555 Eu 55.51.15) 51551-1 . n. 5555. .55.WV\55rOhm 1.5 55.... 5......\\\.... .55555. .5555WH55 - 5...... .55... . 1...... 5...... .5. - om. o 5.... b... a... 5.5 L5- .155 .T. 5 - 5...... .. .5... 5.5515555 55.5 ..I 5.5.5.65. 5.5 5 m. 5.1. ..555.5n 551‘ 5| .1 I +55 5.5.55 55.5.5.5. 1 0“? .5..555555\\5\.\\55\\fl . m5... .. 5.. 55.5.5. 5. IONn 106 nouuuauauax 32:90 nubouu do: .n« ouauuh .93 235 w 0 ¢ n N. p . on... m I D h ‘ SOJOOS moss I O N ‘9 l 'D vrs £0 \\ _w . -omn. EWIMOD It! X 2.....N0m [mil 2...“? IT. - OR 107 aouunaonu-x 33590 .33ch an: .on ouaumh 1'! _m>m.. anew ¢ Fifi ~11) {-l’) N no“. .3 -omm .S o -onmmw a A: 0 83%. 9 B -omm. ....__..........n_n Irl \\ . . - own £53m Iml 2TH“? I+I fie? 108 auuunaanuuz "uo>uau aumou¢ max .9“ ouamflm “mpg mme .v r I") 5- N l. .Itllxlllull .15....- 1| \uflll lnl ..Illlulikl \L 5.... 5.5 -.5 .55... r O N m ..\ 5 .5510 5.5... 5.. .5555..- , \.5\\5 .5 555..-- ..5 U ..m.- -..\\ - Em O... -5 55 a - -~ \ I .. ........ x. 515.355 1 DNW AU ll“ J 51l . . ..U APPENDIX B YEARLY DI STRICT DATA APPENDIX B YEARLY DISTRICT DATA Table 15. Yearly District Data: CAT - Reading Test/ Norms/ Grade Level District Year 2 3 4 5 6 CAT - 101 National 290 324 357 385 405 COL Composite 332 365 404 434 458 1983 326 364 400 424 458 1984 338 365 404 434 460 1985 330 365 404 435 458 CIM Composite 322 351 389 427 431 1983 312 346 378 425 427 1984 322 355 393 422 438 1985 326 362 393 427 438 CLS Composite 345 376 416 441 464 1983 349 370 419 441 455 1984 342 383 410 456 460 1985 348 370 419 440 482 CAT - 301 National 332 370 406 438 460 COL Composite 363 399 440 476 506 1983 358 394 435 472 502 1984 365 402 444 480 510 1985 363 402 444 479 506 CIM Composite 351 393 426 460 471 1983 348 380 424 454 464 1984 351 395 430 463 471 1985 354 395 430 460 481 CLS Composite 370 404 446 476 506 1983 373 402 448 476 498 1984 373 408 440 482 506 1985 366 408 450 464 510 CAT - 501 National 360 401 443 473 500 COL Composite 385 423 470 505 S40 1983 383 423 460 496 534 1984 388 425 472 507 548 1985 385 425 470 507 544 CIM Composite 372 412 449 481 500 1983 369 403 449 476 493 1984 375 417 449 484 493 1985 372 423 451 484 508 CLS Composite 392 432 466 496 532 1983 392 425 469 496 528 1984 392 432 460 504 535 1985 387 432 469 493 532 110 Table 16. Yearly District Data: MAT - Reading Test/ Normal Grade Level District Year 2 3 4 5 6 MAT - 101 National 524 566 602 628 632 MOL Composite 545 582 631 658 681 1983 584 621 631 654 1984 516 566 625 673 1985 560 558 631 628 681 MIM Composite 561 607 637 660 686 1983 568 608 626 649 701 1984 546 625 646 649 690 1985 561 597 639 668 669 MLS Composite 510 546 616 627 675 1983 489 521 640 613 678 1984 488 603 456 637 680 1985 534 540 618 600 675 MAT - 301 National 576 621 655 682 693 MOL Composite 596 629 655 697 721 1983 620 643 651 704 1984 581 630 652 711 1985 596 598 665 686 721 MIM Composite 608 647 683 693 725 1983 616 649 669 691 753 1984 604 657 691 690 725 1985 612 636 689 701 708 MLS Composite 571 630 653 688 717 1983 552 615 678 688 724 1984 560 657 650 690 712 1985 603 606 647 688 714 MAT - 501 National 620 661 695 718 737 MOL Composite 624 645 691 723 751 1983 646 676 709 732 1984 608 643 671 736 1985 620 629 703 714 751 MIM Composite 644 670 719 718 769 1983 646 672 711 718 786 1984 630 685 727 716 755 1985 652 653 727 721 749 MLS Composite 625 664 685 703 747 1983 615 658 698 709 753 1984 617 680 660 706 744 1985 625 642 695 703 753 Table 17. Test/ Norms/ District Year 111 Yearly District Grade Level 2 Data: 3 CAT 4 Mathematics CAT - 101 National 309 343 372 COL Composite 1983 1984 1985 CIM Composite 1983 1984 1985 CLS Composite 1983 1984 1985 CAT - 301 National 336 374 406 COL Composite 1983 1984 1985 CIM Composite 1983 1984 1985 CLS Composite 1983 1984 1985 433 CAT - 501 National 352 394 428 COL Composite 1983 1984 1985 CIM Composite 1983 1984 1985 CLS Composite 1983 1984 1985 5 6 395 418 429 452 423 450 430 452 432 454 422 442 417 440 422 446 428 444 438 468 425 463 447 462 436 481 436 462 458 484 454 480 458 484 459 486 452 473 451 467 448 473 457 479 459 500 455 494 468 497 453 509 463 491 477 509 475 503 480 509 477 509 474 495 474 490 469 495 474 500 478 524 475 522 480 519 478 533 112 Table 18. Yearly District Data: MAT - Mathematics Test/ Norms/ Grade Level District Year 2 3 4 5 6 MAT - 101 National 403 454 501 546 564 MOL Composite 464 498 553 579 604 1983 496 497 550 586 1984 436 509 555 604 1985 472 497 537 569 604 MIM Composite 464 526 578 619 637 1983 460 532 563 609 676 1984 459 526 588 609 631 1985 464 526 578 619 637 MLS Composite 425 484 511 591 639 1983 418 487 554 609 631 1984 399 469 486 542 646 1985 435 486 524 571 639 MAT — 301 National 468 521 578 624 646 MOL Composite 501 541 588 628 676 1983 517 541 583 636 1984 483 552 592 636 1985 507 531 574 624 676 MIM Composite 507 575 626 658 704 1983 492 588 618 656 723 1984 516 588 626 640 698 1985 533 546 646 688 682 MLS Composite 478 531 569 632 678 1983 486 524 600 632 660 1984 460 562 539 635 696 1985 493 515 569 632 696 MAT - 502 National 507 567 622 664 696 MOL Composite 525 571 618 660 721 1983 545 593 626 656 1984 517 588 618 683 1985 541 551 618 660 721 MIM Composite 541 608 666 688 741 1983 517 628 651 683 754 1984 541 608 666 668 741 1985 558 573 681 730 717 MLS Composite 507 569 617 656 722 1983 519 569 643 660 695 1984 486 584 591 653 731 1985 515 558 608 667 741 REFERENCES REFERENCES Bryk, R. S., Strenio, J. R., & Weisberg, H. I. A method for estimating treatment effects when individuals are growing. ngmmgl 2£ Educational Statistics, 1980, 5, 5-4. Bryk,.A.S.,8:Weisberg,IL I. Value-added analysis: A dynamic approach to the estimation of treatment effects. £22523; 2: Educational Statistics, 1976, 1, 127-155. California Achievement Tests, Norms Tables. Monterey, CA: CTB/McGraw-Hill, 1978. David, J. L., & Pelavin, S. H. Evaluating compensatory education: Over what period of time should achievement be measured? ngmmgl 2: Educational Measurement, 1978, 15, 91-99. Hiscox, S. 8., 5 Owen, T. R. Behnd the basic assumption of Model A. Paper presented at the annual meeting of the American Educational Research Association. Toronto: 1978. Horst, D. P., G: Tallmadge, G. K. A ggggggmggl Gmlgg £35 Validating Achievement Gains 1m Educational Proiects, Monograph Series on Evaluation in Education, No. 2. Washington, D. C.: Office of Education, 1976. Langer, P” Kalk,.J.1L, & Searles,D.1h Age of achieve- ment and trends in achievement: A comparison of blacks and caucasians. émerican Educational Research Journal, 1984,21, 61-78. Linn, R. L. Validity of inferences based on the proposed Title I evaluation models. Educational Evaluation and Policy Analysis, 1979, 1, 2, 23-32. Linn, R. L. Discussion: Regression toward the mean and the interval between test administrations. Egg Directions for Testing and Measurement, 1980, 8, 83-89. Linn, R. L. Measuring pretest-posttest performance changes. In R. A. Berk (ed.) Educational Evaluation Methodology: The State mg £22 A52. Baltimore: Johns Hopkins University Press, 1981. Lord, F. M. Significance test for the hypothesis that two variables measure the same trait except for errors of measurement. Psychometrika, 1957, 22, 207-220. 113 114 Murray, 8. L., Arter, J., 8: Faddis, 8. Title I technical issues as threats to internal validity of experimental and quasi-experimental designs. Paper presented at the annual meeting of the American Educational Research Association. San Francisco: 1979. Olejnik, S. F. Data analysis strategies for quasi-experi- mental studies where differential group and individual growth rates are assumed. Unpublished dissertation. East Lansing: Michigan State University, 1977. Porter, A. C., Schmidt, w., Floden, R., 8: Freeman, D. J. Practical significance in program evaluation. Mmgmlggm Educational Research Journal, 1978, 15, 529-539. Powers, 8., Slaughter, H., & Helmick, C. A Egg}; g_f_ 3M3 Eguipercentile Hypothesis pf the TIERS Norm-referenced Model. Tucson, Arizona: Tucson Unified School District, Powers, 8., Slaughter, H. 8: Helmick, C. A test of the equpercentile hypothesis of the TIERS norm-referenced model. lgmmmgl pf Educational Measurement, 1983 20, 3, 299-302. Prescott, G. A., Balow, I. H., Hogan, T. P. 8: Farr, R. C. Metropolitan Achievement Tests, Teacher's Manual for Ad- ministering and Interpreting. Cleveland, OH: The Psycho- logical Corporation, 1978. Reisner, E. R., Alkin, M. C., Boruch, R. F., Linn, R. L., & Millman, J. AfiEEEEEEEE mg the TITLE I Evaluation and Reporting System. Washington, D. C.: U.S. Department of Education, 1982. Roberts, R. O. H. Regression toward the mean and the regression-effect bias. Mg! Directions for Testing and Measurement, 1980, 8, 59-82. Tallmadge, G. K. Cautions to evaluators. In Wargo, M. 5.. a Green. D- R- (Edam 22222222222 2222.222 22 22222:. 33355539 222 Minority Students. New York: CTB McGraw- Hill, 1977. Tallmadge, G. K. An empirical assessment of norm- referenced evaluation methodology. ngmmgl pf Educational Measurement, 1982, 19, 2, 97-112. Tallmadge, G. K. Rumors regarding the death of the equi- percentile assumption may have been greatly exaggerated. megmgl pf Educational Measurement, 1985, 22, 1, 33-39. 115 Tallmadge, G. K. 5: Horst, D. P. A Procedural Guide for Validating Achievement Gains in Educational Proiects. Washington, D. C.: U. S. Government Printing Office, 1976. Tallmadge, G. K., 8: Wood, C. T. User' 8 Guide: ESEA Title 1 Evaluation and Reporting System. Washington, D. Cd U.S.Office of Education, 1976. Tallmadge, G. K. 8: Wood, C. T. User' 3 Guide: ESEA Title Thurstone, L. L. A method of scaling psychological and educational tests. Journal of Educational Psychology, 1925, 16,7, 433-451. Trochim, W. M. K. Methodologically based discrepancies in compensatory education evaluations. Evaluation Revieg, 1982, 6,4“ 443-480. Van Hove, E., Coleman, J. S., Rabben, K., & Karweit, N. Schools' performance: New York, Los Angeles, Chicago, Philadelphia, Detroit, Baltimore. Unpublished manuscript, Baltimore, 1970.