MSU RETURNING MATERIALS: P1ace in book drop to LJBRARJES remove this checkout from —_ your record. FINES Win be charged if book is returned after the date stamped be10w. "NF ", ,2 ' .. V. '9’?!) c 5 F V; a L 6i§Kfi7 '3. - L . 2 ’v‘ 9‘- \J, ~ g a 0A; A Comparison of Four Methods for Setting C min. and C max. in Hofstee's Compromise Standards Setting Model bY Robert C. Korte A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Department of Counseling and Educational Psychology 1987 ABSTRACT A COMPARISON OF FOUR METHODS FOR SETTING C MIN. AND C MAX. IN HOFSTTE'S COMPROMISE STANDARDS SETTING MODEL BY Robert C. Korte The purpose of the study was to examine four approaches to setting the values of C min. and C max. in Hofstee's compromise standard setting model. The approaches were the Angoff, direct estimate, and Zieky and Livingston's contrasting groups and borderline methods. Each approach was applied to Reading, Math, and Reference Skills tests used for high school graduation in a mid-western urban school district. The results of the study were first presented by approach and then contrasted across approaches in terms of cutting scores yielded and relationships to varying F min’. and F max. by each method for each test. The results of these approaches were also compared to the results obtained from traditional Angoff, contrasting groups, and borderline methods. There were eleven key findings in the study. The cutting scores that resulted from the four approaches for setting C min./C max. and the traditional approaches tended to be reliable, evidenced little significant difference between scores, generalized well between groups and evidenced robustness relative to changes in F min./F max. values. Relative to specific approaches, the Angoff method, in addition to evidencing small C min./C max. differences, consistently produced the highest scores, the direct estimation was the least complex methodologically and the contrasting groups method tended to set lower cutting scores with the traditional method. In addition with the contrasting groups method, large differences in terms of the selected points in the distribution yielded small cutting score differences and the borderline group compromise method yielded higher scores than the traditional method. The study also indicated that in several instances the line (P1, P2) failed to intersect fc and that the percent of students who failed the tests changed greatly with relatively small changes in cutting scores. ACKNOWLEDGEMENTS Without the patience, understanding and love of Susan, Benjamin and David, this study would never have been completed. I thank them and look forward to repaying them for all those times that i had to be away. iii TABLE OF CONTENTS Chapter I - The Problem . . . . . . . . . . . . . . Introduction . . . . . . . . . . . . . Purpose of the Study . . . . . . . . . Overview of Study. . . . . . . . . . . Chapter II - Related Literature . . . . . . . . . . . Standard Setting Models . . . . . . . Establishing a Range of Acceptable Cutting Scores . . . . . . . . . . . Comparison to Other Compromise Models. Methods for Setting Cutting Scores . . Models for Determining C min./C max. . Chapter III - Methodology. . . . . . . . . . . . . . setting. 0 I O O O O O O O O O 0 Procedures . . . . . . . . . . . Instruments and Data Collection. cm‘ ttees O O O O I O O O O I 0 Modified Angoff Method . . . . . Zieky and Livingston Procedures. Direct Estimation. . . . . . . . Determination of F min. and F max. iv PAGE NO. 10 IO 13 1h 18 21 24 2h 26 32 3h 35 36 37 38 Chapter III - Methodology continued Chapter Chapter TABLE OF CONTENTS continued Traditional Procedures Within the Angoff, Contrasting Groups and Borderline Groups Determination of the Traditional Contrasting Groups Cutting Scores. . . Analyses O O O O O O O O O O O O O O O 0 IV - RCSUICS. a o a o o a o o a a o o o o o o 0 Description of Scores and Distribution in Terms of f (c). . . . . . . . . . . Angoff Method. . . . . . . Contrasting Groups Method. Borderline Groups Method . Direct Estimation. . . . . Influence of Values for F min on the Cutting Score . . . Comparison of Methods. V - Discussion . . . . . . Angoff . . . . . . . Contrasting Groups . Borderline Group . . Direct Estimation. . F min./F max. Value Changes. Comparisons Between Methods. . . Summary and Implications of Findings Limitations and Suggestions for Further Research . . . . . . . . . . . . . . . Concluding Remarks . . . . . . . . . . . V PAGE NO. . 39 . 40 . #1 . 48 . #8 . 58 . 7A . 85 . 9h . 99 . 112 . 132 . 132 . 135 . 136 . 136 . 137 . 137 . 138 . 1A6 . 1h8 TABLE OF CONTENTS continued PAGE NO. Appendix A . . . . . . . . . . . . . . . . . . . . . . . . . . 151 Append ‘ x B O O O O O O O O O O O O O O O O O O O O O 0 O O O O 156 List Of References 0 O O O O O O O O O O O O O O O O O O O O O 16] vi Table Table Table Table Table Table Table Table Table Table Table Table Table Table 10 ll 12 13 1% LIST OF TABLES F min./F max. Estimates by the Superintendent. Arbitrary Selection of F min./F max. . . . . . Results of Field Test and 9th Grade Groups . . Rater Means - Reading. . . . . . . . . . . . . Rater Means - Math . . . . . . . . . . . . . . Rater Means - Reference Skills . . . . . . . . Means, Standard Deviation and Cutting Scores Across Raters. . . . . . . . . . . . . . . . Average Absolute Error and Estimate of Consistency - 9th Grade Angoff Rating - Reading. 0 O O O O I O O O O O O O I O O O 0 Average Absolute Error and Estimate of Consistency - 12th Grade Angoff Rating - Read‘ng. I O I O O O O O O O O O O I O I O 0 Average Absolute Error and Estimate of Consistency - 9th Grade Angoff Rating - Math 0 O O O I O O O O O O O O O O O I O O 0 Average Absolute Error and Estimate of Consistency - 12th Grade Angoff Rating - Math 0 O O O O O O O O O O O O O O O O O O 0 Average Absolute Error and Estimate of Consistency - 9th Grade Angoff Rating - Reference Skills . . . . . . . . . . . . . . Average Absolute Error and Estimate of Consistency - 12th Grade Angoff Rating - Reference Skills . . . . . . . . . . . . . . Average Absolute Error and Estimate of Consistency Across Judges. . . . . . . . . . vii- PAGE NO. . 38 . 39 . 49 . 58 . 59 - 59 . 6o . 61 . 61 . 62 . 62 o 63 . 63 . 64 Table Table Table Table Table Table Table Table Table Table Table Table Table Table Table Table Table 15 16 I7 18 19 20 21 22 23 2h 25 26 27 28 29 30 31 LIST OF TABLES continued Summary of C min. and C max. Values. . . . . Cutting Scores, Po and k Resulting From the Angoff Method Based on Solutions 1-h . . . Cutting Scores, Po and k Resulting From the Traditional Angoff Method. . . . . . . . . Counts, Mean Scores and Standard Deviation by Rated Category of Mastery . . . . . . . Percent of Students by Category by Test - Test Sample vs. Total Sample . . . . . . . Cutting Scores Set at Various Points in the Master/Non-master Distributions. . . . . . Cutting Scores, Po and k Resulting From the Contrasting Groups Method. . . . . . . Optimal Cutting Scores for Each Test Based on the QDF O O O O O I O O O O O O I O O 0 Distribution of Students and Mean Scores Borderline Students. . . . . . . . . . . . C min./C max. Values of Borderline Group . . Cutting Scores, PO and k Resulting From the Borderline Group Method. . . . . . . . Cutting Scores, Po and k Based on the Mean . C min. and C max. Scores - Reading . . . . . C min. and C max. Scores - Math. . . . . . . C min. and C max. Scores - Reference Skills. Cutting Scores, PO and k Resulting From the Direct Estimation Method . . . . . . . PAGE NO. . 6h - 73 o 7“ - 75 . 75 . 8o . 84 . 85 . 85 . 89 . 93 . 9h . 9h . 95 . 95 O 99 Cutting Scores Given Various F min./F max. Values Angoff Method for Setting C min./C max.. . . . . 108 viii LIST OF TABLES continued PAGE NO. Table 32 Cutting Scores Given Various F min./F max. Values Contrasting Groups Method for Setting c minOIC max.. 0 O O O O O O O O O O O O O O O O 109 Table 33 Cutting Scores Line 0/100 vs. Line 25/50 - Field Test 0 I O O I O O O O O O O O O O O O O O '10 Table Bk Differences of Cutting Scores - Field Test . . . . 111 Table 35 Cutting Scores, Po and k, SEM and Failure Rates Across Methods . . . . . . . . . . . . . . 113 Table 36 Cutting Scores, Po and k, SEM and Failure Rates Based on Selected Procedures . . . . . . . 116 Table 37 Cutting Scores, Po and k, SEM and Corres- ponding Failure Rates Based on Traditional Procedures 0 O O I O O O O O I O I O O O O O O O 118 Table 38 Comparison Between Field Test and 9th Grade Groups Cutting Score/Failure Rate. . . . . . . . 119 Table 39 Comparison Between Field Test and 9th Grade Failure Rates Based on Generalization of cutting scores 0 O O O I O O O O O O O O O O O O 120 Table 40 Comparison of 9th Grade Failure Rates Between Generalization of C min./C max. and Cutting scores 0 O O O I O O O O O O O O O O I O O O O 0 121 Table hi Comparisons of k Across Methods for Selected scores I O O O O O O O I O O O O O O O O O O O O 123 Table #2 Percent of False Positives and False Negatives Based on Master/Non-master Groups. . . .. . . . 125 Table #3 Cutting Scores and Failure Rates Based on Selected Procedures and Traditional Methods. . . 126 Table Ah Probability of Overlapping Confidence Intervals at P - .50, .68, and .95 - Reading . . . . . . . 129 Table 45 Probability of Overlapping Confidence Intervals at P - .50, .68, and .95 - Math . ; . .. . . . . 130 Table #6 Probability of Overlapping Confidence Intervals at P - .50, .68, and .95 - Reference Skills. . . 131 Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure 10 11 12 13 14 15 16 17 LIST OF FIGURES Illustration of Hofstee's Compromise Model . . Illustration of Beuk's Compromise Model. . . . Illustration of De Gruljter's Compromise Model Language Arts Competencies . . . . . . . . Math Competencies. . . . . . . . . . . . . Master/Non-master Distributions. . . . . . Borderline Distribution. . . . . . . . . . Field Test fc Distributions by Grade Reading. . . . . . . . . . . . . . . . . Field Test fc Distributions by Grade "8th 0 O O O O C O O O O O I O O O O O 0 Field Test fc Distributions by Grade Reference Skills . . . . . . . . . . . . A Comparison of Field Test and Ninth Grade fc Distributions - Reading a o e o o o e A Comparison of Field Test and Ninth Grade fc Distributions - Math. 0 o a o e a e a A Comparison of Field Test and Ninth Grade fc Distirbutions - Reference Skills. . . Angoff Cutting Score Plots - Reading . . . Angoff Cutting Score Plots - Math. . . . . . . Angoff Cutting Score Plots - Reference Skills. Solution to the Problem of Non-intersection With fc - Reading. 0 O O O O O O O O O O O O X 1 PAGE NO. . 7 . 16 . 18 . 25 . 26 e 30 . 31 . 51 . 52 . 53 - 55 . 56 . 57 . 66 . 67 . 68 . 70 Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 3h 35 LIST OF FIGURES continued Solution to the Problem of Non-intersection With fc - Hath o o a o o a a o a a o o o a Solution to the Problem of Non-intersection With fc - RGFGFCHCG Skills 0 e a o a a o a Competency Distributions - Reading . . . . . Competency Distributions - Math. . . . . . . Competency Distributions - Reference Skills. Contrasting Groups Reading. 0 O O O O O O O O O O O O Contrasting Groups "8th 0 I O O O O O O O O O O C O I Contrasting Groups Reference Skills . . . . . . . . . Borderline Borderline Borderline Borderline Reading. Borderline Math . . Borderline Group Group Group Group Group Group - Reading . . . . . "8th. 0 O O O O O - Reference Skills. Cutting Score Plots Cutting Score Plots Cutting Score Plots Reference Skills . . . . . . . . . Direct Estimation Cutting Score Plots - Cutting Score Plots Cutting Score Plots Cutting Score Plots Reading. . . . . . . . . . . . . . . . Direct Estimation Cutting Score Plots - Hath O O O O O O O O O O O O O O O O 0 Direct Estimation Cutting Score Plots - Reference Skills . . . . . . . . . . . Angoff Comparisons Across Various F min./F values - Reading a a a o o a a o a a a a a o e a ‘02 xi max 0 PAGE NO. .. 71 .. 72 - - 77 .. 78 . . 79 . . 81 . . 82 . . 83 . . 86 . . 87 . . 88 . . 90 . . 91 . 92 . . 96 . . 97 . . 98 LIST OF FIGURES continued PAGE NO. Figure 36 Angoff Comparisons Across Various F min./F max. values - Hath. o a o e e a a a o e e a a o a a o 103 Figure 37 Angoff Comparisons Across Various F min./F max. Values - Reference Skills. . . . . . . . . . . . 10h Figure 38 Contrasting Groups Comparisons Across Various F min./F max. Values - Reading . . . . . . . . . 105 Figure 39 Contrasting Groups Comparisons Across Various F min./F max. Values - Math. . . . . . . . . . . 106 Figure 40 Contrasting Groups Comparisons Across Various F min./F max. Values - Reference Skills. . . . . 107 xii CHAPTER I THE PROBLEM Introduction The use of a test to determine a person's competence or to grant certification based on the attainment of specified or desired levels of competence in a specific area usually necessitates that a single passing or cutting score be determined for the test. This score represents a standard: people who score above that score can be considered to be masters of some specified domain and people who score below that score can be considered to not be masters. Standards are not only used to make judgments about an individual person's competence but they also relate to, and are a part of, a larger social and often political setting. When the test is used to determine personally important events (graduation from high school, entrance into a profession, etc.), the integrety/validity of the cutting score is extremely important. The purpose of a competency test is to make the determination of which persons meet or exceed the "standard" and which do not. In educational settings, for example, the purpose is "to communicate with individuals outside of the classroom something about the students' competence in a particular curricular sequence" (Majestic, 1979). As 2 such, the test and its accompanying cutting score must represent the intent of the standard. The standard itself, in the case of competency testing, represents the judgment by a person or group of persons of where on a continuum of levels of proficiency a line should be drawn above which are masters/competent individuals and below which are non-masters/non-competent individuals. A further judgment is then made regarding what test scores represent that standard. In other words, determining what score on a test represents an a priori standard is a matter of judgment. This reliance on judgment in setting cutting scores on tests led Glass (1978) and Burton (1978) to feel that the use of standards and the use of cutting scores is unsound since being judgmental in nature they are arbitrary. In separate rebuttals Popham (1978), Block (1978) and Hambleton (1978) argued that while standard setting does in fact require human judgment, it is not necessarily arbitrary in a capricious sense. They indicated that determining standards on a test can be done in a deliberate and reasonable manner and that the classification of people for various reasons is unavoidable and should be done in as defensible a manner as possible. The important point in the debate is that the determining of standards and the subsequent setting of cutting scores on tests is both a judgmental process and unavoidable in practical situations. Mehrens (1978) summarizes this point quite well: 3 'There is no question but that we make categorical decisions in life. If some students graduate from high school and others do not, a categorical decision has been made whether or not one uses a minimal competency exam. Even if everyone graduated, there would still be a categorical decision if there were the possibility (philosophical or practical) of not graduating. If one can conceptualize performance so poor that the performer should not graduate, then theoretically a cut off score exists. What proponents of minimal competency exams seem to be saying is that they believe, at least philoSOphically, that a level of incompetence too low to tolerate does exist; and thatione ought to define that level so as to make decisions less abstract, less subjective, and hopefully a little less arbitrary than they are currently (p. #9). Popham (1986) describes the "dangers" inherent in setting passing scores as being political and legal. Tuckman and Nadler (1979) describe competency testing as being a "highly politicized process" (p.3) in their examination of the relationships between the socioeconomic level of a school district and the setting of standards. The cost of remediation for students who receive scores below the standard, for example, can result in wealthier districts setting more stringent standards since they would have a ”considerably lesser number of truly 'incompetent' students" (p. 6). Huynh (1979) presents a model for taking budgetary constraints into account when setting cutting scores. His model examines both the budgetary constraints and the factor that the cost of remediation varies with ability (greater ability costs less to remediate) in providing cost information for making decisions about determined cutting scores. Anderson (1979) lists additional costs that have political implications: educational programming, psychological impact of failing (particularly for specific segments of the population) and social impact (again i. particularly for specific segments of the population). Regardless of the accuracy or validity of the cutting score, if specific segments of the intended population are disproportionably affected, there may be important political ramifications that must be addressed. One aspect of the setting of cutting scores being done in a political setting is that usually there are expectations for the number or percent of persons who will "pass" or "fail" to meet specified standards. For example, the recent movement to set specific skill standards for high school graduation was motivated at least in part by the impression by communities supporting schools that too many students were graduating from high school without "sufficient skills." A standard for high school graduation that allowed everyone to pass would, in this case, most likely be unacceptable to such groups. In many other instances, or to other groups in the same setting, a standard that permitted no one to graduate would be equally unacceptable. As Scriven (1978) indicates, there are different audiences of competency tests each with their own needs. The setting of an absolute standard that resulted in either extreme would in effect not be politically viable. This would seem to imply that there is a range of acceptable passage rates that the standard may yield. As such, standards are implicitly normative. Shepard (198A) and Hofstee (1973: as reported by De Gruljter, 1985) both argue that standards are based on an often implicit understanding of what is the expected: in other words, typical performance. On the other hand, ignoring absolute standards in favor of taking some set percent of examinees who should pass or fail a specified test as the standard is equally unacceptable. This would represent a truly 5 capricious approach to the problem in that it would avoid any reference to the meaning or content of the standards. Shepard (198k) suggests that "both normative and absolute comparisons establish the boundaries within which plausible standards can be set." In other words, both the absolute standard and the implied normative standard need to be considered in determining cutting scores on tests. As Skakum and Kling (1980) found in their study comparing normative and absolute methods, use of the two approaches can lead to different cutting scores. This implies that there is a need for specific procedures for using both the absolute and normative standards in setting cutting scores. Typically, their simultaneous use has led to cutting scores being set in an absolute manner and then they are more or less arbltarily adjusted until they fall within the a priori normative range. Unfortunately, such adjustment may be done in an unsystematic and capricious manner. This reinforces Glass's criticisms of cutting scores and their use. There is clearly a need to effect a systematic compromise between absolute and implied normative standards. De Gruijter (1985) describes three models that are compromise procedures that systematically employ both absolute andiunwmtive standards in the determination of cutting scores. 0f the three models, one developed by I-Iofstee appears clearly to be preferable. The Hofstee model identifies a range of acceptable cut scores; the outside parameters of which are compromise scores determined by both absolute and normative criteria. The point in the range that intersects with the distributions of failures given various cutting scores represents the cutting score for the test. 6 Using the Hofstee model, the standard setting committee or person first identifies a maximum cutting score (C max.) that represents the score "set high enough so that you would believe it signified mastery even if every student taking the test attained this score" (Shepard 19811, p. 187). The committee or person then identifies a minimum score (C min.) "set low enough so that every student scoring below it should surely fail, even if every student scored below it" (p.187). Finally, the committee, person or third source (Board of Education, etc.) sets the maximum failure rate (F max.), the "maximum acceptable percentage of failures" (p. 187) and the minimum failure rate (F min.), "the minimum acceptable percentage of failures" (p. 187). The range of acceptable scores is then bounded by P] which is determined by C min. and F max. and P2 which is determined by C max. and F min. The cumulative frequency distribution, f(c), of failure rates given various cutting scores at corresponding raw scores (X) with a ”failure percentage equal to the cumulative percentage of raw score X-I" (De Gruijter 1985, (p. 265) is determined and plated relative to P1 and P2 as illustrated in Figure 1. The actual cutting score is then determined by the intersection of f(c) with a line defined by P1 and P2. There is, then, a need for a compromise model between setting cutting scores based on absolute standards and the realities of the situation in terms of the percent of individuals who pass and fail a competency test. The Hofstee compromise method appears to meet this need. There is a problem, however, relative to the selections of the best method(s) for setting the minimum and maximum scores (C min. and 7 C max.) that define the range of acceptable scores. Three approaches appear to be viable answers to this score determination problem: 1. directly estimating the values; 2. the use of item-content judgment (in which a modification of Angoff's method appears to be the best for our purposes); and 3. the use of empirical data (in which Zieky and Livingston's contrasting and borderline groups methods appear to be best). f . f(c) Pruu__ P1 1“ ------------ , I I I I F MN I I PH I l I a I CHIN c CHM! c ILLUSTRATION OF HOFSTEE'S COMPROMISE MODEL FIGURE 1 Purpose of the Study The purpose of this study was to examine four methods of determining the C max. and C min. values in Hofstee's compromise model for setting test cutting scores. The four methods were: 1. direct estimation of both C max. and C min.; 2. Zieky and Livingston contrasting groups methods setting C min. and C max. based on the means and selected points in the master/non-master distributions; 3. Zieky and Livingston borderline groups setting C min. and C max. at the mean and selected points in the positive direction of the distributions; and A. modification of the Angoff method to set a C min. based on the concept of a minimally competent individual and a C max. based on the concept of a maximally competent individual. The four methods were examined relative to the final cutting scores based on fixed values of F min. and F max. and values of C max. and C min. as determined by each method. The methods were compared relative to their similarity and differences in terms of setting the final cutting score. In addition, the methods were compared to the cutting scores yielded by the application of traditional methods (Angoff method corrected for normative data, Quadratic Discriminant Function for the contrastng groups procedure and the mean of the borderline group). Comparisons to the traditional methods were made in order to examine each of the methods for setting C min. and C max. relative to points (in this case cutting scores) outside of the compromise model. Overview of Study Chapter II examines the literature that is related to both setting cutting scores in general and more specifically the concepts incorporated in the Hofstee compromise model. Specifically, the literature suggests and supports the four approaches for setting C min./C max. examined in the study. In Chapter III, in the Methodology Section, the setting for the study is presented and the specific procedures that were used are explained. This is followed by a presentation of the results in Chapter IV, in narative, tabular and figurative form. These results have a number of interesting Implications which are then discussed in Chapter V. Chapter V also presents limitations of the study, suggestions for further research and a brief speculation on setting cutting scores in general. CHAPTER II RELATED LITERATURE Standard Setting Models Berk (1986), in an extensive review of 38 methods for setting performance standards, presents a taxonomy for classifying various standard setting/cutting score models. He defines five categories of models of his taxonomy: State: These models assume that mastery competence or true score performance is an all-or-nothing state. The true standard is set at 1002, and deviations from this true state are presumed attributable to false mastery/competency or false nonmastery/incompetency errors. Once these errors are considered, the models can be used to adjust the standard to values less than 1002 (e.g., 98%). Glass (1978) referred to these models as J'counting backwards from 1002“ (p. 2AA). Continuum: These models assume that mastery competency is a continuously distributed ability that can be viewed as an interval circumscribing the boundaries for mastery/competence. Judgmental: These methods are based entirely on the judgments of one or more persons. Where several judges are involved, the decisions may be reached independently or from a panel discussion. No performance data are made available to the judges. Judgmental-empirical: These methods are based primarily on the judgments of one or more persons with performance data made available to guide those judgments. Where several judges are involved, the decisions may be reached independently or from a panel discussion. Some of these methods may be IO 11 viewed as a compromise procedures between absolute methods, which do not take into account performance data, and relative methods, which incorporate those data. Other judgmental-empirical methods are composite or combination procedures that reconcile differences between judgmental and empirical- judgmental methods. Empirical-judgmental: These methods are based primarily on performance data from one or more groups of examinees and statistical analysis of those data. Judgment is used to define the criteria for mastery/competency and non- mastery/incompetency, to select individuals for the group(s), to specify a loss ratio expressing the relative benefits and costs of the decision outcomes, as well as to determine other aspects of the procedures. Some of the empirical judgmental methods use the information to set performance standards; others use the Information to estimate false mastery/competency and false non- mastery/incompetency error rates and then to adjust the standard after a consideration of those rates. ,(Note that these techniques presume the "best" judgmental standard has been chosen before any adjustments can be made; they are not methods for setting standards) (p. 139-140). Of the models listed by Berk, fifteen (15) were either state models (8) or models to ”adjust standards" (7) based upon a known initial standard. These methods have little or no utility to the problem of setting standards. For our purposes, the "state" models have little use since they assume that the trait(s) in question is truly dichotoums in nature. In contrast to the state approach, the assumptions of virtually all educational competency testing is that there Is an underlying skill continuum. Also, the seven models to "adjust standards” have little use since they assume that the standard has already been identified. The remaining models fall under the continuum levels of the taxonomy. 12 The remaining twenty-three (23) methods discussed by Berk, hi fact, fall under the continuum level of his taxonomy. They, in turn, fall into either judgmental, judgmental-empirical or empirical- judgmental categories. The majority of the judgmental methods are actually variations or extentions of the Angoff (1971) and Ebel (1979) models. The only six (6) exceptions are models by Beuk (1984), Cangelosi (1984), Mills and Barr (1983), Nedelsky (1954), Yalow and Popham (1983) and Hofstee (1983). Regarding the judgmental and judgmental-empirical methods, Berk concluded that: 1. different methods produce different standards when applied to the same test either by the same judges or by a randomly parallel sample of judges; 2. the Nedelsky method yields consistently lower standards than the Angoff or Ebel methods; and 3. there is inconsistency among judgments within a sample method as well as across methods. In general, he concluded that the ”Angoff method appears to offer the best balance between technical adequacy and practicability. If a judgmental approach is desired for a particular test application, then this method could be appropriate" (p. 147). van der Linden (1982) describes the Angoff method as follows: It is suited for dichotomously scored items and consists of the following few steps: A content specialist is asked to imagine a student just meeting the requirements as formulated in the learning objectives. This may be a hypothetical as well as a real student. Keeping this borderline student in mind, he/she is requested to inspect the test, item by item, and to specify for each item the probability that the student will answer it correctly. The standard is equal to the sum of the probabilities (p. 296). 13 Berk describes five empirical-judgmental methods: borderline -group (Livingston and Zieky, 1982), contrasting groups (Livingston and Zieky, 1982), criterion groups (Berk, 1976), educational consequences (Block, 1972) and norm-referenced criterion (Garcia-Quintana and Mappas, 1980). Of these methods in general, he states "the net effect of incorporating an empirical base in standard setting is that it improves the technical adequacy of the method, but, in most cases, reduces its practicability" (p. 153). Specificallyg he gives the contrasting groups method the highest overall rating. In his discussion of the various methods, Berk goes on to state that ”the compromise methods, especially, deserve serious attention by standard-setting researchers. They were developed only recently by the Dutch researchers Beuk and Hofstee and, therefore, have not been scrutinized by certification test specialists in this country" (p. 153). Koffler (1980) describes the contrasting groups method as follows: The contrasting groups method' sets a standard at the test score that best separates those students judged to be masters from those students judged to be non-masters. To implement the contrasting groups procedure one must determine the mastery/non-mastery status and the corresponding test score for a sample of examinees (p. 171). The borderline group method cutting score is then defined as the mean of students who fall between the master and non-master groups on the borderline of the total distribution. 14 Establishing a Range of Acceptable Cutting Scores 'Two of the few test specialists who have scrutinized the Hofstee methods, Mills and Mellcan (1986) state that: The Hofstee method evaluates "worst case" possibilities: "based on the information provided in the responses to the cut score and failure rate questions, we would be willing to accept a cut score as high as C max. provided the failure rate did not exceed F maxu Further, we would accept a cut score as low as C min. provided the failure rate was at least F min.". The points on the line connecting the extremes C max., F min. and C min, F max. represents the acceptable alternative combination of cut scores and passing rates (p. 7)." It is interesting to note that the line described by Mills and Melican represents a range of acceptable cutting scores as opposed to a single score as Is yielded by most other commonly used methods. There is considerable support in the literature for the notion of a cutting score range. Scriven (1978) in his response to Glass indicates that different procedures may in fact set different standards within the same continuum of abilities. Lockwood, Halpin and McLean (1986) take this notion one step further In their description of a sample distribution of cutting scores across procedures where the cutting scores yielded by various procedures form a normal distribution and are subject to standard notions of sampling error, etc. Given the existence of a population standard (II), Fhavers (as described by Hsu, 1980; Wilcox, 1976; and van der Linden, 1982a) purposed an "indifference zone" which Hsu defined as "a range of 15 values of 11 for which the examiner is indifferent concerning classification of the examinee" (p. 710). Hsu goes on to describe the zone as being bounded by: IEz-the minimum value of II which should lead to the classification of an examinee as having fulfilled the course requirements, and Irj-the maximum value of II: which should lead to the classification of an examinee as not having fulfilled the course requirements. The zone is based on the notion that errors closer to II are of less importance than those further away. Probabilities are then estimated for being outside the zone. In addition, van der Linden states that the zone need not be symetrical around 11 since one type of misclassification may be more important than the other for a given situation. While evidencing specific procedural problems, Emrick also proposed a model (as described by Wilcox and Harris, 1977) for estimating the probability of incorrect classification based on specified population parameters. While describing the value of using a range of acceptable cutting scores in probabilistic terms, the above authors fail to provide an efficient and operative means of determining the parameters for the range aside from mathematical models to minimize the probability of misclassification given selections of It, 111 and/or 112. The advantages of using Hofstee's method is that its application is fairly straight forward and that it uses or implies a range of cutting scores. However, one seemingly critical problem with Mofstee's model is setting the values for the parameters of the range (C min. and C max.). In describing his model, Hofstee (1983) fails to 16 indicate precisely how C min. and C max. are to be operationally determined beyond being set by a single person or committee given the criteria stated above. The implication is that the two scores are directly set by an individual or committee familiar with both the test and the desired competency level in absolute terms. Caparlson to Other Cuproulse Models A second compromise model establishing a range of scores was proposed by Beuk (1984) and involves the committee directly stating the "preferred" values of c and f. The means and standard deviation of these values are then determined and denoted as c, f, Sf and sc. The group standard (0) is determined by c and f. This point is plotted and a line drawn through it with a slope equal to - Sf/Sc the intersection of which with f(c) determines the actual cutting score for the test. Figure 2 illustrates Beuk's approach. f(c) I. I I I l l I l I I - c c ILLUSTRATION OF BEUK'S COMPROMISE MODEL FIGURE 2 17 De Gruljter's critique of Beuk's model appears to be appropriate. He states that in using the ratio Sf/Sc "the committee members may have quite similar ideas about, for example, f, possibly resulting in a small value Sf/Sc. At the same time experience may have taught them not to be too certain of the exact value of f, causing them to have a preference for an absolute rather than a relative standard" (p. 265). The third model, prepared by De Gruijter himself, avoids Beuk's problem but appears to be complex and not readily understandable to persons who are setting standards or using the resulting cutting scores to make decisions. The procedure stipulates that each committee member or person identify the ideal cutting score (co) and the corresponding failure percentage (f0) and then estimate the uncertainty with respect to the true values of<:and f. The uncertainty estimates are denoted as uc and Uf and form a ratio r which equals Uf/Uc. All values of c, f and r in combination form the circumference of an ellipse r2(c-co)2+(f-fo)2-d2 "where d is half the length of the ellipse in the vertical direction” (p. 266). As illustrated in Figure 3, the contact point between f(c) and the ellipse represents the actual cutting score. f(c) Q t' .c ILLUSTRATION OF DE GRUIJTER'S COMPROMISE MODEL FIGURE 3 Methods for Setting Cutting Scores While Hofstee's rather general approach to the problem of determining the range is preferable to Beuk's and DeGruijter's models, it would appear that the model may benefit from more systematic and perhaps more meaningful procedures for estimating C min. and C max. As indicated above, and confirmed in numerous studies such as those by Koffler (1980), Mills and Barr (1983), and Skakum and I(iing (1980), different methods for setting cutting scores often lead to different results. This points to the need as Koffler (1980) indicated to examine different methods for determining cutting scores. 19 This may be no less true in examining different options for setting C thin. and C max. Therefore, various approaches should be examined and those with the greatest potential applied to the Hofstee model and compared in the process of determining cutting scores for actual tests. As a part of the examination, it is interesting to note that there appears to be a "high level of agreement in the determination of a cut-off scores using the same method across two teams of judges" (Hambleton, 1978, p. 284). This finding may indicate that using a number of different teams of judges for specific methods may not be necessary in examining and/or comparing methods. Shepard, in a review of standard setting approaches, proposes two categories for methods. Shepard (1984) refers to Berk's judgmental and judgmental-empirical categories as "standard-setting procedures based on judgments of test content" and the empirical-judgmental as those that rely on "empirical information." It should be noted, however, as Berk (1986) indicates, the empirical methods are not entirely free of subjective judgment. A third general approach falling into Berk's empirical-judgmental category and not designated by Shepard in her article is based on a beta-binomial distribution and includes the Bayesian, and Empirical Bayes approaches (described by Huynh; 1979) and the related binomial approach described by Wilcox (1979). These beta-binomial and related approaches set the optimal cutting score for a test given an identified standard in terms of known ability. The approaches, however, have two drawbacks that preclude their use in the current study. First, they are dependent on the identification of an a priori standard in ability terms. As Berk (1986) indicates, these are 20 designed to adjust standards in terms of observed scores once a ”true score" standard is known and not to set standards. Second, they assume a strict homogeneity of content within the tests. Given the nature of competency tests such as those used for high school graduation, the standard is not usually identified outside of the cutting score process itself (due to the considerations mentioned earlier in this paper) and test content is purposefully heterogeneous inithin a broad area. Two additional probabilistic models, the Macready and Dayton model (1977) and the Goodman model (described by Bergan, Cancelli and Luiten, 1980) based on the use of "latent classes" also have limited value to the current study due to strict assumptions of content homogeneity. Of the two general approaches described by Shepard, the test- content procedures include the Angoff (1971), Ebel (1979), Jaeger (1978) and Nedelsky'(1954) methods and the empirical methods include Zieky and Livingston's (1977) contrasting groups and borderline group methods and Berk's (1976) instructed/uninstructed group methods. Of the test-content methods, the Angoff method appears to be the most straightforward and the least problematic. Ebel's method involves the fairly complex task of asking judges to rate items on two dimensions lat the same time and as Shepard (1984) states, ”judges do not seem to be able to keep the two dimensions distinct" (p. 176). The Jaeger approach asks that each item be judged as to whether a high school graduate should be able to answer the item or not which as Shepard (1984) Indicates limits the probability for each item to 02 or 1002. 'The Nedelsky method also appears to have "serious" problems as indicated by Brennen and Lockwood (1980) and van der Linden (1982). It 21 makes the seemingly unwarranted assumption of random guessing and limits the possible probabilities that can be assigned to each item. Of the test-content methods, it would appear that the Angoff method would be the most appropriate to use. This contention is supported by Shepard (1984); Cross, Impara, Frary and Jaeger (1984) and Berk (1986) all of which found the Angoff method to be preferable over both the Nedelsky and Jaeger approaches. Cross et.al. (1984) further argued "in favor of providing normative feedback if the Angoff method is used" (p. 128). This is further supported by Fitzpatrick (1984) and Berk (1986) in support of informed judgment. Both offer specific support for the use of informed judgment when applying the Angoff method. In particular, they recommended that item difficulties be presented to judges after the initial rating is completed and that items be rated again by the judges with reference to the new information. They then recommended that failure rates based on the new cutting score be examined and the scores again adjusted as deemed appropriate by the judges. Models for Determining C linJC lax. The Angoff method corrected for normative data would be a potential means of effecting a compromise between absolute standards and the reality of examinee performance. While including reference to an absolute standard and the distribution of scores both in terms of actual item difficulties and cutting score failure rates (analogous to f(c)), the corrected Angoff method does not directly include the range of acceptable failure rates. A modification of the method, however, 22 may offer a viable means of setting the C min. and C max. values in Hofstee's model. One possible application of the Angoff method was described by Mills and Melican (1986) in their examination of the compromise models described in De Gruijter (1985). Mills and Melican used the Angoff method to set C min. and C max. After having a team of judges rate each item, "the highest cut resulting from the Angoff data collection was used as the maximum acceptable cut, the lowest Angoff cut was taken as the minimum acceptable cut" (p. 11). Their procedure appears to be unduly influenced by potential outliers. In other words, C min. and C max. appear to have been set by the lowest and highest rating judges. It would seem to be more plausable to ask the judges to rate items in a manner similar to the Angoff method but ask them to estimate the proportion of persons passing each item at two carefully specified levels of competency. Specifically, judges could be asked to rate the probability of passing an item based on some specified minimum level of competency (below which a person could not be considered to be competent) for a person just at that level of competency. Summed across items, this would represent C min.: a score "set low enough so that every student scoring below it should surely fail. . ." Judges could also be asked to rate the probability of passing an item based on some specified maximum level of competency (above which a person could not be considered to be incompetent) for a 'person at that level of competency. Summed across items, this would represent C max.: a score "high enough so that you would believe it signified mastery even if every student taking the test attained this score." For example, given a high school competency test the C min. could be estimated for an individual who has a "ninth" grade skill 23 level and C max. estimated for an individual who evidences a "twelfth" grade level of the same skills. While the modified Angoff minimum and maximum scores would not appear to need to be corrected, since the Hofstee method in effect accounts for normative data; it would be interesting to compare the final cutting score derived from the Hofstee method with the score derived from using the traditional corrected Angoff method. The effects of using the corrected method are by no means clear, however. Cross, Impara, Frary and Jaeger (1984) found differences between the Angoff and corrected Angoff methods while Mills and Barr (1983) did not. Subkoviak and Huff (1986), on the other hand, found that the corrected Angoff resulted in greater intrajudge consistency than the Angoff. Of the empirical methods, the Zieky and Livingston (1977) contrasting groups and borderline groups methods appear to be preferable to Berk's method. Berk (1986) himself states in his comparison of different methods that \of the empirical methods, the contrasting groups method receives the "highest rating." Berk's instructed/uninstructed method would seem to have a very narrow application since most outside of the classroom tests requiring cutting scores attempt to measure complex skills that span a number of learning settings precluding the identification of instructed vs. IMiinstructed groups. Both of the Zieky and Livingston methods (contrasting groups and borderline groups), however, appear to be worthy of investigation relevant to setting C max. and C min. CHAPTER III METHODOLOGY Setting The study was conducted at a middle size (student population approximately 20,000) urban school district in the midwest with a minority percentage of approximately 332. The district at the time of the study was in the process of implementing a high school graduation competency testing program. Five tests had been developed under the direction of the district Supervisor of Assessment and the Office of Curriculum Planning and Evaluation based on a set of standards for graduation adopted by the Board of Education. The competency testing program was to be implemented for the first time with 9th grade students in the 1985-86 school year. The standards consisted of sets of skills in the areas of Lanuage Arts and Math in which students needed to evidence mastery at the "ninth" grade level of competency in order to graduate from high school. The skills are reproduced in Figure 4 and Figure 5. 24 10. 11. 12. 13. 14. 15. I6. 17. 18. 19. 25 Identify the main idea. Follow a given set of written directions. Identify and obtain information from reference sources (dictionary, encyclopedia and text books. Demonstrate the ability to use a table of contents in locating information. Read and answer questions about articles from newspapers and magazines. Read and answer questions about information contained in charts, tables, graphs, and maps. Read and complete various forms (applications, order forms and simple tax forms). Demonstrate the ability to spell words correctly. Demonstrate the ability to write all types of sentences (command, declarative, exclamatory, etc.). Demonstrate the ability to write various types of letters (business, friendly, application, etc.). Demonstrate the ability to write a summary of a reading selection. Demonstrate the ability to complete accurately requested information on various forms (applications, checks, orders). Demonstrate the ability to use appropriate form in writing (paragraph, indentation, etc.). Demonstrate correct usage of punctuation in a written selection. Demonstrate proper grammar usage in a written selection. Demonstrate the ability to prepare a neat and legible paper. Demonstrate the ability to prepare a clearly understandable paper. Demonstrate the ability to express ideas and opinions in writing. Be able to prepare a resume for an entry level job (i.e., summer employment). LANGUAGE ARTS COMPETENCIES Figure 4 26 1. Be able to add, subtract, multiply and divide whole numbers. 2. Be able to add, subtract and multiply fractions. 3. Be able to add, subtract, multiply and divide decimals. 4. Be able to perform calculations using percents. 5. Be able to determine the appropriate mathematical operation and find the solution to application problems. 6. Be able to express measurements in a variety of units (metric and non-metric). 7. Be able to use standard measuring devices for length, weight, capacity and volume (metric and non-metric). 8. Be able to apply basic mathematical operations to monetary situations (make change, add prices, calculate wages, etc.). 9. Be able to apply mathematical operations in completing a short income tax form. 10. Be able to calculate an average. 11. Be able to maintain a checkbook given a set of records. 12. Be able to calculate the best price given the price per item and the number of units. MATH COMPETENCIES Figure 5 Procedures The four methods for setting C min./C max. values (Angoff, contrasting groups, borderline and direct estimation) were implemented in the school district and the resulting cutting scores were compared. In addition to comparing the different results with a standard setting student population, the resulting cutting scores were applied to a second population which consisted of the first group for whom passage of the tests was necessary for graduation. F min. and F 27 max. were both set at various fixed values (e.g., 102 and 752) across methods and at points for each test that were determined by the Superintendent of Schools in the school district where the study was completed. Different fixed values of F min. and F max. were used in order to examine the effect of varying values of F min. and F max. across methods. In addition, the traditional methods for setting cutting scores within each method were computed yielding a single cutting score for each test for the first three methods. The traditional methods included a corrected Angoff, Quadratic Discriminant Function with the contrasting groups procedure and the mean of the borderline group. The direct estimation, modified Angoff and corrected Angoff methods were implemented by two committees for Reading, Reference Skills, and Math Tests. The committees rated the items per the Angoff method with C min. being the probability of a student at the "ninth" grade level of minimal competency getting the item correct and C max. being the probability of a student at‘the "twelfth" grade level of competency getting the item correct. The "ninth" and "twelfth" grade levels of competency were defined as the "average" or "typical" level of performance that would be expected given a specific or defined skill domain. In other words, minimally and maximally competent were defined by "ninth" and "twelfth" grade levels of ability. Because the judges were all familiar with students in both "ninth" and "twelfth" grades, an attempt was made to center their judgments on the skill levels commonly expected at those grades. Based upon their experiences with students at both competency levels, the committee members were asked first to imagine the "average" or "typical" ninth 28 grade student's performance in an area or domain as defined by the specified skills. Item ratings were then completed for each item in order to assess the probabilities of that student passing each item. The procedure was repeated for a student at the twelfth grade level. As such, the definitions of "ninth" and "twelfth" grade levels were based on the committee members' conceptions of an "average" or "typical" student at those levels resulting from their cumulative experiences of interacting in the classroom with such students. The "ninth" grade level was selected as an estimate of C min. because the school district in which the study was completed had a policy for graduation which stipulated the "ninth" grade skill level in the areas tested as the minimum acceptable level needed for high school graduation. This is comparable to a grade equivalent which Is the medium at that grade level. The "twelfth" grade level was selected for two reasons. First, it represented the highest level or referent of "typical" performance at the high school level. Second, it represented a level of general performance that is often taken by society to be the ideal or perferred level of performance for high school graduates. The committees were then asked to directly set C min. and C max. This was done before their having received information about the total means of the individual ratings. Corrected Angoff scores were determined subsequent to the direct estimates. Details of this are provided later. As illustrated by Figure 6, the contrasting groups method has four possible outcomes. Figure 6(a) indicates the result where the two groups evidence moderate overlap. C min. could then be set at 29 some defined points in the master distribution (e.g., -ISD, -250, -3 SD) and C max. se¢.in the positive direction of the non-master distribution (e.g., +150, +250, +350). In addition, C min. and C max. could be set using the'means of the non-master and master groups, respectively. Figure 6(b) indicates no overlap in scores between groups where C min. is set in the positive direction of the non-master group and C max. in the negative direction of the master group. Figure 6(c) indicates a similar pattern to Figure 6(a) except that the greater overlap results in C min. being less than the mean of the non- master group and C max. greater than the mean of the master group. Figure 6(d) indicates almost complete overlap which essentially would preclude the setting of either C min. or C max. 30 Non-master Master 1 I - I I. X C C X N Min Max M (a) Non-master Master (1:) Non-master Master Min N M Max Non-master Master 0 i X X N) MASTER/NON-MASTER DISTRIBUTIONS FIGURE 6 31 Figure 7 illustrates the use of the borderline group method to set C min. and C max. C min. could be viewed as the mean or some point III the negative director of the distribution (e.g., -150, -250, -350) and C max. as the mean in some point in the positive direction of the distribution (e.g., +150, +250, +350). X C C Min Max BORDERLINE DISTRIBUTION FIGURE 7 32 The contrasting and borderline groups data were computed and various points in the distribution, as described above, were determined. Teachers were asked to rate their students on' the skills that wereImeasured on the tests that were used as being masters, non- masters or borderline students (too close to call) between the two. As described later, ratings of the standards setting group (10th, 11th and 12th graders) were used to identify the masters, non-masters and borderline students for use in setting the initial ranges. The ratings were completed and collected at the time of testing but before actual results of students' scores were known to the teachers who did the rating. This was done so the teachers' ratings would be done without direct influence of students' performance on the tests. The data from the committees and teacher ratings were then tabulated and the values of P1, and P2 and the intersections determined for the three approaches to include variations within approaches (based on differing C min. and C max. points in the distribution, etc.). The data from the standard- setting group was tabulated and the values applied to the ninth grade population taking the test for the first time in March of 1986. lnstruents and Data Collection The five tests designed to measure the competencies included tests of Math, Reading, Reference Skills, Writing and Forms Completion. The first three tests were validated and field tested with 887, 615, and 777 students, respectively, per test and the remaining two with approximately 500 students per test. Of the five 33' tests, only the Math, Reading and Reference Skills tests lent themselves to all three cutting score approaches and were the only tests used in the study. The Writing test was holistically scored and cutting scores directly set in reference to the rubrics and established marker papers. The Forms Completion test was scored on a three point scale for each item with a score of £3533 representing all information being accurate and complete on the form, a score of £1312 representing correct and sufficient information for the form to be processed and a score of 235 representing insufficient and/or incorrect information for processing. As of the time of the study, cutting scores had not been determined. Since the tests were to be administered to all ninth graders in the system, the field test consisted of only students in grades 10, 11 and 12. All students in those grades (excepting those students in one of four high schools who did not take the test) were rated by their math, English or reading teacher as to their level of competency relative to the respective established competency standards for graduation. Each math and English teacher was given a set of the competencies (Math or Reading/Reference Skills respectively) and a list of the students in their classes with the letters M, B, or N next to each name representing master, borderline and non-master status. The teachers were then asked, based on the skills list, to rate each student as to whether he or she was a master or non-master of the skills as a whole or on the borderline between. Rather than ask English teachers to separately rate reading and reference skills competencies on students to whom they directly taught neither (the predominance of course work 34 being in grammer, composition writing and literature) they rated students on a common language arts set. The specific skills from the total list of competencies for each of the three tests are as follows: Language Arts (Reading/Reference Skills) 1. Identify the main idea 2. Read and answer questions about articles from newspapers and magazines. 3. Identify and obtain information from reference sources (dictionary, encyclopedia and text books) 4. Demonstrate the ability to use a table of contents in locating information 5. Read and answer questions about information contained in charts, tables, graphs, and maps. Math 1. Be able to add, subtract, multiply and divide whole numbers 2. Be able to add, subtract and multiply fractions 3. Be able to add, subtract, multiply and divide decimals 4. Be able to perform calculations using percents 5. Be able to determine the appropriate mathematical operation and find the solution to application problems 6. Be able to express measurements in a variety of units (metric and non-metric) 7. Be able to use standard measuring devices for length, weight, capacity and*volume (metric and non-metric) 8. Be able to apply basic mathematical operations to monetary situations (make change, add prices, calculate wages, etc.) 9. Be able to calculate an average 10. Be able to calculate the best price, given the price per item and the number of units. Col-lttees Two separate committees were established to set the cutting scores consisting of eight (8) members for the Math committee and seven (7) members for the Language Arts committee. Members consisted of professionals in each of the subject areas tested. All committee 35 members had at least five (5) years of teaching experience in their respective areas at the secondary level and were selected on the recommendations of the district's Supervisors of Reading and Math. The Math committee set cutting scores for the Math test and the Language Arts committee set cutting scores for the Reading and Reference Skills Tests. Modified Angoff Method For the modified Angoff method, the committee members were asked to establish the probability for passing each item first by a minimally competent individual ("ninth" grade level) and second by an individual at a specified higher level ("twelfth" grade level) of competency. Committee members were asked regarding each item for each level to answer the question: "What is the probability of the typical student at the 'ninth' ('twelfth') grade level of competency, as you have imagined that person, getting the item correct?" In order to help clarify the task, the question was restated for each committee as: "What percent of typical students at the 'ninth' ('twelfth') grade level, as you have imagined them, would you expect to pass this item?" Both the ninth and twelfth grade ratings were done independently by each group member. Only the independent rating was used in this study because as Fitzpatrick (1984) strongly suggests, based upon a review of social- psychology literature, judges' opinions are affected by group interaction. She further states that the effect is not necessarily in the direction of group consensus but often is in the direction of 36' greater group polarization. Fitzpatrick goes on to state: "These findings suggest that, because of its normative effects, opinion exposure is a procedure that is not desirable to use in standard setting situations" (p. 16). Based upon this suggestion, the study only reported judges' ratings before group discussion. Zieky and Livingston Procedures For the Zieky and Livingston approaches, English and math teachers of 10th, 11th, and 12th grade students were asked to rate students (in their own classes) by indicating whether they were masters, non-masters or borderline (too close to call) relative to the stated list of competencies at the ninth grade level of competency. Ratings at the time of the field test were completed on 2729 (Math) and 3550 (Language Arts) 10th, 11th, and 12th grade students. The field test was administered to a representative sample of 10th, 11th and 12th grade students. The distribution‘of scores from the students designated to be masters were plotted with the distribution of scores from the students designated as non-masters. Depending upon the relationship between the distributions (see Figure 6), points in the distributions were selected to represent C min. and C max. In fact, different points were selected and reported separately. Points representing C min. were (given results like those illustrated in Figure 6(a) ) the mean of the non-master groups and the scores at -1, -2, and -3 standard deviation in the master group. C max. was represented by the mean of the master group and scores at +1, +2, and +3 standard deviation in the non-master group. The distribution of 37 scores from those students designated as being borderline were plotted with C min. being designated as being at the mean of the distribution and C max. at some point in the positive direction of the distribution (see Figure 7). As with the contrasting groups approach, different points in the distribution were selected for C max. and reported separately. Points representing C max. were +1, +2, and +3 standard deviations. The values of C min. and C max. for both contrasting and borderline groups methods were then plotted with F max., F min. and f(c) and a cutting score set for each set of C min. and C max. values. Direct Estimation For the direct estimation approach, the comittees were asked to directly set C min. and C max. values. The committee members were first asked to individually determine the two scores defined as: l. a score "low enough so that every student scoring below it should surely fail, even if every student scored below it"; and 2. a score "set high enough so that you would believe it signified mastery even if every student taking the test attained this score." The mean of each of the two scores was then calculated. It should be noted that these estimates were completed without knowledge of the passing scores set with the modified and corrected Angoff methods or the contrasting groups and borderline distribution. The estimated values were then plotted as above with F min., F max. and f(c) and the resulting cutting score established. 38 Determination of F min. and F max. The effects of variations in F min. and F max. were examined using two procedures for setting the estimates. The first procedure was to ask the Superintendent of Schools to identify three sets of F min./F max. for each of the three tests. The first set conformed to the following operational definition: "For the students who will be taking the test(s) for the first time in ninth grade, what are the maximum and minimum percents of students who should fail the test(s)?" He was then asked to estimate the overall passing rate at the end of twelfth grade for all students who begin high school in ninth grade. In other words: "What percent of students would you expect to graduate (eventually pass the test(s)) of all those who enter high school?" Once he made these estimates, he was asked to estimate the overall failure rate for the same students given remedial programs for students who originally failed the tests hirfinth grade. 1H5 estimates by tests are presented in Table i. TABLE 1 F MIN./F MAX. ESTIMATES BY THE SUPERINTENDENT Math Reading Reference Skills F min. F max. F min. F max. F min. F max. Ninth graders 152 492 152 352 153 ' 353 Overall 52 702 52 702 52 703 Overall with Remediation 02 102 02 102 02 102 39 The second procedure involved the arbitrary selection of values for F min. and F max. at various ranges in order to examine the influence of these values on the final cutting scores across the methods and across the three tests. Table 2 presents the arbitrary values that were examined: TABLE 2 ARBITRARY SELECTION OF F MlN./F MAX. F min. F max. 0 100 10 90 25 75 40 60 Traditional Procedures Within the Angoff, ContrastingiGroups and Borderline Groups In addition to rating the items based on identified grade levels, the committees employed the traditional Angoff procedure and rated the items based on their judgment of the probability of a minimally competent (as they perceive minimally competent for high school graduation) student getting each item correct without reference to a specific grade anchor. As part of their rating, they were presented with item difficulties for each item before performing the ratings of each item. The corrected ratings were then used to calculate cutting scores. The cutting scores were determined for the contrasting groups method using the Quadratic Discriminant Function (QDF) described by 4O Koffler (1980). For the borderline group method, the mean of the borderline group represented the traditional cutting score. These scores were then compared to the cutting scores yielded by using the various methods of setting C min. and C max. Determination of the Traditional Contrasting Groups Cutting Scores In order to identify the single most appropriate cutting score based on contrasting groups data for setting the traditional cutting score, Koffler (1980) describes procedures for estimating the cutting score which minimize the probability of misclassification when using the contrasting groups approach. Of those he describes, the Quadratic Discriminant Function (QDF) appears to be the most appropriate to use with the current study since it assumes normal distributions but unequal variances. The QDF takes the form of: — - —2 —2 2 X X X ~ X 5 z- .5; - 22- (5.154)-; g1- g3 . gm, .3. 3132 2 1 2 1 2 31 where: Z = the test score 'X1= mean of the master group §i= mean of the non-master group = variance of the master group = variance of the non-master group 41 The optimal cutting score is, then, the smallest test score such that the above equation is greater than log (2%); where q2 is the percent of students in the non-mastery group and ql is the percent of students in the mastery group.‘ Ana IEes The major considerations in the comparison of the approaches for setting cutting scores with the Hofstee compromise model are, first, do the approaches yield significantly different cutting scores and second, if different, is there a score that yields more reliable results relative to the classification of students into mastery or non-mastery groups? The first consideration assumes that there is some parameter by which comparisons can be made. A logical parameter would appear to be some estimate of the standard error. As Lord (1984) indicates, however, different cutting scores have different standard errors. The second consideration, in fact, makes use of this by examining the differing reliability of classification at the different cutting scores. These two considerations for analysis indicate the need for a means of determining the reliability of classifications into mastery and non-mastery states for a variety of cutting scores. Berk (1980a) discussed several approaches to determining the reliability of tests given specific cutting scores. These include the threshold loss function and the squared-error loss function. Berk defined the threshold loss function indices as the "measures of agreement between categorical data sets based on mastery-non-mastery 42 classifications." Essentially, this results in an index of agreement of which there are a number of specific approaches (Hambleton and Novick, 1973; Swaminathan, Hambleton and Algina, 1974; Subkoviak, 1980; and Huynh, 1976). The threshold loss function is contrasted with the squared-error loss function supported by Livingston (1972), Brennan (1980) and Brennan and Kane (1977). The difference between the threshold and squared-error loss function centers on the losses associated with decision errors. For the threshold loss approach, all losses are of equal seriousness where as the squared-error loss examines all decisions and losses of "students who are far above or below cutting score as more serious than losses related to other misclasslfied students" (p. 326). Since the ultimate interest of a cutting score used for certificate purposes is its ability to classify students as either masters or non-masters, the threshold loss function is clearly more useful for our purposes. In our case, classification errors are equally serious regardless of their size. Regardless of the specific approach used,‘the threshold loss function involves two agreement indices; Po, the proportion of individuals consistantly classified as masters and non-masters across (classically) parallel test forms and k, the proportion of individuals consistantly classified beyond that expected by chance (Berk, 1980a, p. 327). There are relative advantages and disadvantages with the use of these two indices. Po is preferred to k for shorter tests and for criterion-referenced tests with absolute cutting scores while k is appropriate for tests "where relative cutting scores are set according to the consequences of passing or failing a particular proportion of the students" (p. 333). Since the current study involves cutting 43 scores that result from both relative and absolute standards, both indices will be reported. There are two basic threshold loss function approaches for deternfiuiing PO and k: the one and two test administration approaches. IAs Berk states, the Hambleton and Novick (1973) or Swaminathon et al. (1974) two test methods are recommended. As in the current study, however, many instances do not offer parallel tests or two test administrations with which to determine estimates of agreement indices. A viable alternative using one test administration was developed by Huynh (1976). Though conservatively biased, his approach provides "relative precise" estimates of PO and k. One draw-back of Huynh's approach is that it tends to be computationally complex. Subkoviak (1980), however, developed a method based on nearly equivalent assumptions to Huynh that is computationally simpler. Peng and Subkoviak (1980) found this method to yield similar results to those estimated by the Huynh. The Subkoviak threshold loss procedure was used in this study to calculate estimates for P0 and k. ,1; ‘ianI - 2 (PX - szll 0 N where: 44 N - sample of examinees "x - number of persons at score x A Px - the probability that the examinee will correctly answer some specified number of items or more, where the proportion of items in the universe that a person with test score x would be expected to answer correctly (pt) 1s: A 9x ' 320 (8111) + (1 - 620) (uln) N - sample of examinees x - test score n . number of test items u - estimated means 020 - KR20 and k as: P - P o c k' I-P C where: ’c'1'2IHEP“-(”;’*)’] - the probability of consistent classification across the entire 85°09 In order to compare cutting scores, each with a unique error of measurement, the Standard Error of Measurement (SEM) was calculated for each cutting score. Because the purpose of the SEM was to compare cutting scores across the range of possible scores, a formula for computing SEM that contained the standard deviation of the total test score was used. As an approximation to the traditional notion of realibility, Po values were used to compute the Standard Error of Measurement as: sen - sx \rITPo' where: Sx - total test standard deviation. #5 van der Linden (1982a) discussed the problem of inconsistency in setting standards when rating items using the Angoff method. He attributes this inconsistency to two possible sources: 1. interjudge inconsistency due to different interpretations of learning objectives; and 2. intrajudge inconsistency (p. 296). These two sources could have influenced the ratings that were obtained. As such, it would appear to be useful to examine their potential effects. lnterjudge inconsistency was examined by determining the distribution of judges' scores relative to their mean. A large standard deviation would tend to indicate relatively greater inconsistency. van der Linden also described a method to examine intrajudge inconsistency which he further defined as: lntrajudge inconsistency arises when the judge specifies probabilities of success on the items which are not compatible with each other and, consequently, imply different standards. An example is an Angoff judge assigning a low probability of success for a borderline student on an easy item but a large probability on a difficult item. Obviously, these two judgments are inconsistent; the former implies a low standard whereas the latter indicates that a high standard should be set. Inconsistencies may be caused by items being perceived differently from the way they actually function (p. 296). His method is based on the assumption that judges, when using the Angoff method, translate "the performance of a borderline student who just meets the learning objectives into a cut off score on the 'true' score scales of the given test." The method compares the difference between the probability of the student at the cutting score 'ability' level getting the item correct and each judges' ratings for that item. v 46 This analysis indicates the "average absolute error" (E) and the estimate of consistency (C1). (8) e1 =- P1 - Pelee) where: P(+|O ) is derived from a latent trait model such as the Rasch equation 1 1+e - (O-b) Pi(s) 8 judges ratings for item 1 van der Linden, based on the above comparison, developed an index of consistency which then takes the form: where: a (S) E XIPi -P(+|Oc)I/n = average absolute error M = X ei(U)/n and (n) _ ei - Max {Pi’ l—Pi} 147 in the current study P (+Ie) was calculated using the Rasch model because of its availability at the time of study. Though van der Linden describes his method using the three-parameter logistic model, he states: "The choice is not essential, however, any other latent trait model could be used as well"(p. 298). It was felt that the data used in the study would be compatable with the Rasch assumptions of equal discrimination) no guessing and the fit to the model for the Rasch model to provide acceptable results. Values of E and C1, are presented in tabular form for each judge and then averaged across judges. Unfortunately, as Van Der Linden states, there are no recognized standards for evaluating C}, because the method has only recently been developed. CHAPTER IV RESULTS The results of the study are first presented by approach and then contrasted across approaches in terms of cutting scores yielded and relationships to varying F min. and F max. values by each method for each test. In addition, results will be presented separately for the field test and 9th grade samples within the discussion of approaches. The results by test will be presented within the discussion of each approach. Before the presentation of results by approach, it would be helpful to examine the results of the tests themselves and the distribution of scores in terms of a cutting score plot. Description of Scores and Distributions in Terms of- f (c) The results of each test are presented in Table 3 by field test and 9th grade groups. The table includes the mean score, standard deviation, KRZO and standard error for each test. 48 49 TABLE 3 RESULTS OF FIELD TEST AND 9TH GRADE GROUPS Field Test 9th Grade Reading N 615 1381 Mean 21.773 21.067 Standard Deviation 6.693 6.388 Reliability .868 .848 Standard Error 2.431 2.491 Math N 837 1395 Mean 31.982 31.578 Standard Deviation 10.309 10.046 Reliability .914 .905 Standard Error 3.023 3.096 Reference Skills N 717 1367 Mean 36.308 34.425 Standard Deviation 8.320 8.710 Reliability .889 .889 Standard Error 2.772 2.902 The mean distribution across grades (10th, 11th-and 12th) was used to report the field test groups‘results since, as Figures 8, 9 and 10 illustrate, the distributions across grades are similar on all three tests. There appears to be little, if any, advantage in reporting these distributions by grade in the examination of the three approaches”. Lt is interesting to note, however, that the figures do appear to evidence a trend in terms of the abilities of the three grades increasing slightly from 10th through 12th grades for each of the tests. The distribution of scores are plotted in terms of the percent of students who would fail the test at all possible cutting scores for the test per Hofstee's approach (i.e., cummulative percents 50‘ at )v-l). Mean attainment rates for each item are presented in Tables 1-A, 2-A, and 3-A which appear in Appendix A. oomsasmgaeassssaasasa§ 51 TDTAL 10TH 11TH 12TH T T r T I I 1 T rTTTr T 1 T I FIELD TEST fC DISTRIBUTIONS BY GRADE READING FIGURE 8 causesmeseaeassaasagaé 52 101 ll. , IDIH 11111 12111 FIELD TEST f BY DISTRIBUTIONS BY GRADE c Mm FIGURE 9 camsasmsasasmsaaasaeaé TOTAL w I I I I I l D I I I I I I I I /./,.. ’J/ If ’0 I / I’.. '., /”s / " ,1.- ”I Alllll IAIILLLLILLL llllLLLLLLLL FIELD TEST fc DISTRIBUTIONS BY GRADE REFERENCE SKILLS FIGURE 10 45 50 54 Figures 11, 12, and 13 present both the overall mean field test distriInItion and the 9th grade distribution. These figures represent the basic distributions for the groups that will be used to graphically'examine each of the three approaches and make comparisons between both the approaches and F min./F max. values. 2% 53 ER 2; ER 53 ER 23 ER 23 50 40 30 20 15: 102 55 FIELD TEST GRADE 9 FIDIREIUJE A COMPARISON OF FIELD TEST AND NINTH GRADE fc DISTRIBUTIONS READING FIGURE 11 H EH 53 EB 23 ER 23 56?- FIELD TEST GRADE 9 FADJREIUJE £3 £3 £3 £5 £3 29 ER 53 EB 23 H m O50 F l"l \ \ \ \ \ \ 01 """"V 'V' V I \ \ \ ILLJALLLLALLLILLLLIALLLL+LIALLALL O P p I' r 20 2530351045 so :3 L31 p... o 5..- (J1 CUNIMSSUFE A COMPARISON OF FIELD TEST AND NINTH GRADE fc DISTRIBUTIONS MATH FIGURE 12 FIELD TEST FAILURE RATE r'[ I V I t T 1" V V r I r Y 'U'U'U'UVIUUVV 'V'U' 'V'U'VU'U'U'V'I'V'U'I'U' asaaasassé U! C ' V I'VV‘ I I 'I'I'V'UIII'U p..- mBIB‘é’ 'I'I'IV'U'U'I'VI '1 T r r’V V 57 1 1 1 1 l L 1 1,1 l 1 1 1 11l1j_1 1 20 25 CUTTING SCORE r“r [ T r Tit I TIT I U 3O 35 A COMPARISON OF FIELD TEST AND NINTH GRADE fC REFERENCE SKILLS FIGURE 13 1 l 1 111,1 111111111 40 45 DISTRIBUTIONS 11111111111 111111111111 111111 50 58 Anggff Method For each test, Angoff ratings were completed for each item at the 9th and 12th grade levels of ability. The mean item rating was then calculated for each rater for each test at the two levels. The mean across raters was then used as C min. (9th grade level) and C max. (12th grade level). These ratings were done after they had access to the field test results for each item. The mean 9th and 12th grade ratings for each item are presented in Appendix B in Tables 4-3, 5-8, and 6-8. The mean rating, standard deviation and cutting scores across items is presented for each rater for each test in Tables 4, 5, and 6. Table 7 then summarizes this data across raters. The interjudge consistency appears to be sufficient across the three tests to warrant the use of the data in this study to set cutting scores based on the Hofstee model. TABLE 4 RATER MEANS ‘ READING 9th Grade 12th Grade Mean Cut Score Mean Cut Score Rater 1 .84 29 .95 32 Rater 2 .68 23 .77 25 Rater 3 .60 20 .87 30 Rater 4 .57 19 .77 26 Rater 5 .59 20 .72 24 Rater 6 .85 29 .91 ' 31 Rater 7 .91 31 .82 28 Mean .72 .83 St. Dev. .14 .08 Cut Score 24 28 59 TABLE 5 RATER MEANS MATH 9th Grade 12th Grade Mean Cut Score Mean Cut Score Rater 1 .88 45 Rater 2 .67 34 -77 39 Rater 3 .62 32 .93 47 Rater 4 .92 47 .87 44 Rater 5 .94 48 .97 49 Rater 6 .81 41 .83 42 Rater 7 .78 40 Rater 8 .74 38 Mean .78 .86 St. Dev. .13 .07 Cut Score 40 44 TABLE 6 RATER MEANS REFERENCE SKILLS 9th Grade 12th Grade Mean Cut Score Mean Cut Score Rater 1 .88 44 .95 48 Rater 2 .64 32 .77 39 Rater 3 .64 32 .86 43 Rater 4 .65 33 .76 38 Rater 5 .59 30 .72 36 Rater 6 .80 40 .89 45 Rater 7 .90 45 .82 41 Mean .73 .82 St. Dev. .13 .08 Cut Score 37 41 _60 TABLE 7 MEANS, STANDARD DEVIATION AND CUTTING SCORES ACROSS RATERS 9th Grade 12th Grade Reading Mean .72 .83 St. Dev. .14 .08 Cutting Scores 24 28 Math Mean .78 .86 Ste DeVe e13 e07 Cutting Scores 40 44 Reference Skills Mean .73 .82 St. Dev. .13 .03 Cutting Scores 37 41 In order to examine the intrajudge inconsistencies in making each of the ratings, the I'index of consistency" (van der Linden, 1982) was computed. The ninth grade sample was used to determine P (+Ie) values since it represents a larger number of students and was the target population for the test. Both E and 61 values are presented in Tables 8-13 for each judge for each test in addition to the mean values across judges. Table 14 then summarizes these data. Though there are no standards for interpretation, the indices of consistency appear to represent sufficient intrajudge agreement to warrant the further use of the data to set cutting scores based on the Angoff method within this study. 61 TABLE 8 AVERAGE ABSOLUTE ERROR AND ESTIMATE OF CONSISTENCY 9th GRADE ANGOFF RATING READING Rater E C I .16 .79 2 017 '77 3 17 -77 18 e23 069 5 .18 .76 6 .18 .76 7 .21 .72 MEAN .19 .75 TABLE 9 AVERAGE ABSOLUTE ERROR AND ESTIMATE OF CONSISTENCY 12th GRADE ANGOFF RATING READING Rater E C l .14 ‘ S .83 2 .13 .84 3 .10 .88 4 .15 .82 5 e111 083 6 .11 .87 7 .11 .87 MEAN .13 .35 62 TABLE 10 AVERAGE ABSOLUTE ERROR AND ESTIMATE OF CONSISTENCY 9th GRADE ANGOFF RATING MATH Rater E C 2 .09 .89 3 .14 .82 4 .11 .86 5 .16 .80 6 .18 .77 8 .15 .81 MEAN .14 .83 TABLE 11 AVERAGE ABSOLUTE ERROR AND ESTIMATE OF CONSISTENCY 12th GRADE ANGOFF RATING MATH Rater E C 1 .05 .94 2 .10 « »- .88 3 .08 .91 4 .07 .92 5 .12 .86 6 .07 .92 7 .10 .88 MEAN .08 .90 63 TABLE 12 AVERAGE ABSOLUTE ERROR AND ESTIMATE OF CONSISTENCY 9th GRADE ANGOFF RATING REFERENCE SKILLS Rater E C 1 .18 .76 2 .18 .76 3 .16 .79 4 .15 .80 5 .19 .75 6 .15 .80 7 .21 .72 "EA" 017 '77 TABLE 13 AVERAGE ABSOLUTE ERROR AND ESTIMATE OF CONSISTENCY 12th GRADE ANGOFF RATING REFERENCE SKILLS Rater E C 1 .16 .81 2 .12 .85 3 .10 .88 4 .12 .85 5 .13 .84 6 .11 .87 7 .11 .87 MEAN .12 .85 64 TABLE 14 AVERAGE ABSOLUTE ERROR AND ESTIMATE OF CONSISTENCY ACROSS JUDGES E CI Reading 9th Grade .19 .75 12th Grade .13 .85 Math 9th Grade .14 .83 12th Grade .08 .90 Reference Skills 9th Grade .17 .77 12th Grade .12 .85 Based on Table 7, Table 15 summarizes the C min. and C max. values for the three tests based on the Angoff method. TABLE 15 SUMMARY OF C MIN. AND C MAX. VALUES C min. C max. Reading 24 28 Math 40 44 Reference Skills 37 41 Using the F. min. and F max. values identified by the Superintendent for ninth graders (Reading: 152 and 352; Math: 152 and 492; Reference Skills: 15% and 35%), Figures 14, 15, and 16 represent estimates for P1 and P2, the range of acceptable cutting scores and 65 the resulting cutting score at the point of intersection for both the field test and ninth grade groups. 100 MIN £51 23 151 80 70 29 iii EB 13% 45 4O 30 20 10 '66 FIELD TEST FIILIIE RIIE VVVIVVII I'VVW" 5 10 15 20 CUTTING SCORE ANGOFF CUTTING SCORE PLOTS READING FIGURE 14- FAILUE RATE 67 FIELD TEST i .. smsaeagasaaasasas New 15 ’- U! I Pa. ANGOFF CUTTING SCORE PLOTS MATH FIGURE 15 FAILUIE RATE 68 FIELD TEST - GRADE 9 rrrr rrv I’ rrTr T V ff U T I V r I I’ r rFI—r I I '— l I ~ I I I r I I I rr firm? . P ’ 1 d D 1 D - D I D .. 1 b d P CI b 1 I 1 h i D 1 D d D I b 1 D m I 1 - - D d I d D 1 I 1 - - m d m I P 1 I d b d P 1 P '1 D I P 1 D - I 1 I d m d h d - d 'l' 17" "" 'l_1 LJlLkl—l L111 11 11111111111 111—11111111L111111111L1LL11111L111 5 10 Cum [em 15 20 5 30 fi 40 45 50 CUTTING SCORE ANGOFF CUTTING SCORE PLOTS REFERENCE SKILLS FIGURE 16 69 As Figures 14-16 illustrate, the line (P1, P2) does not intersect the distribution of either score group (fc) for the three tests. This presents a special problem with the Hofstee model. Aside from again implementing the ratings of items, there appear to be at least four possible solutions. The first soluthMIis to extend the line (P1, P2) beyond P1, until it intersects fc. This, however, would place the cutting score below C min. which was defined as the point below which a person would definitely be considered to be a non-master. it also places the failure rate above F max., there-by violating both of these assumptions. The second solution is to accept C min. as the cutting score. This, however, artificially raises the failure rate above the F max. originally set. The third solution is to draw a line from P1 to the vertical axis of the graph to the Intersection with fc which sets the cutting score much below C min. A fourth potential solution is to deliberately change F max. to some point above fc. Figures 17-19 illustrate these solutions and the corresponding cutting scores for the three tests. , 70 FIELD TEST GRADE 9 FAILLRE RATE .—O_I --——.—-‘—-—- ~.——-—_.¢o CUTTING SCORE SOLUTION TO THE PROBLEM OF NON-INTERSECTION W READING FIGURE 17 ITH fc 71 FIELD TEST GRADE 9 FAILIIE RATE '3 II E 0' 1a 111 [cm1c4‘ O 5 10 15 20 25 30 35 40 45 50 SOLUTIONS TO THE PROBLEM OF NON-INTERSECTION WITH fc MATH FIGURE 18 72 FIELD TEST GRADE 9 FAILIIE RATE .rr’rl'ri'rrtiTr'rrrrerrTIIUUUIrr'UIV—Y'rTIVYU'rI'F D " I A m I ‘ X em, *1 """“V‘I“" "' l \ \ \ \ #- H (.11 c: "1' \ \ CUTTING SCORE SOLUTIONS TO THE PROBLEM OF NON-INTERSECTION WITH fc REFERENCE SKILLS FIGURE 19 73 Table 16 presents the cutting scores for both samples (field test and 9th grade) for the scores resulting from the four above solutions to the problem of non-intersection with fc. Based on these results, the values for P0 and k are also presented in Table 16 for each test. TABLE 16 CUTTING SCORES, Po AND R RESULTING FROM THE ANGOFF METHOD BASED ON SOLUTIONS 1-4 Fieid Test ' 9th Grade C Po k C Po Reading 1. Extention of line 22 .81 .62 21 .82 .63 2. Accept C min. 24 .81 .60 24 .81 .60 3. P1 to vertical axis 20 .83 .64 18 .85 .63 4. Change F max. (75) 26 .82 .55 25 .82 .59 Math 1. Extention of line 38 .88 .70 38 .88 .70 2. Accept C min. 40 .90 .72 40 .90 .72 3. P1 to vertical axis 32 .86 .72 31 .86 .71 4. Change F max. (80) 41 .90 .71 41 .90 .71 Reference Skills 1. Extention of line 36 .85 .70 --35 .85 .70 2. Accept C min. 37 .84 .68 37 .84 .68 3. P1 to vertical axis 34 .86 .71 31 .88 .71 4. Change F max. (75) 39 .84 .65 38 .84 .67 Table 17 presents the results of the traditional Angoff ratings in terms of the cutting score for each test and the values Po and k. 74 TABLE 17 CUTTING SCORES, Po AND k RESULTING FROM THE TRADITIONAL ANGOFF METHOD Cutting Score Po k Reading 24 .81 .60 Math 44 .92 .65 Reference Skills 37 .84 .68 ContrestingGroups Method Based upon the lists of competencies needed for graduating, teachers were asked to identify which of those students who took the field tests they Considered to be masters, non-masters or on the borderline. In total, ratings were obtained on 594 students who took the Reading and Reference Skills field tests and on 432 students who took the Math field test. Table 18 presents the number of students in each category of mastery, the mean test score for each group and standard deviation of each test score for that group. In addition to the students who participated in the field tests, a much larger sample of students (N-2729 for Reading and Reference Skills and N-BSSO for Math) from the same grades and schools were rated. As presented in Table 19, a comparison between the field test group classification and the total group classification by the percent in each group for each test indicates that the field test sample is similar to the total students in those grades. 75 TABLE 18 COUNTS, MEAN SCORES AND STANDARD DEVIATION BY RATED CATEGORY OF MASTERY N Mean St. Dev. Reading Total 594 22.5 6.0 Mastery 392 24.4 5.2 Non-mastery 67 17.8 5.7 Borderline 135 19.4 6.0 Math Total 432 32.3 9.4 Mastery 262 36.1 8.3 Non-mastery 47 21.6 6.2 Borderline 123 28.4 7.9 Reference Skills Total 552 36.2 8.6 Mastery 348 38.2 7.4 Non-mastery 29 29.4 8.5 Borderline 175 32.4 9.9 TABLE 19 PERCENT OF STUDENTS BY CATEGORY BY TEST TEST SAMPLE VS. TOTAL SAMPLE S Field Test Total Reading Mastery .66 .68 Non-Mastery .11 .09 Borderline .23 .18 Math Mastery .61 .58 Non-Mastery .11 .15 Borderline .28 .25 Reference Skills ‘ Mastery .63 .68 Non-Mastery .05 .09 Borderline .32 .18 76 For the field test sample the mastery and non-mastemy groups' distributions were then plotted: test score by percent. Figures 20-22 represent the distributions for each group by each test. om m¢:u_m m2 _ 93¢ mzo_._.:m.¢._.m_o >uzuhmmzou mmoum hmmh wzHDmwhm<2zoz >mwhmuzmhmmzou mmoom hmwh Ihmmhmmmhmuzuhmn—zou mmmouw I—mmh mJJHv—m wuzmmmmmm on mv ow mm om mm om mu 2 m sup-4t Ibiza: Js‘ 111‘ I- I ’JJII‘II‘I‘III‘I‘I . Ii 111‘ I‘ .HDII‘ ‘I‘I- ‘ J I I I- O‘s‘sd IIOlq .VJIIIII‘I‘I ‘41I‘l‘d‘ I II J l I [I O I 0 \\ s ’1’, s e s s s \‘\ e a ” e “\ 1 O I o \‘ I e . I . I; . \K 1.. II . ~\ I s s ‘ I s . s s” \ 1 e s i “\ s e. i o s O ~‘ . . . e 1 0e 00 I'll,” \\ e I \ . . /\ wozmaommm 3 cm on ow on cm cm on cm 2:. 80 Using the master and non-master distributions to set C min. and C max. at defined points in the distributions, produced the array of values shown in Table 20. TABLE 20 CUTTING SCORES SET AT VARIOUS POINTS IN THE HASTER/NON-HASTER DISTRIBUTIONS Means lilSD 1250 $350 Reading C min. 18 20 1h 9 C max. 25 24 30 34 Math C min. 22 28 20 ll C max. 37 28 3“ hi Reference Skills C min. 30 31 2h 16 C max. 39 33 “7 50 Figures 23-25 illustrate the plots of these pairs by test. 100 FAILURE RATE 81 FIELD TEST GRADE 9 I'm. T'V'IVVI U'Il' 1 A g L L 1_‘l 1 n l A m A n l n 1 EM C”"&W m (Em C C'15 ; 20 “2% 30 CUTTING SCORE CONTRASTING GROUPS CUTTING SCORE PLOTS READING FIGURE 23 I i 533 82 FIELD TEST GRADE 9 mum am: 100 : D l h b b D I CONTRASTING GROUPS CUTTING SCORE PLOTS MATH FIGURE 24 83 FIELD TEST GRADE 9 FAILIIE HATE 100.Virr‘f"TrriifirY'rrlY'rYIrrt'ITTY' p 95 90 85 80 75 70 I‘VV "" "Y' "' I l I I ‘1' 'U 55 60 55'. 50 45 i 40 IL L==35~E 30 25 20 1o 5 D ’ 0’1111LLLsLA 4......4 VVI‘I" I I '1‘ 'vmrrv ‘I'V'U cm (35m c({ came cum-lbw "t" o 5 1o 15 20 25 ao’as‘ao 4550 ummsscom CONTRASTING GROUPS CUTTING SCORE PLOTS REFERENCE SKILLS FIGURE 25 8h Table 2l presents the resulting cutting scores Po's and k's resulting from Figures 23-25. There is not a score reported for the ninth grade Reading test at11lSD since the line (P1, P2) did not cross the ninth grade distribution (fCI- TABLE 21 CUTTING SCORES, Po AND k RESULTING FROM THE CONTRASTING GROUPS METHOD Means .t 1 SD :1: 2 SD :h 3 SD Reading Field Test 20 .83 .64 2l .82 .63 19 .84 .64 19 .8h .6h 9th Grade 18 .85 .63 - - - 18 .85 .63 17 .86 .61 Math Field Test 28 .87 .72 28 .87 .72 27 .87 .71 27 .87 .71 9th Grade 27 .87 .7l 28 .87 .72 26 .88 .70 26 .88 .70 Reference Skills Field Test 32 .87 .71 33 .87 .72 32 .87 .71 32 .87 .71 9th Grade 31 .88 .71 31 .88 .71 30 .88 .70 29 .89 .71 Table 22 presents the optimal cutting score for each of the tests based on the traditional procedure. 85 TABLE 22 OPTIMAL CUTTING SCORES FOR EACH TEST BASED ON THE QDF Cutting Score Reading 18 Math 26 Reference Skills 27 Borderline Groups Method Borderline students, as described above, were identified in each skill area with the following distribution of students and mean scores for each category (Table 23). TABLE 23 DISTRIBUTION OF STUDENTS AND MEAN SCORES BORDERLINE STUDENTS N Mean St. Dev. Reading 135 19.h 6.0 Math 123 28.4 7.9 Reference Skills 175 32.h 9.9 The percent distributions for each test of the borderline groups are illustrated in Figures 26-28. 86 mm maze: oz.oUszCm—g cu om om ow on on ON. 00 cm can 87 an RN wane: 1.2: macaw wz _ damn—mom mmoum pmm _. I :1 9. mm on x mm om .fl 2 m ' JJJIlq-Ijlnfllldu.lqu 1.1!..1 .1JJ..4I1J 1‘ I‘- Ji +14 1 d 4 d dII‘ 1q ‘14 SJ fiJ‘ q 4 d T d 11 VUZMDOWE— 0« cm cm 0.» on om on 00 om co“ mm mzao.u mag—gm uuzmzmmmm mzoxu mz_4¢mo¢om mmmoum hmmp mAAme wozmmmmwm mv av mm on mm cu m« a“ m o X 88 I ddjflqddu1111qldw$1q11131111411111-1.4 “#114141: O 1 cm 1 cm 1 av Jon L on 1 on 1 on L on FIbbiPPbPPbe-bPDPPPPPPPPPPFPLP-FPPPPLPbebPLPLbIP 0°“ Szmaommm 89 C min. and C max. values, based on the mean of the borderline group representing C min. and points at +1, +2 and +3 standard deviations representing C max., respectively, are presented in Table 24 for each test. TABLE 28 C MlN./C MAX. VALUES OF BORDERLINE GROUP C min. C max. mean +lSD +250 +350 Reading 20 26 32 34 Math 29 37 #5 51 Reference Skills 33 #3 50 50 This results in three pairs of C min./C max. values for each test. Figures 29-31 illustrate the plots for these values. 100 E?! 23 iii 23 I53 70 60 45 40 'Y '7‘ 302 20 10 FAILURE RATE 90 FIELD TEST GRADE 9 VI '1“ VI'U'I '“V'V'IY'V'V‘ V'I“ 'V‘I'VI' l l l A l l L L L l L V . 1 . . L . . I new [cm cmlcm 1o 15 2T 25““ 30 W ('J CUTTING SCORE BORDERLINE GROUP CUTTING SCORE PLOTS READING FIGURE 29 s.- 81881331888238 3 23 $913 as 5119 FAILUE RATE ,91 FIELD TEST GRADE 9 V'VV'IW‘IUVVV'V‘ 'V'VYUI‘VIUI 'IY‘U'U'V BORDERLINE GROUP CUTTING SCORE PLOTS MATH FIGURE 30 V 92 FIELD TEST GRADE 9 FAILIK RATE 100 , 95E 90:- as:— NI} 50E- 50:- «5% «IE 30E 20E- 1o . o’LLLLllkL‘ BORDERLINE GROUP CUTTING SCORE PLOTS REFERENCE SKILLS FIGURE 31 93 As the previous figures illustrate, the P1, P2 lines on Figures 29 and 31 do not intersect the ninth grade distribution lines, fc- In an effort to not repeat the discussion relative to the problem presented earlier, the line has been extended as a compromise between maintaining F max. and C min. values. Table 25 presents the cutting scores, based on Figures 31-33. TABLE 25 CUTTING SCOREs,P° AND k RESULTING FROM THE BORDERLINE GROUP METHOD +150 +250 +350 C Po k C PD k C Po k Reading Field Test 21 .82 .63 21 .82 .63 21 .82 .63 9th Grade* 19 .8“ .6h 19 .8h .6h 19 .8h .64 Math Field Test 31 .86 .71 32 .86 .72 32 .86 .72 9th Grade 30 .87 .73 31 .86 .71 31 .86 .71 Reference Skills Field Test 34 .86 .71 3h .86 .71 3h .86 .71 9th Grade* 32 .87 .71 32 .87 .71 32 .87 .71 *Compromise values A standard means of determining a cutting score based on the borderline method is to use the mean of the distribution. Table 26 presents the means and corresponding Po's and k's. 9h TABLE 26 CUTTING SCORES, Po AND R BASED ON THE MEAN THE BORDERLINE GROUP METHOD c Po k Reading 20 .83 .6h Hath 29 .87 073 Reference Skills 33 .87 .72 Direct Estimation Each of the committees was asked to directly set C min. and C max. values for each of the tests. At the time that this rating was completed, three members of the Language Amts committee were unavailable so the ratings for Reading and Reference Skills represents the imput of only four committee members. Tables 27-29 present the direct ratings for each rater and the mean ratings for each test. TABLE 27 C MIN. AND C MAX. SCORES READING C MIN. C MAX. 22 27 14 26 12 25 10 25 Mean 15 25 95 TABLE 28 c MIN. AND G MAX. SCORES MATH c MIN. c MAX. 36 #2 28 38 26 AT AT 48 A0 45 38 “5 AD 46 Mean 35 AA TABLE 29 C MIN. AND C MAX. SCORES REFERENCE SKILLS C MIN. C MAX. 3h 40 20 ~ 35 I5 37 20 112 Mean 22 38 Figures 32-3h illustrate the plots of these values with fc. 96 FIELD TEST : GRADE 9 FAILIEE RATE um I'V‘“I"" 812% 8353 EB 2323 EB 8328 ER 40;- 30:- 20:- zszfiLE I 10? : DIRECT ESTIMATION CUTTING SCORE PLOTS READING FIGURE 32 97 FIELD TEST GRADE 9 FAILURE RATE um U V '"f 80 “" 71:- D b p 60- p 1- 4o:- 30:- 20 W15 ORTDGEWUE DIRECT ESTIMATION CUTTING SCORE PLOTS MATH FIGURE 33 100 95 90 85 80 75 70 65 60 55 50 «ii 40 ““”L€EFn , 30 25 20 £315.15. 10 5 0 0 98 'EIELD TEST GRADE 9 FAILURE RATE rrlTIrVVTIrTTrITYr'IF—rr'IrtrfrtriirrtrtrUUUr'IL I'IVW‘VVVV'U" \ 'YTY \ 'V'rv'“" 'I'V'Y'I'VY‘II'VV'VVVVIUVVUI"I' \ \ ‘ I U'YYUVI‘ II I'V‘YT'V‘IV'VU \ V'Y'V‘ ' D \ DIRECT ESTIMATION CUTTING SCORE PLOTS REFERENCE SKILLS FIGURE 34' 99 Table 30 presents the cutting scores, Po's and k's for each test. The line P1, P2 failed to intersect the line fc on the plot for the Math test. As a compromise between maintaining the F max. value or the C min. value, the line has been extended to intersect fc. TABLE 30 CUTTING SCORES, Po AND R RESULTING FROM THE DIRECT ESTIMATION METHOD C Po k Reading Field Test 19 .84 .6A 9th Grade 17 .86 .61 Math Field Test* 33 .86 .72 9th Grade* 33 .86 .72 Reference Skills Field Test 30 .88 .70 9th Grade 28 .89 .68 *Compromise value ~ I Influence of values for F sin. and F max. on the Cutting Score There appeared to be five ways that F min. and F max. values could influence the cutting scores yielded by the Hofstee method: 1. slope of the line (P1, P2) as determined by the difference in F min./F max. values; 2. relationship of the line (P1, P2) to fc; 3. the range of acceptable cutting scores as determined by the difference in C min./C max. values; A. the magnitude of F min.; and 100 5. the magnitude of F max. First, F min./F max. values effect the slope of the line (P1, P2); greater differences between F min./F max. values yield a line with more slope and smaller values less slope. Changes in slope influence the point of intersection between (P1, P2) and fc. Second, for (P1: P2) ranges that fail to intersect fc, as was seen with the Angoff methods, greater changes in F min./F max. may be necessary to effect any change in the cutting score. Third, greater length of (P1, P2) due to larger C min./C max. differences would appear to yield greater changes relative to the intersection with fc than less length with common changes to F min./F max. Fourth and fifth, the absolute values of F min. and F max. would appear to influence changes, particularly in cases where the slope was not necessarily changed but the line (P1, P2) moved up or down relative to the failure rate scale. In order to examine these assumptions, seven pairs of values were plotted for the three tests for both the modified Angoff and contrasting group methods of determining C min./C max. Since the influence of F min./F max. change does not appear to be dependent on the type of method, but instead on the relative values for F min./F max. vs. C min./C max., it was felt that using the two methods would be sufficient to examine the five assumptions. In addition, the ‘two methods appeared to provide sufficient differences relative to C min. C max. values to warrant their use. Figures 35-40 illustrate the plotted results of the seven pairs of values for the two methods across the three tests. 101 Following these figures, Tables 31 and 32 present the resulting cutting scores yielded by the seven F min./F max. pairs. 100 8818813818888 45 40 30 20 15 10 FAILURE RATE FIELD TEST 102 GRADE 9 UV'II“'VI'V'U'T"UWUIUUVTIVVWIUUVV 'V'I' ANGOFF COMPARISONS ACROSS VARIOUS F MlN./F MAX. VALUES READING FIGURE 35 1 103 FIELD TEST GRADE 9 FAILIBE RATE 1 \\ I \\ f "i E' \\ \ ' \. "I \ x / // c: (:1 ES in 25 £31 23 151 is £35 29 151 53 iii 23 £31 23 151 83 251 53 "' I E , ‘\ I I \ " I : f \ : I '- \ I g Q D ’ P F \, \- . L t I i I '- I a ,t \ ' I r I \\ I - I I \ I P I 'I I I - ,I \ I ‘\ I n ’I I 0“ ” ELALll‘ LLAJILLILILLLLILLILLLALAIIILLI+IL LLLL cuwI cm 5 10 15 20 25 3O 35 40 45 CUTTING SCORE ANGOFF COMPARISONS ACROSS VARIOUS F MIN./F MAX. VALUES MATH FIGURE 36 100 888 OO 70 81881 50 45 40 {388 20 15 10 10h FIELD TEST GRADE 9 FAILIE RATE :rU—FTFTU Irrrv TTIU I rrl’i' 1" I II 'VlTrUVFV— rt l—TrrTIT’F G I _. I : I i- I L" I I: I '- I . I : I .- I : I :~ I h I I I L' I : I T. I 5 1 E z i L I 1 I \ I 9.1% 1:- \ I, E = A :' 1:. :- \ V. s I I : I b = \,< p I L'. I P ’ e I I . I E I, ‘ \ i I] \\ I . I I l’ 9. :- '9 E LLILL‘L __’"L ALELA LL 111 LIL-l llLLllLli [CHIN [cm 5 10 15 20 25 30 35 40 45 50 CUTTING SCORE ANGOFF COMPARISONS ACROSS VARIOUS F MIN./F MAX. VALUES REFERENCE SKILLS FIGURE 37 105 FIELD TEST GRADE 9 FAILURE RATE 100 ,, CONTRASTING GROUPS COMPARISON ACROSS VARIOUS F MlN./F MAX. VALUES READING FIGURE 38 106 FIELD TEST GRADE 9 FAILURE RATE 100 I'V'V‘V‘U'VVV'T‘V' vvv \ ["1 "I f S \ {SI 23 I5! is £55 29 til 23 iii 23 E31 23 i§lifig £55 ”,0 0-!- n.) (J1 O VIIIIVIUII 10E- CH“ 35 40 45 50 CUTTING SCORE CONTRASTING GROUPS COMPARISONS ACROSS VARIOUS F MlN./F MAX. VALUES MATH FIGURE 39 107 FIELD TEST GRADE 9 FAILUIE RATE 00 rrrf rt I r 1‘! r r r r 1" v If! r rt I A V‘Frv 1'! I'F fir! f I _ r r I r r r [ rt: 95: P D D ’ .4 b d 90 ' - F 1 F q ’ ‘ ' d 35 " ~ . ‘ D ‘ . ‘ b ‘ 80 " - b q . ‘ . ‘ - q 75 " - . . h - . ‘ . q 70 ' - . ‘ E 'l E «4 55 ' ‘ - - . ‘ ' ‘ p . 60 '- 4 b 1 : . 55 ’ 1 - - . . . ‘ I ‘ . - 50 ’ - D ‘ F q - ‘ . ‘ 45 .- 1 . ‘ p ‘ ' q 40 - - p ‘ - 1 ' 4 . ‘ 35 ' «- ' q . d D ’ ‘ 30 '- ' - ’ I C I I I 25 = , a . p I v, I I _‘ ' I : \( \ : b I ‘ 15 E- I’ \ .5 p I 1 10 L ’ \ -= I I’ " P I °\ 5:. ’ -. r- ’—’a’ ‘ 0': ..._ 1 1 1 L1 11 1 1 [LLLLI 11 1 1 L1 1 l1 1 1 111 1 1 1 FM»: [pm 0’5101520253035404550 wmnsscone CONTRASTING GROUPS COMPARISONS ACROSS VARIOUS F MlN./F MAX. VALUES REFERENCE SKILLS FIGURE 40 108 TABLE 31 CUTTING SCORES GIVEN VARIOUS F MlN./F MAX. VALUES ANGOFF METHOD FOR SETTING C MIN./C MAX. Field Test 9th Grade Reading (C min./C max. - 24/28) 0 - ID 18* 16* 0 - 100 25 25 5 - 7U 25 24 IO - 90 25 25 IS ' 35 22* 21* 25 - 40 25 2h ho - 6o 25 2h Math (C min./C max. = RO/hh) o - 10 28* 27* o - 100 41 A] 5 - 70 40* h0* 10 - 90 AI 41 Is - us 38* 38* 25 - ha 36* 36* 40 - 60 38* 38* Reference SkiIIs (C min./C max. - 37/41) 10 31* . 29* 0.. 0 - 100 39 39 5 - 70 38 38 10 - 90 39 39 15 - 35 36 35* 25 - no 36* 35* ho - 60 39 37 *Based on extention of (P1, P2) line 109 TABLE 32 CUTTING SCORES GIVEN VARIOUS F MIN./F MAX. VALUES CONTRASTING GROUPS METHOD FOR SETTING C MlN./C MAX. Field Test 9th Grade Reading (C min./C max. - 18/25) 0 - 10 15* IA* 0 - 100 22 22 5 - 70 22 21 IO - 90 22 22 15 - 35 20 18 25 - 40 21 19 #0 - 60 23 22 Math (C min./C max. - 22/37) 0 - 10 20* 18* 0 - 100 31 30 5 - 70 29 29 10 - 90 31 3o 15 - A9 28 27 25 - no 28 27 no - 60 32 31 Reference Skills (C min./C max. - 30/39) 10 26* ' 24* 0- o - 100 36 35 S - 70 35 34 10 - 90 36 35 15 - 35 32 31 25 - A0 34 32 ha - 60 37 36 *Based on extention of (P1: P2) ITO In those cases where the line (P1, P2) did not intersect fc and the lines were extended, changes in F values appeared to result in considerably lower cutting scores particularly at lower F values. Therefore, subsequent examinations of the influence of F values will not include values based on the 0-10 F min./F max. values. After ignoring F min. - D and F max. - 10, the greatest difference between slopes is found with the slope of line (F min. - 0 and F max. - 100) vs. line (F min. - 25 and F max. - #0). Table 33 illustrates the cutting scores for the Angoff and contrasting groups methods for the field test group. All three of the tests present special problems using the Angoff approach because the line (P1, P2) did not intersect fc. Consequently, they have not been included in Table 33. TABLE 33 CUTTING SCORES LINE 0/100 VS. LINE 25/40 FIELD TEST Angoff Contrasting Group Reading 0/100 - 22 25/50 - 21 Difference - 1 Math . 0/100 - 31 25/40 - 28 Difference - 3 Reference Skills 0/100 - 36 25/A0 - 3A Difference - 2 111 .An examination of Tables 31 and 32 for the field test group shows the following differences across methods for the range of scores excluding any values derived from extending lines. These differences are presented in Table 3A. TABLE 34 DIFFERENCES 0F CUTTING SCORES FIELD TEST Angoff Contrasting Group Reading (C min./C max. - 25/28) (C min./C max. - 18/25) high 25 23 low 25 20 difference 0 3 Math (C min./C max. - 50/54) (C min./C max. - 22/37) high 51* 32 low 41 28 difference 0 A Reference Skills (C min./C max. - 37/41) (C min./C max. - 30/39) high 39 37 low 36 32 difference 3 5 *Caution should be taken in the interpretation of the Angoff Math scores as all but two values were based on extended scores and they are both #1. 11.2 Comparison of Hethods 'Table 35 presents the cutting scores, Po, k, SEM based on Po and the corresponding failure‘rates for all four methods for each test. As the table illustrates, there are a large number of cutting scores for each test. Scores for Reading, for example, range from a low of 19 to a high of 26 for the field test group, and a low of 17 to a high 10f 25 for the ninth grade group. This represents a difference in the failure rate between .28 and .66 for the field test group and between .28 and .68 for the ninth grade group. Clearly there is a substantial difference in terms of the percent of students who would pass the tests given the variety of cutting scores presented in Table 35. It should be noted that these cutting scores are all based on fixed values for F min. and F max. (Reading and Reference Skills: F min. - 15, F max. - 35; Math: F min. - 15, F max. - 59). CUTTING SCORES, Po, k, SEM AND FAILURE RATES ACROSS METHODS 113 TABLE 35 Field Test 9th Grade C Po k SEM Rate C Po k SEM Rate Angoff Reading Extention of line 22 .81 .62 2.92 .50 21 .82 .63 2.71 .57 Accept C min. 25 .81 .60 2.92 .53 25 .81 .60 2.79 .63 P1 to vertical axis 20 .83 .65 2.76 .31 18 .85 .63 2.58 .35 Change F max. (75) 26 .82 .55 2.85 .66 25 .82 .59 2.71 .68 Math Extention of line 38 .88 .70 3.57 .73 38 .88 .70 3.58 .72 Accept C min. 50 .90 .72 3.26 .78 50 .90 .72 3.18 .77 P1 to vertical axis 32 .86 .72 3.86 .59 31 .86 .71 3.76 .58 Change F max. (80) 51 .90 .71 3.26 .80 51 .90 .71 3.18 .80 Reference Extention of line 36 .85 .70 3.22 .55 35 .85 .70 3.37 .57 Accept C min. 37 .85 .68 3.23 .58 37 .85 .68 3.58 .56 P1 to vertical axis 35 .86 .71 3.11 .35 31 .88 .71 3.02 .35 Change F max. (75) 39 .85 .65 3.23 .59 38 .85 .67 3.58 .61 Contrasting Groups Reading Means 20 .83 .65 2.76 .31 18 .85 .63 2.58 .35 13150 21 .82 .63 2.85 .35 19 .85 .65 2.56 .38 1250 19 .85 .65 2.78 .28 18 .85 .63 2.58 .35 1350 19 .85 .65 2.78 .28 17 .86 .61 2.50 .28 Math 28 .87 .72 3.72 .35 27 .87 .71 3.62 .36 Means 28 .87 .72 3.72 .35 28 .87 .72 3.62 .39 $150 27 .87 .71 3.72 .32 26 .88 .70 3.58 .31 £250 27 .87 .71 3.72 .32 26 .88 .70 3.58 .31 2350 . Reference Skills Means 32 .87 .71 3.00 .28 31 .88 .71 3.02 .35 1150 33 .87 .72 3.00 .28 31 .88 .71 3.02 .35 1250 32 .87 .71 3.00 .28 30 .88 .70 3.02 .31 1350 32 .87 .71 3.00 .28 29 .89 .71 2.89 .29 115 TABLE 35 continued CUTTING SCORES, Po, k, SEM AND FAILURE RATES ACROSS METHODS Field Test 9th Grade C Po k SEM Rate C Po k SEM Rate Borderline Group Reading +150 21 .82 .63 2.85 .35 19 .85 .65 2.56 .38 +250 21 .82 .63 2.85 .35 19 .85 .65 2.56 .38 +350 21 .82 .63 2.85 .35 19 .85 .65 2.56 .38 Math +150 31 .86 .71 3.86 .55 30 .87 .73 3.62 .53 +250 32 .86 .72 3.86 .59 31 .86 .71 3.76 .58 +350 32 .86 .72 3.86 .59 31 .86 .71 3.76 .58 Reference Skills +150 35 .86 .71 3.11 .35 32 .87 .71 3.15 .37 +250 35 .86 .71 3.11 .35 32 .87 .71 3.15 .37 +350 35 .86 .71 3.11 .35 32 .87 .71 3.15 .37 Direct Estimation Reading 19 .85 .65 2.78 C28 17 7.86 .61 2.39 .28 Math 33 .86 .72 3.86 .55 33 .86 .72 3.76 .55 Reference Skills 30 .88 .70 2.88 .23 28 .89 .68 2.89 .26 In order to more efficiently examine the relative differences between methods, one procedure for each method was selected. The cutting scores, Po, k, SEM and corresponding failure rates for each of these procedures are presented in Table 36. Within the Angoff method, the extention of the line (P1, P2) was used because it represented a compromise between the extremes of accepting C min. (thereby ignoring F max.) and extending P1 to the vertical axis (thereby ignoring _C 115 min.). The score based on the means was used for the contrasting groups method as it again appeared to represent a compromise falling towards the middle of scores yielded by the contrasting group procedures. The +250 score was used with the borderline group since it represented the vast majority of scores yielded by this method (only the +150 Math score differed by one score point). Only one procedure was represented by the direct method and as such was used in the analysis. 116 TABLE 36 CUTTING SCORES, Po, k, SEM AND FAILURE RATES BASED ON SELECTED PROCEDURES Angoff Contrasting Borderline Direct Groups Group Estimation (Extention of line) (Means) (+250) Field Test Group Reading C ' 22 20 21 19 P0 .81 .83 .82 .85 k .62 .65 .63 .65 SEM 2.92 2.76 2.85 2.78 Failure Rate .50 .31 .35 .28 Math C 38 28 32 33 po .88 .87 .86 .86 k .70 .72 .71 .72 SEM 3.57 3.72 3.86 3.86 Failure Rate .73 .35 .59 .55 Reference SkiIls c 36 32 35 30 po .85 .87 .86 .88 k .70 .71 .71 .70 SEM 3.22 3.00 3.11 2.88 Failure Rate .55 .28 .35 .23 117 TABLE 36 continued CUTTING SCORES, Po, k, SEM AND FAILURE RATES BASED ON SELECTED PROCEDURES Angoff Contrasting Borderline Direct Groups Group Estimation (Extention of line) (Means) (+250) Ninth Grade Group Reading C 21 18 19 17 P0 .82 .85 .85 .86 k .63 .63 .65 .61 SEM 2.71 2.58 2.56 2.31 Failure Rate .57 .35 .38 .28 Math C 38 27 31 33 po .88 .87 .86 .86 k .70 .71 .71 .72 SEM 3.58 3.62 3.76 3.76 Failure Rate .72 .36 .58 .55 Reference Skills C 35 31 32 28 P0 .85 .88 .87 .89 k .70 .71 1 .71 .68 SEM 3.37 3.02 3.15 2.89 Failure Rate .57 .35 .37 .26 118 In addition to cutting scores derived from the Hofstee model, Table 37 presents the score for each method based upon the traditional method for estimation: corrected rating for the Angoff, Quadratic Discriminant Function for the contrasting groups and the means of the distrllMJtions for the borderline group method. In addition, Table 37 presents Po, k, SEM and the failure rate for each test. The SEM's and failure rates are based on both groups: field test/9th grade. TABLE 37 CUTTING SCORES, Po, k, SEM AND CORRESPONDING FAILURE RATES BASED ON TRADITIONAL PROCEDURES Angoff Contrasting Borderline Groups Groups Reading c 25 18 20 P0 .81 .85 .83 k .60 .63 .65 SEM 2.92/2.79 2.59/2.58 2.76/2.65 Failure Rate .53/.63 .25/.35 .31/.52 Math ‘ C 55 26 29 P0 .92 .88 .87 k .65 .70 .73 SEM 2.92/2.85 3.57/3.58 3.72/3.62 Failure Rate .90/.90 .27/.31 .38/.51 Reference Skills C 37 27 33 P0 .85 .90 .87 k .68 .67 .72 SEM 3.23/3.58 2.63/2.75 3.00/3.15 Failure Rate .58/.56 .15/.22 .32/.51 119 A common occurrence in the establishment of cutting scores is that they are developed with a field test sample and then generalized to the target population. In the current study, the target population was the ninth grade group. Within the Hofstee model, one can generalize either the C min./C max. values developed with one group or the resulting cutting scores. Tables 38 and 39 represent both approaches. l’irst the C min./C max. values developed with the field test sample were plotted with the fc for the ninth grade group. Table 38 presents the differences between groups within methods for the selected cutting scores presented in Table 36 based on the generalization of C mln./C max. values. TABLE 38 COMPARISON BETWEEN FIELD TEST AND 9TH GRADE GROUPS CUTTING SCORE/FAILURE RATE BASED ON GENERALIZATION OF C MlN./C MAX. VALUES Angoff Contrasting Borderline Direct Groups 1 Group” Estimation Reading Field Test 22/.50 20/.31 21/.35 l9/.28 9th Grade 21/.57 18/.35 19/.38 17/.28 Difference 1/.07 2/.03 3/.03 2/.00 Math Field Test 38/.73 28/.35 32/.59 33/.55 9th Grade 38/.72 27/.36 31/.58 33/.55 Difference 1/.01 1/.01 1/.01 0/.01 Reference Skills Field Test 36/.55 32/.28 35/.35 30/.23 9th Grade 35/.57 31/.35 32/.37 28/.26 Difference 1/.03 1/.07 2/.02 2/.03 120 Second, the cutting scores developed with the field test sample were applied to the ninth grade group. Table 39 Presents the differences in failure rates between the field test and ninth grade groups using the same cutting scores. In addition, Table 50 then presents the difference in failure rates for the ninth grade group between the generalization of C min./C max. values and of cutting scores . TABLE 39 COMPARISON BETWEEN FIELD TEST AND 9TH GRADE FAILURE RATE BASED ON GENERALIZATION 0F CUTTING SCORES Contrasting Borderline Direct Angoff Groups Group Estimation Reading C 22 20 21 19 Field Test .50 .31 .35 .28 9th Grade .52 .52 . .57 .38 Difference .12 .11 .12 ' .10 Math c 38 28 32 33 Field Test .73 .35 .59 .55 9th Grade .72 .39 .51 .55 Difference .01 .05 .02 .01 Reference Skills , C 36 32 35 30 Field Test .55 .28 .35 .23 9th Grade- .52 .37 .55 .31 Difference .08 .09 .10 . .08, 121 TABLE 50 COMPARISON OF 9TH GRADE FAILURE RATES BETWEEN GENERALIZATION OF C MIN./C MAX. AND CUTTING SCORES Contrasting Borderline Direct Angoff Groups Group Estimation Reading c .52 .52 .57 .38 C min./C max. .57 .35 .38 .28 Difference .05 .08 .09 .10 Math C .72 .39 .51 .55 C min./C max. .72 .36 .58 .55 Difference 0 .03 .03 0 Reference Skills C .52 .37 .55 .31 C min./C max. .57 .35 .37 .26 Difference .05 .02 .08 .05 122 A means of examing the cutting scores based on the various methods is to compare the proportion of persons consistently classified beyond that expected by chance (k). Cohen (1960) provides an approximation to the standard error of k as: Table 51 presents the cutting scores, k,ck and 95% confidence interval of k for the selected procedures found in Table 36 for the field test group. The k's across methods within tests indicates no significant difference between methods. It should be noted that, vfl1ile appearing to represent high values of k, Table 51 only presents the relative values of k and their standard errors for purposes of comparing the reliability of classification across scores. 123 TABLE 51 COMPARISONS OF k ACROSS METHODS FOR SELECTED SCORES Angoff Contrasting Borderline Direct Groups Group Estimation Reading C 22 20 21 19 k .62 .65 .63 .65 ok .02 .02 .02 .02 Confidence Interval t.05 $.05 3.05 +.05 Math C 38 28 32 33 k .70 .72 .71 .72 ok .02 .02 .02 .02 Confidence Interval $.05 11.05 1.05 .3.05 Reference Skills C 36 32 35 30 k .70 .71 .71 .70 ok .02 .02 .02 .02 Confidence Interval $.05 .1.05 .:.05 1.05 125 An additional means of examining the relative ability of the methods to classify students into master and non-master categories would be to examine the false positives and false negatives associated with categorizing students from known master/non-master groups. Essentially this involves examining the percent of students who teaches classified as masters who failed and conversly those students who were classified as non-masters who passed the test. Table 52 presents the false positive and false negative percents for the field test group for both the selected and traditional scores. Based on the classification of mastery/non-mastery by classroom teachers, the results indicate a number of scores that appear to produce high false positive and false negative percentages. 125 TABLE 52 PERCENT OF FALSE POSITIVES AND FALSE NEGATIVES BASED ON MASTER/NON-MASTER GROUPS Percent Percent False Positives False Negatives Reading 18 I15 11 19 55 15 20 . 37 17 21 35 20 22 27 23 25 18 36 Math 26 25 13 28 17 17 29 15 19 32 6 28 33 5 35 38 0 55 55 o 78 Reference Skills 27 58 8 30 50 11 32 51 20 33 50 22 35 38 ' 25 36 28 30 37 17 35 126 Table 53 presents the cutting scores and failure rates found in both Table 36 and 37 for both the field test and 9th grade groups. Ninth grade Hofstee cutting scores and corresponding failure rates are based on the generalization of C min./C max. values. As a summary, Table 53 affords a quick comparison of the selected Hofstee methods with traditional methods. TABLE 53 CUTTING SCORES AND FAILURE RATES BASED ON SELECTED PROCEDURES AND TRADITIONAL METHODS Hofstee Traditional Con- Con- trast- Border- Direct trast- Border- ing line Estima- ing line Angoff Groups Group tion Angoff Groups Group Field Test Group Reading C 22 20 21 I9 25 18 20 Failure Rate .50 .31 .35 .28 .53 .25 .31 Math C 38 28 32 33 55 26 29 Failure Rate .73 .35 .59 .55 .90 .27 .38 Reference Skills C 36 32 35 30 37 27 33 Failure Rate .55 .28 .35 .23 .58 .15 .32 ° 9th Grade Group 1 Reading C 21 18 19 17 25 18 20 Failure Rate .57 .35 .38 .28 .63 .35 .52 Math C 38 27 31 33 55 26 29 Failure Rate .72 .36 .53 .55 .90 .31 .51 Reference Skills C 35 31 32 28 37 27 33 Failure Rate .57 .35 .37 .26 .56 .22 .51 127 These selected scores represent the results of each of the standard setting methods and, as such, cannot be statistically be compared in a direct sense. The question in effect is not "Do the scores differ by change?" because they were defined by different criteria and from different perspectives. They can be compared indirectly, however, by examining the standard error of each score in relation to the others as they perform with a sample of students. As such, the question becomes "What is the probability that any two scores yielded by different methods will differ by chance alone and that an examinee with a real score at one cutting score could have obtained a different score by chance?". A traditional method for examining differences in scores, as described by Hagnusson (1967) and Allen and Yen (1979), is to establish confidence intervals around the two scores in question. If the intervals do not overlap, then they can be said to differ at some specified level of confidence. Two commonly used intervals of confidence for comparing scores are to set bands of +/SEH (or + 1.96 SEM) around the observed scores and seeing if the bands overlap. Feldt (1967), however, cautions that broad confidence intervals such as these may lead to "underinterpreting" actual differences between scores. If the confidence interval is too broad, then true differences will often not be correctly identified. As such, he argues for narrower intervals such as :E .67 SEM. The decision of whether to use broad or narrow intervals in analogous to decisions regarding Type I and Type 11 errors. One type of error is usually reduced at the expense of the other. In his discussion of this Issue, Feldt presents a method for determining the probability of overlapping confidence intervals, given actual 128 differences between scores. Based on Feldt's method, with the assumption that the selected cutting scores by different methods represent "real" or actual differences, Tables h3-h6 present the probability of overlap between all selected scores within each test. In order to examine the scores relative to minimizing Type I and Type II errors, three confidence intervals about each cutting score were used: .67, 1.00 and 1.96. With the exception of some comparisons to the math score yielded by the traditional Angoff method (C - Ah), all comparisons at the 1.96 interval evidenced high probabilities of overlap (.611 to .99). Host comparisons at the 1.00 interval also should moderate to high probabilities (.50 to .80). The .67 interval, however, evidenced a majority of the comparisons below a .50 probability of overlap. PROBABILITY OF OVERLAPPING CONFIDENCE INTERVALS AT P - .50, .68 AND .96 129 TABLE AA READING 19 20 21 22 29 18 .67 .64 .59 .52 .AA .27 1.00 .83 .78 .71 .64 .47 1.96 .99 .99 .97 .96 .89 19 .67 .65 .60 .54 .37 1.00 .83 .80 .71 .56 1.96 .99 .99 .98 .99 20 .67 .65 .60 .45 1.00 .83 .79 .65 1.96 .99 .99 .96 21 .67 .64 .53 1.00 .83 .73 1.96 .99 .98 22 .67 .60 1.00 .79 1.96 .99 PROBABILITY OF OVERLAPPING CONFIDENCE INTERVALS 130 TABLE 45 AT P - .50., .68, .95 MATH 28 29 32 33 38 Ah 26 .67 .63 .58 .41 .34 .08 .00 1.00 .81 .78 .06 .53 .16 .01 1.96 .99 .99 .95 .92 .60 .07 28 .67 .61 .53 .07 .16 .01 1.00 .89 .73 .67 .30 .03 1.96 .99 .98 .97 .80 .27 29 .67 .58 .52 .22 .01 1.00 .79 .73 .38 .Oh 1.96 .99 .98 .85 .35 32 .67 .65 .91 .06 1.00 .83 .60 .14 1'96 e99 092 .60 33 - .67 .A6 .09 1.00 .66 .19 1.96 .96 .68 38 .67 .35 1.00 .55 1.96 .93 131 TABLE 46 PROBABILITY 0F OVERLAPPING CONFIDENCE INTERVALS AT P - .50, .68, AND .95 REFERENCE SKILLS 30 32 33 34 36 37 27 .67 .53 .37 .28 .22 .11 .08 1.00 .72 .56 .47 .38 .26 .16 1.96 .98 .96 .90 .85 .71 .64 30 .67 .60 .54 .47 .32 .25 1.00 .80 .74 .67 .65 .43 1.96 .99 .98 .97 .92 .88 32 .67 .64 .61 .49 .41 1.00 .83 .80 .68 .60 1'96 099 099 097 095 33 .67 .64 .56 .49 1.00 .83 .75 .68 1.96 .99 .98 .97 34 .67 .61 .55 1.00 .80 .75 1.96 .99 .98 36 .67 .63 1.00 .83 1.96 .99 CHAPTER V DISCUSSION The purpose of the study was to examine four approaches to setting the values of C min. and C max. in Hofstee's compromise standard setting model. The approaches were the modified Angoff, direct estimate, and Zieky and Livingston's contrasting groups and borderline methods. Each approach was applied to Reading, Math, and Reference Skills tests used for high school graduation in a midwestern urban school district. The results of the study are first presented by approach and then contrasted across approaches in terms of cutting scores yielded and relationships to varying F min. and F max. by each method for each test. The results of these approaches were also compared totthe results obtained from traditional'Angoff, contrasting groups and borderline methods. Anggff The judges ratings within the Angoff method appeared to evidence acceptable levels of reliability in terms of both interiudge and intrajudge consistency. It is interesting to note that the intrajudge consistency index was somewhat higher for ratings based on 12th grade skills than for 9th grade. Perhaps this difference stems from 12th 132 1.33 grade students representing a more homogeneous group than 9th graders in that possibly students who do not conform to specific standards do not become 12th graders. Additionally, social norms may have shaped students by the 12th grade. Alternatively, the judges may have been more familiar with 12th grade students in general. In either case, it presents an interesting problem for future study. The range of acceptable cutting scores appeared to be small for the Angoff method. The closeness of C min. and C max. scores may reflect a true lack of significant difference in the levels of the skills that were measured between 9th and 12th grade. These may be skills that are not taught after 9th grade. If true, this has substantial implications for anchoring ratings in grade level terms for setting C min. and C max. The data comparing the groups comprising the field test sample (10th, 11th, and 12th grades) and the comparison between scores of this group and the 9th grade group support the conclusion there may not be substantial skill level differences between 9th and 12th grades on the measured skills. Care should be exercised when anchoring C min. and C max. in known populations to be sure that the anchors represent actual and measurable skill differences. It should be noted, however, that choosing different anchor groups than were used in this study would very likely produce different results particularly if the skills tested were taught to one group and not the other. A significant problem with the Hofstee model was brought to light by the Angoff method: non-intersection of the line (P1, P2) with fc. Though this may reflect an intended/desired standard relative to a lower functioning population and thereby be intended as a means of l 134 motivating students and schools to improve, a cutting score must still be set. llithin the Hofstee model, setting the cutting score implies the intersection of (P1, P2) with fc- The four solutions that were proposed appear to offer only partial solutions to the problem. Accepting C min. as the cutting score appears to violate F max. while maintaining C min. In cases where the range is higher than the measured skill level (as represented by fc) accepting C min.1~ill always result in a higher F max. value than originally specified. Politically, this may be unacceptable. Artibrarlly raising F max. to a score point above fc has the same problem. On the other hand, maintaining F max. by drawing a line from P1 towards the vertical axis until it crossed fc violates the meaning of C min. by often substantially lowering it. The best solution may be to simply extend the line (P1, P2) until it intersects fc effecting a compromise between the two extremes. Of course, it could be argued that either the parameters should be re-established in light of the new information or the model should not‘be used “in such cases and alternate models adOpted. Relative to the standards set by extending the line (P1, P2), the traditional Angoff method set high standards. In fact, they were the highest of any model across both Hofstee and traditional methods.~ It is interesting, however, that the two committees appeared to differ in that the Math committee set much higher scores relative to the other cutting score procedures than did the Language Arts committee. 135 Contrasting Groups One of the more interesting findings relative to the contrasting groups method is that despite C min./C max. values substantially varying depending upon where in the distribution they are selected, there were not correspondingly large differences in the cutting scores. The largest difference in fact was limited to 2 points. It may be that with relatively normal populations the lines represented by corresponding points in the populations represent similar relationships to the distributions of the populations. To the degree that this is true, using the means of the two populations may not only be the most straight forward but the most representative of all possible lines drawn from points in the distribution. It should be noted that the points in the distributions representing C min. and C max. were chosen from a wide range of possible choices. Other points are certainly possible to use and could lead to different results. Another interesting finding relative to the contrasting groups method is that it tended to yield lower cutting scores with the traditional method than with the Hofstee method. It must be cautioned, however, that conclusions drawn from this study relative to the contrasting groups approach should be viewed carefully due to the low number of students assigned to the non-master category. This imbalance in group size, however, may be representative of what happens when using this method. In addition, neither the contrasting groups nor the borderline group approaches would be appropriate for setting standards for licensure examinations because in such cases mastery/non-mastery groups are usually not readily defined. 136 Borderline Group The various cutting scores yielded by the borderline group method for setting C min. and C max. were remarkably similar. Perhaps this is the result of the normal appearances to the borderline group scores. The mean and +250, as representatives of C min. and C max, are probably as good as any other choice within the method though in the long run with a normal population all positive points in the distribution appear to produce relatively equal results. The one caveat to this statement is the relationship between the size of the difference between C min. and C max. and changes in F min./F max. Greater differences will be affected by changes in F minu/F'max. values. It is interesting to note that the use of the mean alone set only slightly lower scores than those set using the full Hofstee model. In hindsight, it may have been of value to use points balanced around the mean rather than anchoring C min. always at the mean itself. Direct Estimation Of the four methods used to establish C min./C max. values the direct estimation method is the simplest and least involved. Like the Angoff method, however, it produced one set of C min./C max. values that failed to intersect with fc with the same problems and hnplications noted above. It is interesting to note that this occurred with the Math committee which also set a higher Angoff score. 137 F aim/F lax. Value Chang In general, the Hofstee model appears to be fairly robust relative to moderate changes in F min./F max. values. Changes to the extremes (for example the values 0/10), can, however, have significant effects. There does appear to be a relationship between the line difference between C min./C max. and the effects of F min./F max. Greater C min./C max. differences result in a greater impact of changes in the F values in terms of their resultant cutting score. The issue of the size of the C min./C max. differences vs. F min./F max. change effects appears to deserve futher study. CC-par 1 sons Between Hethods The most striking thing as one looks across the various cutting scores produced by the different methods and their corresponding failure rates, is that while there are" reasonably small differences between scores based on their standard errors of measurement there are substantial differences in the rates of failure. Relatively small difference in cutting scores can have a profound effect on the rates at which students pass or fall. In statistical terms, all of the methods yielded similar results, evidenced acceptable reliability as determined by Po and k, and generalized with similar results between groups of students. Essentially, there were differences, but few that fell outside of plausable confidence intervals. Based on the most reasonable approach with each method, Reading ranged from scores of 19 to 22 with failure rates from .28 to .110; Math ranged from 28 to 38 138 with failure rates from .35 to .73 and Reference Skills ranged from 30 to 36 with failure rates from .23 to .44 for the field test group. The ninth grade group evidenced a similar pattern. Summary and Implications of Findings In summary, there were eleven primary findings in the study each of which presented implications for the setting of cutting scores using the Hofstee compromise method. 1. Finding: All of the methods used in the study tended to produce reliable results that generalized well between groups. Implication: On the basis of producing reliable results, the use of any of the four means for setting C min. and C max. or the use of the three traditional methods-would be. defensible. Reliability, however, is not and should not be the only criteria for selecting a cutting score method. 2, Finding: There were relatively non-significant differences between the cutting scores yielded ; by the different methods. Essentially, the cutting scores psychometrically appeared to be quite similar. 139 Implication: Psychometrically, it would not greatly matter which method was used to determine cutting scores. They all produced reliable, and in the end, not dissimilar scores. Again, based on these criteria, the use of any of the methods would be defensible. Finding: The percent of students who failed the tests changed greatly with relatively small changes in cutting scores. Though not significantly different, different scores yielded by the different methods produced marked differences in the failure rates on the Implication: The change in cutting scores resulting from the choice of one method over another could have significant political, educational, and personal impact. lvlhile the use of any of the methods is equally defensible based on reliability and similar score results, the selection of a method to use is not trivial in terms of the impact on people. It was hoped that the use of a compromise method with a range of acceptable scores would have helped to reduce this impact by taking acceptable failure rates into ll-IO account before setting cutting scores. ‘This, unfortunately, does not appear to have been the case. The practical implications for practitioners who desire the cutting score to fall Inithin some specified paramater are: (1) one could use any of the methods and then change the resulting cutting score to fall within acceptable failure ranges or (2) to select the method that produces the desired result. In either case, what some would see as a capricious element has been added to the process. Aside from the initial political advantage of appeasing all parties, those interested in absolute standards and those interested in normative standards, one wonders if there are any advantages of using the compromise method to set Cutting sCOres on minimal competency tests. It should be noted, however, that the failure rates found in this study do not reflect the overall failure rate of the same group of students by the end of the twelfth grade. With remedial instruction, one would anticipate that the final failure rate would be much lower than the rate found at the ninth grade. 1.41 Finding: The Angoff method, compromise and traditional approaches, consistently produced the highest cutting scores. Implication: If one wished to set stringent standards to select a smaller portion of examinees, the Angoff method would most likely produce the highest cutting scores of those methods examined. The converse is of course also true, if one wished to avoid high failure rates, the Angoff would not be the method of choice. Finding: The direct estimation method tended to be methodologically the least complex and to produce the lowest cutting scores of the methods emined. Implication: If one wished to have a fairly simple procedure that tended to set relatively low cutting scores, the direct estimation would be the method of choice. Because of its simplicity both in terms of the procedures used and the few number of persons needed to make the estimates (theoretically they could be made by’one person), the direct estimation method would appear to have applicability for 142 setting passing scores on a wide range of tests to include teacher made tests for use in single classrooms. Finding: The contrasting groups method tended to set lower cutting scores with the traditional method (QDF) than with the coqrromise procedures. Implication: Again, this has an impact on the selection of a method relative to producing an acceptable failure rate if one chose to select a method on that basis. Use of the QDF would tend to yield lower cutting scores and therefore lower failure rates. Finding: Large differences in the-selected points in the distribution used to determine C min. and C max. with the contrasting groups method yielded relatively small differences between the corresponding cutting scores. Implication: Though these points were arbitrarily chosen, the findings seem to indicate that it may make little difference as to which set of points one chooses to use. This would particularly appear to be true when 143 the two distributions are fairly normal and the points are chosen symmetrically. One may expect, however, to see greater resulting score differences with non-normal distributions and/or points chosen non- symmetrically (i.e., the mean of one distribution and one standard deviation in the other). Finding: Using the mean as C min. in the borderline group coqromlse method will always yield a higher cutting score for the group on whom the C min./C max. were coquted than the traditional approach of using the mean alone as the cutting score. Implication: Though not a surprising-finding, it does impact on method selection when there is concern for maintaining or avoiding some failure rate level. Alternatively, one may choose to not use the mean as C min. but instead to select some point in the negative direction of the distribution. 10. 144 Finding: Except at extreme ranges, the IIofstee model is fairly robust relative to changes in F min. and F max. values. Implication: Changes in F min. and F max. reflect varying political and normative notions of acceptable failure rates. It is interesting that these notions can vary quite widely with there being only a modest effect on cutting scores. On one hand it is heartening that such differences have such little effect given the non-systematic way in which they are derived. On the other hand this may reflect a lack of sensitivity to considerations of failure rates within the model. This may help to explain why the model does not appear to lessen the effect- of small changes in cutting scores producing large differences in failure rates. Finding: The difference In C min. and C max. values resulting from the Angoff method were Implication: The lack of difference between the C min. and C max. values most likely resulted from a lack of skill differences 11. 145 between ninth and twelfth grade students in the skills measured by the tests. This has implications both for the use of the Angoff method for setting C min. and C max. values and educationally for the remediation of those students who failed the tests. Relative to selecting C min. and C max. values, the use of anchor grades at the secondary level to set the values on tests measuring basic Language Arts and Math skills would appear to not be well advised. In fact, the lack of differences may have been in part responsible for the failure of the line (p1, p2) to intersect fc. Educationaliy, the lack of difference would appear to indicate a lack of academic growth in those specific skills between ninth and twelfth grades. This lack in growth in turn may indicate a lack in instruction in the skills between ninth and twelfth grades. For students failing the tests in ninth grade, there may be little hope in passing them by the end of twelfth grade unless that instruction takes place. Finding: In several instances, the line (P1, .146 Implication: Regardless of the solution used to solve this problem, one or more assumptions of the model will be violated. When it occurs, this is a serious limitation of the model. One is forced to ignore either the absolute or normative assumptions (or both in the case of extending the line). These violations question whether the model should be used when the problem occurs. In other words, do the solutions and corresponding violations lessen the validity of the model to the point that it no longer represents its original intentions? Unfortunately, this is the case. To the degree that one extreme or the other is violated, the model strays from its proported intention of representing a compromise between knowh normative and absolute values. Limitations and Suggestions for Further Research The primary'limltation of the study was that the committees were small in terms of the number of persons on each committee. This was particularly true regarding the number of language arts members involved in the direct estimation procedures. Though this procedure can be performed with a single person, it would have been interesting to see how the C min./C max. estimates would be different using more 147 people to make the estimation. Future research might examine procedures that involved larger numbers of persons making the estimates. Such research, may attempt to assertaln if there is some minimum rater sample for optimal estimations. An additional limitation was that the reading and reference skills competencies were combined for rating students into master, non-master and borderline groups. Separate estimates for each test may have resulted in somewhat different results. Future research may wish to address the single rating problem first by repeating the ratings as separate ratings and second by examining the issue directly. In other words, it would be interesting to see if in fact different results would be obtained by separate vs. combined ratings. A third limitation was that points in the negative direction of the borderline groups distribution were not used. Their use may have resulted in different estimates with the borderline group results and should be addressed in future research. The problem noted in the study of‘ the failure of the line (P1, P2) to intersect fc should be further examined both from the prespective of how to avoid the problem and viable solutions once it occurs. In addition, the problem of large changes in failure rates resulting from small changes in cutting scores should be extensively examined. Both the personal impact and potential means of avoiding the problem need to be studied. Such reseach may look' for and evaluate not only new methods for setting cutting scores but possible new conceptualizations of standard setting itself. If we are going to continue to make dichotomus decisions (which we are) we must find better ways to do so. 148 Concluding Remarks The most troubling finding in the study was that seemingly small changes in cutting scores from one method to the next resulted in large changes in failure rates. This would seem to support Glass' notion that setting cutting scores is unsound and arbitrary. However, as Mehrens points out, categorical decisions will still be made and in doing so people will base their decisions on some type or form of a standard, whether stated or implied. One cannot avoid standards nor can one avoid applying those standards to real life decisions. This does not, however, dissipate the uneasy feeling derived from large changes in failure rates with small cutting score changes. One way out of the dilemma might be to conceptualize a potential standard as lying on some hierarchical scale made up of real life referents so that the mastery of one skill implies mastery of all skills below it in the hierarchy. Essentially, the standard would fall somewhere in an absolute scale made up of real life referents. The identification of the standard itself then becomes one of arriving at a concensous as to the location of the real standard on the scale. Tests could then be constructed to predict the placement of a student at that identified point on the scale. Validation would be expensive because it would involve the indepth assessment of students on an individual basis in order to establish a criterion group relative to the scale but the pay-off may be worth the effort. Scores could be reported in terms of the probability of a person being at or above the standard on the scale and cutting scores could be some minimum probability rather than a fixed test score. Such a model would also 149 allow for other measures/factors in addition to the test to be taken into account when determining probabilities. By fixing the standard on some external (to the test) scale of lobservable referents and examining the probability of a person being at or above that point, the arbitrary nature of setting cutting scores may be avoided. The problem becomes one of the accuracy of prediction rather than the judgmental setting of standards. Though some may argue that the problem of prediction is indeed great, it may be preferable to the problems inherent in setting cutting scores on tests as is currently being done. APPENDICES APPENDIX A 151 152 APPENDIX A Tables l-A, 2-A, and 3-A represent the mean attainment rates for each item for the field test and 9th grade groups for the Reading, Math, and Reference Skills Tests, respectively. During field testing, item 9 on the Math Test was found to be an inappropriate item due to a printing error which deleted the correct answer choice. Consequently, the item was deleted from the analysis. 153 TABLE l-A ATTAINMENT RATE BY ITEM READING Item Field Test 9th Grade 1. .84 .76 2. '1‘] 035 3. .74 .66 4. .73 ~72 5- -77 .72 6. .85 .77 7. .71 .58 8. .63 .61 9. .80 .72 10. .80 .77 11. .50 .51 12. .46 .43 13. .66 .64 14. .78 .70 15. .40 .28 16. .37 .33 17. .42 .37 18. .56 .52 19. .25 .26 20. .57 .43 21. .73 .69 22. .68 .59 23. .86 .74 24. .69 » .- .63 25- .77 .79 26. .80 .79 27. .73 .72 28. .79 .73 29. .65 .68 30. .75 .76 31. .87 .89 32. .79 .81 33. .51 .50 34. .60 .60 154 TABLE Z-A ATTAINMENT RATE BY ITEM MATH Item Field Test 9th Grade ‘0 .92 09“ 2. .62 .60 3. .64 .62 4. .66 .50 5. .90 .91 6. .53 .49 7. .77 .73 8. .73 76 9. - - 10. .82 .82 11. .71 .70 12. .71 .71 13. .75 .79 14. .65 .72 15. .63 .59 16. .75 .72 17. .85 .82 18. .65 .68 19. .87 .80 20. .86 .86 21. .81 .70 22. .69 .67 23o 077 '7‘. 24. .73 .59 25. .89 .82 26. .74 .70 27. .33 .34 28. .45 .44 29. .60 .51 30. .61 .58 31. .71 .67 32. .44 .43 33. .66 .59 34. .64 .49 35. .67 .59 36. .51 .47 37. .64 .55 38. .68 .59 39. .66 .61 40. .55 .55 41. .63 .57 42. .46 .43 43. .65 .58 44. .47 .44 45. .48 .43 46. .62 .61 47. .64 .63 48. .56 .50 49. .59 .53 so. .49 .41 51. .58 _ .61 52. .58 .61 l 155 TABLE 3'A ATTAINHENT RATE BY ITEH REFERENCE SKILLS Item Field Test 9th Grade 1. .82 .84 2. .76 .76 3. .60 .53 4. .58 .50 5. .64 .61 6. .88 .86 7. .95 .93 8. .95 .93 9. .78 .68 10. .88 .88 11. .92 .87 12. .82 .76 13. .62 .63 14. .88 .84 15. .49 .48 16. .58 .47 17. .71 .68 18. .83 .78 19. .78 .67 20. .71 .65 21. .81 .70 22. .83 .75 23. .77 .72 2“. 079 '75 25. .86 .82 26. .70 .64 27. .78 .77 28. .74 .71 29. .86 .85 30. 057 057 31. .94 .95 32. .93 .92 33. .49 .47 34. .70 .77 350 ’57 'ha 36. .77 .76 37. .71 .64 38. .61 .58 39. .84 .81 40. 057 '52 41. .79 .69 “2. 072 '58 43. .65 .65 44. .39 .30 45. .64 .76 46. .74 .82 47. .59 .51 48. .76 .73 49. .30 .36 50. .64 .60 APPENDIX B 156 157 APPENDIX 8 Tables 4-8, S-B, and 6-B present the mean ratings and standard deviation for each item across judges for the ninth and twelfth grades for Reading, Math, and Reference Skills Test, respectively. 158 TABLE 4-B RATER MEANS AND STANDARD OEVIATIONS BY ITEM READING 9th Grade 12th Grade Item Mean SD Mean SD 1. .807 .106 .700 .173 2. .800 .104 .621 .191 3. .750 .141 .579 .208 4. .814 .114 .671 .204 5. .856 .108 .693 .190 6. .849 .089 .757 .146 7. .829 .064 .764 .152 8. .814 .157 .650 .233 9. .870 .097 .814 .135 10. .814 .075 .671 .168 11. .844 .124 .686 .223 12. .816 .125 .771 .170 13. .820 .101 .671 .173 14. .807 .098 .629 .135 15. .824 .084 .650 .104 16. .800 .082 .671 .163 17. .814 .099 .671 .187 18. .841 .086 .770 .171 19. .821 .070 .729 .122 20. .813 .104 .700 .191 21. .829 .076 .771 .129 22. .814 .075 .757 .154 23. .843 .053 ‘.779 ~.115 24. .844 .093 .693 .169 25. .816 .111 .729 .141 26. .841 .107 .758 .143 27. .828 .097 .720 .180 28. .863 .088 .800 .161 29. .856 .086 .784 .166 30. .841 .081 .693 .169 31. .884 .088 .861 .172 32. .878 .084 .827 .166 33. .827 .097 .669 .219 34. .864 .063 .721 .180 159 TABLE 5-8 RATER MEANS AND STANDARD DEVIATIONS BY ITEM MATH 9th Grade 12th Grade Item Mean SD Mean SD 1. .977 .022 .976 .020 2. .827 .082 .871 .061 3. .825 .082 .854 .073 4. .900 .069 .933 .067 5. .932 .065 .956 .051 6. .752 .170 .829 .087 7. .837 .128 .891 .081 8. .888 .099 .930 .062 9. - - - - 10. .863 . .112 .910 .061 11. .835 .124 .904 .061 12. .822 .111 .879 .070 13. .767 .151 .863 .077 14. .775 .133 .860 .076 15. .792 .128 .851 .073 16. .775 .129 .857 .074 17. .863 .121 .883 .103 18. .800 .138 .883 .098 19. .775 .175 .857 .101 20. .805 .144 .874 .093 21. .745 .151 .866 .082 22. .787 .173 .874 .092 23. .867 .093 .903 .083 24. .687 .196 .839 .068 25. .822 .150 .907 .084 26. .675 .216 .859 .072 27. .725 .212 .860 .091 28. .718 .200 .834 .081 29. .783 .144 .906 .098 30. .777 .152 .836 .090 31. .658 .189 .818 .143 32. .650 .187 .826 .110 33. .845 .072 .903 .078 34. .807 .149 .861 .090 35. .798 .154 .847 .091 36. .792 .172 .859 .092 37. .792 .172 .854 .088 38. .797 .213 .871 .081 39. .798 .138 .863 .087 40. .720 .194 .826 .089 41. .725 .221 .820 .063 42. .717 .181 .833 .102 43. .715 .247 .827 .093 44. .683 .238 .799 .072 45. .723 .161 .793 .093 46. .685 .239 .807 .073 47. .793 .128 .843 .093 48. .730 .223 .847 .099 49. .693 . .181 .611 .097 50. ' .783 .179 .861 .107 51. .757 .182 .851 .093 52. .787 .147 .837 .097 160 TABLE 6-B* RATER MEANS AND STANDARD DEVIATION BY ITEM REFERENCE SKILLS ‘ 9th Grade 12th Grade Item Mean SD Mean SD 1. .677 .217 .833 .127 2. .677 .217 .840 .133 3. .677 .217 .826 .121 4. .670 .213 .826 .121 5. .793 .130 .834 .145 6. .806 .137 .841 .141 7. .847 .135 .876 .126 8. .791 .152 .841 .115 9. .783 .142 .827 .105 10. .817 .177 .827 .142 11. .799 .188 .841 .135 12. .779 .147 .857 .089 13. .714 .144 .814 .075 14. .707 .164 .829 .070 15. .579 .264 .764 .131 16. .677 .257 .793 .137 17. .786 .170 .870 .106 18. .777 .153 .857 .053 19. .729 .141 .843 .067 20. .614 .204 .786 .128 21. .693 .199 .850 .082 22. .699 .223 .836 .095 23. .806 .114 .893 .053 24. .791 .118 .871 .049 25. .791 .118 .871 .049 26. . .784 .119 .886 .056 27. .806 .114 .886 .048 28. .784 .122 .877 .067 29. .776 .183 .877 .097 30. .777 .155 .863 .092 31. .819 .154 .870 .093 32. .840 .126 .877 .084 33. .650 .198 .806 .125 34. .693 .192 .813 .112 35. .727 .221 .820 .087 36. .720 .197 .813 .104 37. .679 .208 .791 .121 38. .671 .204 .786 .099 39. .754 .219 .856 .104 40. .833 .181 .856 .115 41. .756 .171 .813 .108 42. .700 .150 .841 .090 43. .677 .179 .827 .112 44. .861 .137 .903 .078 45. .849 .102 .849 .113 46. .743 .137 .806 .111 47. .627 .260 .813 .115 48. ' .682 .193 .827 .120 49. .693 .195 .820 .112 50. .686 .193 .834 .112 LIST OF REFERENCES LIST OF REFERENCES Allen, M. and Yen, W. (1979). Introduction to Measurement Theory, Monterey: Brooks/Cole Publishing Co. Anderson, L. (1979). Considerations for setting performance standards on entrance examinations. Paper presented as part of a symposium entitled ”Setting Standards: Theory and Practice", at the annual meeting of the American Educational Research Association, San Francisco. Angoff, W. (1971). Scales, norms, and equivalent scores. In R.L. Thorndike (Ed.), Educational Measurement (2nd ed.). Washington, D.C: American Council on Education, 508-600. Bergan, J. Cancelli, A., Luiten, J. (1980). Mastery assessment with latent class and quasi-independence models representing homogeneous item domains. Journal of Educational Statistics, 5, 65-81. Berk, R. (1980). Item analysis. In R.A. Berk (Ed.), Criterion- Referenced Measurement: The State of the Art. Baltimore, MD: Johns Hopkins University Press, 49-79. Berk, R. A. (1986). A consumer's guide to setting performance standards on criterion-referenced tests. Review of Educational Research, 56, 137-172. Berk, R. A. (1980a). A consumer's guide to criterion-referenced test reliability. Journal of Educational Measurement, 17, 323-349. Berk, R. A. (1976). Determination of optimal cutting scores in criterion-referenced measurement. Journal of Experimental Education, 45, 4-9. Beuk, C (1984). A method for reaching a compromise between absolute and relative standards in examinations. Journal of Educational Measurement, 21, 147-152. 161 162 Block, J. (1972). Student learning and the setting of mastery performance standards. Educational Horizons, 50, 183-190. Block, J. (1978). Standards and criteria: a response. Journal of Educational Measurement, 15, 291-295. Brennan, R. L. (1980). Applications of generalizability theory. In R. A. Berk (Ed.), gliterion-Referenced Measurement: The State of the Art. Baltimore, MD.: John Hopkins University Press, 186-232. Brennan, R. L., and Kane, M. T. (1977). An index of dependability for mastery tests. Journal of Educational Measurement, 14: 277-89. Brennan, R., Lockwood, R. (1980). A comparison of the Nedelsky and Angoff cutting score procedures using generalizability theory. Applied Psychological Measurement, 4, 219-240. Burton, N. (1978). Societal standards. Journal of Educational Measurement, 15, 263-271. Cangelosi, J. S. (1984). Another answer to the cut-off score question. Educational Measurement: Issues and Practice, 3 (R), 23’25- '6 Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20: 37-46. Cross, L., Impara, J., Frary, R., Jaeger, R. (1984). A comparison of three methods for establishing minimum standards on the national teacher examinations. Journal of Educational Measurement, 21, 113-129. De Gruijter, D. (1985). Compromise models for establishing examination standards. Journal of Educational Measugement, 22 , 263-269 a Ebel, R. (1979). Essentials of Educational Measurement (3rd Ed.). Englewood Cliffs, NJ: Prentice-Hall. Feldt, L. (1967). A note on the use of confidence bands to evaluate the reliability of a difference between two scores. American Educational Research Journal, 4, 139-145. 163 Fitzpatrick, A. R. (1984). Social influences in standard setting: the effect of group interaction on individuals' judgements. Paper presented at the annual meeting of the American Educational Research Association, New Orleans. Garcia-Quintana, R. A., and Mappus, L. L. (1980). Using norm- ‘ referenced data to set standards for a minimum competency program in the state of South Carolina: a feasitility study. Educational Evaluation and Policy Analysis, 2, 17-52. Glass, G. (1978). Standards and criteria. Journal of Educational Measurement, 15, 237-261. Hambelton, R. (1978). On the use of cutoff scores with criterion- referenced tests in instructional settings. Journal of Educational Measurement, 15, 277-290. Hambleton, R. K., and Novick, M. R. (1973). Toward an integration of theory and method for criterion-referenced tests. Journal of Educational Measurement, 10, 159-70. Hofstee, W. (1973). Een alternatief voor normhandhoving bij toetsen. Netherlands Tijdschrift von de Psychologie, 28, 215-227. Hofstee, W. (1983). The case for compromise in educational selection and grading. In 5.8. Anderson and J.S. Helmick (Eds.), 92 Educational Testing, San Francisco: Jossey-Bass. Hsu, L. (1980). Determination of the number of items and passing score in a mastery test. Educational and Psychological Measurement, 40, 709-714. Huynh, H. (1976). On the reliability of decisions in domain- referenced testing. Journal of Educational Measurement, 13’ 253-61.. Huynh, H. (1979). Budgetary considerations in setting mastery scores. Paper presented as a part of a symposium entitled "Setting Standards: Theory and Practice", at the annual meeting of the American Educational Research Association, San Francisco. 164 Huynh, H. (1979a). Bayesion and empirical Bayes approaches to setting passing scores on mastery tests. Publication Series in Mastery Testing. Columbia, SC: University of South Carolina. Jaeger, R. (1978). A proposal for setting a standard on the North Carolina High School Competency Test. Paper presented at the spring meeting of the North Carolina Association for Research in Education. Koffler, S. (1980). A comparison of approaches for setting proficiency standards. Journal of Education Measurement, 17, '67-‘78a Livingston, S. A. (1972). Criterion-referenced application of classical test theory. Journal of Educational Measurement, 9, 13-260 Livingston, S. A. (1982). Estimation of the conditional standard error of measurement for stratified tests. Journal of Educational Measurement, 19, 135-38. Livingston, S. A., and Zieky, M. J. (1982). Passing scores: A manual for settingystandards of performance on educationgl and occupational tests. Princeton, NJ.: Educational Testing Service. Lockwood, R., Halpin, G., McLean, J. (1986). Theoretical assumptions and situational constraints in the standard-setting process. Paper presented at the annual meeting of the National Council on Measurement in Education, San Francisco. Lord, F. M. (1984). Standard errors of measurement at different ability levels. Journal of Educational Measurement, 211 239-43. Macready, G., Dayton, C. (1977). The use of probabilistic models in the assessment of mastery. Journal of Educational Statistics, 2, 99-120. Magnusson, D. (1967). Test Theory. Reading, MA: Addison Wesley. Majestic, R. (1979). The use of experts' judgements in standard setting. Paper presented at the annual meeting of the National Council on Measurement in Education, San Francisco. 165 Mehrens, W. A. (1978). The technology of competency measurement. In R. B. Ingle, M.R. Carroll and W. J. Gephart (Eds.), 133 Assessment of Student Competence in the Publis Schools. Bloomington, IN, Phi Delta Kappa, 39-55. Mills, C., Barr, J. (1983). A comparison of standard setting methods: Do the same judges establish the same standards with different methods? Paper presented at the annual meeting of the American Educational Research Association, Montreal. Mills, C., Melican, G. (1986). A preliminary investigation of three compromise methods for establishing cut-off scores. Paper presented at the annual meeting of the National Council on Measurement in Education, San Francisco. Nedelsky, L. (1954). Absolute grading standards for objective tests. Educational and Psychological Measurement, 14, 3-19. Peng, C-Y. J., Subkoviak, ..J. (1980). A note on Huynh's normal approximation procedure for estimating criterion- referenced reliability. Journal of Educational Measure- ment, 17, 359-368. Papham, W. (1978). As always provocative. Journal of Educational Measurement, 15, 297-300. Popham, W. (1986). Preparing policy makers for standard setting on high-stakes tests. Paper presented at the annual meeting of the American Educational Research Association, San Francisco. Scriven, M. (1978). How to anchor standards. Journal of Educational Measurements, 15, 273-275. Shephard, L. (1984). Setting performance standards. In R.A. Berk (Ed.), A Guide to Criterion-Referenced Test Construction. Baltimore, MD: John Hopkins University Press, 169-198. Skakum, E. N., Kling, S. (1980). Comparability of methods fon setting standards. Journal of Educational Measurement. 17, 229-35. 166 Subkoviak, M. J. (1980). Decision-consistency approaches. In R. A. Berk (Ed.), Criterion-Referenced Measurement: The State of the Art. Baltimore, MD.: John Hopkins University Press, '29'35- Subkoviak, M., Muff, K. (1986). Intrajudge inconsistency in the Angoff and Nedelsky methods of standard setting. Paper presented at the annual meeting of the National Council on Measurement in Education, San Francisco. Swaminathan, H.; Hambleton, R. K.; and Algina, J. (1974). Reliability of criterion-referenced tests: a decision-theoretic formulation. Journal of Educational Measurement, 11, 263-67. Tuckman, B., Nadler, F. (1979). Local competency standards vs. state standards and their relation to district socioeconomic status. Paper presented at the annual meeting of the National Council on Measurement in Education, San Francisco. van der Linden, W. (1982). Passing score and length of a mastery test. Evaluation in Education, 5, 149-164. van der Linden, W. (1982a). A latent trait method for determining intra-judge inconsistency in the Angoff and Nedelsky techniques of standard setting. Journal of Educational Measurement, 19, 295-308. Wilcox, R. (1976). A note on the length and passing score of a mastery test. Journal of Educational Statistics, 1, 359-364. Wilcox, R. (1979). On false-positive and false-negative decisions with a mastery test. Journal of Eduational Statistics, 4, 59'73. Wilcox, R., Harris, C. (1977). On Enrick's "An Evaluation Model for Mastery Testing." Journal of Educational Measurement, 14, 215-218. Yalow, E. S., and Popham, W. J. (1983). Appraising the pre- professional skills test for the state of Texas (Report No. 5): Standards advisors performance standards recommendations. Culver City, CA.: IOX Assessment Associates. 167 Zieky, M., Livingston, S. (1977). Manual for SettingStandards on the Basic Skills Assessment Tests. Princeton, NJ: Educational Testing Service. I 441141011111