I D" “'2'." " "t. (2". U8,“ T3'M "'W i u 333,1133 33 . 12,33.“ 3,,9 331,33 3'§I~3'33P33333""""33" “3,3,, 3|,“ ,, {‘3 “1533,? Eff,” :3 .133 EDW'! $11,, :Zi‘m 3:33:33 33,.“ 3.: "Z 333‘: "3 ""' ‘ , 11'3": 'l,,,"3133; 1‘:3"'.'E?""' 3,3'1'3'3'3333 1,. :l l , 5mg], ,1,“ (3:23“, I :Ffl‘i‘ 1:? $3,, ",3 3:3 " 3. ' I, . Lki'j' lg“! !‘I,, i.‘ y, “:5, ,EW 1 'A'HK' ,v 'h' ,,,§ l,, '3‘] fl'liJ‘f .3». . ”LE:- ... fq‘iz." - t ‘ '3',“ ,‘Wx ,_ to .1. 5. g' ufifig§vs3%Z3érfinfipfi3"33‘w,3, op 35:: 3‘45 :- ’3": A :‘l . qr i. ‘3. ' 3' ; 3,1333% , 3. ‘Z'3 532‘ l '3'3‘ '13 I??? .- 3?" §!":' “Fifls'gn ":pb’ Egigo .z;i.:s ..., . . 'zz‘m'f HEW", 3,3,, I. :3! “9‘“ 13,1333}: 3333'?k‘I'E""-3:zfil~3, . -.J 7 33, . . i 3 <‘ '3 ‘3 33~’33332:43: aq . 3,: :b"33333 ;;3.33 333”" 3 1.33 333131 .333 "3.33.3333 3333333. 3:..33’13323‘233' 233333.3’3‘33333 3 3; 3-7. 3:. .. 3-3323 Z3 3313.3 323 3 x3333, .333 3:333:32. 1333 3., 3'33 '3'"“333' '39'b3hwt m33333'~'33ZRW333 -‘; ,xL=,313.‘,, 33'5“”: 3‘5135,"'3:;'; 3333932333,:gt 333533 Hg, £133: Z 3‘ .., '331'35'33. r 3 333 f';"'?“33313‘ "3 333' ' . 5 3-3 33333333233 3333;333:3333: 4.3" 33,33 3331333233 ,.. “A ',.x~ 4:,“ “'58, "":gi,"" c,, :. "',",,, p, ,1; .d.’ 2...“; f 1' ~ , ...-.. ... ‘ ..- :7." -'?-‘-‘-v- . r -.mu- 3 i, 3 w ”...- 9 "Y‘ *‘d’t‘l ...- ....- oo - . —~ -o« .a i“— -— . J o .- < . '7.- -:d‘ "I. .. ”k“: .v-- .... av row-or - ”:1... 2...: "THE. -. w"! 1 .nm- ..h. o ...- .... 3'”. dc w' . _. _ . 4.3.2. . .an ',,, , 3.33335'333333333 ,v- v. . .01... a I. v ' “4.77 J.." . $1.": ,.._ .... j'r.‘ ..." 2.... a. ..- r ...—3v 49'..- «:3 M. .I“ ”r.-.- J I a "OWIIO .3 n ~ «00-- .. 3‘ .z' -. 111$” ——.— ...-3:)" our . . xv.» -..-emu '~ ., .... ‘ . “it 4:. n. ,5 0., '3: i, g' "‘2" -' ib't'd H I “3:6" ,:,P 'I} 'l‘, I “3' "3:33. ‘ L'l' 'i” :2“ ii“ 3.,“ '33:“ 31333333335;.33333'3‘33'33‘3313333 '3" 33,3531“ ' 'E'M ["3 fig," "— 2' " I , - A ,_ ..., , - g... ...':H.MI . ’3’...-—.~ a." ”O “‘0’ o .— F- n ... , n . goo-w r -.-a a or. > ...-4 arm—«v u n— n u .3 grade 10 test, where each row represents a content strand, each column a process strand, and the number in each cell represents the number of items testing each pair of content/process skills. Concept- Rental Bsti- Coupu- Appli- ualiza- Arith- nation tation cations tion letic whole Nunbers 2 1 1 0 0 Fractions 8 2 7 2 12 Geo-etry 9 o 2 o I Measurement 6 0 2 0 6 Statistics 5 1 I 1 3 Algebraic Ideas I 0 3 5 7 Problel Solving 18 Calculators 6 Figure 1.1 Framework for Grade 10 MEAP Mathematics Test Most items fall within both content and process strands. Some items, however, were not separated into content and process 4 (namely, those items measuring either Problem Solving or Calculators), and are classified only as content. In May, 1991, a group of mathematics teachers and teacher educators met with members of the MEAP staff to develop a common understanding of a ”marginally capable candidate'" (MCC). These educators made predictions about how this hypothetical group of students would perform on each individual test item (cf. Appendix A for specific details about the modified Angoff standard-setting activity). Prior to receiving the test results, these same educators, together with other teachers trained by them, made judgments about the performance of their students who had just taken the tests. The analyses described in this study compared actual student performance with mastery-state judgments made by teachers. This was viewed from two different perspectives. First, the test performance of a large sample of students whose scores were near the "passing" score (and therefore represented marginal performance) was compared with the predicted performance of marginal candidates as hypothesized by the content area specialists. Then, the teachers’ estimates of how each student from their own classroom(s) would perform was compared with the students’ individual test results. ‘A Marginally Capable Candidate is defined as a just- barely-qualified examinee who demonstrates the minimum acceptable competence in the domain being tested. W Research by Jones (1987) and Saliba (1990) suggests that judges modify their predictions after being shown actual item difficulty data. Their predictions of student performance on test items are generally modified to conform more closely to that represented by the empirical data presented. As a result, the between-judge variance of the ratings is reduced after reconsideration. Busch and Jaeger (1990) have shown that there is no consistent pattern of increasing or decreasing the panel-established cut scores after empirical data are presented. There are other factors which create variability in the standard setting process beside inconsistencies between the predictions of judges and the actual performancerof examinees. One such factor is disagreement among judges (inter-judge variability) and inconsistencies within the predictions made by each judge (intra—judge variability). The work of Hunter and Schmidt (1990) suggests that one source of discrepancies between ratings of student performance and ratings of item difficulty is the tendency of judges to assess student performance based upon characteristics not pertinent to the rated skill, such as dress and behavior. In addition to the above factors, drift affects judges and fatigue affects examinees on long tests. Also, when a test covers a diverse set of skills such as those represented by the eight content strands of the MEAP mathematics tests, an additional consideration is that judges’ ratings of item 6 difficulties may be affected by the type of halo effect which involves misconceptions among the judges about the relative difficulties of the various skills. With these concerns and considerations in mind, the following questions were posited. WWW: How internally consistent are the predictions of individual judges? How consistent were the judges’ ratings with the actual performance of "marginally capable" students (i.e., students whose scores were within a standard error of measurement of the panel-established cut score)? How do judges’ predictions of item performance compare with those made by other judges? How do judges’ predictions of the test performance of their own students compare with the students’ actual performance? How do judges’ predictions about the relative difficulties of the test strands compare‘with.actual test strand difficulties? WWW: Does student performance on the various test strands differ among groups which have received different instruction on the strands? Do students show fatigue as indicated by decreased performance on the last half of the test items? Similarly, do judges demonstrate drift by over-or under-predicting examinee success on items which appear later in the test compared with those appearing earlier? By considering these questions, patterns have been observed which should help to explain the mechanisms which are 7 in effect during the standard-setting process. By isolating and describing these mechanisms, some recommendations were identified for consideration when judges are faced with the task of setting a passing score on a test. Chapter 2. Review of the Literature !!. 1 1 . . El ! I 1' '3 1 Two methods are commonly used to interpret scores from educational tests: norm-referenced and criterion-referenced5 (Jacobs & Chase, 1992, pp. 10f; Mehrens & Lehman, 1987, p. 14). Norm-referenced interpretations are based upon "adding meaning to a score by comparing it to the scores of people in a reference (or norm) group" (Mehrens & Lehman, 1987, p. 15). Criterion-referenced interpretations, on the other hand, are based upon comparisons of individual performance against an absolute standard which has been established for "some specified behavioral domain or criterion of proficiency" (p. 15). The process of making norm-referenced interpretations is relatively straightforward: the test developer selects an appropriate norm group, administers the tests, and then uses the results to compare examinees with the norm group. There seems to be a broad consensus on how to interpret norm- referenced test scores, particularly since there are standard methods for' computing such. derived scores as percentile rankings and grade equivalents, which have become widely accepted among test users with notable exceptions, such as Cannell (1987). ’Millman and Greene (1989) prefer the term domain- referenced, since interpretations will "represent information within a specific domain about the absolute level and particular strengths and weaknesses of examinees’ performance" (p. 341). One form of a domain-referenced test which is not addressed in this study is objective-referenced. 8 9 Criterion-referenced interpretations, on the other hand, are complicated by the fact that usually only two results are of interest: Did the examinee demonstrate mastery of the domain, or not?‘ The process of establishing this "absolute standard" causes much concern among those who use the test results for making decisions about examinee performance. Much of the dissension among psychometricians and test users results from differing views and confusion related to the purpose(s) of testing. For example, is the function of the test to document growth, or to make mastery decisions? If it is the former, then there is no need to establish arbitrary passing scores, since the scores themselves provide a greater amount of information. If it is the latter, then how does one differentiate between a master and a nonmaster on the basis of a single test score? Rowley (1982) observed that, "If 300 whiskers makes a beard, then do 299 not make one?" It would certainly be impossible to differentiate between the two cases, since they are part of a continuum which precludes clear separation into "bearded" and "non-bearded." The same can be said about performance on a test. There are several critical steps to be considered when developing tests which will be used for making norm-referenced or criterion-referenced interpretations about examinee performance. The first is to establish the objectives to be ‘On some occasions there has been interest in multiple cutoffs, as has been the case with the National Assessment of Educational Progress (Forsyth, 1991), but in most cases the primary interest has been dichotomous. 10 measured by the test. Next, a table of specifications is developed to be followed by item developers in constructing the test. The third step is to try out the items with a sample of pupils for whom the test was developed and verify the psychometric properties of the items to ensure that they validly and reliably measure what they purport to measure. The procedures for accomplishing these steps are well established and generally agreed upon by the measurement community (Crocker & Algina, 1986, Chapter 4; Ebel, 1972, Chapter 5: Gronlund & Linn, 1990: Mehrens & Lehmann, 1987, pages 226f: Thorndike, 1982, Chapters 2-4). An additional critical step is to establish an acceptable passing score in those situations where a test is to be used for making decisions about mastery. This is perhaps the most tenuous part of the entire test development process, since there is no single correct way to do it, and its outcome determines how' many’ examinees will achieve "satisfactory performance." The situation is further complicated by a lack of consensus of what is meant by a "valid passing score" (Mehrens & Lehmann, 1987, pp. 126f).7 Beyond the purely technical issues surrounding the establishment of performance standards, there remains the question of whether a single score on any test instrument can ’Nitko is critical of the idea of setting a passing score on a criterion-referenced test. He states that "One confusion about criterion-referencing is the misconception that the term means using a cut off score or a passing score" (Nitko, 1984, p. 21: emphasis in original). 11 be relied upon for making important decisions. Shepard (1980) observed, With a good test, valid distinctions can be made between those who are well above or well below the standard, but, pass-fail decisions near’ the cutoff will have poor validity because a continuum of performance has been "arbitrarily" dichotomized (p. 448). The fundamental problem is highlighted by Johnson and Zieky (1988), who noted, A score ... [represents] a probable range or band of scores, rarely provided to the test taker. We know that the size of the band is a function of the reliability of the test, which we can and should estimate. But there is still the question of validity --- is the score an appropriate, meaningful measure of the construct for the person tested and the purpose of the testing (p. 3). Clearly, it is not necessary to select a "passing score" to dichotomize the results of a test when the test is only being used to obtain diagnostic information about an individual’s knowledge, skills and abilities, or when the results are not intended to be used for making mastery decisions about individual examinees. These activities are part of formative evaluation, where the decisions to be made relate more to progress toward a goal rather than final judgments of mastery; The issues become more important if the stakes are higher: for example, after many years of training, a candidate is being tested for a license to practice the profession for which the training was taken (Shimberg, 1981). The expanding use of state-administered tests for making high school graduation decisions has attracted attention to the 12 debate over the importance and relevance of testing (Rudman, 1985, p. 28). Because of legal, technical and political issues, many critical factors are involved in standard-setting under various conditions. Complex checklists have been developed to be used as aids in the standard-setting process (Arrasmith & Hambleton, 1987: Hambleton & Powell, 1983). Even those who concede that tests are generally valid predictors of job performance have criticized the sole use of test results for making selection decisions, since members of dominant social groups tend to perform better on typical employment examinations than members of minority groups (Hartigan & Wigdor, 1989). This is particularly acute when a cutoff score is used. lAs a result, tests have been criticized as being biased and/or unfair to minorities. This is primarily a political issue for which there is no simple technical resolution. Many approaches for settling the controversy have been proposed (Cleary, 1968; Cole, 1973; Thorndike, 1971), but all have been criticized.on practical as well as theoretical grounds (Novick & Ellis, 1977). Issues related to standard-setting have gained additional interest because of the current movement toward using "performance assessment" as a substitute for paper-and-pencil tests (Mueller, 1991). Compared with traditional objectively scored tests, performance assessment is especially sensitive to problems of between-rater and within-rater variability. The validity and generalizability of the results have also 13 been called into question (Fremer, 1991: Linn, Baker and Dunbar, 1991: Mehrens, 1992), and their use on state-mandated tests has been challenged on numerous grounds (Beck, 1991: Phillips, 1992). Establishing standards for performance assessments involves several complicating factors in addition to the usual psychometric elements entering into an objectively scored paper-and-pencil test. One of these factors is task complexity, where the response to a prompt can take on many different forms which express multiple levels of information processing and use. Although this is considered to be one of the advantages of performance assessment, it does contribute to the difficulties in developing reliable and valid scoring. Another is cognitive demand, where multiple facets of a problem must be held simultaneously in the mind of the examinee during the performance. Observer bias, leniency, and rater drift also play a part in performance assessment. Performance assessment is uniquely subject to the "halo effect,” in which the rater tends to appraise diverse characteristics as if they were common attributes. Supervisor ratings of employee productivity may, for instance, be affected by irrelevant characteristics of the examinee such as dress and manner; De Meuse (1987) stated that "the effects of three classes of non-verbal variables (demographic cues, physical appearance, non-verbal behaviours) on performance appraisal ... are significant and varied" (p. 207). Hunter & Schmidt (1990) noted that the "idiosyncrasy of that l4 supervisor’s perceptions is a part of the error of measurement in the observed ratings. Extraneous factors that may influence human judgment include friendship, physical appearance, moral and/or life-style conventionality, and more" (p. 65). Finally, Trevisan (1991) observed that performance assessment is affected by differing standards which can vary over time. Even if rigorous scoring guides are developed, there is no assurance that they will be carried out reliably by different raters. Developers of performance assessment have attempted to minimize extraneous variance, and thereby increase reliability, by producing detailed scoring protocols. However, the more the scoring criteria are tightened, the more the assessment becomes restricted and subject to the same criticisms as its multiple-choice counterpart. O’Leary & Hansen (1983) report that experience in the field of employee performance assessment, which has been standard practice for decades, does not provide much hope for objectivity in the judgment of mastery. E l] . S! 3 3'5 !!° As the use of test scores has become increasingly common in screening and selection, the acceptability of using test scores as indicators of success or failure has come under much debate (McAllister, 1991). Even when highly trained professionals are involved in establishing the passing scores 15 for these tests, it is difficult to obtain agreement.as to how this should be done.“ Although standard-setting methods have received considerable attention, as yet no consensus has emerged as to which method or approach is most appropriate. In his paper on standards, Glass (1978) criticized the "common notion ... that a minimal acceptable level of performance on a task can be specified" (page 237) . He supported this criticism by stating that ”the language of performance standards is a pseudo- quantification, a meaningless application of numbers to a question not prepared for quantitative methods” (page 238), and added further that the result is "to ask for greater precision than the circumstances permit" (page 258). Since scores and performance exist on a continuum, there can be no uniquely defined "scientific" cut point above which everyone is a master and below which everyone is a nonmaster. If a dichotomy is demanded, it must be established arbitrarily. In the introduction to his report on standard-setting procedures, Cramer (1990) observed that "In spite of much 'The debate over defining unique criterion-referenced proficiency levels in the National Assessment of Educational Progress (NAEP) is a case in point, where experts and the public experienced great difficulty in understanding or agreeing upon what was meant by terms such as "basic," "proficient,” and "advanced" (FairTest, 1991). Forsyth (1991) is especially critical of the NAEP science scale, which he describes as having "purported" criterion-referenced characteristics since the test "mixes dimensions" in "an ill- defined domain" which "can lead reporters, legislators and even professional educators to draw very questionable conclusions from the NAEP results" (pages 5f). In other words, if a test domain is not well defined, it is not appropriate to use the test to assess proficiency. 16 research in the area, we are still far from agreement on the ’right’ way to perform this increasingly important task" (p. l). Halpin & Halpin (1987) summed up the situation well: Given that different standards result when different methods are used, standard setters are left with a fundamental, unsolved problem: They must decide how best to set standards. Unfortunately, at present, there are little or no scientific grounds for choosing among different procedures. Needed is research indicating how well the different standard-setting methods serve their purpose which, ideally, is to separate the masters from the nonmasters, to pass those who are qualified and fail those who are not . . .. (p. 977) Plake and Melican (1989) isolated the essence of why there is such divergence in establishing standards for tests. "Judgmental standard setting methods are, by definition, subjective evaluations by .... specialists about the test (or item) performance of minimally competent candidates (MCCs)" (p. 45). When the test developer has been asked to "draw a line," its placement will always be open to criticism because extremely small differences in scores around the cut-off will cause one examinee to pass while the other will fail. A logical question might be, if a student.who answers 70% of the items correctly on a test is considered to be a "master" of the subject, is it reasonable to say that one who receives a score of 69% correct is a "nonmaster?" (Rowley, 1982). Shepard (1980) suggested establishing three zones of mastery: those who are clearly masters, those who are clearly not masters, and those whose mastery state cannot be 17 established by the test score. Shepard’s suggestion was explored in this research. Glass (1978) found flaws in six common methods used for determining the criterion in a criterion-referenced test. Logical perfection aside, passing scores are still demanded by many educators and policy makers who wish to make decisions about individual or group performance in a variety of settings. In such situations, there are many issues which enter into establishing acceptable passing scores. Some of these include: (1) What percentage of examinees "should" be expected to pass the test? (2) What are the relative costs of failing someone who "should” pass, as opposed to passing an examinee who "should" fail? (3) How important is it to establish a high (or low) standard? Rudman (1985) criticized the tendency of standard setters to identify minimums which eventually become maximums as educators orient their teaching toward the narrow content of the tests, and argued for setting multiple standards which would recognize excellence as well as the achievement of minimal competencies. When faced with a situation in which standards are demanded, there remain numerous dilemmas which frame the standard-setting situation. If a high percentage of examinees achieve the standard, it might be argued that the test is too 18 easy. If a large proportion of examinees fail the test, some will charge that the test. is unfair' or' arbitrary. 13: determining the relative costs of "false positives” or “false negatives," how does one determine the utility of someone being denied entry into a profession because of a spuriously low test score, or the cost of having someone admitted into college who does not possess the skills necessary for success in the program of study? Numerous studies have demonstrated that judges can set reasonably consistent passing scores using one of several commonly used methods, when provided with consistent instructions. Across methods, however, there is typically a wide range of results (as will be explored later). WWW Although it is difficult to conceive of a measurement of an important competency which does not result in a range of scores rather than a simple dichotomy, standard-setting methods have been classified as "state" or "continuum," .State models view'competence as all or nothing: ‘the examinee either has the skill or does not have it, as in the determination of whether or not.a normal child can take a step, speak a word or ride a bicycle. Macready and Dayton (1980, pp. 494f) discussed the relative merits and limitations of state models, noting that their application has been made difficult because of errors attributed to guessing, forgetting, cheating, and differential cognitive processing brought about by variations 19 in learning, such as rote memorization and nonhomogeneous domain specifications. Continuum models are based on the observation that most skills vary over a wide range. This makes the determination of a "passing score" logically impossible since there is virtually no difference between an examinee at any given score and an individual who correctly answers one item more (or less). To be perfectly defensible, the establishment of a standard (n1.a continuum-type competency requires that the items of the test can be ranked, in the same order’ of difficulty for every individual tested, as in a Guttman Scale (Thurstone, 1928: Gardner, 1962), so that "mastery" can be defined as the score which demonstrates that the skill has been attained. Although the continuum model leads logically to the conclusion that "mastery" cannot be uniquely determined, there are social and political reasons for pursuing the task of setting standards for acceptable performance on tests, and this has led psychometricians to attempt to find reasonable and practical methods for establishing such standards. Berk (1986) identified at least 38 different approaches which have been used to set passing scores, which he categorized into three levels according to: (l) The acquisition of the underlying trait or ability (state or continuum), (2) The methods (judgmental, empirical, or mixed), and 20 (3) The major aim (e.g., to set a new standard or to adjust an existing one). He found that 30 of the 38 approaches were of the continuum type: 17 of the continuum models were variants of the approaches developed by Angoff (1984) and Ebel (1972): and 10 of those relied strictly on expert judgment. Cross, Impara, Frary and Jaeger (1984) observed that "all methods that have been proposed for setting standards can be classified . . . into two groups, those that are based upon judgments of test content, and those that are based upon the test performance of known groups of examinees" (p. 114). Hofstee (1983) argued that ultimately' all standards are relative and are based on a real or fictitious norm group. One could set an absolute standard? which could be so high that none would succeed or so low that all would pass. Conversely, a relative standard10 could be established based on the percentage of passes desirable. Methods which use a combination of these approaches have been developed by various psychometricians (Bank, 1984: De Gruijter, 1980, 1985; Hofstee, 1983). Hofstee’s method used minimum and maximum acceptable cutoff scores and failure rates to determine a line which (ideally) intersects the curve ’An absolute standard is one which is established based upon judgments of test content, independent of considerations of actual examinee performance. mA relative standard is one which is established based upon test performance of known groups of examinees, independent of considerations of test content. 21 established by the measured cumulative score frequency of the examinee group. Beuk’s method introduced the variability of acceptable cutoff scores and failure rates as a means for determining the degree of judges’ preference for absolute ratings. De Gruijter introduced Bayesian statistics by having each judge establish estimates of the uncertainty of the cutoff scores and failure rates proposed by that judge. All three of these methods represent compromises between a relative standard based upon empirical data and an absolute standard based upon judgments about test content. WWW Various methods have been developed for determining appropriate cutoff scores for criterion-referenced tests (Angoff, 1971: Berk, 1986; Beuk, 1984; Hofstee, 1983; Jaeger, 1982b; Nedelsky, 1954; Livingston & Zeiky, 1982). In a review of standard-setting methods, Hambleton, Powell, and Eignor (1979) identified, approximately' 30 different. methods for setting out scores. All of these approaches require experts to make judgments about (1) an absolute standard based on the expected performance of a hypothetical group of examinees on certain test items, or (2) a relative standard based on declaring as "masters" the highest "n" percent of the examinees (where "n” may typically be between 50 and 90 percent). Judgmental standard-setting methods require the deliberation of experts who are knowledgeable about both content and examinee behavior, and who have a clear 22 understanding about the competency expected for the circumstances. When considering various judgmental standard- setting methods, it is necessary first to understand how various approaches work. Three major classifications will be described and discussed here. I l . J . . 3 ! l ! l ! °l . Although a more complex method was developed by Ebelu (1972) and used for several years, the Nedelsky12 (1954) and Angoff (1971) methods are currently the most commonly used for setting standards based upon hypothesized performance of "competent" examinees on test items (Jaeger, 1989). Between these two methods, Jaeger (1990) reported that "limited empirical evidence ... suggests that Angoff’s method, more often than not, yields standards that are more reliable than 11In the Ebel method, judges are asked to specify two pieces of information about each item: its perceived difficulty (easy, medium and hard) and its relevance (essential, important, acceptable and questionable). The items are sorted ”into cells using a two-way classification grid where the relevance and difficulty of the item are the two dimensions considered. Once all test items have been sorted, the items in each cell are considered by the individual judge or group of judges, and the proportion of items within each cell that should be answered correctly by an examinee who has achieved a minimum acceptable level of proficiency is specified. The product of this proportion and the number of items in each cell is calculated. The examination standard or passing score is then derived by summing these products across cells" (Andrew & Hecht, 1976, pages 46f). uThe Nedelsky method is limited to multiple choice items, and requires subject matter experts to identify the distractors that a minimally competent candidate would be expected to eliminate. The reciprocal of the number of remaining options is the minimum passing level of the item. These estimates are then added to determine the passing score. 23 those produced by Nedelsky’s method" (p. 17). In their purest forms, the.Angoff and Nedelsky methods require judgments about predicted examinee performance on test items and therefore lack.a link between actual examinee performance and the levels of competence anticipated by the judges. E l . J . . I ! l ! . . Livingston and Zeiky’s (1982) Contrasting Groups method and Borderline Group method are frequently used for setting standards based on the performance of groups of specific individuals (Arrasmith, 1986). Because these methods are predicated upon judges’ perceptions of the competence of actual examinees, they possess a kind of "face validity" and are, therefore, intuitively appealing. "People in our society are accustomed to judging other people’s skills as adequatetor inadequate for some purpose ... . Therefore, making this type of judgment is likely to be a familiar and meaningful task" (Livingston & Zeiky, 1982, p. 31). Although there may be broad agreement that this task is familiar, there are many who would argue that it is not meaningful. These methods are no less judgmental than, e.g., the.Angoff or Nedelsky methods since they require the judgment of examinee competence by teachers or other expert observers. The experience of many years of developing performance appraisals for business and industry would indicate that such judgments are usually unreliable and easily confounded by characteristics of the judge and the examinee which are not 24 germane to the subject of the rating (Hunter & Schmidt, 1990: O’Leary 8 Hansen, 1983). one on - 9- (“g ‘11... ._ 0.”. 1‘0 ‘ ' ;. ‘ ”-1.: . In an effort to bridge the gap between judgments about test items and judgments about examinees, Hofstee’3 (1983), Beuk“ (1984), and De Gruijter“ (1985) developed newer and slightly more complex methods for setting passing scores. These are ”compromise” approaches, since judges must decide on ”In the Hofstee method, judges are required to specify the minimum and maximum percentages of failing examinees (fun and f..., respectively) along with the minimum and maximum acceptable jpercentages of items that. minimally' competent candidates should be able to answer correctly (km, and k_,, respectively). Upon a graph which shows the percentage of candidates who would fail at each given score, f(k), is superimposed a line which connects (k.,,,, f...) and (kw, fun). The intersection of this line and the graph of f(k) is the passing score. “In the Beuk method, judges are required to specify the knowledge level that a "minimally competent” candidate should possess, expressed as a minimum percentage of items answered correctly on the test, and the expected pass rate for that score, expressed as a percentage of the examinees passing. The means and standard deviations for both ratings are computed across all judges. These are denoted as km, and SR for the passing score and v“, and s, for the passing rate. Upon a graph which shows the percentage of candidates who would pass at each given score, v(k), is superimposed a line which passes through the point (vw, k...) with a slope sv/sk. The intersection of this line and the graph of v(k) is the passing score. ”In the De Gruijter method, judges are required to specify an ”ideal” passing score k. and a corresponding failure rate f0, and to estimate the uncertainty of each (uk and u,, respectively). Using the ratio r = u,/u,,, all combinations of k and f on the ellipse r’(k - k0)2 + (f - f,,)2 = d2 are considered equally plausible. All that is necessary is to determine which combination (f, k) on the empirical curve f(k) produces the smallest value of d, thereby ascertaining the value of k which is the "best" compromise passing score. 25 acceptable ranges of cutoff scores and then use actual test performance data to arrive at a "best” compromise. As a group, judges tend to over-estimate examinee performance in the absence of data from actual test administrations. As a consequence, when theoretical judgments about item content are combined with empirical results from real examinees, passing scores generally result which are lower than would have been obtained from purely judgmental methods, therefore producing slightly higher passing rates. For a variety of reasons, none of these compromise methods has achieved.widespread use. This present study is an attempt.to combine the strengths of both judgmental approaches (i.e., judgments about items and examinees), but to do so in a manner which is not as complicated as the compromise methods just described. The aforementioned approaches may be viewed, accordingly, as empirical (or examinee-oriented), theoretical (or test- oriented), and policy-based (or compromise) approaches to the standard-setting process. Since there is very little agreement among practitioners as to which approach is most acceptable, the research undertaken in this study weighed the strengths of each approach and assessed how they could be combined to optimize the setting of passing scores in the final outcome. Although there are many methods for determining passing scores on tests, no single "best" method has emerged. The subjective nature of standard-setting procedures, along with 26 the possibility that multiple standard-setting methods invoke different cognitive processes, can lead to unacceptable variations in outcomes. Comparative studies have shown that the failure rates resulting from cutoff scores derived using different methods can vary greatly. For example, on the.Louisianatsrade_2_8a§io W, the failure percentages ranged from 0% to 29.75% using the results fromtdifferent standard-setting procedures (Mills, 1983). Because of such variability, Jaeger (1989, p. 500) recommends that standard-setters use different methods, and then set a standard which is a compromise among them. H . E ! EE 1' ! 1 1‘5 !!' ghazaggggistig§_gf_tng;1e§t. Any well-developed test is generally optimized for use with students who possess a relatively broad range of abilities. This ensures that the test will provide a maximum amount of information about the population for which it was designed. The distribution of test scores, relative to the region of maximum information for the test, can have an impact on test results. A test for which most of the items are too easy (or too difficult) for the examinee group will exhibit greater error of measurement than one which is optimized for the ability of the group. In a study of the effect of item difficulty distribution on the precision of measurement near the cutoff score, Julian (1985) found that the number of false passes and false failures were related to whether the test’s area of maximum 27 precision was above or below the cutoff score. "The easier test passed fewer persons who should have failed and the more difficult test failed fewer students who should have passed even when the tests had similar total error rates" (page 108). Depending upon the relative cost of false positives and false negatives, then, the test developer could reduce the more critical type of error by designing an easier’ or’ more difficult test. Another test characteristic which affects standard- setting is its reliability. Since reliability sets an upper limit on the validity of decisions made using test scores, this is particularly important. Tucker (1946) showed that the maximum validity for a 100-item test occurs when the point biserial correlation between items is less than 0.3 (p. 11). He demonstrated similar results for tests of various lengths. Nunnally (1970) argued that "when items have low correlations with one another and each correlates positively with the criterion, each item adds information to that provided by the other items, and when scores are summed over items, a relatively high correlation with the criterion will be found" (p. 204). WWW Brennan and Lockwood (1979) stated that the Angoff method established a less variable standard than the Nedelsky method, indicating that judges using the .Angoff procedure were in. greater agreement than judges employing the Nedelsky procedure. Reaching a similar conclusion, Behuniak, Archambault and Gable 28 (1982) reported that "judges using the Angoff procedure were in greater' agreement than judges employing the iNedelsky procedure" (p. 254). They found this to be true for tests of both reading and mathematics. As noted by Harasym (1981), one of the weaknesses of the Nedelsky method is that, depending upon the number of options used, the p-values are limited: for example, on a multiple- choice test item with four alternatives the p-values are 1.00, .50, .33, .25 and 0. The Angoff method, by contrast, yields a continuum of p-values ranging from 1.00 to 0. This means that, in the Nedelsky procedure, all very easy items will be assigned a p-value of 1.00, and all other items will have p- values in a restricted range represented by numbers which are only'moderately different from algebra -> geometry). This could be verified by comparing the developmental scales across the»grades tested (i.e., Grades 4, 7 and 10). A second consideration was how the performance predicted by the judges on the strands compared with the actual student achievement measured by the tests. If these differ, then the judges’ ratings are at odds with the unidimensional assumption. One explanation for such an occurrence might be that an instructional effect has altered the presumed unidimensional nature of student development. If, on the other hand, mathematical learning proceeds independently along many strands simultaneously, it would be necessary to examine development along each strand independently. In order to explore the existence of as many as.8 unique strands of mathematical learning, Phase III of the analysis included a factor analysis of the student test results. In the earlier pilot testing of the MEAP instruments, factor analysis determined that there was only one major strand of mathematical development (Rigney, 1990). 46 Since the data used for this study came from a larger, more diverse population than*was included.in.the pilot testing, and since "weak" or ambiguously worded items from the pilot were omitted from the final forms of the tests, it seemed possible that a more complex factor structure might emerge than had been apparent during pilot testing. WWW Pearson product-moment correlations between judges’ individual predictions of item difficulties and the consensus predictions provides an assessment of the reliability of the standard- setting process. Also, if the ratings established by one judge have a much higher correlation with the consensus ratings than those of the other judges, it might indicate that one judge may have wielded a higher level of influence in the consensus process than other judges. Correlations between the consensus ratings and actual p-values obtained by marginally capable candidates, along with correlations among individual judges’ ratings, represent a measure of the validity of the ratings. Angoff (1991) noted that, in the absence of a clear criterion, differences between standards set by judges and performance by actual "marginally capable candidates" do not necessarily indicate invalidity. The judges had been instructed to rate the items based on the assumption that the students had received several years of instruction consistent with the objectives which underlie the tests. If instruction has any value in promoting learning, then it must be assumed 47 that instructional differences can affect the order in which cognitive skills develop in students. The extent to which judges use their own students as models for the "marginal'I examinee is not known, but it would be reasonable to assume that experience with ”real” students would have a strong impact on the judges’ ratings. When the predictions and actual performance do not agree, then, the differences could be a result of erroneous judgements about (a) how hypothetical ”marginally capable" individuals will perform on the items, (b) how individual "real" students will perform on the test, or both. In the absence of a clear criterion, it is difficult t0>determine if the source of the invalidity is in the judges’ perceptions of the students, or the students’ perceptions of the items. The judges’ predictions of the performance of students in their mathematics classes was the criterion used in this study, because these predictions were based upon the actual performance of real students, while the predicted p-values were based on hypothetical marginally capable students. This decision strongly affects the conclusions reached in this study since, having established a clear criterion, it is now possible to examine possible sources of invalidity. Comparisons were made using data collected from judges. In cases where a judge was not teaching a grade in which the tests had.been administered, the judge‘was asked to select for inclusion in the study one or more teachers (of Grade 4, 7 or 48 10) who had been trained to understand and use the instructional model upon which the tests were based. The teachers made mastery/nonmastery classifications of their students prior to receiving test results“ The tests also made mastery/nonmastery classifications of the same students. These sets of results were compared using the Chi-square test. These same data enabled a comparison between the modified Angoff method employed in the MEAP standard-setting process (of. Appendix A for a description of the approach followed) and Livingston and Zeiky’s (1982) Borderline Group and Contrasting Groups methods. For this analysis, modifications of the customary Borderline Group and Contrasting Groups methods were used. Instead of Borderline Group and Contrasting Group frequency plots, cumulative frequency plots were used, as follows: (A) It is common to select the mode of the Borderline Group when this method is used to establish a cut score. The mode is difficult to justify if the distribution of the scores is not smooth (Mills, 1983). This can be especially problematical if the data are multi-modal. For the modification of the Borderline Group method, the median was selected as the cut score. (8) The intersection of the mastery and non-mastery frequency plots for the "master" and "nonmaster" groups is selected as the cut score in the traditional Contrasting Group method. This 49 equalizes the false positive and false negative errors at the cut point only. Selecting the appropriate intersection is complicated when multiple intersections of the frequency plots occur (Mills, 1983). Livingston and Zeiky (1982) suggested calculating conditional probabilities of mastery for each raw score and smoothing the resultant probability function by hand. The cut score is then taken at the point where the probability of mastery is 50 percent, thereby equalizing the false positive and false negative errors. It seemed. desirable to eliminate the subjectivity of this smoothing process. Therefore, for the modification of the Contrasting Groups method, the cumulative frequency of the non-mastery group was plotted beginning at the highest score and proceeding down to the lowest, whereas the cumulative frequency for the mastery group was plotted beginning at the lowest score and proceeding up to the highest. The point at which these curves intersected was selected as the cut score, thereby equalizing the false positive and false negative errors. E ! .1 l : . In a classroom testing situation, the teacher should not be interested merely in finding out how well the student performed on the particular test which was used. The more 50 interesting (and important) question is, "How well can the results of this narrow test be generalized to the broader domain of interest?" For instance, if a student.can1correctly answer 10 out of 10 items on a computational test, does this mean that the student has mastered 100% of the skill which is called "math computations?" When the Michigan Educational Assessment Program developed the items for the mathematics tests which.were used as the basis for this study, a sincere effort was made to represent a wide spectrum of mathematics content and ability. The item-writers were not merely interested in how well the students could perform on a particular sample of items, but rather in how well the students could apply their knowledge about mathematics to a wide range of situations. A well developed test can be used to make inferences to the broader domain of interest. Such a test must contain items which are psychometrically sound. For example, when the scored responses (1 = "correct” or'O = "incorrect") to an item correlate positively and significantly with the total test score (i.e., when the point-biserial correlations are high”), the item is apparently measuring the same trait as the rest of the test. Items with very low (or negative) point-biserial correlations are often viewed as problematical “Nunnally (1970) and Tucker (1946) both alluded to the notion that the optimal point-biserial should be approximately 0.3; higher values indicate that the item does not contribute anything new to the rest of the test, whereas lower values indicate that the item is measuring a different trait from the rest of the test. 51 since their primary effect is to contribute "noise" (measurement error) to the test results. In the analysis of the items used in this study, statistics characteristic of both classical and Item Response Theory were used. From the perspective of classical test theory, it seems reasonable to ask whether judges should be expected to make valid predictions about examinee performance on items which have extremely low empirical p-values and/or very low (or negative) point-biserial correlations. If an IRT model is being used in the analysis, extremely poor fit statistics may indicate that an item is not functioning properly and therefore should be excluded from the test, exempted from the standard-setting process, or weighted low”. If these are the best items available to represent important content in the test, it may be necessary to keep them in the item pool. In the standard-setting process, however, when the statistics for an item are poor the following procedure is suggested (Mehrens, 1993): (1) Make a rank-order list of the items, based upon their empirical p-values: (2) Note the items which, on the list, are just above and below the item with poor fit statistics; and ”Poor fit may also be an indication that the assumption of unidimensionality'is being violated.by the item: this could mean that instruction has altered the trait acquisition, or that the item itself is not part of the trait represented by the rest of the test. 52 (3) Compute the average predicted p-value for these two adjacent items: this average should be used instead of the judges’ predicted item difficulty in determining the passing score using the modified Angoff method. In his paper on the internal consistency of judge performance, van der Linden (1982) noted that, aside from latent trait theory, there is no method for measuring the consistency of a judge’s prediction of marginally capable candidate performance on test items, because item difficulty is a nonlinear function of examinee ability. Examinee responses to a Guttmann scale, for example, are extremely nonlinear when plotted against examinee ability. Even within the sphere of latent trait analysis, if responses to an item do not fit the measurement model, there is no way to specify the relationship between the probability of a correct response and the ability of the examinee. In the tests used in this study, several of the items had IRT fit statistics which were extremely poor, indicating that the responses to the items were inconsistent with what would have been expected given the overall performance by examinees on the testi There were items, for instance, where the judges predicted that 60% of the examinees should get the correct answer, whereas only 10% actually did so. Since 10% is considerably below'the chance level (25% for an itemmwith four response choices), it is clear that the examinees viewed the items quite differently from.what the judges had anticipated. On the other hand, there were items which the judges thought 53 would be missed by 25% of the examinees, but which.were missed by only 5% in the actual test administration. In many cases, very poor fit statistics indicated that the student responses were essentially unpredictable. There are those who reason that "tests should lead curriculum and instruction," and therefore that "standards cannot be based on current performance" (Rigney, 1992). On the other hand, standards which are too remote from current performance may be subject to the criticism that they are excessively arbitrary and judgmental. In a discussion of the development of scales in psychology, Thurstone (1928) observed that the concept of measurement requires the existence of an appropriate scale of measurement, where the elements of the scale form an ordinal set of attributes to be assessed. He went on to note, If the scale is to be regarded as valid, the scale values of the statements should not be affected by the opinions of the people who help to construct it. This may turn out to be a severe test in practice, but the scaling method must stand such a test before it can be accepted as being more than a description of the people who construct the scale. At any rate, to the extent that the present method of scale construction is affected by the opinions of the readers who help sort out the original statements into a scale, to that extent the validity or universality of the scale may be challenged (pp. 547f). Since the purpose of a test is to place the examinees on a scale, the words "items" and "judges" can be substituted into the above quotation in place of "statements" and "readers." In practice, the judges’ ratings of the perceived difficulties of the items could be used to sort the items into 54 an order of ascending (or descending) difficulty. These orderings could be compared both within and across judges to determine if the items appear to be rated consistently. Residuals between estimated.p-values which.exceed, say; one:or two standard errors of measurement could be used to detect ”misfitting" items, since their difficulties are affected too much by individual judge’s opinions. Eliminating (or substituting’ empirical data for) judge’s predictions for severely misfitting items could increase the probability that judges can set more consistent (and, therefore, potentially more valid) standards for tests. 1 1 . ! E I l , E I' There are two major purposes for correlating the ratings established by a panel of expert judges. One is to investigate possible halo effects, i.e., to determine if the judges were influenced by factors other than the perceived difficulties of the items being rated. Halo is observed in "what seems to be inflated correlations between dimensions of ratings ..., i.e., correlations higher than warranted by actual ratee behavior" (Borman, 1983, p. 128). Another reason for correlating ratings is to assess the impact of systematic distortions in the judges’ ratings, caused by "the absence of relevant cues for a rater" which leads to "nonrandom distortion in the direction of semantic, ’what goes with what’ relationships between dimensions" (Borman, 1983, pp. 133f). For instance, judges may assume that students who do well on algebra-related items will also 55 do well on those related to geometry, since students tend to take these courses as a sequence. Although the skills required to perform well in algebra and geometry may prove to be highly correlated, these subjects are not necessarily part of the same unidimensional trait, since one involves relationships among numerical quantities and the other involves relationships among spatial figures. Depending upon the extent to which a judge’s views of reality are systematically distorted, there may be situations in which the responses of examinees to the test items will not be in agreement with expert predictions. Inter-rater reliability is considered to be essential in assessing rater performance. Brennan and Lockwood (1980), in discussing the relative merits of the Nedelsky and Angoff methods, observed that ”the validity and practical utility of these approaches, and similar approaches, for practical decision making may rest heavily upon the extent to which raters agree in their judgments" (p. 220). Several methods for comparing ratings between and among judges have been described in the literature. The most common of these involves computing the Pearson product-moment correlation (Ebel, 1972, p. 411) and/or rank-order correlation of the ratings (Guilford, 1951, p. 395). High correlations among judge ratings could indicate either that the judges agreed.substantially about the relative difficulties of the items, or that their ratings were affected by some factor(s) other than perceived item difficulty. 56 Judges who have access to actual test results, for example, may be influenced by the empirical data, thereby resulting in higher correlations between judges (Saliba, 1990, pp. 70f: Jones, 1987, pp. 51-55). "Judges presented with item difficulty indices moved their item reevaluations in the direction of the p-values. This trend was uniformly observed across groups of judges and across the majority of the test items" (Saliba, 1990, p. 97). Similarly, judges who collaborate in rating the item difficulties prior to participating in the consensus-rating exercise may influence each others’ judgments, thereby producing spuriously high inter-rater correlations. The standard-setting process used in this study was structured in a manner which prevented judges from working together until the final consensus session, and during the standard-setting meeting the judges were not provided with p- values from previous administrations of the tests. Some experts in the standard-setting field have argued, however, that judges should be provided with the "prior knowledge" of real data, and should.be encouraged to work together to reduce idiosyncratic differences. For instance, in a discussion of research involving the Ebel method, investigators concluded that large error variances obtained for the Ebel standards ”were due to disagreements among judges regarding the expected success probabilities assigned to groups of items." Their solution to this problem was to provide "reasonable probabilities” for the raters (Jones, 1987, p. 10). 'There are 57 studies which demonstrate that, given empirical item statistics, judges will generally "move their' item reevaluations in the direction of the normative feedback" (Saliba, 1990, p. iv). This "team" effect is worthy of further study. The correlation between a predictor and its associated criterion is often used as a validity coefficient, while correlations between (among) predictors may be used as reliability coefficients (Pedhazur 8 Schmelkin, 1991, p. 37). When the relationship between two sets of ratings is linear, the product-moment correlation is used (Ebel, 1972, p. 411). When the relationship is non—linear, other methods (such as curvilinear regression) may In: indicated (Pedhazur 8 Schmelkin, 1991, p. 37). The reliability for the judgements of a panel of judges can be computed using the average correlation through the Spearman-Brown prophecy formula Inf—'5‘??- (k-1)I+1 where ‘1: is the average of the off-axis elements of the correlation matrix for all k judges. Ebel (1951, pp. 411f) computed the reliability of the average ratings of k judges using the formula r” = 1 - vdhu, where v. and v, are the error and item variances, respectively. Guilford (1954, pp. 396f) provided formulas to use for partitioning the variance into the components due to items, 58 raters and error. All of these measures of reliability were used in this study. I ! 1 . 1 E I I , l 1' Three different analytical approaches were used in this study to measure the internal consistency of the judges’ ratings. These are described below. 0 ‘ ._ 0| 0 -0." o. to ' t . -2 ‘_ : Each judge’s p-values were correlated with those obtained by the Marginally Capable Candidates group and also with the consensus ratings established by the panel of judges. Since the probability of a correct response (p-value) is not a linear function of ability (the relationship between p-value and ability is assumed to approximate an ogive), use of linear correlation may yield inaccurate results. WWW: An index of consistency is needed to provide a nondimensional measure of how well each judge predicted actual test performance of the hypothetical marginally capable candidate. Latent trait analysis, also known as Item Response Theory (IRT), provides a functional relationship between empirically-derived psychometric characteristics of each test item and the ability of the examinee. It is possible, therefore, to calculate an expected probability that a person with a given ability will succeed on a given test item. By performing these calculations for each of the items comprising an examination, and summing the probabilities, one can determine the most likely raw score which would be obtained by this hypothetical 59 person. IRT was used to estimate how closely each judge’s p- values compared with a postulated latent trait using an analysis of the residuals, i.e., by computing the absolute differences between each judge’s ratings and the IRT model. In a standard-setting situation, each judge envisions the hypothetical ability of a: typical marginally capable candidate. Since the abilities envisioned by different judges are rarely identical, different raw scores will result when the predicted probabilities of success for the items are summed, These differences represent the in;ez;juggg inconsistency in the setting of the standard, analogous to the W in ANOVL For the within;judgg_yariance, van der Linden postulates that (for items which fit the latent trait model) the intra; We: can be computed from the residuals between the latent trait model and the judge’s predictions. Once a test.has been administered, there are computer programs (such as BICAL, LOGIST, or BIGSTEPS) which can be used to compute a ”raw score to ability" conversion table for each test. The item calibrations (difficulty ratings), obtained from actual test administration(s) , can be used to compute the modelled probabilities that the marginal candidate (envisioned by any given judge) will select the correct answer for'a given item. These are then compared with the predictions established by the judge. The steps in this process are: 1. The judge estimates, for each test item which has been found to fit the IRT model reasonably well, index, dividing the sum by the sum of the maximum possible residuals. The maximum possible residuals are those which would be obtained if the judge predicted that the examinees would all get a difficult item correct, or that all examinees would get an easy item wrong. 60 the probability that a marginally capable student will obtain the correct answer. These values are then summed to produce the expected raw score ( ’?; ) for' the marginal examinee. The score thus determined is looked up in the "raw score to ability” conversion table to determine the 6 value for the marginal candidate. This value, 6c , is substituted into the latent trait model for each test item to determine the probability that the person with the raw score would obtain the correct answer to the item. These probabilities are then compared with the judge’s predictions to see how closely the latent trait. model was approximated, and. the :residual between the latent-trait modelled p-value and the judge’s prediction is computed. In an effort to create a nondimensional consistency are as follow: 1. Compute the absolute difference between the judge’s predicted p-value and the IRT-modelled p-value. van der Linden normalized the 'mean residuals by The steps for computing these residuals CC la 61 2. Find the average of the differences obtained in Step 1; this is called E, (the mean error for judge j)- 3. Compute the maximum value of the difference between the IRT-modelled p-value and the value of 1 or 0. 4. Find the average of the differences obtained in Step 3: this is called M, (the mean maximum possible error for judge j). 5. Compute the difference between E, and M, and divide by M,: this is the consistency index C,,.23 CH can also be written as There is some bias in this method of determining consistency, however. Because the logistic curve is non- linear, small differences in ability can lead to relatively large variations in probability of success in the region.where the Item Characteristic Curve (ICC)“ has the greatest slope, i.e., where the ability of the examinee and the difficulty of the item are matched (p = 0.5). This also happens to be the region of greatest reliability (and maximum information content) of the item. In«cases where the items are either too 2"I‘he subscript 1 indicates that the Angoff method is being used. 2‘An Item (or Test) Characteristic Curve is a plot which represents the probability of success, as a function of ability, on an item (or test). 62 easy or too difficult, however, the slope of the ICC is small so the reverse situation applies (large differences in ability lead to small variations in the probability of success). In the standard-setting process, the difficulty of the test is fixed. The judges are not modifying the items, but are merely trying to predict how many items will be correctly answered by a marginally capable candidate. If Judge A believes the test to be very easy (therefore envisioning the marginal examinee to be very high-scoring) , then the magnitude of the residuals between predicted and actual performance are likely to be small (because of the low slope of the logistic curve at high values of ability relative to the item difficulties). A similar result would be obtained by Judge 8 who believes that the test is very difficult (leading to a low-scoring marginal candidate). If, however, Judge C believes the test is matched to the ability of the marginal candidate (who will thereby obtain a score near 50 percent), the residuals will tend to be larger (because of the high slope of the logistic curve near p = 0.50). This does not necessarily mean that Judge C is not as consistent as Judges A and B: the difference could merely be an artifact of the shape of the logistic curve and the difficulty of the test relative to the abilities of the hypothetical marginal students as envisioned by the three judges. For this study, the consistency index was used with the caveat that the results may be tdased against a judge who places the passing score near 50 percent. Note further that na 22 Q’U na SE rn 63 many facets of the content of the MEAP tests are, for many teachers, relatively new and therefore perceived as ”hard." As a result, many judges may have been tempted to assign p- values close to 0.5 for a large number of the items. This would‘tend,totdepress their consistency indices. .Despite this possibility, it will be shown that most of the judges displayed excellent internal consistency in their ratings. H E l . 1° . I] il'-S I ! The primary purpose of this study was to determine the nature and extent of the discrepancies among multiple standard setting methods. Statisticians have developed various measures of association to measure the relationship between variables. The Pearson chi-square test is one of the most widely used measures of association, and is used primarily to test the hypothesis that two dichotomous variables are independent (Norusis, 1990). Once several students, for whom mastery/nonmastery states had been established, had taken the tests and received their scores, the effect (on the classification of students by the test) of setting hypothetical values of the cut score over a wide range of raw scores was explored. It was only necessary to vary the cut score and observe the effect it had on the value of Chi-square (or the Phi coefficient) and error rate obtained for each hypothetical value of the cut score. The "optimal" cut score, relative to the judge’s ratings, was defined as the value of the cut score which maximized Chi- square (or the Phi coefficient), or which minimized the error ra th sq be wo ma mi Cl. tl’ 64 rate. For a given test, it seems reasonable to assume that these will occur at the same cut score, since the maximum.Chi- square (or Phi coefficient) implies the best relationship between judge ratings and test scores. Logically, then, this would seem to imply that the cut score which provides the best match.between judge ratings and.test results*would.produce the minimum number of classification errors as well. By plotting Chi-square, the Phi coefficient, and the error rate against the values of the cut score, this hypothetical relationship was explored. Even when the "Optimal" cut score (or, possibly, the optimal range of cut scores) has been established, the task of standard-setting' is still incomplete» Experts are ‘very reluctant to rely upon a standard which has been determined "blindly” through statistics. This is why psychometricians have been unable to arrive at a single "best" method for establishing cut scores. Consider, for example, the case where the judges’ mastery ratings of examinees are taken.to be relatively accurate representations of the "truth," as would be the case with the Contrasting Groups or Borderline Group Methods. In these cases, statistics based on these "known" mastery states are used. Despite this, most standard-setting is done using a variant of the Angoff method where data from examinees plays at most a minor role (Jaeger, 1989). If the judges merely wish to use the test to separate probable masters from probable nonmasters, they may be willing to settle for a low cut score which simultaneously provides a re re "51 so si ne by va GI m In 65 reasonably significant Chi-square or Phi to those obtained using the average correlations with the Spearman-Brown prophecy formula, both for the individual rater judgements and the panel judgements. This demonstrates that simple correlations can be used to identify those judges who contribute most positively to the standard set by the panel, and to determine how much confidence can be placed in the final outcome of the standard-setting process. For instance, in high-stakes situations it may be desirable to use panels of experts whose 76 judgements yield reliabilities higher than the values of 0.75 to 0.77 obtained in this situation. WWW: For the Grade 4 students, mastery/nonmastery decision data were obtained from members of the standard setting panels whose students had taken the MEAP mathematics tests for which standards were being established. In cases where a judge was not teaching a grade in which the tests had been administered, the judge was asked to select one or more teachers who had been trained to understand and use the instructional model upon which the tests were based. These teachers, in turn, rated their own students as "masters," "nonmasters" or ”marginal.” All of the students were rated prior to the receipt of test results by their teachers. The cumulative frequency distributions of scores obtained by the three rated groups (master, nonmaster and marginal) are displayed in Figure 4.2 below. For the Contrasting Groups method, the cumulative frequency of the non-mastery group was plotted beginning at the highest score and proceeding down to the lowest, whereas the cumulative frequency for the mastery group was plotted beginning at the lowest score and proceeding up to the highest. The point at which these curves intersected was selected as the Contrasting Groups cut score, thereby' equalizing' the overall false positive and false negative errors“ For the Borderlinet Group 'method, the cumulative frequency for the marginal group was plotted beginning at the lowest score and proceeding up to the 77 highesti The point.at.which this curve reached.50 percent was selected as the Borderline Group cut score, thereby splitting this group in half. Fourth Grade Frequency by Mastery 100 r 7. : I I : I _a E i E 3 3 E E E 2 “*' 901° -" """""" . '''''''' """"""" ‘ Masters 30. .......... RR“... . a 701‘atm\- ....... .......... g ...... ; .......... ; .......... Nonmastors 60 1. ......... E.....-....E. ......... ........ ......... inn-uni..." 1......n-Eu-o-n-u: ........ “ 1 Nunmnms CumuIativc Frequency 30354045505'56b6'5707530 Score Figure 4.2 Contrasting Groups and Borderline Group Method A "false positive" error is defined as an event in which an examinee who had been rated as a "clear nonmaster" by the teacher/judge is classified by the test as a "master." A "false negative" error occurs when an examinee who has been rated as a "clear master" is classified by the test as a "nonmaster.” Most of the error at the panel-established cut score of 69 is of the "false negative" type, where students who had been rated as "clear masters" by their teachers were classified by the test as "nonmasters." Had the cut score been set below the lowest score obtained by a "clear master" 78 (a score of 43 on this test), there would be no false negatives, but the number of false positives would be large (69%). Had the cut score been set above the highest score obtained by a "clear nonmaster" (a score of 83 in this case), the false positives would be eliminated, but the false negatives would be large (over 70%). The false positive and false negative error rates are equal at a score of 63, as shown by the intersection of the frequency plots at the score of 63 in Figure 4.2. By definition, then, the Contrasting Groups method places the cut score at 63, the point where the frequency curves intersect. The Borderline Group Method sets the cut score at the median of the marginal group (or a score of 61). The tetrachoric correlation for this data set is 0.92 (Davidoff 8 Goheen, 1957). This is considerably larger than the maximum phi-coefficient (0.80). Figure 4.3 shows the results of varying the cut score from 30 to 80 and computing Chi-square, the Phi coefficient, and the error rate for the comparison between test-derived mastery states and teacher-assigned mastery states for the Grade 4 students. These results indicate a very strong relationship between mastery ratings and test results over a wide range of cut scores (p < 0.01 from a score of 34 through 80). The optimal cut score, based upon.highest chi-square and lowest error rate, is 60 (out of 92). At this score, the error rate is less than 10%, the chi-square is 88 (p < 0.0000), and the phi-coefficient is 0.80. The actual cut score set by the panel of judges was 69, which produces an 79 error rate twice as large as that produced at the optimal score of 60. Fourth Grade Contrasting Groups 100 I g g T r 7 r r + Chngurc —a— Phi + Error Rate Statistic c. 4 3e 30.35404550556065707580 ChtScorc Figure 4.3 Chi-square, phi (x100) and Error Rate (%) for Grade 4 Test Figure 4.2 also shows that the panel-established cut score of 69 misclassifies 36% of the "clear masters" and 6% of the ”clear nonmasters." The standard error of measurement (SEM) on this test is 7 raw score points. By reducing the panel’s cut score 0.9 SEM, the false positive and false negative errors are equalized: reducing it by 1.3 SEM minimizes the error rate and maximizes Chi-square. These results for all three standard-setting methods are summarized in Table 4.5 below. 80’ Table 4.5 W Method Cut False False Score Positives Negatives Angoff (consensus) 69 6% 30% Contrasting Groups 63 12% 12% Borderline Group 61 16% 8% WW: Was there any evidence of differences in test performance based upon the emphasis placed on various strands in the mathematics instrUction reported by teachers? A correlation of the student performance (strand scores) with the instructional alignment was performed. Strands which were reported as having the most emphasis were coded with "1.“ Those with the least emphasis were coded with "-1." Others were coded as "0." If student test performance is improved because of the reported instructional alignment, these correlations should be positive. The results for the Grade 4 Test are shown in Table 4.6 below. 81 Table 4.6 o - . 'o. o ..q o - ' c - . . -u e 'Oou‘! ° 12‘ fi_I§§§ Content Strand Correlation Whole Numbers & Numeration 0.04 Fractions 0.10 Measurement -0.23** Geometry 0.14* Statistics & Probability -0.14* Algebra -0.11 Problem Solving 0.32** Calculators -0.13 Process Strand Correlation Conceptualization 0.23** Mental Arithmetic -0.06 Estimation -0.32** Computations -0.12 Applications 0.09 NOTE: ** Significant at p < 0.001: * Significant at p < 0.01 82 There are six significant correlations: for Measurement, Geometry, Statistics, Problem Solving, Conceptualization and Estimation. Half of the significant correlations (Geometry, Problem Solving, and Conceptualization) are positive and half (Measurement, Statistics, and Estimation) are negative. The mean correlation is -0.01. As a whole, then, it can be concluded that instructional emphasis, as reported by the Grade 4 teachers, had no consistent effect upon student test performance. There are several ways such a result could be interpreted. Assuming that the teacher self-reported instructional alignment was accurate, one could surmise that instruction had no effect on test results. Another interpretation is that student learning was more closely related to curriculum than to instruction, and that a better indicator would be curriculum (or textbook) alignment rather than instructional alignment. A third logical explanation is that the teacher self-report.was not.an.accurate reflection of what was actually taught in the classroom. Independent of the choice of an explanation, it can be concluded that the data collected showed no consistent relationship between reported instructional alignment and student test performance. Ratgr_pzifit: Was there any evidence of drift among the grade 4 judges? It is conventional in test development to arrange the items from the easiest at the beginning of the test to the most difficult at the end. The items on the MEAP mathematics tests were grouped around themes with little or no 83 regard for item