'— IIII ' I I I I 'II ‘3‘. I' III]. (IIIIIIII IIIII I; I I I5 . I? "M IIJ . (I. um I IIIIIII I IIII‘I‘fi .II‘IIII III' IIII'I I'III ’IM‘I' : III .. I'I ‘I'I"- 7-3-51 MI I‘ II 'I I” II‘IIIIIIIIIIIIIIIIIIIIII JI I ”I III! MI I’ I'II‘I'I II“: ”III“; iIrIa‘II XIIII‘V IIJ.‘ 1- II' “3‘0 _Ih:ll:r.f;..".j '. -<..‘»Il:- 1II“III"'II'IIII IIIIII‘ .III","I...'II II‘IIII'I I'I'I‘III II III ”I? N ‘ h 1mg? IIII .IIIII II II}, UIIIIII'I‘ II‘IIIII. I"? I ~‘III 3 ‘Xz't-L‘H II ' IE‘ I‘lméh‘ I M H I .IIII 'IIII '.III.II.II,I IIIIIIIIIIII.I , .14” II' I . bIIII‘I‘ .EIII'II {jeuJIFI’III’I'IBE‘ I “III? .‘tI' w,iII,II. 1' «If? ”4-" (IIHIII'IIII yr» #13:. I1 III: 'II'IIIIIIII II I II. II; I III: I" IIII III I I II IIII IIII :IIIIIIII “I‘II’IIIIIIIIII‘I‘IIIII II . iv“ I . I I I. . . .I “I”. I I I‘ ,..v‘ III ”I ‘ —_,. . ~— * u 1 ”.7: 5:1 I: .If: .¢ {(5 alt-N It"! 3’ (1.33“??? ”‘55 I fin? 1.3; “$3? .‘0‘ 1 "AWL" ”Sadr wigs." 3’ (13 ¥< ‘- fwi‘SIEfl 'sorL gifltgggu gin-1%: ; a? 33521:!" h I.’ 1 ' . ‘ ' {fig 'gJIz'I ELIIIIJ I??? ~ ‘ .276 F.‘ ;. h) 9") ‘ - : hf? 'I'IIIIIIQ‘III‘ ‘EI‘YZIU"; n -7“: - :27: d: ' 4. "‘w‘e-—::;~v . . -:: 4-.- 55" ' “5% - WIN III I I IIIIJ'IEIIK ‘. . II?“ If}. 37:59;- 3,. 3 .~ _ ‘c‘t I, n... W ~. I 4 - “find-— 2.‘ . - 0 “‘24:“ ~ ' “’t‘ac' I' Iw I III! I61 IIIII; I “I III - ”—0—... . III ' I III: I IIIEII “US$3.57; .. . .1- ”.311; '. :TIIIT“ ’51:} ’7 x I III, . .‘I‘fidziz'v I {144; IrIfi‘I . ,, . IIi ”IdgfIIsgiiI‘i I ' 'I HT. 4. ,- ' . {dim}: III-I“ III"! - . . “II; I III .. . I 1 I- :r: if? :1 '=' XI. II’II‘III‘I . III: ' '4‘. - '1‘: 'II-I-u :5" ' y I { 'A [A n I :2? —o u» .. 0;. I 3:! .. ... M . - _.- k: -.... 5;“?‘37‘ .r - -.—.. w. :::\-< t m- I." u o H '1]. . . ‘ ....o- 1 II I “"M “II. 3"”: . ,2. , I all! IIJIISI 45:: éfr. II .. . :I IIIx “ ' ;. W, .155. III {II III-'2?! II; {In}: Isrq ‘fi‘zv‘JIZII ‘ :- rt! IIIIIIE «WIN .2! :5!" :ih‘.‘ IfIEI’IJL :h- :1. ‘ ‘ Igfiinfigzz. *IBIIf-tIII‘III-f:u. .-. .I’hfu’ng- :: g; '. a; 1‘? pkg-I: 3::- aInIS'IiIIIfi'IIII"‘I"IIiVII I=.Izi -.' fig: Ilfigii DH‘JQI‘WI‘I «(gig-EH u I. IIIIII: UMP—z 'I‘n Luigi: IJ 52;” if; E: ' .3 <- v... r92; .. .I 'III III-g. I ZI-d'i" ‘ ’3'. ' ' . ”x1 I'Ilg'; ‘h 1:13"? “Sgt": .\ ’33.: II} 1:... I" {I‘vI Eiiifiigiuj‘ ‘ ‘ ":2 I- “1:“. " . -I ' fi'l‘IgI .E. I,‘II}X:M:.'3”' ., “I. I, ” IIII .III’fI‘I‘IIIII‘IIII ‘I-Isi :rétggif III “I III :II ‘II‘ - ‘ ~1‘I"I,I-' I 'I II I. III III: ';I .:I 1 Early}. - It]? 33% ::b§"it:‘ii ‘1 :f_:Ifi:BlrI,3'\II:ti-’:“I "' 3.1%. “1",“ H IIIIIIJ k”. 5:? 'EILI‘IE“ ‘I I" to . :15 “a. ,1: W :' : : . 1‘ ' .2" :‘IfiI 1'. “III- .. ”-4.: II; KIzIII'I‘ .I {IN '1 I", “ ’%&7%. 4.!" ‘di'. 1 av t 0‘1 I,‘ . . 1mm: . ISIIIII'IJ'JIIIIIIIIIII'IffiIIIIIII“ “II" IIII‘ _ 'I‘I'I ' I: . . ‘Ii’vfi‘IIIIIIIII‘I’MIl'I “' 7‘; ' fifIIIHI IIIIIIxi-JI‘III IIII' IIIIII‘JI IILI I‘J'JIl'qz'l rI'IIII" 1‘ ":9; .‘J :‘IIJIEI'II Ii'ItIIIIIIufiIfl “1,.” fairy. .I:SIIIJIISZK‘ES-‘IIIE~2I~ II ‘I"’II‘I“II-‘-r'.“"'I’I'V’N'II III I:IIIWII‘III'RIIEIIIIIII' g 'f" " .III?!“IIIIIIIII’III‘IIIIIw' . III III I . II :23: ' .‘I II, “"‘II III“: I‘EIII'I‘I-ii.,ginIII-I‘“II" ’ I “ I ‘ . ,I ",1 . . II I'II;I£.III?IIIIII . III. 'I I~ IIIEI I 0‘ IH".I"IIIl{ ‘Iu-IIII IIII'IIISII {I III; VII 5;: "I"r IIIII III ’3' I IJI {IIII’ InIIm III II I‘II‘iILv I IIIIIII IIIIIIIIL IIII- III . Mu IIII Elmo :4. A “1-.” ,.;, ‘K'vwahah"" ‘ .' ‘3‘... r1 ‘ a-nH‘IDL:T:-3 ‘4 L--.— il I "J" &!--‘a.._q_- III i u I a it! This is to certify that the dissertation entitled The Comparative Reliability and Validity of Alternate-Choice and Multiple- Choice Tests presented by Timothy J. Van Susteren has been accepted towards fulfillment of the requirements for the Ph.D. degreein MeasurementJ Evaluation, and Research Design ajor professor Date December 2Ll985 MS U is an Affirmative Action/Equal Opportunity Institution 0-12771 MSU LIBRARIES RETURNING MATERIALS: Place in book drop to remove this checkout from your record. FINES will be charged if book is returned after the date stamped below. ’ vague-41. e. tens: . . ‘. : x . 9 29? [3‘77 THE COMPARATIVE RELIABILITY AND VALIDITY OF ALTERNATE—CHOICE AND MULTIPLE—CHOICE TESTS BY Timothy J. Van Susteren A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Department of Counseling. Educational Psychology. and Special Education 1986 ABSTRACT THE COMPARATIVE RELIABILITY AND VALIDITY OF ALTERNATE-CHOICE AND MULTIPLE-CHOICE TESTS BY Timothy J. Van Susteren Ebel (1980) proposed a new item format termed alternate-choice which he suggested would compare quite favorably with other conventional test item formats with regard to difficulty. discrimination. reliability and validity. yet have the advantage of being easier to write. While this unique item form proposed by Ebel appeared to show potential as an important addition to the repertoire of test item formats item writers have at their disposal. little empirical research existed to substantiate Ebel's claim. The purpose of this study was to compare the reliability and validity of alternate-choice and multiple—choice tests that were written to measure understanding of concepts and relationships in educational psychology. The difficulty and discrimination of the two formats was also investigated. and examinees' perceptions of the items was explored. Timothy J. Van Susteren In this study a series of examinations composed of alternate—choice and multiple—choice subtests were admin- istered to a group of students enrolled in an introduc— tory course in educational psychology. Students were timed to identify the number of alternate—choice and multiple—choice items to which they were able to respond in a given time period. and a questionnaire was admin- istered to the students to explore their perceptions of the items. The results of the study indicated that the alternate-choice items compared favorably with the multiple—choice items. While the alternate-choice items were easier than the multiple-choice items. they discrim— inated as well and were as reliable. Also. the alternate— choice items were more efficient. since students were able to respond to three alternate—choice items to every two multiple-choice items. The concurrent validity of the alternate-choice tests did not equal that of the multiple-choice tests. but the validity of both forms was quite acceptable. In addition. students viewed both forms quite positively and did not express a preference for one form over the other. The use of alternate—choice items to measure educational achievement is recommended. DEDICATION This dissertation is dedicated to the memory of Dr. Robert L. Ebel whose pursuit of more precise methods of measuring “useful verbal knowledge” led to his conception of the alternate-choice item format. I will always remember Dr. Ebel's charm. wit. and intellect: and be grateful for his advice. counsel. and friendship throughout my doctoral program. ii ACKNOWLEDGMENTS Sincere thanks are extended to Dr. Irvin Lehmann. Chairman of the dissertation committee for the excellent advice and editorial assistance he provided. I am also grateful to each of the dissertation committee members—— Dr. Willam Mehrens. Dr. Leroy Olson. and Dr. Joseph Levine--for their contributions. Dr. J. Bruce Burke deserves special thanks for his constant encouragement and moral support throughout the research project. as well as for his assistance in writing items and gathering data. I will never be able to adequately express my gratitude to my wife. Lynn. for the support she gave me during my graduate studies. Her love and understanding represent an important contribution to my successful completion of a doctoral program. iii TABLE OF CONTENTS Page LIST OF TABLES .................................... vi LIST OF APPENDICES ................................ v11 CHAPTER I. THE PROBLEM ............................... 1 Alternate Forms of Objective Test Items ........................... l Alternate-Choice Items ................. 4 Need for this Study .................... 6 Purpose of this Study .................. 7 Hypotheses ............................. 9 Overview ............................... 10 II. REVIEW OF THE LITERATURE .................. 11 Studies Comparing Multiple-Choice with Constructed-Response Type Items ................................ 12 General Studies Comparing Various Objective Test Item Formats .......... 16 Studies Comparing Amount of Testing Time ......................... 24 Studies Considering Examinees' Preference for an Item Type .......... 28 Summary ................................ 31 III. DESIGN AND PROCEDURES ..................... 33 Introduction ........................... 33 Hypotheses ............................. 33 Subjects Participating in the Study .... 34 Instrumentation ........................ 35 Alternate-Choice Items Generation Procedure ............................ 37 Research Design ........................ 42 Analysis ............................... 44 Summary ................................ 49 iv APPENDICES RESULTS OF THE STUDY ................ Introduction ..................... Results Concerning Difficulty and Discrimination ................. Results Concerning Efficiency of Alternate-Choice and Multiple- Choice Items ........ ..... ...... Results Concerning the Reliability of the Alternate-Choice and Multiple-Choice Tests ..... ..... Results Concerning Concurrent Validity ....................... Results Concerning the Students' Preference for Alternate—Choice and Multiple-Choice Items ...... Summary .......................... SUMMARY AND CONCLUSIONS ............. Summary .......................... Conclusions ...................... Discussion ...................... . Limitations of the Study ......... Suggestions for Further Research . BIBLIOGRAPHY ................................ 51 51 51 54 56 58 60 61 63 63 66 66 75 76 78 90 Tables 1. 2. LIST OF TABLES Psychometric Characteristics of the Items Employed in the Pilot Study ........ Descriptive Statistics and Multivariate Analysis of Variance for Difficulty and Discrimination ................... . ....... Mean Number of Alternate-Choice and Multiple-Choice Items Attempted in the Time Trials .............................. Significance Tests of the Reliability of the Alternate—Choice and Multiple— Choice Tests ............................. Significance Tests of the Correlation Coefficients for Multiple—Choice and Alternate-Choice Scores on the Test ...... vi Page 42 53 55 57 59 L I ST 01" APPENDICES Appendix Page A. Tables of Specifications ................. 78 B. Questionnaire Summary .................... 88 vii CHAPTER I THE PROBLEM Alternate Forms of Objective Test Items Educators and measurement specialists are constantly seeking more versatile and efficient methods of measuring knowledge. With the objective of achieving more precise and valid measurements. researchers have investigated the psychometric properties of essay. multiple choice. true-false. matching. completion. and various novel combinations of these forms in a variety of testing situations and subject matter disciplines (Charles. 1926: Ebel. 1975: Grosse 8 Wright. 1985: Meihoff 8 Mehrens. 1985). While reliability and validity are generally considered the ultimate criteria for judging the overall merits of an item form. researchers have also considered the ease or difficulty of writing or producing the item. the adaptability of an item form to a variety of measurement goals and objectives. and examinees' preferences and perceptions of an item type as important criteria in evaluating the practical usefulness of the form. Of the various item forms mentioned above. multiple—choice items are by far the most popular (Wesman. 1971). It has been demonstrated that tests composed of multiple—choice items can be reliable and valid. and that multiple-choice items can be readily employed to test almost any subject matter (Ebel. 1980). It has also been demonstrated that multiple—choice items can be conceived to measure complex mental processes (Bloom. 1958). While there are many advantages associated with the use of multiple-choice items. there are also difficulties involved. One of the problems most often cited is the difficulty of producing a sufficient number of plausible distractors. Mehrens and Lehmann (1984) point out that: Although any test item is good or bad depending on the clarity of expression. the multiple-choice item must have. in addition to a stem that is clear and free from ambiguity. a correct answer and a set of plausible responses. The value of a multiple- choice item depends to a large degree on the skill with which the various distractors are written. (p. 279) It is apparent that a large portion of the difficulty involved in writing good multiple-choice items is associated with the item writer being able to conceive of a sufficient number of alternatives that are plausible enough to attract the less knowledgeable examinee. yet not so close to the correct answer as to make the item ambiguous and confusing to even the most knowledgeable examinee (Blake 8 Huntley. 1984). The related issue of the optimal number of alternatives that should be provided to maximize efficiency (efficiency is commonly defined as the number of items that examinees are able to answer in a fixed time period) and reliability has been debated from the time multiple-choice items were first introduced to the present (Ruch & Stoddard. 1925; Budescu 8 Nevo. 1985). Most of the recent empirical studies seem to provide evidence favoring the use of three-choice items over items offering four or five alternatives (Wilson. 1982). with some studies suggesting that tests composed of two-choice items can provide satisfactory reliability (Williams & Ebel. 1957). Grier (1975) summarizes the issue of the optimal number of alternatives and concludes that items with shorter stems and fewer alternatives are frequently found to be more reliable than longer items with more alternatives (four and five choice items). He also notes that there are at least two other reasons that shorter items with fewer distractors are preferable. First. it is usually easier to produce one or two plausible distractors than three or four. Second. there may be a ". . . gain in efficiency. since students might not get lost reading many alternatives and have to return and re-read the question and early alternatives" (1975. p. 112). Employing the same logic that Grier (1975) provides in favor of three-choice items. Ebel (1980) has recently proposed a two-choice item form. termed alternate—choice items. Alternate-Choice Items Ebel (1980) explains that alternate-choice items are based on a single proposition rather than complex situations. and that they offer only two alternatives instead of the conventional three. four. or five. Alternate-choice items also differ from multiple-choice items. including the conventional format for two-choice items. in that they include the responses as segments of a continuous sentence rather than listing them in a column under the stem. For example. The items teachers write for their classroom tests are likely to be too *a) variable b) uniform in difficulty. Indices of item difficulty tend to vary *a) less b) more from one group of students to another than do indices of discrimination. Adrian has finally learned to take turns with classroom toys. In order to maintain this appropriate behavior. his teacher should praise him a) often *b) occasionally. The concepts of overlearning and satiation are a) very similar *b) distinctly different. Ebel (1980) notes that in many situations there are good reasons for favoring the use of alternate-choice items. He states that: Alternate-choice items have some important advantages over the more familiar multiple—choice item form that offers four answer choices. They are more efficient in that they yield more scorable responses per unit of testing time. They are easiery to write. because they only require two alternate answers. Often the important questions an item writer would like to ask have only two plausible alternative answers. A problem is either major or minor. simple or complex. Action in response to it is reasonable or unreasonable. The President either supports or opposes a restriction on the import of foreign automobiles. The birth rate in Russia is higher or lower than that in the United States. increasing the homogeneity of the items in a test either increases or decreases the reliability of the test scores. (p. 115) According to Ebel. polar alternatives. such as those mentioned above. are very common in real life. The unique format of alternate-choice items allows the item writer to pose these realistic alternative questions in test items free of the constraints involved when he/she must conceive of two or three additional plausible distractors. In a preliminary study. Ebel (1982) found that alternate—choice items compare quite favorably with true-false items. The results of that study indicated that tests composed of alternate-choice items tend to be (a) easier. (b) more highly discriminating. and (c) demonstrate higher reliability than true-false tests. He also notes that students seemed to prefer alternate-choice items and perceive them to be less ambiguous than true—false items. Need for this Study As mentioned in the beginning of this chapter. there has been a continuing debate among measurement specialists on the psychometric advantages of various test item formats. In that debate. true-false and multiple-choice items are among the most commonly discussed (Wesman. 1971). Proponents of true-false items note that this form is comparatively easy to write and is quite efficient. but concede that true-false items tend to be viewed as ambiguous and best suited to testing factual recall. Multiple-choice items have the advantage of versatility and can be written to measure almost any cognitive objective. They are. however. less efficient than some other objective forms. such as true-false. and are among the most difficult to write in that it is often difficult to provide a sufficient number of plausible distractors (Mehrens & Lehmann. 1984). Ebel (1980) has recently proposed a unique test item form termed alternate-choice items. He contends that alternate—choice items reflect the advantages of both true-false and multiple-choice items in that they are very efficient. quite versatile. and can be written to measure the important elements of any subject matter. Ebel also suggests that as a result of the unique format of the item form. alternate—choice items are easier to write than either multiple-choice or true—false items. This unique item form proposed by Ebel appears to show potential as an important addition to the repertoire of test item forms that teachers and other item writers have at their disposal. but since alternate-choice items are new. there has been little research conducted with them. As previously noted. researchers have investigated the psychometric properties of multiple-choice. true—false. and various other test item forms in a variety of testing situations and subject matter disciplines. The accumulated knowledge gained from these studies has helped item writers to provide more precise and valid measurements using these item formats. A need exists for more psychometric investigation of alternate—choice items. The empirical data that studies of this type can provide is vital for educators and other test item writers to evaluate the usefulness of alternate-choice items and to determine the potential of this new item form for measuring knowledge in various settings. Purpose of this Study The purpose of this study was to evaluate the performance of alternate-choice items relative to multiple—choice items in a large-scale college course testing program. The study compared the reliability and concurrent validity of alternate-choice and multiple-choice test items written to measure understanding of concepts and relationships in the same content area. The difficulty and discrimination of the two item forms was also explored and information on the efficiency of alternate-choice and multiple-choice items was collected. This information was used to determine the optimal length of the alternate-choice tests. The reliabilities of these lengthened tests were then compared to the reliabilities of the multiple-choice tests with testing time held constant. Because examinees' preference and perceptions have been considered an important evaluative criteria (Frisbee 8 Sweeney. 1982: Ward. 1982). data was collected in this study to investigate examinees' preference for alternate-choice and multiple—choice items. The information provided by the results of this study will be useful in the evaluation of the psychometric merits of alternate-choice items. The study is. however. not without limitations. In discussing the reasons for the relatively small number of studies researching test items. Wesman (1971) notes: That research studies have contributed little to item writing is not very surprising. The inherent difficulties in conducting penetrating and generalizable studies may not be insurmountable. but they are far from easily resolved. It is the sophisticated recognition of these difficulties that is largely responsible for the paucity of attempts at basic research. (1971. p. 84) In addition to the limitations common to many research studies associated with random sampling and sample size and composition. this study has at least two other limitations that Wesmann (1971) cites as often associated with basic research on test items. They are the inability 1) to control for the fact that some concepts or subjects may lend themselves more readily to one test item type or format than they do to others. and 2) to account for the fact that the skill of the item writer may account for a large portion of any differences detected. These limitations are discussed more fully in the final section of this paper. but are mentioned here in order to provide an additional dimension to the problem. Hypotheses The research hypotheses of the study were: 1. There is no difference in the difficulty indices of alternate-choice and multiple-choice tests. 2. There is no difference in the discrimination indices of alternate-choice and multiple-choice tests. 3. Examinees will attempt the same number of alternate—choice and multiple-choice items in a fixed time period. 10 4. There is no difference in the internal consistency reliabilities of alternate-choice and multiple-choice tests. 5. The Pearson product—moment correlation between individuals' alternate-choice and multiple-choice test scores is 1.00 when corrected for attenuation. 6. There is no difference in examinees' preference for alternate—choice and multiple—choice tests. Overview In Chapter II the literature relevant to the general problem and to specific hypotheses is reviewed. The design of the study. the sample. the instrumentation. and the method of analysis are presented in Chapter III. In Chapter IV the results of the study are discussed. This is followed by a final chapter that contains a summary of the study. a discussion of the findings. the limitations of the study. and suggestions for future research . CHAPTER II REVIEW OF THE LITERATURE The relative merits of various test item formats is considered an important subject and is the topic of a great deal of debate among educators and measurement specialists. This is evidenced by the fact that research pertaining to this debate can be found in the literature from the 1920's. when objective test items first became popular. to the present. Chapter II of this study presents a general survey of the research comparing different test item formats. One of the most persistent and consistently raised questions in achievement testing concerns the concurrent validity of various test item formats and asks whether the item format employed affects the attribute measured by the test. Researchers have attempted to investigate this question by comparing item types that require examinees to select a response with items that require examinees to produce a response. The first section of Chapter II. therefore. is devoted to reviewing the literature comparing test item formats. such as multiple-choice. that require examinees to select a response from a list of options with items requiring examinees to supply a correct response from memory. 11 12 commonly referred to as free-response or constructed—response items. Researchers have also been greatly concerned with comparing the reliability that may be achieved using various objective test item formats. Since alternate-choice items are a new form. no specific mention is made of them in the literature. Accordingly. the second section of this chapter includes studies comparing the reliabilities of other popular test item formats. The third section of this chapter contains a review of research dealing with the testing time required by examinees to respond to different test item forms. In the final section. a number of research studies are cited pertaining to examinees' preference when they are presented with two or more types of items. A general summary at the end of the chapter provides a review of Chapter II and an introduction to Chapter III. Studies Comparing,Multiple—Choice Test Items with Constrgcted—Response Type Items Teachers and other test users often report an intuitive belief that different types of test items measure different types or levels of knowledge. They believe that test items requiring examinees to produce an answer to a test question comprise not only an inherently more difficult task. but one requiring an entirely 13 different mental process (cognitive task) than test items requiring examinees to recognize and select the correct answer from a list of options. Several researcheres have sought to gather evidence relevant to this hypothesis. For example. Heim and Watts (1967) and Cook (1955) compared multiple-choice and completion vocabulary items and found the principle difference between the two forms is that the items requiring examinees to supply the answer tended to be somewhat more difficult. These results correspond to those of a similar study by Andrews and Bird (1938) in psychology terminology. Rowley (1974) compared student responses to multiple-choice and free-response tests of vocabulary and mathematics. The results of his study also suggest that the free—response items were generally more difficult. In addition. Rowley discovered that the use of multiple-choice items to test vocabulary may favor examinees high on testwiseness and/or risktaking. In measuring mathematical achievement. no evidence was found to suggest that the multiple—choice scores differed in any systematic way from the free response scores. Rowley speculates that the risktaker may have been able to benefit from informed guesses on the multiple-choice vocabulary test to a greater extent than on the multiple—choice mathematics test where his/her guesses 14 were more truly random. He points out that testwise students who do not know the correct answer to a vocabulary item can often eliminate one or more of the options on the basis of partial knowledge and guess among those remaining. This is seldom the case with mathematics items. especially when the student is required to perform some mathematic operation to arrive at the answer. The results of a study by Rocklin and Thompson (1985) supported Rowley's conclusions and also detected an interaction between test anxiety. test difficulty and item format. As might be expected. students high on test anxiety found free-response items more difficult and anxiety producing than multiple-choice items. The format of items testing mathematics was also the subject of a study performed by Oosterhof and Coats (1984) in which they investigated the difficulty and internal consistency of free-response and multiple-choice items used in an undergraduate course in business finance. The results of their study indicated that the free-response mathematics items were more difficult and reliable than the multiple-choice items employed. The authors note that the comparatively better performance of the completion items used in their study may have been at least partially due to the fact that the probability of 15 an ignorant examines guessing the correct answer to a completion-type mathematics item is extremely low. Ward (1982) compared the reliability (coefficient alpha) and concurrent validity of free-response and multiple-choice verbal aptitude test items. His study confirmed earlier findings that free-response items tend to be somewhat more difficult than multiple-choice items and concludes: This study has shown that it is possible to develop open—ended forms of verbal aptitude item types that are approximately as good. in terms of score reliability. as multiple-choice items and that require only slightly greater time limits than do the conventional items. These open-ended items. however. provide little new information. There was no evidence whatsoever for a general factor associated with the use of a free-response format. (1982. p. 9) Ward emphasizes that the results provide no evidence of a verbal production factor and also notes that the results of the concurrent validity study indicated that both multiple—choice and free—response items can be valid measures of verbal aptitude. Traub and Fischer (1977) also found high correlations across multiple-choice and free-response formats. Their results were similar to Ward's (1982) in that they concluded that the item format employed does not affect the attribute measured by the test. Similarly. Choppin and Purvis (1969) found multiple-choice and open—ended items equally valid in the measurement of their students' knowledge of literature. 16 The studies cited above provide a strong indication that multiple-choice and free-response test items can be written to measure the same thing. While the studies cited generally agree that free-response items tend to be slightly more difficult. there is no evidence to indicate that requiring students to supply a response to a question comprises a different or higher-order mental task than requiring them to select a response from a list of options. General Studies Cospsrisgythe Relisbility of Varioss Qsjective Test ItessFormsss Researchers have expressed interest in comparing various objective test item formats from the early part of the century to the present. In 1921 Toops conducted what may have been the first study of this type. He compared the reliabilities of several general information tests composed of fifty items. each cast into free-response. multiple-choice and true-false forms. Each of six groups took each test with the order of administration randomly assigned. The split—half reliabilities reported for the tests were very similar ranging from .507 to .556. In another study. Charles (1926) compared the split-half reliabilities of five-. three-. and 17 two-response multiple-choice tests and a true—false test with the reliability of a free-response test. He administered 50 factual information items of introductory psychology to each subject in completion form followed by 50 items in one of the other forms. Charles did not perform any statistical significance tests. but did conclude that there existed little difference of practical significance between the reliabilities of the item formats. Ruch and Stoddard (1925) employed a design identical to Charles' but used items intended to measure knowledge of history and social science. They discovered that the split—half reliabilities for the 100 item tests composed of five-. three- and two—choice multiple-choice items and true-false items were .886. .748. .849. and .714. respectively. The researchers found that the number of items that students were able to answer varied with the number of options posed in the item. Students were able to answer approximately 1.5 true-false items for every single five-choice multiple-choice item. Therefore. they elected to recalculate the reliabilities equating them for testing time using the familiar Spearman-Brown Prophecy Formula. The corrected relabilities were estimated to be .901. .806. .902. and .820 for the five-. three-. and two-choice and the 18 true—false items respectively. They offered no explanation for the especially good performance of the two—choice items. Watson and Crawford (1930). Copeland and Gilliland (1943). and Eurich (1931) compared the reliability of multiple-choice and true-false tests and reported conflicting results. Watson and Crawford (1930) compared the formats on high school physics unit tests and reported higher reliability estimates (split-half) with multiple-choice items. Copeland and Gilliland (1943) found higher Kuder-Richardson-ZO reliability estimates for a true-false test than for a multiple-choice test of child psychology when they equated the reliability coefficients for testing time using the Spearman—Brown Formula. Eurich (1931) performed two experiments comparing true-false and multiple-choice items on educational psychology. He reported that the internal consistency reliability estimate of the multiple-choice test was substantially higher in one trial and approximately equal to the true-false test reliability in the other trial. He concluded that the reliabilities of multiple-choice tests are consistently as high and usually higher than the reliabilities of true-false tests. Burmeister and Olson (1966) sought to determine whether a test composed of college-level natural science 19 true-false items had the same desirable psychometric characteristics as the multiple-choice form. They concluded that true—false items could be constructed that discriminate almost as well as multiple-choice items. They also noted that true-false items tended to be somewhat less difficult than multiple-choice items. The difference in difficulty may have been due. at least partially. to the guessing effect. Grosse and Wright (1985) note that the effect of guessing is to some degree. dependent on the number of alternatives offered by the items employed. However. as the number of items increases. the probability of attaining a passing or acceptable score on any item format significantly decreases. Ebel (1971) also studied the comparative reliability and validity of true-false and multiple-choice tests. He constructed two forms of a natural science test. each form composed of 44 true-false items and 44 multiple-choice items. Ebel notes that the mean discrimination indices tended to be higher for the multiple-choice tests. The Kuder—Richardson 20 reliabilities for the multiple-choice and true-false subtests were .81 and .84 respectively. for form one and .86 and .71 for form two. The true-false reliabilities were estimated by the Spearman-Brown Prophecy Formula for 20 a double-length test under the assumption that two true—false items can be attempted for every multiple—choice item attempted. In the investigation of concurrent validity. Ebel correlated students' scores on the two forms under the assumption that if both item types are measuring the same psychological dimension. the correlation between them would be unity. In order to compensate for the unreliability of the tests. he applied the correction for attentuation and reported that the corrected correlations between the multiple-choice and true-false subtests on the two forms were 1.20 and .80. Ebel concluded that the results of his study support the hypotheses that: (a) true—false and multiple-choice tests are equally reliable when testing time is equated. and (b) there is no difference between the concurrent validity of multiple-choice and true-false tests. Frisbee (1974) performed a study very similar to Ebel's but with a considerably larger sample. The Spearman-Brown Prophecy Formula was used in both investigations to adjust the reliabilities of the true-false tests in order to equate them with the multiple-choice tests on the basis of testing time rather than number of items. Ebel (1971) arbitrarily used a ratio of 2:1 (true-false items: multiple-choice items) as 21 mentioned above. and Frisbee used an experimentally derived ratio of 3:2 in adjusting the Kuder Richardson Reliability estimates in his study. Perhaps as a function of the ratio used in equating the tests. the results of Frisbee's study did not confirm those of Ebel's (1971) study. Rather. the reliabilities of the multiple-choice tests in Frisbee's study were consistently higher than the true-false tests. In the interest of identifying the optimal number of alternatives for increasing reliability. Williams and Ebel (1957) compared the reliability of tests composed of items having four. three. and two alternatives via the Kuder Richardson 20 method. They concluded: For tests of equal working time . . . three choice vocabulary test items gave a test of equal reliability. and two choice items a test of higher reliability. in comparison with the standard four-choice item. (p. 59) Costin (1970) reported somewhat different results. His study indicated that three choice items tend to be the most reliable. Ramos and Stern (1973) and Hogben (1975) investigated only four and five choice items and found the five alternative items superior in reliability. It should be noted that Ramos and Stern did not equate the tests for testing time. as many other researchers did. but rather. compared tests of equal number of items. More recently. Straton and Catts (1980) 22 conducted a study seeking to identify the optimal number of alternatives that should be provided in multiple-choice test items to maximize reliability. They estimated reliability using the analysis of variance method and concluded: The findings of this study lend support to the notion that tests composed of three choice items are equivalent or superior to tests of four or two choice items when test reliability is used as the basis for comparison. (1980. p. 364) Straton and Catts (1980) suggest that for many applications. three-choice items are to be preferred. They point out that. compared to four—choice items. three-choice items are easier to write and the distractors taken as a set are more plausible. Also. students should be able to complete more items in a fixed period of time thus ensuring greater coverage of subject matter. As a result. test reliability for three-choice items should be at least as high as that achieved with four choice items. Grier (1975) reviews the results reported surrounding the debate of the optimal number of alternatives and concludes that shorter items with fewer alternatives are frequently found to be more reliable than longer items with more alternatives such as five-choice items. He also notes that in addition to this advantage of increased reliability that there are at 23 least two other advantages associated with shorter items. The first is that it is easier to write one or two plausible distractors than it is to write three or four. The second is that shorter items tend to be more efficient in that students spend less testing time reading and interpreting them than with longer. more complex items. The results of Budesue and Nevo's (1985) research on item efficiency tended to support Grier's (1975) conclusions. They found a strong and consistently negative relationship between examinees' rate of performance and the number of options for the items. However. their research did not support the findings of Straton and Catts (1980) that three-choice items yield higher reliability estimates than items with four or five options. The research cited in this section spans the period from the 1920's when objective test items first became popular to the present. While the research surrounding any of the questions discussed provides no definitive results. it does appear to support a few general conclusions. First. the research seems to support the thesis that any item format can be successfully employed to test any subject matter. Second. it seems apparent that while free-response items tend to be somewhat more difficult than multiple-choice items. there is little 24 evidence to indicate the existence of a verbal production factor. Finally. it appears that shorter. less complex items with fewer options produce reliability estimates that are often as good or better than longer. more complex items with more options. The results of this research review provide support for the study of alternate-choice items as they clearly fall into the category of shorter. less-complex test items with fewer alternatives. Studies Comparing Amount of Testing Tips Most researchers comparing different forms have considered the efficiency of the forms an important variable. Efficiency is defined as the number of items to which examinees are generally able to respond in a given unit of testing time. Efficiency is important because when comparing the reliability of two or more test item forms. researchers have generally considered it appropriate to equate the tests for testing time (using the familar Spearman-Brown Prophecy Formula) and to compare the reliability of these theoretically lengthened tests. In two studies previously cited. Charles (1926) and Ruch and Stoddard (1925) reported differing results when they compared time required by examinees to respond to 25 test items of various formats. Charles (1926) reports that his subjects were able to respond to 1.4 true—false items for every five choice multiple-choice item and 1.2 true-false items for every three-choice item. The corresponding ratios from the Ruch and Stoddard study are 1.6 and 1.3. Williams and Ebel (1957) stated that subjects finished faster as the nubmer of response alternatives diminished. but they did not indicate how much faster. In another study. Ebel (1971) reported that subjects typically attempted two true-false items for every one multiple-choice item attempted. However. there appears to be a general agreement in the results of other studies (Watson 8 Crawford. 1930: Copeland 6 Gilliland. 1943: Frisbee. 1974) that three true-false items can usually be attempted for every two multiple—choice items attempted. In a more recent study. Ward (1982) compared the speed of examinees' performance on verbal aptitude items set in an open-ended (free-response) format and a multiple-choice format. He explains that all items were pretested to ensure that adequate time limits would be permitted during data collection to avoid problems associated with test speededness. On the basis of pretesting. Ward determined that 75$ of the examinees were able to complete 20 multiple-choice items in 12 minutes and 20 open-ended items in 15 minutes (36 seconds and 45 seconds per item respectively). 26 Frisbee and Sweeney (1982) compared the relative merits of multiple-true-false (MTF) items with multiple-choice (MC) items. They explain that MTF items are a cross between multiple—choice and true-false items. consisting of a question or problem posed in the stem followed by a series of statements pertaining to it. Examinees respond true or false to each of the statements. In discussing the results. they note: The MFT format would appear to have several advantages over the MC format: a greater number of responses can be obtained in a given time period: the longer test is likely to be more reliable: a greater range of content can be examined because of the length: and a more valid measure should be obtained because of increased reliability. In addition. students might indicate greater preference for MTF than MC because the additional test length gives more opportunities for them to show what they have learned. (p. 29) In the interest of identifying the relative efficiency of multiple-true-false items. Frisbee and Sweeney constructed two content-parallel test forms. each containing 50 multiple-choice items (five—choice) and 250 sets of multiple-true-false items (each composed of a stem and five true-false propositions). Students were told in advance that the tests consisted of both multiple-choice and multiple-true-false items. Since multiple-true-false items had been used previously in course exams. students were quite familiar with them. The researchers explained that the course tests were 27 carefully timed. After ten minutes of testing time had elapsed. students were stopped and asked to circle the number of the test item they had just attempted. Testing then resumed without further interruptions. The ratio in their study was 3.44 multiple-true-false items to one five-choice multiple-choice item. In another study comparing the relative merits of alternate—choice items with true-false items. Ebel (1982) noted that students seemed to be able to complete about the same number of each type of item in a given time period. This study. described as preliminary. provided no empirical data. The research reviewed in this section is not conclusive nor does it provide any definitive results or information pertaining to the efficiency of any particular test item format beyond the obvious observation that students are able to answer more short and simple test items than they are long and complex items in a given time period. It does. however. highlight the fact that researchers (1) have considered the efficiency of an item type as an important consideration and (2) have been able to gauge item efficiency with some consistency. This information has been useful to researchers in estimating the reliability of tests of theoretical length. Recent empirical 28 evidence of the type described in this section is lacking for alternate-choice items. Therefore. provision was made in this study to collect data on the ratio of the number of multiple-choice to alternate-choice items to which examinees are able to respond in a given time period. Studies Considerins_§saminees' Preference for sn Item Type In addition to the factors mentioned above. researchers have also considered examinees' preference as a worthwhile element in the overall evaluation of an item format. It is hoped that such information may provide insight into examinees' ability to interpret and understand the question posed in the item. and to gauge students' acceptance of the item format. In one such case. Ebel (1982) conducted an informal survey of students enrolled in an introductory measurement course who had encountered both true-false and alternate—choice items. He reports that students generally expressed belief that the alternate—choice items were less ambiguous and equal in difficulty to the true-false test items used in the course. In addition. more than half of the students surveyed expressed a preference for alternate-choice items. Ward (1982) also considered examinee preference in his study comparing different 29 forms of verbal aptitude test items. He noted that students' perceptions that multiple—choice items were less difficult was confirmed by the data. Perhaps as a result of the perception that multiple-choice items were easier. students also expressed a preference for them. Students' perceptions were also considered an important variable in a study of the effects of incorporating humor in test items (McMorris. Urbach. 6 Connor. 1985). The researchers reported that the inclusion of humorous items did not affect the students' scores. yet students favored the inclusion of humorous items and perceived them to be easier. Item statistics did not support students' perceptions of the difficulty of these items. Rosenfeld and Anderson (1985) also studied the effects of including humor in test items. Their results were similar to those of McMorris. Urbach. and Connor (1985) in that all students viewed humor positively. However. Rosenfeld and Anderson did detect a significant sex difference in perception of the items and performance on them. The college males viewed the experimental items as much more humorous than their female counterparts did. but also scored lower. The researchers speculated that the males who perceived the items as extremely funny may have been more distracted by them than the female participants were. 30 Frisbee and Sweeney (1982). in a study previously cited comparing multiple-true-false and multiple-choice test items. asked the examinees for their perceptions of the relative difficulty of the two forms and for their preference. Results indicated that students' perceptions closely matched empirically derived tabulations of item difficulty. Also. nearly half (47.8‘) indicated a preference for multiple—true-false over multiple-choice items. About one-third (32.0‘) preferred multiple-choice and the remainder of the examinees expressed no preference of one type or the other. Campbell (1961) and Benson and Crocker (1979). and Green (1984) concentrated their investigations on students' perceptions of item difficulty. All of these studies concluded that item format and students' reading ability were significantly related to students' perceptions of item quality. In the same vein. Ebel (1980) was interested in students' perception of test item ambiguity. He suggests that. in many cases. students' complaints that a test item is ambiguous are the result of the students' own lack of knowledge or incomplete knowledge of the subject matter. Ebel refers to this condition as extrinsic ambiguity. Conversely. he explains that intrinsically ambiguous items are ambiguous due to the fact that the item is imprecisly worded or has 31 some other defect inherent in the item. Ebel (1980) emphasizes this distinction between intrinsic and extrinsic ambiguity in order to highlight the effect students' knowledge of the subject matter has on their perception of item quality. An ill-prepared student views many items as very difficult and/or of poor quality. While the research cited seems to indicate that examinees' perceptions (especially of item difficulty) may be significantly influenced by reading ability and subject matter knowledge. it also indicates that. as a rule. examinees are able to judge the relative difficulty of item types fairly accurately. It seems clear that examinees' perceptions should be considered an important element in the overall evaluation of a novel item type. If an item type does not have face validity and at least nominal familiarity to and acceptance of the examinees. use of the item type may serve to disturb or disrupt the "psychological set" of the examinees as they prepare for and take examinations (Frisbee 8 Sweeney. 1982). Summa I.” 2 In Chapter II a general survey of research surrounding the debate of the relative merits of various test item formats was provided. The review affirms that alternate—choice items constitute a potentially important 32 addition to the repertoire of test items employed by educators and other test users. As such. they merit further study. In Chapter III the methodology is presented: including a description of the students who participated in the study. and the procedures employed to test the hypotheses. CHAPTER III DESIGN AND PROCEDURES Introdpction This research study was designed to examine the reliability and concurrent validity of alternate-choice and multiple-choice tests of educational psychology. The difficulty and discrimination of the two item forms was compared. and students' preference for alternate-choice or multiple—choice test items was explored. In this chapter the sample is described and the instrumentation is discussed. Some examples of the items employed in the study are also provided. as well as sections outlining the methodology and the analysis. The chapter concludes with a brief summary and an introduction to Chapter IV. Hypotheses The hypotheses of the study were: 1. There is no difference in the difficulty indices of alternate-choice and multiple-choice tests. 2. There is no difference in the discrimination indices of alternate—choice and multiple—choice tests. 3. Examinees will attempt the same number of alternate—choice and multiple—choice items in a fixed time period. 33 34 4. There is no difference in the internal consistency reliabilities of alternate—choice and multiple-choice tests. 5. The Pearson product-moment correlation between individuals' alternate-choice and multiple-choice test scores is 1.00 when corrected for attenuation. 6. There is no difference in examinees' preference for alternate—choice and multiple-choice tests. §pbjects Participating in the Stpsy The subjects that participated in this study were undergraduate students at Michigan State University enrolled in the Spring Term of 1983 in a course titled Teacher Education 200. The Individual and the School. This one-term. four credit course is required of all elementary and secondary education majors at Michigan State and serves primarily as an introduction to educational psychology. On the average. the course has 100 to 200 students enrolled in five to ten sections. Course sections have common content. textbooks. tests and term papers and are taught by supervised teaching interns. In a typical term the majority of the students in the course are sophomores and juniors. About half of the students are elementary education majors and the other half are secondary education majors. 35 Subjects participating in the study were not randomly selected. Rather. all students enrolled in the course were included in the study. While it is generally agreed that complete randomization is necessary to guarantee the external validity of a study (Campbell 8 Stanley. 1963). it was not possible in the present case. Nevertheless. the subjects included in the study did appear quite representative of college students enrolled in other courses in the College of Education at Michigan State University. Two groups of subjects were actually used in the study. since two phases of testing were required for instrument development and data collection. The first group was used to try out newly created alternate-choice test items. This group of subjects was composed of all students enrolled in Teacher Education 200 in the Winter Term of 1983. The second group used in the actual data collection was composed of all students enrolled in the course in the Spring Term of 1983. The final group of participants in the study was composed of 112 students enrolled in six sections. Instrppentstion The multiple—choice items that were used in the study were drawn from a pool of currently existing items 36 used to construct unit and final examinations for the course. Items in this pool had been prepared with great care and used in various tests for the previous six terms. These items had been analyzed. revised and improved on the basis of expert judgement (judgement of the course instructors and educational psychology faculty) and item analysis. The items had also been classified according to topics and keyed to the objectives of the course. They appeared to provide adequate sampling of the curriculum. As a result of the care taken in their preparation. analysis and revision. these multiple-choice items were judged to be technically sound. having appropriate difficulty. discrimination and content validity. A few examples of these items are given below. Bruner would place a child who makes use of visual images to organize his thoughts at which stage of cognitive development? A. Concrete B. enactive *C. Iconic D. Symbolic Bill West. a high school social studies teacher wants his students to "become good citizens." In order to make this goal a reality he must first: a. provide opportunties for citizenship to occur. *b. specify how good citizenship is to be demonstrated. c. contact good citizens in the community to serve as models. d. identify appropriate rewards for achieving good citizenship. 37 Cognitivists take which of the following positions on the role of errors in learning? a. Errors are threats to learning. *b. All responses provide some feedback. c. Errors are events that must be overcome. d. Teachers should eliminate errors that they can anticipate. Alternste-Choice Item Generstion Procedure Since a similar large pool of high—quality alternate-choice items pertaining to the course content was not available. it was necessary to create one. Alternate-choice test items to meet that need were written with the assistance of the course coordinator and course instructors. In addition to the multiple-choice items. the following resources were used in writing the alternate—choice items: a) The course text and teachers manual. b) The course study guide. manual and term projects. c) Common student problems and questions. d) Ideas from the course instructors and educational psychology staff. e) The multiple—choice test items. As noted in Chapter I. the alternate—choice item format as proposed by Ebel (1982) is a unique item form. There is an important difference between the conventional two-choice multiple-choice form mentioned in the 38 literature (Straton & Catts. 1980: and others) and the unique form pospoed by Ebel. The pool of multiple—choice items described above was useful in highlighting important concepts and ideas. and served as an aid in identifying topics for writing alternate—choice items. However. the alternate-choice items were conceived and written independently of the multiple-choice items. They were pss fashioned simply by eliminating two or three of the least attractive distractors from four or five choice items as conventional two—choice items often are. The alternate-choice items were written using a procedure suggested by Ebel (1980). In that procedure. he lists the first step in writing alternate-choice items as the identification of an important concept or idea which can be stated simply and definitively in a declarative sentence or proposition. This step is common to the creation of all items. With the concept statement in mind. Ebel instructs the item writer to analyze the statement for a central or critical element which could be conceived as a semantic differential and presented as polar alternatives to test the students' knowledge of the concept. According to Ebel. the response alternatives should: 1. Involve a critical element of the proposition. 2. Be distinctly different. opposite in meaning. or mutually exclusive. 39 3. Be parallel in meaning and structure (members of the same class of ideas). 4. Avoid inclusion of the word "not." 5. Be plausible. 6. Be definitely correct or incorrect. 7. Complete the sentence sensibly. 8. Be presented in natural or alphabetical order. 9. Present no relevant clues. (Ebel. 1980. p. 115) For example. one application of this procedure is seen in creation of an alternate—choice item to test the students' knowledge of theories of human development. All of the development theorists studied in the course emphasize that the rate of attainment or progress through various developmental stages tends to vary from person to person. The sequence or order through which people pass through the stages. however. is fixed and does not vary. This concept was identified as an important concept which could be stated simply and definitively. The critical element of this concept was then identified as the sequence of the stages. Two alternatives which involved the critical element were conceived and the following item was written. The sequence at which different people pass through developmental stages is *a) fixed. b) varied. The course coordinator and instructors who assisted With the item writing were surprised and delighted at the 40 ease with which they were able to generate items which. often after only minor revision. were judged to be quite acceptable items. As items were written. they were reviewed for flaws by measurement specialists. These items were also edited by the course instructors and educational psychology faculty consultants for subject matter relevance and accuracy. Based on these reviews. items were either retained. revised. or discarded. Using this method. a bank of approximately 300 alternate-choice test items was created. A few examples of these alternate-choice test items are given below. Piaget believed that learning is most likely to occur in the presence of cognitive *a) conflict. b) harmony. Most teachers agree that rote learning and drill are *a) efficient b) inefficient activities for enhancing the learning of young children. Cognitivists believe that a child who is given candy for doing his/her homework is a) more *b) less likely to learn as much over the long run as a child given no reward. In order to ensure that all items used in the final data collection were technically sound and also to provide an opportunity to "iron out" the details for the final data collection. a pilot study was conducted in the Winter Term of 1983. While the newly created alternate-choice items had been subjectively analyzed by subject matter and measurement experts. a more complete analysis required that the items be subjected to actual 41 try-out and item analysis. as had the multiple—choice items used in the course. Since each unit test and the final examination of the pilot study was to be composed of two content-parallel equivalent subtests. one subtest composed of alternate-choice items and the other composed of multiple-choice items. it was necessary to take 5 special care in test construction. A table of test specifications containing unit objectives. cognitive level to be tested. and the number of items allocated to measure each objective was designed for each test (see Appendix A). These tables of specifications were created with the aid of the instructors and the educational psychology faculty consultants. The test specifications were used in the process of item generation to ensure that the alternate—choice and the multiple-choice items were measuring the same level of cognitive complexity and that the tests composed of the two item formats were as content parallel as possible. Analysis of the pilot study data (see Table 1) revealed that. as a rule. the mean item difficulty. discrimination and the reliability of alternate—choice tests were only slightly lower than the multiple-choice tests. In the pilot study it was possible to try-out the newly written alternate-choice test items and on the 42 basis of item analysis and student reaction solicited informally. to refine them further. Foils that were not attractive were replaced. and items that were identified as either very difficult or very easy or that displayed low discrimination were marked for reevaluation. Accordingly. many of the original items were revised or rewritten. It was necessary to discard others. Table l Psychometric Characteristics of the Items Employed in the Pilot Study Mean Mean K-R 20 aTest Diff. Disc. Rel. Test 1 AC 64 26 .479 Test 1 MC 70 26 .509 Test 2 AC 83 24 .629 Test 2 MC 74 29 .650 Test 3 AC 81 18 .522 Test 3 MC 72 28 .552 Test 4 AC 75 24 .486 Test 4 MC 76 25 .595 Final AC 75 21 .605 Final MC 66 28 .705 aN of items = 30 for Tests 1—4. N=50 for the Final Exam Research Design In the Spring Term of 1983 the data for the study were collected. A counter-balanced design was employed 43 in which three unit examinations and a final examination was administered to all students enrolled in the course. Each unit examination was composed of two content-parallel subtests of thirty items: one composed of alternate-choice items and the other composed of multiple-choice items. The final examination was composed of 50 alternate-choice items and 50 multiple-choice items. In addition to comparing the psychometric properties of alternate-choice and multiple-choice items. the study was also designed to investigate the efficiency of alternate-choice items. Specifically. the study sought to identify the ratio of the number of alternate—choice items to multiple-choice items to which students were able to respond in a given time period. Such information was necessary to compare the reliabilities of the two forms equated for testing time. In order to gain this information. a time study was conducted in the Spring Term of 1983. Students in all sections were timed and asked to mark the item they had just completed at the end of five minutes and at the end of ten minutes. The students were then informed that they would not be interrupted further and were instructed to proceed with the examination. 44 Since students' perceptions of the tests and preference for alternate-choice or multiple-choice items was also deemed an important consideration. a survey was conducted at the end of the term to explore the students' perceptions and preferences. Students were asked to respond frankly and to write additional comments on the form. Care was taken to assure the students of anonymity in their feedback. On the basis of remarks made by students under the comments section of the survey and verbally to the instructors of the course it was concluded that the students reported honestly and took the task seriously. Analysis In the interest of gaining information necessary to test the hypotheses. each of the examinations was scored and analyzed. The item analysis programs employed provided information on examinees' scores. item and test difficulty and discrimination. and estimates of internal consistency reliability (Kuder-Richardson 20) necessary to evaluate the first three hypotheses. The definitions and formulas for these indices and coefficients may be found in an introductory measurement text. Multivariate analysis of variance (MANOVA) with alpha set at .05 was employed in the study to determine 45 whether the multiple—choice items differed significantly from the alternate-choice items in difficulty and discrimination. Norusis (1985) explains that MANOVA allows a researcher to test the differences between two or more groups on two or more dependent variables. In this study. item format was the independent variable and item difficulty and item discrimination were the dependent variables. Norusis (1985) also notes that the MANOVA procedure yields a main effect P for each of the dependent variables and for the interaction effects. If a significant MANOVA F is found the researcher may elect to perform follow-up univariate analyses in an attempt to determine which levels of the dependent variable are significant and contributing to the MANOVA F. Follow-up analyses were not performed in this study. since the statistical significance of the differences among the four alternate—choice examinations and among the four multiple-choice examinations in difficulty and discrimination were not of interest. The third hypothesis pertained to the efficiency of the alternate—choice and multiple-choice items. To test that hypothesis frequency distributions indicating the number of items to which subjects responded in the allotted time were constructed. The ratio of the means of the two distributions was tested for significance using the chi square test for goodness of fit. 46 Shavelson (1981) explains that the purpose of the chi square test is to determine whether the observed distribution differs systematically from the theoretically expected distribution. or whether the differences may be attributable to chance. In the present case. the observed ratio of the number of alternate—choice to multiple—choice items to which students were able to respond was tested against the hypothesized ratio of 1:1. indicating no difference. In the interest of testing the fourth hypothesis. the Kuder-Richardson 20 reliability estimates for the alternate-choice and multiple—choice test were equated using the familiar Spearman—Brown correction formula. The equating procedure was based on the information gathered in the time trials previously mentioned. and the alternate-choice tests were theoretically lengthened and equated by testing time with the multiple—choice tests. A procedure proposed by Feldt (1980) was employed to test these equated coefficients. Feldt explains that the statistic for conducting the test of H:p1 = p2 when the coefficients are obtained form the same sample is as follows: tN 2 = (W—1)(N-2)1/2 ’ 2 1/2 (4W (1—r )) Where: W = 1-r x x -——1 1 2 1-r 2 N = examinees r = correlation be- x1x2 tween the tests. 47 The test is based on the usual assumptions associated with the two—factor random model analysis of variance and is generally considered to provide a conservative test of the hypothesis that coefficient alpha (or Kuder— Richardson 20 if the items are scored dicotomously) is the same for two tests or measurement procedures. The procedure proposed by Feldt was used to test each of the four sets of reliability coefficients of the study. It is important to recognize that many t statisticians look with disfavor on the practice of performing consecutive tests that cannot be regarded as strictly independent. because this practice potentially increases the possibility that at least one of the tests might reach significance by chance (Norusis. 1985). Accordingly. in the interest of maintaining a conservative test and minimizing Type 1 errors. the .01 confidence level was selected as the decision point for rejecting the fourth hypothesis. Hypothesis 5 pertains to the concurrent validity of the alternate-choice and multiple-choice format. To test this hypothesis a Pearson product—moment correlation coefficient was calculated beween examinees' scores on the multiple-choice and alternate-choice subtests on each of the four examinations. The correlation coefficients were adjusted for unreliability in the measurement of the 48 two variables by the correction for attenuation formula given by Ghiselli (1964. p. 268). Where: r true correlation tt between scores on x and y r = iv rxy = correlation between tt r r observed scores on / xx yy x and y -\ rYY a reliability coeffi— cient for the y scores Theoretically. if the two item formats are measuring the same construct. the correlation between them when corrected for attenuation will equal one. The corrected correlation coefficients were tested to determine if their values were different from unity using a procedure which Lord (1959) suggests. Lord notes that the appropriate test statistic in situations of this type is chi square (x2) distributed with one degree of freedom. In order to provide a conservative test. alpha was preset at the .01 level of significance. The last hypothesis (Hypothesis 6) pertains to examinee preference for tests composed of alternate—choice and multiple-choice items. In order to gain information necessary to evaluate examinees' preference. a questionnaire was constructed and administered to the examinees at a class period near the end of the term. Examinees' responses were tabulated and analyzed to identify examinee preference. The chi square 49 goodness of fit test previously described was used to test examinees' preference for tests composed of multiple-choice or alternate-choice items. Summar The 112 subjects who participated in the final testing phase of this study were described as representative of undergraduate college students enrolled in the College of Education at Michigan State University. In a counter balance design. each subject was administered three unit examinations and a final examination composed of content parallel alternate—choice and multiple-choice subtests. Examinees' responses were analyzed and difficulty indices. discrimination indices. and reliability coefficients were calculated. MANOVA was employed to test the differences in the difficulty and discrimination indices of the alternate—choice and multiple—choice subtests. The ratio of the number of alternate—choice and multiple—choice items to which subjects were able to respond in a given unit of testing time was calculated: and the procedure proposed by Feldt (1980) was employed to test the differences in the reliability coefficients of the alternate—choice and multiple—choice tests for significance. Also. the correlation between individuals' alternate—choice and multiple—choice subtest scores was 50 calculated and corrected for attenuation for each testing form to provide information on the the concurrent validity of the alternate-choice test items. Finally. a questionnaire was administered and responses analyzed to evaluate examinees' preference for alternate—choice or multiple-choice items. CHAPTER IV RESULTS OF THE STUDY Ipsspdpction The results of the study are presented in Chapter IV. The chapter is divided into major sections corresponding to the research hypotheses. The first section deals with the comparisons of the difficulty and discrimination of the two item forms. Section two pertains to the findings regarding the number of multiple-choice and alternate—choice items to which examinees were able to respond in the time trials. The third section contains the results relevant to the reliabilities of the multiple-choice and alternate-choice subtests. Results that reflect on the concurrent validity of the subtests composed of the two item formats are reported in the fourth section. and results pertaining to students' perceptions of alternate-choice items are presented in the fifth. The chapter concludes with a brief summary. Results Concerning Difficplty,and Discriminstion The data were analyzed using MANOVA of the Statistical Package for the Social Sciences (Nie. Hull. Jenkins. Steinbrenner. 6 Brent. 1985) to determine 51 52 whether significant differences existed in the difficulty and discrimination indices of the alternate—choice and multiple-choice items. The results of the analysis are presented in Table 2. An inspection of Table 2 reveals that the tests composed of alternate—choice items tended to be easier than the multiple-choice tests. This is evidenced by the fact that the means of the item difficulty indices for the alternate-choice tests were generally higher than those of the corresponding multiple-choice tests. indicating a greater proportion of the examinees answered the alternate-choice items correctly. These differences were found to produce a significant (F > .05) main effect for item difficulty in the MANOVA analysis. It was. therefore. necessary to reject the first hypothesis of no difference in the difficulty of the two formats. The results of the analysis of the discrimination of the tests were similar to those of the difficulty analyses. The data presented in Table 2 shows that the multiple-choice items tended to be somewhat more discriminating than the alternate-choice items. However. the results of the MANOVA analysis indicated that the difference in the mean discrimination indices of the tests composed of the two item forms was not significant at the .05 level. On the basis of the analysis of the Descriptive Statistics and Multivariate Analysis of Variance for Difficulty and Discrimination 53 Table 2 Test Mean Diff. S.D. MANOVA P DIFFICULTY Test I Alternate—Choice 68 15.52 7.372 .017 Multiple-Choice 71 16.74 Test II Alternate—Choice 72 14.59 Multiple-Choice 69 15.21 Test III Alternate—Choice 72 14.66 Multiple-Choice 68 16.95 Final Exam Alternate-Choice 75 17.71 Multiple-Choice 65 22.04 DISCRIMINATION Test I Alternate—Choice 32 10.28 0.246 .620 Multiple-Choice 36 11.15 Test II Alternate-Choice 26 12.16 Multiple—Choice 34 13.08 Test III Alternate—Choice 32 12.49 Multiple-Choice 26 11.03 Final Exam Alternate-Choice 24 13.78 Multiple—Choice 26 12.44 54 discrimination indices of the tests. the second hypothesis was not rejected. Results Concerning Efficiency of Alternste-Choicessps gpltiple-Choice Items The third hypothesis of the study stated that examinees can respond to the same number of alternate—choice and multiple-choice items in a fixed time period. In order to examine the data pertinent to this hypothesis. frequency distributions were constructed of the number of items students attempted in the time trials. The mean number of alternate—choice and multiple-choice items which students attempted and the ratio of the number of each form attempted in each trial are presented in Table 3. The results of the time study indicated that the examinees were able to respond to a greater number of alternate-choice items than multiple choice items in the time allowed. In both Test I and Test II students were timed and instructed to mark the item on which they were currently working at five minutes and again at ten minutes. The ratio of the average number of alternate-choice to multiple-choice items attempted by the students in the two tests was then calculated. These ratios. which serve as an index of the relative rates of work by subjects on the two item forms. were quite stable 55 Table 3 Mean Number of Alternate-Choice and Multiple-Choice Items Attempted in the Time Trials Testa 5 Minutes 10 Minutes Test I Alternate-Choice 15.3 26.6 Multiple—Choice 10.4 17.4 RATIO (AC:MC) 1.47 1.53 Test II Alternate-Choice 14.4 26.7 Multiple-Choice 8.6 16.2 RATIO 1.67 1 65 aN=103 for Test 1. n=108 for Test II. 56 across trials. ranging from 1.47 (one multiple—choice item to 1.47 alternate—choice items) to 1.67. with an average of 1.58. The proportion of 1:1.58 was tested using the chi square goodness of fit test to determine the probability of this proportion if the population proportion is actually 1:1 as hypothesized. The analysis g revealed a chi square of 10.24 (1. N=98) which was found to be significant at the .01 level. The conclusion drawn from these results was that. in general. students attempted approximately three alternate-choice items for I every two multiple-choice items they attempted. Results Concerning the Relisbility of the Alternste-Choice and Msltiple-Choice Tests The fourth hypothesis of the study proposed no differences in the reliability of tests composed of either alternate-choice or multiple—choice items. The study of the reliability of the two forms involved calculating the Kuder—Richardson 20 estimate for each of the unit examinations and the final examination. These reliability coefficients are reported in Table 4. The alternate-choice test reliability estimates were adjusted with the Spearman-Brown formula to estimate the reliabilities of tests 1.58 times as long as the original tests. The adjusted reliabilities also appear in Table 4. 57 Table 4 Significance Tests of the Reliability of the Alternate-Choice and Multiple—Choice Tests Test KR—20 aCorrected tw P Test I alternate—choice .711 .795 .655 .514 multiple-choice .778 Test II alternate-choice .547 .666 1.130 .261 multiple-choice .601 Test III alternate-choice .588 .693 .201 .842 multiple—choice .683 Final Examination alternate-choice .682 .772 1.578 .117 multiple-choice .713 aReliabilities of the alternate-choice tests were adjusted by the Spearman-Brown formula to estimate the reliability of tests 1.58 times as long. 58 The difference between the corrected alternate-choice reliability coefficients and the multiple—choice test reliability coefficients were tested using the technique (tw) proposed by Feldt (1980) to determine whether any of them were statistically significant. The tw statistic is provided in Table 4 with the corresponding probability level. None of the differences in the reliability coefficients were found to be significant at either the .01 or .05 level. Therefore. the fourth hypothesis of no difference in the i reliability coefficients of the two item forms was not rejected. Results Concerning Concpgrent Validity The fifth hypothesis of interest pertained to the concurrent validity of the alternate-choice and multiple-choice item formats. In order to test this hypothesis. examinees' responses on each examination were scored to derive a separate alternate—choice and multiple—choice subtest score. A Pearson product moment correlation coefficient was calculated between students' scores on the multiple—choice (x) and the alternate-choice (y) tests. These correlation coefficients are presented in Table 5. The coefficients were then corrected for attenuation and the chi square 59 technique proposed by Lord (1957) was used to test whether the dis-attenuated correlation coefficients differed significantly from unity. The results of that analysis are also presented in Table 5. Table 5 Significance Tests of the Correlation Coefficients for Multiple-Choice and Alternate-Choice Scores on the Tests a 2 Test Form N rxy rtt x1 Test I 103 .689 .877 8.222* Test II 108 .583 .921 1.825 Test III 112 .530 .770 19.425* Final Exam 106 .666 .898 17.162* ra designates r corrected for attenuation tt xy *p < .01 The results revealed that three of the four disattenuated coefficients were significantly different from unity at the .01 level. The conclusion drawn from the analysis was that the corrected correlations between individuals' multiple—choice and alternate-choice scores was not perfect (equal to one). and that the concurrent validity of the two forms was not equal. 60 Results Concerning Students} Preference for Alternste-Choice snd Mgltiple-Choice Itsss Prior to the administration of the final examination and the conclusion of the course. students were asked to complete a brief questionnaire designed to measure their perceptions of the course tests. Whether the students would prefer alternate-choice items over multiple-choice items as Ebel (1982) suggested was of special interest. The results of that questionnaire are presented in Appendix B. The results of the questionnaire indicated that the students tended to view the course testing as well as the alternate-choice items employed quite positively. For example. over 70s of the students indicated that the tests provided a strong motivation to learn the principles taught. and that the alternate-choice items tested important concepts in the curriculum. It is interesting to note that although slightly more than half of the students perceived alternate-choice items to be more difficult and ambiguous than multiple—choice items. these perceptions were not verified by the item statistics. Question nine. however. indicated that a majority (74%) of the students responding thought that future exams should be composed of psps multiple—choice and alternate-choice items. Only 25% of the students 61 favored either the exclusive use of multiple-choice (21%) or alternate-choice (5.0%) items on future exams. The question most pertinent to the sixth hypothesis of the study was item number six which posed the statement. "I would prefer taking a test composed of alternate-choice items to a test composed of multiple-choice items.“ Only 34‘ of the students responded affirmatively to this statement. while 66‘ of them disagreed. This proportion of 66:34 was tested using the chi square goodness of fit test. The analysis revealed a chi square of 12.47 (1. N=98) which was found to be significant at the .01 level. indicating that a significant number (66S) of the students responding did not prefer taking a test composed of alternate—choice items. Because these results did not provide evidence of a preference for alternate-choice items over multiple-choice items. the fifth hypothesis was not rejected. Summa I! The results of the data analysis for this study were presented in this chapter. The findings concerning the major research hypotheses were: 1. While the multiple-choice form was significantly more difficult that the 62 alternate—choice form. the difference in the discrimination of the two forms was not found to be significant. Students attempted approximately three alternate-choice items for every pair of multiple-choice items attempted. The Kuder-Richardson 20 reliability estimates for the multiple-choice tests were not significantly different from those of the corrected for length alternate-choice test reliability estimates. The correlations. corrected for attenuation. between the multiple-choice and alternate-choice test scores were significantly different from unity for three of the four tests. indicating a difference in the concurrent validity of the two forms. Students did not express a preference for alternate-choice items over multiple-choice items. The majority. however. indicated that both alternate—choice and multiple-choice items should be included on future tests. CHAPTER V SUMMARY AND CONCLUSIONS Summa I! The purpose of this study was to compare the reliability of multiple-choice and alternate-choice tests and to explore the concurrent validity of alternate—choice tests that were written to measure understandings of concepts and relationships in an introductory Educational Psychology course. The major questions of the study that were formulated as research hypotheses were: 1. Are alternate-choice and multiple—choice achievement tests that were designed to measure the same objectives equally difficult and discriminating? 2. What is the ratio of the number of alternate—choice to multiple-choice items to which examinees are able to respond in a given time period? 3. Are alternate—choice and multiple-choice achievement tests that were designed to measure the same objectives equally reliable? 4. Will the corrected for attenuation correlations between examinees' scores on the multiple-choice and alternate—choice tests equal unity? 63 64 5. Will examinees exhibit a strong preference for the newer alternate-choice format over the familiar multiple-choice item format? Ebel (1980) proposed a novel item format termed alternate-choice items. He suggested that this new item type would compare quite favorably with other conventional test item formats with regard to difficulty. discrimination. reliability and validity. He also suggested that students tended to prefer them over more conventional item formats. A review of the literature revealed a large number of studies comparing the reliability. validity. and other psychometric properties of various test item forms. However. since the alternate-choice item format was new. very little research had been conducted on it. While this unique item form proposed by Ebel appeared to show potential as an important addition to the repertoire of test item formats that teachers and other item writers have at their disposal. little empirical research existed to substantiate Ebel's claim. In this study a group of approximately 112 undergraduate students enrolled in an introductory educational psychology course at Michigan State University each responded to three unit tests and a final examination composed of content—parallel alternate—choice 65 and multiple-choice subtests. Students' responses were scored and analyzed to identify the difficulty and discrimination indices of the items. Kuder—Richardson 20 reliability estimates for the alternate-choice and the multiple-choice items were also calculated. Students were timed on two occasions to identify the number of alternate-choice and multiple-choice items to which they were able to respond in a given time period. The correlation between individuals' scores was calculated and corrected for attenuation for each unit examination and the final examination. In addition. a questionnaire was developed and administered to the subjects to explore their perceptions of alternate-choice items and their preference for either alternate-choice or multiple-choice items. Statistical tests were performed to determined if the difficulty. discrimination and reliability of the two item types were significantly different and to determine if the value of the corrected correlation coefficients departed significantly from unity. Statistical tests were also performed to determine the probability of the observed results of the time study and of students' responses on the questionnaire. 66 Conclpsions The conclusions associated with the research hypotheses listed in Chapter I were: 1. The multiple-choice items were significantly more difficult than the alternate—choice items. 2. The multiple-choice items were not significantly more discriminating than the alternate—choice items. 3. Students were able to respond to approximately three alternate-choice item to every two multiple-choice items they attempted. 4. There was not a significant difference between the reliabilities of the alternate—choice and the multiple-choice tests. 5. The corrected for attenuation correlations between the multiple-choice and alternate-choice test scores were significantly different from unity for three of the four examinations. 6. Students did not express a strong preference for either item format used in the study. but generally viewed both formats positively. Discussion The results of the analysis of difficulty and discrimination of the alternate—choice and 67 multiple-choice items revealed significant differences in the difficulty but not the discrimination of the two forms. These results may be interpreted as evidence that alternate-choice items discriminate between knowledgeable and less-knowledgeable examinees approximately as well as multiple-choice items. even when they are less difficult. Because it is probably easier for item F writers to produce test items that are less difficult than it is to produce items that are of medium to high difficulty. this may constitute an important advantage. I especially since alternate-choice items may be comparatively easier to write than multiple-choice items in the first place. Statistical significance aside. the practical importance of these results is that they support Ebel's (1982) claim that a person who is content knowledgeable and a fairly skilled item writer can produce alternate-choice items which compare quite favorably psychometrically with other item formats. As expected. students' "rate of work" varied with item form. Students were able to respond to substantially more alternate—choice items in a given unit of testing time than multiple—choice items. Since it has previously been demonstrated (Ebel. 1971: Frisbee. 1974) that students are usually able to respond to approximately three true-false items to every two 68 multiple—choice items. it was expected that students' rate of work on alternate-choice items would approximate that of true-false items. It was. therefore. not surprising to find that students answered an average of 1.58 alternate—choice items to every multiple-choice item they attempted for a ratio of about 3:2. While the primary purpose of the identification of this ratio of alternate-choice to multiple choice items was to equate the alternate—choice tests with the multiple-choice tests for testing time. it does have additional practical importance. A practical consideration of this finding is that a longer test may be administered in a given period of testing time if the items are in alternate-choice form as opposed to multiple-choice form. This would be especially important if the examiner is concerned about the adequacy with which the sample of items which comprise the test represent the universe of content. Also. a longer test can provide for a more thorough sampling of the universe and will usually constitute a more reliable measure. The results of the reliability analysis generally followed those of the difficulty and discrimination analyses. The reliability estimates of the alternate-choice tests were slightly lower than those of the corresponding multiple—choice tests. However. when 69 the reliabilities of the alternate-choice tests were adjusted by the Spearman—Brown formula to equate them with the multiple-choice tests for testing time. in each case they were slightly higher than those of the corresponding multiple-choice test. None of these differences were statistically significant. yet the reliabilities of all of the tests were quite respectable for classroom tests of this type. The results of the reliability analysis provides further support for Ebel's claims regarding the usefulness of alternate—choice items. It is important to recognize that these results are somewhat confounded because test reliability is related to the difficulty and discrimination of the items involved. The results of the validity study are more difficult to interpret. Three of the four disattenuated concurrent validity coefficients were found to be significantly different from one. indicating a difference in the concurrent validity of the two forms. The explanation for this difference is not readily apparent. While a concerted effort was made to create alternate-choice tests that were content-parallel and that tested the same level in the cognitive hierarchy. it is possible that these results are due to systematic differences in the items of the subtests. It may also be 70 true that the format of an item makes it more conducive to a particular type of question than to another. or some concepts may lend themselves more readily to being tested with one item format than another. For example. alternate-choice items seem particularly well suited for posing comparative statements or propositions to the examinee. The alternate-choice items written for and used in the study may be generally directed toward concepts that are most easily adapted to the item form. These concepts may have substantive differences or vary in cognitive complexity from those most readily adapted to the multiple-choice format. Another possible explanation for the difference in concurrent validity may be in the cognitive task of responding to the different item forms. The cognitive task of responding to the alternate-choice items may be somewhat different from that required of the multiple-choice items. Since alternate-choice items are new. the examinees had not had nearly the exposure to them as they have had to multiple—choice items. and it may be that the cognitive tasks of reading the alternate-choice item. identifying the question. evaluating the alternatives. and selecting a response may be somewhat different from the series of cognitive tasks involved in responding to a multiple-choice item. 71 From a practical standpoint. the correlations found between students' scores on the alternate—choice and multiple-choice formats were quite high. Students who scored high on one form also tended to score high on the subtest composed of the other item format and vice versa. Further. both item formats appeared to do a good job of testing students' achievement and to be quite valid. The results of the questionnaire administered to the students indicated that they did not have a strong preference for alternate—choice items as Ebel (1982) predicted. Nevertheless. the examinees viewed the tests in general and the alternate-choice tests in particular. quite positively. Students indicated that the alternate—choice items were challenging and did a good job of testing their knowledge of the course content. and that most of the alternate—choice items were relevant and tested important points in the curriculum. It is interesting to note that more than one-half (60%) of the students perceived that the alternate-choice items were more difficult than the multiple-choice items. These results were not verified by the item statistics. As mentioned previously. the alternate-choice items were clearly not more difficult than the multiple—choice items. In fact. the analysis of the indices of 72 difficulty for the alternate-choice items showed the examinees responded correctly to them more often than they did to the multiple-choice items. In the same vein. it is interesting to observe that 59‘ of those responding to the questionnaire perceived the alternate-choice items to be ambiguous. Yet 45‘ of those responding held the contrary view that it was easier to understand and interpret the question in alternate-choice items than in multiple-choice items. and 32‘ indicated that the multiple-choice items were more ambiguous than the alternate-choice items. As with students' perceptions of item difficulty. students' percepitons of item ambiguity was not confirmed by the item statistics. Items are considered ambiguous when there is not a single correct answer on which experts would agree. Ambiguous items tend to confuse examinees and. therefore. to produce lower or negative indices of discrimination. Item statistics for ambiguous items usually show the higher achieving examinees unable to select a single response as correct and often divided between two or more of the options. The item analysis statistics for the tests used in this study did not show such a pattern. While the item discrimination tended to be lower than ideal. this was more attributable to the relatively low difficulty of the items. 73 A possible explanation for the students' perception of ambiguity may lie in the students' understanding of the concept of ambiguity. Ebel (1980) suggested that many students do not possess a clear understanding of the term and often regard all items to which they do not know the correct answer as ambiguous. Their perceptions of ambiguity. therefore. may be at least partially attributable to low achievement. Incomplete knowledge which does not equip the student to detect subtle differences or make fine distinctions may lead the student to view many items as difficult and ambiguous. In the present case. the most likely explanation of the results may be that the lower achieving students tended to view many items. both alternate—choice and multiple-choice as difficult and ambiguous. The fact that some examinees tended to view the alternate-choice items as more difficult and ambiguous than the multiple-choice items may also be at least partially attributable to the novelty of the form. This study was designed to compare selected psychometric indices between alternate-choice and multiple-choice items. The purpose of the study was to provide evidence as to the usefulness of the alternate-choice format in testing achievement. To that end. the results of the statistical analyses are probably 74 not as important as the practical importance of the study. Wesman (1971) points out that the generalizability of most research studies of item form effectiveness and item comparison studies are quite limited. He notes that this is principally due to the fact that situational variables such as the content of the items and the skill of the item writer cannot be controlled. For example. in one study. well written true-false items on natural science may appear quite superior to lower quality sentence completion items. In another study. well conceived completion type items written to test students' ability to formulate hypotheses in the social sciences may be much more reliable than a set of multiple—choice items written to test the same subject. Even though the generalizability of the present study may be quite limited. it does constitute a demonstration of the usefulness of alternate-choice items. Also. the statistical significance tests are probably not as important as the fact that a relatively large pool of alternate—choice items were written and. after revision. were found to perform quite well and to yield quite acceptable psychometric indices of quality. It is also of importance to note that the examinees viewed the alternate-choice items positively and believed 75 that the items presented to them in this novel format did a good job of testing how much they knew. Limitations of the Study The results of this study should be interpreted with caution. for certain limitations existed in the study. Major limitations of the study fall into two major categories: 1) those associated with sampling. and 2) uncontrolled situational variables. As described in (naapter III. the participants consisted of all students (n=112) enrolled in an introductory educational psychology course at Michigan State University. The fact tniat the participants were not randomly selected from a defined population limits the generalizability of the study. The generalizability is further limited by the fact that the participants were a fairly homogeneous group of students majoring in Education. It is uncertain tC> what extent these students are representative of cO‘llege students of other major fields of study or at OtJJer universities. or to what extent these results can be generalized to other age groups. The study is further limited by uncontrolled Situational variables. As previously noted. Wesman (1971) cites the fact that researchers are unable to Control situational variables as the principle cause of 76 the failure of item effectiveness or item comparison studies to significantly advance our knowledge or art of item writing. Wesman identifies the skill of the item writer and the subject matter as two important situational variables that commonly limit studies of item comparison or effectiveness. Unfortunately. both of these variables remain uncontrolled in this study further limiting the generalizability of the results. Suggestionsifor Fprther Research The following suggestions are offered for further investigation into the comparative effectiveness of the alternate-choice item format. 1. The generalizability of the present study was limited by the effects of situational variables such as the subject matter tested and the skill of the item writers. It would be appropriate for future research to study the effectiveness of alternate—choice items in other situations. testing different subject matters. and written by other item writers. 2. Examinees' perception of item difficulty and conception of item ambiguity merits further investigation. possibly in relation to the examinees' knowledge of the subject matter or level of achievement. 77 The number of alternate-choice items to which examinees are able to respond in a given time period could be investigated using different subject matter with item difficulty as a control variable. The results of this study suggested differences in the concurrent validity of the item formats. Research might be designed to investigate these differences further. The cognitive task of responding to various item formats may be quite different and may constitute a significant source of item difficulty. Additional research into the cognitive aspects of responding to test items could be fruitful. APPENDICES APPENDIX A TABLES OF SPECIFICATIONS gsble of Specificstions INTRODUCTION AND UNIT I: COGNITION Cbmpre— hension Appli- (Tnmm- canon laiion, (Abstract Analysis lnterpre— to Ccn- (Elements, Know- iaiion, crate and Relations, ledge Extrapo— vice Organizing Unit Content (Recall) lation) versa) Principles) TOTAL 1. Teacher as a decision maker 1 1 2. Theory into practice, e.g.. 1 1 Physical Development 3. Piaget's theory of learn- 1 l 1 3 ing 4. Piaget's four Develop- 3 1 l 5 mental periods 5. Relation between language 1 1 2 and thought 6. Bruner's perspective on 1 l l 3.5 learning: Discovery 7. Ausubel: Didactic 1 1 1 3.5 Instruction 8. Klausmeier's Model of l 1 2 Concept Attainment 9. Attention to Stimuli 3 1 1 S 10. Acquisition and Retention 3 l l S 78 INTRODUCTION AND UNIT I: 79 COGNITION (cont.) Canne— lwmskm App”- (Trans- cation Iation, (Abstract Analysis lnterpre- to Ccn- (Elements, Know- tation, crate and Relations. leQFr Ethpr-iflce Oqunhflng Unit Content (Recall) lation) versa) Principles) TOTAL 11. Improving Performance 4 1 l 6 12. Retrieval and Transfer 2 1 l 4 13. Memory: Short-Term and 1 1 1 3 Long-Term 14. Social-Economic Status 1 l 2 15. 1.0. Testing and its uses 1 l l 3 16. Sex Roles and Sexism 1 l l 3 1?. Cognitive styles 2 1 l 4 18. Differentiation l 1 2 l9. Cuilford's Model of the 1 l Intellect 20. Creativity and its en— 1 l 2 hancement TOTAL 30 17 13 l 51 80 Table of Specificstions UNIT II: BEHAVIORAL THEORY AND SOCIAL—EMOTIONAL DEVELOPMENT Cempre- hension Appli— (Trans- cation lation, (Abstract Analysis lnterpre- to Con- (Elements, Kncw— tation, crate and Relations, ledge Extrapo— vice Organizing Unit Content (Recall) lation) versa) Principles) TOTAL 1. Conditions of Learning 2 1 3 2. Simplistic Theories of 1 1 Learning 3. Early Behavioral Research 1 1 4. Quantitative Behavioralism 1 l S. Skinner: Theory of Rein— 2 1 l 4 forcement 6. Bandura: Social Learning 2 1 1 4 Theory 7. Modeling: Characteristics 2 1 1 4 8. Reinforcement: Char— 2 3 1 6 acteristics 9. Using reinforcement 2 3 l 6 10. Reducing Undesirable 4 1 l 1 7 ll. Social-Emotional 1 1 2 Development 81 Isble of specificstions UNIT II: BEHAVIORAL THEORY AND SOCIAL-EMOTIONAL DEVELOPMENT Canne— inmflon Anni- (Trans— cation lathwu UUstnxfl lhnhwfls lnterpre— to Con- (E Ianents, Know— tation, crate and Relations, ledge Extrapo— vice Organizing Unit Content (Recall) lation) verse) Principles) TOTAL 1. Conditions of Learning 2 l 2. Simplistic Theories of 1 Learning 3. Early Behavioral Research 1 4. Quantitative Behavioralism l 5. Skinner: Theory of Rein- 2 1 l forcement 6. Bandura: Social Learning 2 1 1 Theory 7. Modeling: Characteristics 2 l l 8. Reinforcement: Char- 2 3 1 acteristics 9. Using reinforcement 2 3 1 10. Reducing Undesirable 4 l 1 1 11. Social—Emotional 1 1 Development 82 Isble of Specifications UNIT III: INSTRUCTION Comra- lnmflon AnMi- (Trans— cation laHon, Oflxtnxfl Amahnfls lnterpre— to Con- (Elcnents, Know— tation, crate and Relations. ledge Extrapo- vice Organizing Unit Content (Recall) lation) verse) Principles) TOTAL 1. Educational Goals and 3 1 4 Learning Objectives 2. Taxonomies of Educational 2 l l 4 Objectives 3. Types of Learning Outcomes 4 4 4. Information Processing 1 1 2 5. Designing Instruction 1 1 2 and adapting needs 6. Self—fulfilling prophecy 1 l 2 7. Effects of Inappropriate 3 2 1 6 Teacher Expectations 8. Characteristics of 2 1 1 1 5 Appropriate Teacher Expectations 9. Student Self-Concept 2 l 1 4 10. Teaching for personal 1 l 2 growth UNIT III: 83 INSTRUCTION (cont.) Ccmpre- hension Appli- (Trans- cation lation, (Abstract Analysis lnterpre— to Con- (Elanents, Know- tation, crate and Relations, ledge Extrapo— vice Organizing Unit Content (Recall) lation) versa) Principles) TOTAL 11. Organizing instruction, 1 1 1 3 classroom variables 12. Successful Classrooms 1 l 13. Characteristics of 3 l l 5 Teaching Effectiveness 14. Classroom Climate and 2 l l 4 Effective Teaching 15. Professional Teacher 1 2 3 Development 16. Sources for Teacher 1 l l 3 Improvement 17. Feedback and its uses 1 1 1 3 18. Creating. Maintaining and 1 l Restoring Appropriate classroom behaviors TOTAL 31 15 9 2 57 84 Table of specificstions UNIT IV: MOTIVATION AND MANAGEMENT Canne— hension Appli- (Trans- cation lath!» 0&stnufl' Nmuysis lnterpre— to Con- (Elements, Know- tation, crate and Relations, ledge Extrapo— vice Organizing Unit Centent (Recall) lation) verse) Principles) TOTAL 1. Definition of Behavior 1 1 Motivation 2. Need Theory: Murray 2 l 1 4 and Maslow 3. Cognitive Theory: Weiner 1 1 4. Intrinsic Theory: Hunt 1 l 2 S. Achievement Motivation 1 l 1 3 6. Attribution Theory 1 1 2 7. Task involvement and 1 l 2 achievement 8. Motivational Tasks 3 2 2 1 8 9. Cooperation and Competition 1 1 2 10. Creating desirable stu- 2 1 3 dent behavior 11. Maintaining Students on 1 l 2 Task 85 UNIT IV: MOTIVATION AND MANAGEMENT (cont.) Carpre— lumskm App”- (Trans— cation latkm, cuntnmfl Amahnfls lnterpre— to Con- (E lunents , Knme- taHon. «mmmeamd Rehflfiomm ledge Extrapo— vice Organizing Unit Content (Recall) lation) versa) Principles) TOTAL 12. Restoring students to 2 l 3 desirable behavior 13. Ideal Teacher Attitudes 3 2 l 6 and Behavior 14. Characteristics of a l 1 2 Favorable Teaching Environment 15. Leadership styles 2 1 3 16. Management Techniques 2 1 l 4 17. Classroom physical char- 1 1 acteristics and tasks 18. Public Law 94-142 1 1 l 3 19. Mainstreaming and special 2 1 3 problems 20. Individual Educational 1 1 Programs TOTAL 3O 17 8 l 56 86 gsble of spscificstions UNIT V: EVALUATION (Note: Unit V topics will be texted on the Final Examination along with a survey of the other Units.) Compre— hension Appli- (Trans- cation lation, (Abstract Analysis lnterpne— to Con- (Elements, Know- tation, crate and Relations, ledge Extrapo— vice Organizing Unit Content (Recall) lation) versa) Principles) TOTAL 1. Errors of Measurement 1 l 2. Test Reliability 1 l 2 3. Test Validity l 1 2 4. Norm referenced tests 1 l 2 5. Criterion referenced 1 l 6. Interpreting test results 1 1 2 7. Descriptive Statistics 3 l 4 8. Distribution Statistics 2 1 3 9. Normal Curve 3 1 1 5 10. Test Length and Content 1 l 2 Coverage 11. Instructional Objectives 1 1 2 and Test Construction UNIT V: 87 EVALUATION (cont.) Canne— lnmflon Amfli- (Trans- cation lation, (Abstract Analysis lnterpre— to Con— (Elanents, Know- tation, crate and Relations, lean» Exhwmo— vice (hgpnhflng Unit Content (Recall) lation) versa) Principles) TOTAL 12. Essay tests: construc— l 1 2 tion and scoring 13. Objective test items 1 2 3 14. Norm vs. criterion 1 1 2 measures 15. Rating scores 1 1 2 16. Function of grades 1 l 17. Types of grade schedules 1 1 l 3 18. Grades and Educational 1 l 2 Philosophy TOTAL 23 13 S 41 APPENDIX B QUESTIONNAIRE SUMMARY Test and Item Questionnaire Summary Percent Agree/ Strongly Agree Percent Disagree! Strongly Disagree The tests given in this 71% course provide a strong motivation for me to study. 29‘ Alternate-choice items 60% tend to be more diffi— cult than multiple— choice items. 40% Most of the alternate— 77% choice items test important points in the curriculum. 23% Many of the alternate- 59% choice items are ambiguous. 41% Alternate choice items 59% challenging and do a good job testing how much I know. 41‘ I would prefer taking 34% a test composed of alternate-choice items to a test composed of multiple choice items. 66‘ It is usually easier 45% to interpret and under- stand the question posed in alternate-choice items than in multiple-choice items. 55% 88 89 c. part alternate-choice 74% items and part multiple-choice items. Percent Percent Agree/ Disagree! Strongly Strongly Agree Disagree 8. Multiple-choice items 32% 68% tend to be more am- biguous than alternate— choice items. 9. In the future. unit exams should be composed of: a. all alternate-choice 5% items b. all multiple—choice 21% items B I BL IOGRAPHY BIBLIOGRAPHY Andrews. D. M. 6 Bird. C. (1938). Comparison of two new-type questions: Recall and recognition. Journal of Educationsl Psychology, gs. 175-193. Bensen. J. 6 Crocker. L. (1979). The effects of item format and reading ability on objective test performance: A question of validity. Educational snd Psychological Measurement. ss. 225-233. Budescu. D. U. 6 Nevo. B. (1985). Optimal number of options: An investigation on the assumption of proportionality. Journal of Espcationsl Measurement. pg (3). 183-196. Burmeister. M. A. 6 Olson. L. A. (1966). Comparison of item statistics for items in multiple-choice and in alternative-response forms. spience Education. 12. 457—470. Campbell. A. C. (1961). Some determinants of the difficulty of nonverbal classification items. Journal of Educationsl Psycholpgy. gs. 899-913. Campbell. D. T. & Stanley. J. C. (1963). Esperipental and quasi—esperimentsl design for research. Chicago: Rand McNally. Charles. J. W. (1926). A comparison of five typss of objective tests in elementsry psycholsgy. Unpublished doctoral disseration. University of Iowa. Iowa City. Choppin. B. H. 8 Purvis. A. C. (1969). Comparison of open-ended and multiple—choice items dealing with literary understanding. Research in the Tesching of English. s. 15-24. Cook. D. L. (1955). An investigation of three aspects of free—response and choice-type tests at the college level. Dissertation Apstracts Internstionsl. ss. (9-10A) 4310. Copeland. J. S. S Gilliland. A. R. (1943). Comparison of the validity and reliability of three types of objective examinations. Journal of Egpcationsl Psychology. ss. 242-246. 90 91 Costen. F.. (1970). The number of alternatives in multiple-choice achievement tests: Some empirical evidence for a mathematical proof. Educational and Psychological Measurspent. sp. 353-358. Ebel. R. L. (1971). The comparative effectivenessgof pspe-false and multiple-choice achievement test items. Paper presented at the American Educational Research Association Annual Meeting. New York City. Ebel. R. L. (1975). Can teacher write good true-false items? Journal of Edpsstionsl Measurement. gg. 31-35. Ebel. R. L. (1980). Some advantages of sgternste—choice items. Unpublished manuscript. College of Education. Michigan State University. East Lansing. MI. Ebel. R. L. (1982). Proposed solutions to two problems of test construction. Journal of Educational Measurement. gs (4). 267-277. Eurich. A. C. (1931). Four types of examinations compared. Journal of Educational Psychology. gg. 268-278. Feldt. L. S. (1980). A test of the hypothesis that Cronbach's alpha reliability coefficient is the same for two tests administered to the same sample. gsychopstrika. ss (1). 99-105. Frisbee. D. A. (1974). The effects of item format on reliability and validity: A study of multiple-choice and true-false tests. Educational and Psychological Measgrement. gs. 885—892. Frisbee. D. A. S Sweeney. D. D. (1982). The relative merits of multiple— true-false achievement tests. Journal of Egpcational Messurement. pg. 29-35. Ghiselli. E. C. (1964). Theory of psychological measurement. New York: McGraw-Hill. Green. K. (1984). Effects of item characteristics on multiple—choice item difficulty. Educational ans Psychological Measurement. gs. 551—562. Grier. J. B. (1975). The number of alternatives for optimum test reliability. Journal of Educationag Measurement. gg. 109—113. 92 Grosse. M. W. 8 Wright. B. D. (1985). Validity and reliability of true—false tests. Educational and Psychological Measurement. gs. 1—6. Hays. W. L. (1973). Statistiss for the social sciences. New York: Holt. Rinehart. and Winston. Heim. A. W. & Watts. K. P. (1967). Experiment on multiple—choice versus open—ended answering in a vocabulary test. British Journal of Educational Psychology. s1. 339-346. Hogben. D. (1975). The reliability. discrimination. and difficulty of work knowledge tests employing multiple-choice items containing three. four. or five alternatives. Asstralisn Journal of Education. gg. 63—68. Hughes. R. 8 Trimble. W. (1965). The use of complex alternatives in multiple-choice items. Educational snd psychological Measurement. gg. 117—126. Huck. S. W. (1978). Test performance under conditions of known item difficulty. Journal of Educational Measurement. gs. 117—126. Lord. F. M. (1957). A significance test for the hypothesis that two variables measure the same trait except for errors of measurement. Psychometrika. gg (3). 207-220. Lord. F. M. (1977). Optimal number of choices per item: A comparison of four approaches. Journal of Educational Messurepent. gg. 33-38. Maihoff. N. A. 8 Mehrens. W. A. (1985). A comparison of siternste-choice and true-false iteps forms used in classroom examinstions. Paper presented at the National Council on Measurement in education Annual Meeting. Chicago. McMorris. R. F.. Urbach. S. L.. 6 Connor. M. C. (1985). Effect of incorporating humour in test items. Journsl of Educationsl Measurement. gg (4). 127—134. Mehrens. W. A. S Lehmann. I. J. (1984). Measurement and evalgation in education snd psycholqu, New York: Holt. Rinehart S Winston. 93 Mendelson. M. A.. Hardin. J. H.. S Canady. S. D. (1980). The effects of format on the difficulty of multiple—completion test items. Harvard Educational Review. gg. 73-80. Millman. J. (1978). Determinants of itep difficplty: A preliminary investigatigp. CSE Reports No. 114 Ed 163 071. Nie. N. H.. Hull. C. H.. Jenkins. J. G.. Steinbrenner. K.. 8 Brent. D. (1975). Statisticsl package for the social sciences. New York: McGraw-Hill. Norusis. M. J. (1985). SPSS-x: Advances statistics guide. New York: McGraw-Hill. Oosterhof. A. C. S Coats. P. K. (1984). Comparison of difficulties and reliabilities of quantitative word problems in completion and multiple-choice item formats. Applied Psychological Measurspents. g (3). 287-294. Plake. B. S. & Huntley. R. M. (1984). Can grammatical clues result in invalid test items? Educational and gsychological Measurement. gg. 687-692. Ramos. R. A. 6 Stern. J. (1973). Item behavior associated with changes in the number of alternatives in multiple-choice items. gpprnal of Educational Measgrement. pp. 305-310. Rocklin. T. 8 Thompson. J. M. (1985). Interactive effects of test anxiety. test difficulty. and feedback. Journal of Educationsl Psychology. 11 (3). 368—372. Rosenfeld. P. 8 Anderson. D. (1985). The effects of humorous multiple—choice alternatives on test performance. Journal of Instgpctionsl Psycholggy. gg (1). 3-5. Rowley. G. L. (1974). Which examinees are most favored by the use of multiple choice tests? Journal of Educationsl Measurement. g; (1). 15-21. Ruch. G. M. S Stoddard. G. D. (1925). The comparative reliabilities of five types of objective examinations. Journal of Educational Psychology. gg. 89—103. 94 Shavelson. R. J. (1981). Statistical ressoningsfor the pehsvioral sciences. Boston: Allyn and Bacon. Inc. Straton. R. G. 8 Catts. R. M. (1980). A comparison of two. three. and four choice item tests given on total number of choices. Educational snd Psychological Measurepsnt. gs. 357-364. Toops. H. A. (1921). Trade tests in education. Teachers College Contribgtion to Education. New York: Teachers College. Columbia University. No. 115. Traub. R. E. 6 Fisher. C. W. (1977). On the equivalence of constructed-response and multiple-choice test. Applied Psychological Measurements. s. 355-369. Ward. W. W. (1982). A comparison of free-response and multiple—choice forms of verbal aptitude tests. Applied Psychological Measurements. g. 1-11. Watson. D. R. 8 Crawford. C. C. (1930). Four type of tests. High SchoolyTesgher. 1. 282—283. Wesman. A. G. (1971). Writing the test item. In R. L. Thorndike (Ed.) ggpcational measurement (PP. 81-129). Washington. DC: American Council on Education. Williams. B. J. & Ebel. R. L. (1957). The effects of varying the number of alternatives per item on multiple-choice vocabulary items. The Foprteenth Yearbook (pp. 63-65). Washington. DC: National Council on Measurements in Education. Wilson. V. L. (1982). Maximizing reliability in multiple-choice questions. Educational and Psychological Measurement. gg. 69-72. Winne. P. H. & Belfry. J. M. (1982). Interpretive problems when correcting for attenuation. Journal of Educationsl Measurement. gs (2). 125-133. general References Bloom. B. S. (1956). gsxonopy of educational objectives; handbook I: Cognitive domain. New York: Longmans. Green and Co. 95 Campbell. D. T. 8 Stanley. J. C. (1963). ggperipental and quasi—experimental designs for research. Chicago: Rand McNally. Ebel. R. L. (1979). Essentials of edpcational measpgement. Englewood Cliffs. NJ: Prentice-Hall. Hopkins. K. D. & Stanley. J. C. (1981). Educational and psychological messurepent and evaluation (6th ed.). Englewood Cliffs. NJ: Prentice-Hall. Lee. W. (1975). gyperipentsl design and analysis. San Francisco: Freeman and Company. Lord. F. M. 6 Novick. M. R. (1968). statisticsl theories of mental test scores. Reading. MA: Addisson Wesley. Thorndike. R. L. (1971). Educational measgrement (2nd ed.). Washington. DC: American Council on Education. Winer. B. J. (1971). Statisticsl principles in experiemental desig . New York: McGraw—Hill.