94.11.:‘ég - w I 4MvIIA'7uII .,,.,I '1‘ ”H _ raj. ., . “1:1". «.1 ..5'""":3.I 1! . v F, 1:" hI-SI 3" war .. "11:: MW 1 52>.“ . mm... . 1. ,.. . m .. , . .. ml... , . . ., . . .1 , . - . v . u w r n ‘p , .. . . w... I 7 , -. .- I ,, . ;v., c; f-IIInLL‘C~71.-., 1. ,1... 1., , , . . . . . .1 . , 2:": 1.. '..,I ‘51, a. 1/2115»... w ‘n’ :a rhr 3-.“ (r r . .04....“ :4 - Ju m. . m | a ' q». 1 . .“ 33”" 1, H... r .,., “35% 2:1 wk}. ’ ~ u' .. raw .§!:I;§;§‘*"z"'¥ .:.~ ,. n. 1",... . n , .1 . ‘ .1 new... . . “I '“2‘1'5' 51"“. I."-‘Cr“‘ ' TI; 3.2"“... 147.41 .1 3:411 If”; XI %;19512p1’.u1fJ1nII;l-;::5—r kw "vii-um “I". "Vt 62:21" W". J.‘ 3.12".“ a... .1 r :17. ,I “,4 ,. A "3:51... .1 TL . 3,3, , I ; fggfl.I I .. i”- «:1 H -- ~~'- ; .r' r2..-*.crr.._ 5'97 via—1.111;: M ‘ 1 r r.- '. 1 . u 4:? l .. , ;;;,;. . 121- 1. v. "71435." ' . M. 1.. I I v . . . . _ -. \ w - r» I run- _ I - ~‘. ‘ ., $1.. r up. ,M g. ”4.; y . m “A“. 1 A , vhf,” I, . , 1 . Fr"): ."Jfifi‘ I .1. "'1". ' « .. fl; . A". 3 514‘ -\l .. 2‘ 7'“ ‘ ' ‘ In! a. 1.3. .. . . . 1'. m. » * .r,..Lng,... fl. 1 . . . ""Qh‘fivfl‘ .‘P iv...) ..I I ,, $3.4 , , I . , {'7‘ I I .. ... I. 1 .. my . I ..- . . I ' {2261;- ' ‘ 4 ~ .- v - . .. 1’ ur'w ‘ , .' . » , 'r ‘ ' ‘r k- ' . 1 m' 1.3!: .3- ‘ ~ 1. , m “x 1. - ,9. “Drag; '2}; Inn 1 . ._'. 3-, ".‘.‘;.- ’ "2 ' 3"! w, J > . ’.. 'fl, r Lug; {MP 63%. 1-1 V PU 'l . (.1 : .‘ . 'f . ' 11:14.31“... 1 " v. 1 .1 , .......,... .I. ”13"?!“ 72-173.? 3. =3 aw.“ .I . ,. 34.75735: I351??? ‘ "f". a: I 1;"; ”a qua..- ".4313. ":éy'm . ‘43 1 {If “3"»: zigufi ..; . I II r1.» . _ . '. . . . 3:7, M1“. , t. , 1.17.5111“ -.-.~ .u- 731,411, "‘...” WEAK .1 u: ., 4 1(0 ’2‘} 415.4. "I, , I: . ' .. km ....s “a Y." . I47.- w “’ ”2' I. ' - . ~.¢ v u '? ‘ . 244.12. . ....1,... AA‘ys- If)": N1 15m . ’5“ 1.1. ."’ . .'.' N r.‘ (7‘. 87"“?‘53...’ ”NI :3. "I mu -..I_2I’|;;f’..;!f .1. .62"; .1'3.. » “331$ arflz: "’1‘". ‘ hunt-11061?! r. J‘”‘""' ~1- ., n 3:6. .434... n m. "7‘": ".5. ’..-1:23;}; “If”; “fr. :3? u. 1,!" ‘. u “wrunrtfl-r‘y i I , 1,; ' ,I;I.I.I‘ '. 2.1.2. "L. A W .' 5,}? ,';r ' " a" mat-"1:1. :::;1r"'* N133) 5“” 4-4 m . r . 2:14;}; . - .5. 3 , .. .. t J3 ,. O-r 1‘ "rd .4 r“ r “‘ ’ y ”1.... ":11? 25.14....1, lw'l't ILA- "£3" ”(W—red ‘..-r .3," -:~:’ 'I‘er’ “Fifi": I ! “'rw‘“ 'xrr'gw 1.1”" X;.... . .1: Cir”. .4"...{.'..~a:17. .., u ....-¢»..«.~,~.-,.. .5 1 1 ,, ,q. [1/ . ,J‘WEWVU,’ 252‘ ’ML' 1' 2 :21. .' ' .. —'.. v m . .IJII . '.n- ‘ 0'— '51'1. a“ f‘"-:.;;T W451. a“ 1 .. 1 r: w . . m 9 1"1 ~11“. 1: {b 3'?" , ~ In - . w-r , ‘ w‘b‘vfi- :-- .7 . @1151! f t”- flafi‘“ " My ' .J~1; -;..:, W .- :‘prm wu— 53‘.‘ ..", a». u». If?” ' . 12f}: ,1,“ ”.41.“: I, win. "w v" 4.34.27 2.71., "‘. r_. w, :5.- _. ”I ,-.c... 1.3,, ..., '1‘ ,. .. 1, ...' . , . I I . 1 . V . . .' V9 I u I 1 I I ’1 u. u n dfw' .. . 'rWflv‘I . 1K” Wyn???” v'vy‘yr’c'u 3w» . u 1. u. u 1 pm mun-w .0 w g9“: _ “W" ‘ . ...‘ .. ..1x”"r‘fl ":11171-3w .. ' v: ' . 12"“; ~. ... 5". ‘~.,:§i'lf...;.’. .I . 1.3.2” ' L 1, .1 . , 1.15.1.1“ *v. .N .m ”1.; , .. .:r ,,.. ‘7‘." .. ., w: « .r/ I». w m- '~.~u:.ivI-F)L1f a ~ My .1 m .— . . 3:1 1 m, .,..-,,,,,.,, , , ,5 f... ..,..,, . "cu-1.7.». an .. "2;... 1. ... , n... , .. 7}... ’..... —.1J.?Z,I~- .1. 1...”.- ‘wii "~53.” ° '1" “ 1' ‘W “NV-“I vrro- - .J: «pan ’ ..1 (.1 -. -- ~r-A;,I:j-:T;- w» I.” '..—I 'M-o-r - :- mot! . ,1.- w. : 5 . 3,124,”: Tr: wwérw ~r-w-'~..~ 75"“ ' ‘ 11%.: m.) . $1235? if! 7115 h .4 - :— n . 2 I. . pl w J. ., ". . . I... 1 I, "‘1‘7"’.’."‘-' , w. , .. . J. 7-?" 10,—“ ‘ {1" Y :‘Ibe u 3'1" i. M?» mac-ermnv: 1-»- v~ 1w rmyrknnmrrJ-Pé . J' 1 , .fi‘l. ‘.J... T;- :7 . _ , mu .. ~ 4 :r "5253”,.13IN-Jv If.-.» "NV...- we... Monk . .. '7." mu In; .V;~“;‘-; ’3: :25?” ms , 'h.‘ o p .. 22.";21 . ' 1 ‘ . ,.~ .. .«v ‘., ’75-, .....,,,_J .- "‘ . . m I ‘..»p. ~ f" "I w vr L .1 - 'zawx my n- n‘t-udw’rfi :ffi)"lf W ”M , 31" r‘ r. ‘v w 'k‘t-dwyar r .. '1')! rvaxM—fl "’..-«3.: .1 2W _V, 1;,“ ,.. ,, 7 .1 ,1 ,4... "I...”I” '. w” u .' V w ,L I ,1- ,wmgza,ru,..r ,2 ‘ :"Yv‘v -v—-I:-.—»r—a-: :7") wrw BMW-9'7 crxffi: .1 .. ~> ”7:13;; a.“ . J;- -.~."’:‘~'—: r. .“JJ' . 4 , 1. {1.4. Liv" ....,r,.._ 22., «I. r W , . 5 ‘ 'L .2 ‘fl "w’I I :‘w “:47.” a) ' v- ' wv! m 7.31—qr,.',....,.y"‘ win.» v~.~' ”v1 3" "a: :m WEN”?! t-rwv’D-v' s- ....-,. , .LM' " ' z .- . .. ., .. , " .I 1-,- . C r" I .,.. . .. II . ‘... n r V , I- My . . '..“'- .... 1.! . .. . , E - ' ' ' ' ""0“ H ~r v.3... y"...— . ‘ .0' Wm: v --. .. .’ 1.. --, ““3": ‘ ‘ a» ' 7-: .«u- :2. 1.11....- 1- “ ' " a} ' 12m??? . . .. ... "-"-’."“‘ . .LK: _ .. ._ , . ... .. . . . , , 7. . *1 ‘ . » u .. ~ 1 - h» , .4 ~ '\7 F‘- “" " ' ’ 7931.1" 1 f ' ; '... "a . , ‘ - u»..- “’.." .. . nu» ,1...— ... e, . . w... ». m 21.27””... ”,5, n -' ’..-"aw..." ,- -- .m r) LIBRARY Michigan State . _ N11».-. .- Jute Umvemty ° University This is to certify that the thesis entitled COMPARATIVE RELIABILITIES AND VALIDITIES OF TRUE-FALSE AND MULTIPLE CHOICE TESTS presented by David A. Frisbie has been accepted towards fulfillment of the requirements for Ph . D . degree in Educagi QQ W/i fly Major professor Date July 20, 1271 0-7639 l ABSTRACT COMPARATIVE RELIABILITIES AND VALIDITIES OF TRUE-FALSE AND MULTIPLE CHOICE TESTS BY David A. Frisbie This study was designed to compare the reliabili— ties of true—false and multiple choice tests and to deter- mine the concurrent Validities of true-false tests. Four major questions were formulated as research hypotheses: 1. Are true—false and multiple choice achievement tests that were designed to measure the same ob— jectives equally reliable? 2. Are true—false tests constructed by the judgmental method as reliable as those developed by the dis— crimination method? 3. What is the ratio of the number of true—false items attempted to the number of multiple choice items attempted by a group of examinees in a given period of testing? 4. Is there a perfect correlation (+1.00) between true-false and multiple choice test scores when the correlation coefficients are corrected for attenuation? David A. Frisbie Two methods were devised for systematically chang— ing multiple choice items to true—false form. The judg- mental method involved the use of teachers to choose the multiple choice distractor that would result in the most plausible false statement. The discrimination method relied on item analysis data from a multiple choice test— ing to identify the distractor that best discriminated between high and low scorers on the test. The first of three phases of testing was needed to gather the item analysis data. The true-false items generated by the two conver— sion methods were tried out in the second phase of testing. The revised true—false items were incorporated in the eight test forms used in phase three, the final testing. Each of the 70-item final forms consisted of 35 multiple choice and 35 true—false items. A sample of 1018 non—urban high school students each responded to one of eight test forms. The three factors that differentiated the forms were: 1. Subject matter (natural science or social studies) 2. Method of conversion (judgmental or discrimination) 3. Subtest order (true—false first or last) Kuder-Richardson Formula 20 reliability coeffi— cients were calculated for the eight multiple choice and eight true—false subtests. The ratio of the number of true-false to multiple choice items that subjects attempted David A. Frisbie in the first eight minutes of testing was also computed. The correlation between multiple choice and true—false subtest scores was calculated and corrected for attenua- tion for each final test form. Statistical tests were performed to determine if the 16 reliability coefficients were homogeneous and to ascertain if the corrected corre— lation coefficients significantly departed from unity. The results associated with each of the major questions of interest were: 1. The reliabilities of the multiple choice tests were significantly greater than the reliabilities of the true—false tests. There was no significant difference between the reliabilities of the true—false tests constructed by the judgmental or discrimination methods. Examinees responded to three true—false items for every pair of multiple choice items attempted. The corrected correlation coefficients for six of the eight final forms were significantly less than unity. COMPARATIVE RELIABILITIES AND VALIDITIES OF TRUE-FALSE AND MULTIPLE CHOICE TESTS BY David Ainrisbie A THESIS Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Department of Counseling, Personnel Services, and Educational Psychology 1971 ACKNOWLEDGMENTS Sincere thanks are extended to Dr. Robert L. Ebel, chairman of my Guidance Committee, for his advice, counsel, and friendship throughout my doctoral program. The con— tributions of each of my committee members--Dr. Robert C. Craig, Dr. Laurine E. Fitzgerald, Dr. William A. Mehrens, and Dr. Willard G. Warrington--are gratefully acknowledged. Several individuals made worthy contributions to this research effort. I wish to thank -—the Office of Evaluation Services staff for provid— ing prompt scoring and item analysis services. -—the high school principals and teachers who per— mitted me to use their classes and who gave me such excellent cooperation in the data collection phase. ——my friends and colleagues who gave their insightful suggestions for improving the study in its early stages. There is no way I can adequately express my grati— tude to my wife, Janet, and my children, Marcy and Scott, for the support they have given me during my graduate studies. Their efforts represent the most important con— tribution to my successful completion of a doctoral program. ii The financial support of the U. S. Office of Edu- cation through a Research Director Training Program fel- lowship enabled the author to complete his doctoral studies at Michigan State University. iii TABLE OF CONTENTS Page LIST OF TABLES . . . . . . . . . . . . . . . . . . . vi LIST OF APPENDICES . . . . . . . . . . . . . . . . . Vii Chapter I. THE PROBLEM . . . . . . . . . . . . . . . . . 1 Alternative Forms of Test Items . . . . . . 1 Advantages and Limitations of True-False Items . . . . . . . . . . . . . . . . . . . 2 Need for This Study . . . . . . . . . . . . 3 Purpose of This Study . . . . . . . . . . . 3 Hypotheses . . . . . . . . . . . . . . . . . 5 Definition of Terms . . . . . . . . . . . . 5 Overview . . . . . . . . . . . . . . . . . . 6 II. REVIEW OF THE LITERATURE . . . . . . . . . . . 7 Introduction . . . . . . . . . . . . . . . . 7 General Studies Comparing Item Forms . . . . 8 Validity and Reliability Studies . . . . . . 8 Studies Using Item Conversion Procedures . . 14 Studies Comparing Amount of Testing Time l7 0 0 III. DESIGN AND PROCEDURES . . . . . . . . . . . . 19 Introduction . . . . . . . . . . . . . . . . 19 Sample . . . . . . . . . . . . . . . . . . . 19 iv Chapter Page Instrumentation . . . . . . . . . . . . . . 20 Item Conversion Procedures 24 Judgmental Method . . . . . . . . . . . . 24 Discrimination Method . . . . . . . . . . 25 True-False Try-Out . . . . . . . . . . . . 28 Design . . . . . . . . . . . . . . . . . . . 29 -M Hypotheses . . . . . . . . . . . . . . . . . 33 Analysis . . . . . . . . . . . . . . . . . . 33 Summary . . . . . . . . . . . . . . . . . . 38 IV. RESULTS . . . . . . . . . . . . . . . . . . . 39 Introduction . . . . . . . . . . . . . . . . 39 Results Concerning Amount of Testing Time . 39 Results Concerning Test Reliability . . . . 41 Hypothesis One . . . . . . . . . . . . . . 4l Hypothesis Two . . . . . . . . . . . . . . 43 Results Concerning Concurrent Validity 44 Summary . . . . . . . . . . . . . . . . . . 47 V. SUMMARY AND CONCLUSIONS . . . . . . . . . . . 48 Summary . . . . . . . . . . . . . . . . . . 48 Conclusions . . . . . . . . . . . . . . . . 50 Discussion . . . . . . . . . . . . . . . . . 52 Limitations of the Study . . . . . . . . . . 56 Suggestions for Future Research . . . . . . 57 APPENDICES . . . . . . . . . . . . . . . . . . . . . 59 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . 64 LIST OF TABLES 2.1. Summary of reliabilities and validities from Charles' study . . . . . . . . . . . . 2.2. Amount of time equal number choice items 3.1. Description of 3.2. Description of 3.3. Item analysis data used in the discrimination method . . . required to respond to an of true-false and multiple sample used in phase III the sample used in phase I 3.4. Description of subjects used in phase II . 3.5. Arrangement of test forms used in phase III 4.1. Median number of items attempted in the first eight minutes of testing for each final test form . 4.2. Kuder—Richardson Formula 20 reliability coefficients for final subtest forms . 4.3. Computations for testing hypothesis one 4.4. Correlation coefficients and true—false subtest scores on each final form . 4.5. Computations and results of tests for hypothesis four . . . . . . . . . . . 5.1. Reliability and concurrent validity coefficients for final subtest forms . vi for multiple choice Page ll 18 21 26 28 3O 32 40 42 43 45 46 51 LIST OF APPENDICES Appendix Page A. Description of Schools Participating in This Study . . . . . . . . . . . . . . . . 59 B. Distractor Judgment Task . . . . . . . . . . 60 C. Computations for Testing Hypothesis Two . . . 61 D. Number of Items Attempted in the First Eight Minutes of Testing . . . . . . . . 62 E. Means and Variances for the Subtests of the Final Test Forms . . . . . . . . . . . 63 CHAPTER I THE PROBLEM Alternative Forms of Test Items Test construction specialists and teachers who develop instruments to measure educational achievement have a wide variety of item forms at their disposal for accomplishing their specific objectives. Essay, multiple choice, true-false, matching, completion, simple recall or short answer, and novel combinations of these forms are usually mentioned in measurement textbooks as appropriate item forms for teachers to use. Presently the multiple choice item is "the most highly regarded and widely used form of objective test item" (Ebel, 1965, p. 149). Multiple choice items have been recommended for achievement testing because of their apparent versatility and adaptability to the measurement of outcomes requiring mental processes beyond mere recall (Durost and Prescott, 1962; Brown, 1970; Ahmann and Glock, 1967). Thorndike and Hagen (1969, p. 102) wrote: [The multiple choice item] can be used to appraise the achievement of any of the educational objectives that can be measured by a paper-and—pencil test except those relating to skill in written expression and originality. Advantages~and Limitations of True-False Items Many authors who discuss the use of various item forms suggest that the advantages of true—false items are outweighed by their limitations (Ahmann and Glock, 1967; Gronlund, 1965; Brown, 1970). One of the major shortcom— ings attributed to true-false items is the difficulty en- countered in preparing good statements that measure more than simple factual information. Wesman (1971) noted that many of the important objectives of instruction require generalizations, explanations, evaluations, and inferences. He concluded that since these outcomes often cannot be expressed in statements that are precisely or universally true, they cannot be tested with true-false items. There is substantial agreement among the authors cited above that true-false items are appropriate only for measuring knowledge of unambiguous factual material, and that multiple choice items can be used to measure a variety of outcomes. However, empirical evidence has not been presented to support this Viewpoint. A search of the lit- erature failed to turn up any data that would support or challenge the statements referred to above. Though true—false items do have some limitations, they do have unique advantages in measuring educational achievement. True-false items are more efficient than multiple choice items of comparable quality. Examinees can respond to more true-false items than multiple choice items in a given time period. The greater efficiency may lead to a more reliable test. A true-false test probably can provide a broader sampling of the examinee's knowledge than can a multiple choice test intended to measure the same subject matter. Since more true—false items can be used in a given time period, these items should more thoroughly sample the uni- verse of content than a multiple choice test. Need for This Study The arguments for and against either multiple choice or true—false items for measuring achievement can be found in many textbooks and journal articles, but little empirical data is available to substantiate the viewpoints expressed. The most equitable means of com- paring the two item forms is through empirical study. Purpose of This Study The purpose of this study was to compare the re- liabilities and validities of true—false and multiple choice tests that were written to measure understanding of concepts and relationships in the same content areas. Two systematic procedures for changing multiple choice items to true-false form were used so that corres— ponding items of the two types would be as equivalent as possible in content. These procedures also made the con- version processes more objective and reproducable. The two conversion methods, judgmental and discrimination, were compared to determine if one yielded a more reliable or valid test than the other. (These conversion methods are defined later in this chapter and are described in detail in Chapter III.) Data was collected to determine the number of) items of each type that examinees attempted in a fixed period of time. This information was used to determine the theoretical length of the true-false tests. The re- liabilities of the lengthened tests were compared to the reliabilities of the multiple choice tests with testing time constant. It was necessary to collect these data because the data in the literature were not recent or were not empirically based. Finally this study was devised to compare items that were written to measure achievement in natural science and in social studies. Two content areas were used so that effects due to particular subject matter could be examined and so that the findings could be generalized somewhat. Hypotheses The major hypotheses in this study were: 1. When multiple choice items are converted to true- false form, the reliabilities of the two test forms are not different. 2. When multiple choice items are converted to true- false form using the judgmental and discrimination methods, the reliabilities of the true-false tests are not different: 3. More true-false than multiple choice items are attempted by the same group of examinees in a fixed time period. 4. The simple correlation between individuals' true- false and multiple choice test scores is 1.00 when corrected for attenuation. Definition of Terms Test reliability is defined as a measure of in— ternal consistency estimated by the Kuder—Richardson For- mula 20. The two methods of converting multiple choice items to true-false format are defined as judgmental and discrimination. The judgmental method employs subject matter experts to choose the multiple choice incorrect response that would result in the most plausible false statement when used with the stem. The discrimination method relies on item analysis data from a multiple choice testing to identify the multiple choice incorrect response that best discriminates between low and high scorers on the test. Simple correlation is defined by the Pearson product moment correlation formula found in most elementary statistics books (see Chapter III). Correction for attenuation is defined as the pro- cedure used to estimate the true correlation between two variables represented by unreliable measures. (The for- mula used for this procedure appears in Chapter III.) Overview In Chapter II the literature relevant to the gen- eral problem and each of the specific hypotheses is re— viewed. The design of the study, the sample, the instru- mentation, and the methods of analysis are discussed in Chapter III. Chapter IV, the results of the study, is followed by a final chapter that contains: a summary of the study, a discussion of the findings, the limitations of the study, and suggestions for future research. CHAPTER II REVIEW OF THE LITERATURE Introduction An abundance of research that is concerned with comparing various item forms has been reported in the literature. Many studies were conducted during the late 1920's when objective test items made their initial surge in popularity. The first section of this chapter includes investigations that deal with the comparison of either multiple choice or true-false items and some other item type. Subsequent sections each contain reviews of studies that are related to each of the hypotheses enumerated in the previous chapter. Section two provides an examination of the research that has focused on the comparative reliability and valid- ity of both true—false and multiple choice items. The third section of this chapter deals with studies that have employed methods of converting items from one form to another. The final section contains a review of research reports that provide information concerning the difference in testing time required for tests composed of different item forms. Brief summaries have been provided at the close of each section of this chapter in lieu of a general chapter summary. General Studies Comparing Item Forms Several studies were reported in which selected test characteristics were compared. Heim and Watts (1967) compared multiple choice and completion vocabulary items and found that the open-ended items were significantly more difficult. A similar study by Andrews and Bird (1938) in psychology yielded the same results. In addition they found higher odd—even reliabilities for completion items. Choppin and Purves (1969) concluded that multiple choice and open-ended literature items measured the same thing in their validity study. Cronbach (1941) used multiple multiple choice items (more than one correct alternative per item) and multiple true—false items (each multiple choice alternative was marked true or false) based on introductory psychology content. No differences were found between test forms with regard to testing time, difficulty, reliability, and validity. Validity and Reliability Studies In what could be considered the pioneer study of this general problem, Toops (1921) compared the reliabilities of 50-item general information tests, each cast into recall, multiple choice, and true-false form. Each subject took each test but the order of administra— tion was varied so that six groups ranging in size from 39 to 10 were used. The corrected split-halves relia— bilities reported were .556 and .507 for multiple choice and true-false tests, respectively. When testing time was ‘held constant these reliabilities were estimated to be .607 and .664. Completion, multiple choice, and true-false test reliabilities and validities were compared by Rutledge (1926) in his dissertation. Three forms of an elementary psychology midterm exam were constructed so that each form contained 40 test items of each of the three types. The examination formats were arranged so that the first 40 items of form A were true-false, the first 40 items of form B were multiple choice, and the first 40 items of form C were completion. The following items illustrate a typical test item in the three formats: A beginning was made in the study of scientific psy— chology in the century. The study of scientific psychology began in the 18th century. A beginning was made in the study of scientific psy— chology in the (1) 17th, (2) 18th, (3) 19th, (4) 20th, century. 10 Corrected split-halves reliability coefficients averaged .70 and .89 for multiple choice and true-false items, respectively, across test forms. Nine minutes of testing time were allotted for each of the two forms; therefore, no adjustment of reliability coefficients was made. The average correlation between multiple choice and true-false subtest scores across test forms was .64, corrected for attenuation. In a similar study by Charles (1926), the relia- bilities of five-, three—, and two-response multiple choice tests and a true-false test were compared with a completion test reliability. Fifty factual information items from introductory psychology were administered to each subject in completion form followed by 50 items of one of the other item forms. The results of Charles' study are summarized in Table 2.1. No statistical tests were made but one could conclude that there is little practical difference between the reliabilities of true— false and five—response or three-response multiple choice tests. (Charles offered no good explanation for the un— usual performance of the two-response multiple choice tests.) A study carried out by Ruch and Stoddard (1925) employed a design identical to Charles' and items intended to measure history and social science general information. The reliabilities for the five-, three-, and two—response 11 multiple choice tests and the true-false test were .886, .748, .849, and .714, respectively, for loo-item tests. The reliabilities were recalculated to equate testing time and the new values were estimated to be .901, .806, .902, and .820. Table 2.1. Summary of reliabilities and validities from Charles' study. Item Form rtt rnna rcfb Completion .603 .752 .603 5—R .680 .809 .714 3—R .624 .768 .703 2-R .477 .646 .639 True-false .602 .751 .680 aCorrected split—halves reliabilities. bAverage correlation between the completion test score and the score from each of the other item forms. Reliability studies reported by Watson and Crawford (1930), Copeland and Gilliland (1943), and Eurich (1931) yielded conflicting results. Watson and Crawford esti— mated reliabilities favoring multiple choice items on high school physics unit tests. COpeland and Gilliland cor- rected their reliability coefficients to hold testing time constant and found a higher reliability for the 20-item 12 true-false child psychology test. Two experiments were reported by Eurich in which educational psychology test items were used. The multiple choice test reliability was higher in one trial and the same as the true-false test reliability in the other trial. An item conversion method was employed by Bur- meister and Olson (1966) to aid in determining whether college—level natural science true-false items could be written that had the same desirable characteristics as the multiple choice form. The authors concluded that true— false items could be constructed that discriminate "almost as well as" multiple choice items, and that true-false items were less difficult because of the guessing effect. Ebel (1971) used a 90-item natural science multiple choice test as a basis for studying the validity and re- liability of true-false items. Two forms, each containing 44 items of each item type were administered to groups of 53 and 50 students in an education course. The mean dis- crimination indices tended to be higher for the multiple choice tests. The Kuder-Richardson Formula 20 reliabil- ities for the multiple choice and true—false subtests were .81 and .84, respectively, for form one and .86 and .71 for form two. (The true-false reliabilities were estimated by the Spearman—Brown Formula for a double-length test under the assumption that two items can be attempted for every multiple choice item attempted.) The correlations 13 between multiple choice and true—false subtest scores on the two forms were 1.20 and .80, corrected for attenuation. In conclusion, there was some support for the conjecture that true-false and multiple choice tests are equally reliable when testing time is equated, and there was no difference between item forms in what was measured. It was recommended that, in future studies of this problem, larger samples should be used in the initial tryout of the true—false items and that more time should be spent revis- ing these items before the final form is administered. The findings of the research reviewed in this sec— tion are far from conclusive. There is no overwhelming evidence to suggest that multiple choice and true-false tests are equally reliable or that one form is superior on this count. Only two of the nine studies reported on the comparative validities of the two item forms. The find— ings of Charles (1926) and Ebel (1971) lend support to the hypothesis that multiple choice and true—false tests measure the same thing. Except for the studies by Ebel (1971) and Bur— meister and Olson (1966) the research cited here is not recent. There is reason to believe that the nature of objective tests has changed since many of these studies were conducted. Multiple choice tests, in general, prob- ably consist of fewer factual information items than tests constructed in the 1920's. There appears to be a trend 14 today toward measuring individuals' understandings of concepts and relationships and ability to apply or gen- eralize from learned propositions. This study was devised in an attempt to answer some of the same questions that were asked when objective items first became widely used. Studies Using Item Conversion Procedures Only a small number of the studies designed to compare item forms have specified the methods used for constructing items on the same content. Some writers, such as Eurich (1931), gave incomplete details of their procedures. He wrote essay items to cover midterm course material and then developed an acceptable response for each item. The statements in these responses were used to generate stems for completion items. There is no de— scription provided, however, for methods used to formulate true-false or multiple choice items. The item conversion methods used in this study (as defined in Chapter I) were used by Owens, Hanna, and COppedge (1970) in a study devised to compare completion and multiple choice geometry items. The judgmental (J), frequency (F), and discrimination (D) methods were em— ployed to select multiple choice distractors based on responses to completion items. Form J was constructed by using 13 secondary mathematics teachers who chose the 15 three most plausible distractors that appeared as errors in responses to the completion form. Distractors for the final form were selected from those most frequently chosen by the 13 judges. Form F was constructed using the most frequently occurring errors from the completion tryout. Examinee errors from the completion form were used to select the distractors that best discriminated between high and low scorers on the test to build form D. The three 17-item final tests utilized 51 unique distractors of which 13 were common to forms J and F, two overlapped in forms J and D, and 21 were identical in forms F and D. The authors concluded that the three methods were equally valid and reliable for choosing multiple choice distractors. They suggested that the study be replicated with test con— tent as an independent variable. Loree (1948) used the judgmental and frequency methods for selecting distractors in his study of the characteristics of multiple choice items. The validities and reliabilities of the two multiple choice forms were not significantly different. The frequency method was used by Burmeister and Olson (1966) in a study cited previously. Multiple choice items were converted to true statements if the incorrect options were equally attractive in the tryout. If one distractor was most frequently chosen the item was changed to a false statement. 16 Ebel (1971) and Williams and Ebel (1957) employed a discrimination procedure in their item conversion proc- esses. Ebel changed each multiple choice item to a pair of true-false items (one true and one false item) and compiled two true-false test forms. These forms were ad- ministered to a group of subjects and the most discrimin- ating item of the original pair was retained for inclusion in the final true—false and multiple choice composite test. Williams and Ebel (1957) studied the effect on internal consistency reliability of varying the number of response alternatives in multiple choice tests. Item analysis data on 150 four-choice items were utilized to form three—choice and two-choice items. The least dis- criminating distractors were eliminated from the original test items. The findings revealed no significant differ- ences in reliability on the three forms. The judgmental and discrimination methods have been used as procedures for selecting multiple Choice distractors. The two methods have not been used for con— verting from multiple choice to true-false form in the studies cited in this section. Although Ebel (1971) used a discrimination procedure for selecting true-false items, he did not employ a systematic and replicable procedure for converting multiple choice items to true—false form. 17 Studies Comparing Amount of Testing Time Few recent studies dealing with the comparison of item forms have focused on the amount of testing time re- quired for each form. Two of the studies reviewed here do not supply empirical evidence to support their conclusions. Williams and Ebel (1957) stated that subjects finished faster as the number of response alternatives diminished, but they did not indicate how much faster. In another study it was assumed that subjects typically attempt two true-false items for every multiple choice item tried (Ebel, 1971). More dated studies by Toops (1921), Watson and Crawford (1930), and Copeland and Gilliland (1943) demon- strated agreement in their findings that three true-false items can be tried for every two multiple choice items attempted. Two other studies (Charles, 1926; Ruch and Stoddard, 1925) reported on testing time for true-false items and multiple choice items that varied in the number of re- sponse alternatives available. The ratio of testing time for an equal number of true-false and multiple choice items can be calculated from the data presented in Table 2.2. The ratios in Charles' study for the five—response and three-response forms are 1.4 and 1.2, respectively. The corresponding ratios from the Ruch and Stoddard study 18 are 1.6 and 1.3. These results coincide with the findings of the previously cited studies. Table 2.2. Amount of time required to respond to an equal number of true-false and multiple choice items. Time in Minutes Item Form Study 1a Study 2b 5-R 25.5 8.0 3—R 21.5 6.8 2—R 19.6 5.7 True-false 18.3 5.1 aFrom Charles (1926). bFrom Ruch and Stoddard (1925). There is some agreement in the research reported here that 1.5 true—false items can be attempted in the time required to respond to one multiple choice item. Provision was made in this study to collect data on the number of items attempted by examinees in a fixed period of time because recent empirical evidence was lacking. CHAPTER III DESIGN AND PROCEDURES Introduction This research study was designed to examine the reliabilities, concurrent validity (correlation between true—false and multiple choice subtest scores), and the amount of testing time required for subjects to respond to true-false and multiple choice social studies and natural science achievement tests. Two methods of converting multiple choice items to true-false form, judgmental and discrimination, were compared to determine if one yielded more reliable true—false test scores than the other. Sample The subjects that participated in this study were selected from classrooms in six public high schools lo- cated in South-Central Michigan. Schools and classrooms were selected on a voluntary basis; there was no random sampling procedure employed for determining the study sample. The goal of the sample selection scheme used in this study was to identify schools in four types of com— munities and to choose at least one school from each 19 20 strata for inclusion in the study. The four community types were defined by the Michigan Department of Education (1970) as city, town, urban fringe, and rural. The high school students that took part in this study probably represent a crossection of non-urban high school students in science and social studies achievement levels. The schools from which they were drawn are de- scribed briefly in Appendix A. Three phases of testing were required for instru— ment development and data collection. Phase I involved gathering item analysis data for an item conversion-method and phase II was used to try-out the true—false items. The subjects that participated in phase III, the final testing, are described in Table 3.1. A total of 509 stu- dents responded to the social studies tests and 509 stu- dents responded to the natural science tests. A minimum of 125 students attempted each of the eight test forms that were administered in the final phase of testing. Instrumentation The multiple choice items that were employed in this study appeared in a widely used battery of achievement tests.1 The social studies items were written to measure 1Permission to use these items for this research was obtained from the publisher. The publisher requested that the source of the items not be identified. The test items used for illustrative purposes in this thesis are copyrighted and may not be reproduced. 21 Table 3.1. Description of sample used in phase III. Form Totals' School Grade Social Studies Natural Science A 9 25 10 76 11 46 19 12 45 16 B 9 36 10 34 67 ll 26 12 69 C 9 23 10 23 74 ll 32 24 12 45 D 9 23 10 12 11 19 12 43 E 9 10 ll 67 12 43 F 9 73 10 11 12 58 22 knowledge and understanding of contemporary social insti- tutions and practices. The following items are typical of those used in the test: 1. When was the United States Constitution written? (1) Immediately after the French and Indian War. (2) During the early years of the Revolutionary War. *(3) Shortly after the Revolutionary War. (4) During the Reconstruction period which fol— lowed the Civil War. 2. In the absence of government controls, what or- dinarily happens to the price of goods if the supply increases and the demand remains unchanged? (1) The price increases. *(2) The price decreases. (3) The price remains about the same. (4) The price changes rapidly. The natural science items were intended to measure general knowledge and understanding of scientific terms' and principles. The following items are representative of those used in the test: 3. What is the chief use of the cyclotron? (1) To change lead into gold. (2) To generate electricity from steam. *(3) To get high speed particles for atomic research. (4) To mix the essential ingredients of the atomic bomb. 23 Is it more dangerous to prick oneself with a pin than with a needle? Why?' (1) Yes. Because pins are usually made of brass, which is poisonous to human flesh. (2) Yes. Because bacteria are more likely to be present on a pin than on a needle. *(3) No. The two are about equally dangerous. (4) No. The needle is much more dangerous for it is likely to have traces of rust on it. The items from the achievement battery were se- lected for use in this study and were deemed apprOpriate for the study sample because: 1. 2. The items were expertly written and were tried—out and revised with extreme care by the authors. A classification of the items by subject matter suggested that the items covered objectives re- flecting the current high school science and social studies curricula. (This notion was confirmed by the secondary teachers that reviewed the test con— tent during the three phases of testing.) The reported reliabilities of the social studies and natural science tests were in excess of .90. The tests had demonstrated high reliability in the past. The test items were intended by their authors to be used for measuring achievement in grades 9-13. The tests were suited for a broad range of 24 achievement and concern about a low ceiling effect could be reduced. Item Conversion Procedures The judgmental and discrimination methods defined in Chapter I were used to convert multiple choice items in each of the 70-item achievement tests to true-false state- ments. The two methods will be described below. Judgmental Method Five secondary science and social studies teachers were asked to judge the quality of the multiple choice distractors from the test items in their respective areas of expertise. They were directed to select the distractor for each item that appeared to be most plausible for making a false statement with the original stem. The specific directions given to the judges appear in Appendix B. The reSponses of the judges were tabulated and a decision was made to use the correct response or one dis- tractor to make a true or false statement. If at least four of the five judges agreed on a best distractor, it was used to make a false statement. If the judges failed to agree on one best distractor, the correct response was used to make a true statement. The use of this method resulted in 41 false state— ments and 29 true statements in social studies. There was 25 consensus among the judges on their choices for 12 false statements and four of the five agreed on their choices for the other 29 false statements. Item one from the examples listed previously in this chapter was converted to: The United States Constitution was written during the early years of the Revolutionary War. The judges unanimously agreed that response alternative two was the most plausible. There were 45 false statements and 25 true state- ments written in natural science. Four of the five judges agreed on their choices for 40 false statements and all were in accord on only five false statements. Item three from the examples listed in this chapter was changed to: The chief use of the cyclotron is to mix the essential ingredients of the atomic bomb. Four of five judges thought choice four was the most plausible. The two true-false tests developed by the judg- mental method were labeled form SJ (social studies) and form NJ (natural science). Discrimination Method The original 70—item multiple choice tests were labeled form SM (social studies) and form NM (natural science). Forms SM and NM were each administered to a 26 minimum of 100 subjects in classrooms from schools that appeared in the final sample. Table 3.2 describes-the 103 students that reSponded to form SM and the 101 students that took form NM in phase I of testing. Answer sheets were scored and responses were.put on magnetic tape using the OpScan system of the Office of Evaluation Services at. Michigan State University. A computer program developed by the Office of Evaluation Services was used to generate item analysis data from phase I of testing. Table 3.2. Description of the sample used in phase I. Form School Grade SM NM A 9 30 10 ll 25 12 30 B 9 19 10 11 48 18 12 7 C 9 10 27 11 12 27 Kuder-Richardson Formula 20 reliabilities for forms SM and NM were .905 and .918, respectively. A decision was made to change each item to a true or false statement depending on the value of the discrimi— nation index of each distractor. A form of the Upper— Lower Index, known frequently by D, was used as a dis- crimination index. In this case, the proportion in the upper group that responded to each distractor was sub- tracted from the proportion in the lower group that re- sponded to each distractor. The foil with the largest lower-upper difference was used to make a false statement. If the indices for an item did not differ by more than .09 or if the largest index was less than .20, the item was converted to a true statement. The 70-item true—false social studies test, labeled form SD, contained 33 true statements and 37 false state- ments. Item two from the examples listed preViously was changed to this true statement: In the absence of government controls, if the supply of goods increases and the demand remains unchanged, the price of the goods decreases. The item analysis data that were used to convert this item are given in Table 3.3. The lower—upper indices for distractors l, 3, and 4 were .18, .14, and .21, re- spectively. Though distractor four had an index in excess of .19, it did not satisfy the second criterion and was, therefore, not used to make a false statement. 28 Table 3.3. Item analysis data used in the discrimination method. Response Alternatives l *2 3 4 Omit Total upper 0 28 0‘ 0 0 28 27% 0% 100% 0% 0% 0% 100% middle 1 41 4 2 0 48 46% 2% 85% 8% 4% 0% 99% lower 5 ll 4 6 2 28 27% 18% 39% 14% 21% 7% 99% The 70-item true-false natural science test, lab— eled form ND, consisted of 33 true statements and 37 false statements. Item four from the examples was recast as this false statement: It is more dangerous to prick oneself with a pin than with a needle because bacteria are more likely to be present on a pin than on a needle. Four true-false tests (forms SJ, SD, NJ, and ND) were generated with the two item conversion methods. There were 30 common items to the 70—item forms, SJ and SD, and 21 items common to the 70-item forms, NJ and ND. True—False Try-Out The four true-false test forms were each adminis- tered to a group of 50 students in phase II of testing. The purpose of phase II was to attempt to identify poor or ambiguous items. The rationale for including this step in 29 the instrument development sequence was that the original multiple choice items sustained extensive study and revi- sion before they were incorporated into the final form. _Participants in phase II are described in Table 3.4. The four schools included in this phase of testing were also involved in the final phase of testing. Test scoring and item analysis services were fur- nished by the Office of Evaluation Services. Kuder— Richardson Formula 20 reliabilities for forms SJ, SD, NJ, and ND were .764, .799, .798, and .764, respectively. Item difficulty and discrimination indices were examined for all items in the four true-false tests. One item common to forms NJ and ND was reworded because it was a negative discriminator.. wa items from form SJ and three items from form SD were reworded for the same reason. These revised true-false forms were used to compile the final forms for phase III of testing. Design This study was designed with five major principles in mind for controlling extraneous factors that had the potential for introducing error. 1. No student responded to more than one test in social studies or natural science across the three phases of testing. 2. The four test forms in each subject matter area were randomly distributed to subjects within Table 3.4. Description of subjects used in phase II. 30 School Grade Form SJ SD NJ ND A 10 11 12 10 ll 12 10 11 12 10 11 12 21 21 21 21 12 16 16 12 16 19 31 classrooms in an attempt to control for differen- tial abilities and achievement levels of class- rooms. 3. The final test forms were arranged in two different orders to control effects that could occur due to one item form continuously preceding the other. This arrangement was also conducive to gathering data on the number of items attempted. 4. In the final phase each subject received both a multiple choice and a true—false subtest score so that individual scores could be correlated. 5. One individual, experienced in test administration procedures, administered all tests in the three phases of testing with standardized directions designed for the separate phases. The final test forms were arranged so that each subject responded to both multiple choice items and true- false items converted by one of the two methods. The eight 70-item final forms are depicted in Table 3.5. Different orderings of the item subtests within forms are designated by A and B, S refers to social studies, N refers to natural science, and J and D represent judgmental and discrimination, the two item conversion methods. Form SJA, for example, consisted of items l—35_of the original multiple choice form, SM, and items 36-70 of form SJ, social studies items converted by the judgmental method. Form SJB was comprised of items 1—35 of form SJ and items 36—70 of form SM. The other six forms were arranged in a similar manner. The final test forms were administered in high school social studies and science classrooms. The four 32 forms within each subject matter area were randomly dis- tributed to subjects within classrooms so that approxi- mately the same number of each form was used in each classroom. Standardized directions and procedures were used by the same test administrator in all phases of test- ing in this study. Table 3.5. Arrangement of test forms used in phase III. Test Form Subtest Order SJA MC TF SJB TF MC SDA MC TF SDB TF MC NJA MC TF NJB TF MC NDA MC TF NDB TF MC Subjects were timed with a stopwatch to supply information regarding the number of items of each item type that were attempted in a fixed period of time. Sub— jects were asked to stop working after ten minutes and were then asked to circle in their test booklet the number of the item that they were currently working on. A pre— liminary examination of this data showed that ten minutes enabled many students to respond to more than 35 items. The time period was subsequently reduced to eight minutes and data were collected from 967 subjects. 33 Hypotheses The research hypotheses that were examined in this study were: 1. When multiple choice items in social studies and natural science are converted to true-false form, the Kuder-Richardson Formula 20 reliabilities of the two test forms are not different, regardless of the subject matter. 2. When multiple choice items in social studies and natural science are converted to true-false form using the judgmental and discrimination methods, the Kuder-Richardson Formula 20 reliabilities of the true-false tests are not different, regardless of the subject matter. - - i 3. A group of examinees can attempt more true—false than multiple choice items in eight minutes of testing time. 4. The simple correlation between individual's true- false and multiple choice test scores is 1.00, corrected for attenuation. Analysis The Kuder-Richardson Formula 20 reliability co- efficient was computed for each of the two subtests in each of the eight final forms. The true—false subtest 34 reliabilities were adjusted for a lengthened test using the data gathered regarding the number of items of each form that subjects responded to in an eight-minute period. The ratio of the number of true-false items attempted to the number of multiple choice items attempted was substi— tuted in the Spearman-Brown Prophecy Formula for this purpose. The formulas for the Kuder-Richardson Formula 20 (Equation 3.1) and the Spearman—Brown formula (Equation 3.2) are given by Ebel (1965) _ k z r — fl [1 — Egg] (3.1) where r is reliability, k is the number of items in the test, qu is the sum of the item variances, and 02 is the test variance. nr 8 r (3.2) n l + (n-l)rS where rn is the reliability of the lengthened test, n is the number of times the original test was lengthened, and r8 is the reliability of the shorter test. A test statistic known as the paired t test was employed to test the difference between multiple choice and true-false test reliabilities. The statistic, t, was defined as: (3.3) \m El 35 (3.4) and rim and ri are the reliability coefficients for the t ith pair of subtests from the eight final test forms; and Sd’ the standard error of the differences, is i _ (3.5) and di is defined as the difference between the multiple choice and true-false reliabilities for each i pair. d. = r. - r. (3.6) An hypothesis that the multiple choice and true- false test reliabilities are not different is tested against the alternative hypothesis that the null hypothesis is false. The test statistic is referred to Student's t-distribution with n—l degrees of freedom. Values of the test statistic large in absolute value cause the null hypothesis to be rejected. The alpha level used in all statistical tests in this study was 0.05. The use of the test statistic, t, depends on the assumption that the sample statistic being tested is 36 normally-distributed. Though this assumption was not strictly met by the data in this study, the large sample sizes probably overcome that limitation. Frequency distributions indicating the number of items subjects responded to in eight minutes were con- structed based on the A and B test forms. The ratio of the medians of the two distributions was used as evidence for supporting or rejecting the third research hypothesis. A Pearson product-moment correlation coefficient was computed between individuals' multiple choice and true-false subtest scores on each of the eight forms. The correlation coefficients were adjusted for unrelia- bility in the measurement of the two variables by the correction for attenuation formula given by Ghiselli (1964, p. 268): r r = XY (3.7) r Vr xx yy where roooo is the true correlation between scores on X and Y, rxy is the correlation between observed scores on X and Y, rxx is the reliability coefficient for the X scores, and ryy is the reliability coefficient for the Y scores. 37 The corrected correlation coefficients were tested to determine if their values were different from unity. The test statistic used was given by Lord (1957) as; (14512) (1—534> 2- x1 — 2.3026 (N l) loglO [(1 + 512) (l + 534) ‘ 4 513] (3.8) where N is the sample size ~ p12 = p34 = p13 = 1/3 (912 + 2913) 012 = 034 is the reliability estimate for the two measurements is the correlation between observed scores for the two measures Lord's derivation of the above formula depended on the use of the correlation between parallel test forms as a reliability coefficient and it assumed equivalent esti— mates of reliability for the two measures. The mean Kuder- Richardson Formula 20 reliability coefficient for the true-false and multiple choice subtests in each test form was used as an approximation of 612. The calculated value of xi was referred to the chi-square distribution with one degree of freedom and alpha was preset at the 5% level of significance. The 38 decision rule for each test was to reject the null hypo- . _.2 thes1s (ny—l) if x1 > 3.84. Summary The 1018 subjects that participated in the final testing phase of this study were described as representa- tive of non-urban high school students. Each subject responded to one of eight test forms in either social studies or natural science. The 70-item test forms were composed of half multiple choice and half true-false items. The true-false items were converted from multiple choice form by the judgmental and discrimination methods. Kuder-Richardson Formula 20 reliability coeffi- cients were calculated for all true-false and multiple choice subtests. The ratio of the number of true-false to multiple choice items that subjects attempted in the first eight minutes of testing was computed. Finally, the cor- relation between individual true-false and multiple choice subtest scores was calculated and corrected for attenuation for each final test form. CHAPTER IV RESULTS Introduction This chapter is divided into four major sections. The first section deals with the findings regarding the number of multiple choice and true—false items that sub— jects responded to in eight minutes of testing time. The second section contains the results relevant to the reliabilities of the multiple choice and true-false subtests. The outcomes associated with the first two research hypotheses are reported separately in section two. Results that reflect on the concurrent validity of the true-false and multiple choice tests are reported in the third section. A final section, the chapter summary, follows. Results Concernigg Amount of Testing Time Frequency distributions indicating the number of items subjects responded to in eight minutes were con- structed for each of the eight test forms. The medians and number of examinees for each distribution are shown in 39 Table 4.1. 40 Students, in general, worked more rapidly on the social studies tests than on the natural science tests. Also, students responded to more true—false than multiple choice items in the eight-minute period. Table 4.1. Median number of items attempted in the first eight minutes of testing for each final test form. True-False Items Test Form SJB SDB NJB NDB Number of examinees 121 120 122 122 Items attempted (Md) 27.44 26.83 23.00 24.83 Multiple Choice Items Test Form SJA SDA NJA NDA Number of examinees 118 123 120 119 Items attempted (Md) 17.42 17.34 16.42 17.05 The data from the above distributions were combined to form two frequency distributions, one for form A tests and one for form B tests. (See Table A2 in Appendix D for the complete distributions.) The typical performance of subjects on these forms was represented by medians calcu- lated for the two distributions. The median for the 41 true-false tests, form B tests, was 25.59 and the median for the multiple choice tests, form A tests, was 17.04. The ratio of these medians that serves as an index of the relative rates of work by subjects on the true-false and multiple choice tests was 1.50. The conclusion drawn from these data was that, in general, students attempted three true-false items for every pair of multiple choice items attempted. Results Concerning Test Reliability Kuder-Richardson Formula 20 reliability coeffi- cients were computed for each subtest of the eight final test forms. The 16 coefficients are reported in Table 4.2. The true-false subtest reliabilities were adjusted by the Spearman-Brown formula to estimate the reliabilities of tests 1.5 times as long as the original tests. The ad- justed reliabilities also appear in Table 4.2. Hypothesis One The first hypothesis of interest that was stated in Chapter III was: H1: When multiple choice items in social studies and 0 natural science are converted to true-false form, the Kuder—Richardson Formula 20 reliabilities of the two tests are not different, regardless of the subject matter. A visual inspection of the data reported in Table 4.2 indicated that the multiple choice and true-false 42 reliabilities were different, and in each case the multiple choice reliabilities were higher. The differences noted were tested statistically to determine if these were sig- nificant differences. Table 4.2. Kuder-Richardson Formula 20 reliability coefficients for final subtest forms. Subtest Multiple True-False a Test Form Choice Original Adjusted SJA .796 .708 .785 SJB .827 .654 .739 SDA .805 .498 .598 SDB .851 .641 .728 NJA .835 .759 .825 NJB .852 .612 .703 NDA .854 .704 .781 NDB .862 .645 .732 aThe adjusted true-false test reliabilities were estimated by the Spearman-Brown formula for a test 1.5 times the length of the original test. The test statistic, t, reported as Equation 3.3, was used to test the hypothesis Hi: pm = pt against the alternative hypothesis that H: is false. The pairs of reliability coefficients and their corresponding di's are included with the computational data in Table 4.3. The computed value of t was 5.520. Since the decision rule 43 for this test was to reject H: if -2.365 1 t g 2.365 (t( 05)7 = 2-365), a decision was made to reject Hi. Table 4.3. Computations for testing hypothesis one. Test Forma rim rit di (di-d)2 SJA .796 .785 .011 .007709 SJB .827 .739 .088 .000117 SDA .805 .598 .207 .000067 SDB .851 .728 .123 .000586 NJA .835 .825 .010 .007885 NJB .852 .703 .149 .002520 NDA .854 .781 .073 .000666 NDB .862 .732 .130 .000973 E = .0988 Sd = .0506 = 8 aMeans and variances for the final test forms can be found in Appendix E. The conclusion based on these data was that the reliabilities of multiple choice and true-false tests were different, and, by inspection, the multiple choice relia- bilities were consistently greater. Hypothesis Two The second hypothesis of interest that was stated in Chapter III was: 44 N H : When multiple choice items in social studies and natural science are converted to true-false form using the judgmental and discrimination methods, the Kuder-Richardson Formula 20 reliabilities of the true-false tests are not different, regardless of the subject matter. An examination of the data reported in Table 4.2 showed that the true-false test reliabilities were rela- tively homogeneous. The two most extreme values, .825 and .598, favored the judgmental forms. A paired t test was used to test the hypothesis piD against the alternative hypothesis that H: is false. The computations for this test appear in Ap- pendix C. Since the calculated value of the test statistic, 1.307, did not exceed the critical value, = 3.182, t(.05)3 a decision was made not to reject Hi. The conclusion was that the reliabilities of the true-false tests constructed by the judgmental and discrimination methods were not different. Results Concerning Concurrent Validity Each subject received a score on the multiple choice and on the true-false subtests of the test form to which he responded. A Pearson product—moment correlation coefficient was calculated between subtest scores on each of the eight final forms. These are presented in Table 4.4. The correlation coefficients were adjusted for 45 unreliability in the measurement of the two variables by the correction for attenuation formula given as Equation 3.7. These estimates of the correlation between the true scores on the two subtests are also depicted in Table 4.4. Table 4.4. Correlation coefficients for multiple choice and true-false subtest scores on each final form. Test Form rmt r0000a N SJA .578 .769 126 SJB .697 .947 127 SDA .564 .891 128 SDB .430 .582 128 NJA .661 .831 126 NJB .728 1.009 129 NDA .710 .916 125 NDB .825 1.107 129 aDesignates r corrected for attenuation. mt The fourth research hypothesis of interest was stated in Chapter III as: H4: The simple correlation between individuals' true- 0 false and multiple choice test scores is 1.00, corrected for attenuation. An inspection of the correlation coefficients in Table 4.4 revealed that two of the corrected correlations exceeded one. The explanation for actual values exceeding the theoretical upper bound of one, according to Lord (1957, p. 208), is sampling fluctuation. Values greater 46 than unity occur when the correlation between observed scores is larger than the true value or when the observed reliability coefficients are underestimates of their true values. The test statistic, xi, reported as Equation 3.8, p = XY P < 1 where P XY XY was used to test the hypothesis H:: 1 against the alternative hypothesis H4 is the dis- 1' attenuated correlation coefficient. Table 4.5 provides the results of the six tests that were carried out, each at the d = .05 level. If the alpha level had been reduced initially from .05 to .0001 for each test to favor non—rejection of the null hypo— thesis, the results would have been the same. Thus, the usual problem of compounding the alpha level for multiple statistical tests did not affect the outcomes in this situation. Table 4.5. Computations and results of tests for hypothesis four. Test Form 912 = 834 913 512 = 534 = 513 ny Xi SJA .791 .578 .649 .769 55.26* SJB .783 .697 .726 .947 14.27* SDA .702 .564 .610 .891 18.86* SDB .789 .430 .550 .582 100.42* NJA .830 .661 .717 .831 54.28* NJB 1.009 NDA .818 .710 .746 .916 26.95* NDB 1.107 *Significant at d < .0001. 47 The conclusion drawn from the data represented by Table 4.5 was that corrected correlations between indi- viduals' multiple choice and true—false subtest scores were not perfect (equal to 1.00). Summary The results of the data analysis for this study were presented in this chapter. The findings concerning the four major research hypotheses were: 1. Students reSponded to three true-false items for every pair of multiple choice items attempted. In addition, students worked more rapidly on the social studies items than on the natural science items. The Kuder-Richardson Formula 20 reliability co- efficients for the multiple choice subtests were greater than those of the true—false subtests. There was no significant difference between the reliabilities of the true-false tests constructed by either the judgmental or discrimination methods. The correlations, corrected for attenuation, be— tween true—false and multiple choice subtest scores were significantly different from unity for six of the eight final test forms. Summary CHAPTER V SUMMARY AND CONCLUSIONS The purpose of this study was to compare the reliabilities of multiple choice and true-false tests and to determine the concurrent validities of true-false tests that were written to measure understanding of concepts and relationships. The four major questions that were formu- lated as research hypotheses were: 1. Are multiple choice and true—false achievement tests that were designed to measure the same ob- jectives equally reliable? Are true—false tests that are converted from mul- tiple choice form by the judgmental method as reliable as those converted by the discrimination method? What is the ratio of the number of true-false items attempted to the number of multiple choice items attempted by a group of examinees in a fixed period of time? Is the correlation between individuals' true-false and multiple choice subtest scores perfect (+1.00) when the correlation is corrected for attenuation? 48 49 A search of the literature revealed that there were few recent studies concerning the comparison of true- false and multiple choice test reliabilities or validities. The findings of studies completed in the 1920's were in- congruous and were based primarily on items that measured factual information. No studies were noted that reported an objective and reproducable procedure for changing items from one form to another. There was agreement in the re- search cited that l.5 true-false items could be attempted in the time required to respond to one multiple choice item. No recent empirical evidence was located to sub- stantiate this earlier claim. A sample of 1018 non—urban high school students in Central Michigan each responded to one of eight test forms constructed to measure social studies or natural science achievement. The original multiple choice items used in this study were selected from a widely used battery of standardized achievement tests. Two methods were devised for systematically chang- ing multiple choice items to true—false form. The judg— mental method involved the use of secondary school teachers to choose the multiple choice distractor that would result in the most plausible false statement. The discrimination method relied on item analysis data from a multiple choice testing to identify the distractor that best discriminated between high and low scorers on the test. The first of 50 three phases of testing was needed to gather the item analysis data. The true—false items generated by the two conver- sion methods were tried out in the second phase of testing. The revised true—false items were incorporated in the eight test forms used in phase III, the final testing. Each of the 70-item final forms consisted of 35 multiple choice and 35 true-false items. Kuder—Richardson Formula 20 reliability coeffi- cients were calculated for the 16 multiple choice and true-false subtests. The ratio of the number of true- false to multiple choice items that subjects attempted in the first eight minutes of testing was also computed. The correlation between individuals' true-false and multiple choice subtest scores was calculated and corrected for attenuation for each final test form. Statistical tests were performed to determine if the subtest reliabilities were different and to ascertain if the values of the eight corrected correlation coefficients departed significantly from unity. Conclusions The reliability and concurrent validity coeffi- cients from the final form subtests are summarized in Table 5.1. The conclusions associated with the four major research hypotheses were: 51 Table 5.1. Reliability and concurrent validity coeffi— cients for final subtest forms. a Test Form r20 rmt rcoco N SJAm .796 .578 .769 126 SJAt .785 SJBt .739 .697 .947 127 SJBm .827 SDAm .805 .564 .891 128 SDAt .598 SDBt .728 .430 .582 128 SDBm .851 NJAm .835 .661 .831 126 NJAt .825 NJBt .703 .728 1.009 129 NJBm .852 NDAm .854 .710 .916 125 NDAt .781 NDBt .732 .825 1.107 129 NDBm .862 aThe true-false test reliabilities were adjusted for a lengthened test by the Spearman-Brown formula. 52 1. The Kuder-Richardson Formula 20 reliability coeffi- cients were greater for the multiple choice than- for the true—false subtests. 2. There was no significant difference between the reliabilities of the true-false tests constructed by either the judgmental or discrimination method. 3. Examinees responded to three true-false items for every pair of multiple choice items attempted. In addition, students worked more rapidly on the social studies items than on the natural science items. 4. The correlations, corrected for attenuation, be- tween true-false and multiple choice subtest scores were significantly different from unity for six of the eight final test forms. Discussion The findings of this study are somewhat in agree- ment with the conclusions drawn by other researchers in recent work. None of the studies previously cited, how- ever, used subject matter as a variable of interest. The reliability coefficients obtained in this study were found to differ depending on item form, and an inspection of the data in Table 5.1 demonstrates that higher reliabilities were observed for the natural science subtests than for the social studies subtests. The corrected 53 concurrent validity coefficients follow this same trend. The explanation for these observed differences is not readily apparent. It may be true that the content of the high school social studies curriculum is less tightly organized than the subject matter in natural science. Hierarchically-arranged concepts and principles are prob— ably more conducive to measurement with a set of rela- tively homogeneous items than are more loosely knit units of knowledge. The more heterogeneous social studies test items are likely to produce a lower coefficient of in- ternal consistency than are the natural science items. The results of the statistical tests employed in the data analysis of this study are probably not of para- mount importance. The same conclusions could be drawn from an examination of the data in Table 5.1. The multiple choice and true-false subtest reliabilities within-each of the test forms are not extremely different for practical group achievement testing purposes. Only two of the cor- rected correlation coefficients can be interpreted as perfect correlations. Five of the remaining six are prob— ably too low to consider them near perfect. Sampling fluctuation may be an explanation for the low corrected coefficients. Another possibility, however, is the con— jecture that true—false and multiple choice test items do not measure the same thing. 54 The abilities required of examinees to respond to true-false items are probably different from those needed to obtain the same score on a comparable four-response multiple choice test. For example, an individual may mark a statement true because he could not think of a counter- example, a situation or occurrence that would make the proposition false. His search for a counterexample may have been bounded by time limits or the length to which he could stretch his mind or the depth of his retrieval system that he could penetrate. The multiple choice item, how- ever, limits the universe of comparisons that the individ- ual must make. He can decide which alternative makes a true statement with the item stem and then review the re- maining alternatives to determine if any of them is a counterexample for the true statement. Though individuals probably differ in the responding schemes they use, the manners of responding to true-false and multiple choice items probably depend on somewhat different abilities. These differences in abilities required may be reflected in the test scores and, therefore, in the correlation coefficient. The data from this study indicated that examinees' rates of work varied with item form and content. Students worked more rapidly on the true-false tests than on the multiple choice tests and they responded more rapidly to social studies items than natural science items. One 55 practical application of these findings is that teachers could maximize their classroom testing time by conducting small experiments to find out the rates of work of their classes based on the item forms the teacher typically uses. Those individuals that construct achievement tests should be aware that rate of work may vary with item form, content, and difficulty. The findings of this study suggested that test quality may be somewhat sacrificed by using true-false rather than multiple choice items. A practical consequence of this finding is that a longer, though perhaps a bit less reliable, test may be used for a given period of testing if the items are in true-false rather than mul- tiple choice form. A true-false test may be a feasible alternative if the examiner is primarily concerned about the adequacy with which his sample of items represents the universe of content. A longer test can probably ef- fect a more thorough sampling of the universe. If, how— ever, the examiner was not willing to sacrifice reliabil- ity, a 52-item true-false test with a reliability of .739 would theoretically need to be lengthened to 89 items to obtain a reliability coefficient of .827 that was achieved with a 35-item multiple choice test. (The assumptions required by the Spearman-Brown formula (Equation 3.2) would have to be considered in judging how much confidence to place in such an inference.) 56 Some teachers express the notion that unambiguous true-false items are more difficult to prepare than mul- tiple choice items. Good items of both types are not actually easy to construct. The individual that finds more difficulty writing true-false items might utilize one of the two item conversion methods employed in this study to make true-false statements. The judgmental method might be an attractive procedure when examinations are prepared as a departmental effort. Only one good dis- tractor is necessary to write a false statement; at least one good distractor is required for an adequate two- response multiple choice item. Also, good four-response multiple choice items can be converted to one true and three false statements. The converted items can be used to build a sizeable item bank. The results of this study depart from the findings of Storey (1966, p. 285) who concluded his article by writing that "only a trifler and the uninformed pretend to measure anything with the relatively invalid, unreliable, and subject-to-set true-false item." Storey's remark is unwarranted, however, because he did not compare true- false items with any other item form in his study. Limitations of the Study The schools and classrooms that participated in this study did so on a voluntary basis. The 57 generalizability of the findings to other classrooms is left to the reader's discretion since a pOpulation was not clearly specified. Also natural science and social studies tests were arbitrarily selected for this study. The findings should not be indiscriminately generalized to tests covering vastly different subject matter. The motivation of the subjects to reflect their true achievement levels on the tests is questionable. Students were directed not to guess blindly on any test item but there is no way to determine the extent to which those directions were followed. Fewer than five examinees were observed randomly recording their responses throughout the three phases of testing. Less obvious cases of blind guessing are not easily detected. The sample sizes for each of the final test forms were probably too small to control the sampling fluctuation that plagues the interpretation of correlation coeffi- cients. Samples approaching 300 subjects would undoubtedly have been preferable to groups of 125. Suggestions for Future Research The following suggestions are offered for further investigation into the comparative effectiveness of item forms. 58 The range of true-false subtest scores was re- stricted compared to the range for multiple choice scores because the chance scores were different for the two. It would be appropriate for future studies to use a lengthened true-false test to estimate reliability instead of estimating the reliability for a hypothetically-lengthened test. This procedure should increase the variability of the true-false scores and, perhaps, produce a better estimate of the relationships between multiple choice and true-false scores. The use of conversion methods to construct con- tent-equivalent items could be extended to include a frequency method and a random method. The fre- quency method would entail using the multiple choice distractor most frequently chosen by exam— inees to make a false statement. The random method would involve selecting a distractor at random to make a false statement. The amount of testing time required for a given number of items could be investigated for various item forms in several subject matter areas with item difficulty used as a control variable. The results of this study suggested differences in reliability and concurrent validity associated with the subject matter content of the test. Future investigations might profitably be designed to determine the causes of these differences. Research is needed to investigate the sampling fluctuation that sometimes interferes with the interpretation of the disattenuated correlation coefficient. Perhaps a monte carlo study in which sample size is a variable would shed some light on this matter. APPENDICES APPENDIX A DESCRIPTION OF SCHOOLS PARTICIPATING IN THIS STUDY Table A1. Description of schools participating in this study. Number Number of Students Grade Number of Teachers in the School Levels of Students in School School District 9-12 9-12 7 12 960 452 885 1120 1251 1300 40 24 40 62 54 67 3489 1621 1914 3320 4112 3763 aThis data was compiled from the Michigan Educa— tion Directory and Buyer's Guide (1970). 59 APPENDIX B DISTRACTOR JUDGMENT TASK Distractor Judgment Task Directions: The following pages contain 70 multiple choice items with the correct response or best answer circled for each item. For each item you are to choose from the remaining alternatives that response which, in your judg- ment, would appear most attractive to a student that does ESE possess sufficient knowledge to respond to the item correctly. You might go about this task by thinking: "If I were going to make a false statement out of this item, which of the incorrect responses would provide me with the most plausible statement. Or, which statement would an uninformed student be most willing to accept as true." Please respond to each of the 70 items in this fashion. If two or more incorrect responses seem to be equally plausible, select one of them for your final choice on a random basis. Mark your choice on the enclosed answer sheet as if you were taking this test. (Of course, none of your responses will be correct when you have finished.) You need not fill in any of the identification blanks on the answer sheet. Thank you for your COOperation. Your promptness in completing this task will be appreciated. 60 APPENDIX C COMPUTATIONS FOR TESTING HYPOTHESIS TWO Table A2. Computations for testing hypothesis two. Test Form rJ rD di (di-d)2 SA .785 .598 .187 .017902 SB .739 .728 .011 .001781 NA .825 .781 .044 .000085 NB .703 .732 —.029 .006757 E = .0532 Sd = .0814 = 4 61 APPENDIX D NUMBER OF ITEMS ATTEMPTED IN THE FIRST EIGHT MINUTES OF TESTING Table A3. Number of items attempted in the first eight minutes of testing. Number of Items Multiple Choice True-False Attempted N=480 N=487 35 or more 0 56 34 0 8 33 l 16 32 O 16 31 1 ll 30 l 25 29 l 22 28 3 29 27 3 21 26 8 38 25 7 19 24 8 20 23 16 26 22 17 24 21 32 34 20 45 25 19 33 15 18 36 21 17 57 19 16 51 8 15 47 13 14 35 2 13 20 3 12 15 2 ll 21 0 10 8 0 9 or less 10 4 62 APPENDIX E MEANS AND VARIANCES FOR THE SUBTESTS OF THE FINAL TEST FORMS Table A4. Means and variances for the subtests of the final test forms. Test Form Mean Variance SJAm 24.31 30.67 SJAt 19.55 25.86 SJBt 24.48 18.09 SJBm 16.92 43.02 SDAm 24.21 31.65 SDAt 19.61 15.42 SDBt 22.57 18.17 SDBm 17.44 48.83 NJAm 21.62 40.88 NJAt 19.71 30.61 NJBt 20.78 18.05 NJBm 17.83 47.32 NDAm 22.19 44.35 NDAt 21.00 24.69 NDBt 19.36 20.97 NDBm 16.79 50.63 63 BIBLIOGRAPHY BIBLIOGRAPHY Ahmann, J. S. and Glock, M. D. Evaluating Pupil Growth. 3d ed. revised. Boston: Allyn and Bacon, Inc., 1967. Andrews, D. M. and Bird, C. "Comparison of Two New-Type Questions: Recall and Recognition," Journal of Educational Psychology, XXIX (March, 19387, pp. 175-193. Brown, F. G. Principles of Educational and Psychological Testing. Hinsdale: The Dryden Press, Inc., 1970. Burmeister, M. A. and Olson, L. A. "Comparison of Item Statistics for Items in Multiple Choice and in Alternate-Response Form," Science Education, L (December, 1966), pp. 467-470. Charles, J. W. "A Comparison of Five Types of Objective Tests in Elementary Psychology." Ph.D. Thesis, State University of Iowa, 1926. ChOppin, B. H. and Purves, A. C. "Comparison of Open- Ended and Multiple Choice Items Dealing With Literary Understanding," Research in the Teach- ing of English, III (Spring, 1969), pp. 15-24. COpeland, J. S. and Gilliland, A. R. "Comparison of the Validity and Reliability of Three Types of Ob- jective Examinations," Journal of Educational Psychology: XXXIV (April, 1943), pp. 242-246. Cronbach, L. J. "Experimental Comparison of the Multiple True-False and Multiple Multiple Choice Tests," Journal of Educational Psychology, XXXII (Oc- tober, 1941), pp. 533-543. Durost, W. N. and Prescott, G. A. Essentials of Measure- ment for Teachers. New York: Harcourt, Brace and World, Inc., 1962. Ebel, R. L. Measurinngducational Achievement. Englewood Cliffs: Prentice-Hall, Inc., 1965. 64 65 Ebel, R. L. "Case For True-False Test Items," School Review, LXXVIII (May, 1970), pp. 373-389. . "The Comparative Effectiveness of True-False and Multiple Choice Achievement Test Items." Paper presented at the American Educational Research Association Annual Meeting, New York City, February, 1971. Eurich, A. C. "Four Types of Examinations Compared," Journal of Educational Psychology, XXII (1931), pp. 268-278. Feldt, L. S. "A Test of the Hypothesis That Cronbach's Alpha or Kuder—Richardson Coefficient Twenty Is the Same for Two Tests, (September, 1969), pp. " Psychometrika, XXXIV 363-373. Ghiselli, E. C. Theory of Psycholggical Measurement. New York: McGraw-Hill Book Company, 1964. Gronlund, N. E. Measurement and Evaluation in Teaching. New York: The MacMillan Company, 1965. Heim, A. W. and Watts, K. P. "Experiment on Multiple Choice Versus Open-Ended Answering in a Vocab- ulary Test," British Journal of Educational Psychology, XXXVII (November, 1967), pp. 339-346. Lord, F. M. "A Significance Test for the Hypothesis That Two Variables Measure the Same Thing Except for Errors of Measurement," Psychometrika, XXII (September, 1957), pp. Loree, M. R. "A Study of a Technique for Improving Tests.", 207-220. Ph.D. Thesis, University of Chicago, 1948. Marascuilo, L. A. "Large Sample Multiple Comparisons," Psychological Bulletin, LXV (1966), pp. 280-290. Michigan Department of Education. Levels of Educational Performance and Related Factors in Michigan. Lansing: Assessment Report No. 4, 1970, pp. 19-26. Michigan Education Directory and Buyer's Guide. Lansing: Michigan Education Directory, 1970. Mood, A. M. and Graybill, F. A. of Statistics. 2d ed. Book Company, 1963. Introduction to the Theory New York: McGraw-Hillfi 66 Owens, R. E.; Hanna, G. S.; and Coppedge, F. L. "Compar- ison of Multiple Choice Tests Using Different Types of Distractor Selection Techniques," Journal of Educational Measurement, VII (Summer, 1970), pp. 87-90. Ruch, G. M. and Stoddard, G. D. "The Comparative Relia- bilities of Five Types of Objective Examina- tions," Journal of Educational Psychology, XVI (1925), PP. 89-103. Rutledge, R. E. "The True-False Examination in Elementary Psychology With Suggestions for Its Improvement." Ph.D. Thesis, University of California, 1926. Storey, A. G. "Review of Evidence or the Case Against the True-False Item," Journal of Educational Re- search," LIX (February, 1966), pp. 282-285. Thorndike, R. L. and Hagen, E. Measurement and Evaluation in Psychology and Education. 3d ed. revised. New York: John Wiley and Sons, Inc., 1969. Toops, H. A. "Trade Tests in Education," Teachers Collgge Contribution to Education. New York: Teachers College, Columbia University, No. 115, 1921. Watson, D. R. and Crawford, C. C. "Four Types of Tests," High School Teacher, VI (September, 1930), pp. 282-283. Wesman, A. G. "Writing the Test Item." Chapter 4 in Thorndike, R. L. (ed.). Educational Measurement. 2d ed. Washington: American Council on Educa- tion, 1971. Williams, B. J. and Ebel, R. L. "The Effect of Varying the Number of Alternatives Per Item on Multiple Choice Vocabulary Test Items," Fourteenth Year- book of the National Council on Measurements Used In Education, Princeton, 1957, pp. 122-125. General References Allen, D. W. "Quick Scoring, Less Guessing on True—False Tests," Clearinghouse, XXXIII (October, 1958), pp. 74-76. 67 Bayless, E. E. and Bedell, R. C. "A Study of Comparative Validity as Shown By a Group of Objective Tests," Journal of Educational Research, XXIII (1931), pp. 8-16. Burkheimer, G. J.; Zimmerman, D. W.; and Williams, R. H. "Maximum Reliability of a Multiple Choice Test as a Function of Number of Items, Number of Choices, and Group Heterogeneity," Journal of Experimental Education, XXXV (Summer, 1967), pp. 89-94. Carter, H. D. and Crone, A. P. "The Reliability of New— Type or Objective Tests in a Normal Classroom Situation," Journal of Applied Psychology, XXIV (1940), pp. 353-368. Feldt, L. S. "The Approximate Sampling Distribution of Kuder-Richardson Reliability Coefficient Twenty," Psychometrika, XXX (September, 1965), pp. 357— 370. Hurd, A. W. "Comparison of Short Answer and Multiple Choice Tests Covering Identical Subject Content," Journal of Educational Research, XXVI (September, 1932), pp. 28—30. Karraker, R. J. "Knowledge of Results and Incorrect Recall of Plausible Multiple Choice Alternatives," Journal of Educational Psychology, LVIII (Feb— ruary, 1967), pp. 11—14. Kinney, L. B. and Eurich, A. C. "Summary of Investigations Comparing Different Types of Tests," School and Society, XXXVI (October 22, 1932), pp. 540-544. Magill, W. "The Influence of the Form of Item on the Validity of Achievement Tests," Journal of Edu— cational Psychology, XXV (1934), pp. 21-28. Miklich, D. R. and Gordon, G. P. "Test-Taking Carefulness vs. Acquiescence Response Set on True-False Examinations," Educational and Psychological Measurement, XXVIII (Summer, 1968), pp. 545-548. Millman, J. and Setijadi. "Comparison of the Performance of American and Indonesian Students on Three Types of Test Items," Journal of Educational Research, LIX (February, 1966), pp. 273-275. 68 Payne, W. H. and Anderson, D. E- "Significance Levels for the K—Rzo: An Automated Sampling Experiment Approach," Educational and Psychological Meas- urement, XXVIII (Spring, 1968), pp. 23-39. Preston, R. C. "Multiple Choice Test as an Instrument in Perpetuating False Concepts," Educational and Psychological Measurement, XXV (Spring, 1965Y, pp. 111—116. Remmers, H. H. and Remmers, E. M. "The Negative Suggestion Effect on True-False Examination Questions,“ Journal of Educational Psychology, XVII (1926), pp. 52-56. Ruch, G. M. The Objective or New—Type Examination. Scott, Foresman and Company, 1929. Shulson, V. and Crawford, C. C. "An Experimental Compar- ison of True-False and Completion Items," Journal of Educational Psychology, XIX (1928), pp. 580- 583. Storey, A. G. "Versatile Multiple Choice Item," Journal of Educational Research, LXII (December, 1968), pp. 169—172. Wood, B. D. "Studies in Achievement Tests," Journal of Educational Psychology, XVII (1926), pp. 1-22. "1111111114141111155