b :r’l' . .32.: ”t.” :3: iv: .5, ; 7r; ‘ «Vt, if (3353?» .M, rfig’hlggi'fll'l’rfilmm'f' "M - -w—f ll‘ ll in l wa if: '5’ 1'? at”! ”1W 5 all 1 ”My 1 M a —'. - - A. . ~ I—‘ ha M..."..: ‘f.’ - J 13-.» , Jury}. “‘7‘ r m.- «2.: V Ml ll ll“ ”Jill ' V 2; Valli“: I..- ”“11 i" r H m Ilzl l '31! "I: “o m! [M 'l 1W?” ”HI, Hi . ‘V ‘ H“ I: 111“ l 1 V .’VLVMV “I W? ‘Wl W 34m 1” i, ngPVV ‘ {Mlh ‘htMM,IlV“M£Ml Vll IVE? : "0...... , . . _. . . - . -. $.15.- V .. . Q. ~|Q~ "2v ‘. . ‘.v c z I" will . ~ . '1! 2' “mm! lit "‘1 “‘5 flint-H: ’ -. ~ HMMMIMM‘MIE I‘M,“ W baln‘lul {V} l ’5 '- “11%” 312;"; in. ([2,: 1‘ i'i. El In“! 1 All k I k: V ' Mll'3~"l::ll};wzl‘l'VM l fl 1.. mil MM "“5 M; Mr “ "W" ' ”’pifikfifikfiik,hll"'l “ Mia : faith" Millil"i-gt'l‘l'V’I’L’itélézérnazam“: W1" "M“ “”3 2 . H w #14451 , :z'.-nlfi%l.llle“MV! .1 1 -. 2': dés- : " .- 2" 423:2?'lllzsaswlvwalla?M.islill‘lslllw‘e "Vi“sm‘v ;-; l“ 'M V”? fl*35353§Vu ‘MVFB‘ ‘VV VvlfiVVFHAnl " "“ " ‘ ”hl’ ""'H':f"_ 14 I}, . ,3; ' Ll. 'f'Zf. J." _‘ 3'. . ' *3 4;: WI “$3” 4" gfi ‘uu “I. Vq-r Mfah4gakxuu: 35: as ‘ “If” 5'le l {+17- 3 ‘ . Va?“ 5' ' ,I‘ v, in)": v’f“ ltéi‘ y] '. I". “94"? «a h.fl3£$ 1i; .' ' 1'7.‘:~1:l’?"§ l, . ‘,' . $.11“ 'ziql“; 0‘ 3'” h :d I. > gi.h . ‘ .l - ' A‘ "S . .‘ :<—t-¢ A —D - .. MM . ,2: " ,%l I ‘ " " :‘4 " oll‘ “M. I: \‘1 Pl. is“: Hi‘ u 1 if»? . 1N“. u; l 1.1;. t H“ .l :flml (gtfiullll ' l n will fit Q‘MWthllm 1' WWW“. g ml My"! Emma. "Mm. 311! L9 ' t-N ‘\ "« ,x'-..’$ '7‘ _-*" . '7‘."- E [E A) 5. . 2-" Inn-r nun-i -u’r" I I E O ’ O ’ g ‘ V ,3. , v.7. [9; 9,} a: f (‘f‘ f“; h-flw‘ltli‘lf'-- bin—Le . I O t'-truw«rr r1 \ in.".v5--‘I7 This is to certify that the dissertation entitled THE COMPARISON OF ALTERNATE-CHOICE AND TRUE-FALSE FORMS USED IN CLASSROOM EXAMINATIONS presented by NANCY ANN MAIHOFF has been accepted towards fulfillment of the requirements for _BILD_.___degree in W, EVALUATION, AND RESEARCH DESIGN Major professor Date MARCH 13, 1986 (MSU is an Affirmative Action/Equal Opportunity Institution . 0.12771 MSU LIBRARIES ”- RETURNING MATERIALS: Place in book drop to remove this checkout from your record. FINES will be charged if book is returned after the date stamped below. A COMPARISON OF ALTERNATE-CHOICE AND TRUE-FALSE ITEM FORMS USED IN CLASSROOM EXAMINATIONS By Nancy Ann Maihoff A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Department of Counseling, Educational Psychology, and Special Education 1986 ABSTRACT THE COMPARISON OF ALTERNATE-CHOICE AND TRUE-FALSE ITEM FORMS USED IN CLASSROOM EXAMINATIONS BY Nancy Ann Maihoff The purpose of this study was to compare two-choice alternate- choice and true-false items, and correct answer first (Acci) and incorrect answer first (Acic) versions of alternate-choice items on difficulty, discrimination, reliability, and criterion-related validity; and to investigate the practicability of judging the better versions of the alternate-choice and the true-false items. Three tests were administered to students in a freshman level natural science course. Form A and Form B of Tests I and II contained identical sets of multiple-choice items, 10 alternate-choice items, and 10 true-false items. The alternate-choice and true-false items on Form B were the content equivalent of the true-false and alternate-choice items on Form A. The same 247 students took both Tests I and II. All alternate-choice items were converted to Acci and AC1c versions; true-false items were converted to true form (TFt) and false form (TFf). Two experienced course instructors were asked to judge which alternate-choice and which true-false versions would best maximize the chances of an informed student correctly answering the item and an uninformed student incorrectly answering the item. Form A and Form B of Test III consisted of alternate-choice and identical sets of multiple-choice items. Form A contained 10 ACCi and 10 Acic items; Form B contained the respective content equivalent of these AC1C and ACCi items. There were 102 students who took Test III. The alternate-choice items were found to be less difficult than the true-false items. Both item forms were equally discriminating and reliable, and both were equally related to final course grade. The agreement of the judges on the better item version of the alternate-choice and true-false items was only 5 percent greater than that expected by chance. No differences were found between the two alternate-choice versions on difficulty, discrimination, reliability, or criterion-related validity. For all three tests, significant interaction effects were found for item position and item content. Control of these two variables is strongly recommended in further research of this type. Dedicated to the memory of Robert L. Ebel ii ACKNOWLEDGMENTS I would like to express my sincere appreciation to the many special people who helped to make this endeavor possible. My gratitude is expressed to Dr. Wm. A. Mehrens, Dr. Lou Anna Kinsey-Simon, Dr John E. Hunter, Dr. Alain F. Corcos, and Dr. Susan E. Phillips for their advice and assistance as members of my doctoral committee. A special word of appreciation must go to Mr. Michael J. Valiga and Dr. Julie P. Noble of ACT, and Mr. Steven E. Pokorny for their valuable advice and assistance. Also, I wish to thank my students at the California School of Professional Psychology who gave me the strong encouragement to begin this educational endeavor, and a very special thanks to Ms. Thelma A. Weiner who helped me realize it could be attained. iii LIST OF TABLE OF TABLES O O O O O O O O O 0 LIST OF APPENDICES . . . . . . . . Chapter I. II. III. STATEMENT OF THE PROBLEM . Need for the Study . . . . Purpose of the Study . . . Hypotheses . . . . . . . . Definition of Terms . . . Overview of Dissertation . REVIEW OF THE LITERATURE . Introduction . . . . . . . Studies Comparing Two-choice/Alternate-choice and True-false Item Forms Item Conversion Procedures Item Sequence . . . . . . Chapter Summary . . . . . DESIGN AND PROCEDURE . . . Part I . . . . . . . . . . Sample . . . . . . . . Materials 0 o o o o o 0 iv CONTENTS Page vii ix 15 24 34 38 39 39 4O Chapter Page Item Conversion Procedures . . . . . . . . . . . . . . 43 Test Form Development . . . . . . . . . . . . . . . . . 45 Procedure . . . . . . . . . . . . . . . . . . . . . . . 47 Part II . . . . . . . . . . . . . . . . . . . . . . . . . 49 Participants . . . . . . . . . . . . . . . . . . . . . 49 Materials . . . . . . . . . . . . . . . . . . . . . . . 50 Procedure . . . . . . . . . . . . . . . . . . . . . . . 50 Part III . . . . . . . . . . . . . . . . . . . . . . . . . 50 Sample . . . . . . . . . . . . . . . . . . . . . . . . 50 Materials . . . . . . . . . . . . . . . . . . . . . . . 51 Test Form Development . . . . . . . . . . . . . . . . . 51 Procedure . . . . . . . . . . . . . . . . . . . . . . . 53 Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . 56 Part I . . . . . . . . . . . . . . . . . . . . . . . . 56 Part II . . . . . . . . . . . . . . . . . . . . . . . . 57 Part III . . . . . . . . . . . . . . . . . . . . . . . 57 Design and Analysis . . . . . . . . . . . . . . . . . . . 58 LatinsquareDeSigno00.0.9...00000...58 Feldt Test for Equality of Two KR-ZO Reliabilities . . 59 Iv. RESULTS 0 O O O O O O O 0 O O O O O O O O O O O O O O O O 63 Part I O O O I O O O O O O O O O O O O O O O O O O O O O O 63 Part II 0 O O O O O I O O O I O O O O O O O O O O O O O O 71 Part III 0 I O O O O O O O O O O O O O O O O O O O O O O C 72 V. SUMMARY AND CONCLUSIONS . Part I . . . . . . . . . . Part II . . . . . . . . . Part III . . . . . . . . . Limitations of the Study . Suggestions for Further Research APPENDICES O O O O O O O O O O 0 REFERENCES 0 O O O O O O O O O 0 vi Page 78 79 84 85 86 87 89 119 LIST OF TABLES Table Page 1 Summary of Results from Ruch and Stoddard Study . . . . 10 2 Summary of Results from Charles Objective Item Study . l3 3 Summary of Ebel's Alternate-choice and True-false ItEIflStUchoooooooooo00.000000.15 4 Summary of Results from Loree Item Conversion Study . . l7 5 Summary of Burmester and Olson Alternate-choice Study . 21 6 Summary of Williams and Ebel Item Conversion Study . . 24 7 Results of Brenner Item Arrangement Study . . . . . . . 27 8 Results of the Monk and Stallings Item Arrangement Study . . . . . . . . . . . . . . . . . . . . . . . 29 9 Results of the Klosner and Gellman Item Order Study . . 32 10 Number of Students Who Took Form A and Form B of Tests I and II . . . . . . . . . . . . . . . . . . . 41 11 Positions of Items on Test I and Test II . . . . . . . 48 12 Strata, Number in Each Stratum, and Results of Random Sample of Items . . . . . . . . . . . . . . . . . . 52 13 Position of Items on Test III . . . . . . . . . . . . . 54 14 Position of Alternate-choice Versions on Test III . . . 55 15 Latin Square Research Design Used in Part I and Part III Of this StUdy O O O O O O O O O O O O O O O 61 vii Table 16 17 18 19 20 21 22 23 24 Analysis of Total Sum of Squares for the 2 X 2 Latin Square Design . . . . . . . . . . . . . . . Results of Repeated Measures Analysis of Tests I and II . . . . . . . . . . . . . . . . . . . . . . Latin Square Design and Item Statistics for Tests I and II . . . . . . . . . . . . . . . . . . . . . . Latin Square Analysis of Test I (Midterm) . . . . . . Latin Square Analysis of Test II (Final) . . . . . . Latin Square Design and Item Statistics for Part III Latin Square Analysis of Part III . . . . . . . . . . Reliabilities Adjusted to 100 Items by the Spearman- Brown Formula . . . . . . . . . . . . . . . . . . Point-biserial Correlations of Experimental Items-369 Tests I, II, and III C O O O O C O O I I C O C O 0 viii Page 62 64 66 67 68 74 75 83 117 LIST OF APPENDICES Appendix Page A Multiple-Choice, Alternate-choice, and True-false false Forms of Experimental Items . . . . . . . . . 89 B University Committee on Research Involving Human Subjects (UCRIHS) Approval Letter . . . . . . . . . 107 C Exam Instruction Sheet . . . . . . . . . . . . . . . . 108 D Item Judgment Instruction and Recording Sheets . . . . 109 E Point-biserial Correlations of Experimental Items - Tests 1, II, and III C O O O O O O O O O O O C O O O 117 ix CHAPTER I STATEMENT OF THE PROBLEM College instructors have a variety of item forms available for construction of classroom achievement tests. The essay and the objective-type items such as the short-answer, matching, multiple-choice, and true-false items forms have been most commonly used to sample students' knowledge of subject matter taught in the classroom. Of these item forms, the use of true-false items on educational achievement tests has been controversial. On one side of the controversy are those who contend that the problem of guessing, the low reliability, and the difficulty in preparing good true-false items should be evidence enough that the true-false items should be abandoned in favor of three-, four, or five-response multiple-choice items (Ahmann & Glock, 1967; Gronlund, 1965; Brown, 1970; and Frisbie, 1971). On the other side of controversy are those (Ebel, 1979; Smith, 1958; Burmester & Olson, 1966) who point out that true-false items not only have respectably high reliabilities, but they are more efficient than multiple-choice items. This high efficiency means that more true- false questions can be asked within a specific time period than can multiple-choice items, and as a result, true-false tests can provide a much broader sampling of students' knowledge of subject matter. There are those instructors who are willing to accept a lower reliability to obtain a more content-valid examination. 2 Recently, Ebel (1982) proposed the use of what he termed the 'alternate-choice' item as a replacement for the true-false item. This alternate-choice item form described by Ebel is a modified two-choice multiple-choice item in which the two responses are included within the stem of the question. The following are examples of this item form (Ebel, 1982, p. 272, p. 274): An eclipse of the sun can only occur when the moon is 1) full* 2) new. The density of ice is 1) greater 2) less* than that of water. The placement of the two responses is not restricted to the end of the stem, but there is freedom to place the responses where they might fit best in the structure of the stem. Ebel's proposal to replace the true-false item with the alternate- choice item sprang from the results of research conducted in one of his college courses in educational measurement. The findings showed that tests composed of alternate-choice items were less difficult, more discriminating, and more reliable than tests composed of true-false items. Also, because of its brevity, the alternate-choice item form provided an efficiency similar to that of the true-false item in measuring examinees' knowledge. Earlier, Smith (1958) examined what he called the 'double-choice' item form. The double-choice item is written in a form similar to the alternate-choice item constructed by Ebel, but with more distinct punctuation. Several examples follow (Smith, 1958, p. 387-388): The two divisions of the autonomic nervous system are the: (a) central and skeletal (b) sympathetic and parasympathetic*. 3 An animal is trained to choose a 2-inch square in preference to a l-inch square. When later confronted with a 2-inch square and a 4-inch square, he will probably choose the: (a) Z-inch (b) 4-inch*: square. These double-choice items produced reliabilities ranging from .82 to .90 in tests containing more than 200 items. Smith found that the items were easier to write and that more could be written within a given time period than three-, four-, and five-choice multiple-choice items. In addition, student reaction to the use of the double-choice items was extremely positive. This initial evidence indicates that the alternate-choice item form may be a superior substitute for the much disparaged true-false item form without losing the positive quality of item efficiency characteristic of the true-false item. Need for the Study, Few empirical studies have been conducted that compare the alternate-choice type item form to the true false item. Those that have compared these items (Ruch & Stoddard, 1925; Charles, 1926) have found results that differ from those of Ebel (1982). Clearly, further empirical study is needed not only to help clarify the status of the alternate-choice item relative to the true-false item, but also to investigate in greater depth the item characteristics of the alternate- choice item form. Purpose of the Study The purpose of this study was threefold: 1) to compare the difficulty level, discrimination level, reliability, and criterion- related validity of the alternate-choice item form with the content equivalent true-false item form; 2) to investigate the practicability of judging whether the alternate-choice item version with the correct 4 answer listed first (ACci) or the version with the incorrect answer first (ACic) is the better form of the item, and whether the true form of a true-false item (TFt) or the false form (TFf) is the better form of this item version, and 3) to examine the effects of placing the correct answer first (ACc version) or the incorrect answer first (AC1c version) i on item difficulty, item discrimination, reliability, and criterion related validity of the item. The inter-comparisons of item characteristics of the alternate- choice and true-false items are presented in Part I of this study, the item version judgment in Part II, and intra-comparison of the item characteristics of versions ACCi and AC1c of the alternate-choice item in Part III. Hypotheses Part I. The major hypotheses in Part I of this study were: H1: The alternate-choice items will be less difficult than the content equivalent true-false items. H2: The alternate-choice items will show higher discrimination than the content equivalent true-false items. H3: The reliability of the alternate-choice items will be greater than the reliability of the true-false items. H4: The criterion-related validity of the alternate-choice items will be greater than the criterion-related validity of the Ase items. 5 Part II. The major hypotheses in Part II of this study were: H1: Agreement between two departmental colleagues as to the better version of an alternate-choice item will be no better than chance (50%). H2: Agreement between two departmental colleagues as to the better version of a true-false item will be no better than chance (50%). Part III. The major hypotheses in Part III of this study were: H1: Version ACci and version Acic of the alternate-choice item will not differ in difficulty level. H2: Version ACCi and version AC1C of the alternate-choice item will not differ in discrimination level. H3: Reliabilities of version ACci and version AC1C of the alternate-choice item will not differ. H4: Criterion-related validities of version ACCi and version ACic of the alternate-choice item will not differ. Definition of Terms Criterion-related validity. In Part I of this study, criterion- related validity is defined as the product moment correlation between the respective total score of the alternate-choice items and the true- false items and the criterion of total weighted score in the course. In Part III of this study, content-related validity is defined as the product moment correlation between the respective total score of ACc1 items and the AC1c items and the criterion total weighted score in the course. In Part I, the scores of the true-false and alternate-choice items were removed from the weighted course total scores prior to calculating the correlation coefficient, and in Part III, the weighted 6 scores of both versions of the alternate-choice items were removed from the course total scores prior to calculating the correlation coefficient. Item difficulty,_p. The difficulty of an item is defined as the proportion of students answering the item correctly. This proportion is represented by the letter 2: Test difficult . The difficulty of a test is defined as a mean score of the test. Item discrimination, D. The discrimination level of an item, D, is defined as the value obtained for the item by subtracting, of all the students who answered the item correctly, the proportion of the 27 percent lowest scoring students from the proportion of the 27 percent highest scoring students. The Index 2_ranges from -l.00 to 1.00. Item discrimination, r pbis’ The discrimination level of an item, 5-pbis’ is defined as the product moment correlation coefficient of the continuous variable of total score and the dichtomous variable of correct answer (1) or incorrect answer (0) for the item. Test Reliability. The reliability, £.tt’ of a test in this study is defined as the degree of internal consistency of the test as estimated by the Kuder-Richardson-ZO formula. Overview of Dissertation. In Chapter II, the literature relevant to the general research of multiple-choice item forms and to specific comparative research covering true-false and two-choice/alternate-choice type item forms is reviewed. In addition, literature related to research on item conversion methods and item sequence effects is reviewed, in that this literature is relevant to the item conversion methods and item 7 sequencing procedures used in this study. In Chapter III, the sample, the instrumentation, the procedures used, the operational form of the hypotheses, and the analyses used are presented. The results of the study for each part by each hypothesis are presented in Chapter IV. The final chapter, Chapter V, contains a summary of the study, a discussion of the findings, the limitations of the study, and suggestions for future research. CHAPTER II REVIEW OF THE LITERATURE Introduction The review of the literature is organized and presented in the three areas relevant to this research. The first section of this chapter concerns the review of applicable comparative studies of objective item forms on item characteristics of difficulty, reliability, and/or validity and the more specific studies that compared the two- choice multiple-choice item and true-false item forms on these item characteristics. The second part of the review of literature concerns the methods used to convert items from one form to another item form. The third part of the review of literature concerns the effects of item sequence on the item characteristics of power tests. A chapter summary follows at the end of this chapter. Studies Comparing Two-choice/Alternate-choice and True-false Item Forms One of the earliest studies to examine the two-choice item was carried out by Ruch and Stoddard (1925). These investigators constructed five versions of a 100-item examination to test general information knowledge in history and social sciences. Each lOO-item examination was composed of two forms, Form A and Form B. Each form contained fifty parallel items. Split-half reliabilities for each examination were calculated using Form A versus Form B. Version I of the lOO-item examination was composed entirely of recall items; Version II of five-choice multiple-choice items; Version III of three-choice 8 9 multiple-choice items; Version IV of two-choice multiple-choice items; and Version V of true-false items. Examples of the item forms found in these version of the tests follow (Ruch & Stoddard, 1925, pp. 90-91): Recall l. The American Revolution began in the year 1775 Five-choice 1. The American Revolution began in 1762 1775 1783 1789 1812 1775 Two-choice 1. The American Revolution began in 1762 1775 1789 1775 True-False l. The American Revolution began in 1775 2:22_ False The examinee was to write the correct answer in the space to the right of the recall, five-, three- or two-choice item. The correct answer was to be underlined or circled for the true-false items. The senior classes of 24 Iowa high schools were divided alphabetically by surnames into four groups, each containing approximately 135 students. The recall examination was administered to all students on the first day. The following day the five-choice multiple choice test was administered to Group I; the three-choice multiple-choice test, to Group II; the two-choice multiple-choice test, to Group III; and the true-false test, to Group IV. Of particular interest to the researchers in this study were the cooperative merits of the two-choice multiple-choice item form and the true-false item form, given that their pure guessing probabilities are 50:50. The results of this study are summarized in Table 1. 10 Tflfle 1 Sunnary of Results from Ruch and Stoddard Study Manx (SD) Item 50 Item 100 Item No. per Adjustsd c form _r_ tt _r_ tt Form A Form B _n_ _ xy 1(1) recall _1; tt (6.37) (7.86) 5—choice 0.796 0.886 27.20 22.80 137 0.861 177 0.901 (5.73) (7.68) (5.73) (6.06) 2-choice 0.737 0.749 35.64 31.98 135 0.713 164 0.902 (5.73) (5.88) True-False 0.55 0.714 30.06 27.67 133 0.480 183 0.820 (5.98) (6.84) Note: Statistics reported are for scores uncorrected for chance. aForm A vs Form B. bReliability using Spearman-Brown formula. cConcurrent validity with recall items as criterion. dReliability adjusted for number of items administered per time period. 11 Although there were no statistical tests performed on the differences between reliabilities of the forms, Ruch and Stoddard concluded that there are no practical differences in reliability, particularly between the three-choice, two-choice, and true-false item tests. Upon examination of the mean scores, the true-false version appears to have been more difficult than the two-choice version of the examination. Also, the authors noted that considerably more two-choice items and true-false items could be administered within a given time period than could other item types. The researchers noted that the correlation of the two-choice and true-false items with the recall items seemed to indicate that there were differences in the kinds of mental processes brought into play when answering these items: It should always be remembered that we cannot assure that restatement of recall items in multiple-response or true-false forms does not alter their relative difficulties. However, such a lack of perfect agreement as our coefficients show probably cannot be accounted for by the matter of changed difficulties but must be in large part indicative of differences in the pedagogical and psychological characteristics of the several types. (Ruch & Stoddard, l925,pp. 92-93) In a subsequent study, Charles (1926) administered a 100-item general psychology examination to 747 college students (this study was later published with Ruch: Ruch & Charles, 1928). Five versions of the examination were constructed, each version contained 50 items in Form A, and 50 parallel items in Form B. Version I of the examination contained only recall items; Version II contained only five-choice multiple-choice items; Version III only three-choice multiple-choice items; Version IV, only two-choice multiple-choice items; and Version V, only true-false items. 12 The following are examples of the item forms used (Charles, 1926, p. 399): Recall 1. A synapse is a junction between two neurones Five-choice 1. A synapse is a junction between two (1) dendrites, (2) axones, (3) muscles, (4) neurones, (5) bones. 4 Three-choice 1. A synapse is a junction between two (1) neurones, (2) dendrites (3) axones. 1 Two-choice l. A synapse is a junction between to (l) neurones, (2) dendrites. 1 True—False 1. A synapse is a junction between two dendrites. True False In the recall version of the test, the examinees were instructed to write the correct answer in the space provided to the right of the question. In the five-, three-, and two-choice versions of the test, the examinees were instructed to write the number of the correct answer in the space provided. In the true-false version, 'True' or 'False' was to be underlined. The recall version of the examination was administered on Day 1 to all students. On Day 2, the examinations were sequentially arranged and distributed so that every fourth student received the same version of the test. A summary of the results is shown in Table 2. The results of this study showed the two-choice item form to be considerably less reliable than the true-false item form (see Table 2). When difficulty levels were compared, the true-false item 13 form was shown to be more difficult than the two-choice item type, a similar result to that found by Ruch and Stoddard (1925). Again, when compared to other item types, substantially more two-choice and true- false items could be administered within a given time period. Emfle 2 Summary of Results from (harles Objective Item Study Mmfll Item 50 Item 1(1) Item 55 Adjusted form _r_ tta E ttb Form A Form B _n_ _ xyc minutes _1; ttb Recall 0.603 0.752 11.33 15.60 747 0.311 154 0.824 (2.04) (2.92) 5-choice 0.680 0.809 27.64 31.31 182 0.213 220 0.903 (3.10) (2.94) 3-choice 0.624 0.768 32.82 37.90 188 0.313 258.5 0.896 (2.73) (2.64) 2rchoice 0.477 0.646 37.10 40.39 188 0.319 285 0.839 (2.07) (2.09) True-False 0.602 0.751 32.90 34.47 189 0.226 302.5 0.901 (2.27) (2.33) Note: Statistics reported are for scores uncorrected for chance. 3Form A vs Form B. bReliability using Spearman-Brown formula. cPredictive validity using final grades as criterion. Reliability adjusted for number of items administered per 55 minutes. Predictive validity of the item subtest scores with final grades was quite low for all subtest scores, with the highest correlation (r xy = 0.319) between the two-choice subtest and final grades, and the two lowest between the five-choice (£_ y = 0.213) and the true-false X CE-xy = 0.226) scores and final grades. Charles considered these low l4 correlations the result of limited variability in final grades given by instructors. In a more recent study, Ebel (1982) compared what he termed the "alternate-choice" test item with the true-false item. The alternate- choice item form pr0posed by Ebel is similar in construction to the two- choice multiple-choice item form used by Charles except that in the alternate-choice form the responses can be placed in positions other than at the end of the item. Ebel provides several examples (PP.272- 273): An eclipse of the sun can only occur when the moon is 1) full 2)new. Most teachers believe that if tests were to be abandoned there would be 1) serious 2) very few educational losses. In his study, Ebel administered two 25-item end-of-unit tests to 28 students enrolled in an educational measurement course. One of the tests was composed of true-false items, the second test of alternate- choice items. Each test was composed of independent samples of items that did not ask the same questions. For each of the eight units comprising the course, either the true-false or alternate-choice version was administered first. By the end of the term, both the true-false and the alternate-choice versions had each been administered first four times. The scores were summed for the eight alternate-choice form tests and the eight true-false form tests. The results of this study are shown in Table 3. 15 Eflfle 3 Slmnary of Ebel's Alternate-choice and True-false Item Study 25 Hem IUJIten Ltkfi th3 Item form KR-20 _r; tt KR-ZO _r_ tta Mean SD Diff Discc Alternate- 0.67 0.89 19.99 3.08 0.20 0.30 Chflce True-False 0.47 0.78 18.80 2.81 0.25 0.28 Note: All statistics reported are averages over eight unit tests, and scores are not corrected for chance. aReliability using Spearman-Brown formula. Difficulty is expressed as 1 --IL CBased on 2) the difference between the upper-lower 27% scores. As in the previous studies, the true-false version of the test appears to have been more difficult than the alternate-choice version. Reliability for the alternate-choice items was considerably higher than for the true-false items, and it appears that both item forms equally discriminate. Item Conversion Procedures When items are converted from one form to another; for example, from four-choice to alternate-choice, there has been concern that this conversion might effect item characteristics such as reliability, validity, difficulty, or discrimination. In reviewing item conversion methods, Owens, Hanna and Coppedge (1970) found that three methods most commonly used were: the judgmental, the frequency, and the discrimination methods. In the judgmental method, the item author supplies the distractor or distractors considered most plausible. The frequency and discrimination methods are more empirical in nature. According to the 16 frequency method, the most frequently chosen incorrect response or responses are used as the respective distractor or distractors in multiple-choice items, or for false versions of true-false items. With the discrimination method, the incorrect responses are used that discriminate most highly between high scorers and low scorers. A small number of studies have been conducted comparing the effects of these methods on item reliability, validity, discrimination, and difficulty. Loree (1948) studied the relative effects of using the judgmental and the frequency methods for selecting multiple-choice item distractors for tests in three subject areas: arithmetic, health knowledge, and word meaning. Three forms of each test were developed. Form A consisted of multiple-choice items for which the examiner "conceived" of the distractors. Loree defines the term 'conceived' as any process by which the test constructor obtains distractors to multiple-choice items, other than by obtaining direct evidence of the kinds of responses students make to a specific test item when that item is presented in a free response form (p. 6). In other words, the examiner judges without empirical method which distractors would be the most appropriate to include in the multiple-choice item. Form B of the examinations consisted of the same items in recall form. Form C consisted of the same items, but these were in multiple- choice form in which the distractors had been developed from the most frequent response errors to the recall items on Form B. All forms of each test were administered to high school students in the Chicago area. Form B, containing the recall items, was administered first. Form C was then developed from the results of Form B, and both 17 Form A and Form C were administered to the same students three weeks later. To investigate the effects of item conversion methods on concurrent validity, the total score correct for each form was correlated with every other form. The results are shown in Table 4. There were no significant differences between the correlations of these three forms for any test. The effects on the difficulty level of the two item conversion methods were tested by Eytests. For the Arithmetic Problems Test, the means of Form A, Form B and Form C were significantly different from Tflfle‘4 Summary of Results from Loree Item Conversion Study H)Ilan ZJIlgn a _r_ tt _1; tt Mean Concurrent Validity fkst RmmlA.lkmm(3 anlA annCI lkmmlk EMMIB annC) liab libc liac Arithmetic 0.802 0.820 0.844 0.901 11.03 8.39 9.57 0.817 0.847 0.844 Haflth Knowledge 0.451 0.369 0.622 0.539 14.84 8.17 10.57 0.578 0.580 0.744 “mm! Meaning 0.764 0.714 0.866 0.833 28.03 14.56 19.24 0.812 0.735 0.790 Note: Statistics reported are for scores uncorrected for chance. aParallel Split Half reliability bReliability using Spearman-Brown formula. 18 each other. For the other two tests, the mean on Form A was significantly higher (easier) than on Form B (see Table 4). There were no differences in the reliability of Form A and Form C for any of the tests. Owens, Hanna, and Coppedge (1970) investigated the effects on reliability of multiple-choice items converted from recall form by the judgmental method, the frequency method, and the discrimination method. A 33-item geometry recall test that had been administered to 357 high school students was the source from which 17 four-choice multiple-choice questions were developed. The same multiple-choice items in each of the three forms of the examination had exactly the same stems and the same correct responses. Only the three distractors differed from form to form. For the judgmental form of the exam, each of 13 secondary mathematics teachers supplied three distractors they judged most appropriate for each item. The three most frequently mentioned distractors for each item were retained. The item analysis for the recall test was used to develop the other two forms of the examination. For the frequency form of the exam, the three most frequently produced incorrect answers were selected as the distractors. For the discrimination form of the exam, the three distractors were selected that discriminated the highest. First a parallel recall test and then the three forms of the multiple-choice examinations were administered to 1875 students enrolled in high school geometry. The examinations were sequentially ordered so that every third student received the same form of the test. After the administrations, the students' examinations were divided into three 19 equivalent groups based on recall exam scores and on which form of the multiple-choice examination the student took. The reliability coefficients for the judgmental, frequency, and discrimination forms were .556, .620, and .614, respectively, and concurrent validity coefficient with the recall examinations were .617, .646, and .647, respectively (when corrected for attenuation they increased to .982, .973, and .979). The validity coefficients were tested for homogeneity, and no significant differences were found. Although the differences in reliabilities were not tested, the investigators concluded there was little practical difference between them. Burmester and Olson (1966) used the frequency method in a study to determine if college-level natural science alternate-response items could show the same desirable item characteristics as previously administered five-choice multiple-choice items. It must be noted that the two responses available in these alternate-responses items were 'Acceptable' and 'Unacceptable', and thus were actually a variation of the true-false item form. A total of 85 multiple-choice items was selected on the basis of performance on previously administered final exams. The average difficulty, 2, of these items was .57, and the average discrimination index (Flanagan) was .45. Based on the item analysis for each item, a multiple-choice item was converted to true form if the distractors were approximately evenly selected. The correct answer was simply added to the stem to make the item a true statement. If a distractor was shown to be particularly attractive, the item was converted to a false statement by adding this incorrect answer to the stem. 20 The resulting 37 'acceptable' (true) items and 48 'unacceptable' (false) items were administered to 110 students in the same natural science course. The KR-ZO reliability for the alternate-choice items was .86. The distribution of difficulty levels and the discrimination levels of the multiple-choice and alternate-choice items are shown in Table 5. The results show that alternative-response items are less difficult, and that they discriminate as well as the five—choice items, and show high reliability (£.tt = .86). The investigators also found that more alternate-choice than five-choice items could be administered within a given time, thus providing for greater sampling of content areas or educational objectives. More recently, Frisbie (1971) conducted a three-phase study to compare the reliabilities of true-false and multiple-choice high school general knowledge science and social studies tests. In phase I, four- choice multiple-choice items were converted to true-false items using the judgmental and discrimination methods. For the judgmental method, five high school science teachers and five high school social studies teachers were used to judge the best distractors of multiple choice items from a published standardized test in their respective fields. If four out of five judges agreed on the best distractor, it was used to make a false true-false item; if fewer than four judges agreed on a distractor, the item was made true true-false item. This resulted in a social studies test of 41 false and 29 true statements and a natural science test of 45 false and 25 true statements. For the discrimination conversion method of this phase, the original social studies and natural science multiple-choice tests were 21 Table 5 Summary of Burmester and Olson Alternate-choice Study Multiple- Alternate- Item Statistic Choice Choice A .2 Difficulty Index 2_ 81-100 4 27 61-80 39 42 41-60 27 12 21-40 14 4 00-20 1 0 Discrimination Index (Flanagan) 61-80 13 11 41-60 38 33 21-40 34 28 00-20 0 13 Discrimination Index 2? 61-80 - 2 41-60 - 21 21-40 - 35 00-20 - 27 aDiscrimination index 2_was not available for multiple-choice items. 22 each administered to 100 students, and the discrimination index, 2, was calculated for each distractor. If 2_was less than .20, or did not differ from the other indices by more than .09, the item was converted to true; otherwise, the distractor with the largest 2_was used to make the item false. The resulting 70-item true-false social studies test that was developed contained 33 true and 37 false statements; the 70— item science test also had 33 true and 37 false statements. In phase II, the social studies and science true-false tests were administered to a sample of students to identify ambiguous items and improve them. In phase III, each of the four tests, social studies-judgmental method (SJ), social studies-discrimination method (SD), science- judgmental method (NJ), and science-discrimination method (ND), was divided into two forms, and the original multiple-choice items were added to each test. Form A of the SJ test contained multiple-choice items 1-35 and the converted SJ true-false items 36-70. Form B contained the converted SJ true-false items 1-35 and the multiple-choice items 36-70. The other tests were arranged similarly. A total of 1018 high school students was administered one of these eight forms. Eight minutes after the beginning of the exam, students were asked to stop and write the number of the item on which they were working. The median number of multiple-choice items attempted within this time period was 17.04, and the median number of true-false items attempted was 25.59. Thus, in this sample, students attempted three true-false items for every two multiple-choice items. When the KR-20 reliability of the true-false and multiple-choice items for each form was tested using paired tftests, the true-false 23 items were shown to be consistently lower in reliability than the multiple-choice items. The KR-20 reliability of the true-false items converted by the judgmental method was than compared to the KR-20 reliability of the true-false items converted by the discrimination method. A paired Eftest showed no significant difference between the reliability of items converted by these two methods. Williams and Ebel (1957) used the discrimination method to convert four-choice multiple-choice Iowa Test of Educational Development vocabulary items to three- and two-choice items. The least discriminating distractor was dropped from the four-choice item to convert it to a three-choice form, and the two least discriminating distractors were dropped to convert to a two-choice item. Three forms of the test were developed, each containing exclusively either four-, three-, or two-choice items. These three forms were arranged in an alternate sequence and administered to all students in 6 four-year Iowa high schools. Students were given 30 minutes to complete as many items as possible. The score was based on the number correct. The results (see Table 6) show that there were no significant differences among the reliabilities of these three forms (using the first 85 items of each form). The difficulty level of the item decreased as the number of responses decreased; the discrimination of the items also decreased accordingly. In addition, considerably more two-choice than four-choice or three-choice items could be completed within the 30-minute time period. 24 Item Sequence When examinations are administered under crowded classroom conditions, college instructors, to prevent cheating, often develop two or more forms of the examination in which the same items are arranged in different sequences. Several authorities in the measurement field contend that there exists a sequence effect (Cronbach, 1970) or serial error (Stanley, 1961) caused by these different item arrangements. Sequence effect is the discouragement a student feels as the result of failing to answer an item, and serial error is the failure to answer items that follow a particularly troublesome item. One common hypothesis is that the underlying dynamic and cause of this problem is test anxiety of the examinee (Mckeachie, Pollie, & Speisman, 1955; Mandler & Sarason, 1952; and Alpert & Haber, 1960). If the student encounters a particularly troublesome item, the student's anxiety level increases, and this increase adversely affects subsequent Table 6 Summary of Williams and Ebel Item Conversion Study Ig_items KR-ZO Mean Index Index Item form finished -£-tt (SD) Diff Disca Four-choice 85 .945 42.14 .496 .487 (16.70) Three-choice 94 .941 49.29 .585 .471 (15.78) Two-choice 129 .929 58.22 .684 .412 (14.01) Note: Test statistics are based on the first 85 items of each test. aDiscrimination Index 1_)_. 25 performance. To minimize test anxiety, then, helps the student to maximize test performance, and measurement texts often suggest that items be arranged from least difficult to most difficult. When different item forms are used, it is suggested that the item forms be ordered from the most simple to the most complex (e.g. true-false, matching, short answer, multiple choice, interpretative, and essay questions), and that within each item form, the items be arranged from least difficult to most difficult (Gronlund, 1977). A number of studies of college classroom examinations have been conducted examining the effects of arranging items by difficulty level. All examinations in these studies were power tests. The earliest study was conducted by Brenner (1964), who administered a series of examinations to students in educational psychology classes over two quarters. During the first quarter, 01, a total of 320 multiple-choice test items were written and administered over a series of four examination periods. For each period, there were two forms of the examination administered. Each form contained 40 items that had been randomly arranged using a table of random numbers. The purpose of administering these items was to obtain a difficulty value (percentage correct, uncorrected for chance), and discrimination value (point- biserial) for each item. During the subsequent quarter, 02, the same items, or a subset thereof, was again administered to a new group of students. Items administered during the first testing period of 01 were administered during the first testing period of 02. Those items administered the second, third, and fourth testing periods of Q1 were administered during the respective second, third, and fourth testing periods of Q2. 26 During the first testing period of 01 and 02, the same examination form of 40 randomly arranged items (Form IB) was administered to check for reliability (stability) of item difficulties. Two other forms of the same items were also administered during this first testing period. Form IA contained items arranged by difficulty from easy to hard, and Form IC contained items arranged from hard to easy. For the second testing period of 02, two forms of a test were developed from the pool of eighty items tested during the second testing period of 01. From this pool, Brenner selected the 10 easiest items, the 10 hardest items, and 20 items that in combination best reflected the course content and were highest in discrimination. Form IIA contained the 10 easiest items (in order of increasing difficulty), followed by 20 randomly arranged course content items and then the 10 hardest items. Form IIB contained the ten hardest items (in decreasing order of difficulty) followed by 20 randomly arranged course content items and then the 10 easiest items. For the third testing period of 02, two forms of an examination were constructed from items of only one of the examinations administered during 01. Form IIIA contained items arranged by difficulty from easy to hard; Form IIIB contained items arranged from hard to easy. From the 80 items available for the fourth testing period of 02, 40 items were selected in the same manner as those for Forms IIA and IIB. However, in Form IVA, all_the items were arranged from easy to hard, and on Form IVB all_the items were arranged from hard to easy. Brenner calculated the following test statistics for each form of each examination: mean score correct on the examination, the KR-8 27 reliability coefficient, and the mean discrimination (average point- biserial correlation between item and total test score). Significant differences between average difficulty, average discrimination, and reliability for each pair of examination forms for each testing period were determined using the tftest. The results are shown in Table 7. Table 7 Results of Brenner Item Arrangement Study Mean score Exam form Difficulty Reliability Discriminationa IA 21.18 .578 .220 18 21.04 .553 .232 IC 20.90 .598 .218 IIA 20.04 .753 .283 IIB 23.93 .674 .250 IIIA 26.14 .778 .309 IIIB 26.33 .805 .326 IVA 23.69 .836 .284 IVB 24.17 .747 .289 IEAverage point-biserial correlation between item and total test score. fp$.02 using a paired Eftest. Clearly, difficulty-based item arrangement did not affect the mean scores of the examination nor the reliability, and for the most part, did not effect discrimination. The significant difference found in discrimination of Form IIA and IIB was not explained by Brenner. Brenner did not report if the item difficulties of Form IB were reliably replicated from Q1 to Q2, however, research by Carter (1942), Davis (1951), and Gibbons (1940) has shown that item difficulty values are highly reliable across administrations. 28 Monk and Stallings (1970) analyzed the data available on 11 tests administered to students in a college-level basic geography course.Each test had two forms, and each was administered at some time between 1965 and 1968. The number of items in each test varied from as few as 80 to as many as 200. The objective of producing two forms of each test was to reduce the likelihood of two students in adjacent seats working on the same question. The same items were used in both forms of each test; however, the patterns of item arrangement differed in each form. In one form, the test items were grouped by item form (true-false, matching and multiple-choice). In the second form, the arrangement of the item-form groups was altered and the order of the individual items within each group was changed. The significant differences between the mean scores for the two forms of a test were assessed by using the fiftest. The number of items, means, standard deviations, and KR-21 reliabilities for each form are shown in Table 8. Significant differences were found between the mean scores of only two pairs of tests. Monk and Stallings pointed out, however, that one of these significant pairs of tests (7 and 8) had been corrected with a key containing ten errors. There were only slight variations that existed in the reliabilities of each pair of tests. Huck and Bowers (1972) conducted two studies on the effects of item sequence and p_values (proportion of examinees who answered an item correctly). In the first study 10 forms of a 60-item final examination were administered to 120 college students enrolled in an introductory psychology course. The items were the same on each form, the difference was only in the item order on each form. The item order on each form was such that six balanced Latin Squares were formed. Table 8 Results of the Monk and Stallings Item Arrangement Study 29 Number of KR—Zl Test Pair test items Mean score fl _Il I. tt 1 100 69.66 13.44 89 .892 2 100 72.10 11.10 77 .845 3 100 73.11 16.56 10 .938 4 100 75.21 11.14 90 .858 5 100 68.46* 18.15 94 .944 6 100 73.49* 10.74 84 .840 7 100 69.97** 10.92 124 .826 8 100 62.62** 9.14 132 .727 9 100 60.28 12.46 123 .854 10 100 62.21 11.74 121 .838 11 80 58.75 8.67 68 .802 12 80 58.61 8.21 71 .707 13 80 52.36 10.12 66 .834 14 80 50.55 9.68 69 .811 15 200 134.43 25.07 63 .935 16 200 129.13 23.21 63 .920 17 80 49.70 8.19 138 .728 18 80 48.53 8.29 118 .731 19 80 48.74 9.86 118 .814 20 80 48.29 9.86 121 .813 21 200 130.02 23.36 120 .921 22 200 129.93 22.26 118 .916 £25.01 *fpfi.001 30 A special ANOVA procedure (Cochran & Cox, 1957, pp. 133-139) was used to test for residual effects of item order on the p_values. None of the Efvalues for the six Latin Square designs was significant. In their second study, Huck and Bowers administered six forms of a 50 item midterm to 162 students in the same introductory psychology course. The items were arranged to form six Latin Square designs. The special ANOVA procedure again showed no significant sequence effects for any of the Latin Square designs. A recent study was conducted by Plake (1980). Three forms of a 96 item multiple-choice examination were administered to students in a course in psychiatric nursing. The items for this examination were selected from a pool of items from a test that had already been administered and for which item difficulty and reliability statistics were available. The item difficulties ranged from .20 to .96, and the KR-20 reliability of the originally administered test was .85. On the first form of the examination, the 96 items were placed in difficulty order from easy to hard. Item order for the second form was random, and the order for the third form was spiral cyclical. In the spiral cyclical form, every four items were arranged from easy to hard; thus, the cycle of easy to hard was repeated every fifth item. One-half of the examinations contained directions that explained the respective item ordering and that gave test-taking strategies for that item order. The other one-half did not contain these directions. A set of questions were added to the end of the examination which asked students to rate, on a 1-5 scale, the fairness of the test, the perceived difficulty of the test and to estimate their performance on the teSt o 31 A 3 X 2 multivariate ANOVA was performed on the dependent measure of total score on the examination, rated fairness, perceived difficulty, and performance estimates. There were no significant interaction effects for item order and directions, nor were there significant main effects for item order or for directions. Klosner and Gellman (1973) administered three forms of a 75 item multiple-choice final examination to students enrolled in an educational measurement course. Each form contained the same items. Form S contained items arranged in the order the subject-matter was presented in the course. Form S X D contained items grouped by subject-matter, but within each subject matter topic the items were arranged by difficulty. Form D contained items grouped only by easy to hard difficulty level. To match for ability, students were ranked according to their midterm grade and divided into triads. Students within each triad were randomly assigned to take one of the three forms of the examination. The means, standard deviations and reliability of each of the three test forms are shown in Table 9. Using the median midterm examination grade, students were then split into a high achieving group and a low achieving group. A 2 X 3 ANOVA of the test scores showed only achievement grouping to be significantly related to total test score. Neither the main effect of item order, nor the interaction of item order with achievement level was significant. Smouse and Munz (1968) developed three forms of a 100 item multiple-choice final examination for a course in introductory psychology. Each form contained the same items but differed in the 32 Table 9 Results of the Klosner and Gellman Item Order Study KR-Zl Test Form 11 Mean SD .£.tt Form S (SUbjeCt) 18 60006 5098 0675 Form S X D 18 61.67 5.10 .586 (Subject and Difficulty) Form D (Difficulty) 18 59.72 6.08 .680 difficulty-based order of items. In the respective forms, the item order was from easy to hard, hard to easy, and randomly mixed. These three forms were administered to two randomly assigned groups: a high test-taking anxiety group, and a normal test-taking anxiety group. For the normal test-taking anxiety group, the usual test taking atmosphere was maintained; however, for the high test-taking group: The anxiety-provoking treatment consisted of informing the 83 that because of "widespread cheating" on previous examinations, should their individual performances drop significantly below previous examination scores, they would have to take a special oral examination administered by the coordinator of the introductory psychology sections. Further, the examination was administered by a professor rather than the graduate instructor and was proctored by assistants who continually circulated among the Ss. (Smouse & Munz,1968, p. 182) Stapled at the end of each examination was the Multiple Affect Adjective Check List (MAACL) developed by Zucherman (1960). The purpose of administering the MAACL was to measure the amount of anxiety felt in each test situation. A 2 X 3 ANOVA for unequal 238 was performed on the total number of 33 items answered correctly and on the MAACL scores. The item arrangements, the anxiety treatments, and the interactions between these two independent variables were not significant. In a subsequent study, Munz and Smouse (1968) administered the same three forms of the 100 multiple-choice item final examination to another group of students enrolled in the introduction to psychology course. Prior to the examination, each student had completed the Achievement Anxiety Test (AAT) developed by Alpert and Haber (1960) and was placed into one of four achievement anxiety type groups based on their AAT scores: Facilitators, Debilitators, Non-affecteds, and High- affecteds. According to Alpert and Haber, a facilitator is an individual whose test performance is facilitated by the anxiety- provoking situation; the debilitator is an individual whose test performance is depressed by the anxiety-provoking situation; the non- affected is an individual whose test performance is not affected by the anxiety provoking situations; and the high-affected is an individual who has the potential as both a facilitator and a debilitator. A 3 X 4 ANOVA was performed on mean score correct on the final examination. No significant differences were found for the main effects of item order and achievement-anxiety type. A significant but unclear interaction between anxiety types and the random and easy to hard arrangement was found. Marso (1970) administered three forms of a 103 item multiple-choice comprehensive final exam to students enrolled in several sections of an introductory educational psychology course. One form contained the items in random order. The other two forms contained the items grouped by course content. One form contained the items grouped in the order 34 they were presented during the course; the other form contained the items grouped in the opposite order in which they were presented during the course. Several days prior to the final examination, each student was given a set of test anxiety scales developed by Carrier and Jewel (1966) which was used to classify the students as to level of test-taking anxiety. Students were divided into an upper, a middle, and a lower third base on the total anxiety scale score for the sample. A 3 X 3 ANOVA for unequal_njs was performed on the final examination scores and on the total time in minutes taken to complete the examination. Only the main effect of test anxiety was significant for final examination scores, showing students with high anxiety to perform more poorly than middle or low anxiety students. Neither main effects nor interactions between the main effects were significant for the number of minutes taken to complete the exam. Chapter Summary_ The first series of studies reviewed in this chapter concerned the effects of item form on the item characteristics of difficulty, reliability, and/or validity. When the two-response type multiple- choice item form has been compared to the true-false item form, the two- response type item has consistently been found to be less difficult than the true-false item. When the reliabilities of these two item forms are compared, the results are less consistent: of the three studies, one study showed the two-response item type to be more reliable than the true-false item (Ruch & Stoddard, 1925), one study showed no practical difference (Ebel, 1982), and one study showed the two-response item type to be less reliable than the true-false item (Charles, 1926). From the 35 results of concurrent validity studies of the two-choice and the true- false items with recall items, Ruch and Stoddard (1925) suggest that answering the two-choice, and particularly the true-false item, may require different mental processes. The next series of studies reviewed concerned the methods of converting items from one item form to another item form, and the possible effect of the conversion method used on the item characteristics of difficulty, discrimination, reliability, and/or validity. When items were converted from recall to multiple-choice form by use of the frequency and judgmental methods (Loree 1948), or by these two methods and the discrimination method (Owens, Hanna, and Coppedge, 1970), no significant differences were found in the reliability of the multiple-choice items, or in the concurrent validity of the recall or multiple-choice items. The results were different for the difficulty of the multiple-choice items, however. Items converted by the judgmental method were found to be significantly easier than items converted by the frequency method (Loree,1948). When items were converted from multiple-choice form to true-false type item form by use of the frequency method (Burmester and Olson, 1966), the true-false form was found to be less difficult, equal in discrimination to its original five-choice multiple-choice form, and highly reliable. When the judgmental and discrimination methods were used to convert multiple-choice items to true-false form (Frisbie, 1971), no differences were found in the reliability of the true-false items converted by these methods. It should be noted, however, that the true-false items were significantly lower in reliability than the original four-choice items. 36 When four-choice multiple-choice items were converted to three- choice and to two-choice form by use of the discrimination method (Williams & Ebel, 1957), no significant differences were found in the reliability of the converted items. It was found, however, that the difficulty and discrimination of the items decreased as their number of responses decreased. The last series of studies reviewed concerned the effects of item arrangement, or item sequence, on the test characteristics of power tests. It has been hypothesized that, because of anxiety caused by the failure to answer an item (sequence effect) and/or the failure to answer items that follow a troublesome item (serial error), students perform differently on tests in which the items are arranged differently. Several studies have examined the effects of item order on difficulty, discrimination, and/or reliability without controlling for anxiety of students. Monk and Stallings (1970) compared two forms of each of eleven tests on difficulty and reliability. Each item form differed only as to the arrangement of items. No differences in reliability were found, and only two of the eleven pair of tests differed significantly in difficulty. The effects of arranging items by random or by various degrees of difficulty were examined by several researchers (Brenner, 1964; Huck & Bowers, 1972; Klosner & Gellman; 1973, and Plake, 1980) No significant differences were found in difficulty of the tests administered, no matter what item arrangement was used. In addition, Brenner found no differences in the reliability of nine test forms, each of which contained a different item order. The remaining studies reviewed concerned the effects of item order 37 on test difficulty that also included some measure of student anxiety in the research design. In one study (Smouse & Munz, 1968) no significant differences were found for student scores on tests containing items ordered differently by difficulty, nor in the scores of high and low anxiety students. In a subsequent study (Munz & Smouse, 1968), a significant but unclear interaction was found between anxiety types and item order for total test scores. In a later study (Marso, 1970), significant differences were found in test scores of the anxiety groups but not for item order. Thus, although anxiety may play a part in student performance on a test, there is little evidence that this anxiety is related in any way to item order. CHAPTER III DESIGN AND PROCEDURE Introduction This research study was conducted in three parts. Part I was designed to examine the difficulty and discrimination levels of alternate-choice and true-false item forms; the reliabilities of the true-false subtest scores and the alternate-choice subtest scores; and the criterion-related validities of true-false and alternate-choice items as measured by the correlation between each respective subtest score and final grades. Part II was designed to investigate the practicability of judging the best version (TFt or TFf) of the true- false item and to determine whether the alternate-choice version with the correct-answer-first (ACCi) or with the distractor-first (Acic) is the best form of the item. Part III was designed to examine the effects of placing the correct answer first (Acci) or placing the incorrect answer first (Acic) in the alternate-choice responses on difficulty, discrimination, reliability, and criterion-related validity. 38 39 Part I Sample. The students that participated in Part I of this research were from seven sections of a natural science course offered at Michigan State University in the fall quarter of the 1983-84 academic year. The lectures of the seven sections were team-taught by the same two professors and, although the lab sessions were not team taught, the same material was covered during the quarter. The sample of students who participated did so because their instructors agreed to include the alternate-choice and true-false items on their midterm and final examinations. These students cannot be considered a random sample, since they were not chosen in such a way that each student taking this natural science course in the fall quarter had an equal and independent probability of being selected. It can be argued, however, that these seven sections can be considered representative of the population of students taking this course in the fall quarter, particularly in regard to the cognitive skills required to master the material taught and to take the course examinations. This argument is based on the manner in which students select the sections of this course. Briefly all lower division students are required to enroll for a specific number of general education credits in the area of biological and mathematical sciences. The majority of incoming freshman students enroll in this natural science course, Natural Science 115, in the fall quarter to fulfill part of their minimum requirements in this area (according to personal communication with the Assistant Provost for Undergraduate Education, July 1, 1984). As a result, there are approximately 36 sections of this course offered each fall quarter. 40 Most often the section chosen by the student is based either on the time it is offered, or the space availabile in a given section. It is rare that these incoming freshmen choose a course based on the instructor teaching it. There are many other criteria or combinations of criteria used by students. Thus the self-selection process in this course tends to be multidimensional and non-systematic in nature. Therefore, although students in the seven sections chosen for this study were not randomly sampled, the self-selection process used by students is not likely to result in systematic differences among sections in relation to the variables relevant to this study. Of the 255 college freshman students enrolled in these sections, 247 took both the midterm (Test I) and the final examination (Test II). Both tests contained the alternate-choice and true-false questions. Table 10 shows the number of students who took Form A and Form B of each test. To achieve a balanced research design, there was one student randomly eliminated from the group who took Form B of Test I and Form A of Test II. Materials. The examination items used in this study were drawn from an item pool of approximately 1600 questions. All items were written to test the knowledge, understanding, and application of scientific principles and methodology of biological science and to test philosophy, historical perspectives, and social implications of the biological sciences. Approximately 400 of these items were applicable to the genetics and human reproduction emphasis of the course. All items in the item pool were developed by faculty teaching natural science courses at Michigan State University. Some items were written 41 Table 10 Number of Students Who Took Form A and Form B of Tests I and II Test II Test I Form A Form B Form A 55 68 Form B 683 55 8One student was randomly eliminated from this group to achieve a balanced design. 25 years ago, others were written the term preceding this study. (Some of the items have been used innumberable times, some used only a few times.) All items in the item pool had been administered to students at least once and had shown some measure of difficulty and a positive discrimination. Item statistics were not available as they were neither stored with the items nor retained in files. For security reasons, none of the items in the natural science item pool are keyed with the correct answer. Some of the items which are applicable to genetics and human reproduction are in matching form or in a key-type multiple-choice form (the student matches items from a four- or five-item key to subsequently listed statements). However, the majority of the 400 potential items are in four- or five-choice form. The following are typical of the items found in the item pool applicable to the area of genetics and human reproduction: 42 EXAMPLE ONE The reason(s) why it required 150 years to develop and clarify the cell principle (theory) was a) poor equipment for studying the cell. b) poor communication between scientists. c) "getting the idea" of a unified structure and function of the cell. d) a and b above. e) a, b, and c above.* EXAMPLE TWO Amino-acids are carried to ribosomes by a) messenger RNA. b) transfer RNA*. c) proteins. d) cytoplasmic DNA. e) nuclear DNA. EXAMPLE THREE The most effective way of making human chromosome counts is a) by examining the egg and sperm. b) by utilizing cultured and treated red blood cells. c) by utilizing cultured and treated white blood cells*. d) by examining liver tissue. e) none of these are effective. EXAMPLE FOUR In a DNA molecule, one strand contains the following sequence of bases: A-G-AeT-C. Which of the following represents the complementary sequence on the other strand? 3) C-C-T-A-G b) A-G-A—T-C c) T-C-T-A-G* d) U-C-U-A—G e) none of these 43 Item Conversion Procedures. The process of converting midterm examination items from multiple-choice to alernate-choice and true-false forms began shortly after the start of fall quarter. The item conversion process for the final examination began shortly after the administration of the midterm examination. The process used for both examinations was identical. The senior instructor initially selected 65 items that had the potential of being included on Test I, and 100 items that had the potential of being included on Test II. For each item, the senior instructor indicated the correct answer and the distractor judged the most reasonable answer given by an uninformed student. Only an item in which the correct response included a single answer or element was considered for conversion to alternate-choice and true-false form. An item that required selection of the answer from a key (key-type multiple-choice item) or an item in which the correct response contained more than one answer or element (e.g., all the above, a and b above) was considered for use only in its original multiple-choice form. This was because the extensive revision required to convert such a complex item to alternate-choice or true-false form might change the content of the item. EXAMPLE ONE is such a complex question. A total of 26 items for the mid-term and 27 items for the final examination met the criteria for conversion to alternate-choice and true-false form. In the alternate-choice form suggested by Ebel (1982), the two responses can be placed at the very end or at any other location within the stem. This freedom of placement permitted the multiple-choice items to be converted to alternate-choice form in one of three ways. First, if the stem of an item was a statement, then the stem was kept intact 44 and only the correct answer and the distractor indicated by the senior instructor were joined to the end of the stem. EXAMPLE TWO is such a question, and in alternate-choice form it reads: Amino-acids are carried to ribosomes by a) messenger RNA b) transfer RNA* . Second, if the item was a statement and contained duplicate wordings in both the distractor and the correct answer, the duplicate wordings were made part of the stem and only the word or words that made the statement correct or incorrect became the responses. EXAMPLE THREE contains these duplicate wordings (underlined); in alternate-choice form it reads: The most effective way of making human chromosome counts is by utilizing cultured and treated a) red b) white* blood cells. Third, when the stem of the original item was in question form, some rewriting of the stem was necessary. EXAMPLE FOUR was in question form, and to covert it to alternate-choice form, the stem was rewritten as follows: In a DNA molecule, one strand contains the following sequence of bases: A-G-A-T-C. The complementary sequence on the other strand is a) U-C-U-A-G b) T-C-T-A-G* A table of random numbers was used to determine whether the correct answer or distractor would be listed first. Items were converted from alternate-choice to true-false form by randomly eliminating either the correct response or the distractor from the alternate-choice item. Research (Frisbie, 1971; Oosterhof & Glasnapp, 1974) has shown that true-false items in false form tend to have better discriminating ability than items in true form. Ebel (1979) suggests including more 45 than one-half, and in some cases up to 67 percent, false-form items in a true-false test. Of the 10 true-false items included in each form of each examination, it was decided to make 60 percent (3.= 6) of the items false. Within this parameter, a table of random numbers was used to determine whether an item was to be true or false. The items in EXAMPLES TWO, THREE and FOUR in true-false form read as follows: EXAMPLE TWO Amino-acids are carried to ribosomes by messenger RNA. (F) EXAMPLE THREE The most effective way of making human chromosome counts is by utilizing cultured and treated white blood cells. (T) EXAMPLE FOUR In a DNA molecule, one strand contains the following sequence of bases: A-G—A-T-C. The complementary sequence on the other strand is T-C-T—A—G. (T) Test Form Development. Two forms of the examination were developed for both the midterm and the final examinations. For each item on an examination, the alternate-choice version was included on one form and its content equivalent true-false version on the other form. Before this could be done, however, it was necessary to ensure that the items were, in fact, equivalent in content. Each item in its original multiple-choice form, alternate-choice form, and true-false form was submitted to two measurement experts to be judged for equivalence of content. One of the experts was a professor in Educational Measurement at Michigan State University and a nationally recognized expert in his field. The second measurement expert was a recent Ph.D. graduate in Educational Psychology, who had extensive 46 experience in test item development. The judges were asked to compare the alternate-choice, the true-false, and the original multiple-choice item forms to each other to determine whether the questions measured the same item content. They also were asked to write any comments they had on the sheet containing the item. All items were judged equivalent in content. There were several items, however, that were identified by the judges as having construction flaws. It was decided to eliminate these flawed items from further consideration for use on the examination. From the remaining items, the 20 clearest, non-redundant items were selected for inclusion on the examination. Appendix A contains the 20 items for each test. Each item is in multiple-choice, alternate-choice, and true-false form. It was anticipated that few students, if any, had encountered the alternate-choice item in any examinations prior to the Natural Science 115 midterm examination. Therefore, it was decided to alternate the unfamiliar alternate-choice form items with the more familiar true-false form items. To accomplish this, the 20 alternate-choice items were randomly assigned to groups of five items each. It must be noted that on Test I it was necessary to keep questions 1, 2, 3 together due to their relation to the descriptive paragraph preceding the questions; within this grouping, the items were randomly assigned a sequence. It was also necessary to keep questions 7 and 8 in order, as question 8 referred to question 7. These two items were randomly assigned as a pair in Test I sequence. The other items were randomly assigned a sequence within each group. The content equivalent true-false items were put in the same sequence as their alternate-choice counterparts. Two groups of alternate-choice items were then randomly assigned to Form 47 A, their true-false equivalent forms were assigned to Form B, and vice versa. These groups of alternate-choice and true-false items were arranged in two ways: on Form A, the arrangement was alternate-choice, true-false, alternate-choice, true-false; on Form B, the arrangement was true-false, alternate-choice, true-false, alternate-choice. Thus, alternate-choice items on one form had their content equivalent true- false item in the same respective position and sequence on the other form. The arrangement of the items on each form for Test I and Test II is shown in Table 11. The sequenced items were returned to the senior instructor who added to them a subset of 22 multiple-choice items for Test I and 65 multiple-choice items for Test II. The multiple-choice items were arranged in two different sequences by the senior instructor; one sequence was randomly assigned to Form A and other to Form B (see Table 11). The four- and five-choice multiple-choice items were placed last on each form to reduce the advantage of guessing for students who might have felt rushed near the end of the examination, even though the examination was a power test. Procedure. Prior to the administration of the midterm (Test I), the University Committee on Research Involving Human Subjects (UCRIHS) gave permission for the conduct of this research project (see Appendix B). For both Test I and Test II, Forms A and B were arranged in alternating sequence and administered to students assembled in the large lecture hall regularly used for lectures and examinations. The purpose of sequencing the forms was to obtain randomly equivalent groups and to discourage students from copying the answers of those sitting nearby. 48 Table 11 Positions of Items on Test I and Test II Test I Test II Form A Form B Form A Form B Acl TF1 Ac1 TF1 A25 2% :05 2:25 T 6 C6 C6 6 TF10 AC10 ACIO TF10 AC11 TFll AC11 TFll AC15 TFIS 2915 $515 TF16 AC16 C16 16 Ton AC20 AC20 TF2o MCZI M021 MCZl MC53 M€25 M€25 ° - M 26 M 42 0 O MC2 “C41 . . M09 MC31 . MC85 MC3O MC32 . “C21 M031 :g29 ° ° MC32 30 O C MC33 MC33 . . ”€40 MC40 . . MC41 MC26 . . M042 MC28 MC85 MC52 Where: ACi is the alternate-choice version of item 1. TF1 is the true-false version of item 1. MCi is the same four- or five-choice multiple-choice item on Form A and Form B. 49 Page one of each test contained instructions for taking the examination and was the same for both Forms of Tests I and II (see Appendix C). Students were given one hour to complete the 42-item Test I, and two hours to complete the 85-item Test II. Because of this generous time allotment, both Tests I and II were considered power tests. Students marked their responses to each question on machine- scorable answer sheets. The examinations and the machine-scorable answer sheets were collected by the two instructors, separated as to Form A and Form B, and machine scored by the Michigan State University Scoring Office. The midterm and the final examination scores were weighted and merged with other weighted test and quiz scores to form a total course score for each student. This total course score was used for grade assignment to students. All information identifying the student was removed by the Director of the Scoring Office before allowing this researcher access to the data. The data were entered into the CDC 6000 version of SPSS (Statistical Package for the Social Sciences) to rearrange the items in Form B to the same sequence on Form A, to convert to correct-answer = 1 and incorrect-answer = 0, and to perform the necessary data analysis. Part II Participants. The senior instructor of the Natural Science 115 sections used in this study and a departmental collaborator who was well versed in measurement methodology were asked to judge the better version of each alternate-choice and each true-false item. The senior instructor has been teaching this course for approximately 20 years; the collaborator, now retired, had taught the course for more than 30 years. 50 Materials. Only the alternate-choice and true-false items administered in Tests I and II were used in this part of the study. Each of the alternate-choice and true-false items on Form A and Form B of these tests was converted to two versions. The alternate-choice items were converted to correct-answer-first (Acci) and incorrect- answer-first (Acic) forms, and the true-false items to true form (TFt) and false form (TFf). Procedure. The senior instructor and departmental collaborator were given a packet for each form of each test that contained both versions of each item listed on a separate page, an instruction sheet, and a recording sheet (see Appendix D). The judges were asked to choose, from among the two versions of each item, the one version that in their estimation would, simultaneously, most maximize the chances of a correct answer being made by a student who knows the material, and an incorrect answer being made by an uniformed student. The judges independently recorded their choices on the recording sheets and returned the packets to this investigator. The investigator then tallied the percent of agreement. Part III Sample. The students that participated in Part III of this study were from three sections of Natural Science 115, all of which were team taught by the same two professors in the spring quarter of the 1983-84 academic year. Only the senior instructor had been involved in Part I of this study. The students who participated did so because the instructors agreed to include the two versions of the alternate-choice items on their final examination. The students participating consisted of the population of 51 all students taking Natural Science 115 in the spring quarter of the 1983-84 academic year. There was one student repeating the course who participated in Part I the preceding quarter. A total of 102 students took the final examination (Test III). There were 51 students who took Form A and 51 students who took Form B. Materials. The items used in Part III consisted of a stratified random sample of 20 of the 38 alternate-choice items administered in Tests I and II. Two of the 40 items administered in these tests were excluded from Part III because the material they tested had not been taught the spring quarter. The discrimination index, 2, was used as the stratifying variable, and the 38 alternate-choice examination items were arranged from lowest to highest discriminating ability, then grouped into strata. Each stratum contained a spread of .10 of these indices, with the exception of the lowest stratum which contained a spread of .13. A 20/38 or .53 proportional sample of items was selected from each stratum. The distribution of these strata, the number of items in each stratum, and the number selected is shown in Table 12. Test Form Development. Two forms of the final examination (Test III) were developed. From the 20 items selected for inclusion, there were 10 Acci items randomly assigned to Form A, and 10 of their AC1C version assigned to Form B. There were 10 AC1C items randomly assigned to Form A and their Acci versions to Form B. The respective versions of the items were arranged in the same sequence on each form of the examination, and returned to the senior instructor for the inclusion of 46 complementary four- and five-choice multiple-choice items selected from the item pool. These 46 items were arranged in two different sequencies; one sequence was assigned to 52 Table 12 Strata, Number in Each Stratum, and Results of Random Sample of Items Number Number in Stratum Selected Discrimination Index A: j: -.03 - .09 7 4 .10 - .19 12 6 .20 - .29 6 3 .30 - .39 6 3 .40 - .49 4 2 .50 - .59 3 2 Total 38 20 Note: Proportion of items selected from each stratum was 20/38 or .53. Form A, the other to Form B. To prevent students sitting next to each other from working on the same alternate-choice item at the same time, the 20 alternate-choice items were embedded in the examination and were assigned as items 31 to 50 on each form of the examination. Thus, item 31 on Form A read: In scientific methodology, prediction means nearly the same as a) expectancy* b) interpretation of data. On Form B, item 31 read: In scientific methodology, prediction means nearly the same as a) interpretation of data b) expectancy*. The arrangement of all items on Forms A and B of the final examination (Test III) is presented on Table 13. The specific arrangement of the 53 alternate-choice item versions is shown in Table 14. Procedure. Forms A and B of Test III were arranged in a regular sequence and administered to the students assembled in the large lecture hall regularly used for lectures and examinations. The forms were alternately ordered to obtain randomly equivalent groups and to discourage the copying of answers from those sitting on either side of the student. Oral instructions were given regarding the taking of the examination. Students were allowed two hours to complete the 66 items. Students marked their responses to each question on machine- scorable answer sheets. The examinations and the answer sheets were collected by the instructors, separated as to Form A and Form B, and machine scored by the Michigan State University Scoring Office. The Test III score was weighted and merged with the other weighted quiz scores to produce a total weighted course score for each student. These total weighted scores were used for course grades. All information identifying a student was removed by the Director of the Scoring Office prior to allowing this researcher access to the data. The data were entered into the CDC 6000 version of SPSS to rearrange the items in Form B to the same sequence on Form A, to convert to correct-answer = 1 and incorrect-answer = O, and for analysis of the data. Table 13 Position of Items on Test III 54 Test III Form A Form B M01 MC1 MC16 M354 MC17 MC29 M018 MC30 MC19 MC55 MC3o MC66 AC31 AC3l AC50 Aso MC51 “C14 MC52 MC15 MC53 Mc1 MC65 MC13 MC66 MC16 Where: A01 is one of the version of alternate-choice item 1. M01 is the same four- or five-choice multiple—choice item on Form A and Form B. Table 14 Position of Alternate-choice Versions on Test III 55 Test 111 Form A Form B ACalci AC3lie AC32ci AC321C AC33ic AC33ci AC34ci AC341a AC3Sie AC35ei AC36ic AC36ci AC37ei AC37ie AC3sie AC38ei AC39ei AC39ie AC40ci AC4Oic AC41ci AC41ic AC42ci AC421C AC43ci AC43ic AC44ic AC44ci AC45ci AC4Sic AC46ic AC46ci ACane AC47ci AC481c AC48ci AC49ic AC49ci ACSOic AC50ci Where: ACC1 has the correct answer listed first in alternate-choice item i. AC1C has the incorrect answer listed first in alternate-choice item 1. 56 Hypotheses The major hypotheses stated in Chapter I are restated in operational terms in this chapter: Part I H1: The mean score of the alternate-choice items will be significantly greater than the mean score of the content equivalent true-false items when tested by the_§_test. Alpha was preset at .05. H2: The item-total point-biserial £5.pbis) correlations of the alternate choice items will be significantly higher than the item-total point-biserial {5.pbis) correlations of the content equivalent true-false items when tested by the sign test. Alpha was preset at .05. H3: The KR—20 reliability coefficient of the alternate-choice items will be greater than the KR-20 reliability coefficient of the true-false items. The Feldt Test for Equality of Two KR-20 Reliabilities was used to test this hypothesis. Alpha was preset at .05. H4: Criterion related validity of the alternate-choice items, as defined by the product moment correlation between the alternate-choice total scores and the criterion total weighted scores (with the alternate-choice scores removed from the criterion) will be greater than the criterion related validity of the true-false items as defined by the product moment correlation between true-false total scores and the criterion total weighted scores (with the true-false scores removed from the criterion). The correlations were transformed togr 57 scores and the fiftest statistic for two independent correlations (Glass & Stanley, 1970, pp. 313-314) was used to test the differences. Alpha was preset at .05. II Part Agreement between two departmental colleagues' judgments of the better version of an alternate-choice item will be no better than chance (50 percent). In this study, the best form was defined as the form that could best maximize both the choice of a correct answer from an informed student and the choice of an incorrect answer from an uninformed student. Agreement between two departmental colleagues' judgments of the better version of a true-false item will be no better than chance (50 percent). Again, the best form was defined as the form that could best maximize both the choice of a correct answer from an informed student and the choice of an incorrect answer from an uninformed student. III Part The mean score of the ACc1 items will not differ from the mean score of the AC1c items when tested by the paired Ertest. Alpha was present at .05. The item-total point-biserial {5.pbis) correlations of the ACci items will not differ from the item-total point-biserial {E-pbis) correlations of the AC1c items when tested by the sign test. Alpha was preset at .05. 58 H3: The KR-20 reliability coefficient of the ACci items will not differ from the KR-ZO reliability coefficient of the ACic items. The Feldt Test for Equality of Two KR-20 Reliabilities was used to test this hypothesis. H4: The criterion related validity of the ACCi items, as defined by the product moment correlation between the Acci total scores and the criterion total weighted scores (with the ACci scores removed for the criterion), will not differ from the criterion related validity of the AC1c items, as defined by the product moment correlation between Acic total scores and the criterion total weighted scores (with the AC1c scores removed from the criterion). The correlations were transformed to 3_ scores and the aftest r statistic for two independent correlations was used to test this hypothesis. Alpha was preset at .05. Designgand Analysis The research design for Part I and Part II of this study was the same. Each part was a comparative study in which differences in difficulty, discrimination, reliability, and content-related validity were tested for two item forms. Latin Square Design. The administration of Form A and Form B of Tests I, II, and III was designed to obtain two randomly divided groups, Group I and Group II. Two levels of two treatments, Item Form and Item Position, were administered to students. Group I received alternate- choice items in positions 1-5, 11-15 and true-false items in positions 6-10, 16-20; Group II received true-false items in positions 1—5, 11-15 and alternate-choice items in positions 6-10, 16-20. In this 2 X 2 59 Latin square design, no student received more than one treatment combination of Item Form and Item Position. The design and level of treatment for each group is shown in Table 15. When groups of subjects are used in each cell of the Latin square rather than single subjects (Lindquist, 1956 termed this a Type II Latin square design), the sums of squares are separated into two Between Subjects, and four Within Subjects components. These components and degrees of freedom are shown in Table 16. Feldt Test for Equality of Two KR-ZO Reliabilities. This test is an approximate statistical test derived by Feldt (1969) to test the hypothesis that two KR-ZO reliabilities are equal. The assumptions made about a test, the examinees, and the scores of test 1 (these assumptions are the same for test 2) are: (i) The N1 examinees are assumed to be a random sample from the examinee population. (ii) The k1 units are assumed to be a random sample from the population of units in the domain represented by Test 1. (iii) Over the entire population of examinees, the quantity tli is assumed normally distributed. (iv) Over the entire examinees-by-units matrix for Test 1, the e11 are assumed homogenious in variance and are normally distributed, independently of each other and of tli (Feldt, 1969, p. 365) Where: N1 = The number of examinees k1 = The number of items on the test 1 t1i = The true score in deviation form, of the examinee, where E(t11 3 0) e = Measurement error, where E(t ) = 0 for lij 11 examinees and E(elij) for item 3 for Person 1 of Test 1 60 The statistic W is obtained using the following formula: 1.. r2 E-= 1 - r l where in Part I where in Part III r1 = reliability of the alternate- r1 = reliability of the ACCi choice items items r2 = reliability of the true-false r2 = reliability of the AC1C items items The statistic W_is approximately distributed as a central-Ewithi1 - l, and N_2 - 1 degrees of freedom only when-ii1 and N_2 are greater than 100. Wheni1 or._N_2 is less than 100, the degrees of freedom must be adjusted by use of the following formulas: V _ 2A V = 2A2 "2 A ' 1 '—1 23 - AB - A2 Where: A = df4 dfz df4 - 2 df2 - 2 2 2 (dfl + 2)(df4) (df3 + 2)(df2) B = . (df4 - 2)(df4 - 4)(df1) (df2 - 2)(df2 - 4)(df3) and, df1 = N1 - 1 df3 = (N2 - 1) (k2 - 1) df4 = N2 - 1 dfz = (N1 - 1) (k1 — 1) 61 Table 15 Latin Square Research Design Used in Part I and Part III of this Study Treatment A1 A2 B1 GI GII Treatment B2 G11 GI Where in Part I: Where in Part III: A1 = AC item form A1 = ACCi item form A2 = TF item form A2 = AC1C item form B1 = Item position 1-5, 11-15 B1 = Item position 1-5, 11-15 B2 3 Item position 6-10, 16-20 82 = Item position 6-10, 16-20 GI = Group II GI = Group IIII GII a Group III GII = Group IIIII 62 Table 16 Analysis of Total Sum of Squares for the 2 x 2 Latin Square Design Source df Sum of Squares Between - Subjects an - 1 SSS error (b) a(n - 1) SSerror(b) - SSS - SSG Within - Subjects an(a - 1) SSwS = SST - SSS A a " 1 SSA B a - 1 SSB AB (w)a (a - 1)(a — 2) SSAMW) = SSAB - ssG error (w) a(a - 1)(n - 1) SSerror(w) = SSwS - SSA - SSB - SSAB(w) Total a2n - 1 SST Note: From Design and Analysis of Experiments (p.278) by E.F. Lindquist, 1953, Boston: Houghton Mifflin. 8In a 2 x 2 design this term vanishes as a source of sums of squares because df = (2 - 1)(2 - 2 = 0). CHAPTER IV RESULTS This chapter is divided into three major sections. The results of Part I of this study are presented in the first section. In Part I, the alternate-choice and true-false items are compared on difficulty, discrimination, reliability, and criterion-related validity. The results of Part II are presented in the second section. In this part of the study, the practability of judging the best form of the alternate-choice item (ACci or ACic) and the best form of the true-false item (TFt or TFf) are explored. The results of Part III are presented in the third section. In this part of the study, the alternative-choice items with the correct answer listed first (Acci) and the incorrect answer listed first (Acic) are compared on difficulty, discrimination, reliability, and criterion- related validity. Part I During the initial exploration of the data, a repeated measures analysis of students' scores across Test I (midterm exam) and Test II (final exam) showed that the students performed differently on the two tests (see Table 17). As a result, it was decided to treat Test I and Test II as independent substudies within Part I. This posed no problem because there was no overlap in the material tested up to the midterm and the material tested after the midterm. 63 64 Table 17 Results of Repeated Measures Analysis of Tests I and II Source 2:. 21$. .2 e Between-Subjects Constant 1 error 245 Within-Subjects Test (I & II) 1 13.21 5.71 .05 error 245 2.31 Form (A & B) 1 370.75 154.54 .001 error 245 2.40 Test x Form 1 16.13 6.29 .05 error 245 2.56 65 The treatment of each test as a substudy required twice as many statistical analyses to test the hypotheses than originally planned. To adjust for Type I error, the alpha level stated in each hypothesis in Chapter III was increased from .05 to .025. The first operational hypothesis to be tested stated: H1: The mean score of the alternate-choice items will be significantly greater than the mean score of their content equivalent true-false items when tested by the fiftest. The means (M), standard deviations (SD), and other item statistics for each Item Form for each Item Position within the Latin Square design are shown in Table 18. The results of the Latin square analysis for Test I are shown in Table 19; the results for Test II are shown Table 20. For both Test I and Test II the mean score of the alternate-choice items was found to be significantly greater than the mean score of the true-false items. Given these results, Hypothesis 1 was supported. Although item position was not of primary interest in this study, the results deserve discussion. In Test I, the mean score of items in Item Position l-5,11-15 was significantly greater than the mean score of items in Item Position 6-10,16-20. The reverse was true for Test II, where the mean score of items in Item Position 6-10,16-20 was significantly greater than the mean of items in Item Position 1-5,11- 15. These results suggest that there is an interaction effect between item position and item content on student performance on these two examinations. 66 Table 18 Latin Square Design and Item Statistics for Tests I and II TEST I Item Item Form Position 1g; 3}: Items Group I Group II 1-5, 11-15 M = 7.19 M = 6.03 SD 3 1071 SD 3 1070 rpbis = .392 rpbis = .367 rtt z .413 rtt ’3 0271 r I 0592 r :3 0443 r = 0876 r a 0788 tta 1:1:a Items Group II Group I 6-10, 16-20 M = 6.76 M = 4.94 SD = 1.42 SD :3 1057 'f = .325 'f = .326 pbis pbis rtt = .036 rtt = .111 r = .425 r = .278 rtt a .272 rtta 3 .555 TEST II Group I Group II Items M = 6.52 M = 5.48 1-5, 11-15 _ SD = 1068 _ SD = 1042 rpbis = 0380 rpbis = 0313 rtt = .333 rtt = .000 r = .545 r a .272 r = 0859 r = 0000 tta tta Items Group II Group I 6-10, 16-20 M = 7.37 M = 5.48 SD = 1.59 SD - 1.91 rpbis = 0365 rpbis 3 0411 rtt = 0329 rtt :3 .462 r = .388 r = .628 r = .831 r = .896 tta tta NOTE: N = 123 for each cell rtt a KRZO reliability r - correlation between each Item Form and the course grade criterion rtt - reliability adjusted to 100 items by the Spearman-Brown formula 67 Table 19 Latin Square Analysis of Test I (Midterm) Source 51?. 11% 2‘. 2 Between-Subjects 245 2.95 1.01 AB (b) 1 13.34 4.58 n.s. error (b) 244 2.91 Within-Subjects 246 3.62 1.61 n.s. A (Item Form)8 1 270.783 120.40 .001 B (Item Position)b 1 71.075 31.60 .001 AB (w) 1 0.00 0.00 n.s. error (w) 244 2.249 Total 491 aItem Form AC vs TF bItem Position 1-5,11-15 vs 6-10,16-20 Table 20 Latin Square Analysis of Test II (Final) Source it is .F. 12 Between-Subjects 245 3.46 .99 AB (b) 1 0.59 .17 n.s error (b) 244 3.47 Within-SUbjeCtS 246 2089 1044 A (Item Form)8 1 116.09 57.62 .001 B (Item Position)b 1 104.73 51.98 .001 AB (w) 1 0.00 0.00 n.s. error (w) 244 2.02 Total 491 aItem Form AC vs TF b Item Position 1-5,11-15 vs 6-10,16-20 69 The second operational hypothesis to be tested stated: H2: The item-total point-biserial g£.pbis) correlations of the alternate-choice items will be significantly higher than the item-total point-biserial (£.pbis) correlations of the content equivalent true-false item when tested by the sign test. A £.pbis coefficient was computed for each alternate-choice item and the total alternate-choice score of its respective test. Similarly a.£.pbis coefficient was calculated for each true-false item and the total true-false score of its respective test. These values are found in Appendix E. The.£.pbis of each of the 20 content-equivalent alternate-choice and true-false items was placed side by side and a sign test used to test for differences in discrimination ability of these two item forms. For Test I, 1" 11, p>.05,, and for Test II, _'1;- 13, p_ >.05, where I_is the number of times the.£.pbis of the alternate-choice item was greater than the £.pbis of the content-equivalent true-false item. Given these results, Hypothesis 2 was rejected. Note that the average-E-pbi8 €3.9b13) for each cell in the Latin Square design is shown in Table 18. The third operational hypothesis to be tested stated: H3: The KR-20 reliability coefficient of the alternate-choice items will be greater than the KR—ZO reliability coefficient of the true-false items. The Feldt Test for Equality of Two KR—ZO Reliabilities was used to test this hypothesis. The KR-20 reliabilities (£.tt) of the 5 alternate-choice items and the 5 true-false items were computed for each cell in the Latin square design. These KR-ZO reliability coefficients are shown in Table 18. The reliabilities of these content equivalent item forms were then 70 tested for equality by use of the Feldt Test for Equality of Two KR—20 Reliabilities (Feldt, 1969). In Test I, for Item Position 1-5,11-15, .W_(123,123) = 1.24, p_> .05; and for Item Position 6-10,16-20, .W(123,123) a .923, p_> .05. Thus, the magnitude of the reliabilities for Test I were found to be equal. For Test II, the reliability of the alternate-choice items in Item Position 1-5,11-15 was shown to be significantly greater, W_(123,123) = 1.499, p_< .025, than the reliability of the content-equivalent true- false items. The magnitude of the reliabilities of the content- equivalent alternate-choice and true—false items in Item Position 6- 10,16-20 were found to be equal, W(123,123) = .802, p_> .05. Given these results for Test I and Test II, Hypothesis 3 is only partially supported. The last operational hypothesis to be tested stated: H4: The criterion-related validity of the alternate-choice items, as defined by the product moment correlation between the alternate-choice total scores and the criterion related weighted score, will be greater than the criterion related validity of the true-false items as defined by the product moment correlation between true-false total scores and the criterion total weighted score. The correlations were transformed to 3_ scores and the Eftest statistic for two r independent correlations was used to test for differences. The course grade for each student was based on a weighted accumulated score of all quizzes and tests. This accumulated score was adjusted by removing the weighted scores of all alternate-choice and true-false items. Correlations were then computed between the total 71 alternate-choice score for each Item Position and the adjusted score, and the total true-false score for each Item Position and the adjusted score. These correlation coefficients (5) are shown in Table 18. Each correlation was transformed to a'§_ score and a Eftest for r two independent samples was performed for content-equivalent items in each Item Position. For Test I, and items in Item Position l-5,11-15, i = 1.55, p_< .067; and for items in Item Position 6-10,16-20, £_= 1.318, .p_< .097. In both cases, the alternate-choice and the true-false item correlations with the course grade criterion were not significantly different at the .05 level. For Test II, the criterion-related correlation of the alternate- choice items in Item Position 1-5,11-15 was significantly greater than that of the true-false items, 3.: 2.56, p_< .006. The criterion-related correlation of the true-false items in Item Position 6-10,16-20 was significantly greater than that of the alternate-choice items, §_= -2.56, p_< .006. Given the mixed results for Test I and Test II, Hypothesis 4 was only partially supported. Part II The operational hypotheses in Part II were both stated in null form: H1: Agreement between two departmental colleagues' judgements as to the better version of an alternate-choice item will be better than chance (50 percent). H2: Agreement between two departmental colleagues' judgements as to the better version of a true-false item will be no better than chance (50 percent). 72 The two instructors who judged the better version of each alternate-choice item (ACci and ACic) and the better version of the true-false item (TFt and TFf), expressed a great deal of frustration concerning the completion of the task. Both found judging the alternate-choice item versions more exasperating than judging the true- false versions. One judge noted on the recording sheet that: "I have completed this task but I have no confidence that, confronted with the same task again, I would make the same choice". The other judge stated verbally that the task was a piece of "nonsense" and that he was certain that he would not be able to produce the same judgments if he were to redo the task--which he stated that he would not do. The percent of agreement between the judges in their choice of the best alternate-choice items was 55 percent, only 5 percent greater than that expected by chance. There was also a 55 percent agreement for the true-false items. It can be concluded from these results that the agreement of two departmental colleagues' judgements as to the better item form of each alternate-choice and each true-false item is no better than chance. Hypothesis 1 and Hypothesis 2 were accepted. PART III The same statistical tests used in Part I of this study were used in Part III. The alpha level for all hypotheses was set at .05. All hypotheses in Part III were stated in null form. The first operational hypothesis to be tested stated: H1: The mean score of the ACC1 items will not differ from the mean score of the AC1C items when tested by the Latin square F-test. The means, standard deviations, and other item statistics for each 73 alternate-choice Item Form for each Item Position within the Latin Square design are shown in Table 21; the results of the Fftests are shown in Table 22. The mean difficulty of the Acci and ACic items were found to be equal, F(l,202) = 3.03, p_> .05. Given these results, Hypothesis 1 could not be rejected. Although item position was not of primary interest in this part of the study, the results deserve discussion. The mean score of items in Item Position 6-10,16-20 were found to be significantly higher than the mean of items in Item Position 1-5,11-15. These results suggest, as they did in Part I, that there is an interaction effect between item position and item content on student performance on this examination. The second operational hypothesis to be tested stated: H2: The item-total point-biserial {E.pbis) correlations of the ACC1 items will not differ from the item-total point-biserial (£.pbis) correlations of the ACCi items when tested by the sign test. A point-biserial correlation Q£.pbis) coefficient was computed for the total alternate-choice score and each Acci and each ACic item on the test. These values are found in Appendix E. The.£.pbis of each of these 20 content-equivalent ACC1 and ACic items was placed side by side and a sign test used to test for differences in discrimination. 74 Table 21 Latin ngare Design and Item Statistics for Part III Item Item Forms Position ACCI ACIC A Group I Group II M = 6043 M = 6094 SD = 1.75 SD = 1.67 ‘f = .333 'f = .337 pbis pbis rtt = 0375 rtt = 027]. r = .692 r = .542 B Group II Group I M = 7045 M 3 7043 SD = 1.62 SD = 1.66 ‘f = .328 'f = .357 pbis pbis rtt 3 .402 rtt “'3 0453 r = 0649 r = 0548 Note: N 204 for each cell KR20 reliablity correlation between each Item Form and course grade criterion 1"t1: r 75 Table 22 Latin Square Analysis of Part III Source a £4: .1: t Between-Subjects 101 3.59 0.00 AB (b) 1 3.57 .99 n.s. error (b) 100 3.59 Within-Subjects 102 2.31 2.28 A (Item Form)8 1 3.06 3.03 n.s. B (Item position) 1 29.06 28.77 .001 AB (w) 1 0.00 0.00 n.s. error (w) 202 1.01 Total 203 3 Item Form ACCi vs ACiC 76 The results of the sign test was .I'= 12, p_> .05, where I_is the number of times the r of the AC —-pbis 1C item was greater than the r — pbis of the content-equivalent ACCi item. Given these results, it can be concluded that there is no significant difference between the item discriminations of the ACCi and ACic items. Hypothesis 2 could not be rejected. Note that the average_r__pbis gE-pbis) for each cell in the Latin square design is shown in Table 21. The third operational hypothesis to be tested stated: H3: KR-20 reliability coefficient of the ACci items will not differ from the KR-ZO reliability coefficient of the ACiC items. The Feldt Test for Equality of two KR-20 Reliabilities was used to test this hypothesis. The KR-20 reliability coefficients (£_tt) for the 5 ACci items and the 5 AC1c items were computed for each cell in the Latin Square design. These KR-20 reliability coefficients are shown in Table 21. The reliabilities of the content equivalent items for each Item Position were tested for equality by use of the Feldt Test for Equality of Two KR-20 Reliabilities (Feldt, 1969). For Item Position 1-5,11-15, EK44,45) = 1.06, p_> .05; and for Item Position 6-10,16-20,.W(44,45) = 1.09, p_>.05. Given these results, it can be concluded that there is no significant difference between the reliabilities of these two item forms. Hypothesis 3 could not be rejected. 77 The last operational hypothesis to be tested stated: H4: The criterion related validity of the ACCi items, as defined by the product moment correlation between the ACci total scores and the criterion total weighted scores (with the scores of all ACC1 and AC1C items removed), will not differ from the criterion related validity of the Ac1C items, as defined by the product moment correlation between Acic total scores and the criterion total weighted scores (with the scores of all ACCi and AC1c items removed). The correlations were transfered to z_ scores and the fiftest r statistics for two independent correlations was used to test this hypothesis. The weighted accumulated score for each student was adjusted by removing the weighted scores of all ACCi and AC1C items. For each Item Position, correlations were computed between the total Acci score and the adjusted score, and between the total A01C score and the adjusted score. These correlation coefficients (I) are shown in Table 21. Differences in the correlations of Acci and AC1c items were tested by transforming them to 3_ scores and, for each Item Position, r performing a g_test for two independent samples. For items in Position 1-5,ll-15, £_= 1.17, p_> .05, and for items in Item Position 6-10,16-20, £_= .78, p_> .05. Given these results it can be concluded that there are no significant differences between the criterion related validities. Hypothesis 4 could not be rejected. CHAPTER V SUMMARY AND CONCLUSIONS Ebel (1982) proposed the use of the "alternate-choice" item as a replacement for the true-false item. The alternate-choice item he described was a modified two-choice multiple-choice item in which the two responses were included within the stem of the item. A search of the literature showed, other than for Ebel's study, a dearth of recent research studies on the comparison of two-choice or alternate-choice and true-false items. The results of two studies conducted in the 1920's showed conflicting outcomes for the reliabilities of the two-choice multiple-choice and true-false items, but results that were consistent with those of Ebel's concerning the less difficult nature and greater predictive validity of the two—choice multiple-choice/alternate-choice item. The purpose of the present study was three-fold: 1) to compare the difficulty level, discrimination level, reliability, and criterion- related validity of the alternate-choice item form and the content- equivalent true-false form; 2) to investigate the practicability of judging whether the alternate-choice version ACc1 or AC1c is the better form of the item, and whether the true-false version TFt or TFf is the better form of this item; and 3) to examine the effects of the alternate-choice item with the correct answer given first (ACCI) and the distractor given first (ACic) on difficulty, discrimination, reliability, and criterion-related validity. 78 79 This study was conducted in three parts. Each part corresponded to each purpose of this study. In this chapter, a summary, a discussion of the findings, and conclusions are presented for each part. Limitations of the study, and suggestions for future research are presented last. Part I The instruments used in Part I of this study were a midterm (Test I) and a final examination (Test II). Both tests were administered to lower division college students in a natural science course that emphasized genetics and reproduction. Each test consisted of Form A and Form B. Each form of Test I and each form of Test II contained 10 alternate-choice items, 10 true-false items, and respectively, 22 and 65 four- and five-choice multiple-choice items or key-type multiple-choice items. The alternate-choice and true-false items on Form B were content equivalent of the true-false and alternate-choice items on Form A. From a pool of 400 items, the senior instructor of the course selected 65 items for conversion to alternate-choice and true-false items for Test I and 100 items for conversion for Test II. For each item be indicated the correct answer and the distractor he judged the most reasonable answer given by an uninformed student. Items were converted from alternate-choice to true-false form by randomly eliminating either the correct response or the distractor, within the parameter than 60 percent of the items would be false. Each item in its multiple-choice, its alternate-choice, and its true-false form was submitted to two measurement experts to be judged for equivalence of content. All items were judged equivalent. 80 Forms A and B were distributed to students in a regular alternating sequence to discourage the copying of answers and to obtain randomly equivalent groups. There were 247 students who took both Test I and Test II. There was one student randomly eliminated to produce a balanced Latin square design. Both tests were considered power tests. During the initial exploration of the data, a repeated measures analysis indicated that students performed differently on Test I and Test II. As a result it was decided to treat Test I and Test II as independent substudies within Part I. The results of the Latin square analysis Fftests indicated that for both Test I and Test II, the alternate-choice items were significantly easier to answer than the content-equivalent true-false items. These results are consistent with those found by Ruch and Stoddard (1925), Charles (1926), and Ebel (1982). The consistent findings that the alternate-choice item is less difficult than the true-false item can most likely be attributed to the additional piece of information present in the alternate-choice item that is not present in the true-false item. This information probably acts as a focusing mechanism to assist the student in determining more precisely what information the item writer is seeking. There was an additional finding of an interaction effect of item position and item content on item difficulty for both Test I and Test II. This finding suggests a need for control of these variables in future research of this type. 81 The results of the sign test comparing the r correlations of — pbis the content-equivalent alternate-choice items with those of the true- false items for both Test I and Test II indicated that there were no differences in the discrimination ability of the two item forms. In his study, Ebel (1982) reported that the alternate-choice items had higher discrimination ability than the true-false items. One reason for these differing results may be that Ebel did not control for content- equivalence of items in his study, therefore the alternate-choice items in this study may have been easier in content than his true-false items. It also must be noted that Ebel did not conduct statistical tests on his discrimination data. It is possible that the difference between the 2_values of .28 for the true-false items and of .30 for the alternate-choice items were not statistically significant. There is also a probability that, in this study, the error variance introduced by guessing or item ambiguity may have obscured any real differences in discrimination of these item forms. The examination of the means of the true-false items (see Table 18) show two of the four means to be just slightly above the guessing level (5.48 and 5.48), and one mean to be below the guessing level (4.94). When the reliabilities of the alternate-choice and true-false items in each item position in each test was tested for equality by the Feldt Test for Equality of Two KR-20 Reliabilities, only the alternate-choice and true-false items in Item Position 1-5,11-15 in Test II were found to differ significantly. The reliability for the alternate-choice items in this item position was .333, the reliability for the content equivalent true-false items was .000. 82 To more directly compare the magnitude of the reliabilities of this study with those found in previous studies, the reliabilities adjusted to 100 items by the Spearman-Brown formula reported in Tables 1,2,3 and 18 are shown in Table 23. The average reliability of the alternate- choice items in this study compare favorably to those found by Ruch and Stoddard (1925), Charles (1926), and Ebel (1982). However, this is much less the case for the true-false items. If the low reliabilities of these items are due to ambiguity and/or guessing, then there should be an increase in error variance but not in the true variance of the test, and it should follow that reliability would be reduced. Given the means that reflect near guessing levels for these items, this appears to be the case in this study. The last comparison made was of the criterion-related validity of the alternate-choice and true-false items. The criterion-related validity was defined as the Pearson product moment correlation between the alternate—choice total score and the total weighted score for the course. The student's final grade was based on this total weighted score. For Test 1, both the alternate-choice and the true-false scores were found to be equally correlated with the course grade criterion. For Test II, the results were mixed. The alternate-choice items in Item Position 1-5,11-15 were more highly correlated with the criterion than the true-false items. This result is not surprising given the zero reliability of the true-false items in this item position. For items in Item Position 6—10,16-20, the true—false items were more highly correlated with the criterion than were the alternate-choice items. 83 Table 23 Reliabilities adjusted to 100 items by the Spearman-Brown formula AC TF Test 100 Item 100 Item itt in Test I Item Position 1-5,11-15 .876 .788 Item Position 6-10,16-20 .272 .555 Test II Item Position 1-5,11-15 .859 .000 Item Position 6—10,16—20 .831 .896 Ebel (1982) .890 .780 Charles (1926) .646 .751 Ruch and Stoddard (1925) .749 .714 84 The conclusions drawn from the results in Part I of this study were: 1) The alternate-choice item form was less difficult than the true-false item form. 2) There was evidence of an interaction effect of item position and item content on item difficulty. 3) The alternate-choice item form and the true-false item form do not differ in discrimination ability, and do not consistently differ in reliability or in their relationship to the final score for the course upon which grades are based. Part II Part II of this study was concerned with the practicability of judging the better item version of the alternate-choice and true-false item, and with the amount of agreement between judges on choosing the better version. The senior instructor of the natural science course used in this study and a departmental collaborator, both of whom taught the natural science course for many years, were asked to be the judges. Each alternate-choice item used in Test I and Test II was converted to two version, one with the correct answer presented first (Acci)’ and the other with the incorrect answer presented first (ACic). Each true- false item used on these two tests were also converted to two versions, one was the true version (TFt) and the other was the false version (TFf) of the item. Each judge was asked to choose the better version of each item; that is, the version that would, in his estimation, simultaneously maximize the chances of a correct answer being made by a student who knows the material, and an incorrect answer being made by an uninformed 85 student. The percent of agreement between the judges as to the better version of each alternate-choice item was 55 percent, and the agreement as to the better version of each true-false item was also 55 percent. From the frustrations expressed by the judges, it is unlikely that a test writer would seriously attempt a task similar to this one. The results in Part III of this study indicated that the two alternate- choice versions functioned very much the same on a test. The resultant lack of agreement between the judges as to the better form suggest that the use of a table of random numbers to choose the better version could produce a timely written test while keeping the test writer more even- tempered throughout the test development process. It should be noted, however, that because the false version of the true-false item is a better discriminating item than the true version, it is important for the practitioner to consider including more false than true versions of items on a true-false test. Part III The instrument used in this part of the study was a final examination (Test III) that was administered to lower division college students in a natural science course that emphasized genetics and reproduction. This course was taught by the same senior instructor who taught the natural science course in Part I. Test III contained 20 alternate-choice items randomly selected from those in Tests I and II, plus 22 multiple-choice or key-type items that were selected from the item pool. There were 10 ACC1 items randomly assigned to Form A, and 10 0f their ACIC content—equivalent version assigned to Form B. There were 10 Acic items randomly assigned to Form A and their content-equivalent ACci versions to form B. 86 These two forms were administered to 102 students using the same procedure as that used in Part I. Students had two hours to complete Test III (66 items), therefore Test III was also considered a power test. A Latin square design was used for data analysis. The results of each statistical test indicated that no differences existed between these two item versions on difficulty, reliability, criterion-related validity, or discrimination. Thus, unlike the false version of the true-false item form, which is more discriminating than the true version, the AC1C item version does not discriminate better than the ACCi item version. As in Part I of this study, a significant interaction effect of item position and item content on item difficulty was found for these items. The conclusions drawn from the results in Part III of this study were: 1) The alternate-choice item with the correct answer presented first and the alternate-choice item with the incorrect answer presented first did not differ in difficulty, discrimination, reliability, or criterion-related validity. 2) There was evidence that an interaction effect on item difficulty exists between item position and item content. Limitations of the Study The judgmental method was not used to convert items from alternate- choice form to true-false form. Instead, a table of random numbers was used to decide whether the correct answer or distractor was to be eliminated form the alternate-choice item. Given the results of the item form judgment task it is retrospectively doubted that the performance of the true-false item forms would have been different if 87 the judgmental method had been used. However, it is possible that this random conversion method may have affected the true-false item performance in this study. Further research which would include a random conversion procedure for developing true-false item from multiple-choice type item forms would be helpful in clarifying this issue. Suggestions for Further Research The following recommendations for further research on item forms are made: 1. That there be replication of Part I using items that are less difficult in nature 2. That there be a replication of Part III of this study to test for systematic differences in the ACc and AC1c item versions. i 3. That future investigations involving the conversion of item forms from a multiple-choice type form to true-false form should include a random selection method to determine the true or false form of the item. 4. That future investigations take into account the element of guessing which was not examined this this study. Guessing may be less on the alternate-choice item forms than on the true- false item form because of the additional piece of information supplied in the alternate-choice item. The extent of guessing on these two item forms needs to be studied using a three parameter latent trait analysis model. 5. 88 That an investigation be made into the efficiency of Ebel's alternate-choice item form. Because of its compactness, this form may have greater efficiency than the two-choice multiple- APPENDICES APPENDIX A MULTIPLE-CHOICE, ALTERNATE- CHOICE, AND TRUE-FALSE FORMS OF EXPERIMENTAL ITEMS Multiple-choice, Alternate-choice, and True-false Forms of Experimental Items MIDTERM EXAM TEST 1 The next three items are based on the following information: In snapdragons tallness (T) is dominant over dwarfness (t), while red flower color is due to a gene (R) and white to its allele (r). The heterozygous condition results in pink flower color. A dwarf homozygous red snapdragon is crossed with a plant homozygous for tallness and white flowers. 1. MC form What is the genotype and phenotype of the Fl's? A) tth, dwarf and pink B) ttrr, dwarf and white C) Tth, tall and red * D) TrRr, tall and pink E) None of these 1. AC form The genotype and phenotype of the Fl's are a) Tth, tall and pink b) Tth, tall and red. 1. TF form The genotype and phenotype of the Fl's are Tth, tall and pink. 2. MC form If two plants of the genotypes tth and TtRR are crossed and no mutations occur, what are the chances that they will produce a dwarf white plant? A) 1/2 3) 1/4 C) 3/16 D) 1/16 * E) 0 2. AC form If two plants of the genotypes tth and TtRR are crossed and no mutations occur, the chances are a) 0 b) 1/16 that they will produce a dwarf white plant. 2. TF form If two plants of the genotypes tth and TtRR are crossed and no mutations occur, the chances are 1/16 that they will produce a dwarf white plant. 89 3. 90 MC form A plant which is heterozygous for tallness and red flowers is self- pollinated. What is the probability that the offspring will be short and white? A) 9/16 B) 3/16 C) 3/9 D) 1/16 E) 0 AC form A plant which is heterozygous for tallness and red flowers is self- pollinated. The probability is a) 0 b) 1/16 that the offspring will be short and white. TF form A plant which is heterozygous for tallness and red flowers is self- pollinated. The probability is 0 that the offspring will be short and white. MC form A son is born whose father is normal but whose grandfather, on his mother's side was hemophilic. What are the chances that he, too, would hear this trait? A) 75% B) 50% C) 10% D) 1/16 E) 0% AC form A son is born whose father is normal but whose grandfather, on his mother's side was a hemophilic. The chances that he, too, would bear this trait is a) 50% b) 75% TF form A son is born whose father is normal but whose grandfather, on his mother's side was hemophilic. The chances that he, too, would bear this trait is 75%. MC form In a cross between individuals heterozygous for two traits, the expected number of homozygous recessive individuals is A) 9/16 B) 1/2 C) 1/4 D) 3/16 * E) 1/16 5. 5. 6. 91 AC form In a cross between individuals heterozygous for two traits, the expected number of homozygous recessive individuals is a) 1/16 b) 1/4 TF form In a cross between individuals heterozygous for two traits, the expected number of homozygous recessive individuals is 1/16. MC form Dr. Corcos works with a little plant, Arabidopsis thaliana, whose chromosome number is 2n=10. In such a plant the number of possible combinations of paternal and maternal chromosomes is A) 64 B) 32 C) 16 D) 8 E) 4 AC form Dr. Corcos works with a little plant, Arabidopsis thaliana, whose chromosome number is 2n=10. In such a plant the number of possible combinations of paternal and maternal chromosomes is a) 8 b) 32 TF form Dr. Corcos works with a little plant, Arabidopsis thaliana, whose chromosome number is 2n=10. In such a plant the number of possible combinations of paternal and maternal chromosomes is 32. MC form A streak of white in an otherwise colored head of hair is known as white forelock. It is due to a dominant gene. If a woman with a white forelock marries a normal man and their first child is normal, her genotype is A) AA B) As C) as AC form A streak of white in an otherwise colored head of hair is known as white forelock. It is due to a dominant gene. If a woman with a white forelock marries a normal man and their first child is normal, her genotype is a) aa b) Aa . TF form A streak of white in an otherwise colored head of hair is known as white forelock. It is due to a dominant gene. If a woman with a white forelock marries a normal man and their first child is normal, her genotype is aa. 8. 92 MC form What are the chances that the second child of the marriage above has a white forelock? A) 3/4 B) 1/2 C) 1/4 D) 0 AC form The chances that the second child of the marriage above has a white forlock is a) 1/2 b) 1/4 . TF form The chances that the second child of the marriage above has a white forelock is 1/2. MC form Some organisms have sex chromosomes of the XO, XX type, in which males have one X chromosome, females two; in other organisms the female has two X chromosomes, the male an X and a Y; instill other organisms such as birds, the female has XY and the male XX. Some animals and plants can even be male in one situation and female in another, when conditions are favorable; others are hermaphroditic. All this would seem to indicate that sex determination is A) ultimately a hereditary decision prescribed by the male. B) by the sex chromosomes only. C) wholly random and unpredictable in any case D) not always entirely determined by the karyotype. AC form Some organisms have sex chromosomes of the X0, XX type, in which males have one X chromosome, females two; in other organisms the female has two X chromosomes, the male an X and a Y; in still other organisms such as birds, the female has XY and the male XX. Some animals and plants can even be male in one situation and female in another, when conditions are favorable; others are hermaphroditic. All this would seem to indicate that sex determination is a) not always entirely determined by the karyotype b) by the sex chromosomes only. TF form: Some organisms have sex chromosomes of the X0, XX type, in which males have one X chromosome, females two; in other organisms the female has two X chromosomes, the male an X and a Y; in still other organisms such as birds, the female has XY and the male XX. Some animals and plants can even be male in one situation and female in another, when conditions are favorable; others are hermaphroditic. All this would seem to indicate that sex determination is by the sex chromosomes only. 10. 10. 10. 11. 11. 11. 12. 12. 93 MC form Few genes are Y linked. The reason for this is probably that A) the Y chromosome is largely homologous with the X. B) both sexes possess the Y chromosomes. C) both sexes possess two X chromosomes. D) the Y chromosome occurs only in one sex and is small. AC form Few genes are Y linked. The reason for this is probably that the Y occurs chromosome 3) is largely homologous with the X b) only in one sex and is small. TF form Few genes are Y linked. The reason for this is probably that the Y chromosome is largely homologous with the X. MC form The "drumstick" chromosome often found in the female nuclei white blood cells A) indicates the sex of the person involved. B) represents an inactivated Y-chromosome. C) could occur in a person with the XYY syndrome. D) is characteristic of the cri-du-chat disorder E) none of the above AC form The "drumstick" chromosome often found in the female nuclei white blood cells a) represents an inactivated Y- chromosome b) indicates the sex of the person involved. TF form The "drumstick" chromosome often found in the female nuclei white blood cells indicates the sex of the person involved. MC form of of of In guinea pigs black is dominant over white. A cross between a heterozygous black and a white guinea pig would give a ratio of A) about 3 blacks to 1 white B) all black C) about 1 black to 1 white D) about 3 whites to 1 black E) about 1 black to 2 grey to 1 white AC form In guinea pigs black is dominant over white. A cross between a heterozygous black and a white guinea pig would give a ratio of a) all black b) about 1 black to 1 white. 12. 13. 13. 13. 14. 14. 14. 94 TF form In guinea pigs black is dominant over white. A cross between a heterozygous black and a white puinea pig would give a ratio of about 1 black to 1 white. MC form A form of Vitamin D resistant rickets, known a hypophatemia, is inherited as a sex-linked dominant trait. If a male with hypophatemia marries a normal female, which of the following predictions concerning the potential progeny would be true? A) All their sons would inherit the disease. B) All their daughters would inherit the disease. C) None of their sons would inherit the disease. D) None of their daughters would inherit the disease. E) Both b and c are true. AC form A form of Vitamin D resistant rickets, known a hypophatemia, is inherited as a sex-linked dominant trait. If a male with hypophatemia marries a normal female, all their a) sons b) daughters would inherit the disease. TF form A form of Vitamin D resistant rickets, known a hypophatemia, is inherited as a sex-linked dominant trait. If a male with hypophatemia marries a normal female, all their sons would inherit the disease. MC form When children do not express a trait unless at least one parent expresses it, it is an indication that the gene involved is A) x-linked dominant B) autosomal dominant C) autosomal recessive D) polygenetically inherited E) skipping a generation AC form When children do not express a trait unless at least one parent expresses it, it is an indication that the gene involved is a) skipping a generation b) autosomal dominant. TF form When children do not express a trait unless at least one parent expresses it, it is an indication that the gene involved is skipping a generation. 15. 15. 15. 16. 16. 16. 17. 95 MC form If, in testing a genetic hypothesis you found a Chi—square of zero, you should A) reject the hypothesis. B) accept the hypothesis. C) redo the experiment. D) discard Mendelian genetics E) discard the Chi-square method AC form If, in testing a genetic hypothesis you found a Chi-square of zero, you should a) accept the hypothesis b) redo the experiment. TF form If, in testing a genetic hypothesis you found a Chi-square of zero, you should redo the experiment. MC form If the somatic cells of a male were found to contain a Barr body in each of their nuclei, what would be the most likely genetic constitution of the individual? A) X0 B) XX C) XYY D) XXY E) XXX AC form If the somatic cells of a male were found to contain a Barr body in each of their nuclei, the most likely genetic constitution of the individual is a) XYY b) XXY . TF form If the somatic cells of a male were found to contain a Barr body in each of their nuclei, the most likely genetic constitution of the individual is XYY. MC form The theory of inheritance during Mendel's time was known as "blending". If this theory were correct, the outcome of a cross between a black animal and a white animal would produce offspring of what color? A) Black B) White C) Spotted D) Gray E) Impossible to tell 17. 17. 18. 18. 18. l9. 19. 19. 96 AC form The theory of inheritance during Mendel's time was known as "blending". If this theory were correct, the outcome of a cross between a black animal and a white animal would produce a) Spotted b) Gray offspring. TF form The current theory of inheritance during Mendel's time was known as "blending". If this theory were correct, the outcome of a cross between a black animal and a white animal would produce Spotted offspring. MC form The relative distance between linked genes may be determined by A) cell fusion experiments B) crossing over frequencies C) epistasis D) pleiotropism E) a and b above AC form The relative distance between linked genes may be determined by a) crossing over frequencies b) cell fusion experiments. TF form The relative distance between linked genes may be determined by cell fusion experiments. MC form How many possible combinations of gametes could be produced by one individual in a trihybrid cross? A) 3 B) 6 C) 8 D) 16 E) 64 AC form The possible combinations of gametes that could be produced by one individual in a trihybrid cross is a) 6 b) 8 . TF form The possible combinations of gametes that could be produced by one individual in a trihybrid is 8. 20. 20 20. MC If is A) C) D) E) AC If is TF If is 97 form you flip 3 coins, the probability of getting 2 heads and 1 tail 1/6 1/8 3/9 3/8 none of these form you flip 3 coins, the probability of getting 2 heads and 1 tail 3) 1/8 b) 3/8 form you flip 3 coins, the probability of getting 2 heads and 1 tail 3/8. 1. FINAL EXAM QUESTIONS TEST II MC form The main activity of science is to A. observe nature. B. make and test theories. C. debate issues with organized religion. D. create machines which will improve human society. E. support political systems. AC form The main activity of science is to. a) observe nature b) make and test theories . TF form The main activity of science is to make and test theories. MC form In scientific methodology, prediction means nearly the same as A. interpretation of data. B. generalization from empirical observation. C. expectancy. D. experimentation. E. none of the above. AC form In scientific methodology, prediction means nearly the same as a) expectancy b) interpretation of data . TF form In scientific methodology, prediction means nearly the same as interpretation of data. MC form The lowest level of explanation is A. a theory. B. an hypothesis. C. a fact. D. an assumption. AC form The lowest level of explanation is a) an hypothesis b) an assumption . TF form The lowest level of explanation is an assumption. 98 4. 5. 5. 99 MC form Which of Mendel's procedures differed from those of his predecessors and contributed most to his success? A. He kept breeding records. B. He observed distinct inherited traits. C. He observed many characteristics for each trait. D. He quantitatively (statistically) analyzed his data. E. He used one of the few organisms which can be grown in a laboratory. AC form Mendel's procedure that differed from those of his predecessors and contributed most to his success was that a) he observed distinct inherited traits b) he quantitatively (statistically) analayzed his data . TF form Mendel's procedure that differed from those of his predecessors and contributed most to his success was that he observed distinct inherited traits. MC form In radishes, long and round are alleles, as are red and white. In a cross between a long, red variety, and a round, white variety the F1 is oval and purple. How many different phenotypes you would expect to find in the F2? A. 16 B. 9 C. 4 D. 3 E. 2 AC form In radishes, long and round are alleles, as are red and white. In a cross between a long, red variety, and a round, white variety the F1 is oval and purple. The different phenotypes you would expect to find in the F2 is a) 4 b) 9 . TF form In radishes, long and round are alleles, as are red and white. In a cross between a long, red variety, and a round, white variety the F1 is oval and purple. The different phenotypes you would expect to find in the F2 is 4. 7. 100 MC form A characteristic of a dominant trait is that A. the trait never skips a generation. B. the genotype can be determined directly from the phenotype. C. the phenotype cannot be read from the genotype. D. the homozygote for the train can be distinguished from the heterozygote. E. more than one above. AC form A characteristic of a dominant trait is that a) the trait never skips a generation b) the genotype can be determined directly from the phenotype . TF form A characteristic of a dominant trait is that the genotype can be determined directly from the phenotype. MC form Two black female mice are mated to a brown male. In several litters, Female I produced 9 blacks and 7 browns, Female II produced 57 blacks. Assuming black to be dominant over brown, what are the respective genotypes of the Female 1, Female II, and the male? A. Bb, BB, bb B. BB, Bb, bb c. Bb, bb, BB D. bb, Bb, BB E. BB, BB, bb AC form Two black female mice are mated to a brown male. In several litters, Female I produced 9 blacks and 7 browns, Female II produced 57 blacks. Assuming black to be dominant over brown, the respective genotypes of the Female I, Female II, and the male are a) BB, Bb, bb b) Bb, BB, bb . TF form Two black female mice are mated to a brown male. In several litters, Female I produced 9 blacks and 7 browns, Female II produced 57 blacks. Assuming black to be dominant over brown, the respective genotypes of the Female I, Female II, and the male are Bb, BB, bb. MC form A woman who has Turner syndrome is found to have hemophilia; yet neither of her parents have the disease. She A. got the defective gene from her father. B. got the defective gene from her mother. C. could have gotten the defective gene from either parent. D. could not have gotten the defective gene from either parent. E. must be adopted since hemophilia is due to a dominant gene. 9. 9. 10. 10. 10. 101 AC form A woman who has Turner syndrome is found to have hemophilia; yet neither of her parents have the disease. She a) got the defective gene from her mother b) could have gotten the defective gene from either parent. TF form A woman who has Turner syndrome is found to have hemophilia; yet neither of her parents have the disease. She got the defective gene from her mother. MC form The extra Y chromosome of the XYY male was thought for some time to cause A. stunted and stocky build in affected males. B. above average intelligence. C. above average strength. D. sterility in such males. E. aggressive and antisocial behavior. AC form The extra Y chromosome of the XYY male was thought for some time to cause a) sterility in such males b) aggressive and antisocial behavior . TF form The extra Y chromosome of the XYY male was thought for some time to cause sterility in such males. MC form What process probably occurs during meiosis to produce an XXY individual? A. Segregation B. Crossing over C. Nondisjunction D. Random assortment E. None of these AC form The process that probably occurs during meiosis to produce an XXY individual is a) crossing over b) nondisjunction . TF form The process that probably occurs during meiosis to produce an XXY individual is crossing over. 11. 11. 11. 12. 12. 12. 102 MC form Sex chromatin found in body cells and called Barr bodies have a relationship with the number of X chromosomes present in a given individual's body cells. If a given male had a sex chromosome composition of XXXXY, the number of Barr bodies observable in somatic tissue cells would be A. five B. four C. three D. two E. none of these AC form Sex chromatin found in body cells and called Barr bodies have a relationship with the number of X chromosomes present in a given individual's body cells. If a given male had sex chromosome composition of XXXXY, a) three b) four Barr bodies would be observable in somatic tissue cells. TF form Sex chromatin found in body cells and called Barr bodies have a relationship with the number of X chromosomes present in a given individual's body cells. If a given male had sex chromosome composition of XXXXY, three Barr bodies would be observable in somatic tissue cells. MC form More than one man was responsible for proposing the one-gene, one- enzyme hypothesis. Those responsible were A. Watson and Crick. B. Mendel and Morgan. C. Lysenko and Lamarck. D. Beadle and Tatum. E. None of these. AC form The men responsible for proposing the one-gene, one-enzyme hypothesis were a) Beadle and Tatum b) Watson and Crick. TF form The men responsible for proposing the one-gene, one-enzyme hypothesis were Watson and Crick. 13. 13. 13. 14. 14. 14. one 103 MC form When cells of certain bacteria are grown on glucose they do not produce beta-galactosidase (an enzyme which is important in breaking down lactose). However, when the same cells are placed in lactose they begin to make beta-galactosidase almost immediately. The results of this experiment support the hypothesis that A. DNA is a genetic material. B. genes are influenced by the environment C. not all the genes are operative all the time. D. B and C E. A, B, and C AC form When cells of certain bacteria are grown on glucose they do not produce beta-galactosidase (an enzyme which is important in breaking down lactose). However, when the same cells are placed in lactose they begin to make beta-galactosidase almost immediately. The results of this experiment support the hypothesis that a) genes are influenced by the environment b) DNA is the genetic material . TF form When cells of certain bacteria are grown on glucose they do not produce beta-galactosidase (an enzyme which is important in breaking down lactose). However, when the same cells are placed in lactose they begin to make beta-galactosidase almost immediately. The results of this experiment support the hypothesis that genes are influenced by the environment. Watson-Crick base pairing requires that the adenine content of one strand of DNA equals the A. thymine content of the complementary strand B. thymine content of the same strand C. adenine content of the complementary strand D. uracil content of the complementary strand E. guanine content of the complementary strand Watson-Crick base pairing requires that the adenine content of one strand of DNA equals the thymine content of the a) complementary b) same strand. Watson-Crick base pairing requires that the adenine content of strand of DNA equals the thymine content of the same strand. 15. 15. 15. 16. 16. 16. 17. 17. 17. 104 MC form A nucleotide consists of either a purine or pyrimidine, a five- carbon sugar and a A. carbohydrate. B. amino acid. C. peptide. D. phosphate group. E. sulfate group. AC form A nucleotide consists of either a purine or pyrimidine, a five- carbon sugar and a) a phosphate group b) an amino acid . TF form A nucleotide consists of either a purine or pyrimidine, a five- carbon sugar and a phosphate group. MC form Amino-acids are carried to ribosomes by A. messenger RNA. B. transfer RNA. C. proteins. D. cytoplasmic DNA. E. nuclear DNA. AC form Amino-acids are carried to ribosomes by a) messenger RNA b) transfer RNA. TF form Amino-acids are carried to ribosomes by messenger RNA. MC form According to the genetic code, a gene responsible for the formation of a protein of 200 amino-acid subunits should have A. 200 nucleotides. B. 400 nucleotides. C. 600 nucleotides. D. 800 nucleotides. E. Dr Corcos and Dr Marinez, do you think we are math majors? AC form According to the genetic code, a gene responsible for the formation of a protein of 200 amino-acid subunits should have a) 600 b) 200 nucleotides . TF form According to the genetic code, a gene responsible for the formation of a protein of 200 amino-acid subunits should have 200 nucleotides. 18. 18. 18. 19. 19. 19. 20. 20. 105 MC form In a DNA molecule, one strand contains the following sequence of bases A-G-A-T-C. Which of the following represents the complementary sequence on the other strand? A. C-C-T-A-G B. AeG-A-T-C C. T-C-T-A—G D. U-C-U—A-G E. None of these AC form In a DNA molecule, one strand contains the following sequence of bases A-G-A—T-C. The complementary sequence on the other strand is a) U-C-U-A-G b) T-C-T-A—G . TF form In a DNA molecule, one strand contains the following sequence of bases A-G-A—T-C. The complementary sequence on the other strand is T-C-T-A-G . MC form Mongoloid idiocy or Down's Syndrome is a consequence of abnormalities in: A. sex-linked heredity B. sex-influenced heredity C. the number of sex chromosomes. D. the number of autosomal chromosomes. E. None of these AC form Mongoloid idiocy or Down's Syndrome is a consequence of abnormalities in the number of a) sex b) autosomal chromosomes . TF form Mongoloid idiocy or Down's Syndrome is a consequence of abnormalities in the number of sex chromosomes. MC Form In recombinant DNA research, an alien gene is A. treated with tRNA. B. incorporated into a bacterial plasmid. C. combined with a repressor substance. D. mixed with histone proteins. E. injected into the host cell with a sex pilus. AC form In recombinant DNA research, an alien gene is a) incorporated into a bacterial plasmid b) injected into the host cell with a sex pilus. 106 20. In recombinant DNA research, an alien gene is incorporated into a bacterial plasmid. APPENDIX B UNIVERSITY COMMITTEE ON RESEARCH INVOLVING HUMAN SUBJECTS (UCRIHS) APPROVAL LETTER MICHIGAN STATE UNIVERSITY UNIVERSITY COMMIT'I‘I-fih ON RESEARCH INVOLVING EAST LANSING ° MICHIGAN ' 48824 IILMAN SL'BJEC'I'S (L'(‘.RIIIS) 233 ADMINISTRATION BL'II [)IVG (SI‘) 355~2|80 December 12, l983 Ms. Nancy A. Maihoff 424 Administration Building Dear Ms. Maihoff: Subject: Proposal Entitled, ''A Comparison of Alternate-Choice and True-False Item Form Used in Classroom Examinations” \ UCRIHS review of the above referenced project has now been completed. I am pleased to advise that the rights and welfare of the human subjects appear to be adequately protected and the Committee, therefore, approved this project at its meeting on December 5, 1983. You are reminded that UCRIHS approval is valid for one calendar year. If you plan to continue this project beyond one year, please make provisions for obtaining appropriate UCRIHS approval prior to December 5, 1984. Any changes in procedures involving human subjects must be reviewed by the UCRIHS prior to initiation of the change. UCRIHS must also be notified promptly of any problems (unexpected side effects, complaints, etc.) involving human subjects during the course of the work. Thank you for bringing this project to our attention. If we can be of any future help, please do not hesitate to let us know. Sincerely, _ fl, ,. /:)<:JY'53vL”$ZALA/fl——' Henry E. Bredeck Chairman, UCRIHS HEB/jms cc: Mehrens 107 MSU is an Affirmative Action/Equal Opportunity Institution APPENDIX C EXAM INSTRUCTION SHEET Exam Instruction Sheet A. CORCOS NATURAL SCIENCE 115 FALL 1983 D. MARINEZ FINAL EXAM FORM B THERE ARE 85 ITEMS ON 15 PAGES OF THIS EXAMINATION. 33; SURE YOU HAVE ALL g1: THEM. 1. Check which form of the exam you have. The answer sheet should be BLUE if you have FORM A, and BROWN if you have FORM B. Raise your hand if this is not the case. 2. Print your last name and first initial, and your student number on the answer sheet, then darken the corresponding circles. 3. Each item is worth one point. Select the one best answer for each item. 4. Note that there are four-and five-choice multiple-choice items, alternate-choice (two-choice) items, and true-false items. Be sure to mark the appropriate space on the answer sheet by darkening the circle corresponding to the answer you select. 5. Do any calculations or scribbling on the last page of this exam booklet. If you mark answers on the test be sure you transfer the answers to the answer sheet. 6. If you make stray marks on the answer sheet or fail to erase completely the answer you wish to change, your response will not be counted. 7. Keep the marked part of your answer sheet covered at all times. 8. Your score on this examination will be the number of answers you marked correctly. Try to answer each item, but do not spend too much time on any one item. 9. If you have any questions, ask your instructor now, before starting the examination. Good luck. Happy Holiday Season Feliz Navidad y Prospero Ano Nuevo Joyeux Noel et Bonne Annee 108 APPENDIX D ITEM JUDGMENT INSTRUCTION AND RECORDING SHEETS Item Judgement Instruction and Recording Sheets ALTERNATE-CHOICE ITEM FORM JUDGMENT TASK SECOND MIDTERM EXAM Directions: The following pages contain 20 alternate-choice items asked on a Natural Science 115 Midterm Examination. Each item is written in two forms: with the correct answer placed first, and as the incorrect answer placed first. Your task is to choose the form for each item that will, in your judgment, simultaneously maximize the chance of a correct answer from a student knowing the material and the chance of an incorrect answer from an uninformed student. The process suggested is to review all 20 items and select the form that you judge to be the best form of the item. Transfer your choice of item form to the enclosed RECORDING SHEET, by writing the number of each item in either the CORRECT ANSWER FIRST column or in the INCORRECT ANSWER FIRST column. Thank you very much. 109 ._ c. 110 RECORDING SHEET ALTERNATE-CHOICE BEST ITEM FORM CORRECT ANSWER FIRST INCORRECT ANSWER FIRST Note: Please add more lines to column if you need them. 111 ALTERNATE-CHOICE ITEM FORM JUDGMENT TASK FINAL EXAM Directions: The following pages contain 20 alternate-choice items asked on a Natural Science 115 Final Examination. Each item is written in two forms: with the correct answer placed first, and as the incorrect answer place first. Your task is to choose the form for each item that will, in your judgment, simultaneously maximize the chance of a correct answer from a student knowing the material and the chance of an incorrect answer from an uninformed student. The process suggested is to review all 20 items and select the form that you judge to be the best form of the item. Transfer your choice of item form to the enclosed RECORDING SHEET, by writing the number of each item in either the CORRECT ANSWER FIRST column or in the INCORRECT ANSWER FIRST column. Thank you very much. 112 RECORDING SHEET ALTERNATE-CHOICE BEST ITEM FORM CORRECT ANSWER FIRST INCORRECT ANSWER FIRST Note: Please add more lines to a column if you need them. 113 TRUE OR FALSE ITEM FORM JUDGMENT TASK SECOND MIDTERM EXAM Directions: The following pages contain 20 items asked on a Natural Science 115 Midterm Examination. Each item is written in two forms: as a true statement, and as a false statement. Your task is to choose the form for each item that will, in your judgment, simultaneously maximize the chance of a correct answer from a student knowing the material and the chance of an incorrect answer from an uninformed student. An additional constraint within which you must work is to ultimately identify only 8 true items and 12 false items from the total 20 items. The process suggested is to review all 20 items and select those that you judge to be best as a true or as a false item. Then determine how many items you put in each category. If the number of items selected exceeds 8 true forms, then review again your choices so that only 8 are identified. Transfer your choice of item form to the enclosed RECORDING SHEET, by writing the number of each item in either the TRUE FORM BEST column or in the FALSE FORM BEST column. Thank you very much. 114 RECORDING SHEET TRUE/FALSE BEST ITEM FORM TRUE FORM BEST FALSE FORM BEST 115 TRUE OR FALSE ITEM FORM JUDGMENT TASK FINAL EXAM Directions: The following pages contain 20 items asked on a Natural Science 115 Final Examination. Each item is written in two forms: as a true statement, and as a false statement. Your task is to choose the form for each item that will, in your judgment, simultaneously maximize the chance of a correct answer from a student knowing the material and the chance of an incorrect answer from an uninformed student. An additional constraint within which you must work is to ultimately identify only 8 true items and 12 false items from the total 20 items. The process suggested is to review all 20 items and select those that you judge to be best as a true or as a false item. Then determine how many items you put in each category. If the number of items selected exceeds 8 true forms, then review again your choices so that only 8 are identified. Transfer you choice of item form to the enclosed RECORDING SHEET, by writing the number of each item in either the TRUE FORM BEST column or in the FALSE FORM BEST column. Thank you very much. .‘A 116 RECORDING SHEET TRUE/FALSE BEST ITEM FORM TRUE FORM BEST FALSE FORM BEST APPENDIX E POINT-BISERIAL CORRELATIONS OF EXPERIMENTAL ITEMS- TESTS I, II, AND III Table 24 Point-biserial Correlations of Experimental Items - Tests I, II, and III TEST I Item L pbis Item L pbis Group I Group II AC1 .1778 TF1 .4148 AC2 .6591 TF2 .5495 AC3 .3655 TF3 .1234 AC4 .3579 TF4 .3569 ACll .5232 TF11 .3846 AC12 .3955 TF12 .4033 AC13 .4336 TF13 .3446 AC14 .3534 TF14 .3728 AC15 .1940 TF15 .3491 Group II Group I TF6 .1223 AC6 .2918 TF7 .3803 AC7 .2585 TF8 .2184 AC8 .3494 TF9 .2760 A09 .3058 TF10 .4005 ACIO .2905 TF16 .2729 AC16 .3857 TF17 .2424 AC17 .4188 TP18 .3373 AC18 .2817 TF19 .4846 A019 .3692 TFZO .5242 ACZO .3016 117 118 TEST II Item _r_ pbis Item _r_ pbis Group I Group II AC1 .2896 TFl .3030 AC2 .4132 TF2 .4147 AC3 .4232 TF3 .3665 AC4 .4726 TF4 .3715 ACS .4027 TF5 .2299 A011 .4039 TF11 .3474 AC12 .4256 TF12 .3521 AC13 .3615 TFl3 .1574 AC14 .2906 TF14 .3732 AC15 .3171 TF15 .2102 Group II Group I TF6 .5012 AC6 .4474 TF7 .3163 AC7 .1915 TF8 .4329 AC8 .4661 TF9 .4264 AC9 .4226 TFlO .4921 A010 .2986 TF16 .4423 ACl6 .4791 TF17 .2701 AC17 .4745 TFl8 .3121 AC18 .3484 TF19 .4085 AC19 .3402 TF20 .5071 AC20 .1820 119 TEST III Item _1; pbis Item _r_ pbis Group I Group II AC31C1 .0748 AC311C .1535 AC32ci .4141 AC321C .0957 AC34ci .2318 AC34ic .3236 AC37ci .2548 AC37ic .4973 AC39ci .1298 AC391C .2551 AC40ci .1986 AC4Oic .2499 AC41ci .6497 Ac4lic .4733 AC42ci .3757 AC4Zic .6174 AC43ci .5076 AC43ic .1933 AC45ci .4948 AC451C .3934 Group II Group I AC33ci .2549 AC331C .4899 AC35c1 .2911 AC351C .4878 AC36ci .2985 AC36ic .4537 AC38C1 .3834 A038ic .3439 AC44ci .3706 AC44ic .4090 AC46ci .2357 AC46ic .2907 AC47ci .4721 AC47ic .2868 AC48c1 .3827 AC48ic .2662 AC49ci .4778 AC49ic .3235 AC50Ci .1170 AC501C .2155 LIST OF REFERENCES REFERENCES Ahmann, J.S., & Glock, M.D. (1967). Evaluating pupil growth (3rd ed. revised). Boston: Allyn and Bacon. Alpert, R., & Haber, R.N. (1960). Anxiety in academic achievement situations. Journal of Abnormal and Social Psychology, 21) 207-215. Brenner, M.H. (1964). Test difficulty, reliability, and discrimination as functions of item difficulty order. Journal of Applied Psychology, 48) 98-100. Brown, F.G. (1970). Principles of educational and psychological testing. Hinsdale: Dryden. Burmester, M.A., & Olson, L.A. (1966). Comparison of item statistics for items in multiple-choice and in alternative-response form. Science Education, 22) 467-470. Carrier, N.A., & Jewell, D.O. (1966). Efficiency in measuring the effect of anxiety upon academic performance. Journal of Educational Psychology, 21) 23-26. Carter, H. (1942). How reliable are the common measures of difficulty and validity of objective test items? Journal of Psychology, 12’ 31- 39. Charles, J.W. (1926). A comparison of five types of objective tests in elementary psychology. Unpublished doctoral dissertation, State University of Iowa. Cochran, W.G., & Cox, G.M. (1957). Experimental designs (2nd ed.). New York: Wiley. Cronbach, L.J. (1970). Essentials of Psychological Testing. New York: Harper and Row. Davis, F.B. (1951). Item selection techniques. In E.F. Lindquist (Ed.), Educational measurement (pp. 266-328). Washington D.C.: American Council on Education. Ebel, R.L. (1979). Essentials of educational measurement (3rd ed.). Englewood Cliffs, NJ: Prentice-Hall. Ebel, R.L. (1982). Proposed solutions to two problems of test construction. Journal of Educational Measurement, 12) 267-278. 120 121 Feldt, L.S. (1969). A test of the hypothesis that Cronbach's alpha or Kuder-Richardson coefficient twenty is the same for two tests. Psychometrika, 24) 363-373. Frisbie, D.A. (1971). Comparative reliabilities and validities of true- false and multiple-choice tests. Unpublished doctoral dissertation, Michigan State University. Gibbons, C.C. (1940). The predictive value of the most valid items of an examination. Journal of Educational Psychology) 313 616-621. Glass, G.V., & Stanley, J.C. (1970). Statistical methods in education andypsychology. Englewood Cliffs, NJ: Prentice-Hall. Gronlund, N.E. (1965). Measurement and Evaluation in Teaching. New York: Macmillan. Gronlund, N.E. (1977). Constructing achievement tests (2nd ed.). Englewood Cliffs NJ: Prentice-Hall. Huck, S.W., & Bowers, N.D. (1972). Item difficulty level and sequence effects in multiple-choice achievement tests. Journal of Educational Measurement, 2) 105-111. Klosner, N.C., & Gellman, E.K. (1973). The effect of item arrangement on classroom test performance: Implications for content validity. Educational and Psychological Measurement, 23) 413-418. Lindquist, E.F. (1956). Design and analysis of experiments. Boston: Houghton Mifflin. Loree, M.H. (1948). A study of a technique for improving tests. Unpublished doctoral dissertation, University of Chicago. Mandler, G., & Sarason, S.B. (1952). A study of anxiety and learning. Journal of Abnormal and Social Psychology, 41) 166-173. Marso, R.N. (1970). Test item arrangement, testing time, and performance. Journal of Educational Measurement, 2; 113-118. McKeachie, W.J., Pollie, D., & Speisman, J. (1955). Relieving anxiety in classroom examinations. Journal of Abnormal and Social Psychology, Monk, J.J., & Stallings, W.M. (1970). Effects of item order on test scores. Journal of Educational Research, 62, 463-465. Munz, D.C., & Smouse, A.D. (1968). Interaction effects of item- difficulty sequence and achievement-anxiety reaction on academic performance. Journal of Educational Psychology, 22’ 370-374. Oosterhoff, A.C., & Glasnapp, D.R. (1974). Comparative reliabilities and difficulties of the multiple-choice and true-false formats. Journal of Experimental Education, 42, 62-64. 122 Owens, R.E., Hanna, G.S., & Coppedge, F.L. (1970). Comparison of multiple-choice tests using different types of distractor selection techniques. Journal of Educational Measurement, 1) 87-90. Plake, 8.8. (1980). Item arrangement and knowledge of arrangement on test scores. Journal of Experimental Education, 423 56-58. Ruch, G.M., & Charles, J.W. (1928). A comparison of five types of objective tests in elementary psychology. Journal of Applied Psychology, 122 398-403. Ruch, G.M., & Stoddard, G.D. (1925). Comparative reliabilities of five types of objective examinations. Journal of Educational Psychology, 16, 89-103. Smith, K. (1958). An investigation of the use of "double-choice" items in testing achievement. Journal of Educational Research, 21’ 387-389. Smouse, A.D., & Munz, D.C. (1968). The effects of anxiety and item difficulty sequence on achievement testing scores. Journal of Psychology, 683 181-184. Stanley, J.C. (1961). Studying status versus manipulating variables. In R.O. Collier & Elam, S.M. (Eds.). Research design and analysis. Bloomington IN: Phi Delta Kappan. Williams, B.J., & Ebel, R.L. (1957). The effect of varying the number of alternatives per item on multiple-choice vocabulary test items. The 14th Yearbook of the National Council on Measurements Used in Education (pp. 63-65). East Lansing, MI: Michigan State University. Zuckerman, M. (1960). The development of an affect adjective check list for the measurement of anxiety. Journal of Consulting Psychologyg 2&9 457-462. GENERAL REFERENCES Mehrens, W.A., & Lehmann, I.J. (1978). Measurement and evaluation in education and psychology (2nd ed.). New York: Holt, Rinehart and Winston. Nie, N.H., Hull, C.H., Jenkins, J.G., Steinbrenner, K., Bent, D.H. (1975). Statistical package for the social sciences (2nd ed.). New York: McGraw-Hill. Raj, D. (1972). Design of sample surveys. NY: McGraw-Hill. 123 uni ll 1111111, Illlllls “4 “1 m3 “0 E N3 M9 “2 31 llllllllllllll