‘Iwill}Elixlglgiijllllnnu “ THESIS This is to certify that the thesis entitled A Comparison of the Quality of Selected Multiple-Choice Item Types within Medical School Examinations presented by Julie G. Nyquist has been accepted towards fulfillment of the requirements for Ph,D, degree in Measurement and" Educational Psychology figm {10. EM Major professor Date May 20, 1981 0.7639 MSU LIBRARIES ”1—- RETURNING MATERIALS: Place in book drop to remove this checkout from your record. FINES will be charged if book 15 returned after the date stamped below. f5" 1 “ A COMPARISON OF THE QUALITY OF SELECTED MULTIPLE-CHOICE ITEM TYPES WITHIN MEDICAL SCHOOL EXAMINATIONS By Julie G. Nyquist A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Department of Counseling and Educational Psychology T981 Copyright by Julie G. Nyquist 198l ii ABSTRACT A COMPARISON OF THE QUALITY OF SELECTED MULTIPLE-CHOICE ITEM TYPES WITHIN MEDICAL SCHOOL EXAMINATIONS by Julie G. Nyquist The purpose of this study was to compare groups of test items selected on the basis of format or item writing rule violation, for psychometric quality based on data collected from administration of regular classroom tests within medical education. Five item types were identified for study as were four relevant comparisons between them. All five item types are variants of the basic multiple-choice item format. The two formats chosen for comparison in this study were the traditional single-answer multiple-choice format and a multiple-answer multiple-choice format used primarily in medical education and known as type k on the National Board of Medical Examiners' examinations. The two rule violations are: use of negatively worded stems and use of incomplete stems, stems which do not state specifically the question to be answered. A total of 718 test items were used in the study. All of the items came from the College of Human Medicine at Michigan State University and were taken from item pools composed of test items pre- viously used on regular classroom test in each subject area. In the total study, 387 pairs of items were selected: 124 in the Julie G. Nyquist single-answer multiple-choice vs. multiple-answer multiple-choice comparison, Sl in the single-answer multiple-choice vs. uninformative stem multiple-choice comparison, 69 in the multiple-answer multiple- choice vs. uninfonnative-stem multiple-answer multiple-choice compari- son and l43 in the single-answer multiple-choice vs. negative-stem multiple-choice comparison. These 387 item pairs originated on 40 separate exams. Content was controlled by using pairs of items from the same test administration which were keyed to the same topic area. In the item selection process items with errors not being studied were eliminated. Also, the average number of options per item was controlled for comparisons including items with varying numbers of options. Three item statistics were chosen as estimates of item quality: p, proportion getting the item correct; D, the upper-lower discrimi- nation index, and rBis, the biserial item—total correlation coefficient. Univariate repeated measures ANOVAS were used to test the three hypotheses related to each of the four comparisons. The level of significance for the F-tests was set at .05. However, since a univariate analysis was used, the Bonferroni approach for correction of a possible inflation of alpha, the type one error, was applied. With this method the desired alpha level is divided by the number of comparisons (i.e., .05 is divided by 3) so the cutoff for significance was set at .Ol7. Results of all analyses based on item type performed to test the hypotheses of the study are summarized below: Hypothesis l- In the comparison of SA-MC and MA—MC items, the MA-MC items were found to be significantly more difficult and less discriminating than the SA-MC items. Julie G. Nyquist Hypothesis II - In the comparison of SA-MC and US-MC items, the US-MC items were found to be significantly less difficult than the SA-MC items. There were no significant differences based on discrimination. Hypothesis III - There were no significant differences in either difficulty or discrimination found when MA-MC and US-MA-MC items were compared. Hypothesis IV - In the comparison of SA-MC and NS-MC items, the NSJMC items were found to be significantly less discriminating than the SA-MC items. There was no significant difference in difficulty between the two item types. It was concluded that: Item format affected both average diffi- culty and discrimination; stem orientation, positive versus negative, affected item discrimination but not difficulty; and that informa- tiveness of the item stem had some effect on item difficulty but no statistically significant effect on discrimination. DEDICATION To my daughter Stephanie Kraig Olivero Who brings joy to my life and makes the struggle worthwhile ACKNOWLEDGEMENTS I am glad to finally have this opportunity to acknowledge all of the people who have given me support and assistance throughout my doctoral program. First and foremost I would like to thank the entire Educational Psychology faculty for their encouragement, their excellent teaching, and their faith in my ability. Special thanks beyond this go to Drs. Robert Ebel, Larry Lezotte, and Bill Mehrens for always being there, willing to listen to me and advise when appropriate. I would like also to acknowledge those within medical education who helped provide my financial support and further helped me develop the direction for my professional career. Drs. Tom Parmeter, Jack Jones, Steve Downing, Marty Anderson, Dan English, and Verda Scheifley all have my deepest respect and regard for their con- tributions to my development as a professional, especially Tom. Thank you from the bottom of my heart, Dr. J. Thomas Parmeter, for caring enough to allow me the time and space to be a mother as well as become a medical educator. I am also happy to acknowledge those who helped make the process of working toward a doctorate enjoyable. A warm thank you to my fellow doctoral candidates and friends, all of whom finished before I did but none of whom forsook me, Drs. Robert Griffore, John Molidor, and James Haf. Finally, I would like to acknowledge the assistance of Harry Davis at the Medical College of Georgia for his extremely valuable help in iv completing the analysis of my data. Thank you, Harry, for all of your time and persistent effort. It is my profound hope that I am developing into a professional worthy of the caring support I have received from those mentioned above and many others as well. TABLE OF CONTENTS Page Chapter I. THE PROBLEM ....................... l Introduction ...................... l Various Forms of Multiple-Choice Items ......... 3 Conventional Rules for Item Writing .......... 5 Some Highlights of Previous Studies .......... lO Need for Further Research ............... l3 Context of the Study .................. 14 Specific Questions to be Answered ........... 15 Summary ........................ l6 II. REVIEW OF RELATED RESEARCH ................ l7 Multiple-Answer Form .................. l7 Negative Stem Form ................... 21 Uninformative Stem Form ................ 23 Number of Options ................... 29 Other Rule Violations . ................ 31 Summary ........................ 38 III. PROCEDURES AND DESIGN .................. 39 Introduction ...................... 39 Source of Items ................... . 39 Item Selection and Content Control ........... 43 Measures of Item Quality ................ 52 Design ......................... 54 Hypotheses and Analysis Methods ............ 56 Summary ......... . . . .......... 6l IV. 'RESULTS ......................... 62 Introduction ...................... 62 Results Based on Item Type ............... 62 Results of the SA-MC versus MA-MC Comparison ...... 64 Results of the SA-MC versus US-MC Comparison ...... 66 Results of the MA-MC versus US-MA-MC Comparison . . . . 68 Results of the SA-MC versus NS-MC Comparison ...... 70 Results Based on Content Orientation .......... 7l Summary ........................ 75 vi Page Chapter V. DISCUSSION AND CONCLUSIONS ............... 76 Introduction ..................... 76 Discussion ...................... 76 Conclusions ...................... 95 Suggestions for Future Research ............ 96 APPENDICES ........................... 98 A. Source of Items - Breakdown by Exam of Origin ...... 98 8. Sample Item Pairs: SA-MC vs. MA-MC Comparison ..... lOO BIBLIOGRAPHY ......................... ll2 Table l .l LIST OF TABLES Frequency of Rule Violations in a Survey of Over 3,000 Items on Exams in the College of Human Medicine at Michigan State University ........ Number of Items used in Each of the Four Research Comparisons ................. Source of Item Pairs by Content Area and Comparison . . List of Topics used for Item Selection within the Focal Problem Areas and Separate Clinical Specialties ..................... Illustration of Steps 3, 4 and 5 of the Item Selection Procedure using the SA-MC vs. MA-MC Comparison and the Spring l976 Elevated BUN Exam as the Example. . . Sample SA-MC vs. MA—MC Item Pairs from the Elevated BUN Spring l976 Illustration ............. Mean Values for All Relevant Groups in Comparisons Based on Item Types ................. Results for the SA-MC vs. MA-MC Comparison Two-Way Repeated Measures ANOVA, n =l24 ........... Results for the SA-MC vs. US-MC Comparison Two-Way Repeated Measures ANOVA, n = 51 ........... Results for the MA-MC vs. US-MA-MC Comparison Two- Way Repeated Measures ANOVA, n = 69 ......... Results for the SA-MC vs. NS-MC Comparison Two-Way Repeated Measures ANOVA, n = l43 ........... Mean Values for all Relevant Groups in Comparisons Based on Content Area ................ Sample SA-MC vs. US-MC Item Pairs ........... Sample NS-MC Items with Poorly Focused Stems (PFS Items) SA-MC vs. NS-MC Comparison, Grouped Frequency Distri- bution Based on D and rBIS .............. viii Page 40 42 44 47 49 63 65 67 69 72 74 83 89 91 CHAPTER I THE PROBLEM INTRODUCTION Teacher-made achievement tests are used widely in medical education as in other areas of education. In non-graded systems they are used to help determine whether students pass or fail; in programs using tradi- tional letter grades the students' scores on any particular test may constitute anywhere from a small proportion to l00% of the total grade for a course. The quality of classroom achievement tests is therefore very important and of major concern to measurement specialists. The most important consideration in achievement testing is the quality of the total test as reflected by reliability and validity. Since a test is composed of individual items the quality of the test will be influenced by the quality of each item. Technical quality of any item in terms of item difficulty and item discrimination cannot be computed until after the test is administered. Therefore measurement specialists look for factors which tend to consistently affect item quality. Know- ledge of these factors is especially important in classroom achievement testing because there is no test tryout to eliminate poor items. The first time the test is given is often the last time as well; student performance is scored and grades given on the basis of this initial administration. What factors consistently affect item quality? Item format can affect item quality. For instance, the vast majority of studies comparing single-answer multiple-choice items to true-false items have shown the single-answer multiple-choice item to be generally more discriminating than true—false items even when the statistics were adjusted for differ- ences in testing time per item. This does not mean that excellent true- false items cannot be written by experts. They can be and are. It does however indicate that item format can make a difference in typical usage. The primary purpose of this study is to answer the question, "Does typ- ical use of the multiple-answer multiple-choice (MA-MC) format result in items which differ in terms of difficulty or discrimination from items written in the single—answer multiple-choice format?" The other major factor thought to affect item quality is the violation of item writing principles. These principles are intended to guide item writers and when followed should result in high quality items (naturally only if combined with expert knowledge of the subject matter and under- standing of the students to be tested). Violations may not always re- sult in items of poor quality but because rule violations are often merely the visual evidence of poor quality thinking, poor quality items are often the result. The second purpose of this study is to address the question: "What effect does violation of either of the two rules specified have on the quality of items found on typical examinations within medical education, measured using average item difficulties and discrimination indices?” These two violations refer to items with nega- tively phrased stems and items with incomplete uninformative stems. VARIOUS FORMS OF MULTIPLE—CHOICE TEST ITEMS Many forms of multiple-choice test items have been used in the last fifty years. The five forms under study here are those most commonly found within medical education. Below is an example of each form along with relevant directions: Directions: Please select the ONE best answer for the item in the number corresponding to your choice on the answer sheet provided. Single-Answer Multiple-Choice (SA-MC) Form - Example: The diagnosis of intestinal amebiasis depends upon identification of mucosal lesions by sigmoidoscopy organism in the stool or tissue increased antibody titers to E, histolytica characteristic hepatic abscess positive response to antiamebic drugs macaw-d Uninformative Stem Multiple-Choice (US-MC) Form - Example: Trophozoites of E, histolytica survive outside of the animal host, following passage in the stool can readily be found in the stools of asymptomatic carriers can live in the lumen or walls of the large intestine usually produce gereralized inflammation of the intestinal wall hum—- Negative Stem Multiple—Choice (NS—MC) Form - Example: Antiviral chemotherapy is limited for all of the following reasons Except l. much of the viral activity is dependent upon cell function 2. viral disease becomes evident only after extensiVe multiplication of the virus within the host . 3. therapeutic disruption of viral multiplication is often accompanied by host cell death 4. the virus‘ proteins are the same as the host cell's proteins Directions: For each of the questions or incomplete statements below, ONE or MORE of the answers or completions given is correct. On the answer sheet fill in the space under if only A, B, and C are correct if only A and C are correct if only 8 and D are correct if only D_is correct if all_are correct U‘l-PWN-J FILL IN ONLY ONE SPACE ON YOUR ANSWER SHEET FOR EACH QUESTION Multiple-Answer Multiple-Choice (MA-MC) Form — Example: Characteristics frequently encountered in virulent viruses include ability to multiply well despite elevated host temperatures (fever) poor induction of interferon resistance to interferon inhibitory action a DNA genome instead of RNA DOWJ> Uninformative Stem Multiple-Answer Multiple-Choice (US-MA-MC) Form - Example: Disseminated viral infections A. involve only the target organ in viral multiplication B. are caused by viruses which can find host cells (with appropriate receptor sites) only in the target organism C. can be controlled by treatment with interferon at time of initial symptoms 0 may have several viremic stages The MA-MC format was chosen for comparison because it is commonly used in medical education by classroom teachers in constructing tests used to make decisions about student performance. In a survey by the author of over 3000 items found on exams in the College of Human Medicine at Michigan State University, the average usage for this format was l7%: however, usage varied from under 5% to over 50% of the total items. What makes this particularly important is that these tests are graded on a pass-fail basis, using the same cut-off score for all Year One and Year Two exams. It is therefore important to know if items written in this format are comparable in difficulty and ability to discriminate to the more commonly used SA-MC items. This is a decision-oriented reason for studying this item format, to provide information useful in making specific educational decisions. There is a second conclusion-oriented rationale for studying this type format. The idea of using a multiple-answer objective-type format has been discussed in the literature since Cronbach's article in T939. Some authors have favored this general type while others have opposed its use. Research results (summarized in a later section) have been equivocal, partially due to the variety of formats included within this general item type. The MA-MC format as defined here pro- vides one consistent format, used frequently at least in medical education, that can be studied in its natural environment. This particular format has been used frequently enough so that item data from items in many exams can be collected and used to compare this format to the single-answer format. This study should therefore be able to both provide information needed to make specific decisions and make a general contribution toward the further understanding of the performance of the multiple response type of multiple-choice item in comparison to the performance of the single response multiple-choice item. CONVENTIONAL RULES FOR ITEM WRITING Principles of Item Writing There are three basic components of any multiple-choice test item: the item idea; the item stem (the statement posing the question); and the answer set (the possible answers). Each basic rule of item writing is directed toward one or more of these components. The rules are based on logic and common sense and in some cases backed up by research findings. In general terms: the item idea should both precede and direct the writing of the test item. The item stem should clearly relate to the item idea, be concisely stated and direct the examinee to the answer set. The options or alternatives within the answer set should all be plausible possible answers to the question posed by the stem. Anything which gets in the way of clear communication of the intent of the item to the examinee, or inappro- priately confuses or cues the examinee should be avoided. The recommended rules for item writing have remained quite stable over time and consistent between various authors. This is reflected in the close similarity between Ebel's article, "Writing the Test Item," in the l95l first edition of Educational Measurement, and the article of the same name by Wesman in the l97l second edition. The set of rules used in the design of the present research reflects the overall body of suggestions and further reflects past research directed toward discovering the effect of rule violations on item quality. Following is a list of relevant item writing rules including the rule, the part of the item the rule is directed toward, and the relationship of the rule to the present study: l. The item idea should be a single coherent thought. (Item idea) - Violation of this rule is not tested directly in this study. 2. The item should clearly relate to an objective of instruction. It should be worth testing and at the level of the students tested. (Item idea) — All the items used in this study were screened by medical faculty both before and after the tests were given. All items with obvious content problems were eliminated from the item pools and from this study. 3. The stem should pose a specific question. (Item stem) - The basis for two research questions in this study. 4. The stem should be positively worded. (Item stem) - The basis for one research question in this study. 5. The item should be clearly and concisely worded so that the intent of the item is clear. (Item stem) - All items which were obviously confusing were eliminated during faculty review. However, items which had stems that relayed insufficient information were not eliminated and are being studied here. 6. One and only one correct answer should be included in the answer set. (Answer set) - Items with more than one correct answer were eliminated in the faculty review. 7. All responses should be grammatically consistent with the stem. (Answer set) - Items violating this rule were eliminated from the study. 8. Complex alternatives should be avoided as should the options "all of the above" and ”none of the above.“ (Answer set) - Items violat- ing this rule were eliminated from the study. 9. Avoid writing items where the correct answer is significantly longer and contains significantly more qualification than the incorrect responses. (Answer set) - Items violating this rule were eliminated from the study. Rule Violations Two rule violations were chosen for study in the present research: 1) items with negatively phrased stems, and 2) items with stems which are uninformative, i.e., stems which do not pose an answerable question. Examples of negative stems: - All of the following are characteristics of the nephrotic syndrome except: - For treatment of all but one of the following poisons an emetic would be used. Mark the exception. - Which of the following is not high in protein? Examples of uninformative stems: - Which of the following is true? - Achalasia - N. Meningitis These two violations were selected primarily because of the relatively high level of incidence. In a survey of over 3,000 test items from exams given in the College of Human Medicine at Michigan State University, it was found that 40% of all single answer M-C items had at least one violation of item construction rules over 30% of the multiple-answer multiple—choice items had a violation. The two viola- tions mentioned above were by far the most commonly occurring violations. Items displaying these two rule violations were found on every exam surveyed, although the prevalence varied from test to test. Table l.l indicates the overall frequency of occurrence of each of these violations for the single-answer and multiple—choice formats. TABLE l.l FREQUENCY OF RULE VIOLATIONS IN A SURVEY OF OVER 3,000 ITEMS ON EXAMS IN THE COLLEGE OF HUMAN MEDICINE AT MICHIGAN STATE UNIVERSITY RULE VIOLATION % 0F SA-MC % MA-MC No Violation 60% 70% Negative Stem 20% - Uninformative Stem lO% 25% All Other Errors l0% 5% 100% 100% l0 SOME HIGHLIGHTS OF PREVIOUS STUDIES A review of the literature indicates that there are three basic methods of selecting the items to be used in comparisons of item forms: l) New items - all items for each format compared are new and untried 2) Recast items - a. Format A - old items are used which have been screened for appropriate levels of difficulty and high levels of discrimination. Format 8 - The items from format A are recast into format B and are untried. b. Format A - old items, pretried Format 8 - recast items, also pretried 3) Old items - items are used for comparison as they appeared on the original tests. An early study by Eurich (l93l) illustrates the use of all new items. Eurich compared four item types, essay, completion, SA-MC, and T-F. In constructing the tests a traditional essay-type examination was prepared to cover the salient points of the subject matter. Specimen answers were then written out in detail for each essay question. Then using the essay questions and detailed answers, completion, SA-MC, and T-F items were written independently. The four subtests were then administered to the student groups and reliabilities (Odd-Even) were then computed. This was done for two separate courses, an educational psychology course and a statistical methods course. ll The results in terms of reliability coefficients: Essay Completion SA-MC T-F Exp I As Constructed .69 .72 .71 Tlfi' If made 60 min. .79 .84 .88 .74 Exp II As Constructed .56 .80 .75 .69 If made 60 min. .69 .89 .90 .87 In both experiments, completion and SA-MC type were the most reliable. No significance data were available. This method is perfectly legitimate and results in comparisons of expert written items of varying formats. A study of this nature answers the research question, ”How do items written in format A and format B compare in terms of difficulty and discrimination if the items are written by experts and content is closely controlled." A study done by Oosterhof and Glasnapp (l972) illustrates the use of the first type of recast items. The authors chose 40 single-answer M-C items from a pool of l00 items, based on relationship to course objectives and previously shown high discrimination indices. They then created 40 new true and 40 new false items from these items. The new true-false items were created in a very legitimate manner using the stem of the multiple-choice questions plus the correct answer as the true items and the stem plus the most discriminating distractor as the false items. They then administered these 2 groups of items, old proven multiple-choice and new untried T-F to lOl under- graduates in an introductory measurement course. The results indicated that the multiple-choice items differed significantly from the true-false items in both difficulty and discrimination. When corrected for guessing, the M-C items were less difficult and more discriminating. This design tends to favor the old provem items no l2 matter what their format, because the items are usually selected on the very basis later used to compare the two item formats. The design was therefore avoided. Frisbie (1973) also used recast M-C items to compare the multiple- choice and true-false item types. However, he included an inter- mediate phase where the T-F items were tried-out and changes made in accordance with the item analysis data. Results using paired t-tests showed the eight M-C subtests to be significantly more reliable than their counterpart T-F subtests even when time was considered. (t = 5.405, p '<.00l) In this study it was determined that per unit of time students could answer 3 T-F items for each 2 M-C items attempted. Frisbie (T974) was basically a replication of the earlier study except that the T-F subtests were lengthened by a 3.2 ratio in comparison to the M-C subtests. Results were similar to the earlier study; however the T-F subtests were significantly less reliable as well as less difficult. This method of recasting items is an entirely legitimate method of selecting items for study and clearly addresses the question, "How do items in Format A versus Format B compare in terms of difficulty and reliability when content is con- trolled by recasting items from Format A into Format B?" The question which remains unanswered is how typical items written in Format A versus Format 8 would compare based on difficulty and discrimination. A study by Mueller (T975) illustrates the approach of comparing items on the basis of data obtained from the item analyses of the original tests in which the items appeared. Mueller used the data from several administrations of a real estate licensing exam to compare the l3 difficulty and discrimination of items with substantive alternatives versus items including the options "all of the above," "none of the above," or some complex combination of alternatives such as "l and 2 are correct." He found that use of these options affected both difficulty and discrimination. Items containing complex alternatives were the most difficult while items with the alternative "all of the above" keyed as correct were the least difficult. Discrimination, though less affected, showed items with only substantive alternatives to be most discriminating, while items with the alternative "none of the above" keyed as correct were least discriminating. 0f the three selection methods, this last approach is most conducive to the study of item formats as they are actually used within class- room tests. It was therefore the approach chosen for use in the selection of items to be included in this study. NEED FOR FURTHER RESEARCH As will be evident in the Review of Related Research a very small number of studies have addressed the questions posed in the present research. Further, the majority of studies intended to compare item formats have used the recasting method of item selection. Studies of this type have left unanswered the general question of how items of any two formats under study would compare in terms of average diffi- culty and discrimination, if all of the items used were written by actual classroom teachers for use in their own examinations. What is proposed is a study in which the items and formats to be compared will come from actual tests used in the College of Human 14 Medicine at Michigan State University, at some time in the past several years. There is a need for evidence from actual classroom testing situations, using items written by typical medical school teachers, comparing specified item forms to determine format effective- ness and impact of rule violations. Evidence collected in this manner meshes well with that collected using professionally prepared or carefully recast items as the basis for comparing separate formats or item writing principles. CONTEXT OF THE STUDY All of the items used in this study came from the College of Human Medicine at Michigan State University and were taken from item pools composed of test items used in regular classroom tests on each sub- ject area. Nine subject areas were used. Six of them were Focal Problem areas used with first and second year medical students as a means of testing specific basic science knowledge. A Focal Problem is a major presenting complaint which is used as a focus for studying related anatomy, physiology, microbiology, pharmacology, etc. The six Focal Problems used in this study were Altered Consciousness, Elevated BUN (blood urea nitrogen), Anemia, Chest Pain, Diarrhea, and Jaundice. The other three subject areas were the clinical science areas of Pediatrics, Internal Medicine, and Surgery. The clinical science items were all used on tests given as final exams in third or fourth year required clinical clerkships. Nine subject areas were used both to provide a cross-section of content areas within medical education and to provide a sufficiently large number of items of each 15 type under study to assure reasonable stability of the experimental results. All of the items in the nine item pools had been screened by the faculty for serious content or wording problems both before and after the initial exam was given. All severely defective items were removed from the item pools and therefore not included in this study. Further, all items with item writing errors not under study were excluded from the selection process. The items used in each of the four comparisons were matched as closely as possible for number of options, content and exam of origin. SPECIFIC QUESTIONS TO BE ANSWERED All four research questions are identically stated, the only difference being the particular comparison specified. The General Question: How do the two specified groups of items compare in terms of mean values for item discrimination and diffi- culty where: a) number of options, exam of origin, and content objective have been controlled, and b) the items and item statistics are taken from the original administration of actual classroom tests. The Four Comparisons: I. Single-Answer Multiple—Choice items without rule violations (SA-MC) vs. Multiple-Answer Multiple-Choice items without rule violations MA-MC) II. Single-Answer Multiple-Choice items with positive stems (SA-MC) vs. Single-Answer Multiple-Choice items with negative stems (NS-MC). 16 III. Single-Answer Multiple-Choice items with informative stems (SA-MC) vs. Single-Answer Multiple-Choice items with uninformative stems (US-MC). IV. Multiple-Answer Multiple-Choice items with informative stems (MA-MC) vs. Multiple—Answer Multiple-Choice items with uninformative stems (US-MA-MC). SUMMARY The purpose of this study is twofold: first, to compare two item formats commonly used in medical education, and, second, to compare items which violate specific item writing principles to items with- out violations. Comparisons will be made on the basis of average difficulty and discrimination indices for the groups of items under study. The two formats chosen for comparison in this study are the tradi- tional single-answer multiple-choice format and a multiple-answer multiple-choice format used primarily in medical education and known as Type K on the National Board of Medical Examiners' examinations. The two rule violations chosen for study both deal with the manner in which the question is posed in the stem of a test item. The viola- tions are: l) use of negatively worded stems, and 2) use of incomplete stems, stems which do not state specifically the question to be answered. The first error results in items of the form, "All of the following diseases have killed virus vaccines available except.. 3' and require the examinee to choose the exception. The second type are of the form "Which of the following is true?" and require the exanfinee to discern the intended question for himself after reading the options. CHAPTER II REVIEW OF RELATED RESEARCH Many studies have been conducted to compare item types. The earliest studies comparing item types took place in the 1920's and were intended to compare the "new type items" to essay type questions. The study by Eurich (l93l), described earlier, is typical in the choice of formats used for comparison: essay, completion, single-answer multiple choice, and true-false. Most of the more recent studies com- paring item formats have concentrated on comparing the single-answer multiple-choice (SA-MC) format to either regular true-false (T-F) items or to some format which is a variant of either the SA-MC or T-F formats. The Frisbie studies (T973, l974) detailed earlier, are recent examples of studies comparing the SA-MC and T—F formats. In the present study single-answer multiple-choice items will be com- pared to multiple-answer multiple-choice (MA-MC) items. Further, SA-MC and MA-MC items without rule violations will be compared to items with specified common violations. MULTIPLE—ANSWER FORM Study of this type of item format began over thirty years ago. An early positive mention appeared in an article by Cronbach in the Journal of Educational Psychology (Cronbach, l939). In this article he advocated the use of an item format he called the multiple true- false format. The items are set up like regular single answer multiple-choice items, but each option is answered like a separate l7 .18 true-false test item. Cronbach (l94l) also conducted a study which compared the use of the multiple true-false format (all options marked with T or F) and the multiple multiple-choice format (only the true options are marked). The study disclosed little difference between these formats in terms of testing time, reliability, or validity. Already in this early study there are two variatons of the multiple-answer item format, neither of which is the MA-MC format used in the present research. Also, these item formats were compared only to each other and not to the single-answer format. In a study reported by Albanese, Kent and Whitney (l977), the MA-MC format (national board type) was compared with two other multiple response item types using the same stems and options as the MA-MC items but answered like four separate T-F items. The MTF format was scored as 160 separate T-F items. In the MR type only those options that were felt to be true were marked and an additional option "none of the above” was also included. The MR items were scored as 40 individual multiple-choice items where credit was given only where the correct combination of true options was chosen. (The option “none of the above" was never a correct answer.) The findings were that the MR type was significantly more difficult than the MTF and the MA-MC even after difficulty level was corrected for chance, and that the MTF section was significantly more reliable than the MA-MC section only. The reliability results are not particularly sur- prising. It might be expected that a subtest of I60 T-F items would have a higher reliability than a 40 item MA-MC subtest. Unfortu- nately, although the authors mentioned inclusion of SA-MC items in 19 their design, they did not include these items in their analysis. Further, neither the MR nor MTF item type is practical for use in combination with SA-MC items, the MR type because it requires a different answer sheet and scoring procedure and the MTF because of probable problems with content balance. Therefore, although the results of the study are interesting they do not relate directly to the present research. Dryden and Frisbie (l975) conducted a study designed to compare the multiple-answer multiple-choice format with the single-answer format. They took 64 multiple-answer items from a General Nursing Exam and converted these to SA-MC items. Their results showed that 25% more SA-MC items were answered per unit of time; and that SA-MC tended to be slightly more reliable; there was no significant difference in difficulty between the two types. However, most of their SA-MC items contained either complex alternatives like "a and b" or "all but c" or the option "all of the above." In addition, it appeared that the multiple-answer format was not one set format (like the traditional national board format) but a conglomeration of multiple-answer formats. Their use of this variety of format variations within both the single- answer and multiple-answer items makes it difficult to assess the meaning of their results and hazardous to generalize from them. The piece of research that relates most directly to the study of the SA-MC item versus the MA-MC item was done in Canada by Shakun, et al., (1977). Part of their study was the comparison of the traditional single-answer multiple-answer format with the multiple-answer multiple- choice format (called Type k on the National Board of Medical 20 Examiners exams). Twenty items of each type were prepared by one of the authors to be parallel in the sense of testing the same content or basic concept. The items were then reviewed by the General Surgery Test Committee and revised where necessary. Eighteen items of each type, designated as the experimental items, were selected for inclusion in the 1976 General Surgery Certifying Exam of the Royal College. The items were then interspersed among other items of the same type in either Paper I or Paper II of the Exam (l9l total SA-MC items; lll total MA—MC items). The results for the experimental items showed the following: MA-MC (Type K) SA-MC p (percentage getting the item correct) .69 .71 KR (reliability of subtest) 20 .52 .52 There was essentially no difference in the performance of the two item types. The major problem with this study is that l8 items is a very small number of items from which to generalize. If all items of each type in the total exam are considered a larger difference in average difficulty is noted; SA-MC, p = .73, MA-MC, p = .65. No test of sig- nificance was reported for these data, however. Also no item discrimination or subtest reliabilities were reported for the larger group of items. This study is a good example of the type of study which compares two item types on the basis of professionally written new items. All of the experimental items of both types were well written and prescreened, but not pretested for difficulty or discrimination. 21 The present research used comparisons similar to the one referred to above, using all of the items of both item types. In making the com- parison groups all available items of each type from the specified item pools were used. As many item pairs as possible were matched, controlling for number of options, exam of origin and content objec- tive. The research question here is: How do items of each type, as typicaTb/written and used, compare in terms of difficulty and discrimination? NEGATIVE STEM FORM The argument against the use of negatively worded stems is mainly logi- cal. First, since examinees are used to thinking positively and are generally asked to choose correct answers, items calling for an incorrect response can be confusing (Mehrens and Lehmann, 1978). Second, since questions of this nature are rarely encountered in the real world, they lack practical relevance (Ebel, 1979). Finally, evidence collected from the study of true-false items indicates that negatively stated items can take longer to answer than positively phrased items (Wasen, 1961; Zern, 1967). Research actually comparing the performance of items with negatively phrased stems to that of items with positive stems is somewhat limited. Terannova (1969) conducted a study using two option multiple-choice items to address the issue of negative versus posi- tive stems. Three independent variables were used, positive and negative stems, frequency of change of direction set (0, l, and 2 changes) and grade level of the student (5, 7, 9, and 11th grades). Six experimental instruments were prepared along with a common 22 instrument. The common instrument was given to all subjects while the experimental instruments were assigned randomly to the subjects. Two three-way analyses of covariance were used for analysis. The results indicated a significant difference in difficulty level (at the .95 level of confidence) based on stem type. Negative stem items were more difficult than positive stem items. There was no signifi- cant difference in the reliability of any of the experimental subtests. Dudycha and Carpenter (1973) studied three variations of the SA-MC item type. One of these variations was the positive versus negative stem. (The other two were open-ended statement-type stem versus closed question type stems and the presence or absence of the option ”none of the above".) Sixty—four items (all positive, closed, without the option "none of the above”) were chosen from a group of 96 items on the basis of item analysis data and course instructor evaluation of content appropriateness. These items were then recast with great care to reflect all eight possible variations (i.e., positive, closed, without "none of the above" through negative, open, with "none of the above"). Sixteen separate experimental tests were written and administered randomly to 1124 students as the final exam in an intro- ductory psychology course. Each test had one half negative and one half positive items. A 2x2x2 fixed-effects, repeated measures ANOVA was performed on item difficulties. The results indicated that nega- tive stems were more difficult than positive stems (F = 14.71, p <.001). There were no interaction effects between the three factors studied. No significant differences were found between the average discrimination indices of items with positive versus negative stems. 23 Both of these studies provide evidence of a difference in difficulty between negative and positive stemmed items, when the negative items are recast from positive items of proven high quality. Both, however, leave unanswered the question addressed in the present study, that of the difference between positively phrased and negatively phrased items as they appear on typical classroom exams. The present study is also within a different content area and at a different level than the two previous studies, i.e., medical education, with students on a post- bachelor's degree level. UNINFORMATIVE STEM FORM The single-answer multiple-choice item with an uninformative stem can be looked at in several ways. It can be thought of as a con- glomeration of separate true-false items or as a multiple-choice item which violates one or both of the following rules: 1. An item should test only one coherent thought. 2. The item stem should pose a direct question in a concise manner so that the meaning of the item is clear. Ebel (1978) conducted a study which looked at this item type as a con- glomerate of separate true-false items. First, a test consisting of 100 separate true-false items was prepared. Then these items were arranged to form a second test with 2 two-option items and 32 three- option items (23 with the stem "which statement is true" and 9 with the stem "which statement is false"). The two tests were then administered as Part I and Part II of a midterm examination of two classes of graduate students in education. The results of a com- parison of the scores on the two parts showed: (1) that the test using separate true-false items was more reliable than the "grouped” 24 test for both groups, and (2) the individual "grouped" items were more difficult than the individual true-false items. No significance data were reported; however, this study points out two things: (1) a definite loss in information occurs if separate true-false items are combined to form MC items with uninformative stems,and (2) the result- ing test would probably be less reliable. In the present study, the M-C item with an informative stem is looked at as a regular single-answer M-C item which was written incorrectly. Some authors of classroom test items are under the mistaken impression that if one concept in an item is good, two is better, four better still. One solution would be to simply reverse the process used in Ebel's 1978 study and make each of these items into separate true- false items. However, this process could easily result in a different rule violation. Rule: The content being tested should be worth testing. It should be important knowledge. In general, when medical faculty edit and rewrite test items, it is more common for the faculty to find no concepts really worth testing in items of this type than to find four concepts all worth testing. Many items of this type seem to be the result of unclear or disorganized thinking on the part of the item author. However, some items with uninformative stems do have one basic item idea. In these cases, the error is that the stem does not pose a direct question, forcing examinees to look for the question among the options as well as look- ing for the answer to that question. In the first case, the item idea was unclear so no specific question was being asked by the item. In the second case, a specific question was intended but the stem did not 25 pose the question appropriately. For the purposes of the present study, these two variations are considered together as items with uninformative stems. Before going on it is important to make the distinction between items which have uninformative stems, as described above, and items which merely have open ended stems. Open ended stems are stems which use an incomplete declarative sentence to pose a question instead of actually using an interrogative sentence. In this case, there is no loss of content, merely a difference in form. Examples: Question Format: Which of the following should be used for initial treatment of an anaphylactic reaction? Incomplete Statement Format: An anaphylactic reaction should be treated initially by use of: Uninformative Stem: An anaphylactic reaction. In the present study, items using either the question or incomplete statement formats are classified as acceptably written items. The issue in the present study is the presence or absence of relevant information in the item stem, not the superficial form of that stem. Four studies from the literature provide background for the present research, although none of them focus specifically on the question of major interest in the present study. Dunn and Goldstein (1959) studied the effect of each of four rule violations. One of their com- parisons was between items with closed stems (questions) and items with open stems (incomplete statements). The authors stated that for~ convenience the open stem was designated as the "rule." The procedure for developing the experimental instruments was to rewrite acceptably 26 constructed items to reflect one or more rule violations. The items were obtained from Army Basic Military Subjects Tests. Four 100 item tests (A, B, C, and D) were constructed for their Series A experiment which combined the study of open versus closed stems with the study of the use of cues and specific determiners. Each test had 25 items reflecting each of the following combinations: Open - no cue, question- no cue, question - cue, open - cue. No significant differences in dif- ficulty or KR 20 reliability coefficients were found between groups of items with open versus closed stems. The Dudycha and Carpenter (1973) study described earlier also investi- gated the open versus closed stem. In this study, the closed stem was used as the ”rule.” The open-ended experimental items were carefully recast from previously used well-written closed stem items so that there was no loss of information in the stem. The results indicated that items with open stems were more difficult than those with closed stems. (F = 5.61, p <.05) There was no difference found in average item dis- crimination between the two groups. Both of these studies were focused primarily on the form of the stem, not the completeness of the content. Both ask the question, ”Does a change in the form of the stem alone affect the average difficulty or discrimination indices for multiple-choice items?” Since these studies produced differing results the question is not answered definitively. Why the difference? One possible explanation might be in the type of stem found in the original items used for study. In the Dunn and Goldstein study, it appears that the original high quality items had open stems (incomplete statements) which were altered to question fermat 27 for comparison, while in the Dudycha and Carpenter study the opposite was true. This is a significant difference and could explain the results. Logically, it would seem that if the basis for comparison of two groups of items is going to be difficulty and item discrimination that either both sets of items or neither set should be selected on the basis of previous difficulty and discrimination values. This would involve either selection of original items on some other basis (i.e., using all available items, random selection, content appropriateness, etc.) or using a tryout to obtain item statistics for the recast items. This was not done in the above studies. These two studies serve as a stimulus, as indication of interest in studying the general topic of the effect of variations in the wording of positively phrased item stems. However, because their focus was on the form of the stem alone, they do not relate directly to the present research. The last two studies reviewed did address the issue of informative versus uninformative stems and were both done by the same authors, Cynthia Board Schmeiser and Douglas Whitney. The first study (Board and Whitney, 1972) was designed to compare a group of well written test items to groups of items containing one of four violations of accepted item writing principles. Thirty items were chosen from a sixty item midterm exam in an Introduction to America Politics course at the University of Iowa. The items were chosen on the basis of their diffi- culty (40% - 70%) and discrimination indices (.3 or better) and their adaptability to being recast using the rule violations selected. Three 30 item experimental tests were written: Test I had the 30 well written midtenn items. Test II had 13 items with "window dressing” and 17 items with incomplete stems, all recast from the original 30 items. 28 Test III was composed of 15 items with distractors made systematically longer or shorter than the correct answer and 15 items where the only option grammatically consistent with the stem was the correct answer. The authors did not state whether the original items were in the form of a question or an incomplete statement. The focus of the study was not the form as in the two previous studies but the content. When the items were recast, the stems were severely truncated to make the stems "grossly incomplete.” All three of the experimental tests were administered along with a sixty item final exam. The results indicated that the items with incomplete stems were more difficulty than the original items (F = 4.42, p <.05). The analysis also indicated that compared to the subtest of original items the subtest of items with incomplete stems had a lower reliability using Feldt's approximation of the F test for comparing KR 20's (W = 1.52, p <05). The most recent study used a slightly different method. Schmeiser and Whitney (1975) first made an extensive search through teacher constructed exams at the University of Iowa's Evaluation and Examina- tion Service, where they found a number of examinations in which at least one-fourth of the items contained incomplete stems. (This indicates that this item fault is a common error outside of the medical area as well as within medical education.) They chose for study purposes a sociology exam where 22 of the original 61 items had incom- plete stems. They then rephrased these stems to make 22 comparison items. Following is an illustration of their method: Incomplete stem: Free will Rewritten stem: Which of the following statements characterize f will? 29 The items with incomplete stems were found to be significantly more difficult (F = 23.88, p <.001) but did not differ in average discrimina- tion (using 0, the upper-lower index). The first study used the traditional method, criticized above, of taking proven good items and recasting them to produce poor items. The second study goes several steps further, first by locating items that already contained an error, second by using all available items in the recastim; process. This study provides definite evidence that items with uninformative stems tend to be more difficult than items with complete- informative stems. The present study contributes a different type of evidence about this item writing error than the past studies which all used recast items. A comparison will be made between approximately 50 items of each type (uninformative versus complete-informative stem) taken from actual classroom exams in medical education. The item pairs are matched for content objective and exam of origin. NUMBER OF OPTIONS The number of options in an item can affect the difficulty of the item with items with fewer options being easier than those with more options Burmester and Olson (1966) recast 85 SA-MC items from a college natural science course into 85 true-false items. Items where student selection of incorrect answers was spread evenly across all distractors, were made true (using stem and correct answers). Items where one distractor was chosen most frequently were recast as false (using that distractor with the stem). The original SA-MC items had a mean difficulty (p value) of .57 and discrimination of .45 (Flanagan index). For the 30 recast items the figures were .71 and .41. Their conclusion was that the true-false items were easier but similar in discrimination. No significance data were given. Several other studies described earlier (Oosterhof and Glasnapp, 1972; Frisbie, 1973; Frisbie, 1974) further substantiate the claim that items with fewer options can differ signi- ficantly in difficulty. Further, two of these studies (Oosterhof and Glasnapp, 1972; Frisbie, 1974) also showed a significant difference in reliabilities between groups of items with two versus four options. These studies represent the extreme comparison based on differing numbers of options; i.e., true-false with two options versus multiple choice with four or more options. Several other studies have compared groups of items more similar in number of options. Williams and Ebel (1957) using the vocabulary section of an Iowa Test of Educational Development recast 150 four-option SA-MC items into three option and two option items by dropping the least discriminating distractors. Following are their results: 4 choice 3 choice 2 choice Index of Difficulty .50 .58 .68 (p - % getting item correct) Index of Discrimination .49 .47 .41 (upper-lower) A fairly large decrease in difficulty with each option dropped was accompanied by a much smaller decrease in discrimination. Several earlier studies conducted by Ruch & Stoddard (1925) and Ruch, De Graff, et a1. (1926), also compared items with differing numbers of options with results similar to that of Williams and Ebel. In the Ruch and Stoddard study, two forms of a fifty-item history and social 31 science test were used, each item in each form having been prepared as a five-option, three-option and two-option multiple-choice item. Three groups of high school seniors, 135 students in each, were tested, one group being given both forms of the five-response test, one both forms of the three-response test and one both forms of the two-response test. Reliabilities were computed using the scores on the two forms. Both p-values and reliability coefficients increased with the increase in number of options. The Ruch, De Graff, et al., study was very similar with the exception of the addition of seven-option items. The results of this study showed a similar increase in p-values and reliabilities from the 2-option to 3-option to 5-option items. How- ever, a decrement in reliability resulted when 7-option items were used. All of the studies cited indicated that items differing in number of options also differ in resulting item statistics. For this reason, the average number of options per item has been controlled for all comparisons to be made in the present research. OTHER RULE VIOLATIONS Use of complex alternatives: Items including the options ”all of the above," "none of the above,” or variations of the complex alternative "Both 1 and 2 are correct" were all eliminated from the present research. It is necessary to control for possible effects because of evidence which suggests that the presence of any of these options can affect the difficulty of the test item. Further, since items with these options appeared only sporadically it would be very difficult to control for their effect by 32 any means other than elimination. All seven of the relevant studies in the literature found some increase in difficulty with use of these options (Meuller, 1975; Dudycha & Carpenter, 1973; Williamson & Hopkins, 1967; Hughes & Trimble, 1965; Rimland, 1960; Boynton, 1950; and Wesman & Bennett, 1946). Five of these studies concentrated on the option "none of the above," while the other two looked at all three kinds of complex alternatives. Both the Meuller (1975) study and the Hughes and Trimble (1965) study compared items with each type of complex alternative to regular SA-MC items. The Meuller study compared large groups of items from actual exams used in a Real Estate Salesmen's Course. The results seem to indicate increased difficulty for items with combination alternatives, e.g., 1 and 2 are correct, as compared to items with only substanative alternatives. There were small differences in average discrimination indices for the varied groups of items with items containing only substanative options showing the highest average value. Since there was no report of significance tests a summary of the results is provided below: ITEM TYPE Substanative None of All of Combination alternatives only the above the above» alternatives # of Items (k) 91 94 79 45 Average Diffi- culty (p) .79 .74 .78 .64 Average Dis- crimination (r Pt. Bis.) .30 .27 .27 .26 Hughes and Trimble (1965) used four option items originally written by the author and previously tried out in an Introductory Psychology course as control items. The items were recast to make comparison items by adding a fifth option; either "none of the above," "all of the 33 above," or "Both 1 and 2 are correct." Three experimental tests were made up: Test 1 — 50 regular 4 option items (15 control-35 used incxmparisons) Test 2 - 15 regular control items; 35 items with "Both 1 and 2 above are correct" Test 3 - 15 regular control items; 17 items including ”none of the above;" 18 items including "all of the above." The three tests were distributed to randomly selected subgroups of 26 students each. Results indicated in increased difficulty for the comparison items in both Test 2 and Test 3. Test 1 (control Mean = 27.19 Test 2 (Both 1 and 2) Mean = 22.05 Test 3 (All or None) Mean = 23.76 F = 7.05, p .01 Dunn's test - 1 with 2 C = 3.60 p <.05 comparing Means 1 with 3 C = 2.40 p <.05 There was no significant different in the reliabilities of the three experimental tests. One problem with estimating the meaning of the results of this study relates to the method of recasting items. The authors recast these items by increasing the number of options from 4 to 5. Therefore, the increase in difficulty could have been due to the content of the fifth option or to the mere presence of an extra option. The authors themselves admitted this was a problem in their procedures section but gave no explanation why it was done this way. Despite the design problems in the second study both studies do provide evidence that the use of any of these kinds of alternatives can affect 34 mean item difficulty. Further evidence is provided by additional studies which included only the option "none of the above" in their design. In the Dudycha and Carpenter study described earlier, sixteen experi- mental tests (64 items each) were written and administered randomly to 1,124 students as the final exam in an introductory psychology course. One-half of the items on each test included the option "none of the above," the other half had substanative alternatives only. The results indicated the items which included "none of the above" were more difficult (F = 81.54, p <.001). There were no signi- ficant interactions between the three factors tested (open vs. closed; negative vs. positive stems were the other factors). In reference to item discrimination, inclusion of the alternative "none of the above" decreased the discrimination ability of an item (F = 17.19, p< .001). Four earlier studies gave similar results in terms of difficulty but provided no evidence of a decrease in reliability or validity asso- ciated with the use of this option. Williamson & Hopkins (1967) used four standardized arithmetic tests with 345 fourth grade students recasting items in two of the tests to include "none of these” and recasting items in the other two tests to exclude that option. The results indicated a significantly higher mean difficulty for items including "none of these" on two of the four tests but no difference in the other two. Rimland (1960) used the Navy Arithmetic Test with 3,600 Navy recruits, recasting items to include the "right answer not given" option. Results showed a small but significant difference in difficulty with items including the "right answer not given" option being more difficult. Boynton (1950) studying the affect of the "none 35 of these" option with spelling items found an increase in difficulty. Finally, Wesman and Bennett (1946) conducted an exploratory study using only 20 vocabulary and 20 arithmetic items with 590 applicants as part of a test of admission to nursing schools. They found that 17 out of 20 of the vocabulary items increased in difficulty with the use of the "none of these" option while there was no difference in diffi— culty on the arithmetic section. All seven of the studies mentioned substantiate at least to some degree the decision to eliminate items with complex alternatives to avoid the possibility of a variable confusing research results due to an uncontrolled effect on either difficulty or discrimination. ,Grammar and Length Faults: Grammar and length faults both refer to errors in the item options. An item has a grammar fault if the options are not grammatically con- sistent with the stem and a length fault if the correct answer is significantly longer than the incorrect options. The relevant research studies (Dunn and Goldstein, 1959; McMorris, et al., 1972) both investigated the effects of each of these errors on average item difficulty and discrimination. The earlier study used Army enlistees as subjects and the Army Basic Military Subjects Test as the basis for recasting items to include faults while the later study used high school students and an American History Test. In both studies, faults were found to make the items easier; however, validity and reliability coefficients remained unchanged. 36 The Dunn and Goldstein (1959) study was discussed earlier. In the portion of the study which relates to these item writing errors, the authors used items which were recast to contain the fault in preparing four tests, E, F, G and H to form Series E in the experiment. Each test included 25 items belonging to each of the following groups: equal choices - good grammar; unequal choices - good grammar; equal choices - poor grammar; unequal choices - good grammar. An analysis was done comparing average difficulty and discrimination for the 4 groups of items. The results indicated that the errors resulted in easier items but had no effect on reliability or validity indices. On all four experimental tests, E, F, G and H, the four groups of items ranked in the same order based on difficulty: 1) Items which included both errors - easier, 2) Items containing a length error only, 3) Items containing a grammar error only, and 4) Items with no error - most difficult. The McMorris, et a1. (1972), study investigated three of the four errors looked at by Dunn and Goldstein. These were: length errors, grammar errors, and cue errors, where a cue to the correct answer is included in the stem. The authors first wrote, then pretested a set of well-written test items covering the New York Board of Regents objectives for American History. They then chose 42 items with diffi- culty indices between .25 and .85 which were recast to incTude one of the three errors. The resulting experimental instruments were two 42-item tests which were as identical as possible except for the faults. On each test, there were 21 items without faults and 7 items with each of the three errors. The tests were administered to 494 students with alternating students getting forms A & 8. Within the 37 analysis, comparisons for each error were based on the 14 fault free items and their counterpart l4 faulted items. Differences in diffi- culty were tested by means of confidence intervals around the mean difference score (average number of students getting the faulted items correct minus average number of students getting the fault free items correct). The results indicated that both the length and grammar faults decreased the difficulty of the items at the 95% level of con- fidence. The analysis indicated no differences in the subgroups based on comparison of reliability and validity indices. Both of the studies reviewed provided evidence to indicate that inclusion of items with grammar or option length errors can affect overall difficulty of groups of items. These items were eliminated from consideration in item selection in an effort to avoid as many potentially confounding factors as possible. Fault within NS-MC and MA—MC Types Relating to Content of Options Elimination of items with this final error was based, not on past research but on logic as described below. This error is particularly to the negative stemmed and MA-MC item types. These are items where the correct answer and its direct opposite are both included in the set of alternatives thus automatically eliminating all other options. Recall that with single answer items with negative stems the examinee is look- ing for the exception. If two of the options contain statements that could never both be true at the same time (e.g., alkalosis and acidosis or hypothermia and hyperthermia) one must be the exception asked for. With the MA—MC item type an error of this kind could eliminate from one to three alternatives depending on Which options were involved. The only error of this type actually found was where 38 the item author had included 2 correct options (1 & 3 or 2 & 4) and 2 incorrect options (1 & 3 or 2 3 4) in such a manner that all of the options are eliminated except option 8 (l & 3 correct) and C (2 & 4 correct). With a positively worded SA-MC item this error cannot occur because a statement and its opposite are legitimate options. Both could be irrelevant to the question being asked or one of them could be the correct answer. The point is the examinee does not get an irrelevant cue to the correct answer. The effect of any error of this type on difficulty and discrimination is unknown so items with this error were eliminated. W In this chapter the literature relevant to the general research questions of this study was reviewed. This included literature related to study of the multiple-answer format, and to the effect of selected item writing rule violations and the number of options per item. It is evident from this review that there are few past studies relating directly to the specific research questions of the current study. In Chapter III the research procedures and design of the present study are detailed. Also the specific research hypotheses are stated. CHAPTER III PROCEDURES AND DESIGN INTRODUCTION The purpose of this research is to compare groups of test items, selected on the basis of format or item writing rule violation, for psychometric quality based on data collected from administration of regular classroom exams within medical education. Five item types have been identified for study as have four relevant comparisons between them. All five item types are variants of the basic multiple-choice item format. The measures used as a basis for the comparison of designated item groups are one estimate of item difficulty, p, pr0portion getting the item correct, and two estimates of item discrimination, the upper-lower index, 0, and the bi- serial correlation coefficient, rBis. This chapter includes a description of: the source of the test items used, the item selection procedures, including a discussion of the method for control of content, the statistics used as measures of item quality, the design of the study, the hypotheses to be tested, and the statistical procedures to be used to test these hypotheses. SOURCE OF ITEMS A total of 718 test items were used in the present study. Table 3.1 provides a breakdown showing the number of items used in each of the four research comparisons. Since the pairings for each of the four comparisons were done independently, some of the individual SA-MC and MA-MC items are included in more than one comparison. All of the items used come from the college of Human Medicine at Michigan 39 40 TABLE 3.1 NUMBER OF ITEMS USED IN EACH OF THE FOUR RESEARCH COMPARISONS Comparison §A;M§? MA;M§_ N§;M§’ U§;M§_ US-MA-MC SA-MC VS. MA-MC 124 124 SA-MC VS. NS-MC 143 143 SA-MC VS. US-MC 51 51 MA-MC vs . US-MA-MC _ _69 _ _ 6_9 TOTAL NUMBER 284 171 143 51 69 aThe total number of SA-MC and MA-MC items is less than the number of each typed used in all comparisons because some of the SA-MC and MA-MC items are used in more than one comparison. 41 State University and were taken from item pools composed of test items previously used on regular classroom tests in each subject area. 0f the nine subject areas used, six were focal problems areas: Altered Con- sciousness, Diarrhea, Anemia, Jaundice. Chest Pain, and Elevated BUN. Each focal problem is a major presenting complaint or symptom, used at Michigan State as the focus for a three to five week course of study of related basic, clinical and behavioral sciences. At the end of each of these mini-courses a content exam is administered. The fecal problem items used in this study originated on these tests. The final three sub- ject areas were the clinical science areas of Pediatrics, Internal Medi- cine and Surgery. The clinical science items used in this study all originated on tests given as final exams in these required clinical clerkships. Table 3.2 shows the source of all items pairs by content area and comparison. In the total study 387 pairs of items were select- ed: 124 in the SA-MC vs MA-MC comparison, 51 in the SA-MC vs US-MC come parison, 69 in the MA-MC vs US-MA-MC comparison and 143 in the SA-MC vs NS-MC comparison. This table further indicates the total number of item pairs used in the study from the focal problem area versus the clinical science areas. These 387 item pairs originated on 40 separate exams. Appendix A provides a complete listing by content area of the 40 exams used, along with the number of students tested and the number of item pairs selected from each exam for each of the four comparisons. The Pediatrics and Internal Medicine exams were administered and analyzed as combined exams throughout the entire period of this study so each exam appears under both content areas. However, although the exam items were combined into one test the preparation was completely separate, with the content being submitted to and reviewed by separate committees within the respective departments. 42 TABLE 3.2 SOURCE OF ITEM PAIRS BY CONTENT AREA AND COMPARISON Comparison Content SA-MC SA-MC MA-MC SA-MC Area . vs vs vs vs MA-MC US-MC US-MA-MC NS-MC Focal Problems 1. Altered 12 3 11 8 Consciousness . 2. Diarrhea 13 5 14 15 3. Anemia 7 7 7 19 4. Jaundice l3 6 10 10 5. Chest Pain 3 3 4 l 6. Elevated BUN l7 9 8 l8 Subtotal 6_5 3? 6‘4 71' Focal Problems Clinical Science Areas 7. Surgery 20 6 3 l4 8. Pediatrics 17 9 5 26 9. Medicine 22 3 7 32 Subtotal 59 1.8— 75 72— Clinical Science TOTAL 'T24 '5T '59 'T43 43 All ot.the items within the nine item pools had been screened by the faculty for serious content or wording problems, both before and after the initial administration of each exam. All severely defective items were removed from the item pools and therefore excluded before the item selection procedure for this study began. Further, all of the items in- cluded in each item pool were keyed to a topic area within the overall content area. This keying was later checked by another faculty member, group of faculty members, or a specially trained medical student. Table 3.3 diSplays the entire list of topics used by faculty in cate- gorizing items for each of the content areas. The topic list was the same for all focal problems. It can also be noted that most of the tapics are eicher basic science areas relating to the focal problems or subspecialties within the general clinical science areas. ITEM SELECTION AND CONTENT CONTROL Specific Selection Procedures - Six Focal Problem Areas The College of Medicine at Michigan State University has a special Focal Problem Tract for first and second year medical students. Approximately 40 medical students in each class of 100 students participate. Over the two year period the students learn the same overall basic science mater— ial as students in the more traditional program; however, they learn it within the context of focal problems. A focal problem is a common gen- eral medical finding or symptom. Chest pain, abdominal pain, elevated BUN, anemia and jaundice are all examples. The students study these problems intensely, one at a time, for a three to five week period. Study is done independently and is based on a list of basic concepts and suggested references. At the end of the designated time the students take an achievement test (approximately 150 items) where every student 44 TABLE 3.3 LIST OF TOPICS USED FOR ITEM SELECTION WITHIN THE FOCAL PROBLEM AREAS AND SEPARATE CLINICAL SPECIALITIES List for Focal List for List for List for Problem Areas Pediatrics Medicine Surgery Anatomy Behavioral Cardiology Lumps Behavioral Science Cardiology Endocrine/ Hernia Biochemistry Emergency Metabolic Abdominal Clinical Science Conditions Gastro- ' Pain Histology Endocrine/ intestinal Abdominal Microbiology Metabolic Hematology Mass Neurology Genetics/ Immunology GI Bleeding Pathology Birth Defects Infectious Peripheral Pharmacology Growth & Diseases Vascular Physiology Development Nephrology Injuries & Gastrointestinal Neurology Burns Genitourinary Oncology Neck Masses Hematology Pulmonary General Immunology/Skin Procedures Infectious Diseases Newborn Nerves & Muscles Respiratory & ENT 45 is allowed sufficient time to answer every item. No correction for guessing formula is applied. Each test is very homogeneous in content because it is designated to measure the knowledge and understanding of concepts which all relate directly to one central medical finding. The items selected for use in this study had been screened by the faculty for obvious content or wording problems. but were not pretested or screened on the basis of item statistics. Each of the items was keyed to a specific topic area. Content was controlled by using pairs of items from the same test administration, which were further keyed to the same tapic area. The initial keying was done by the physicians, basic scientists and behavioral scientists who wrote the items, but was also rechecked by the author. Item pairs were only made when the author's judgement and the item writer's judgement coincided. The final result is: A set of item pairs, each comprised of two items from the same test, one of each type to be compared, which were written indepen- dently and keyed to the same basic topic area. Selection of the pairs of items was based solely on item type, number of Options and content similarity. To illustrate this item selection and content control process the Focal Problem Elevated BUN and where appropriate the single answer multiple- choice versus multiple answer multiple-choice comparison are used as an example. The item selection and content control process is outlined be- low: 1) All available items in the Elevated BUN item pool were used. These items had been screened by the faculty for gross errors in content or wording, and were all keyed to one of the topic areas listed in Table 3.3. 2) Every item was then screened by the author for item writing errors not included in the present study and either eliminated 46 or classified into one of the five item types used in this study. 3) Each item was then listed according to test of origin, item type, keyed topic area and number of options. The item writers' keying of item to topic areas was checked by the author at this point and where disagreement existed the items were eliminated. 4) Every screened MA-MC item was matched as closely as possible to a five-option SA-MC item from the same test administration, in accordance with the keyed topic areas. In the rare cases where the MA-MC item could be matched to any of several SA-MC items the specific item was chosen using first. similarity of content, then if necessary, a random number generator, a total of four item pairs resulted for the SA-MC versus MA-MC COM— parison, all matched in terms of content designation, number of Options and test of origin. Table 3.4 illustrates steps 3, 4 and 5 in this process using the Spring 1976 exam and the SA-MC vs MA-MC comparison. Table 3.5 shows the four item pairs resulting from this example of the selection process. Specific Selection Procedures - Three Clinical Science Areas All medical students at Michigan State University are required to parti- cipate in clinical clerkship experiences which range from six to twelve weeks in duration. As part of the experience in Pediatrics, Internal Medicine and Surgery the students are provided with a set of basic con- tent objectives and a list of suggested reading materials. At the end of each clerkship the students take an examination (75-125 items) based on the relevant content materials. During the period of this study exams in Internal Medicine, Pediatrics and Surgery were administered three ‘ times a year to groups of approximately thirty students. Each item was keyed to a sub-specialty or problem area as listed in Table 3.3. As 47 TABLE 3.4 ILLUSTRATION OF STEPS 3, 4 AND 5 OF THE ITEM SELECTION PROCEDURE USING THE SA-MC vs. MA-MC COMPARISON AND THE SPRING 1976 ELEVATED BUN EXAM AS THE EXAMPLE STEP 3: List each item according to test of origin, item type, keyed topic area and number of options. EXAM 0F ORIGIN: Elevated BUN - Spring 1976. NUMBER OF OPTION: All items have five options. ITEM TYPE SA-MC MA-MC Tapic Area Topic Area Item Number Number Item Number Number 2 10 46 9 3 12 47 6 4 9 48 3 5 9 50 16 6 3 51 8 7 9 57 1 8 5 100 9 10 6 12 8 l4 4 18 9 21 6 23 9 24 3 25 10 27 16 34 3 38 6 0.) £0 .4 .4 48 TABLE 3.4 (continued) -STBP 4: Since there were more SA-MC items, the .MA-MC items were listed first. Then each was matched as closely as possible to a SA-MC item. Item Type Tapic MA-MC” ' SA-MC Area Item Item Content Item ‘Item Cbntefit‘ Number Number Description - Number Description 1 57 none 3 48 amino acids 6 amino acids 24 amino acids 6 47 21 8 51 none 9 46 antibiotic-absorption 4 antibiotic-aminoglycosides 100 antibiotic-sensitivity 5 antibiotic-side effects 7 antibiotic-toxicity 18 antibiotic-sensitivity 23 antibiotic-absorption 38 general-toxicity STEP 5: Resultant Item Pairs. MA-MC Item SA-MC Item Topic Number Number Number How Selected 3 48 6 random number 6 47 21 only possible pair 9 46 23 content similarity 100 18 content similarity 49 TABLE 3.5 SAMPLE SA-MC vs MA-MC ITEM PAIRS FROM THE ELEVATED BUN SPRING 1976 ILLUSTRATION (1) Which of the following serum amino acids transports amonia to the liver 1. aspartate 2. glutamine 3. histidine 4. serine 5. tyrosine (2) (3) Which of the following amino acids can be directly synthesized via intermediates from the glycolytic pathway or pentose phosphate pathway A. aspartate B. glutamine C. phenylalanine D. serine Which of the following is the most common causative organism in acute bladder and kidney infections in patients in whom no obstruction exists and neither antimicrobial agents nor instrumentation have been used Enterobacter Escherichia coli Klebsiella Proteus mirabilis Pseudomonas aeruginosa mkWN—J o o o o 0 Which of the following tests would be helpful in differentiating typical E. coli from E. aerogenes A. indole test B. methyl red test C. Voges-Proskauer reaction 0. citrate test To prolong the absorption time, repository penicillin is administered 1. intramuscularly 2. intravenously 3. intrathecally 4. orally 5. subcutaneously 50 TABLE 3.5 (continued) Which of the following are stabile in gastric acid and undergo good absorption after oral administration A. penicillin G B. oxacillin C. methicillin D. ampicillin (4) The incidence of hypersensitivity reactions to cephalosporins is higher in patients who have shown allergic manifestations following the administration of gentamicin penicillin polymyxin sulfonamide derivatives tetracycline o penicillin could be treated with ampicillin methicillin cephalexin l 2 3 4 5 A patient with a urinary tract infection and known sensitivity t A B C D lincomycin 51 with the Focal Problem Exams the items were initially classified by the item writers, then later checked by a different faculty member or group of faculty during the test review process, as well as being rechecked by the author during the process of item selection. The total selection process was identical to that described in the Elevated BUN example.. The result was also the same: Pairs of items from the same test, one of each format, which were written independently and keyed to the same basic sub-content area within that medical speciality. Content Control Since the primary purpose of the item selection procedure outlined above was to control content this vital topic has already been discussed at some length. However, a few additional remarks are necessary. The present study makes use of independently written items from regular classroom tests. The purpose of content control, therefore was to assure that each item within a pair had been designed to test similar content. The important factor in each case was the keying of the items to the topic areas used to guide item writers. (Table 3.3). The pro- cedure used to assure as much accuracy as possible in item keying was outlined earlier. Summarized, there were three steps: 1) Initial key- ing, 2) Check by other faculty, and 3) Recheck by the author. This double check should be sufficient to assure content similarity. How- ever, one final checking procedure was used. The procedure was followed as outlined below: 1) Since the primary comparison in the study was the SA-MC vs MA-MC comparison, this was the one chosen for use in this final item pairing check. 52 2) Three tapic areas were sought, two focal problem tOPICS and one clinical subspecialty. These were chosen on the basis of having the largest number of item pairs in the SA-MC vs MA-MC comparison. Select- ed were the focal problem topics, pharmacology and clinical science and the clinical subspecialty. Cardiology. 3) All of these SA-MA item pairs were collected and listed. Appendix 3 contains the entire listing: Clinical Science, 12 pairs, Pharma- cology, 11 pairs and Cardiology, 8 pairs. 4) The completed lists were given to two physicians who were asked to do two things. First, place a check beside all items which were in his opinion correctly keyed to each tapic area. Second, to rate each item pair in terms of content similarity: very different, different, similar, or very similar. The result of this final check showed that both reviewers felt that all of the item pairs were keyed to the correct area. Further, both reviewers rated all sample pairs as being either similar or very similar in content. This result increases the confidence in the appropriateness of the item pairs used throughout the study. MEASURES OF ITEM QUALITY The reliability and validity of test results depends on the properties of the individual items which make up the test. The total test has no properties which cannot be derived from those of the single items or the relationships between them. (Magnusson, 1967) The present study concentrates on the quality of the single test items, using difficulty and discrimination indices as measures of indivual item quality. In his 1939 article, Flanagan outlined the primary and . 53 secondary bases for judging test items, stating that the primary con— siderations are item difficulty (percentage of persons getting the item correct) and item validity (the extent to which an item will predict the criteria, i.e. predict total test score). The specific test statistic chosen to estimate item difficulty was simply p, the percentage of examineed who marked the item correctly. Two statistics were chosen to estimate item discrimination: D, the ' upper-lower index, computed using the upper and lower 27% of the ex- aminees based on total test score; and the biserial correlation coef- ficient. The rationale for using two discrimination indices is based on their relationship to item difficulty. The biserial correlation co- efficient was chosen because it is independent of the difficulty of the item. Pyrczak (1973) published the results of a study intended to measure the validity of the Discrimination Index (Biserial) as a measure of item quality. Biserial correlation coefficients were computed for each of 27 items on Form A and Form 8 (parallel form) of a nonspeeded arithme- tic-reasoning test administered to 364 teacher education students. The re- sulting discrimination indices were compared to the average of the ratings of three judges to determine validity. The judges based their ratings on nine criteria for item quality, rating each item on both the presence of each fault and their opinion concerning how seriously this fault should affect validity. The validity coefficients for the discrimination in- dices were .544 and .558 for Forms A and B respectively. Both values were significant at the .01 level. Pyrczak's conclusion was that discrimination indices that are relatively free of the influence of item difficulty are at least moderately valid indicators of item quality. Using p along with the biserial provides two relatively independent measures of item quality. 54 The Upper-Lower Index provides information slightly different from either of the other two indices. The difficulty level of an item in- fluences of value of D, with the highest values only possible in the intermediate range of difficulties. Since the two indices function dif- ferently when item difficulties are low, and since a high percentage of the test items to be used in the study had low difficulty levels, the de- cision was made to use both indices as estimators of item quality. One final note concerning the use of the biserial coefficient is necessary. If an item has a very low or very high difficulty level, the value for the biserial will occasionally exceed 1.00. Since in reality it is im- possible to achieve a correlation of more than 100% the upward limit for the biserial was set at .99 and the lower limit at -.99. DESIGN To avoid sources of internal invalidity in any experimental design, it is necessary to attempt to make sure that the two groups of subjects (in this case items) do not differ in any way other than on the eXperi- mental treatment (in this case item format). (Campbell & Stanley, 1963) Other possible sources of difference should either be included within the design of the study or controlled for, either during the study or in the statistical analysis. With this in mind the variable of content area (focal problem vs. clinical science) was included in the statistical analysis of each comparison. The reasoning behind the decision to include this factor in the analysis was based on the dif- fering orientation of the two overall groups of test constructors. The focal problem exams were used as criterion referenced exams with an absolute cut-off for passing of 70 percent correct whereas the clinical ‘science exams were used as norm-referenced exams with the passing score 55 dependent or the performance of the specific group being tested. Since it is not certain what effect this difference in orientation may have on difficulty or discrimination values for items of any particular item format this factor was included in analysis. Several other factors were also identified as being possible extraneous sources of difference. Therefore, these were controlled to the extent feasible and in the manner described below: 1. Violations of item writing principles not under study: All items containing any violations Of item writing principles not under study were excluded from the comparison groups. These violations included: a) items that included any complex alternatives or the alternatives "all" or "none of the above"; b) items lacking grammatical consistency between the stem and all options; c) items where the correct answer was significantly longer and contained significantly more qualification than the distractors; and d) negative-stemmed or multiple-answer items where the correct answer and its direct Opposite were both included in the answer set. Number Of options: Since all multiple-answer multiple-choice items (National Board Type K) have five Options, these items were compared only to one another or to five-Option SA-MC items. For the com- parisons involving only single-answer items the average number Of Options was controlled so that this figure would be very similar for both item types in each comparison made. IA Ho: 13 Ho: IC Ho: 56 Item content: The same basic procedure was used to control con- tent for all comparison made in the study. In general, the pro- cedure involved selection of pairs of items from the same exam which shared the same content Objective. The procedure used was described in detail in an earlier section. HYPOTHESES AND ANALYSIS METHODS There is no difference in the mean difficulty of test items based on item format (single-answer multiple choice (SA-MC) versus multiple-answer multiple-choice (MA-MC)). There is a difference in the mean difficulty of test items based on item format (single-answer multiple choice (SA-MC versus multiple-answer multiple-choice (MA-MC)). There is no difference in the mean discrimination (using 0, upper-lower index) of test items based on item format (single- answer multiple choice (SA-MC) versus multiple-answer multiple choice (MA-MC)). There is a difference in the mean discrimination (using 0, upper-lower index) of test items based on item format (single- answer multiple choice (SA-MC) versus multiple-answer multiple choice (MA-MC)). There is no difference in the mean discrimination (based on rBis, biserial correlation coefficient) of test items based on item format (single-answer multiple choice (SA-MC) versus multiple- answer multiple choice (MA-MC)). There is a difference in the mean discrimination (based on rBis, biserial correlation coefficient) Of test items based on item format (single-answer multiple choice (SA-MC versus multiple- IIA H0: 118 Ho: IIC Ho: 57 answer multiple choice (MA-MC)). There is no difference in the mean difficulty of regular one correct answer multiple choice test items based on complete- ness of the stem (single-answer multiple-choice items without rule violation (SA-MC) versus single-answer multiple-choice items with uninformative stems (US-MC)). There is a difference in the mean difficulty of regular one correct answer multiple choice test items based on complete- ness of the stem (single-answer multiple-choice items without rule violation (SA-MC) versus single-answer multiple-choice items with uninformative stems (US-MC)). There is no difference in the mean discrimination (using 0, the upper-lower index) in regular one correct answer multiple choice test items based on completeness of the stem (single- answer multiple-choice items without rule violation (SA-MC) versus single-answer multiple-choice items with uninfOrmative stems (US-MC)). There is a difference in the mean discrimination (using 0, the upper-lower index) in regular one correct answer multiple choice test items based on completeness Of the stem (single- answer multiple-choice items without rule violation (SA-MC) versus single-answer multiple-choice items with uninformative stems (US-MC)). There is no difference in the mean discrimination (using rBis, the biserial correlation coefficient) in regular one correct answer multiple choice test items based on completeness of the stem (single-answer multiple-choice items without rule violation IIIA Ho: 1113 Ho: IIIC Ho: 58 (SA-MC) versus single-answer multiple-choice items with un- informative stems (US-MC)). There is a difference in the mean discrimination (using rBis, the biserial correlation coefficient) in regular one correct answer multiple choice test items based on completeness of the stem (single-answer multiple-choice items without rule violation (SA-MC) versus single-answer multiple-choice items with un- informative stems (US-MC)). There is no difference in the mean difficulty of type k mul- tiple choice items based on completeness of the stem (multiple- answer multiple-choice (MA-MC) versus multiple-answer multiple- choice items with uninformative stems (US-MA-MC)). There is a difference in the mean difficulty of type k mul- tiple choice items based on completeness of the stem (multiple- answer multiple-choice (MA-MC) versus multiple-answer multiple- choice items with uninformative stems (US-MA-MC)). There is no difference in the mean discrimination (using 0, the upper—lower index) of type k multiple choice items based on completeness of the stem (multiple-answer multiple-choice (MA-MC) versus multiple-answer multiple-choice items with un- informative stems (US-MA-MC)). There is a difference in the mean discrmination (using 0, the upper-lower index) of type k multiple choice items based on completeness of the stem (multiple-answer multiple-choice (MA-MC) versus multiple-answer multiple-choice items with un- informative stems (US-MA-MC)). There is no difference in the mean discrimination (using rBis, IVA H : IVB Ho: 59 the biserial correlation coefficient) of type k multiple choice items based on completeness of the stem (multiple- answer multiple-choice (MA-MC) versus multiple-answer multiple- choice items with uninformative stems (US-MA-MC)). There is a difference in the mean discrimination (using rBis, the biserial correlation coefficient) of type k multiple choice items based on completeness Of the stem (multiple- answer multiple-choice (MA-MC) versus multiple-answer multipel- choice items with uninformative stems (US-MA-MC)). There is no difference in the mean difficulty of regular one correct answer multiple-choice test items based on stem orienta- tion (single-answer multiple-choice items with positive stems (SA-MC) versus single-answer multiple-choice items with neg- ative stems (NS-MC)). There is a difference in the mean difficulty of regular one correct answer multiple-choice test items based on stem orienta- tion (single-answer multiple-choice items with positive stems (SA-MC) versus single-answer multiple-choice items with neg- ative stems (NS-MC)). There is no difference in the mean discrimination (using 0, the upper-lower index) Of regular one-correct answer multiple- choice test items based on stem orientation (single-answer multiple-choice items with positive stems (SA-MC) versus single- answer multiple-choice items with negative stems (NS-MC)). There is a difference in the mean discrimination (using 0, the upper-lower index) of regular one-correct answer multiple- choice test items based on stem orientation (single-answer 60 multiple choice items with positive stems (SA-MC) versus single- answer multiple—choice items with negative stems (NS-MC)). IVC Ho: There is no difference in the mean discrimination (using rBis, the biserial correlation coefficient) of regular one-correct- answer multiple-choice test items based on stem orientation (single-answer multiple-choice items with positive stems (SA-MC) versus single-answer multiple-choice items with nega- tive stems (NS-MC)). Univariate repeated measures ANOVAS will be used to test the three hypo- theses related to each comparison. The level Of significance for the F-tests is set at .05. However, since a univariate approach will be used, the Bonferroni approach (Harris, 1975) for correction Of the pos- sibility Of an inflated alpha will also be used. With this method the desired alpha level is divided by the number of comparisons (i.e. .05 divided by 3) so the cut-Off for significance is set at .017. This approach is conservative because it assumes the highest possible in- flation of alpha, which would occur only in cases where the dependent variables are completely independent and unrelated. Since a strong positive correlation would be expected between the three measures used, especially between the two discrimination measures the Bonferroni approach should more than control for any inflation of alpha, the type one error, resulting from making multiple comparisons. Although this model is r0! bust for normality, it was tested and the findings indicated that the distributions for the three statistics fell well within the acceptable range. The r to z transformation Of the biserial was also tested but was found not to fit the model closely at all. Therefore consideration of its use was dropped. 61 SUMMARY This study was designed to test the effect of differing item type on the average item difficulty and discrimination when the groups of items being compared are composed of item pairs matched for content and test Of origin. Five item types were identified within exams in the College Of Human Medicine at Michigan State University: Single-answer multiple—choice, national board type k multiple-answer multiple choice (MA-MC), single- answer multiple choice with uninformative stems (US-MC), type k with uninformative stems (US-MA-MC), and single-answer multiple choice with negative stems. Four comparisons between item types were selected as being most relevant: SA-MC vs. MA-MC, SA-MC vs. US-MC, MA-MC vs. US-MA-MC, and SA-MA vs. NS-MC. Groups of item pairs for each comparison Were selected independently and on the basis of content similarity. Three item statistics were chosen as estimates of item quality p, pro- portion getting the item correct; 0, the upper-lower discrimination index, and rBis, the biserial item-total correlation coefficient. With- in each comparison a repeated measures ANOVA will be used to test the specific hypotheses relating to each of the three measures of item. quality. Chapter IV presents the results and data analyses performed for this study. CHAPTER IV RESULTS INTRODUCTION This Chapter presents the results of the statistical analyses performed to test the hypotheses Of this study. Results are presented and com- ' pared concerning the mean values for difficulty and discrimination for each item type within the four comparisons being studied. Overall results will be presented followed by the results for each individual hypothesis. Finally, additional results relating to the effect Of content orientation (focal problem versus clinical science) on item difficulty and discrimination will be reported. RESULTS BASED ON ITEM TYPE All Of the research hypotheses in this study relate to comparisons based on item type. Table 4.1 displays the means and standard deviations for all three measures Of item quality for both groups in each of the four comparisons under study. This table also displays the average number of options for each item type for the two relevant comparisons. To control for the possible effect on item statistics Of differing average number of options, an attempt was made to assure that the average number Of options was very similar for each group of items in the SA-MC vs. US-MC and SA-MC vs. NS-MC comparisons. As noted on the table these average values were very close. NO averages were reported for the other two comparisons because all of the items included in both comparisons had five Options. 62 63 TABLE 4.1 MEAN VALUES FOR ALL RELEVANT GROUPS IN COMPARISONS BASED ON ITEM TYPE MEAN VALUES (Standard Deviations) COMPARISONS ITEM TYPES DIFFICULTY DISCRIMINATION (Mean number of options (Number of Upper-Lower per item) items) 2 value Index~ r Bis I. Based on item format: SA-MC versus SA-MC .740 .263 .378 MA“MC n=124 (.217) (.226) (.279) MA—MC .632 .156 .185 n=124 (.237) (.298) (.329) II. Based on stem quality in single answer SA-MC .709 .262 .361 items: Informative versus Uninformative n 51 ('246) ('245) ('284). Mean NO. of Options(S.D.) US-MC .788 .181 .289 SA-MC 4.57(.567) a US-MC 4.51(.538) n 51 (.191) (.232) (.314) III. Based on stem quality in multiple ans. MA-MC .670 .199 .257 items: Informative versus Uninformative n 69 ('208) (.265) ('327) US-MA-MC .694 .159 .268 n=69 (.229) (.264) (.322) IV. Based on stem orientation: Positive SA-MC .729 .256 .371 versus Negative n=l43 (.203) (.230) (.282) Mean No. of Options(S.D.) SA-MC 4.78(.414) NS-MC .697 .159 .251 NS-MC 4.83(.381) n.143 (.235) (.219) (.300) 64 RESULTS OF THE SA-MC VERSUS MA-MC COMPARISON Five-option single-answer multiple choice items were matched on the basis of content similarity with multiple-answer items from the same test administration. The resulting 124 item pairs were compared for item quality on the basis of difficulty (p) and discrimination (Upper- Lower Index and rBis). Analysis was performed using a two-way repeated measures analysis of variance with the item type as one independent . variable and content orientation (focal problem versus clinical science) as the second. Alpha was set at .05. However, since three univariate analyses were performed there was a risk of an inflated type one error. Therefore, the Bonferroni Approach for correction was used. In this approach the alpha level is simply divided by the number of comparisons (.05 / 3) making the cutoff for significance p i .017. The results Of the three ANOVAS relating to hypotheses IA, IB, and IC are displayed in Table 4.2. Hypothesis IA asks: Are the mean difficul- ties for the SA-MC and MA-MC item types the same or different? The results indicate that the MA-MC item type (Mean p = .632) was significantly more difficult at the .001 level than the SA-MC type (Mean p = .74) thus resulting in a rejection of the null hypothesis (IAHO) in favor Of the alternative hypothesis (IAH1). Hypothesis IB asks: Are the mean discrimination values (based on the Upper-Lower Index) for the SA-MC and MA-MC item types the same or different? The results indicate that the SA-MC item type (Mean 0 = .263) had a significantly higher mean value for discrimination when measured using the Upper4Lower Index than the MA-MC item type (Mean 0 = .156). This brings a rejection of the null hypothesis (IBHO) in 65 TABLE 4.2 RESULTS FOR THE SA-MC VS. MA-MC COMPARISON TWO-WAY REPEATED MEASURES ANOVA, n=124 RESULTS FOR DIFFICULTY (p= proportionfigetting the item correct) RESULTS FOR DISCRIMINATION Sum of Degrees Mean Source Sguares of Freedom Sguare __§;__ 2 of F Content .578 ' 1 .578 10.173* .002 Error 6.935 122 .057 SAMC vs. MAMC .774 l .774 18.496* .000 Interaction .001 l .001 .020 .887 Error 5.103 122 .042 RESULTS FOR DISCRIMINATION (Da Upper-Lower Index) Sum of Degrees Mean Source Sguares of Freedom Sguare F p of F Content .040 1 .040 .448 .504 Error 10.765 122 .088 SAMC vs. MAMC .584 1 .584 11.090* .001 Interaction .129 l .129 2.460 .119 Error 6.412 122 .053 (rBis= biserial correlation coefficient) Sum of Source Sguares Content 1.081 Error 15.567 SAMC vs. MAMC 2.134 Interaction .014 Error 7.385 * Significant at the .05 Degrees Mean of Freedom Square F p of F 1 .081 .636 .427 122 .128 1 2.134 35.250* .000 1 .014 .226 .635 122 .061 level, using the Bonferroni Approach to set the cut-off for p of F at .017. ° 66 favor of the alternative hypothesis (IBH1). Hypothesis IC asks: Are the mean discrimination values based on rBis the same or different for the SA-MC and MA-MC item types? The results indicate that the SA-MC item type (Mean rBis = .378) had a signifi- cantly higher mean value for discrimination when measured using the biserial correlation coefficient than the MA-MC type (Mean rBis = .l85). This results in a rejection of the null hypothesis (ICHO) in favor of the alternative hypothesis (ICHI). Reporting of all results relating to comparisons based on content orientation or any possible interactions between content orientation and any item type will be defered until a later section of this Chapter. RESULTS OF THE SA-MC VERSUS US—MC COMPARISON Single-answer multiple choice items without rule violations were matched on the basis of content similarity with single-answer items with unin- formative stems. The average number of options per item was controlled as closely as possible resulting in a mean of 4.57 options per SA-MC item and 4.5l options per US-MC item. Fifty-one item pairs resulted. These were compared in the same manner as the SA-MC vs. MA—MC pairs. The results of the three ANOVAS relating to Hypotheses IIA, I18, and IIC are displayed in Table 4.3. Hypothesis IIA asks the question: Are the mean difficulties for the SA-MC and US-MC item types the same or different? The results indicate that the SA-MC item group (Mean p = .709) was significantly more difficult than the.US-MA group (Mean p = .788) withan alpha of'.016. This results in a rejection of the null hypothesis of no difference (IIAHO) in favor of the alternative 67 TABLE 4.3 RESULTS FOR THE SA—MC VS. US-MC COMPARISON TWO-WAY REPEATED MEASURES ANOVA, n=51 RESULTS FOR DIFFICULTY (p=proportion getting;the item correct) Sum of Degrees Mean Source Sguares of Freedom Sguare __§L__ p of F Content .123 l .123 2.489 .121 Error 2.415 49 .049 SAMC vs. USMC .268 1 .268 6.228* .016 Interaction .205 1 .205 4.776 .034 Error 2.107 49 .043 RESULTS FOR DISCRIMINATION (D= Upper-Lower Index) Sum of Degrees Mean Source Sguares of Freedom Sguare F p of F Content .0001 l .0001 .002 .965 Error 2.750 49 .056 SAMC vs. USMC .215 1 .215 3.665 .061 Interaction .058 1 .058 .995 .324 Error 2.875 49 .059 RESULTS FOR DISCRIMINATION (rBis= biserial correlation coefficient) Sum of Degrees Mean Source Sguares of Freedom Sguare F p of F Content .003 1 .003 .033 .856 Error 4.500 49 .092 SAMC vs. USMC .224 1 .224 2.560 .116 Interaction .176 1 .176 2.009 .163 Error 4.291 49 .088 * Significant ar the .05 level, using the Bonferroni Approach to set the cut-off for p of F at .017. 68 hypothesis (IIAHI). Hypothesis IIB asks the question: Are the mean values for discrimina- tion (based on the Upper-Lower Index) for the SA-MC and US-MC item types the same or different? The results showed that the mean value for the SA-MC item type (Mean D = .262) was higher than the mean value for the US-MC item type (Mean D = .181) but that the difference was not significant at the .05 level. Therefore, the null hypotheses (IIBHO) could not be rejected. Hypothesis IIC asks the question: Are the mean values for discrimin- ation (based on the biserial correlation coefficient) for the SA-MC and US-MC item types the same or different? The results showed that although the mean discrimination value for the SA-MC type (Mean rBis = .36l) was higher than that for the US-MC type (Mean rBis = .289) that this difference was not significant at the .05 level. Therefore, the null hypothesis (IICHO) was not rejected. RESULTS OF THE MA-MC VERSUS US-MA-MC COMPARISON Multiple-answer multiple choice items without rule violations were matched on the basis of content similarity to multiple-answer items with uninformative stems. The sixty-nine item pairs which resulted were compared in the same manner as the SA-MC vs. MA-MC pairs. The results of the three ANOVAS relating to Hypotheses IIIA, IIIB, and IIIC are displayed on Table 4.4. Hypothesis IIIA concerned the rela- tionship between the two item types based on p, the percentage getting each item correct (difficulty) asking whether the mean difficulty levels are the same or different? The results showed the two means to be 69 TABLE 4.4 RESULTS* FOR THE MA-MC VS. US-MA-MCCOMPARISON TWO-WAY REPEATED MEASURES ANOVA, n=69 RESULTS FOR DIFFICULTY (p= proportion getting the item correct) Sum of Degrees Mean Source Sguares of Freedom Sguare __§;__ 2 of F Content .216 l .216 3.514 .065 Error 4.127 67 .062 MAMC vs US-MAMC .014 l .014 .445 .507 Interaction .000 1 .000 .0002 .989 Error 2.173 67 .032 RESULTS FOR DISCRIMINATION (D= Upper-Lower Index); Sum of Degrees Mean Source Sguares of Freedom Sguare F p of F Content .029 l .029 .470 .495 Error 4.133 67 .062 MAMC vs US-MAMC .047 1 .047 .590 .445 Interaction .001 1 .001 .018 .893 Error 5.345 67 .080 RESULTS FOR DISCRIMINATION (rBis= biserial correlation coefficient) Sum of Degrees Mean Source Sguares of Freedom Sguare __§;__ 2 of F Content .105 l .105 1.052 .309 Error 6.663 67 .099 MAMC vs US-MAMC .005 1 .005 .043 .836 Interaction .001 1 .001 ' .010 .921 Error 7.575 67 .113 * None of the results were significant at the .05 level, using a cut- off of .017 for p of F determined using the Bonferroni Approach. 70 very similar with the mean difficulty equal to .670 for the MA-MC type and .694 for the US-MA-MC type. Since there was no significant difference at the .05 level, the null hypothesis (IIIAHO) was not rejected. Hypothesis IIIB concerns the relationship between the two item types based on D, the Upper-Lower Index for item discrimination, asking whether the mean values for discrimination are the same or different? The results showed the two means to be quite similar with an average D of .l99 for the MA-MC type and .l59 for the US-MA-MC type. Since there was no significant difference at the .05 level, the null hypothesis (IIIBHO) waS‘not‘rejected. Hypothesis IIIC concerns the relationship between the two item types based on rBis, the biserial item-total correlation coefficient, asking whether the mean values for discrimination are the same or different? The results showed the two means to be almost identical with an average rBis of .257 for the MA-MC item type and .268 for the US-MA-MC type. Since there was no significant difference at the .05 level, the null hypothesis (IIICHO) was not rejected. RESULTS FOR THE SA-MC VERSUS NS-MC COMPARISON Single- answer multiple choice items with positive stem orientation were matched on the basis of content similarity with single-answer items with negatively stated stems. The average number of options per item was controlled as closely as possible resulting in a mean of 4.78 options per SA-MC item and 4.83 options per NS-MC item. One-hundred and forty- three item pairs resulted. These were compared in the same manner as the SA-MC vs. MA-MC pairs. 71 The results of the three ANOVAS relating to hypotheses IVA, IVB, and IVC are displayed in Table 4.5. Hypothesis IVA asks the question: Are the mean difficulties for the SA-MC and NS-MC item types the same or differ- ent? The results indicate that the NS—MC item type (Mean p = .697) was slightly more difficult than the SA-MC item type (Mean p = .729). How- ever since this difference was not significant at the .05 level the null hypothesis (IVAHO) was not rejected. Hypothesis IVB asks the question: Are the mean discrimination values (based on the Upper-Lower Index) for the SA-MC and NS-MC item types the same or different? The results indicate that the SA-MC item type (Mean 0 = .256) had a significantly higher mean value for discrimination when measured using the Upper-Lower Index than the NS—MC item type (Mean 0 = .l59). This results in a rejection of the null hypothesis (IVBHO) in favor of the alternative hypothesis (IVBHI). Hypothesis IVC asks the question: Are the mean discrimination values based on rBis the same or different for the SA-MC and NS-MC item types? The results indicate that the SA-MC item type (Mean rBis = .37l) had a significantly higher average value for discrimination when measured using the biserial correlation coefficient than the NS-MC type (Mean rBis = .25l). This results in a rejection of the null hypothesis (IVCHO) in favor of the alternative hypothesis (IVCH1). RESULTS BASED ON CONTENT ORIENTATION All of the research hypotheses in this study relate to comparisons based on differences in item type. As stated earlier the analysis for each hypothesis was performed using a two-way repeated measures ANOVA with 72 TABLE 4.5 RESULTS FOR THE SA-MC VS. NS-MC COMPARISON TWO-WAY REPEATED MEASURES ANOVA, n =l43 RESULTS FOR DIFFICULTY (p= proportionAgetting the item correct) RESULTS FOR DISCRIMINATION Sum of Degrees Mean ' Source Sguares of Freedom Sguare ._;§__ 2 of F Content 1.366 1 1.366 26.196* .000 Error 7.353 141 .052 SAMC vs. NSMC .076 l .076 2.161 .144 Interaction .038 l .038 1.077 .301 Error 4.947 141 .035 RESULTS FOR DISCRIMINATION (D: Upper-Lower Index) Sum of Degrees Mean Source Sguares of Freedom Sguare F p of F Content .142 l .142 2.558 .112 Error 7.830 141 .056 SAMC vs. NSMC .671 l .671 15.005* .000 Interaction .053 1 .053 1.179 .279 Error 6.305 141 .045 (rBis= biserial correlation coefficient) Source Content Error SAMC vs. NSMC Interaction Error Sum of Sguares .899 13.537 1.028 .038 9.543 Degrees Mean of Freedom Sguare l .899- ' 141 .096 1 1.028 1 .038 141 .068 F p of F 9.361* .003 15.188* .000 .566 .453 * Significant at the .05 level, using the Bonferroni Approach to set the cut-off for p of F at .017. 73 one of the independent variables being content orientation. This var- iable was included in the analysis primarily to determine if content orientation interacts with item type, in any of the relevant comparisons. This type of interaction might occur if performance of any specific item type varied systematically, either with the type of content or with increased student experience with that item type. (Focal Problems are studied by year l and 2 students while Clinical Sciences are studied by year 3 and 4 students.) In either case an interaction effect would be demonstrated by a statistically significant F test for the interaction term. Three analyses were performed for each of the major research hypotheses (I - IV). There were no significant interactions found (at the .05 level) in any of the analyses. Therefore, it is assumed that the two variables, item type and content orientation, act independently in relation to item difficulty and discrimination. Table 4.6 displays the means and standard deviations for all three measures of item quality, for both content areas in each of the four comparisons under study. Results related to average item difficulties showed that in all cases the Clinical Science items had a higher average level of difficulty, with this difference significant at the .05 level for both the SA-MC vs. MA-MC and SA-MC vs. NS-MC comparisons. Results related to average item discrimination showed that the mean values were generally quite similar for the two content orientations, especially those for the Upper-Lower Index. However, one significant difference was found within comparison IV (SA-MC and NS-MC items). The average value for rBis was higher for Focal Problem items (Mean rBis = .368) than for Clinical Science items (Mean rBis = .256). 74 TABLE 4.6 MEAN VALUES FOR ALL RELEVANT GROUPS IN COMPARISONS BASED ON CONTENT AREA MEAN VALUES (Standard Deviations) COMPARISONS CONTENT AREAS DIFFICULTY DISCRIMINATION (Number of Upper-Lower items) 2 value Index r Bis I. SA-MC versus Focal Problem .732a .198 .299 MArMC n=l30 (.232) .240) .314) Clinical Science .6358 .223 .262 n=118 (.224) .300) .326) II. SA-MC versus Focal Problem .774 .222 .329 US-MC n-66 (.202) .236) .300) Clinical Science .702 .220 .318 n=36 (.254) .252) .305) III. MArMC versus Focal Problem .703 .171 .277 US-MA-MC n=108 (.224) .254) .340) Clinical Science .607 .206 .210 n=30 (.181) .301) .255) IV. SArMC versus Focal Problem .783b .230 .368c NS-MC n=l42 (.197) .223) .314) Clinical Science .644b .185 .256C n=144 (.220) .234) .268) a,b,c The differences between these mean values were significant at the .05 level using the Bonferroni Approach to adjust the cutoff for p of F to .017. 75 SUMMARY Results of all analyses performed based on item type and used to test the hypotheses of the study, have been presented in this chapter and are summarized below: Hypothesis I - In the comparison of SA-MC and MA-MC items, the MA-MC items were found to be significantly more difficult and less discriminating than the SA-MC items. Hypothesis II - In the comparison of SA-MC and US-MC items, the US-MC items were found to be significantly less difficult than the SA-MC items. There were no significant differences in mean item discrimination. Hypothesis III - There were no significant differences in either difficulty or discrimination found when MA-MC and US-MA-MC items were compared. Hypothesis IV - In the comparison of SA-MC and NS-MC items, the NS-MC items were found to be significantly less discrim- inating than the SA-MC items. There was no significant diff- erence in difficulty between the two item types. Results based on content orientation were also reported in this chapter and indicated that items on the clinical science exams tended to be more difficult than items on the focal problem exams. This difference reached the .05 level of significance within the SA-MC vs. MA-MC and SA-MC vs. NS-MC comparisons. There did not appear to be any consistent difference in the mean discrimination values between the two content orientations, although there was one significant difference based on rBis within the SA-MC vs. NS-MC comparison. This difference showed the focal problem items to be more discriminating. In Chapter V these findings will discussed, conclusions will be drawn, and suggestions for future research will be made. CHAPTER V DISCUSSION AND CONCLUSIONS INTRODUCTION In this chapter the results reported in Chapter IV will be discussed and related where relevant, to past research. Each comparison will be discussed seperately followed by a general distussion and summary. Conclusions will then be drawn and suggestions for future research made. DISCUSSION SA-MC versus MA-MC Comparison The purpose of Hypotheses IA-IC was to help answer the research question: ‘How do the two item types (SA-MC and MA-MC), as typically written and used within medical education, compare on the basis of average item difficulties and discrimination. One reason for asking this question was a concern that the MA-MC (national board type k) item type might be more complex without being any more effective than the more Straight forward single-answer multiple choice type. The results of testing Hypotheses IA-IC provide evidence to support this concern. The MA-MC type was found to be significantly more difficult (based on p) while having a significantly lower average value for discrimination (based on both the upper-lower index and rBis) as dis- played in Table 4.2. Past research comparing a variety of multiple-answer formats to one another and to the single-answer format showed few significant differ- ences. The study having the most relevance to the present study 76 77 .(Skakun et. al., l977) showed no significant differences between the SA-MC and MA-MC formats based on l8 matched pairs of Surgery items written by the authors for a certification examination. Two differences between this study and the present study may account for the differing results. First, the items in the Skakun study were written by profes— sional item writers while those in this study were written by regular classroom teachers. Second, l8 items is a very small sample providing little power to detect differences. Careful examination of the results of the present study and of all items used in the SA-MC vs. MA-MC comparison disclosed several hypotheses which might explain the results. The first is the straight-forward hypothesis that the results are due to an inherent difference in com- plexity between the two formats. This difference in complexity is easily demonstrated since the SA-MC format always calls for one and only one correct alternative, while the MA-MC format can have anywhere from one to four correct alternatives arranged to form five different combinations. (Recall that the choices are: Option 1 if A,B and C are lcorrect, 2 if A and C are correct, 3 if B and D are correct, 4 if only 0 is correct, and 5 if A,B,C and D are correct.) It is reasonable to hypothesize that this inherent complexity contributed to the increased 'difficulty of the format and may have affected discrimination. The second hypothesis relates to the content of the keyed correct responses of the two formats. Nell-written SA-MC items generally have the best possible answer as the keyed correct response, the students need only to identify this response. In comparison MA-MC items some- times include in the correct response the best possible answer plus 78 one or more other responses which are correct less frequently or only under certain circumstances, so students are forced to make finer distinctions between what is and what is not correct. An example may help to clarify. Example: The clinical symptoms of hepatitis are in some ways similar to and often confused with : *A. infectious mononucleosis *B. herpes simplex infections *C. toxoplasmosis *D. idiosyncratic drug reactions (p = .33, D = .20, rBis = .30) In this item only 33% of the students chose the correct response (5, all are correct) while 55% of the students chose option 4 (D only) and all students included 0 as part of their chosen response. Clearly the stu- dents were most familiar with the similarity of symptoms between hepa- titis and idiosyncratic drug reactions while most did not judge the connection between the symptoms of hepatitis and the other problems significant enough to mark them as correct responses. Items calling for this type of decision making can be either appropriate or inappropriate. This would depend primarily on whether the total item, including all options, was at the correct level for the students being tested. Logically, items of increased difficulty but with appro- priate content should show higher item discriminations, while those at an inappropriate level should discriminate poorly. .The hypothesis is that many items contain options at an inappropriate level so that both difficulty and discrimination are affected. The inappropriate level of some items, is due at least partially to the fact that teachers sometimes include alternatives that contain informa- 79 tion unfamiliar to most or all students being tested. This can happen intentionally, based on the questionable notion that including some very difficult or unfamiliar material "challenges" students. It can also happen unintentionally. An instructor needing just one more option may "settle for" something less than optimal. Further, many instructors, especially physicians, tend to write items using their own experiences instead of refering directly to the materials available to students. This practice of including unfamiliar material in test items is question- able at best, but does not pose much problem to the student whose task is to choose the one best response. He will probably simply ignore the unfamiliar option. However, when the task is to choose the appropriate combination of correct alternatives an option with unfamiliar material poses a greater problem. If the student has no rational basis for judging the correctness of the option he is forced to guess. This type of guessing interferes with the item's ability to measure student learning. Further, it is contended that this problem is exacerbated by use of con- fusing wording in some stems. An example will help illustrate this: In a patient with major motor seizures, status epilepticus may be precipitated by: *A. abrupt withdrawl of anticonvulsant drugs *8. brain tumor *C. brain injury D. hypoglycemia (p = .51, 0 = .00, rBis = -.13) An examination of the item analysis for this item showed that while 5l% of the students chose the correct option (I, A,B and C correct), the other 49% chose option 5 (all correct). This indicates either that 80 many students were not familiar with the relationship (or lack of) between hypoglycemia and onset of status epilepticus or that they interpreted the phrase "may be precipitated" differently than the item writer intended. Careful reading of the stems shows that both this item and the earlier example contain phrases which could cause confusion. The underlined phrase in each item requires a value judgement; how 9ft;g_is often enough and is the opposite of mgy_bg, never or seldom? Use of nebulous wording like this within a multiple correct answer format can detract from the accurate measurement of student knowledge. My hypothesis is, that use of this type of terminology combined with the problem of varying degrees of correctness among options makes the students' task more difficut, resulting in increased item difficulty and in some cases negatively affecting item discrimination. To summarize, the results of testing Hypotheses IA-IC indicated that the MA-MC items are more difficult and less discriminating. Two hypotheses were offered as possible explanations: l) The format itself is more complex which results in items of higher difficulty but lower discrimination; and 2) The format is used poorly by classroom teachers resulting in items which are overly difficult and less discriminating due to (a) confusing wording in the stem and/or (b) inclusion of options with unfamiliar information or presenting distinctions too fine for students to discern. A final possibility is that the difference in difficulty and discrimination between the SA-MC and MA-MC formats is due to a combination of all of the factors listed above. 81 SA-MC versus US-MC Comparison Hypotheses.IIA - IIC addressed the question;. What effect does complete- ness (informativeness) of the item stem of multiple-choice items have on item difficulty and discrimination? In these hypotheses single- answer items with complete, informative stems were compared with those with uninformative stems. The results of the analysis, displayed in Table 4.3, indicated that while the US-MC items were less difficult than the SA-MC items, there was no significant difference in the average value of either discrimination index. How does this result compare to the results of past research? Of the studies discussed earlier in the review of the literature only one shares enough similarity with the present study to merit discussion here. In this study (Schmeiser and Whitney, 1975) a teacher-made soc- iology exam was identified in which 22 out of 61 items had incomplete stems. The items containing this error were then rewritten by the authors to make the stems appropriately complete. Two groups of items resulted which were compared on the basis of item statistics. The re- sults of this analysis (detailed earlier) indicated that the items with less complete stems were more difficult but equally discriminating. On the surface this result seems to contradict the results of the pre- sent study. However, the two studies are not really addressing the same research question. In the Schmeiser and Whitney study the compar- ison items were identical except for the changes, made by the authors, in the wording of the stems. In the present study, while the items were intended to test the same content or objective, the comparison items were written independently and were not identical. Comparison of 82 the conclusions which could be drawn on the basis of the results of the respective studies will make this difference clear. An appropriate conclusion which could be drawn from the Schmeiser and Whitney research would be: When measurement specialists rewrite items that have unin- formative stems, so that the stems are as informative as possible, the resulting items are easier than the original items but no more discrim- inating. This was a logical result; when an item which poses no identi- fiable question in the stem is rewritten so that the question asked is clearer, the rewritten item ought to be less difficult. Since the pre- sent study has a different focus the results should not be expected to be identical. A conclusion based on the results of the present study would be: When items with uninformative stems are matched with inde- pendently written informative-stemmed items that were intended to test the same content or objective, the items with uninformative stems are easier but not significantly less discriminating. What possible explanations are there for the present result? Careful examination of the items included in the SA-MC vs. US-MC comparison reinforced the earlier statement that many US-MC items lack a clear focus, i.e. many of these items do not appear to be based on a definite single item idea. Table 5.l shows two sample SA-MC vs. US-MC item pairs. In the first pair both items were intended to test student know- ledge about seizures in childhood. It can be seen that the US-MC item is very general in focus (grand mal seizures) and that the options are quite heterogeneous, while the focus of the SA—MC item is specific (pharmacologic treatment of petit mal seizures under specified circum- stances) and the options very homogeneous. A similar comparison can be made between the two items in the second pair. Both items were 83 TABLE 5.] SAMPLE SA-MC VS. US-MC ITEM PAIRS US-MC Item A grand mal seizure (major motor seizure) seldom causes unconsciousness 2: never has an aura preceding it *3. may follow a temporal lobe seizure 4. is usually construed to mean movement of the arm and leg on one side SA-MC Item When a child with petit mal seizures shows no response to phenobar- bital, the drug of choice is now *1. ethosuximide( (ZarontinR& 2. trimethadione (Tridione ) 3. acetazolamide (DiamoxR ) 4. paramethadione (ParadioneR) 5. phensuximide (MilontinR) US-MC Item The space of Disse in the liver l contains both the plasma and cillular elements of blood 2 receives bilirubin glucuronide on its way from the hepatocyte to the bile ducts 3. connects directly with the central veins *4. receives proteins formed in the hepatocytes on their way to the intravascular compartment SA-MC Item The liver arises as a diverticulum of l. esophagus 2. midgut *3. foregut 4. hindgut 84 testing student knowledge of the anatomy of the liver. However, while the SA-MC item asked a specific question with only one item idea behind it, the US-MC item was less specific with the options being a collect- ion of facts (or falsehoods) related in some way to a particular part of the liver. These examples are shown here to raise the possibility that since many of the US-MC items have stems which are quite general and options which are quite heterogeneous, average item difficulty may have been affected. One of the precepts of item writing (Wesman, 1971; Ebel, l979) states that items with heterogeneous options (options with meanings that are widely divergent) tend to be less difficult than items with homogeneous options (options with meanings that are very similar.) Since a high percentage of the US-MC items used heterogeneous options, while most SA-MC items had more homogeneous options, this is suggested as a possible explanation for the results showing US-MC items to be signif- icantly less difficult than the SA-MC items. A second possible explanation for these results relates to the state- ment made in Chapter II, that very often the type of item classified as having an uninformative stem, was found on careful examination by medical faculty, to have been testing several concepts, none of which were important concepts. In other words many of these items are basic- ally trivial in nature. This is offered here as a possible explanation why this item type had a lower average difficulty than the SA-MC group. However, this logic would also lead to the expectation that the US-MC items would also be less discriminating. While it was the case that the average value for both D and rBis was lower for the US-MC type, 85 neither difference was significant. It is possible that future research using more refined techniques for identification of US-MC items and for pairing these with appropriate SA-MC items might be able to detect sige nificant differences. MA-MC versus US-MA-MC Comparison Hypotheses IIIA - IIIC addressed the same basic question asked in Hy- pothesis II: What effect does completeness (informativeness) of the item stem have on item difficulty and discrimination? However, the focus of Hypothesis III was on the MA format, comparing multiple-answer items with complete, informative stems to those judged to have unin- formative stems. The results, shown in Table 4.4, indicated that there were no significant differences between the two item types, based on either difficulty or discrimination. There are two logical alternative explanations for this lack of signif- icant results. First is the possibility that the completeness of the stem has in fact, no effect on either the difficulty or discrimination of multiple-answer items. An alternative explanation relates to the general problems of the MA format outlined earlier in the discussion of the SA-MC vs. MA-MC comparison. In this discussion several previously unlooked for item writing errors were proposed as problems in the mul- tiple-answer format. Since these problems, i.e. specific types of con- fusing wording in the item stem and use of inappropriate alternatives, were not considered in the item selection process for the MA-MC vs. US-MA-MC comparison, it is doubtful that items classified as "well- written" MA-MC items were actually error free. Therefore, the results of the present comparison could have been contaminated by items of both 86 types, MA-MC and US-MA-MC, containing errors other than the one under study. Future research will be better able to take these errors into consideration when studying differences between variations of the mul- tiple-answer format. Under those more controlled conditions the most common errors within the multiple-answer format can be identified and studied. SA-MC versus NS-MC Comparison Hypotheses IVA - IVC were directed toward the comparison of items based on stem orientation, positive versus negative. The results, shown in Table 4.5, indicated that while there was no significant difference in difficulty between the SA-MC and NS-MC item types, the SA-MC type had a significantly higher average value for discrimination, based on both the upper-lower and biserial indexes. This result differs somewhat from those of the earlier studies described in the review of the literature. Recall that two studies were described, (Terannova, l969; Dudycha and Carpenter, l973) both of which used pro- fessionally recast items to test for differences. Each study found items with negative stems to be more difficult than those with positive stems, but found no significant differences in discrimination. In the present study, although the difference in difficulty was not statistic- ally significant the negative-stemmed items were found to be somewhat more difficult. The more notable difference was that the present study showed a significant difference in average discrimination while the earlier studies did not. 87 As discussed in the section relating to the SA—MC vs. US-MC comparison, the difference in research approach between use of recast items in past studies and use of independently written items in this study goes far to explain differing results. Further, there were other differences between these past studies and the present study. The Terannova study looked only at two-option m-c items while the average number of options per item in the present study was 4.8 which could contribute to differ- ences. In the Dudycha and Carpenter study the initial items used in thezrecasting process were chosen on the basis of middle difficulty and high values for discrimination, the same bases later used to compare item groups. Use of the same basis for both selection and comparison casts some doubt on the validity of this result and adds to the differ— ences between their study and this one. The present study provides evidence that the NS-MC item type is less effective than the positive stem (SA-MC) item type. What factors might contribute to this decreased effectiveness? The first factor which might logically have an effect on item discrimination is the negative orientation itself and the resultant fact that the keyed correct response is actually a wrong answer. Not only could this present problems for examinees but it is proposed here that it presents even more difficulties for instructor item writers and can affect item quality because of the relationship between the “keyed' correct response and item quality. With the positive orientation where the item calls for the best correct response, the quality of the item depends on: l) the item idea and its expression in the stem, 2) the correct response, and 3) the appro; priateness of the incorrect or less correct alternatives. However, with 88 the negative orientation where the item calls for the incorrect or least correct response, the order of importance changes to become: I) the item idea and its expression in the stem, 2) the wrong response, and 3) the appropriateness of the correct alternatives. It is possible that it is easier to write high quality items (based on item discrim- ination) when the quality is based more on the item writer's knowledge of the most correct answer than when the quality depends more on the writer's choosing the best wrong answer. In other words, typical in- structors in medical education may be less able to write appropriate NS-MC items which results in lower average discrimination values. A second possible factor contributing to this difference in discrimina- tion is the type of questions asked within the two item types. On care- ful examination of the NS-MC items used in this comparison, the negative orientation was found to be a rather stereotypic approach. Approximately 50% of the NS-MC items had stems which were very general and often poorly focused, i.e. the specific item idea is unclear. In this approach, for’ simplicity refered to as the PFS (poorly focused stem) approach, the stems took one of the following three general forms: 1) All of the following statements concerning ..... are true Except: 2) Each (All) of the following may be (are) associated with ..... Except: 3) All of the following are characteristics of ..... Except: Table 5.2 provides a specific sample item of each form. The PFS approach appears to be unique to the NS-MC item type, and its prevalence results in a very large number of items which are rather superficial since they ask only for recognition of facts. Looking again at the examples it can also be seen that instead of asking a direct question with a specific 89 TABLE 5.2 SAMPLE NS-MC ITEMS NITH POORLY FOCUSED STEMS (PFS ITEMS) l) All of the following statements concerning echoencephalograms are true Except: 2) l. 01-h (JON All 01 k to N "4 vvvvv All UT-FWN-J it is an invasive procedure with many associated potential hazards midline displacement of 3mm is considered abnormal it is difficult to distinguish epidural from subdural hematomas with this test it is most beneficial when used in conjunction with other tests it is not useful in posterior fossa tumor detection of the following are associated with pityriasis rosea Except: acute, self-limited disease allergic origin herald (patch) lesion rare before age one year corticosteriods are effective when severe pruritus is present of the following are characteristics of penicillin G Except: rapid renal excretion instability in gastric acid rather poor penetration into CSF high incidence of adverse and toxic reactions narrow antimicrobial spectrum 90 correct answer, this approach uses a general stem which provides the student with less direction and requires the student to choose the incorrect response from among a list of statements. It is not known whether this approach results in items with difficulties or discrim- ination values that differ from other NS-MC items, but because of its prevalence and uniqueness within this item type it is offered as a possible factor. To further explore the SA-MC vs. NS-MC comparison and the possible effect of the PFS approach,a grouped frequency distribution was pre- pared using values for both D and rBis. This distribution is displayed in Table 5.3. Using this distribution the 26 best items (based on D) and the 53 worst items (based on rBis) were selected and scrutinized. Of the 26 best items, 16 SA-MC and 10 NS-MC, the only commonality found was that all except one of the items had well-focused stems asking a direct question. Only one of the ten NS-MC items in this group used the PFS approach criticized above. For 50 of the 53 worst items the most probable problem could be identified. These are listed below: . Number of Number of Problems SA-MC items NS-MC items Poorly focused stem 0 14 No one missed the item 7 9 Student confusion between 2 options 5 3 Item content appeared too difficult or inappropriate 6 3 Stem wording was confusing O 3 Problem could not be identified _%; 3%_ This exploratory analysis of the items used, supports the idea that heavy use of the PFS approach may have a negative effect on discrimination. 91 TABLE 5.3 SA-MC VS. NS-MC COMPARISON, GROUPED FREQUENCY DISTRIBUTION BASED ON 0 AND rBIS Discrimination Index _Q_ rBis Range Number of Number of Number of Number of of SA-MC NS-MC SA-MC NS-MC Values Items Items Items Items .00 - .OO 28 47 20 33 .Ol - .10 5 10 8 10 .ll - .20 29 24 15 l8 .21 - .30 28 33 19 24 .31 - .40 21 8 14 13 .41 - .50 16 11 16 16 .51 - 1.00 16 10 51 29 143 143 143 143 92 Only one of the 10 best negatively oriented items had stems of this type while 14 of the 33 worst NS-MC items had stems of this type. In summary, the results of the SA-MC vs. NS-MC comparison indicated that the SA-MC items are more discriminating than the NS-MC items but not significantly less difficult. Two factors were suggested which might contribute to the lower average discrimination of the NS-MC type: 1) the task of choosing an effective wrong answer to be the keyed correct response may be more difficult than choosing an effective correct answer, and 2) many NS-MC items use poorly focused stems (PFS). Content Orientation All of the comparisons in this study were analyzed using two-way repeated measures analysis of variance with item type an one independent variable and content orientation as the other. The results of these analyses, in Tables 4.2 - 4.5, showed no significant interactions between content orientation and any item type, for either difficulty or discrimination. The results of these analyses also indicated that on the average the clinical science items were more difficult than the focal problem items, with this difference being statistically significant within two of the four comparisons, SA-MC vs. MA-MC and SA-MC vs. NS-MC. This result is consistent with the differing grading method used, at the time of this study, within the focal problem curriculum versus that used in the three clinical clerkships. Within both content orientations a pass-fail grad- ing system was used. However, different methods were used to determine the cutoff score for passing. The focal problem exams were criterion- referenced, using an absolute cutoff score for passing, which varied in different focal problems from 68 -72 percent correct. The clinical 93 science exams were norm-referenced, the passing score for each exam was dependent on the performance of the group tested. The cutoff point was generally set at 1.5 - 2.0 standard deviations below the mean score. Since the focal problem exams used cutoff scores of approximately 70% correct, the average difficulty value would be expected to be above .70. The actual range for the mean difficulties in the four comparisons in this study was .70 - .78. Since the clinical science exams were norm- referenced, the average difficulty would be expected to be closer to the ideal level of difficulty for trying to maximize discrimination. When four or five-option items are used this level would be between .625 and .667. The actual range for the mean difficulties in the four compar- isons was .61 - .70. The results relating to item discrimination indicated that in seven out of eight comparisons there was no significant difference between focal problem and clinical science items. The one significant difference was within the SA-MC vs. NS-MC comparison and showed that the focal problem items had a significantly higher value for discrimination, based on the biserial coefficient. This result does not seem to have any practical significance since it was an isolated result and not even consistent with the result for the other discrimination index within the same comparison. General Discussion and Summary The majority of the discussion thus far has concentrated on the negative aspects of the MA-MC, US-MC, US-MA-MC and NS-MC item types. At this point, a brief statement needs to be made about the positive aspects of the SA-MC item type. The major advantage that the SA-MC item type 94 appears to have is that it uses the most direct approach to asking questions. In this type of item the instructor (item writer) is asking only one question for which he is requesting only ggngest answer. The stem and alternatives are separate parts of the item; before seeing the options the student knows exactly what is required and could (if he has sufficient knowledge) answer the item without ever seeing the options. None of the other item types studied have these characteristics. The multiple-answer types have varying numbers of correct alternatives and cannot be answered by reading the stem alone, even if the stem asks a specific question, because the student does not know how many answers he needs. Items of the NS—MC type also have more than one "right" answer. The students cannot directly answer the question posed in the stem be- cause the keyed correct response is always a "wrong" answer. A further asset of the single-answer format is that when instructors use this format they more frequently use complete well-focused stems. Recall from Table l.l that of all single-answer items only about 10% had poorly focused stems (US-MC items), while 25% of the multiple-answer items had this problem (US-MA-MC items). Also, as discussed earlier, approximately 50% of the negatively oriented items used general stems which lacked clear focus (PFS items). Clearly, the SA-MC item type has strengths which help to account for the very high average values for discrimination associated with this item type throughout the study. Part of the dif- ference, at least for discrimination, must be attributed to these strengths as well as being attributed to the very real weaknesses of the other item types. 95 To summarize the discussion section, several key points are outlined below: 1) 2) 3) SA-MC items were significantly less difficult and more discrimin- ating than MA-MC items. The MA-MC format is inherently more complicated which may have contributed to differences. Also the possibility exists that instructors are less able to use the format correctly and effectively. This results in items which have con- . fusing wording in the stem and/or include inappropriate options, which can have a negative effect on item statistics. SA-MC items were significantly more difficult than US-MC items. The two explanations offered were: a) that many items using this item form are testing trivial content so are not as difficult, and b) that many items of this type have very heterogeneous alterna- tives, which may result in lower average difficulties. SA-MC items were significantly more discriminating than the NS-MC items. Many of the NS-MC items used very general item stems which seemed to result in a higher number of very poor items (based on D and rBis) and a lower number of very good items. This probably contributed to the lower average discrimination values for the NS-MC item type. Items on the criterion-referenced focal problem exams had significantly lower average difficulty when compared to items on the norm-referenced clinical clerkship exams. CONCLUSIONS Following are the conclusions of this study: 1. Item format affected both average difficulty and discrimination. a. Items in the single-answer multiple-choice (SA-MC) format had higher average values for discrimination than items in the. multip 8 answer multiple ch019e (MAPMC) format. b. The SA-MC items were less difficult than the MA-MC items. Stem orientation, positive versus negative, affected item discrim- ination but not difficulty. Items with positive stems were more discriminating than those with negative stems. Informativeness of the item stem had some effect on item difficulty but no statistically significant effect on discrimination. Within the single-answer format, items with complete informative stems (SA-MC) 96 were more difficult than items with incomplete stems (US-MC). Within the multiple-answer format there were no differences. SUGGESTIONS FOR FUTURE RESEARCH This study was to a certain extent exploratory, attempting first to compare the MA and SA formats within medical education, then to point out areas of difference which might give direction to future research. This study showed large differences in difficulty and discrimination between the SA-MC and MA-MC items and offered some possible explana- tions. This should lead to additional research efforts in this area in the future. It would be premature to draw major conclusions or make broad generalizations on the basis of this one study. It must be considered one link in a whole chain of related research. It is to be hoped that this study will prompt future research efforts in the study of varying item types within medical education. One direction this research might take would be replication under more thoroughly controlled conditions. It would be helpful to refine the techniques used to select item pairs, increasing the certainty that both items in each pair really were intended to measure the same know- ledge or skill. It should also be possible to refine the techniques for choosing well-written MA-MC items for future studies. Since this format has been studied much less frequently than the single-answer format less was known concerning what kinds of item writing errors might be common within this format. This study has added to that body of knowledge so that future studies can have increased precision. This study was exploratory in another way as well, trying to determine whether differences in item statistics similar to those found in studies 97 using professionally written or recast items. wou1d also be found when independently written teacher-made items were used to compare item types. Some of the results found were similar, but there were enough instances of difference to help stimulate interest in further exploration of the performance of selected item types when all items are written by classn room instructors. Throughout the results and discussion of this study one dimension of the test item seemed to come repeatedly to the forefront, the item idea upon which the item is based. In the present study uninformative stems were chosen as the focus for two comparisons. This focus is once removed from what seems now to be the more appropriate focus on the item idea. The question would not be, was the stem informative, but, can a clear singular item idea be identified as having been the basis for both the stem and options? Research exploring this important area would not be easy to design or carry out, but may have great potential in the ongoing search for better understanding of the principles underlying the writing of valid and reliable test items. Finally, it must be noted that although the MA-MC format had poor item statistics in comparison to the SA-MC format, this type can still have a valid place in the overall testing process in medical education. Since most physicians obtain licensure by use of the National Board of Medical Examiners exams, which contain this item format, medical students need practice with its use. What item writers can gain from this study is the caution that this format can perform poorly, along with some specific suggestions concerning possible problems with the format (e.g. confusing wording in the stems and use of options with inappropriate content) that might be avoided with special care. APPENDICES APPENDIX A SOURCE OF ITEM PAIRS - BREAKDOWN BY EXAM 0F ORIGIN 98 APPENDIX A SOURCE OF ITEM PAIRS - BREAKDOWN BY EXAM 0F ORIGIN Number of Item Pairs per Comparison Content Area- No. of SA:NC vs. SAnMC vs. MA-MC vs. SA-MC vs. Exam Dates Students MA-MC US-MC _ US-MA-MC- NS-MC Altered Consciousness Spring '78 28 4 O 6 4 Spring '77 47 3 1 l 1 Spring '76 40 4 l 2 1 Winter '76 A; 0 O l 1 Sprin '75 ___ 1 g 12' 3 n” 73' Diarrhea W1nter '78 33 2 3 3 1 Winter '77 36 9 O 4 9 Spring '75 44 l 2 3 4 Fall '74 19 l O 4 1 T3” ‘5 14' T5“ Anemia. Winter '78 29 3 3 2 7 Winter '77 43 O O O 4 Winter '76 4O 2 2 3 1 Winter '75 39 O O O 5 Winter '74 16 2 _g_ 2 .3; ‘7 7 “'7‘ 19 Jaundice Winter '78 21 8 3 6 3 Winter '77 36 5 2 1 3 Winter '76 4O 0 1 3 4 ‘T3 'TI TO' 10' Chest Pain Fall '77 29 1 1 O 1 Falt.*76 38 O 1 0 0 Saring.'75 34 2 _l_ 4 O '7? 3 'TT "F Elevated BUN Winter 178 34 8 5 4 9 ‘Winter '77 39 4 4 O 8 Spring '76 32 _31 _Jl 4 1 l7 9 '7? T8' Sur er Spring '78 23 8 3 3 7 Fall '77 31 4 2 O 4 Fall '76 13 8 l O 3 26' ‘6' "3' ‘FT 99 APPENDIX A (Continued) SOURCE OF ITEM PAIRS - BREAKDOWN BY EXAM 0F ORIGIN Number of Item Pairs er Com arison Content Area- No. of SA-MC vs. SA-MC vs. MA-NC vs. SA-MC vs. Exam Dates Students MA-MC US-MC US-MA-MC NS-MC Pediatrics Spring '78 35 l l 1 2 Winter '78 41 O 0 O 2 Fall '77 34 l O O 1 Winter '77 53 1 l 1 2 Fall '76 34 3 0 O 1 Spring '76 32 2 0 2 0 Winter '76 40 2 O O 2 Fall '75 26 2 O 1 2 Spring '75 18 3 l O 0 Winter '75 36 O O 0 3 Fall '74 30 2 1 O 4 Spring '74 23 O O O 2 Winter '74 35 O 3 O 4 Fall '73 39 O ‘ __2_ _O_ 1 T7 9 5 26 Internal Medicine Spring '78 35 O 0 O 3 Winter '78 41 1 O O 0 Fall '77 34 1 O O 4 Winter '77 53 3 l 2 5 Fall '76 34 5 O O 0 Spring '76 32 l O O 6 Winter '76 40 O O O 2 Fa111'75 26 2 1 O 1 Spring '75 18 3 O 2 2 Winter '75 36 l O 1 1 Fall '74 30 4 1 1 3 Spring '74 23 O O O 1 Winter '74 35 O O 1 3 Fall '73 39 _]_ __Q_ _0_ l 22 3 7 8'2" APPENDIX B SAMPLE ITEM PAIRS SA-MC vs. MA-MC COMPARISON 100 APPENDIX B SAMPLE ITEM PAIRS SA-MC vs. MA-MC COMPARISON Clinical Science 1. A 32 year old female with intermittent diarrhea, constipation, and abdominal cramps of 18 months duration has a GI series of x-rays which reveal an abnormal terminal ileum and cecum with narrowing of the lumen and edema of the bowel wall. Tissue removed at surgery reveals inflammation in the mucosa, muscularis, and subserosa. The findings are most characteristic of ulcerative colitis amebic dysentery regional enteritis diverticulosis non-tropical sprue 01%de o o o o o Diarrhea can be contracted via A. contaminated food 8. contaminated water C. respiratory tract infection D. infected animals The treatment of choice in functional diarrhea (irritable bowel syndrome) is l. cholestyramine 2. diphenoxylate 3. sodium carboxymethyl cellulose 4. morphine 5. pr0pantheline Colonic motility in the irritable bowel syndrome differs from normal in that there is an increased responsiveness A. cholecystokinin (endogenous or exogenous) B. parasympathomimetic drugs C. meals D. cyclic AMP The drug of choice for treating status epilepticus is l diphenylhydantoin 2. succinylcholine 3. oxazepam 4. thiopental 5 diazpam 6. 101 APPENDIX B (continued) Types of vascular headache include A. B. C. D. migraine headache hypertensive headache cluster headache headache caused by fever The most sensitive test for detection of the hepatitis B surface antigen (HBsAg) is 01‘5de o o o o o counterimmunoelectr0phoresis (CIEP) gel diffusion hemagglutination inhibition radioimmunoassay (RIA) complement fixation Chronic active hepatitis is more likely if bridging necrosis is found. Other cues to this complication are A. B. C. D. elevated SCOT and bilirubin at 2 months after acute symptoms persistent hepatomegaly the patient feels well the HBsAg remains positive at 4 months The ability of the neonatal liver to handle bilirubin may be enhanced by treatment of either the pregnant mother or the neonate with U'I-FNN—I' c o o o o salicylates enterovioform sulfonamidesi phenobarbital interferon If serial studies of amniotic bile pigments demonstrate a severely affected fetus, which of the following should be tried to prevent its death from erythroblastosis fetalis A. B. C. D. induce premature labor if at least 34 weeks old inject the uterus with phenobarbital attempt intrauterine transfusion if infant is too immature for delivery begin phototherapy prior to delivery A disease, generally idiopathic, which is characterized by widespread crescent formation throughout all glomeruli, a poor prognosis and increased renal failure leading to uremia is mth—a o o o o o acute pyelonephritis membranous glomerulenephritis rapidly progressive glomerulonephritis necrotizing papillitis proliferative glomerulonephritis 8. 102 APPENDIX B (continued) Complications of acute pyelonephritis include A. renal carbuncle B. renal papillary necrosis C. perinephric abscess D. chronic pyelonephritis Your patient is a 22-year-old who complains of a unilaterial throb- bing headache that has awakened him from Sleep for the past 3 nights. The pain is accompanied by lacrimation, and intense rhinorrhea. The entire episode subsides spontaneously in approximately 40 minutes. Your diagnosis is tension headache classic migraine headache depression headache . cluster headache headache caused by berry aneurysm thN". o o o o In a patient with major motor seizures, status epilepticus may be precipitated by A. abrupt withdrawal of anticonvulsant medication 8. brain tumor C. brain injury D. hypoglycemia A 36 year old man with recurrent episodes of abdominal paid due to peptic ulcer disease is admitted because of weakness progressive for 3 months. He had an episode of black stools 4 months previously during one of his bouts of active peptic ulcer symtpoms. History is negative for other symptoms. Physical examination reveals a pulse rate of 70, blood pressure of 120/65, pallor of conjunctiva and mucous membranes. The remainder of the physical examination is normal. RBC 3.2 mil/mm3 Hgb 6.2 gm/lOO ml Hot 24% Retic Count 1.7% Calculation of the red cell indices suggest that the anemia is normocytic, normochromic ~ microcytic, hypochromic macrocytic, hypochromic macrocytic, normochromic microcytic, normochromic mprA 0 o o o o 10. 103 APPENDIX B (continued) A 70 year old man is admitted to the hospital because of increasing fatigue and exertional shortness of breath which has been developing over a period of several months. He is found to have macrocytosis on the peripheral blood smear and the bone marrow is megaloblastic in type. Causes of this clinical picture would include A. folic acid deficiency 8. chronic liver disease C. vitamin V-lZ deficiency C. iron deficiency The most consistent indication of an infiltrative liver lesion such as carcinoma or granuloma may be elevated serum alkaline phosphatase activity increased LDG activity 1. greatly increased SCOT activity 2. hyperbilirubinemia 2. reticulocytosis 5: The clinical symptoms of hepatitis are in some ways similar to and often confused with A. infectious mononucleosis B. herpes simplex infections C. toxoplasmosis D. idiosyncratic drug reactions A 30 year-old woman is admitted to the hospital with a history of diarrhea for over 10 years. This is characterized by 8-10 bulky, greasy foul smelling stools/24 hrs. with 2-3 at night. Her appetite is good yet she had lost 15 lbs; her weight has been stable for several years. She complains of bloating, mild abdominal cramping and a good deal of flatulence. For 3 months she has had persistent mid-back pain. She considers herself "nervous". She is 5' tall and weights 98 lbs. Her tongue is rather smooth, her abdomen mod- erately distended and tympanic and there is tenderness over D 2. The P. E is otherwise normal. The Hb is 10 G/dl, the Hct 36 and RBC 4x105/mm3. In writing admission orders one of the earliest diagnostic goals would be to determine if she has abnormal stool flora gastric hypersecretion parasitic infestation steatorrhea blood in her stool 01$de 0 c o o o ll. 12. 104 APPENDIX 8 (continued) Large volume (stool weight > 300 G/day) diarrhea is seen in A. cholera B. irritable bowel syndrome C. celiac sprue D. Crohn' s disease of the colon With anemia of chronic disease the plasma iron (Fe) and total iron binding capacity (T. I. B. C. ) would most likely show whith of the following normal Fe, increased T.I.B.C. normal Fe, normal T.I.B.C. normal Fe, decreased T.I.B.C. decreased Fe, increased T.I.B.C. decreased Fe, decreased T.I.B.C. m-th—l o o o o 0 Findings in pernicious anemia include A. weakness 8. macrocytic erythrocytes C. sore tongue D. numbness and tingling in the extremities Diffuse, symmetrical slowing of the EEG would most likely be associated with 1 a hemorrhage 2 an infarction 3. a supratentorial tumor 4 an abscess 5 a metabolic disorder Pneumoencephalography is used for viewing the A. cerebral arteries 8. posterior fossa contents and cisterns C. cerebral lymphatics. D ventricles Pharmacology 1. mth-H o o o o 0 Which of the following lists gives the correct ranking of the barbiturates in order of increasing duration of action thiopental, phenobarbital, pentobarbital pentobarbital, thi0penta1, phenobarbital pentobarbital, phenobarbital, thiopental thiopental, pentobarbital, phenobarbital phenobarbital, pentobarbital, thi0pental 105 APPENDIX B (continued) The major anticonvulsant drugs are believed to work by A. suppressing abnormally discharging foci B. carbonic anhydrase inhibition C. reducing excitability of curcuit neurons D. alterations in the acid-base balance One of the important mechanisms of action of nitroglycerin which results in the relief of anginal pain is 1. a direct depressant effect on myocardial metabolism causing a reduced myocardial oxygen consumption 2. a direct effect on hemoglobin oxygen binding resulting in an increase in release of oxygen from the red blood cells in the myocardial capillary network 3. an increase in cardiac output resulting in an increase in coronary blood flow 4. direct effect on the cardiac pacemaker resulting in slowing of the heart rate 5. a drop in systemic blood pressure resulting in a decrease in cardiac work Pharmacologic effects of nitroglycerin include A. peripheral venous vasoconstriction B. smooth muscle relaxation C. increase in myocardial oxygen consumption D. peripheral arterial dilitation Mrs. Johnson is a 40-year-old woman in renal failure. She has an extra-renal infection caused by organisms susceptible to all of tetracyclines. Whicg of the following would be the best agent to use chlortetracycline oxytetracycline doxycycline mnnocycline demecycline 01-9de 0 o O o 0 Which of the following, in "fixed-dose" combination preparations, have a place in the approved management of infection A. penicillin G - streptomycin B. penicillin G - chlortetracycline C. kanamycin - methicillin D. trimethoprim - sulfamethoxazole 106 APPENDIX B (continued) An antibiotic has the following characteristics: Bactericidal; Resistance is infrequent; Activity is sharply limited to gram negative bacteria; Surface active agent leads to disorientation of the lip0protein membrane. This antibiotic is 1. penicillin 2. gentamicin 3. tetracycline 4. polymyxin B. 5. cephalosporin Penicillin G in repository form can safely be given by which route (s) A. sub q B. intrathecal C. IV D. IN To prolong the absorption time, repository penicillin is admin- istered l. intramuscularly 2. intravenously 3. intrathecally 4. orally 5. subcutaneously A patient with a urinary tract infection and known sensitivity to penicillin could be treated with A. ampicillin B. methicillin C. cephalexin D. lincomycin When the urine is acid, the clearance of a drug is found to be less than the rate of glomerular filtration. However, when the urine is alkaline, the drug clearance is greater than the rate of glomerular filtration. The drug is a strong organic base weak organic base strong organic acid weak organic acid nonelectrolyte 01¢de o o o o o The rapid renal clearance of a drug is favored if the drug has low solubility in water reduces renal blood flow has a high degree of binding to plasma protein has low solubility in lipid DOW), 107 APPENDIX 8. (continued) Which of the following drugs finds its major usefulness in petit mal epilepsy dephenylhydantoin (Dilantin) phenobarbital (Luminal) primidone (Mysoline) . trimethadione (Tridione) phenacemide (Phenurone) U'I-hOON—l o o o 0 Which of the following would be indicated in status epilepticus A. morphine B. ethosuximide (Zarontin) C. succinylcholine (Anectine) D. diazepam (Valium) The incidence of hypersensitivity reactions to cephalosporins is higher in patients who have shown allergic manifestations following the administration of l. gentamicin 2. penicillin 3. polymyxin 4. sulfonamide derivatives 5. tetracycline Which of the following are stabile in gastric acid and undergo good absorption after oral administration A. penicillin G B. oxacillin C. methicillin D. ampicillin Which of the following parasympathically innervated functions is most sensitive to low doses of atr0pine salivary secretions vagal effects on the heart micturition gastric secretion accomodation of the eye UI-FwN-P c c c o 0 Which of the following are antimuscarinic agents scopolamine atropine propantheline (Probanthine) morphine DOW) 10. 11. 108 APPENDIX 3 (continued) The antibiotic of choice for treating salmonella is mth—a o o o o o carbenicillin erythromycin tetracycline ampicillin oxacillin Which of the following might be useful for treating a Pseudomonas infection A. B. C. D. The is 01¢de a o o o o Polymyxin B Gentamicin Carbonicillin Tetracycline most important and serious side effect of the use of gentamicin elevation of blood urea nitrogen overgrowth of Candida on oral administration enterocolitis cardiac arrhythmias ototoxicity Gentamicin is most important in the treatment of serious infection, including those caused by A. B. C. D. seudomonas aeruginosa nterobacter KlebSiella Strep pneumoniae Cardiology - Pediatrics 1. A continuous murmur (with diastolic accentuation) is heard over the primary aortic area in a 3-year-old-boy. The murmur is obliterated by compression of the neck veins on the right side and by assumption of the supine position. The most likely diagnosis is 01-55de 0 o o o o venous hum patent ductus arteriosus ventricular septal defect with prolapsed aortic valve cusp aortic aneurysm aortic stenosis and insufficiency 109 APPENDIX B (continued) Functional or "Innocent" heart murmurs occur frequently in children. Appropriate management should include which of the following A. B. C20 0 o A U'lwa-J routine radiographs and ECG'S on all children with vibratory murmurs refer for further cardiac evaluation and possible cardiac catheterization if the murmur becomes louder when the patient has a fever antibiotic prophylaxis before dental work emphasize to parents that no restriction of activity is needed for a child with a functional murmur blood pressure cuff that is too small gives false low readings false high readings slightly lower readings than usual markedly lower readings than usual accurate readings A 12-year—old boy, on routine physical examination in the office, shows a blood pressure of 140/90. Indicated steps in his work-up and management should include A. B. C. D. checking the pressure in his other arm checking the pressure in his leg evaluating the Size of his arm relative to the size of the blood pressure cuff rechecking the pressure after 15 minutes of quiet rest An infant with early cyanosis, progressive cardiac enlargement and pulmonary plethora should be suspected of having Ul-C>WN-‘ o o o o o patent ductus arteriosus coarctation of the aorta vascular ring complete transposition of the great arteries Ebstein's anamoly Cardiac failure in the first few weeks of life may be due to COED) coarctation of the aorta paroxysmal atrial tachycardia transposition of the great vessels cerebral arteriovenous fistula 110 APPENDIX 8 (continued) Cardiology - Internal Medicine 1. A 50-year-old male arrives in the emergency department with a history of dyspnea for 4 hours. He has a history of recent onset of angina. Esam: P 130 reg., resp. 24, T 99po, BP 190/100. Lungs - diffuse inspiratory and expiratory wheezing. Cardiovascular exam - JVP greater than 10 cm., S3 gallop, and paradoxically split 52. Extremities - no edema or phlebitis. The most likely diagnosis acute asthma attack pneumococcal pneumonia acute pulmonary edema pulmonary emboli pneuthorax mth-f o c o o 0 Which of the following is/are helpful in distinguishing pulmonary embolism from pulmonary infarction A. blood gases 8. chest X-ray C. lung scan 0. presence or absence of hemoptysis Paracentesis abddminis is most likely to be of-therapeutic benefit 1n a pat1ent with peritoneal fluid due to tuberculous peritonitis nephrotic syndrome systemic lupus erythematosus hepatic cirrhosis congestive cardiac failure mth—f o o o o o The therapeutic removal of large volumes of ascitic fluid by paracentesis may be complicated by A. ptosiS of the abdominal viscera B. circulatory collapse C. acute gastric dilatation 0. plasma albumin depletion The aim of theinitial therapy in acute pulmonary edema due to left ventricular failure is to slow the heart rate allay anxiety improve left ventricular contractility decrease pulmonary blood volume remove the excess fluid from dependent parts m-th-d o c o 0 0 111 APPENDIX B (continued) Myxedema is frequently associated with A. B. C. 0. increased cardiac size, as seen in the chest x-rays hoarseness of voice bradycardia pretibial accumulations of subcutaneous myxedema The most frequent mechanism of cardiac arrest in the hospitalized patient with an acute myocardial infarction is 01¢de o o o o o ventricular fibrillation asystole electro-mechanical dissociation cardiac rupture atrial fibrillation Morphine sulfate is often used to relieve pain in patients with acute myocardial infarction. Side effects that Should be of concern in this situation are A. B. C D bradycardia due to vagotonic action hypotension primarily related to venous dilitation and pooling of blood respiratory depression via direct action on the medulla diarrhea BIBLIOGRAPHY 112 BIBLIOGRAPHY Albanese, M.A.; Kent, T.H.; Whitney, D.R. A comparison of the difficulty, reliability, and validity of complex multiple choice, multiple response, and multiple true-false items. Paper presented at the Annual Meeting of the American Association of Medical Colleges, Washington D.C., November 1977. Board, C.; Whitney, D.R., The effect of selected poor item-writing practices on test difficulty, reliability and validity. Journal of Educational Measurement, 1972, 2, 225-233. Boynton, M. "None of these" makes spelling items more difficult. Educational and Psychological Measurement, 1950, 19, 431-432. Burmester, M.A.; Olson, L.A. Comparison of item statistics f0r items in multiple-choice and in alternative-response form. Science Education, 1966, 59, 467-470. Campbell, D.T.; Stanley, J.C. Experimental and Quasi-Experimental Designs {9r Research. ChicagozRandMENally CollegePublishing Company, 963. Cronbach, L.J. An experimental comparison of the multiple true-false and multiple multiple-choice tests. Journal of Educational Psychology, 1941, 32, 533-543. Note on the multiple true-false test exercise. Journal of Educational Psychology, 1939, 39, 628-631. Dryden. R.E.; Frisbie, D.A. Comparative reliabilities and validities of multiple choice and complex multiple choice nursing education tests. Paper presented at the Annual Meeting of the National Council of Measurement in Education, Washington D.C., April 1975. Dudycha, A.L.; Carpenter, J.B. Effects of item format on item discrimination and difficulty. Journal of Applied Psychology, 1973, 58, 116-121. Dunn, T.F.; Goldstein, L.G. Test difficulty, validity, and reliability as functions of selected multiple-choice item construction principles. Educational and Psychological Measurement, 1959, 12, 171-179. Ebel, R.L. Essentials of Educational Measurement. Englewood Cliffs, N.J.: Prentice-Hall, Inc., 1979. The ineffectiveness of multiple true-false test items. Educational and Psychological Measurement. 1978, 38, 37-44. Writing the test item. In E.F. Lindquist (Ed.) Educational Measurement, Washington, D.C.: American Council on Education, 1951, Pp. l85-249. 113 Eurich, A.C. Four types of examinations compared. Journal of Educational Psychology, 1939, 33, 268-278. ' Flanagan, J.C. General considerations in the selection of test items and a short method of estimating the product-moment coefficient from data at the tails of the distribution. Journal of Educational Psychology, 1939, 39, 674-680. Frisbie, D.A. Multiple-choice versus true-false: A comparison of relia- bilities and concurrent validities. Journal of Educational Measurement, 1973, 19, 297-304. - The effect of item format on reliability and validity: A study of multiple-choice and true-false achievement tests. Educational and Psychological Measurement, 1974, 39, 885-892. Harris, R.J. A Primer of Multivariate Statistics. New York: Academic Press,Tl975. Hughes, H.H.; Trimble, W.E. The use of complex alternatives in multiple- choice items. Educational and Psychological Measurement, 1965, 25, 117-126. Magnusson, 0. Test Theory. Reading, Massachusetts: Addison-Wesley Pub- lishing Company, 1967. McMorris, R.F.; Brown, J.A.; Snyder, G.W.; Pruzek, R.M. Effects of vio- lating item construction principles. Journal of Educational Measurement, 1972, 9, 287-295. Mehrens, W.A.; Lehmann, I.J. Measurement and Evaluation in Education and Ps cholo , Second Edition. 'New York: Holt, Rinehart and Winston, Inc., 1978. Meuller, D.J. An assessment of the effectiveness of complex alternatives in multiple-choice achievement test items. Educational and Psychological Measurement, 1975, 33, 135-141. Oosterhof, A.C.; Glasnapp. D.R. Comparative reliabilities of the multiple- choice and true-false formats. Paper presented ar the Annual Meeting of the American Education and Research Association, Chicago, Illinois, April 1972. Pyrczak, F. Validity of the discrimination index as a measure of item quality. Journal of Educational Measurement, 1973, 19, 227-231. Rimland, B. The effects of varying time limits and of using "right answer not given" in experimental forms of the U.S. Navy arithmetic test. Educational and Psychological Measurement, 1960, 39, 533-538. 114 Ruch, G.M.; Stoddard, 0.0. The comparative reliabilities of five types of objective examinations. Journal of Educational Psychology, 1925, 19, 89-103. Schmeiser, C.B.; Whitney, D.R. Effect of two selected item-writing practices on test difficulty, discrimination and reliability. Journal of Experimental Education, 1975, 93, 30-34. Skakun, E.N.; Nanson, E.M.; Taylor, W.C.; Kling, S. An investigation of three types of multiple choice questions. Paper presented at the Annual Meeting of the American Association of Medical Colleges, Washington, D.C., November 1977. Terranova, C. The effects of negative stems in multiple-choice test items. (Doctoral dissertation, State University of New York at Buffalo) Ann Arbor, Michigan: University Microfilms, 1969, 69-209 512 o Wason, P. Response to affirmative and negative binary statements. British Journal of Psychology, 1961, 33, 133-144. Wesman, A.C. Writing the test item. In R.L. Thorndike (Ed.) Educational Measurement. Second edition. Washington, D.C.: American Council on Education, 1971. Wesman, A.C.; Bennett, G.K. The use of "None of these" as an option in test construction. Journal of Educational Psychology, 1946, 31, 541-549. Williams, B.J.; Ebel, R.L. The effect of varying the number of alternatives per item on multiple-choice vocabulary test items. Fourteenth Yearbook, National Council ojLPbasurements Used in Education. Princeton, N.J., 1957,563-651 Williamson, M.L.; Hopkins, K.D. The use of "None-of-these" versus homogeneous alternatives on multiple-choice tests: experimental reliability and validity comparisons. Journal of Educational Measurement, 1967, 9, 53-58. Zern, 0. Effects of variations in question phrasing on true-false answers by grade school children. Psychological Reports, 1967, g9, 527-533.