"J cgfi ‘LIBRARY Michigan State University This is to certify that the thesis entitled An Analysis of the Effects of Different Multiple- Choice Item Selection Strategies on the Reliability and Validity of Measures of Physician Competence in Specialty Certification presented by Steven M. Downing has been accepted towards fulfillment of the requirements for Ph. D. degree in Education Water f. EM Major professor Datquy 24, 1979 0-7639 OVERDUE FINES ARE 25¢ PER DAY . PER ITEM Return to book drop to remove this checkout from your record. 79 C) Copyright by Steven M. Downing 1979 AN ANALYSIS OF THE EFFECTS OF DIFFERENT MULTIPLE-CHOICE ITEM SELECTION STRATEGIES ON THE RELIABILITY AND VALIDITY OF MEASURES OF PHYSICIAN COMPETENCE IN SPECIALTY CERTIFICATION by Steven M. Downing A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Department of Counseling, Personnel Services, and Educational Psychology 1979 ABSTRACT AN ANALYSIS OF THE EFFECTS OF DIFFERENT MULTIPLE-CHOICE ITEM SELECTION STRATEGIES ON THE RELIABILITY AND VALIDITY OF MEASURES OF PHYSICIAN COMPETENCE IN SPECIALTY CERTIFICATION By Steven M. Downing This study investigated the effect of two multiple-choice item selection strategies on the discrimination of physician competence in Emergency Medicine and on several other psychometric character- istics of item subscales selected by the two different criteria. The research was carried out in the context of a field test of a library of examination materials intended for certification of special- ists in Emergency Medicine. The ninety-four subjects for this study represented four distinct groups: Residency-eligible and practice-eligible emergency physicians, second-year residents in Emergency Medicine, and fourth-year medical students. Physicians with Board eligibility represented a national stratified random sample of emergency physicians judged by their peers as very clinically skilled and, therefore, certifiable in Emer- gency Medicine. Residents represented a national stratified random sample of beginning second-year residents in Emergency Medicine with a wide range of competence. Fourth-year medical students were paid volunteers. TWO examination formats were investigated: 1) Objective--Single best-answer, four or five option multiple-choice and pictorial-stem Steven M. Downing multiple-choice items, and; 2) Simulated Clinical Encounters~-high1y structured, examiner-administered and rated patient-game simulations of typical emergency medical cases. Four 91-item subscales were selected from the 364 items of the Objective format. Two subscales were selected for an item-difficulty criterion. Two other subscales were selected for a relevance-to- clinical—medicine criterion, which was defined for this study as item point-biserial correlation with the grand mean rating on the indepen- dent criterion measure, Simulated Clinical Encounters. Hypotheses about criterion-group discrimination, criterion-re- lated validity, scale reliability, mean item difficulty, proportions of identical items selected for scales using different criteria, and differences in the distributions of pictorial-stem, clinical-situational, and factual multiple-choice items were tested. Residency-eligible (n=22), resident (n=36), and student (n=22) subject groups were used to test hypotheses for the following ninety-one item objective sub- scales: 1. Medium Difficulty: p-value .50 to .69 with a mean p-value of .63. 2. Low Difficulty: p-value .84 to .99 with a mean p-value of .90. 3. High Clinical-Relevance: item-criterion correlations of r = .33 to .68 with a median r = .38. 4. Low Clinical-Relevance: item-criterion correlations of r = -.23 to .11 with a median r = .05. Discriminant Analyses showed that high clinical-relevance was the best discriminator and approximately 6.7 times more effective Steven M. Downing than medium difficulty in statistically separating known groups, but this difference was not significant at a = .05. Both medium difficulty and high clinical-relevance were significantly more discriminating of known groups than low clinical-relevance. High clinical-relevance correctly classified 76.3 percent of subjects, while medium diffi- culty classified 71.2 percent correctly. The high clinical—relevance scale had a significantly higher criterion-related validity coefficient (rxy = .90), was significantly more reliable (rxx = .95), and was significantly lower in mean item difficulty (8 = .72) than the medium difficulty scale. There was no statistical difference in the proportion of overlap of identical items between the medium difficulty/high clinical- relevance and the medium difficulty/low clinical-relevance scales. There was also no significant difference in the distributions of pictorial—stem, clinical-situational, and factual multiple-choice items across the four scales. These results using the practice-eligible physician group (n=14), who were not considered in any of the subscale construction analyses, withstood a small validation. It was concluded that the relevance of multiple-choice items to simulated clinical performance is important to the valid statistical discrimination of criterion-group performance. DEDICATION In memory of Florence Downing McClure To whom I owe much for the person I am becoming ii ACKNOWLEDGMENTS I am indebted to many people and institutions for their assistance in this project. First, I am deeply grateful to Barbara Frederickson, my significant other - for her love, support, and patient understanding of my preoccupation throughout the months of this research, and for her technical assistance in data analysis and proofreading. I wish to express my appreciation to my dissertation committee for their assistance in this project. I am honored to have had the opportunity during my doctoral program to study with Dr. Robert L. Ebel, who taught me what I know about educational measurement and directed this dissertation. Dr. Jack L. Maatsch, who directs the American College of Emergency Physicians project in the Office of Medical Education Research and Development, "brought me along" as a medical education researcher and provided the ideal empirical climate for this study. Dr. Joe L. Byers, who taught me a great deal about educational research and computers, provided support and friendship throughout my doctoral program and offered much technical assistance for this study. Dr. Leroy A. Olson, who has assisted me many times with test scoring problems, supported this study and offered many helpfhl suggestions. I also wish to thank my typist, Marlene Dodge, who showed endless patience with me and the countless drafts of this report. I also appreciate technical assistance for this study from Dr. Pamela W. Wilson, Mr. Douglas Barker, and Ms. Barbara Schachenman. iii Dr. Nelson H. Goud, Indiana University School of Education, encouraged my doctoral study in a very special way and supported me throughout with his friendship. Dr. Martha R. Anderson, helped me-- through her special friendship, caring, and faith in me--to complete this degree program. The American College of Emergency Physicians, the American Board of Emergency Medicine, the Office of Medical Education Research and Development at Michigan State University, and the National Center for Health Services Research (HS 02038) have all directly supported this research. I am grateful to many individuals within each of these organizations for their assistance during my four and one-half year association with the Emergengy_Medicine Examination. East Lansing, Michigan S.M.D. May 24, 1979 iv TABLE OF CONTENTS Chapter I. THE PROBLEM. Introduction. Historical Background of Certification Testing. Need for the Study. The Problem . . Research Hypotheses . Summary . Overview of the Dissertation. 11. REVIEW OF THE LITERATURE . Certification Examinations. . Examination Construction Models . Summary . III. PROCEDURES AND DESIGN. Introduction. . Sample of Subjects. Examination Construction. Design. . Hypotheses and Analysis Methods . Summary . . . . . . . . IV. RESULTS. Introduction. . Item Selection for Four Subscales . . Statistical Analysis for Group Discrimination Hypotheses . . Results Concerning Differences in Discrimination: Medium Difficulty versus High Clinical- Relevance Results Concerning Differences in Discrimination: Medium Difficulty versus Low Clinical-Relevance. Results Concerning Differences in Discrimination: High Clinical-Relevance versus Low Clinical- Relevance. . . Results Concerning Criterion- Related Validity . Results Concerning Internal- ~Consistency Reliability Results Concerning Mean Item Difficulties . . Results Concerning Overlapping Items in Subscales . Results Concerning the Distribution of Item Types in Subscales . . . Summary of Results for Tests of Hypotheses. Results of Additional Analyses. . Summary Results of Additional Analyses. 65 75 84 92 94 97 98 103 108 110 122 Chapter V. SUMMARY AND CONCLUSIONS. Summary of Findings . Conclusions . Discussion. Future Research . BIBLIOGRAPHY . vi LIST OF TABLES 3.1 Test Items Allocated to Medical Content Categories. . . 37 3.2 Example Multiple-Choice Items . . . . . . . . . . . . . 40 4.1 Medium Difficulty Items . . . . . . . . . . . . . . . . 57 4.2 Low Difficulty Items. . . . . . . . . . . . . . . . . . 59 4.3 High Clinical-Relevance Items . . . . . . . . . . . . . 61 4.4 Low Clinical-Relevance Items. . . . . . . . . . . . . . 62 4.5 Raw-Score Group Discrimination: Medium Difficulty versus High Clinical-Relevance. . . . . . . . . . . . . . . 66 4.6 Summary of Stepwise Discriminant Analysis: High Clinical—Relevance versus Medium Difficulty. . . . . 70 4.7 Standardized Discriminant Function Coefficients: Medium Difficulty versus High Clinical-Relevance . . 70 4.8 Relative Discriminating Power of Medium Difficulty and High Clinical-Relevance Scales . . . . . . . . . 72 4.9 Classification Analysis Using High Clinical-Relevance and Medium Difficulty Discriminant Functions . . . . 73 4.10 Summary of Stepwise Discriminant Analysis With Medium Difficulty Entered First . . . . . . . . . . . . . . 73 4.11 Raw-Score Group Discrimination: Medium Difficulty versus Low Clinical-Relevance. . . . . . . . . . . . 76 4.12 Summary of Stepwise Discriminant Analysis: Medium Difficulty versus Low Clinical-Relevance . . . . . . 80 4.13 Standardized Discriminant Function Coefficients: Medium Difficulty versus Low Clinical-Relevance. . . 80 4.14 Relative Discriminating Power: Medium Difficulty versus Low Clinical-Relevance. . . . . . 81 4.15 Classification Analysis Using Medium Difficulty and Low Clinical-Relevance Discriminant Functions. . . . 83 4.16 Summary of Stepwise Discriminant Analysis with Low Clinical-Relevance Entered First . . . . . . . . . . 83 vii Table Page 4.17 Raw-Score Group Discrimination: High Clinical-Relevance versus Low Clinical-Relevance. . . . . . . . . . . . 86 4.18 Summary of Stepwise Discriminant Analysis: High versus Low Clinical-Relevance. . . . . . . . . . . . 87 4.19 Standardized Discriminant Function Coefficients: High versus Low Clinical-Relevance . . . . . . . . . 87 4.20 Relative Discriminating Power of High versus Low Clinical-Relevance Scales. . . . . . . . . . . . . . 89 4.21 Classification Analysis Using High and Low Clinical- Relévance Discriminant Functions . . . . . . . . . . 90 4.22 Summary of Stepwise Discriminant Analysis With Low Clinical-Relevance Entered First . . . . . . . . . . 90 4.23 Criterion-Related Validity Coefficients: Subscale Score Correlation With Mean Simulation Ratings . . . 93 4.24 Internal-Consistency Reliability of Subscales . . . . . 96 4.25 Mean Square Values for High Clinical-Relevance and Medium Difficulty Scales . . . . . . . . . . . . . . 96 4.26 Subscale Mean Item Difficulty . . . . . . . . . . . . . 99 4.27 Repeated Measures ANOVA of Medium Difficulty, High and Low Clinical- Relevance Subscales . . . . . . 100 4.28 Overlap of Identical Items. . . . . . . . . . . . . . . 102 4.29 Raw-Score Group Discrimination of Four Subscales. . . . 111 4.30 Standardized Discriminant Function Coefficients: Four Subscales . . . . . . . . . . . . . . . . . . . 112 4.31 Classification Analysis Using TWO Discriminant Functions Derived From Four Subscales. . . . . . . . 112 4.32 Subscale Zero-Order Correlations . . . . . . . . . . . 114 4.33 Raw-Score Discriminating Using the Practice-Eligible Group: Four Subscales . . . . . . . . . . . . . . . 116 4.34 Comparisons of Criterion-Related Validity Coefficients . . . . . . . . . . . . . . . . . . . . 118 viii Table Page 4.35 Comparison of Internal-Consistency Reliability of Scales: Practice-Eligible and Previous Sample. . . 118 4.36 Comparisons of Subscale Mean Item Difficulty: Practice- Eligible Group (n=14) versus Previous Group (n=80). . . . . . . . . . . . . . . . . . . . . . . 120 4.37 Repeated Measures ANOVA of Medium Difficulty, High and Low Clinical-Relevance Scales: Practice- Eligible Group. . . . . . . . . . . . . . . . . . . . . . . 121 ix LIST OF FIGURES Figure 3.1 Schematic of 91 Item Subscales to Investigate. 4.1 Medium Difficulty Scale. 4.2 High Clinical-Relevance Subscale . 4.3 Low Clinical—Relevance Subscale. 4.4 Low Difficulty Subscale. 4.5 Observed Distributions of Item Types By Subscale . 4.6 Confidence Intervals Around Differences in Proportions of Item Types By Subscales. 69 78 79 105 107 LIST OF FIGURES Figgre 3.1 Schematic of 91 Item Subscales to Investigate. 4.1 Medium Difficulty Scale. 4.2 High Clinical-Relevance Subscale . 4.3 Low Clinical-Relevance Subscale. 4.4 Low Difficulty Subscale. 4.5 Observed Distributions of Item Types By Subscale . 4.6 Confidence Intervals Around Differences in Proportions of Item Types By Subscales. 69 78 79 105 107 CHAPTER I THE PROBLEM INTRODUCTION "The competence of physicians to recognize, understand, and manage the problems of their patients is a very critical element in health care” (Senior, 1976). Medical specialty boards, since the early years of this century, have attempted to measure and certify the competence of their candidates to practice in specialized areas of medicine (Hubbard, 1971). Yet, Williamson (1976) states: "Finding evidence of the relation of certification results to actual clinical performance proved to be a difficult task." And also, "....the problem in improving validity of medical specialty certification procedures is serious." During the past decade, great pressure has been brought on medical specialty certifying bodies to demonstrate the validity of their examination procedures to predict the competence of certified physicians to deliver health care (Williamson, 1976). This issue is so critical today that a conference was devoted to this t0pic by the American Board of Medical Specialties (Conference on Extending the Validity of Certification, 1976). And, the American Board of Medical Specialties has a standing committee charged with studying the validity of evalu— ation procedures used by member boards to certify medical specialists. 1 While some measurement specialists may disagree that cognitive achievement examinations should predict performance (e.g., Ebel, 1961), it is clear that the medical specialty profession, governmental regulatory agencies, and medical consumer groups believe that the certification of a medical specialist should make a difference in his or her clinical performance (Williamson, 1976). The problem of valid discriminations and predictions of clinical performance is, indeed, a thorny statistical and psychometric one. While great gains have been made in improving the psychometric quality of objective certifying examinations, few gains have been made in establishing valid criterion measures of physician clinical performance (Senior, 1976). Most specialists can not even agree on a behavioral definition of competent medical practice, much less measure this elusive characteristic. This study will pose some empirical questions about the relative power of objective-item subscales, selected by two different item selection methods, to validly discriminate criterion-groups of subjects who were selected for their known skills in delivering health care in the specialty of Emergency Medicine. Other psychometric characteristics of these scales will be investigated, including a study of item-type contribution to scales selected by different methods. HISTORICAL BACKGROUND OF CERTIFICATION TESTING Historically, most specialty boards required candidates to successfully complete two to three years of post-graduate training in a specialty residency program, then to pass an essay examination over the content knowledge of the medical specialty and a bedside oral examination, in which the candidate examined a hospitalized patient and was rated by a single examiner. By 1946, one specialty board, the American Board of Internal Medicine, introduced objective examin- ations to replace essay tests and many boards introduced variations of the bedside-oral examination which tended to increase the objec- tivity of the measurement (Hubbard, 1971). The recent history of medical specialty certification testing has, in general, included objective examinations, patient management problems, and some type of oral or performance examination. Some boards require candidates to obtain a minimum passing score on the objective and/or patient management problem sections prior to admission to the oral or performance examination. Other boards, and all state licensing examinations, rely solely on objective examin- ations to certify physician competence. Passing scores are usually determined by referencing examination scores to some norm-group's performance and placing the pass/fail cutting score at some reasonable position on the scale. The National Board of Medical Examiners in Philadelphia, Pennsyl- vania, has done more, perhaps, than any other single organization to improve the overall quality of specialty certification testing in the United States and Canada. In its role as consultant to many specialty boards, the National Board of Medical Examiners has moved boards from essay content examinations to high-quality objective tests, has con- ducted much research aimed toward the improvement of the psychometric qualities of certification examinations, and has aided many boards in the construction, administration, and scoring of their examinations (Hubbard, 1971). Largely through the influence of the National Board of Medical Examiners, state boards of physician licensing and specialty certification boards have come to place heavy emphasis on the multiple—choice examination for measuring physician competence. It is clear that testing procedures used to certify competence in medical specialties have improved greatly through the years of this century. The greatest gains in psychometric quality of specialty certification testing have derived from a move to more objective examination methods. But, as noted above, these examination scores do not predict physician clinical performance well (Williamson, 1976). EssayfiExaminations Prior to the 1950's, when the essay examination was used almost exclusively to measure the cognitive competence of candidates, the inter-rater reliabilities of essay scores were found to be very low (r=.50 to .60) in many studies (e.g., Hubbard and Clemans, 1961). Since the introduction of objective-format examinations to replace essay examinations, the internal-consistency reliabilities of most examinations are in the r :_.90 range (Burg and Schumacher, 1979). Oral Examinations The oral examination has an even longer history than the essay examination in measuring the competence of physicians. Modern medical education Springs from a very long history of treating the training of a physician as an apprenticeship. From ancient through medieval times, even to the present, physician skills are passed from master to apprentice in undergraduate clerkships and graduate residencies. Modern medical curricula usually divide a student's training into two parts: the first year or two is generally devoted to study of the basic life sciences and the last two years to structured experiences in clinical settings. During the years of clinical training--both during undergraduate clinical clerkship and post-graduate clinical internships and specialty residency training-- the oral examination is a highly valued tradition. Students and residents are required to present patient work-ups, during which they are orally examined over the basic and clinical science content appropriate to the particular case. Another example of oral examination methods used to evaluate physicians-in-training is rounds. Rounds refer to the process of a master physician taking a group of students--residents and medical students--from patient to patient in a teaching hospital and orally questioning individual students about the diagnosis and management of these patients' medical conditions. Oral Examinations in Certification It is not, therefore, surprising that specialty certification boards adopted the oral examination as part of their examination procedures. Until the 1950's, specialty boards, in general, required a bedside-type oral examination, in which a candidate was orally examined over cases presented by one or more patients (Hubbard, 1971). As specialty boards became more aware of the psychometric limitations of the oral examination, they tended to modify the oral examination in ways intended to make the measurement more objective, or, to simply abandon the oral in favor of patient management problems. (Patient Management Problems are written, more objectively scored simulations of a physician's ability to diagnose and manage a patient's problem.) Other boards, like the College of Family Physicians of Canada, have adopted a very structured oral format involving several different types of simulated patient interactions (Handbook for Certification in Family Medicine, 1976). Construction of Specialty Board Examinations The test construction methods used by the National Board of Medical Examiners typify the general practice currently used by most specialty boards in the construction of certification examinations. Committees of specialty content experts meet several times per year to outline the content of the examination, to write and to peer- review items written by their colleagues (Hubbard, 1971). Since nearly all specialty boards interpret the scores yielded by their examinations relative to some norm-group performance, items are written to perform such that extremes of difficulty are avoided as much as possible in order to maximally discriminate levels of achievement throughout the distribution of scores. After an objective examination is assembled and administered, it is generally scored and item analyzed twice. Items identified as poor discriminators, because they are too difficult or too easy or because of some ambiguity inherent in the wording of the item, are brought to the attention of the decision-making board. Board members generally debate the merits of these questionable items and decide, as a committee, whether to score such items (Hubbard, 1971). Standards for passing are determined by these boards after inspection of the distribution of scores; passing scores are most often set such that the lower 15 to 20 percent of candidates fail the examination (Hechel and Bowles, 1979). Objective-examination construction methods like those outlined above tend to produce very reliable and content valid measures. High internal-consistency reliability is achieved by producing items that perform at medium difficulty and, therefore, maximize item and test variance, which tends to maximize the internal-consistency reliability coefficient (Magnusson, 1967). Since content validity is a matter of expert judgment and concensus that the examination measures what it should measure (Standards, 1974), it follows that specialty certifi- cation examinations constructed by committees of nationally prominent medical experts in the specialty are, by definition, content valid. NEED FOR THE STUDY As noted above, current practice in constructing objective specialty certification examinations produces, for the most part, highly reliable and content-valid measurements. However, these examination scores do not correlate well with independent measures of clinical performance (Williamson, 1976). Predictive-validity studies, for example, which attempt to find a correlation between scores on a certification examination and some independent measure of the quality of medical care delivered by the examinees, have failed to show any very large validity coefficients (Burg and Schumacher, 1979; Williamson, 1976). Concurrent with the improvements in the technology of medical specialty examinations, there has been a public press toward assuring the quality of health care provided by medical practitioners. Con- sumers of health care have begun demanding that their physicians provide them the best health care possible or, at least, that they have some protection against inadequate or incompetent medical practice. The public and regulatory agencies have also begun to demand that certification examination scores predict the adequacy of physician performance or that passing a certification examination, in fact, makes some difference in the quality of health care delivered by the certified physician. Many factors may account for the low criterion-related validity coefficients of certification examinations: the reliability and validity of the criterion may be low, the range of scores may be restricted in one or both distributions of scores--thus attenuating the correlation coefficient (Magnusson, 1967), or the examination tasks may simply not be relevant to the tasks measured or rated by the criterion (Maatsch gt_§1,, 1978), since an objective examination may measure only one aspect of physician competence. The test construction methods and philosophy outlined above-- writing items to maximize internal-consistency reliability--may tend to attenuate a validity coefficient. For example, items written and selected to be of middle difficulty may be less familiar in content and, thus, less fundamentally necessary to know for the actual, day-to- day practice of clinical medicine, than items written and selected to some other criterion. The selection of middle-difficulty items to maximize item discrimination and examination internal-consistency reliability may distort the content relevance of items to the typical practice of medicine. In summary, current objective test construction methods tend to produce examination batteries that very sharply discriminate levels of achievement in academic medical content areas which may not be totally relevant to the actual practice of medicine. Objective certification examinations do tend to yield highly reliable scores and are judged as content valid by groups of nationally recognized content experts in the specialty. However, scores from such objective examinations fail, in general, to correlate with other independent measures of the quality of health care delivered. One possible explanation for the failure of objective scores to predict clinical performance may be an inherent lack of content relevance in objective items. If items do lack relevance to the actual practice of clinical medicine, the reason may be that current item analysis criteria used in selecting these items may tend to select items that are less relevant to clinical medicine than they might be. THE PROBLEM The purpose of all educational achievement measurement is to discriminate those who know or can do more from those who know or can do less (Ebel, 1972). Medical specialty certification measurement has been viewed, essentially, as achievement measurement--the measurement of how successfully a candidate has mastered the content and skills of the specialty. But, at the same time, the public has come to believe that these measurements should also predict the adequacy of physician clinical performance. The methods used by specialty boards to construct their objective certification examinations and to select items through item analysis 10 are taken from the literature on classroom achievement measurement (Hubbard, 1971). These methods, while clearly appropriate for their intended use (Ebel, 1972) may be less appropriate to the measurement of, or the prediction of, a medical specialist's ability to deliver adequate, safe, health care. This study will compare certification examination subscales-- selected by classical item-analysis methods and by an independent measure of item relevance to clinical practice-~for item difficulty, reliability, criterion-related validity, and their discriminant validity for criterion-groups with known levels of clinical competence. The proportion of overlap of identical items selected by these two different item selection methods and the effect of pictorial-stem, factual multiple-choice, and clinical-situational1 test items on the clinical relevance of item content will be examined. The Emergency Medicine Examination The Office of Medical Education Research and Development, Colleges of Human and Osteopathic Medicine at Michigan State University, developed, under contract to the American College of Emergency Physicians, a certification examination for the emerging specialty of Emergency Medicine. This new certifying examination, which took over three years to develop, consists of the following formats: 1Clinical-situational items present clinical data about a patient--for example, signs and symptoms, laboratory data, and so on--and then ask a question about diagnosis or management. The second example item of Table 3.2 shows a clinical-situational item. 11 1. Objective: Multiple-Choice and Pictorial Multiple-Choice Items 2. Patient Management Problems 3. Simulated Clinical Encounters: Simulated Patient Encounters and Simulated Situation Encounters A large library of examination materials was developed and field tested by the Office of Medical Education Research and Development and the American College of Emergency Physicians. A total of ninety- four subjects, representing four criterion-groups with known and different levels of training and experience in Emergency Medicine, were administered all examination materials under a National Center for Health Services Research Grant (HS 02038) in October, 1977 (Maatsch gt_al,, 1978). The four criterion—groups on whom data were collected are: l. Residency-eligible physicians 2. Practice-eligible physicians 3. Second-year residents in Emergency Medicine 4. Fourth-year medical students Item Selection Strategies to Investigate The relevance of objective-item scores to the adequacy of simulated health care delivered is the major subject under investigation in this study. The effect of objective-item subscale scores, sub- scales which are operationally defined as high or low on the continuum of relevance to the typical practice of clinical Emergency Medicine, will be compared to subscales selected by the classical item analysis methods (Hubbard, 1971) used by most certifying boards. These empirically defined subscales will be compared regarding their 12 criterion-group discrimination, internal-consistency reliability, mean item difficulty, and criterion-related validity. Definition of Clinically Relevant Knowledgg For this study, clinically relevant knowledge is defined as that knowledge which is frequently used and/or has direct utility for the accurate diagnosis and successffll management of patients' medical problems as seen in typical clinical situations. Operational Definition of Clinical Relevance The clinical relevance of objective-item content will be operationally defined, for this investigation, as that item content which correlates most highly with the grand mean rating of the Simulated Clinical Encounters. The twelve Simulated Clinical Encounters--eight Simulated Patient Encounters and four Simulated Situation Encounters--consist of highly structured oral simulations of typical emergency patients' clinical problems. A well-trained examiner presents the realistic problem or case to the candidate who works through orally the diagnosis and medical management of the patient or patients. The examiner then rates the candidate's performance at the conclusion of the simulation. Objective-Item Subscales to Investigate The objective format of the Emergency Medicine Examination consists, after the deletion of some items following an initial item analysis, of 364 four or five option, single-best answer multiple- choice items. These items sample twenty-three content categories of 13 Emergency Medicine and were intended to measure the essential knowledge or ability needed for the typical practice of clinical Emergency Medicine (Maatsch g£_§l., 1976). Four Subscales to Study The following four 91-item examination subscales will be in- vestigated: l. Medium-Difficulty Subscale: 91 items selected for item difficulties closest to ideal for norm-referenced achievement tests (.5 §_p :_.7) and positive point- biserial item-total score discrimination indices. 2. Low-Difficulty Subscale: 91 items selected as having the lowest item difficulties and positive point- biserial item-total score discrimination indices. (N High Clinical-Relevance Subscale: 91 items selected for their highest correlation with the grand mean rating on the Simulated Clinical Encounters. 4. Low Clinical-Relevance Subscale: 91 items selected for their lowest correlation with the grand mean rating on the Simulated Clinical Encounters. Subscales one and two will be composed of independent items, as will scales three and four. However, subscales one and two will not necessarily be composed of a set of items that are completely different from the items found in subscales three and four. RESEARCH HYPOTHESES IA. H : There is a difference in the criterion-group discrimination of the medium difficulty and the high clinical-relevance subscales. The difference favors the high clinical-relevance subscale. 18. H : There is a difference in the criterion-group discrimination of the medium difficulty and the low clinical-relevance subscales. The difference favors the medium difficulty subscale. 14 IC. H : There is a difference in the criterion-group discrimination of the high clinical-relevance and the low clinical- relevance subscales. The difference favors the high clinical-relevance subscale. 11. H : There are differences in criterion—related validity between the medium difficulty and the high clinical-relevance sub- scales. The high clinical-relevance subscale will have a higher validity coefficient than the medium difficulty subscale. 111. H : There is a difference in internal-consistency reliability between the medium difficulty and the high clinical- relevance subscales. The difference favors the medium difficulty subscale. IV. H : There are differences in mean item difficulty between the medium difficulty subscale and the high and the low clinical-relevance subscales. The medium difficulty sub- scale and the low clinical-relevance subscale will be more difficult than the high clinical-relevance scale. V. H : The proportion of overlap of identical items between the medium difficulty and the high clinical-relevance subscales will be lower than the proportion of overlap of identical items between the medium difficulty and the low clinical- relevance subscales. VI. H : There are differences in the distributions of pictorial—stem, clinical-situational, and factual multiple-choice items selected for the four subscales. The high clinical— relevance subscale will have a larger distribution of pictorial-stem and/or clinical-situational items than the medium difficulty or the low clinical-relevance subscale. SUMMARY The techniques and methods used to objectively measure the competence of medical specialists has been greatly improved during the last twenty-five years. These improvements include specialty boards' adoption of multiple-choice formats to replace essay examin- ations and improvement or abandonment of the oral examination. In recent years, there has been increasing pressure both from within the medical specialty professions and from consumers of medical care 15 to show that specialty certification examination scores predict the quality of health care delivered by the examinee, or, at least, that certified specialists perform more adequately than non-certified specialists. Research in these areas has been largely ignored or has failed, for the most part, to demonstrate the criterion-related validity of certification examination scores. This dissertation study will evaluate the effect of two different objective-item selection strategies on the validity of the statistical discrimination of groups of subjects with known and different levels of training and experience in the medical specialty and differing levels of competence to deliver health care in the specialty. Classical item selection methods will be empirically compared to the selection of items for clinical relevance, as defined for this study. With independent ratings of simulated clinical performance standing-in for ratings of actual clinical performance, objective subscales selected for their item difficulty or their relevance to clinical medicine will be evaluated for differences in criterion-related validity, scale reliability, and mean item difficulty. The proportions of identical items selected for subscales using the two separate item selection strategies will be evaluated. Finally, the contribution to clinical relevance of two specialized item types--the pictorial-stem and the clinical-situational item--will be assessed. The theoretical contribution of this research study will be to provide a procedural model for objective certification examination construction that will maximize the valid discrimination of clinical competence. 16 OVERVIEW OF THE DISSERTATION Chapter II will review the literature on medical specialty certification testing as this literature relates to the prediction of clinical performance and the educational measurement literature as it relates to item-analysis methods used to maximize examination validity and reliability. The procedures and methodology used to construct the Emergency Medicine Examination, the design of the field test experiment, the sampling of subjects, and the statistical methods to be used to test hypotheses will be discussed in Chapter III. In Chapter IV, the results of the data analysis will be pre- sented. Chapter V will discuss the results of the statistical analyses and present the conclusions resulting from this dissertation study. Additional research suggested by this study will also be noted in Chapter V. CHAPTER II REVIEW OF THE LITERATURE Chapter I stated the need for this study in relationship to a current critical problem facing medical specialty certifying boards. This basic problem--the general lack of criterion-related validity for certification examinations or lack of evidence that passing a certification examination makes much difference in physician per- formance--will be documented in this chapter. The reliability and validity of some medical specialty certi- fication examinations will be examined. Two examination construction models--one intended for achievement testing and the other intended for the prediction of performance-~will be reviewed. Optimum item- analysis strategies for each type of examination will be reviewed. CERTIFICATION EXAMINATIONS A search of the literature on medical specialty certification examinations yields relatively few empirical studies. And, some studies that are reported are of questionable quality, such that con— clusions may be of limited value. The Relationship of Certification Results to Performance Measures Several studies in the past thirty years have attempted to assess the relationship between physician variables--such as years and type of training, certification examination scores, and so on--and quality of health care delivered by the physician. 17 18 One of the earliest studies of correlates to physician performance was conducted for the Teamster Union (Trussell, 1962) in the 1950's. In this classic study, a team of specialists conducted a thorough chart audit of 406 hospital admissions, randomly selected from Teamster members in the New York area. Many aspects of medical care were rated by the team of specialists. This study's major conclusion was that certification status had little relation to the quality of health care delivered by the physicians in the study. The type of hospital-- whether teaching or non-teaching--was more highly related to quality of care than the certification status of physician providers of health care. This study was replicated by Morehead and others (1964) and these results were confirmed. McGuire and Williamson (1968) conducted a study for the American Heart Association in which they compared the performance of three groups of physicians--general practitioners, non-certified, and certified specialists-~on three patient management problems. Results of this study showed no statistical differences in the performance of the three groups on the written simulations of physician performance. Pawluk and others (1976) studied the relationship between scores on various formats of the Canadian Family Practice Certification Examination and physician performance on a measure of quality of care (Kessner, 1973; Sibley, 1975). The sample of subjects for this study was very small (n=15) and, thus, correlations may have been attenuated by large standard errors around r. Pawluk‘s data showed that the multiple-choice scores correlated -.36 with the measure of quality of care. The simulated office oral scores, correlated .42, while patient management problem scores correlated .25 with the quality of care 19 measure used. These researchers concluded that multiple-choice examinations were poor predictors of clinical perfbrmance. Gonnella (1973) found low correlations between scores on multiple-choice examinations in Urology and measures of diagnostic accuracy and proper management of urinary tract infections with a sample of certified Urologists. Some of the best evidence for the lack of criterion-related validity for medical specialty certification examinations has been presented by BeverLy Payne and his associates at the University of Michigan. For example, Payne and Lyons (1972), in a large study of the correlates of physician health-care delivery in Hawaii, evaluated the adequacy of physician management of twenty common health problems typically seen in hospital and office practices. The major finding in this study was that board certification, type of specialization, years of practice, and hospital size did not correlate with process audit ratings of physician performance. However, in some specialized areas of medical practice--Pediatrics, Surgery, Internal Medicine-- years of experience in the specialty did predict clinical performance; the board certification of the specialists, however, did not predict their clinical performance. Payne found only two areas in which certified specialists performed statistically better than non- certified specialists; in both of these cases, the barely significant effect seemed confounded by a disordinal interaction effect. Rhee (1975) reanalyzed the Payne Hawaii data for a doctoral dissertation. These findings show no statistical differences in the performance of board-eligible and board-certified physicians. However, Rhee found that board-eligible and board-certified physicians performed 20 much better than self-claimed specialists when they were practic- ing in their own specialized areas. Several other studies have examined the relationship between certification examination scores and faculty ratings of student or resident clinical competence (Hubbard, 1971; Kaplan, Freeman, and Kaplan, 1968; Kelley and Levit, 1967; Kelley, Stumpe, and Levit, 1970; Schumacher, 1964). These studies have shown low positive correlations of examination scores with faculty ratings of overall clinical per- formance. None of the studies reviewed here allow a comparison of the multiple-choice content measured, its relevance to clinical medicine, or item types employed. In summary, the literature on the relationship between board certification and physician clinical performance suggests: 1. Certification examination scores have low correlations with criterion measures of physician performance. 2. Type and length of formal postgraduate training-- residency education--do correlate with subsequent measures of quality of medical practice. Psychometric Characteristics of Some Certifying Examinations To place the present research study on the Emergengy Medicine Examination in context, it seems appropriate to review published studies on the reliability and validity of some other specialty certifying examinations. As in the previous section which reviewed the criterion-related validity of certification examinations, there are relatively few published empirical studies and they are of mixed quality. 21 Validity Hubbard (1971) shows that the National Board of Medical Examiners test construction methods assure the content validity of specialty examinations from the National Board. These objective examinations, as outlined in Chapter I, are constructed by committees of experts in the specialized content. Predictive validity studies for National Board certifying examinations have failed to show any large correlations with ratings of clinical performance (Burg and Schumacher, 1979). Burg, Guerin, and Schumacher (1977) and Levine, McGuire and Nattress (1970) show some construct validities for various National Board certifying examinations. That is, these studies demonstrate that certain National Board certifying examinations yield scores that are sensitive to years of training in some specialties. Maatsch and others (1978) have shown that the Emergepgy Medicine Examination yields scores, in both the Objective and Simulated Clinical Encounter formats, that were sensitive to years of training and experience in Emergency Medicine. This same study also showed the concurrent validity of the Emergency Medicine Examination in that the total Objective score is correlated with the grand mean rating on Simulated Clinical Encounters .83. On the other hand, correlations of objective scores and ratings on an oral examination for an American Board of Orthopaedic Surgery certifying examination were .29 overall (Levine and McGuire, 1970). And, the highest concurrent validity coefficient reported by Kelley and others (1971) for the American Board of Anesthesiology Examination was r = .54, for scores that had been corrected for attenuation due to the unreliability of both objective and oral rating scores. 22 Reliability The reliability of examination scores is defined by Ebel (1972) as "....the consistency with which a set of test scores measures what- ever it does measure." Many test specialists (e.g., Mehrens and Lehmann, 1973) suggest that test reliability is the single most important index of overall examination quality. In general, for objectively scored medical specialty certification examinations, Burg and Schumacher (1979) state that internal-consistency reliabilities are greater than I = .90. Oral examination formats, on the other hand, tend to have much lower reliability (Burg and Schumacher, 1971). Accordingly, most empirical research reported in the literature has dealt with the reliability of the Oral examination format. Since different researchers tend to use different methods of calculating oral examination reliability, the studies reported here may not be exactly comparable. It should be noted in this context that the appropriate reliability to report for oral examinations is the inter-rater reliability coefficient; inter-rater agreement is most accurately and efficiently assessed by the interclass inter-rater reliability coefficient (Ebel, 1951a). The generally low inter-rater reliability of oral examinations is well documented (e.g., Ebel, 1972; Mehrens and Lehmann, 1973). Yet despite much evidence for the errorfulness of oral examination ratings, medical specialty boards have used this examination format from the beginnings of the certification movement. It is interesting to note that in the classic study by Levine and McGuire (1970), which con- cluded that oral examinations measure something quite different than 23 objective examinations, the inter-rater reliability of the oral was only r = .50. In an earlier study, McGuire (1966) reported substantial rating disagreements for oral certifying examinations, although she did not report a coefficient of rater agreement. Other oral certifying examinations report much higher rater- agreement coefficients. For example, Carter (1962) studied the rater-agreement for the oral format of the American Board of Anesthesiology examination and found an agreement coefficient of r = .80. For the Emergency Medicine Examination, Maatsch and others (1978) report high interclass correlation coefficients for the oral Simulated Clinical Encounters. For twelve field test cases, the inter-rater reliability coefficients for individual ratings ranged from r = .63 to .89 with an average r for all problems equal to .79. Raters for this study were, however, carefully trained to an objective rating criterion. The inter-rater reliabilities for the oral Simulated Clinical Encounters compare favorably with the internal-consistency reliability coefficients for the objective formats of the Emergency Medicine Examination. The Kuder—Richardson 20 reliability for the total pool of 103 pictorial- stem items was .89; for 261 multiple-choice items, the reliability was .94; and, for the total library of 364 objective items, the reliability was .96. In summary, the available empirical research shows that medical specialty certifying examinations: 1. Have high internal-consistency reliability for objective formats. 24 2. Have low inter-rater reliabilities for oral examinations, unless this format is highly structured and raters are well trained. 3. Have low between—format correlations, from which it has been concluded that different examination formats measure different aspects of physician competence. EXAMINATION CONSTRUCTION MODELS Examination construction specialists have understood for years that different test construction methods are appropriate for different intended uses of test scores. For example, the test con- struction techniques and item selection strategies that are most efficient for classroom achievement testing may be less efficient for building examinations to validly predict successful job performance. Achievement Versus Aptitude Test Construction Ebel (1951b; 1956; 1967; 1972) has carefully and completely documented the most appropriate methods to use in constructing achievement examinations. These methods may be briefly summarized by the following prOpositions: 1. Carefully detailed test content yields content valid measurements. 2. Objective items that present novel questions or problems tend to test student understanding of the relevant and important concepts learned. 3. The achievement test may be the best operational definition of the subject content available. 4. Items of medium difficulty yield internally consistent measurements of student achievement. 5. The upper-lower (D) discrimination index, biased toward items of middle difficulty, tends to select the most efficient achievement test items. 25 Most test construction specialists would agree that these methods will yield valid, reliable, and objective measurements of student achievement. Henrysson (1971), for example, suggests that the point- biserial or the biserial item-total score correlation coefficient be used as an item discrimination index to select achievement test items that maximize item discrimination and test reliability. Cronbach (1951) shows that Coefficient Alpha--Kuder-Richardson 20--is the most appropriate index of internal-consistency reliability. Procedures recommended by Ebel (1972) and most other test con- struction specialists tend to produce internally consistent measures of achievement that are appropriate for norm-referenced score inter- pretation. While aptitude test constructors may write items that look just like achievement test items, item analysis selection strategies may be quite different. For example, if the purpose of an examination is to predict some future complex performance or status, it may be statistically beneficial to write heterogeneous test items (Guion, 1965). The logical extension of this test construction methodology may be found in non-cognitive measurement, especially in empirically-keyed instruments (Mehrens and Lehmann, 1973) wherein items are selected for a scale solely for their empirical correlation with some behavior in some sample of subjects. This strategy--to maximize criterion- related predictive validity--may be the complete opposite of the content validity strategy employed by achievement testers (Ebel, 1972). That is, items chosen to maximize predictive validity may have little or no content validity and low internal-consistency reliability (Guion, 1965). 26 The personnel testing situation is perhaps the best example of differences between the test construction methods of achievement and "prediction" testing. In achievement testing, well written, content valid, discriminating items of about middle difficulty will yield scores that rank-order students in accordance to their mastery of the content measured; the goals of the measuring process-~to mark student achievement, validly--are accomplished. In personnel testing situations--where the goals of measurement may be to predict some future performance or to validly sort groups of subjects according to some psychological trait--test items may have little or no content relevance, but items must have empirical predictive power for the criterion of interest. Accordingly, the personnel test constructor may use some external criterion against which to correlate item scores or use multiple regression, and/or discriminant analysis techniques (Guion, 1965) to identify the most efficient items for the final form of the examination. This final form of the personnel test may have little content validity, but will likely have a very high criterion- related validity coefficient; this final form may also have a rather low internal-consistency reliability coefficient. Magnusson (1967) points out that the prediction of future complex performance may require a test which is composed of several subtests. These subtests--to be maximally efficient--would be highly internally consistent, but would have low inter-correlations with other subscales. Ebel (1961; 1978) has pointed out that test validity is much more a characteristic of test use than of the test itself. Measurements yielded by tests must be valid--but valid for what purpose? Is it 27 perhaps unreasonable to demand that a given test be both content valid and predictive of performance? If the lack of criterion-related validity noted above for most certifying procedures is a serious problem as Williamson (1976) states, then certifying bodies must decide, more clearly than in the past, what their goals of measurement are. 15 the purpose of certification testing the measurement of cognitive achievement in specialized medical content? If so, the literature on achievement testing noted here is relevant. If, however, specialty boards decide that their purpose is to protect the public from incompetent and dangerous medical practice, the literature on minimum competency testing is relevant. If boards decide that the prediction of future clinical practice is the most essential goal, then the literature of personnel and aptitude testing may be of interest. This study can not answer the philosophical questions posed, but will attempt to address the empirical questions concerning the effectiveness of two different item selection strategies for validly and reliably discriminating groups of subjects with known levels of ability to deliver health care. SUMMARY Few high-quality empirical studies have been reported in the area of medical specialty certification. No studies have been reported that relate directly to the exact problem being investigated here. However, a review of the available literature on certifying examinations and two models of measurement reveals the f0110wing: 28 There is little or no evidence that scores on certification examinations in medical specialties predict the quality of medical care delivered by candidates. Post-graduate residency training does predict the quality of clinical performance. Objective formats of certifying examinations tend to be highly internally consistent. Oral certification examination formats tend to have low inter-rater reliability coefficients, unless the oral is standardized and the raters are well trained. Certifying examination formats--multiple-choice and oral-- tend to have low correlations, unless multiple-choice items are written to be relevant to clinical medicine. Specialty boards have not clarified the purposes of their certification measurements. If the purpose is to grant a certificate of excellence to masters of specialty content, achievement testing methods may be appropriate. If the purpose of certification testing is the valid prediction of some future clinical performance, aptitude or personnel testing methods may be appropriate. CHAPTER III PROCEDURES AND DESIGN INTRODUCTION The purpose of this research is to compare subscales selected for item difficulty and clinical-relevance for their psychometric quality in a medical specialty certification examination. Four objective-item subscales will be identified--two subscales of items selected for an item-difficulty criterion and two subscales of items selected for an external criterion of item correlation with per- formance on realistic clinical simulations. These fbur objective sub- scales will be compared for mean item difficulty, internal-consistency reliability, criterion-related validity, and ability to discriminate three groups with known levels of training, experience, and ability to deliver health care in the medical Specialty. Further, the proportion of overlap of identical items selected by different item selection strategies will be examined. Finally, the contribution of three different objective item types-~the pictorial-stem, clinical- situational, and factual multiple-choice item--to clinical relevance, as operationally defined for this study, will be examined. This chapter includes a description of the sampling plan and rationale used to select subjects for this study, the details of examination construction for both the objective items and the clinical simulations, the design of this study, the hypotheses to be tested, and the statistical procedures to be used to test the hypotheses for this research. 29 30 SAMPLE OF SUBJECTS A total of ninety-four subjects participated in this study. These subjects were chosen to represent four distinct groups on the dimensions of known years of training in Emergency Medicine and years of experience in practicing Emergency Medicine. The four groups of examinees were: 1. Residency-Eligible Emerggpgy Physicians: n=22 subjects who were eligible to take a certification examination by virtue of graduation from an approved Emergency Medicine residency program and continuous practice in Emergency Medicine for a minimum of one year. 2. Practice-Eligible Emergency Physicians: n=l4 subjects who were eligible to take a certification examination by meeting the requirement of five years of continuous practice in Emergency Medicine. 3. Residents in Emergency Medicine: n=36 subjects who were beginning their second of three years of residency training in Emergency Medicine. 4. Medical Students: n=22 subjects who were beginning their fourth year of pre-doctoral clinical study. The total number of subjects selected for this study (N=94) was restricted due to the high cost of subject acquisition. The original plan was to have approximately one-hundred subjects equally divided between three groups (physicians, residents, and students), such that large-sample (n > 30) statistics could be used for inter-group comparisons. The final sample of subjects, as detailed below, fell considerably short of the original goal due to the programmatic con- straints of subject acquisition, and subject fee and travel limitations. 31 Different criteria were used to select groups of subjects for this study. Sampling procedures used fOr each group are detailed below. Residency:Eligible and Practice-Eligible Emergency Physicians The American Board of Emergency Medicine was constituted in March, 1976. This new medical specialty board is ultimately re- sponsible for graduate medical education in Emergency Medicine and the certification of specialists in Emergency Medicine. In its role as a certifying body, the Board sets certain minimum prerequisites of training and/or experience in Emergency Medicine fOr those who wish to take the Examination. The Board, recognizing the newness of the specialty and the short history of residency training in Emergency Medicine, has set two separate prerequisite paths to qualify for its certification examination. These two paths are: residency training in Emergency Medicine in one of its approved programs or five years of continuous practice in hospital emergency departments. Selection of Residengnyligible and Practice-Eligible Bmerggncy Physicians A combination of a peer—nomination method and a random sampling plan was used to select two groups of emergency physicians. Each group was intended to be representative of the residency-eligible and practice-eligible applicants for the examination who were clearly competent in clinical, diagnostic, and patient management skills and who were, therefore, certifiable as competent specialists in Emergency Medicine. Accordingly, the first step involved a request by the American College of Emergency Physicians to all its state affiliates for nominations of members to sit for a field test of this examination. 32 The criteria that were to be used for peer-nominations were personal knowledge that the nominee: 1. Provides very competent health care in the emergency department setting. 2. Maintains current knowledge of clinical, diagnostic, and patient management procedures. 3. 15 eligible for the certification examination either by residency training or years of practice in Emergency Medicine. A total of 151 emergency physicians, from throughout the United States and Canada, remained on a nomination list after a credential review by the American Board of Emergency Medicine. Thirty-two nominees had residency-eligibility and 119 had practice-eligibility. The relationship between the numbers of nominees in the two eligibility groups is roughly proportional to the percentage of membership in the American College of Emergency Physicians for each group. The second step was to select a total of thirty-six field test subjects from these nominees. A total of twenty-two residency—eligible physicians (with ten alternates) and fourteen practice-eligible physicians (with eighteen alternates) was selected by a simple random sample of two separate nomination lists. The sample was deliberately skewed in favor of the residency-eligible physicians because of a belief that this group represented more clearly certifiable physicians then the practice-eligibles. The emergency physician participants were reimbursed for travel and per diem expenses for their participation in the field test. Selection of Residents in EmegggncyMedicine The selection of second-year residents in Emergency Medicine also employed a two-step selection plan. The first step in the selection 33 of this group of thirty-six subjects involved the American College of Emergency Physicians' request to the residency directors of all twenty—four residency programs in the United States to submit a rank- ordered list of their second-year residents. This listing ranked every resident in each program from highest to lowest with respect to relative overall clinical competence in Emergency Medicine. This pro— cedure was intended to ensure a final sample of residents that would be representative of the range of competence of second-year residents in Emergency Medicine. The second step in selecting residents was the random selection of thirty-six subjects and alternates. This sampling was carried out by drawing random samples from each of the twenty-four residency programs in the following manner: 1. Random samples were drawn that were strafified on high, middle, and low ranges of competence within each residency program. 2. Random samples were drawn from each program such that the number of subjects selected was roughly proportional to the size of the program. The thirty-six residents who participated in this study received a subject fee and were reimbursed for travel and per diem expenses. Selection of Medical Students Medical students beginning their final year of undergraduate medical education were from Michigan State University's Colleges of Human and Osteopathic Medicine. These paid-volunteer students were recruited to represent a novice or base-line group in training and and experience in Emergency Medicine. This group was not selected to be representative of fourth-year medical students at Michigan State University. 34 The initial procedures followed in obtaining these subjects were: 1. Random selection of subjects and alternates from each of six Michigan communities where students receive clinical training. 2. Invitations to students so selected and their alternates to participate in the field test. This group proved to be the most difficult to obtain. The time of the field test conflicted with the clinical clerkship schedules of many students, who consequently could not serve as subjects for this study. Random selection had to be abandoned ultimately in the interest of simply obtaining sufficient numbers of student-volunteers for the study. The final group of twenty-two student subjects who participat- ed in this study were, then, paid volunteers whose clerkship schedules permitted their participation. Students received a subject fee, plus travel and per diem expense reimbursement. Students also received feedback on their examination performance in terms of raw and percent-correct scores and percentile ranks, based on their own group, for all Examination formats and some content categories. EXAMINATION CONSTRUCTION The development of the examination materials for the Emergency Medicine Examination2 took place from January 1975 to August 1977. All items were generated by content expert members of the American 2The data, test formats, scoring mechanisms and all related examination development and validation procedures described in this dissertation were developed fer the American College of Emergency Physicians. The American Board of Emergency Medicine, which will subsequently adminis- ter the first certification examination, reserves the right to use all or part of the test library and methodologies developed by the American College of Emergency Physicians. 35 College of Emergency Physicians and test construction specialists of the Office of Medical Education Research and Development, Michigan State University. The following pages will describe in detail the examination construction methods and procedures used to develop the Objective and the Simulated Clinical Encounter formats of the Emergency Medicine Examination. Overview The major steps involved in examination construction were: 1. Definition of the exact and proper content of Emergency Medicine by the American College of Emergency Physicians. 2. Identification of Emergency Medicine content to test, and rank-ordering of this content by its importance to test, leading to a test blueprint. 3. Development of detailed content statements on which examin- ation materials would be based: condition sheets. 4. Assignment of item quotas to specialized task forces of American College of Emergency Physician item writers. 5. Training of American College of Emergency Physician item generators by the Office of Medical Education Research and Development. 6. Item writing, review, editing, and production. These procedural steps culminated in the field testing of all Emerggncy Medicine Examination items on October 22—26, 1977, in Lansing, Michigan, with the sample of subjects noted above. Definition of Content The first stages of examination development required the identi- fication and rank-ordering by importance of the content universe of Emergency Medicine. The American College of Emergency Physicians had worked prior to 1975 to gain the concensus of a certification task 36 force on a six-page listing of skills needed to practice Emergency\ Medicine and the medical conditions about which emergency physicians needed content knowledge. This Emergency Medicine Condition/Skills List (Condition/Skills List, 1976) represented the best definition of Emergency Medicine available by defining the domain of content knowledge and psychomotor skills that the emergency physician needed to have and the medical conditions about which the emergency physician needed information. The second step toward operationalizing the definition of Emergency Medicine in an examination required the prioritization of this Condition/Skills list by a sample of the American College of Emergency Physicians certification task force members. The prioritiz- ing of list entries was accomplished by administering a questionnaire to approximately one-hundred task force members. Each respondent mark- ed each entry as either essential to test in a certification examination, important to test, unimportant to test, or necessary pre-condition not to be tested. The final task of each respondent was to allocate one- hundred percentage points to twenty-two broad content categories of Emergency Medicine so that the most important category received the highest percentage allocation. These questionnaire data yielded a consensus of Emergency Medicine specialists about the proper content to test in a certification examin- ation and the proper balance in which to test this content. The table of specifications or the test blueprint which guided the construction of the Examination followed directly out of these procedures. Summary results of the percent allocation procedure are given in Table 3.1. 37 TABLE 3.1 TEST ITEMS ALLOCATED TO MEDICAL CONTENT CATEGORIES Percentage Category 13 Cardiovascular disorders (traumatic and nontraumatic) 7 Abdominal disorders 7 Ear, nose, throat, head and neck injuries (traumatic and nontraumatic) Pulmonary disorders Skeletal injuries Traumatic disorders Urogenital disorders Infancy and childhood dis- orders O\\)\l\l\l U1 Metabolic, allergic and toxi- cologic disorders Fluid and electrolyte problems Neurological disorders Burn and cold exposure Critical infections “(AMA-b Emergency medical services system (including disaster planning and management) 3 Eye disorders (traumatic and nontraumatic) ‘Legal-Ethical Blood disorders Physician/Patient skills NNNLN Emergency department adminis- tration H Dental emergencies 1 Integumental disorders 100% 38 The next pre-examination construction procedure involved ex- panding the entires on the Condition/Skills list into content materials from which examination items could be generated. The Office of Medical Education Research and Development adopted a condition-sheet method whereby the American College of Emergency Physician task force members would complete very detailed outlines for every entry on the Condition/Skills list. The analogy of a textbook on Emergency Medicine was adopted, such that the twenty-two content categories listed in Table 3.1 became the major chapter headings and individual conditions, skills, and knowledge became major subdivisions within chapters. Each condition sheet listed very important or essential knowledge or skills needed by a competent emergency physician for each entry on the specialty defining list. For each medical condition, typical presenting signs and symptoms, diagnostic and medical management problems frequently encountered, common errors made in diagnosing and/ or managing this condition, plus complete medical references were list- ed in detail. Condition-sheet writing, review and editing took nearly one- hundred emergency medical leaders approximately one year to complete. The product resulting from this task represents an encyclopedia of medical knowledge and essential skills needed for the competent practice of Emergency Medicine. The American College of Emergency Physicians next organized five task f0rces of medical experts for the purpose of examination con- struction. These task forces were: 1. Cardio-Respiratory Task Force 39 2. Medicine Task Force 3. Surgery-Trauma Task Force 4. Physician-Patient Task Force 5. Administration-Systems Task Force Item3 quotas were assigned to each task force in accordance with the proportions noted in Table 3.1. Separate procedures were used to develop each examination format. The methods employed to construct the Multiple-Choice, Pictorial Multiple-Choice, and Simulated Clinical Encounter formats will be detailed below. Multiple-Choice Items A total of 372 multiple-choice items were written by the American College of Emergency Physician task force item writers for the Emergency Medicine Examination. These items required the selection of one best answer from among four or five Options. Table 3.2 presents non-secure examples of the type of multiple-choice items used in this examination. Emergency physician item writers were trained to write and review multiple-choice questions in a series of workshops. These workshops presented the basic principles of good item writing through a series of instructional materials (Downing, 1977), following closely the work of Ebel (1972); examples of well written and poorly written items were given. Task force writers then practiced writing items, had these items reviewed by a fellow physician and by a test construction specialist. 3Item is used in its widest meaning here to include not only Multiple- choioequestions,but also Patient Management Problems and Simulated Clinical Encounters. 40 TABLE 3.2 EXAMPLE MULTIPLE-CHOICE ITEMS Which of the following is characteristic of a normal overnight dexamethasone suppression test? A 24 hour urinary 17 OH falls 50% B 24 hour urinary 17 KS rises 50% C plasma cortisol rises 50% D plasma cortisol falls 50% E plasma cortisol remains the same A 35 year old female is seen in the emergency department in a comatose state. Arterial blood gases and serum electrolytes are drawn and re- veal the fOllowing results: sodium 140, potassium 4.9, chloride 98, bicarbonate 10, pH 7.30, and pCOz 24 mm Hg. Which of the following would most likely be the correct diagnosis? A metabolic acidosis--ammonium chloride overdose B metabolic acidosis--renal tubular acidosis C metabolic alkalosis--duodenal fistula D respiratory acidosis--primary E metabolic acidosis--diabetic ketoacidosis Which of the following procedures would give the closest index of the risk of intrauterine death for a fetus with erythroblastosis fetalis? A direct Coomb's test of mother's blood B indirect Coomb's test of mother's blood C spectrophotometric analysis of amniotic fluid D direct Coomb's test of RBC's in amniotic fluid E spectrophotometric analysis of mother's blood 41 Guidelines to item writers (Maatsch gt_§d:, 1976) for the selection of item content included: 1. Frequently used general rules or principles. 2. Absolutely essential knowledge for competent emergency department practice. 3. Specific applications of knowledge to clinical Emergency Medicine. 4. Knowledge that must be remembered at all times for competent practice. 5. Frequently encountered cases and problems. At these worksh0ps, physician item writers received quotas of items and condition sheets from which the item content was generated. Each item written was sent to a physician reviewer for comments and criticisms, and then returned to the item author for revisions. After all items had been written, an Audit Committee of the American College of Emergency Physicians reviewed each item for content and keying and testing specialists reviewed and edited all items for form. Pictorial Multiple-Choice Items The Pictorial Mbltiple-Choice fOrmat of this Examination consisted of 136 pictorial-stem items. This item type presented some visual stimulus--an electrocardiogram rhythm strip, a color photograph of a patient, and/or a high-quality photoreduction of an x-ray--and one or more multiple-choice items based on these visual stimuli. Like the multiple-choice items, the pictorial multiple-choice questions were of the single best answer type and had four or five options. The procedures used to construct pictorial-stem items were essentially the same as for multiple-choice items. Separate workshops, 42 however, were conducted to train item writers and additional criteria for item content selection were used. Criteria for selection of visual materials and item content for this format included (Maatsch §£_§l,, 1976): l. Visuals that test general interpretive skills 2. Visuals that typically require immediate interpretation and use in an emergency department 3. Visual materials that knowledgeable candidates can clearly see and interpret Simulated Clinical Encounters Twelve Simulated Clinical Encounters were developed for this Examination. Simulated Clinical Encounters are examiner-administered, highly structured simulations of realistic emergency medical problems typically seen in hospital emergency departments. The simulations developed for this Examination are of the patient-game type (Maatsch, 1974) in which a well-trained examiner presents pre-designed and standardized information about a patient, when such information is requested by an examinee, who then proceeds to diagnose and manage the patient being simulated. Realistic patient presenting signs and symptoms are described to the examinee, who may, for example, order laboratory studies, x-rays, electrocardiograms, and so on, to aid in the differential diagnosis of the patient. If laboratory studies are ordered, the examiner provides the results to the examinee at the time these data would be available during a real clinical encounter. The Simulated Clinical Encounters are structured and standardized on a patient-game board which precisely details all data which is given, if requested by the examinee, and all examiner responses to 43 examinee actions. The simulated case presentation follows a logical and realistic course in which oral descriptions of the simulated patient's condition are contingent on the actions of the examinee. For example, if a patient's medical condition is deteriorating and the examinee orders a certain drug to be administered, the patient's changed condition will be reflected in data provided to the examinee. All such examiner responses to examinee actions are listed on the patient-game board which directs the administration of the simulated case. Two separate types of Simulated Clinical Encounters were develop- ed for the Emergengy Medicine Examination. The first type, Simulated Patient Encounters, requires the examinee to manage a single simulated patient case during a fifteen minute time period. The second type of Simulated Clinical Encounter, the Simulated Situation Encounter, presents three medical cases which the examinee must manage con- currently; thirty minutes are allowed for each Simulated Situation Encounter. The total of twelve Simulated Clinical Encounters were divided between eight Simulated Patient Encounters and four Simulated Situation Encounters. These simulations were developed by American College of Emergency Physician task force members who were teamed with educational developers from the Office of Medical Education Research and Development, Michigan State University. Prior to developing the Simulated Clinical Encounters, a total of sixty scenarios, story lines of emergency medical cases, were created from entries on the Condition/Skills list. These scenarios were prioritiz- ed by a task force of the American Board of Emergency Medicine to 44 ensure the proportionate sampling of the content categories (Table 3.1), the relevance of the scenario to the typical practice of Emergency Medicine, and their suitability for production as a Simulated Patient Encounter or a Simulated Situation Encounter. Each Simulated Clinical Encounter was pre-tested with an emergency physician task force member prior to final production of the case. DESIGN This section will present the details of administration of the Emergency Medicine Examination materials to the ninety-four subjects on October 22-26, 1977, in the Hilton Inn, Lansing, Michigan. Subject Groups Two administrative sections of twenty-three subjects and two sections of twenty-four subjects were formed for purposes of move- ment through a master schedule of testing. Subjects were randomly assigned to testing sections in proportion to the numbers in each of the four subject groups represented in this study. Subjects were assigned random identification numbers which were used to identify all responses to examination items. The four testing sections remained together throughout the twenty-two hours of testing, taking meals and breaks together to maintain isolation from all other examinee sections in order to ensure test security. A complicated master schedule was develoPed to move administrative testing sections and individual subjects through all formats of this examination. Sections were randomly assigned to formats in order to avoid any testing order-effect on criterion-group scores. 45 Multiple-Choice and Pictorial Multiple-Choice Formats Multiple-choice items were presented in four booklets Of ninety- three items each; pictorial multiple-choice items were divided between two books of sixty-eight items each. Items were assigned to booklets for both formats by a random procedure that forced approximately proportional representation of items from each content category listed in Table 3.1 in each booklet. Examinees answered all items on optically scanable answer sheets for computer scoring and item analysis. Each test booklet presented thorough instructions to subjects with example items; test adminis- trators followed a standard set of instructions for each booklet administration. Each examination session was proctored by a test administrator and two assistants, with each session timed to allow exactly one minute per objective item. Pictorial Multiple-Choice items were presented in special booklets that contained, on opposite pages, both the visual stimulus and the items related to the visual. Subjects, thus, had original visual materials--rather than printed reproductions--to examine for each pictorial question. These Pictorial Multiple-Choice booklets were reused after a thorough check for markings. Simulated Clinical Encounters Twenty-four emergency physician examiners conducted the Simulated Patient Encounters and Simulated Situation Encounters. Examiners had spent ten hours in training immediately prior to administering the Simulated Clinical Encounters. Developers did not administer their own cases . 46 Simulated Clinical Encounters were administered in small booths which had been subdivided fram a large hotel ballroom. The examiner and examinee were seated across a table from each other, with the Simulated Clinical Encounter game board between them. Sections of subjects moved through the twelve Simulated Clinical Encounters according to the master schedule and individual subject schedules. A time keeper signaled the beginning and end of each fifteen-minute Simulated Patient Encounter and each thirty-minute Simulated Situation Encounter. During every two hour Simulated Clinical Encounter block, sixteen examiners worked, with six examiners free for rest or for pairing with other administrators to verify examiner ratings. This verification was undertaken to study the inter-rater reliability of the Simulated Clinical Encounters. Second raters were randomly assigned to observe approximately twenty-five percent of Simulated Clinical Encounter administrations and indepen- dently rate examinee performance. Inter-rater reliabilities for individual ratings, computed by the interclass formula, ranged from r = .63 to .89 for the twelve Simulated Clinical Encounters. The inter-rater reliability of the grand mean rating on Simulated Clinical Encounters was .79 (Maatsch gp_§l,, 1978). Examiners completed a rating form on each candidate immediately after each Simulated Clinical Encounter session. Seven separate clinical skills were rated on an eight-point scale for each case pre- sented. It should be noted that, because of the method of examinee identification used, examiners had no knowledge of the criterion- group membership of individual subjects. 47 Generalizabiligy of Results Results of this study may be generalized to the population of emergency physicians judged by their peers to be very competent and certifiable in Emergency Medicine. For the resident sample, results may be generalized to the population of second-year residents in Emergency Medicine who have a wide range of competence as judged by their residency directors. Since the sample of students is a sample of convenience, only limited generalizations should be made to the population of fourth-year medical students at Michigan State University. The matter of generalizability of results to specific populations of subjects is, however, not of primary concern in the present study. Since the goal of this study is to determine the relationship of item content relevance to clinical performance and the relationship of item difficulty to clinical relevance, the generalizations of most interest concern inferences about item content and types of the valid discrimination of the clinical competence of criterion-groups with known levels of clinical competence. Subscale Develppment As noted in Chapter I, fOur subscales will be identified for this study. Items for two of these subscales--the medium difficulty and the low difficulty subscales--will be identified using an item difficulty criterion. Items for two other subscales--the high clinical-relevance and the low clinical-relevance subscales--will be selected by using an item-criterion score (grand mean Simulated Clinical Encounter rating) correlation criterion. These four subscales are schematically diagrammed in Figure 3.1. 48 SCHEMATIC OF 91 ITEM SUBSCALES TO INVESTIGATE MEDIUM HIGH DIFFICULTY CLINICAL-RELEVANCE SUBSCALE SUBSCALE ITEM DIFFICULTY ITEM CORRELATION WITH SIMULATED CLINCIAL EN- COUNTERS LOW CLINICAL-RELEVANCE LOW DIFFICULTY SUBSCALE SUBSCALE Flame 3.1 49 A major purpose of this research is to study the effect of item selection rules on the valid discrimination of clinical performance. The logic underlying the testable hypotheses for this study can be summarized by the following questions: 1. Does the selection of items for medium difficulty tend to distort the clinical-relevance of item content? 2. What is the effect of any such distortion on the valid discrimination of groups with known levels of clinical competence? 3. Do different item types vary with respect to clinical relevance? The ninety-one items for each of the four subscales will be identified, using the criteria noted in Chapter I. Ninety-one items for the medium difficulty subscale and ninety—one items for the low difficulty subscale will be selected from item analysis data; subscale scores consisting of the sum of correct responses will be computed. Next, all 364 items will be correlated with the grand mean rating on Simulated Clinical Encounters; these items will then be rank-ordered and the high clinical-relevance and the low clinical-relevance items will be identified. The two clinical-relevance subscale scores will then be computed such that each subject's subscale score is the sum of the number of correct responses to the items in the scale. All data analyses to be carried out for the hypotheses of this study will use only three criterion groups of subjects: the residency- eligible, resident, and student groups. Practice-eligible physicians (n=14) will not be considered in any of the analyses for subscale development or hypothesis testing, because--of the two physician groups——there is greater confidence that the residency-eligible group 50 represents true competence in Emergency Medicine. It is felt, there- fore, that for the hypotheses to be tested in this study, clearer results will be Obtained by omitting the fourteen practice-eligibles from all analyses. Omission of the practice-eligible group from sub- scale development analyses will also allow for a small validation of the results of this study by reanalysis of some hypotheses using only the fourteen practice-eligibles or the practice-eligible group in combination with residents and students. HYPOTHESES AND ANALYSIS METHODS IA. H : There is no difference in the criterion-group discrimination (residency-eligible, residents, students) of the medium difficulty and the high clinical-relevance subscales. H1: There is a difference in the criterion-group discrimination of the medium difficulty and the high clinical-relevance subscales. The difference favors the high clinical-relevance subscale. 18. Ho: There is no difference in the criterion-group discrimination of the medium difficulty and the low clinical-relevance subscales. H1: There is a difference in the criterion-group discrimination of the medium difficulty and the low clinical-relevance subscales. The difference favors the medium difficulty subscale. IC. H : There is no difference in the criterion-group discrimination of the high clinical-relevance and the low clinical-relevance subscales. H1: There is a difference in the criterion-group discrimination of the high clinical-relevance and the low clinical-relevance subscales. The difference favors the high clinical-relevance subscale. These hypotheses will be tested by three separate Discriminant Analyses, using the two subscale scores noted in each hypothesis as 51 the discriminating variables and the three criterion groups (residency- eligible, residents, students) as the independent variable. These analyses will identify the most discriminating subscale score by examination of the standardized discriminant function co- efficients and the Wilks' Lambda statistic. It will also provide univariate F-tests of the discriminating power of each of the separate subscales (Tatsuoka, 1971). Each hypothesis will be tested by forming an F-ratio of the two univariate F's for the scales being compared. 11. Ho: There are no differences in criterion-related validity (subscale scores correlated with mean Simulated Clinical Encounter ratings) between the medium difficulty and the high clinical-relevance subscales. H1: There are differences in criterion-related validity between the medium difficulty and the high clinical-relevance sub- scales. The high clinical-relevance subscale will have a higher validity coefficient than the medium difficulty subscale. Analysis will require computation of correlation coefficients between each of the two subscales and the criterion of Simulated Clinical Encounter mean ratings. This hypothesis will be tested by a test of the difference of two non-independent correlation coefficients (Glass and Stanley, 1970). 111. HO: There is no difference in internal-consistency reliability between the medium difficulty and the high clinical-relevance subscales. H : There is a difference in internal-consistency reliability between the medium difficulty and the high clinical-relevance subscales. The difference favors the medium difficulty subscale. The analysis will consist of computation of Kuder-Richardson 20 reliability coefficients for each subscale. Hypothesis III will be tested by formlng an F-ratlo of the MBpersons/Mstotal for both subscales under consideration (Wilson, 1978). 52 IV. H : There are no differences in mean item difficulty between the medium difficulty and the high or the low clinical- relevance subscales. H : There are differences in mean item difficulty between the medium difficulty subscale and the high and the low clinical-relevance subscales. The medium difficulty subscale and the low clinical-relevance subscale will be more difficult than the high clinical-relevance scale. A repeated measures Analysis of Variance will be used to test Hypothesis IV. Post-hoc contrasts will test differences between the means of the medium difficulty subscale and the high clinical-relevance subscale and also between the mean of the low clinical-relevance scale and the high clinical-relevance scale. V. Ho: The proportion of overlap of identical items between the medium difficulty and the high clinical-relevance subscales will be the same as the proportion of overlap of identical items selected for the medium difficulty and the low clinical-relevance subscales. H1: The proportion of overlap of identical items between the medium difficulty and the high clinical-relevance subscales will be lower than the proportion of overlap of identical items between the medium difficulty and the low clinical- relevance subscales. Analysis for Hypothesis V will require a count of the number of overlapping identical items in the scales noted and computation of these proportions. The hypothesis will be tested by drawing a con- fidence interval around the difference of the two proportions (Bacon, 1976). VI. Ho: There are no differences in the distributions of pictorial- stem, clinical-situational, or factual multiple-choice items selected for the medium difficulty, the low difficulty, the high and the low clinical-relevance subscales. H1: There are differences in the distributions of pictorial- stem, clinical-situational, and factual multiple-choice items selected for the four subscales. The high clinical-relevance 53 subscale will have a larger distribution of pictorial-stem and/or clinical—situational items than the medium difficulty or the low clinical-relevance subscale. Counts of the numbers of pictorial-stem, clinical-situational, and factual multiple-choice items selected fOr each subscale will be performed. A chi-square statistic will be used to test Hypothesis VI. SUMMARY Examination materials were developed as a certification examin- ation in Emergency Medicine. This library of test items was administered to ninety-four subjects in fOur groups over a two and one-half day period. The two formats of most interest in this study are the Pictorial Multiple-Choice and the Multiple-Choice formats. The study was designed to test the effect of clinically relevant item content on group-score discrimination, criterion-related validity, subscale reliability, and item difficulty and also to examine the effect of different item types on the clinical-relevance of item subscales. Group discrimination differences for subscales of items selected by different criteria will be tested by a Discriminant Analysis procedure and an associated F-test. A Z-test of non-independent correlation coefficients will be calculated to test differences in criterion-related validity coefficients. An F-test of the difference between two reliability coefficients will be performed. A repeated measures ANOVA will test an hypothesis of equal item difficulties for three subscales. A Z-test of differences in proportions will be used to test differences in prOportions of identical items selected for 54 subscales. And, a chi-square statistic will test differences in distributions of item types selected for each subscale. Chapter IV presents the results of the data analyses perfOrmed for this study. CHAPTER IV RESULTS INTRODUCTION This chapter presents the results of the statistical analyses that were performed to test the hypotheses of this study. Results are presented concerning differences in the ability of subscales to discriminate statistically three criterion-groups of subjects--the residency-eligible physicians, residents, and medical students. Specifically, the group discrimination of the medium difficulty sub- scale is compared to that of the high clinical-relevance subscale and the low clinical-relevance subscale. Differences in group dis- crimination between the high clinical-relevance and the low clinical- relevance subscales are also reported. The criterion-related validity coefficients of the four subscales-- correlations of subscale scores with the grand mean of Simulated Clinical Encounters--are reported and compared. An hypothesis test of the difference in criterion-related validity between the medium difficulty and the high clinical-relevance subscales is presented. Another hypothesis for this study concerns differences in internal- consistency reliability between the medium difficulty and the high clinical-relevance subscales. Reliability coefficients for each of the four subscales are presented and the results of an hypothesis test are given. Differences in mean item difficulty among the medium difficulty, the high and the low clinical-relevance subscales are reported. Two 55 56 post-hoe contrasts test differences in mean item difficulty between 1) the medium difficulty and the high clinical-relevance subscales, and 2) the low clinical-relevance and the high clinical-relevance subscales. To investigate differences in the effect of item types on the statistical properties of scores, differences in the proportion of overlap of identical items between the medium difficulty/high clinical- relevance scales and the medium difficulty/low clinical-relevance scales are reported. Then, differences in distributions of item types--the pictorial—stem, clinical-situational, and factual multiple- choice item type-~across the four subscales selected for this study are investigated and the results are reported. Finally, other results of data analyses which were suggested by the findings of this study are presented. Specifically, this section presents the results of a small validation study in which some of the hypotheses tested are reanalyzed using the practice-eligible emergency physician group (n=14) alone or in combination with the resident and the medical student group. ITEM SELECTION FOR FOUR SUBSCALES Results of the data analyses perfOrmed to select items for the four subscales studied here are presented in Tables 4.1 to 4.4. Medium Difficulpy Subscale Table 4.1 presents the item difficulty indices (p-value equals prOportion marking a correct answer) and item-total score discrimination S7 TABLE 4.1 MEDIUM DIFFICULTY ITEMS n=80 Item p-value Point-Biserial Item p-value Point-Biserial A12 .62 .11 D50 .54 .20 A28 .64 .30 D59 .66 .33 A30 .63 .02 D60 .65 .48 A35 .66 .38 D61 .55 .53 A36 .66 .38 D65 .56 .21 A37 .55 .29 D66 .63 .50 A40 .58 .17 D72 .66 .12 A43 .53 .40 D75 .60 .30 A45 .53 .08 D76 .53 .13 A50 .61 .30 D77 .50 .28 A57 .50 .24 D83 .64 .52 Bl .63 .31 D92 .65 .16 82 .65 .12 E9 .61 .34 B4 .53 .51 E16 .61 .10 B6 .66 .03 E24 .55 .42 B7 .55 .24 E27 .60 .27 821 .55 .15 E38 .63 .29 823 .63 .10 E47 .55 .20 B33 .64 .51 E54 .54 .42 B45 .66 .25 E59 .54 .23 C7 .55 .22 E61 .59 .36 C11 .63 .19 E69 .68 .43 C14 .64 .49 E70 .55 .37 C20 .60 .59 E76 .55 .17 C24 .54 .27 E79 ‘.63 .36 C31 .50 .35 E80 .51 .23 C33 .61 .38 E88 .66 .50 C39 .61 .10 F5 .53 .23 C59 .56 .24 F6 .68 .18 C67 .58 .33 F7 .65 .44 C70 .66 .22 F11 .64 .45 C73 .56 .36 F12 .68 .37 C74 .65 .31 F21 .60 .17 C78 .61 .26 F23 .68 .02 C84 .60 .15 F32 .68 .11 D2 .65 .46 F36 .51 .19 D7 .53 .06 F37 .61 .23 D8 .63 .13 F43 .56 .06 D12 .65 .23 F47 .66 .39 018 .69 .06 PSI .65 .47 D23 .58 .32 F60 .61 .14 D27 .65 .37 _ F71 .69 .08 D28 .68 .51 F79 .61 .22 D32 .53 .20 F82 .66 .09 D37 .56 .24 F90 .60 .42 D49 .64 .43 58 index (point-biserial correlation of the item score and the total correct score), for the medium difficulty subscale. Items for the medium difficulty subscale were selected to be near the ideal difficulty recommended by some measurement specialists (e.g., Ebel, 1972) for educational achievement examinations intended to yield scores that will be interpreted relative to some norm group's performance. This ideal difficulty is a mean score on the test that is approximately midway between the chance score and the perfect score. For a ninety-one item test, with items having four options, the ideal mean score would be approximately 57 items correct. This ideal mean score corresponds to a mean p-value of approximately .63. For the medium difficulty scale in this study, ninety-one items were selected to range in p-value from .50 to .69, for the residency-eligible, the resident, and the student groups (n=80). All items selected had positive item-total score discrimination indices. Items were selected by starting the selection process at p = .50 and continuing to select less difficult items until the quota of ninety-one items had been selected. There were five tied ranks at p = .69; two items out of five were selected at random to complete this ninety-one item subscale. The mean p-value for this scale is .603. A subscale total score was computed such that each subject's score on the medium difficulty scale was the sum of the number of correct responses to these ninety-one items. Low Difficulgy Subscale Item analysis data for the items selected for the low difficulty subscale are presented in Table 4.2. The ninety-one items Item A30 A6 A33 A34 A44 A46 A49 A53 A55 A58 A59 A60 A65 B3 815 816 818 822 824 828 829 838 839 840 844 847 848 849 858 859 863 864 866 867 868 C4 C21 C23 C27 C28 C42 C44 C48 C51 C57 C63 C68 C69 p-value .90 .84 .96 .95 .98 .94 .88 .90 .88 .99 .95 .96 .91 .90 .86 .84 .85 .89 .89 .85 .91 .90 .89 .86 .88 .85 .88 .95 .93 .95 .85 .91 .98 .93 .94 .90 .96. .90 .89 .94 .91 .89 .96 .84 .98 .93 .94 .99 Point-Biserial 59 TABLE 4.2 .20 .30 .36 .07 .31 .10 .34 .49 .18 .32 .23 .08 .16 .08 .30 .40 .44 .18 .16 .23 .34 .30 .30 .30 .43 .23 .23 .13 .30 .33 .17 .33 .10 .33 .33 .05 .38 .29 .52 .34 .42 .30 .31 .31 .42 .18 .41 .27 LOW DIFFICULTY ITEMS n=80 Item C82 C83 D4 D5 D6 D19 D29 D30 D35 D36 D38 D41 D47 D53 D57 D80 D82 E7 E23 E33 E39 E40 E51 E66 E67 E73 E90 F1 F9 F18 F26 F30 F33 F46 F48 F63 F66 F69 F70 F75 F81 F89 F91 p-value .91 .91 .85 .95 .89 .85 .88 .89 .93 .86 .90 .94 .94 .85 .85 .93 .94 .88 .89 .85 .84 .85 .86 .93 .85 .94 .90 .89 .96 .85 .84 .90 .89 .89 .94 .94 .94 .88 .98 .91 .86 .93 .84 Point-Biserial .10 .10 .19 .09 .16 .06 .26 .21 .30 .11 .19 .19 .41 .40 .39 .21 .24 .07 .19 .34 .20 .62 .02 .39 .38 .03 .45 .30 .18 .13 .27 .38 .43 .21 .14 .00 .26 .13 .18 .34 .33 .13 .06 60 selected for this subscale range in p-value from .84 to .99, with positive (or zero) item-total discrimination indices. The mean p—value for this scale is .903. Items for this subscale were selected by choosing the ninety- one least difficult items from a listing of items ranked by p-value from easiest to most difficult. There were ten tied ranks at p = .84; six items were chosen randomly to complete the quota of ninety-one items. A low difficulty scale score was computed for each subject by summing the correct responses to these ninety-one items. High Clinical-Relevance Items Table 4.3 presents the items selected for the high clinical- relevance subscale and their point-biserial correlations with the criterion of grand mean ratings on the Simulated Clinical Encounters. These items were selected by rank-ordering all items from highest to lowest item-criterion correlation and selecting the ninety- one items with the highest correlation with the criterion. Item- criterion correlations range from r = .33 to .68, with a median r = .38 and a mode of r = .33. The mean p-value for this scale is .721. There were fifteen tied ranks at r = .33; twelve of the fifteen items were randomly selected to complete this ninety-one item subscale. Correct responses to these items were summed to form a high clinical-relevance subscale score for each subject. Low Clinical-Relevance Items Table 4.4 presents the items selected for the low clinical- relevance subscale and their item-grand mean Simulated Clinical Encounter correlation coefficients. These items were chosen by 61 TABLE 4.3 HIGH CLINICAL-RELEVANCE ITEMS n=80 Point-Biserial Point-Biserial Item (Item-Criterion) Item (Item-Criterion) Al .58 D53 .35 A2 .68 D60 .45 A4 .42 D61 .38 A5 .36 D64 .36 A6 .33 D66 .46 A23 .37 D71 .37 A35 .40 D73 .37 A36 .37 D83 .36 A53 .33 DQO .34 A67 .39 E2 .38 84 .48 E24 .38 B12 .34 E29 .38 816 .37 E40 .66 819 .44 E46 .40 829 .31 E48 .36 B30 .33 854 .35 B33 .40 £55 .33 B36 .42 E64 .39 842 .39 E66 .37 B44 .36 E67 .35 851 .38 E69 .33 852 .56 E70 .33 857 .43 E72 .50 858 .33 £88 , .44 B64 .37 F7 .53 B65 .40 F10 .38 867 .36 Fll .43 B68 .37 F12 .37 C14 .45 F13 .38 C18 .33 F19 .36 C20 .56 F24 .34 C21 .35 F34 .37 C27 .50 F51 .49 C31 .33 F52 .49 C42 .33 F53 .45 C47 .34 F67 .46 C51 .45 F68 .42 C52 .42 F76 .41 C55 .36 F77 .51 C57 .34 F80 .36 C68 .38 F84 .40 D2 .33 F85 .34 D27 .34 F86 .34 028 .45 F87 .35 D47 .46 F90 .33 D49 .40 F90 .33 62 TABLE 4.4 LOW CLINICAL-RELEVANCE ITEMS n=80 Point-Biserial Point-Biserial Item (Item-Criterion) Item (Item-Criterion) A8 .11 D18 .12 A12 .11 D19 -.01 A29 .08 D30 .08 A30 .02 D36 .10 A34 -.08 D58 .07 A38 .10 067 .00 A39 .05 D72 .06 A42 .06 D76 .07 A46 .01 D81 -.02 A51 .09 D84 .00 A52 .08 D86 -.02 85 -.06 D92 .02 86 .09 E7 -.05 820 .09 E14 .02 821 .06 E18 .04 822 .05 820 .05 823 -.01 E21 .09 825 -.03 826 .02 826 .02 E32 -.03 837 -.12 E35 .10 846 .09 E36 .05 866 .11 E42 -.03 C4 -.01 E43 .05 C6 .05 E47 .07 C7 .11 E51 -.01 C11 .08 856 .03 C16 .10 858 .06 C22 -.05 E71 .11 C29 .01 E75 -.06 C32 .10 E76 .05 C35 -.07 F4 .02 C36 .09 F22 .11 C38 .08 F27 .11 C41 .01 F32 .01 C43 -.04 F46 .07 C53 -.23 F49 .10 C54 -.10 F50 -.10 C62 .04 F57 .00 C65 .10 F63 -.11 C66 -.05 F69 .10 C70 .11 F71 .06 C82 -.04 F73 ' .03 C83 .07 F89 .09 C84 .11 D6 -.04 D7 .05 D8 .07 D14 -.03 63 selecting the ninety-one items that had the lowest item-criterion correlation. Correlations with the criterion ratings ranged from .23 to .11; there were twenty-eight negatively correlating items. The median item-criterion correlation is r = .05, with a mode of r = .11. There were no tied ranks for this subscale. The mean p-value for this scale is .667. A total correct score on these ninety-one items was computed for each subject. STATISTICAL ANALYSIS FOR GROUP DISCRIMINATION HYPOTHESES The first three hypotheses of this study concern the differential power of subscales selected by two different criteria to statistically separate or discriminate criterion groups with known levels of train- ing and experience in a medical specialty. Discriminant Analysis was used to analyze data for these hypotheses. A brief description of the technique of Discriminant Analysis follows. Discriminant Analysis is a multivariate statistical technique that weights potential discriminating variables and linearly combines these variables such that the discrimination between two or more groups of subjects is maximized. The discriminant function has the fOrm: Di = dilz1 + diZZZ +....dipZp (1) Where: Di = Score on discriminant function 1 d1 = Weighting Coefficients Z = Standardized values of the p discriminating variables The mathematics of Discriminant Analysis restrains the number of discriminant functions derived to a maximum of the number of groups minus one or to the number of discriminating variables in the analysis. 64 Several statistics are used to test the importance of variables to the maximum separation of known groups. For the stepwise Dis- criminant Analyses used to analyze data for the hypotheses of this study, the following statistics are important: 1. Eigenvalue: an index of the relative importance of the discriminant function derived. The sum of the eigenvalues is a measure of the total variance of the discriminating variables. Relative Percent of Eigenvalue: Proportion of total variance of discriminating variables accounted for by the function derived. Canonical Correlation: Correlation of the discriminant function and the set of g-l dummy variables which define the g-groups discriminated. The square of the Canonical correlation coefficient defines the percentage of variance in the discriminant function explained by the criterion groups. Wilks' Lambda: An inverse measure of the discriminating ability of the variables in the analysis. When Lambda is small, the discrimination is high. Standardized Discriminant Function Coefficient (di): The coefficient which when multipled by z-scores for each subject, maximizes the discrimination of groups. These coefficients are interpreted like beta weights in a regression equation and, analytically, like factor loadings in a factor analysis. Classification Analysis: A classification of predicted group membership based on the discriminant function(s) derived. 65 Predicted group membership is compared to actual group membership. Percentage of correct classification is an index of the ability of the discriminating variables in the analysis to validly discriminate groups. RESULTS CONCERNING DIFFERENCES IN DISCRIMINATION: MEDIUM DIFFICULTY VERSUS HIGH CLINICAL-RELEVANCE Hypothesis IA stated that the high clinical-relevance subscale scores would statistically discriminate the residency-eligible physicians, the residents, and the medical students better than the medium difficulty subscale scores. This hypothesis was analyzed by computing stepwise Discriminant Analyses on these data and test statistics associated with the Discriminant Analyses (Tatsuoka, 1971). The medium difficulty and the high clinical—relevance scale scores were entered into a stepwise Discriminant Analysis4, using a scale selection criterion that minimizes Wilks' Lambda. The F-value for inclusion of a scale in the analysis was a = .01. Table 4.5 shows the raw-score means and standard deviations for each criterion-group in this study. Univariate F-ratios of the scale scores indicate at p :_.0001 that both the medium difficulty and the high clinical-relevance scores taken separately discriminate the three groups well. Wilks' Lambda shows fairly strong discriminating power for each scale. It should be noted that both the F-ratios 4 . . . . An appllcat1ons computer program, the Statlstlcal Package for the Social Sciences, (Nie pp 31., 1970) was used for all Discriminant Analyses. 66 «Hm.mNH mm.mv mm.n Hw.om nn.mm mw.n mw.©m ewe: .o.m new: mucovsum mucmwfimom m.v mqmm4mmnq >HADUHmmHo ZDHQMZ "ZOHHoHom seemeesu ewe: seeeoeeeee 5382 67 and Wilks' Lambda show relatively stronger discriminating power for the high clinical-relevance scale than for the medium difficulty scale. Figures 4.1 and 4.2 graphically show the raw-score separation of the three criterion-groups for the medium difficulty scale and the high clinical-relevance scale, respectively. Comparing the curves of the raw-scores for the medium difficulty scale and the high clinical-relevance scales shows the relative power of the high clinical-relevance scale in the discrimination of the criterion-groups. There is considerably less overlap in curves for the three groups for the high clinical-relevance scores compared to the medium difficulty scores. Table 4.6 presents a summary of the stepwise Discriminant Analysis performed for this hypothesis. This table shows that the high clinical-relevance scale was entered first; this first function alone yielded a small Wilks' Lambda, indicating the relatively higher group discriminating power of the high clinical-relevance scale compared to the medium difficulty scale. When the medium difficulty scale was added in the second step, Wilks' Lambda decreased only slightly. This result shows that addition of the medium difficulty scale increased the discriminating power only a small but statistical- ly significant amount, given the discrimination accounted for by the high clinical—relevance scale. In Table 4.7 the standardized Discriminant Function coefficients are presented. The first function weights the high clinical-relevance scale in a ratio of 6.65 : 1 relative to the medium difficulty scale. mm ..¢ 052“. eoom 89.8 .28 no no nv mm \mmT % J O 1 ON 1 On I ah Amwus 3536 .I fines 35287.. 5. ENE: 0355 t 3528”. ...l L 00. 28m 5.855 5282 lUOOJOd aANDISH 69 N6 2:2... 9.8m 8950 .26... suns 2.82mi 8n»... £8283. :1 ANN»... oégfitaocosooml 268an moco>m_mm-_8_c__o £91 on as 00. luamad aAglolag 70 TABLE 4.6 SUMMARY OF STEPWISE DISCRIMINANT ANALYSIS: HIGH CLINICAL-RELEVANCE VS MEDIUM DIFFICULTY StepgNumber Scale Name F to Enter Wilks' Lambda p-value l High-Clinical Relevance 125.21 .235 f .0001 2 Medium Difficulty 4.38 .211 :_.0001 TABLE 4.7 STANDARDIZED DISCRIMINANT FUNCTION COEFFICIENTS: MEDIUM DIFFICULTY VS HIGH CLINICAL-RELEVANCE Function 1 Function 2 Medium Difficulty -0.357 -2.664 High Clinical- Relevance 2.374 2.386 71 The second function, the medium difficulty function, weights the medium difficulty scale only 1.12 times greater than the high clinical-relevance scale. Table 4.8 presents other data on the relative contribution of both scales to the discrimination of groups. When both discriminant functions are used, Wilks' Lambda is small and highly significant. The Canonical correlation shows that both functions together can account for 77 percent of the known variance of group membership. A large proportion (97 percent) of the total eigenvalue is explained by both functions. When the first function, the high clinical-relevance function, is removed from the analysis, relatively small, but statistically significant, discriminating power is accounted for by the medium difficulty scale alone. In Table 4.9, the accuracy of classifications made using the two discriminant functions of this analysis is given. A total of 81.3 percent of the subjects were accurately classified by these two discriminant functions. A chi-square statistic was calculated to test the hypothesis that the observed correct classifications were due to chance alone. Theix2 = 82.66 with 4 degrees of freedom is significant at p :_.0001. The null hypothesis of chance accuracy is, therefore, rejected in favor of the alternative that correct classifications were not due to chance. The high clinical relevance scale taken separately correctly classified 76.3 percent of subjects correctly, while the medium difficulty scale classified 71.2 percent correctly. 72 evm.n rer.mHH Oemscmnficu mom. o.m Ham. o.nm «enema osam>cowfim .mxfiwz HMHOP mo uceupoa mom. chm. :ofiumHOhHou Hmowzoemu 500. v Q r Hooo. v m ee HoH. o:o~< seseoemeee eases: mom.m sefisoemmee sumac: use oocm>oHoa -Heumewsu ewe: osflm>cowflm mflmxame< aw meowuocsm onom mmamqmmugHADUHmmHQ ZDHQmZ mo mmzog OZHHHH<4mm w.v mqmonm -Heuueuuu sou eH.A AA.mm mm.h mw.em Ae.m he.ee sufisuemmea sameuz .Q.m :moz .D.m :moE .Q.m :moz masoc=um mucopflmom manfimwfimuNocopfimom muz<>mqmzuq >EADUHmmHQ ZDHQmZ "ZOHHOmm-_ooE__o 30.. d ‘s mm on as 00. luamad anumeu 79 me 2:2... 98m 89:00 .23 me ifil 187 1 J o 1 ON 1 On 1 0h ENE: 3.325.. .I 895 resin ANNE: 03.26-»?0231 II 1 CO. 2888 3.355 3.. lueOJad ahuolaa 80 TABLE 4.12 SUMMARY OF STEPWISE DISCRIMINANT ANALYSIS: MEDIUM DIFFICULTY VS LOW CLINICAL-RELEVANCE Wilks' Step_Number Scale Name F to Enter Lambda p-value 1 Medium Difficulty 65.94 .369 :_.0001 2 Low Clinical- Relevance 4.24 .332 < .0001 TABLE 4.13 STANDARDIZED DISCRIMINANT FUNCTION COEFFICIENTS: MEDIUM DIFFICULTY VS LOW CLINICAL-RELEVANCE Function 1 Function 2 Medium Difficulty 1.879 -0.489 Low Clinical-Relevance -0.403 1.203 81 tnm.m ttvv.ew Ohmsmm sflgu mom. Nmm. messes .mxfiflz mmo..w e i Loco. v a 11 o.m mam. oee. ueee< uuee>uflue -Heuuemeo sou o.wa eom. Hom.s uuee>usum -Heufiemeu 303 use Refinememea aseeuz Osfim>cuwfim :OMHmHOHHOU 63Hm>=mwwm mflwwwec< Hench mo acoonoe Hmufieocmu :« m20wuucsm Ommum muz<>mqmm|q rhqauHmmHD ZDHsz amazon OZHHHH<4mm v~.v m4mofiom -Hmowcfiau 304 mm.ow oocm>ofiom -Heuueueu ewe: :moz ceewwuem-xueueumue muz<>mqmmng muz<>mqmthcowwm :ofiumfiouuou osfim>=owflm HmOOH mo accouoa Hmowcocmu mmqmqmmuq 20H: no «mace UZHHHHo~omufimoflcflfiu 304 oo=m>ofiomufimoficfifiu 364 oucm>OHOmufimowcwHu :mfi: mflmxamc< :fi mcofluucsm onom 90 TABLE 4.21 CLASSIFICATION ANALYSIS USING HIGH AND LOW CLINICAL-RELEVANCE DISCRIMINANT FUNCTIONS Predicteda N_ Residency-Eligible Residents Students Residency-Eligible 22 13(59.l) 9(40.9) 0(0) Residents 36 5(13.9) 30(83.3) l(2.8) Students 22 0(0) 1(4.5) 21(9S.5) 3Numbers in parentheses indicate percentage of classification for that group. TABLE 4.22 SUMMARY OF STEPWISE DISCRIMINANT ANALYSIS WITH LOW CLINICAL-RELEVANCE ENTERED FIRST Wilks' Step Number Scale Name F to Enter Lambda p-value 1 Low Clinical- Relevance 6.07 .864 .1 .004 2 High Clinical- Relevance 118.36 .210 :_.0001 91 in Table 4.22. The addition of the high clinical-relevance scale reduces Lambda to .210, showing that the high clinical-relevance scale adds very significantly to the statistical discrimination of these groups. Comparing Wilks' Lambda when the high and the low clinical-relevance scales are entered first in separate analyses, a large difference in Lambda (.629) is noted. This difference in Lambda favors the high clinical-relevance scale and indicates the relative contribution of high clinical-relevance to the statistical separation of groups, when compared to the low clinical-relevance scale. Hypothesis Test An F-test statistic was formed by the ratio of the univariate F's for each scale in this hypothesis, such that: F . . . F = H1gh Clinical-Relevance (4) calculated F Low Clinical-Relevance For the hypothesis of no difference in criterion-group dis- crimination between the high clinical-relevance and the low clinical- relevance scales: 1: = iii-ll: 20,63 6.07 For 2 and 77 degrees of freedom, the critical value for rejection of H0 at a = .001 is (conservatively) 7.76. Since F-calculated is greater than 7.76, the hypothesis of no difference between the high and the low clinical-relevance scales in group discrimination is re- jected. There is a statistical difference in the discrimination power of these two scales; this difference favors the high clinical- relevance scale. 92 RESULTS CONCERNING CRITERION-RELATED VALIDITY Alternative Hypothesis II for this study stated that the criterion-related validity-~the correlation of subscale scores with grand mean ratings on the independent Simulated Clinical Encounters-- of the high clinical-relevance scale would be higher than the validity coefficient for the medium difficulty scale. The method of item selection for the high clinical-relevance scale forced this subscale to have a high criterion-related validity coefficient. However, the validity of the medium difficulty scale--selected by a different criterion—-could be equal to or lower than the validity of the high clinical—relevance scale. Table 4.23 presents the criterion-related validity coefficients of all four scales in this study. Hypothesis II was tested by a Z-test of the difference of two non-independent correlation coefficients as presented by Glass and Stanley (1970). The Z-test statistic calculated to test the hypothesis of no difference against the hypothesis that the high clinical-relevance scale has a larger validity coefficient than the medium difficulty scale is given here: anlculated = '3'60 For a one-sided (upper) hypothesis test, the critical value for rejection of the null hypothesis is 2.33 at a = .01. Since the decision rule is: Reject H0 in favor of H if Z>/C/, Hypothesis II 1 is rejected in favor of H1. It may be concluded that an rxy = .895 for the high clinical- relevance scale is greater than the rxy = .797 for the medium difficulty scale. 93 TABLE 4.23 CRITERION-RELATED VALIDITY COEFFICIENTS: SUBSCALE SCORE CORRELATION WITH MEAN SIMULATION RATINGS n=80 Grand Mean Subscale Simulated Clinical Encounters High Clinical-Relevance .895 Medium Difficulty .797 Low Difficulty .774 Low Clinical—Relevance ' .214 94 Although this hypothesis concerned differences in validity coefficients between the high clinical-relevance and the medium difficulty scales, it is interesting to note observed differences in validity for the other scales, as presented in Table 4.23. The low clinical-relevance scale has the lowest validity coefficient of the four scales; this result was anticipated, since items for this scale were chosen for their lowest correlation with the criterion. The low difficulty scale's validity coefficient (.774) is only slightly lower than the rxy = .797 for the medium difficulty scale; this result is some surprising since it was anticipated that the low test score variance of this scale would attenuate the scale's correlation with the criterion. RESULTS CONCERNING INTERNAL-CONSISTENCY RELIABILITY Hypothesis III of this study stated that the internal-consistency reliability coefficient of the medium difficulty subscale would be higher than the reliability coefficient of the high clinical-relevance scale. Table 4.24 presents the reliability coefficients computed for each of the four subscales in this study. Much research in educational and psychological measurement suggests that internal—consistency reliability will be maximized by selecting test items that cluster as closely as possible to p = q = .5. The reason for this phenomenon is that when p = .5, item variance is maximized (s:=pq=.25) and test variance is therefore maximized, which tends to produce high internal-consistency reliability. The hypothesis of no difference in scale reliability between the medium difficulty and the high clinical-relevance scale was tested 95 by a ratio of two F-values associated with each reliability co- efficient (Wilson, 1978). The test statistic is given by: F = 1:High Clinical-Relevance(HCR) (5) calculated FMedium Difficulty(MD) Mean Square Where: F = ,persons HCR Mean Square total Mean Square persons F = MD Mean Squaretotal The logic underlying this test statistic is derived from the formula for Alpha or Kuder-Richardson 20 reliability, given by: 2 K (Xsi) 0‘ (it-T11 ‘ "‘2‘"—~ (6) s x Where: K = Number of test items 5: = Item Variance 2 . 5x = Test Varlance Formula 6 can be computed from an Analysis of Variance of items and subjects (Hoyt, 1941), such that: MS - MS a: p r (7) MS P Where: MSp = Mean Square for Persons MSr = Mean Square Residual A test statistic for the difference between two reliability coefficients is, therefore, given by: MS /MSt (HCR) (8) Fcalculated = MSP/MSt(MD) Where: MSt = Mean Square Total 96 TABLE 4.24 INTERNAL-CONSISTENCY RELIABILITY OF SUBSCALES n=80 Subscale Kuder-Richardson 20 High Clinical-Relevance .954 Medium Difficulty .878 Low Difficulty .869 Low Clinical-Relevance .581 TABLE 4.25 MEAN SQUARE VALUES FOR HIGH CLINICAL- RELEVANCE AND MEDIUM DIFFICULTY SCALES Mean Square Mean Square Subscale Persons Total High Clinical-Relevance 3.160 .201 Medium Difficulty 1.822 .239 97 Table 4.25 gives the Mean Square pieces for calculation of this F-test statistic. For this hypothesis, then: it Fcalculated = 3.160/.201 = 15.721 2 2.062 1.822/.239 7.623 Alternative Hypothesis III stated that the medium difficulty scale would be more reliable than the high clinical-relevance scale. Since the opposite direction was observed in the data, a two-sided hypothesis test is appropriate. Accordingly, for a two-sided test at a = .05 with 79 and 79 degrees of freedom, the critical value (conservative) is 1.53. Since F-calculated is larger than the critical value, the null hypothesis is rejected in favor of an al- ternative that states that there is a difference in reliability between the medium difficulty and the high clinical-relevance scales. The high clinical-relevance scale is statistically more reliable than the medium difficulty scale. RESULTS CONCERNING MEAN ITEM DIFFICULTIES Hypothesis IV stated that both the medium difficulty subscale and the low clinical-relevance subscale would be more difficult than the high clinical-relevance subscale. The logic underlying this research hypothesis is that information that is relevant to the every- day practice of clinical medicine is used frequently and, therefore, remembered better than less frequently used knowledge. Since mean item difficulty is a function only of the test mean and the number of items, the subscale means can be used to test an hypothesis about differences in mean item difficulty. 98 Table 4.26 presents means, standard deviations and mean p-values (mean proportion correct) for the four subscales of this study. In- spection of this table shows that the three subscale means are ranked in the order predicted by this hypothesis. That is, the medium difficulty scale is most difficult, followed by the low clinical- relevance and the high clinical-relevance scales. The low difficulty scale is not considered in this hypothesis since the item selection criteria used to select items for this subscale force it to be the least difficult. The medium difficulty scale is considered here because it acts as a difficulty reference point for the two scales selected by a different, independent criterion. A repeated measures ANOVA of the three subscales of hypothesis IV reveals, in Table 4.27, a significant F-ratio. Tukey post—hoc analyses of the two contrasts of interest here show that the medium difficulty mean is significantly lower than the high clinical- relevance mean. Also, the low clinical-relevance mean is significantly lower than the high clinical-relevance mean. These analyses support accepting alternative hypothesis IV that the medium difficulty and the low clinical-relevance scales are each statistically significantly more difficult than the high clinical relevance scale. RESULTS CONCERNING OVERLAPPING ITEMS IN SUBSCALES Hypothesis V for this study stated that the proportion of over- lap of identical items between the medium difficulty and the high clinical-relevance subscales would be lower than the proportion of Subscale Low Difficulty High Clinical- Relevance Low Clinical- Relevance Medium Difficulty 99 TABLE 4.26 SUBSCALE MEAN ITEM DIFFICULTY n=80 Standard Mean Deviation 82.14 7.51 65.63 16.96 60.69 6.32 54.86 12.88 Mean p-value .903 .721 .667 .603 100 TABLE 4.27 REPEATED MEASURES ANOVA OF MEDIUM DIFFICULTY, HIGH AND LOW CLINICAL-RELEVANCE SUBSCALES n=80 Source of Sum of Degrees of Mean Variation Sguares Freedom Sguare '5 Between People 28624.73 79 362.34 Within People-- Between Measures 4632.71 2 2316.35 35.37* Residual 10345.96 158 65.48 TOTAL 43603.40 239 182.44 *p :_.0001 TUKEY POST-HOC ANALYSIS Difference SE Confidence Significance Contrast of Means q3,77' Interval of Contrast VI = XHCR- XMD 10.75 3.08 7.67jp1313.83 p :_.05 ”’2 = XHCR ’ XLCR 4.94 3.08 1.863921802 p _<_ .05 101 overlap of identical items between the medium difficulty and the low clinical—relevance subscales. The logic of this research hypothesis is related to the logic of Hypothesis IV. That is, if medium difficulty items tend to be low in relevance to clinical medicine be- cause of a lower frequency of use of information and knowledge--then, it is expected that there will be a smaller overlap of identical items between the high clinical-relevance and medium difficulty scales than between the low clinical-relevance and the medium difficulty scales. Table 4.28 presents the number and proportion of identical items found between the scales noted in this hypothesis. The proportion overlap of identical items is slightly higher between the medium difficulty/high clinical-relevance scales than between the medium difficulty/low clinical-relevance scales. This hypothesis was tested by drawing a 95 percent confidence interval around the difference of the two proportions (Bacon, 1976) of the form: (Pl‘pz) i-Za/z \lplql/“l I pzqz/“z (9) Where: p = proportion overlap - pl"1 I p2"2 p— n1 I n2 n = number of items for proportion <'1=l-I3 Testing this hypothesis, then: 95%C = .055 + 1.96 v.004 = .055 + .123 102 TABLE 4.28 OVERLAP OF IDENTICAL ITEMS Subscales Identical Items Prpportion Overlap Medium Difficulty and High Clinical-Relevance 24 .264 Medium Difficulty and Low Clinical-Relevance 19 .209 103 So that, -.068 :_(pl-p2) :_.l78 Since this confidence interval includes zero, the hypothesis of no difference in proportions can not be rejected. There is no statistically significant difference between the proportions of over- lap in identical items between the medium difficulty/high clinical- relevance scales and the medium difficulty/low clinical-relevance scales. RESULTS CONCERNING THE DISTRIBUTION OF ITEM TYPES IN SUBSCALES Hypothesis V1 for this study stated that there would be differences in the distributions of pictorial-stem, clinical-situational, and factual multiple-choice items selected for each of the four sub- scales studied here. It was expected that the high clinical-relevance scale would have a larger distribution of pictorial-stem and clinical— situational items than the medium difficulty or the low clinical- relevance scales. Data analysis for this hypothesis required a classification and a count of the numbers of pictorial-stem, clinical-situational, and factual multiple-choice items for each of the four scales. The pictorial-stem items were easily classified; the only criterion used for this classification was the presence or absence of a visual stimulus with the item. If the item had a visual with it, it was classified as pictorial. For the clinical-situational items, all multiple-choice items were inspected by two raters (the author and an educational specialist). 104 All items that met the following criteria were classified as clinical- situational: l. The item stem contained clinical data about a patient's presenting signs or symptoms and/or other clinical data from a physical examination, laboratory studies, or any other information relative to a patient's presenting complaint. 2. The stem of the item ended with a question (or statement) asking for a diagnosis, a management strategy, or the next appropriate action to take for the patient(s). The two raters disagreed on the classification of seven items; each of these disagreements was resolved by a third rater. Figure 4.5 gives the distribution of item types by subscale. A chi-square test of independence was performed on the data given in Figure 4.5 to test the hypothesis of no difference in frequencies of item types by subscale. The x2 test statistic computed for this hypothesis is given by: 2 X2 = 2 ML (10) E Where: 0 = Observed frequency E = Expected frequency For these data, x2 = 11.22 is less than the critical value of 12.59 for six degrees of freedom at a = .05. Therefore, the hypothesis of no difference in item-type distributions can not be rejected. There is no statistically significant difference in the distributions of item types across the four subscales. However, the trend of the distribution of item types is that which was predicted by the research hypothesis. 105 OBSERVED DISTRIBUTIONS OF ITEM TYPES Multiple-Choicea Medium Difficulty Low Difficulty High Clinical- Relevance Low Clinical- Relevance a . Numbers 1n parentheses BY SUBSCALE Pictorial 20(22) 35(39) 28(31) 22(24) Clinical- Situational 17(19) 22(24) 21(23) 19(21) Factual Multiple-Choice 54(59) 34(37) 42(46) 50(55) indicate percentage of item types in subscale. FIGURE 4.5 106 Additional Results ConcerningQifferences in Proportions of Item Types Figure 4.6 presents the combined percentage of pictorial-stem and clinical-situational items and the percentage of factual multiple- choice items for each of the four subscales under investigation here. It is interesting to note that the high clinical-relevance scale has a slightly higher, but not statistically significant at a = .05, proportion of pictorial and clinical-situational items than the medium difficulty scale. The low difficulty scale has the highest proportion of pictorial and clinical-situational items. This difference in per- centage between the low difficulty and the high clinical-relevance scale is also not statistically significant at a = .05, using a confidence internal procedure (Bacon, 1976) for differences in pro- portions. The proportions of clinical—situational items alone (Figure 4.5) selected for the high clinical-relevance scale (.23) shows a slightly higher proportion of clinical-situational items in the high clinical- relevance scale, compared to the medium difficulty scale (.19). This difference in proportions is, however, not statistically significant at a = .05. Another analysis compared differences in proportions of combined pictorial-stem and clinical-situational items to factual multiple- choice items within each subscale. The results of these analyses are presented in the right-hand columns of Figure 4.6. These confidence intervals indicate that the low difficulty scale has significantly more pictorial-stem and clinical-situational items than 107 mm. on no. mm. on mo.ot mm. on no.0: av. 09 Ha. o.v mmauHm mafia“; ~m>poucH ooeomflmcou wmm ma. mm. He. OH. mm. me. mo. 0v. em. cm. hm. mo. oueouommfla OOMOEUaOHmfluHDZ HmcoflpmspwmIHmOficfifiu Hmsuomm use EcumnamfihouOAQ mam mmm>P ZmHH no monHmomomm 2H mmuzmmmmmHn QZDOm< wg<>mmfizm muzmQHmzou sufisumemua sauce: oucm>o~om -seueeueu sou oocm>OHom -Heueeueu ewe: sussuummae sou 108 factual multiple-choice items. And, the medium difficulty scale has significantly more factual multiple-choice items than pictorial-stem and clinical-situational items. SUMMARY OF RESULTS FOR TESTS OF HYPOTHESES Statistical analyses performed to test the hypotheses for this study may be summarized by the following: 1. Residency-eligible physicians, residents, and medical students are statistically discriminated by: a. The high clinical—relevance scale and the medium difficulty scale. The high clinical-relevance scale is more discriminating than the medium difficulty scale, but this difference in dis- crimination is not statistically significant at a = .05. A total of 81.3 percent of subjects were correctly classified using both scale Discriminant Functions. The medium difficulty and the low clinical-relevance scale. The medium difficulty scale is statistically significantly (at a = .05) more powerful than the low clinical-relevance scale in discriminating groups. The low clinical-relevance scale does not discriminate residents from students. The total of correct classifications for these two scales was 76.3 percent. 109 c. The high clinical-relevance and the low clinical- relevance scale. The high clinical-relevance scale is statistically significantly more powerful (at a = .05) in discrimination than the low clinical-relevance scale. Eighty percent of subjects were correctly classified by these two scales. Criterion-related validity: The correlation of the medium difficulty scores with the grand mean of Simulated Clinical Encounters (rxy=.878) is statistically significantly lower than the criterion-related validity coefficient for the high clinical-relevance scale (rxy='954)' Internal-consistency reliability: The high clinical- relevance scale (rxx='954) is statistically significantly more reliable than the medium difficulty scale (rxx=.878). Mean item difficulty: The medium difficulty scale is statistically significantly more difficult than the high clinical-relevance scale. The low clinical-relevance scale is significantly more difficult than the high clinical- relevance scale. Overlap of identical items in scales: There is no statistically significant difference in the proportions of overlap of identical items between the medium difficulty and high clinical-relevance scales and the medium difficulty and the low clinical-relevance scales. Distribution of item types in scales: There is no statistically significant difference in the distributions 110 of pictorial—stem, clinical-situational, or factual multiple—choice items across the four subscales of this study. RESULTS OF ADDITIONAL ANALYSES The results of the data analysis perfOrmed to test the hypotheses for this study suggest several additional analyses. This section will present the findings of various additional analyses carried out to further explore these data. This section will be divided into three parts: 1) Results concerning the criterion-group discrimination of all four subscales taken together, 2) Results concerning the criterion- group discrimination of subscales using the practice-eligibles, residents, and students as the criterion-groups, and; 3) A validation of the findings about subscale criterion-related validity, reliability, and mean item difficulty with the n = 14 practice-eligible physician group. Results Concerning_the Criterion-Group Discrimination of Four Subscales Several stepwise Discriminant Analyses using the medium and low difficulty, the high and the low clinical-relevance subscales as the discriminating variables and the residency-eligible, residents, and student criterion groups were performed. Table 4.29 compares the raw-score discriminating power of all four subscales. These data clearly show the rank-ordering of subscales according to their ability to discriminate known groups. This finding and the standardized Discriminant Function coefficients displayed in Table 4.30 show the power of the high clinical-relevance scale (relative to the other three 111 no.0 mn.mv em.m0 Hm.mmH mn.m v0w. nmv. m0m. mmm. eeeeeu .mxafiz em.m em.em oo.m em.mh es.e AA.em me.e mm.me age mucocsum mmq0~om -Heuuemsu sou sumsuLLMAQ ecu sussueeeme cameo: oocm>OHom -Heumeueu ewe: 112 TABLE 4.30 STANDARDIZED DISCRIMINANT FUNCTION COEFFICIENTS: FOUR SUBSCALES Function 1 Function 2 Medium Difficulty 0.209 1.689 Low Difficulty 0.305 -0.687 High Clinical- Relevance -2.664 -1.118 Low Clinical- Relevance 0.252 0.549 TABLE 4.31 CLASSIFICATION ANALYSIS USING TWO DISCRIMINANT FUNCTIONS DERIVED FROM FOUR SUBSCALES . a Predlcted N Residency-Eligible Residents Students Residency-Eligible 22 14(63.6) 8(36.4) 0(0) Residents 36 4(ll.l) 31(86.1) l(2.8) Students 22 0(0) 0(0) 22(100) 3Numbers in parentheses indicate percentage of group classified. 113 subscales) to discriminate groups with known levels of training and experience in Emergency Medicine. This finding is consistent with results presented for Hypotheses IA to IC. Table 4.31 gives the classification analysis results, using the two Discriminant Functions derived for all four subscales. A total of 83.7 percent of the cases were correctly classified. A chi-square test was calculated for these data. With four degrees of freedom, x2 = 91.51 is statistically significant at p :_.0001. Subscale correlations are presented in Table 4.32. The highest correlation is observed to be between the medium difficulty and the high clinical-relevance scales, while the lowest correlation is between the high clinical-relevance and the low clinical-relevance scale scores. Moderately high correlations are observed between medium difficulty and low difficulty and between high clinical- relevance and low difficulty. In order to further investigate the inter-relationships of these subscales, first and second-order partial correlations were computed. The only zero-order correlation that is seriously decreased by controlling for other scale correlations is the high clinical-relevance --low clinical-relevance correlation. When the correlation with the medium difficulty scale is controlled in a first partial correlation, r decreases from .458 to -.214. There is also a large decrease of the zero-order correlation of high and low clinical-relevance when the correlation with the low difficulty scale is controlled. When the correlation with both the medium and the low difficulty scales is controlled in the second partial correlation, r drops to -.336 from 114 TABLE 4.32 SUBSCALE ZERO-ORDER CORRELATIONS n=80 Medium Low High Clinical- Difficulty Difficulty Relevance Medium Difficulty 1.000 Low Difficulty .772 1.000 High Clinical-Relevance .922 .879 1.000 Low Clinical-Relevance .571 .465 .458 115 r = .458. This finding suggests that the observed moderate correlation between the high and low clinical—relevance scales is spurious and due to the correlation of each of these scales with the medium difficulty and the low difficulty scales. Results Concerning the Criterion-Group Discrimination of Subscales Usipg Different Criterion Groups Hypotheses IA, 18, and IC were reanalyzed using the same subscales as the discriminating variables in the analyses, but substituting the practice-eligible group for the residency-eligible group as the in- dependent variable. These analyses were carried out to attempt a partial validation of the previous results of this study with a group of subjects who were not considered in the item analysis criteria used to select subscale items. Table 4.33 summarizes the results of these analyses. When practice-eligible physicians are substituted for residency-eligible physicians in Discriminant Analyses: 1. The four subscales are rank-ordered in discriminating power in exactly the same manner as in the earlier analyses using residency-eligibles (Table 4.29). That is, high clinical-relevance is the best discriminator, followed by medium difficulty, low difficulty, and low clinical-relevance. 2. Most of this discrimination occurs between the resident and student groups; none of these scales statistically separates the practice-eligible group from the resident group. In fact, the practice-eligible group has a slightly lower mean on the high clinical-relevance and the low difficulty scales than the resident group. Results Concerning Criterion-Related Validity: Practice-Eligible Gropp Correlations between the subscales of this study and the grand mean of the Simulated Clinical Encounters were computed for the 116 wo.m mv.0m NN.HV mm.mn 00 .Nm mam. v~.m nm.mm 00m. oo.m 0m.mn 0me. eH.n nn.mm mam. wa.m mm.mv .Q.m awe: apnea; mucopsum .mxfifiz wo.0 nm.mm wn.m mm.ew aw.n mw.0m mm.n Hw.on .o.m cmmz mucocflmom mmqO~om uHmuwcfifio 304 suesummmuo sou sufisuLLMHQ suave: oucm>OHOm -Heumeusu ewe: 117 practice-eligible group (n=14) alone. Table 4.34 presents these criterion-related validity coefficients. The rank-ordering of these validity coefficients is different than the rank-ordering of the validities for the n=80 sample of residency-eligible, residents, and students (Table 4.23). For the practice-eligible group, the low clinical-relevance scale has the highest validity coefficient (rxy=.49), while the high clinical-relevance scale had the highest validity for the n=80 sample (rxy=.90). The high clinical-relevance scale's validity was ranked second for practice-eligible physicians, while medium difficulty ranked second for the larger group. Other reversals also are noted in Table 4.34. These inconsistencies are likely the result of large standard errors around r computed for a small sample of subjects. Results Concernipg Internal-Consistency_Reliability of Subscales: Practice-Eligible Group! Table 4.35 presents a comparison of the Kuder-Richardson 20 reliabilities computed for the practice-eligible group alone and for the residency-eligible, residents, and students (n=80) for all four subscales in this study. Only two reversals of reliability ranks are noted in Table 4.35. That is, the reliability of the low difficulty scale is second in rank for the practice-eligible group and third for the previous (n=80) group. Medium difficulty is third in rank for the small group and second in rank for the larger group. The F-test of the difference between the high clinical-relevance and medium difficult reliability coefficients yielded: 118 TABLE 4.34 COMPARISONS OF CRITERION-RELATED VALIDITY COEFFICIENTS Grand Mean Simulated Clinical Encounters Practice-Eligible Others n=l4 n=80 Low Clinical-Relevance .487 .797 High Clinical-Relevance .461 .895 Low Difficulty .441 .774 Medium Difficulty .244 .797 TABLE 4.35 COMPARISON OF INTERNAL-CONSISTENCY RELIABILITY OF SCALES: PRACTICE-ELIGIBLE AND PREVIOUS SAMPLE Practice-Eligible Others n=l4 n=80 High Clinical-Relevance .868 .954 Low Difficulty .796 .869 Medium Difficulty .644 .878 Low Clinical-Relevance .117 .581 119 _ 1.118/.l79 Fcalculated - .601/.230 = 2'39 At 13 and 13 degrees of freedom, the (conservative) critical value at a = .05 is 2.69. Since F-calculated is less than the critical value, the hypothesis of no difference between the reliability of the high clinical—relevance and medium difficulty scales for this group can not be rejected. Results Concerning Mean Item Difficulties: Practice-Eligible Group, Table 4.36 presents a comparison of the means, standard deviations, and mean p-value for the practice-eligible group and the group of residency-eligible, residents, and students (n=80). This table shows that the means of subscales are ranked in exactly the same order for the practice-eligible group as for the larger group of eighty subjects. In Table 4.37, the results of a repeated measures Analysis of Variance for the medium difficulty, the high and low clinical—relevance scales for the practice-eligible group (n=14) are given. There are statistical differences in these three means revealed by the significant F-ratio. Tukey post-hoc analysis of differences between the high clinical-relevance and medium difficulty means and the high clinical-relevance and low clinical-relevance means shows that these contrasts of means are significantly different at the a = .05 level. Thus, for the practice-eligible group, the high clinical-relevance scale is significantly easier than the medium difficulty scale and the low clinical-relevance scales. This finding is the same as the earlier result with the residency-eligible, resident, and student groups. 120 mo0. n00. Hmn. mom. Ozam>um new: ww.NH mm.0 00.0H Hm.n .o.m nowucv memmeo 0w.vm m0.o0 m0.m0 vH.Nm cmoz vv0. mm0. 00h. mom. 63Hm>umlwmoz ov.n em.v mo.oH em.m .o.m mm.wm nm.m0 an.m0 v0.mw :moz neencv mumeueum-muHeumme m> fleeucL esoeo mumeueum-mueeuHADUHmmHo ZmHH zofiom -Heumeeau sou oucm>eaom -eeuLeuHu ewe: sussueeeso sou mmq