MSU LIBRARIES gal-g; \ RETURNING MATERIALS: Place in book drop to remove this checkout from your record. FINES wil] be charged if book is returned after the date stamped be1ow. CALIBRATION OF MEDICAL PROBABILITIES AT DIFFERENT LEVELS OF EXPERIENCE by Rita Yuk-King Huang A_DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Department of Administration and Curriculum in Higher Education 1982 ABSTRACT It is unclear whether medical training makes a dif- ference in the accuracy and calibration of subjective prob- ability, as it pertains to the incidence and prevalence of disease conditions. In an effort to contribute to know- ledge in the area, the intent of this study was two-pronged: (l) to examine whetharmedical training leads to more accurate estimation of the incidence of acute and preval- ence of chronic disease conditions;and (2) to determine whether medical training leads to better calibration. Forty fourth-year medical students and forty non- medical students enrolled during the 1982 fall term at Mich- igan State University responded to a questionnaire by indi- cating their estimations of the relative incidence or prev- alence of a series of paired acute and chronic diseases. They were also asked about their confidence in their estim- ations. The results of the study showed that: (1) Medical students were significantly more accurate in their responses than non-medical students. This difference persisted even when the results were adjusted for age. (2) Medical stu- dents were also more confident of their responses than non- medical students. Age was not significantly related to confidence. (3) Medical students were significantly better calibrated than non-medical students. When these results were adjusted for age, however, the differences became non- significant. Holding the accuracy scores constant, dif- ferences between the two groups in calibration persisted. To my dearest husband, Raywin, our son Ritchie and our daughter, Rachelle, I dedicate this work. ‘11 ACKNOWLEDGEMENTS I would like to express my gratitude and apprec- iation to many individuals who made the completion of this dissertation possible. Special thanks to Dr. Arthur Elstein who has un- selfishly given me his time and guidance throughout the process of this dissertation. He planted the idea for this study and his valuable critiques and suggestions on the proposal and the final research were extremely helpful. He unselfishly devoted much of his time to this study when I was working under severe time constraints. His humility, kindness and unselfishness serve as a model for me to follow. I also wish to extend my thanks to Dr. R. Feather- stone who allowed me to expand my research interests inde- pendently. Dr. H. Teitelbaum provided valuable critiques on the methodology of the study and Dr. N. Bell gave me his constructive suggestions on the proposal for the study. I appreciate their concern and guidance. To my husband, Raywin, love, understanding, and patienaeserved as emotional support throughout the whole process of this research, I owe a lifetime of gratitude. His advice on data analysis made the completion of my 111 thesis possible. My children,Ritchie and Rachelle, arrived at the beginning of my doctoral program and just as this study was nearing completion. They are delightful addi- tions who made the completion of this work infinitely more challenging. I also wish to express my love and gratitude to my parents who made my doctoral studies possible by providing initial support for my further education overseas. To my Lord Jesus Christ whose encouragement and guidance helped me in times of frustration, I pledge eternal gratitude. A final note of appreciation goes to the many friends who provided the help and kind words without which this total process of graduate study and research would have been cold, mechanical and futile. They have provided the laughter that made academic drudgery bearable and the caring and help that made complex problems solvable. iv TABLE OF CONTENTS LIST OF TABLES . . . . . . . . . . . . . . LIST OF FIGURES . . . . . . . . . . . . . CHAPTER I. STATEMENT OF THE PROBLEM . . . Introduction . . . . . . . . . The Problem . . . . . . . . . Purposes . . . . . . . . . . Research Questions . . . . . Research Hypotheses . . . . . Definition of Important Terms Overview of the Study . . . . II. REVIEW OF LITERATURE . . . . . Studies Related to Accuracy in Judgment . . . . . . . . . . Calibration in Judgment . . . III. METHODS OF THE STUDY . . . . . Research Design and Variables Research Variables . . . . Variable Matrix . . . . . . Research Procedures . . . . . Population and Sampling . . Instrumentation and Data Collection . . . . . . . . Data Analysis . . . . . . . Page iii iv \DQO‘ChU'IMI-J 10 10 13 21 21 21 22 24 24 25 33 CHAPTER IV. Results of the Study Discussion of the Findings Summary of the Findings Implications Recommendations for Future TABLE OF CONTENTS (Cont'd) RESULTS, Study BIBLIOGRAPHY APPENDICES IMPLICATIONS AND RECOM- MENDATIONS FOR FUTURE STUDY vi Page 34 34 41 43 44 47 51 54 Table II III(a) III(b) IV LIST OF TABLES Page Consistency (percent agreement) of the Duplicate Items in Parts I and II of this Instrument . . . . 32 Mean Scores, standard deviation and t—test differences ixlaccuracy, confidence and calibration for medical and non-medical students . . 35 Ages of Medical and Non-medical Students . . . . . . . . . . . . . 37 Correlation Between the Three Independent Measures: Accuracy, Confidence and Calibration . . . . 38 Mean Subjective Probability for Medical and Non-medical Students . 39 Calibration: Analysis of Co- variance Using Accuracy as Co- variate . . . . . . . . . . . . . 40 vii Figure LIST OF FIGURES Page Hypothetical Calibration Curve-- A graph showing the percentages cor- rect for each probability response . . . 4 Mean Scores by Mean Subjective Proba- bility for Each Subject in Medical and Non-medical Students . . . . 36 viii CHAPTER I STATEMENT OF THE PROBLEM Introduction Making decisions is a daily human activity. The out- come of a decision may greatly affect the indivual's wel- fare or the welfare of others. Some decisions are based on personal belief such as voting decisions in a political election, policy decisions in business and deciding a trial verdict for a defendant in a courtroom. These beliefs are usually expressed in probabilistic language such as, "I think . . .," "chances are," "it is unlikely . . .," or "most probably . . .," and are usually based on intuition, knowledge of the event or subjective experience. To improve accuracy in making decisions for a certain event, individuals have to know the actual probabilities of that event occurring and must "align" their beliefs with actuality. This process is known as the validation of sub- jective beliefs. One way of validation is by expressing these beliefs as estimates of subjective probability and comparing them with probability indexes mathematically de- rived from actual events. One example of a well-defined ”actual" probability would be available reported actual frequencies of events, such as rates of causes of death in the United States reported in Vital Health and Statistics. This technique of comparing a persons' subjective proba- bility (a person's belief about the likelihood of an event) with the "actual" probability (the actual relative frequency of occurrence of the event) is called calibra- tion. Individuals are perfectly calibrated if, over the long run, for all their given subjective probabilities, the proportion that is true is equal to the probability assigned (Fisflfimfld, Slovic, Lichtenstein, 1977, p. 522). For ex- ample, a perfectly calibrated individual in assigning events a probability of .7 will have 70 percent of those responses correct and for all responses assigned a probability of .8 will be 80 percent correct. However, people are not always perfectly calibrated in their probability estimations. Several research studies, reviewed by Lichtenstein, Fischhoff, and Phillips (1977), have shown that people tend to be biased in their probability estimations. In other words, people tend to over or under estimate how much they know. For individuals who are underconfident the proportion of responses that are correct is greater than the probability assigned to them (Lichtenstein and Phillips, 1977, p. 276). For example, individuals might be only 50 percent certain of their re- sponses but have 90 percent of those responses correct. They underestimate how much they know. In cases of over- confidence the proportion of correct responses is less than the probability assigned to them. For example, people might be 90 percent confident in their responses, but have only 75 percent of those responses correct. That is, they over- estimate how much they know and believe they know more than they actually do. A graph showing the percentage correct for each probability response is plotted in Figure 1. Curve A reflects underconfidence, curve B represents perfect cal- ibration and curve C represents overconfidence. From this discussion, two important underlying dimen- sions are noted. They are: (1) accuracy of estimation (that is, percent correct buresponses); and (2) the degree of confidence placed in an estimation. What makes indiv- iduals better calibrated in their estimations? Do training and experience lead to more accurate and better calibration in estimation? The Problem Tremendous amounts of time, energy and resources, especially at higher levels of learning in higher education, are spent to attain high levels of expertise needed in var- ious areas of specialization. The expected outcome of spe- cialization is that the so-called experts will become better decision-makers in the areas in which they have received Percentages Correct (Based on "actual" probability) Underconfidence ,x” . ///// Overconfidence .3 .4 .5 .6 .7 .8 Individual's Assigned Probability Response (Subjective probability) Hypothetical Calibration Curve--A graph showing the percentages correct for each probability response Figure 1. training, than those who do not receive the same training. This presupposition acts as the impetus for this study to re-examine the notion that knowledge and training lead to better decision-makers who are more accurate and better cal- ibrated on probability estimates in their areas of speciali- zation. Whether training and knowledge will really make a difference in improving the accuracy and calibration of sub- jective probabilities is unclear. The inconsistent results of previous research are presented in the literature review section of this study. Such inconsistent results may be due to the short-term nature of the training tested in the studies. The training frequently had little time to take effect to improve the subjects' judgment performances. Hence, the present research was undertaken to examine a longer term of training, such as exists in higher education where train- ing is systematically planned and structured toward a de- fined career goal. The training in higher education is also more intensive, with frequent problem—solving exercises and examinations. Thus, such training should have significant effect in improving trainee's capability for making accurate decisions and building confidence in those decisions, there- by leading to better calibration. Pur ses This study has two purposes: (l) (2) To reexamine the question of whether medi- cal training leads to more accurate estim- ates of the incidence of acute and preval- ence of chronic disease conditions. To determine whether medical training leads to better calibration. Research Questions The following questions summarize the two central issues of this study: accuracy and calibration. (l) (2) Accuracy Do peOple who have medical training know more about the incidence of acute and prev- alence of chronic disease conditions, than peOple who do not have medical training? Calibration If they do, are people with medical train- ing (experts) calibrated better than people without medical training (non-experts)? Research Hypotheses The following hypotheses were formed to examine the questions posed: (l) (2) Medical students are significantly more accurate than non-medical students in esti- mating the incidence of acute and prevalence of chronic disease conditions. Medical students are significantly better calibrated than non-medical students in estim- ating the incidence of acute and prevalence of chronic disease conditions. The testing of these hypotheses will provide empir- ical evidence bearing on the research questions. Definition of Important Terms The following definitions for key terms used in the study will serve to provide a common basis for understand- ing. -- Medical students. Fourth year medical students enrolled in the College of Human Medicine at Michigan State University during the Fall Term of the academic year 1981- 1982. -- Non-medical students. Michigan State University senior or graduate students who have a non-medical major (excludes those majoring in nursing, medical technology, osteopathic medicine, veterinary medicine, and other health- related areas). These students were also enrolled in the Fall Term, 1981. -- Accuracy. Percentages of items correctly iden- tified by the subject. The correct answers to the questions were derived from the statistics in Vital and Health Statis- tics, Series 10, Nos. 109 and 132. -- Incidence of acute disease conditions. New cases of acute disease conditions occurring among people in the United States based on statistics derived from series 10, No. 132, of Vital and Health Statistics. The acute con- ditions selected for this investigation were: Influenza Fracture and dislocation Pneumonia Sprain and strain Headache -- Prevalence of chronic disease conditions. The cases(including new and old cases) of chronic disease con- ditions existing among people in the United States. The statistics were derived from series 10, No. 109, of Vital and Health Statistics. The chronic disease conditions in- cluded in this study were: Arthritis Epilepsy Cerebrovascular disease Heart conditions Diabetes Tuberculosis -- Better calibration. A person is perfectly cal- ibrated if, over the long run, for all responses assigned the same probability, the proportion correct is equal to the probability assigned. A perfect calibration score is O and the worst possible score is 1.0. -- Confidence. A person who is not perfectly cal- ibrated can be either overconfident or underconfident. A person is overconfident when the portion of responses that are correct is less than the probability assigned to them. Overconfidence is shown by a positive score. A person is underconfident if the portion of correct responses is greater than the probability assigned to them. Undercon- fidence is shown by a negative score. Overview of the Study In Chapter I the problem, research questions and hypotheses have been stated. Important terms have been defined. In the chapter to follow a review of research studies related to calibration will be presented. Chapter III will include the research design and variables and will focus on research procedures. In Chapter IV the results of the study will be presented and discussed. The findings will be summarized and their implications considered. Rec- ommendations for future study will be outlined. CHAPTER II REVIEW OF LITERATURE In this chapter a review of research studies re- lated to calibration are presented. The review of litera- ture is organized under two major headings: (1) Studies related to accuracy in judgment; and (2) studies related to calibration in judgment. Accuracy measures the per- centage of items answered correctly by respondents. Cali- bration measures confidence in a subject's judgment. Accuracy in Judgment In this study accuracy in judgment is determined by comparing the subjects' chosen answers with a criterion. Accuracy is measured by percent correct. Studies have shown that subjects with no training or with no prior knowledge in a particular area tend to have difficulties performing a task in that area. In an early study, Lichtenstein and Fischhoff (1976) asked sub- jects with a limited knowledge of painting to study small sketches drawn by European and Asian children and determine if the artist was an European child or an Asian child. Re- sults showed that the subjects had difficulty with the task. Only 53.2 percent of their 1,104 answers were correct. 10 11 In another experiment, reported in the same study, other subjects with limited knowledge of stocks were asked to study market charts and predict whether a stock described in each chart would be up or down three weeks later. This task was even more difficult for subjects to perform accurately and only 47.2 percent of their choices were cor- IECt. These studies concluded, therefore, that subjects without training experience or knowledge in a particular area tend to have difficulty in performing judgment tasks in that area. Training and knowledge, however, might be assumed to affect subjects' performance in terms of accuracy. An- other experiment done by Lichtenstein and Fischhoff (1976) required subjects to identify handwritings, determining whether they were written by an European or an American. Two of the four groups received training for this task. Results showed that trained subjects correctly identified 71.4 percent of the specimens compared with 51.2 percent for untrained subjects. The question of whether there will be a difference in performance in terms of accuracy for subjects with high levels of knowledge or experience in a certain task is dis- cussed in several studies with conflicting results. Sanders (1963) found that students and instructors of meteorology 12 performed about equally well in weather forecasting. Gus- tafson (1963) compared diagnoses of congential heart dis- ease made by a computer, pediatric cardiologists, and non- specialized physicians. The pediatric cardiologists and the computer appeared to be about equally accurate, cor- rectly diagnosing 63-74 percent of the cases, while the non- specialized physicians were less accurate, correctly diag- nosing only 36-52 percent of the cases. Another study con- ducted by Gustafson (1966), however, found basically no dif- ferences between surgical residents and experienced surgeons in their ability to predict patients' length of stay in the hospital. Winkler (1967, 1971) has shown that being an expert in one's own field leads to better performance. He found that sportswriters and bookmakers were better than college students and faculty at predicting scores of NFL and Big Ten football games. Stael Van Holstein (1971) compared four groups of subjects in forecasting the weather. The four groups of subjects with different levels of knowledge in meteorology were meteorologists, meteorology research assistants, stu- dents of meteorology and statisticians. It was found that the research assistants showed the best forecasting ability, the students the worse, the meteorologists' and statisti- cians' forecasting ability fell in between, indicating a curvilinear relationship between level of expertise and 13 accuracy of judgment. Stael Van Holstein's (1972) experiment compared the performance of five groups of people - bankers, stock market experts, statisticians, teachers of business admin- istration and students of business administration in pre- dicting the variability of the stock market. Results showed that stock market experts and statisticians performed the best, followed by business teachers and students, and bankers were last. In summary, previous research has demonstrated that subjects with no prior training or knowledge in a special- ized area tend to have difficulties making accurate judg- ments in the area. However, subjects with even a minimal amount of knowledge or with a minimum of prior training im- prove slightly in performance accuracy. There is not much evidence to determine whether or not expertise in a speacial- ized area increases accuracy in judgments (Beach, 1975). Some studies (Winkler, 1967, 1971; Stael Van Holstein, 1972) reported that experts in a particular field tend to have better performances in that field. Other studies (Gustafson, 1963, 1966; Sanders, 1963) concluded that non- experts performed equally well in specialized areas. Calibration in Judgment Measuring accuracy of judgment, however, does not 14 capUmxzthe certainty or the confidence subjects have in their judgments. A subject can achieve good calibration with little or no knowledge or experience in an area. On the other hand, a person with experience and knowledge in a par- ticular area may not achieve good calibration;the person may believe he knows more than he actually knows (overconfi- dence)or less than he actually knows (underconfidence). At least one published study found that subjects who are not experts in a particular area may achieve good calibra- tion. Using full-range approach with four alternative items, Fischhoff and Beyth (1975) asked 150 Israeli university stu- dents who were not foreign affair experts to assess the proba- bility of 15 firm-future events,such as "President Nixon will meet Mao at least onceJ' The resulting calibration curve is suboptimal atO and 1, and shows a dip at .7 but is otherwise re- markably close to the identity line (perfect calibration). Other studies show that subjects with no training and no prior knowledge in a particular area were poorly calibrated. In fact, they showed no evidence of calibra- tion at all. In one of their experiments discussed earlier, Lichtenstein and Fischhoff (1976) asked subjects to identify small sketches drawn by EurOpean and Asian children. Results indicated that subjects were overconfident in their judg- ments. In another experiment in the same study, subjects studied stock market charts and were asked to predict whether the stock prices described by each chart would be increasing or decreasing. Again, subjects demonstrated 15 overconfidence. The above two experiments demonstrated that subjects with no prior knowledge or training in an area tend to be overconfident in their calibration, in con- trast to the almost perfect calibration obtained in the 1975 Fischhoff auiBeyth study. In an effort to determine whether training would improve calibration, Adams and Adams (1958) asked subjects to decide whether pairs of words were antonyms, synonyms, or unrelated. Calibration tallies and calibration curves were shown to subjects after each of five training sessions as feedback. A modest improvement in calibration was found after subjects received training. In another study by Pratt (1977) an expert was asked to predict attendance at 175 movies shown in local theaters over a period of more than one year. This expert was given some degree of additional training by receiving feedback throughout the experiment. Results indicated that the only evidence of improvement in calibration over time came in the first few days; no further improvement was noted later. Pickhardt and Wallace (1974) also reported slight improvement in calibration with five or six training ses- sions on estimation of uncertain quantities. However, in another study done by the same researchers, using a simula- tion game called PROSIM that was intended to increase know- 16 ledge on calibration, increased information did not affect calibration. There was virtually no improvement in cali- bration over the nineteen days of simulation. Using only one training session on 75 items, Chou (1976) also found little improvement and no generalization in calibration. Lichtenstein and Fischhoff (1976) asked two groups of subjects to examine ten handwritings to determine whether they had been written by an European or an American. The training group studied samples of handwritings labeled with country of origin, the non-training group studied samples of handwriting that were unlabeled. Results showed that trained subjects showed better calibration than un- trained subjects, who showed no evidence of calibration. Lichtenstein and Fischhoff (1980) trained people without previous experience in probability assessment using computerized feedback provided after sessions of assessment on general knowledge items. Eleven long and intensive sessions were used. Results showed that most subjects' calibration improved during the training sessions. The mean calibration score changed from .015 for the first session to .005 for the last session. All measurable im- provements were found to come between the first and the second rounds of training. The above studies demonstrated that calibration can be somewhat improved by training. All the above studies, 17 however, were done using short-term training. No studies of prolonged training related to calibration were found. Studies of subjects with different levels of know- ledge, experience and training (experts vs. non-experts) were examined to determine whether the subjects would be calibrated differently. Oskamp (1962) divided subjects in- to three groups with varying levels of experience to eval- uate the MMPI profile. The three groups were: (1) 28 undergraduate psychology students representing inexperienced judges. (2) 23 clinical psychology trainees working at a VA hospital; and (3) 21 experienced clinical psychologists. Their task was to determine whether VA hospital pa- tients had been admitted for psychological reasons or med- ical reasons simply by reviewing their MMPI profiles. The subjects were then asked to assign a probability of correct— ness to their decisions in each case. Results showed that all three groups were overconfident, especially the under- graduates in their first session. When the first group was split into two groups, one with training for accuracy and the other without training, the trained groups showed better calibration. Sanders (1963) asked students of meteorology and instructors of meteorology to forecast the weather and 18 found that students tend to overestimate the probability that an event will occur. Hazard and Peterson (1973) asked 40 subjects at the Defense Intelligence School to respond to 50 two-alternative general knowledge items. Substantial overconfidence was also found in this study. Lichtenstein and Fischhoff (1976) in their experi- ment asked 120 subjects to answer general knowledge items. Based on the accuracy of their responses, the subjects were divided into three subgroups according to their knowledge: the best subjects (40 subjects with 51 or more correct ans- wers out of 75), the middle subjects (39 subjects with 46- 50 correct answers, and the worst subjects (41 subjects with fewer than 46 correct answers). Separate analyses were performed for each group. The result showed that sub- jects' calibrations varied directly with their knowledge. All groups tended to be overconfident. The most knowlege- able subjects showed the least overconfidence and had a calibration curve closest to the identity line. The results strongly suggested that the more subjects know, the better their calibrations are. In another experiment, Lichtenstein and Fischhoff asked graduate students in psychology to answer 50 general knowledge items and 50 specially written items dealing with psychology. The two types of items were intermixed ran- domly in the stimulus package. The subjects were split for analysis, into best and worst at the median (74.5%) 19 of the distribution of percentage correct. The items were also split into easy (at least 75 percent correct) and hard (fewer than 75 percent correct) items. For these analyses, no distinction was made between general knowledge and psychology items. Results showed that the group with the greatest knowledge (best subjects in terms of percentage correct) did not have the best calibration scores. The most knowledgeable subjects in answering the easiest items showed substantial underconfidence, while the worst subjects, in responding to the hardest items, showed substantial overconfidence. In another experiment Lichtenstein and Fischhoff asked subjects with different levels of knowledge to answer randomly inter- mixed questions with 50 general items and 50 psychology items. Here again, results showed thatthe group with the greatest knowledge did not have the best calibration score . The above calibration studies demonstrate that people are prone to systematic biases in their probability judgments. The most common bias is overconfidence be- cause they believe that they know more than they actually know (Lichtenstein and Fischhoff, 1980, p. 2). Another conclusion is that training can sometimes improve cali- bration. And finally, people who have knowledge in a 20 specialized area (experts) sometimes demonstrated better calibration and sometimes not. Previous studies on sub- jects' accuracy in judgment also demonstrated that training and knowledge can improve accuracy, at least for a short period of time. CHAPTER III METHODS OF THE STUDY In this chapter the research design and research procedures of the study will be discussed in separate sec- tions. Research Design and Variables (1) Research Variables The independent variable of this study was level of medical knowledge represented by two groups, med- ical and non-medical students. There were three dependent variables: accuracy, level of confidence, and calibration. (a) Accuracy - is measured by percent of correct re- sponses in identifying the incidence of acute and prevalence of chronic conditions. It can be expressed as: Accuracy = ZIZS Where n is the number of items correctly iden- tified and N is the total number of items. (b) Confidence (over/underconfidence) - the subject's level of confidence in making a decision. It is 21 22 defined as follows: T over/underconfidence = l/N Z n (r -c ) t=lt t t Where N is the total number of responses, nt is the number of times the response r was used, c t t is the proportion correct for all times assigned probability r and T is the total number of dif- t, ferent response categories used. Overconfidence is shown by a positive difference and underconfi- dence by a negative difference. (c) Calibration - this measure, derived from Murphy (1973), is: T I O — - 2 calibration - l/N tglnt(rt ct) A perfect calibration would have a value of 0 and the worst possible score would be 1.0 which can be obtained by a subject who always responds rt = 1 when wrong, and rt = 0.0 when right. (2) Variable Matrix Given the above mentioned independent vari- able and dependent variables, a variable matrix can be drawn as follows: Cow fine HUN w. H F. u MVrvfiMTu IL 0 Tu H Ai> hush. 23 Calibration Measures Calibration Confidence Accuracy mucmnsum Hmoflnmz masonsum amoficwzlcoz GodumNflamwommm mo mam>wq 24 Simple independent t-tests are used to test the different groups. The critical significance level of < 0,05 is used to test all hypotheses. Research Procedures (1) Population and Sampling (a) ngulation The subjects of this study were divided into two groups representing two different levels of training. They were: (i) Medical students--These students were in their fourth year of medical train- ing and are assumed to possess a cer- tain level of medical knowledge. (ii) Non-medical students--These students were seniors and graduate students study- ing in fields other than medicine or health related areas. They presumably possess a minimum level of medical know- ledge. (b) Sampling Forty fourth-year medical students from Michigan State University voluntarily participated in this study and were used as the subjects for the first group. These 40 fourth-year medical students were enrolled in the College of Human Medicine and do not include students majoring in veterinary medicine, osteopathic medicine, nursing and other health-related areas. Forty fourth-year or graduate students in major areas of study other than medicine or other health- related fields volunteered to participate in the study and were used as subjects for the second group (non-medical students). These 40 non-medical Michigan State University students were majoring in such fields as business, education, forestry, math- ematics, theatre, human ecology, engineering, com- puter science, psychology, sociology, audiology and communication. Only students who had been in the United States more than ten years were used as subjects in either groups. The underlying reason was to exclude those students from other countries who might not be fam- iliar with the subject matter of this study. (2) Instrumentation and Data Collection (a) Instrumentation A questionnaire was designed as the assessment instrument for the study. The format of the:kmtnment 26 is comprised of pairs of conditions (or diseases) presented to the subjects. Each question pre- sents two alternative answers, one of which is true, the other false. Subjects are asked to identify which alternative is true, and to indi- cate the probability that the chosen alternative is, in fact, true. A sample of the instrument items is presented in Appendix A. The procedure for selecting the various acute and chronic conditions used in Part I and Part II of the ques- tionnaire is described below. Part I: Acute Conditions Six acute conditions were selected from Table 1, Incidence of Acute Conditions, Percent Distribution, and Number of Acute Conditions per 100 Persons per Year, by Con- dition Group, According to Sex: United States, July 1977 - June 1978, appearing on pp.1Ld2 of National Vital and Health Statistics, Series 10, No. 132. The procedure for selecting these six acute conditions was as follows: (1) Three categories of conditions (respira- tory conditions, injuries and other acute conditions) that had the highest frequen- cies of occurrence were selected from the five categories of conditions presented. (2) Within each of the three categories selected, conditions were rank ordered by frequency of occurrence from highest to lowest. The median of the ranking was used to divide the high frequency from the low frequency group. 27 (3) Then, one acute condition was randomly selected from the high frequency group and one acute condition was selected from the low frequency group in each of the three categories of conditions. (4) From respiratory conditions, "influenza" was selected from the high frequency group, and "pneumonia" was selected from the low frequency group. (5) From the category of injuries, the high fre- quency acute condition selected was "frac- tures and dislocations" and the low frequency acute condition selected was "sprains and strains." (6) From the category under "other acute condi- tions," "diseases of the ear" was selected from the high frequency group and “headache" was selected from the low frequency group. (7) The six selected acute conditions: influenza pneumonia, fractures and dislocations, sprains and strains, diseases of the ear, and head- ache were then arranged in alphabetical order and assigned a number from 1 to 6. Fifteen possible pairs of acute conditions can be formed from the six acute conditions chosen (See Appendix A). (8) These 15 pairs of acute conditions compromise the 15 items in Part I of the questionnaire. (9) Five items were randomly selected to check for reliability. By reversing the order of the disease conditions for five pairs--(2,l), (3,2), (4,3), (5,4), (6,5)--items 16 to 20 were form- ulated in Part I of the questionnaire. Part II. Chronic Conditions In the second part of the questionnaire, six chronic conditions were selected from "Table K, Number per 1,000 Persons, Prevalence, and Incidence of Selected Chronic 28 Conditions Reported in Health Interviews: United States, 1968-1973" appearing in Vital and Health Statistics, Series 10,No. 109. The procedures for selecting these six chronic conditions are described below: (1) (2) (3) (4) (5) (6) Since the prevalence of chronic conditions in Table K were arranged from highest fre- quency to the lowest, the prevalence rate of 10.3 per 1,000 persons was used as the point at which to divide the conditions into high and low frequency groups. Three chronic conditions were randomly selected from the high frequency group and three other chronic conditions were selected from the low frequency group. The three chronic conditions selected from the high frequency group were arthritis, dia- betes and heart conditions. The three chronic conditions selected from the low frequency group were cerebrovascular disease, epilepsy and tuberculosis. These six chronic conditions were arranged in alphabetical order and each was assigned a number from 1 to 6. Fifteen possible pairs of chronic conditions can be formed from the six chronic conditions chosen (See Appendix B). The 15 pairs of chronic conditions comprise the 15 items in Part II of the questionnaire. Five items were randomly selected to check reliability. By reversing the order of the disease conditions, that is, (2,1), (3,2), (4,3), (5,4), items 16 to 20 were formulated for Part II of the questionnaire. The subjects were asked to state how confident they were about their chosen answer by circling a confidence 29 rating from 0% to 100% under each item in both Part I and Part II of the questionnaire. An example of such an item is shown below. A. Headache B. Influenza How confident are you that your answer is correct? 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% A glossary of terms were also attached to the instru- ment (Cooley, 1973). (b) Data Collection The instrument is presented in Appendix C. It was administered to three medical students and five non- medial students as a pilot study to determine the relia- bility of the instrument. It will be remembered from the discussion of the instrument that five duplicate items with the order of the disease conditions reversed were added to both Part I and Part II to serve as a check of reli- ability. The five pairs of duplicate items (Nos. 16 to 30 20) in Part I are numbers 2 and 16, 3 and 17, 10 and 18, 13 and 19, 15 and 20. The five pairs of duplicate items (Nos. 16 to 20) in Part II are numbers 13 and 16, 10 and 17, 5 and 18, 12 and 19, 4 and 20. Percent agreement (that is, the degree of consis- tency in selecting the same answer for the pairs of the dup- licate items) was calculated for each pair of duplicate items. An overall percent agreement on the duplicate items was also calculated separately for Parts I and II. The overall con- sistency of the pairs of duplicate items in Part I was 87.8 percent agreement. In Part II, the overall consistency of the pairs of duplicate items was 85.2 percent (See Table I). From this pilot study the conclusion can be drawn that the instrument to be used for assessing calibration is highly reliable, indicating that subjects were not respond- ing randomly. Upon completion of the pilot study, the instrument was administered to 40 non-medical student volunteers from various majors as mentioned earlier. Ninety-five ques- tionnaires were then mailed to five medical training com- munities affiliated with Michigan State University in Kalamazoo, Saginaw, Lansing, Flint and Grand Rapids, where fourth-year medical students have clerkships. Administra- tors in charge of the medical training communities were 31 «.mw ucmEmmuw< unmoumm cam: Hamum>o m.mm acmEomuw< uamuumm :moz HHmuo>o ww ON cum 5 .moz mm om pom mH.moz mm ma paw NH .moz mm ma paw mH.moz ooH ma paw m .moz mm wH pom CH.moz nu ma cam OH .moz mm NH tam m .moz mm 0H tam ma .moz ooH 0H cam N .moz ucwsomuw< vmmum>mu mcowuapcoo mo umvuo ucmeouw< wmmuo>mu mcowuwpcoo mo umvuo wo N Luwa macaw oumowfidap mo mpwmm mo N Lugs mEmuw oumowaaap mo mufiwm HH uumm H uumm ucoesuumcH osu «0 HH vcm H muumm :« mEouH mumowadso may no AucwEomuwm ucmuuodv hocmuwwmcou .H manna 32 18, 13 and 19, 15 and 20. The five pairs of duplicate items (Nos. 16 to 20) in Part II are numbers 13 and 16, 10 and 17, 5 and 18, 12 and 19, 4 and 20. Percent agreement (that is, the degree of consis- tency in selecting the same answer for the pairs of the dup- licate items) was calculated for each pair of duplicate items. An overall percent agreement on the duplicate items was also calculated separately for Parts I and II. The overall con- sistency of the pairs of duplicate items in Part I was found to be 87.8 percent agreement. In Part II, the over- all consistency of the pairs of duplicate items was 85.2 percent (See Table I). From this pilot study the conclusion can be drawn that the instrument to be used for assessing calibration is highly reliable, indicating that subjects are not randomly choosing the responses. Upon completion of the pilot study, the instrument was administered to 40 non-medical student volunteers from various majors as mentioned earlier. Ninety-five ques- tionnaires were then mailed to five medical training com- munities affiliated with Michigan State University in Kalamazoo, Saginaw, Lansing, Flint and Grand Rapids, where fourth-year medical students have clerkships. Administra- tors in charge of the medical training communities were asked to distribute the questionnaires to the students in 33 asked to distribute the questionnaires to the students in each community. Forty completed questionnaires were re- turned and were used as data for the group of medical stu- dents. (3) Data Analysis The data from the groups of non-medical and medi- cal students were entered onto a master computer file and analyzed using the Michigan State University CDC cyber 750, with the Statistical Package for Social Science program (SPSS). Independent t-tests were used in testing the dif- ference between the groups for accuracy, over/underconfi- dence and calibration scores. In the chapter to follow the results and findings from this analysis will be discussed. CHAPTER IV RESULTS, IMPLICATIONS AND RECOMMENDATIONS FOR FUTURE STUDY In this chapter the findings of the study, the results and the implications will be discussed. Recom- mendations for future studies will be presented. Results of the Study The derivation of the three independent measures accuracy, confidence and calibration scores and an example of their computation are presented in detail in Appendix D. The mean scores, standard deviation and t-test differ- ences of the three measures are summarized in the variable matrix in Table II. The difference between the two groups is more obvious when the mean scores for each individual on accuracy (percent correct) and mean subjective proba- bility are plotted in relation to the calibration curve, as shown in Figure 2. The results of each measure: accuracy, confidence and calibration will be presented separately in the follow- ing paragraphs. Accuracy. Table II gives the mean score, standard deviation and the t-test differences of accuracy for all 34 35 moo. u 94* so. u 9. mm u up o.H n umuos paw no u muoom cowumunwamo manwmmoa umoa mcaw oocwvfiwcoopmpc: mmumowpaw mSHm> 95:.meH S. 2. 8. 8.- H.“ NT: 84qu mucmcsum HmoacoEIGOZ koo.N «HH.N ««-.m .5. 8. S. o Em No.3 8.1,: mucopsum Hmofipmz u co>ummno 9m x u vo>pwmbo am 2 u co>pmmno am : Ncofiumunfifimu Hwocmnwucou humpsoo< mdzouu mucopaum fimowpmslcoc cam Hmowpoe noL cofiumubwfimo paw wocmpfiucoo .momuauow CH moocopocwwc ammulu pcm cofiumfi>wv pumpcmum .mmuoom cam: .HH manme 36 10 E] E] -r- E] X [I] [I] (ED [EHIUD 80* XI X U3 E] X X [I] E] X [D )( >03 IEBED ' .. X XX>o mama“ mo wobesc HmuoH u z Aduoouuoo powwwucopw wEouN mo monasz u c m." zomusoo< mHOLEAw ob on mmHoEuom ammoqm>mo mmmz mmmoom ZOHHo .>U