Irv! M (774 ‘1! 7 mummilminimummmlllmmum 1293 00609 7376 This is to certify that the dissertation entitled An Alternative Way of Estimating Test Item Statistics for Test Development in Malaysia presented by Bong-Cheang Quek has been accepted towards fulfillment of the requirements for Ph. D. degree in Education Mm {Lew Major professor Date 10/31/89 MS U is an Affirmative Action/Equal Opportunity Institution 0-12771 uaaanv W Michigan State University PLACE IN RETURN BOX to remove this checkout from your record. TO AVOID FINES return on or before date duo. DATE DUE DATE DUE DATE DUE 0' v :r «u ‘,.7 3.244“ 4M» I * \ ‘ =fi ' —— — _ 1‘ ll -— # * - L ill—J MSU Is An Affirmative Action/Equal Opportunity lnditution AN ALTERNATIVE WAY OF ESTIMATING TEST ITEM STATISTICS FOR TEST DEVELOPMENT IN MALAYSIA BY Bong-Cheang Quek A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Department of Counseling, Educational Psychology and Special Education 1989 ABSTRACT AN ALTERNATIVE WAY OF ESTIMATING TEST ITEM STATISTICS FOR TEST CONSTRUCTION IN MALAYSIA BY Bong-Cheang Quek The purpose of the study was to investigate how accurately experienced Chemistry teachers can estimate the item statistics of the Chemistry test items that will be used in the Malaysian Certificate of Education (MCE) Examination. The study has examined whether the accuracy of estimation can be improved by an intervention program for increasing the competency of the teachers' estimation skills. Also examined were the effects of (1) content area of item, (2) cognitive level of item, (3) difficulty level of item, (4) discrimination power of item, and (5) item-type, on the accuracy of item statistics estimation. Thirty experienced Chemistry teachers were randomly assigned to one of two groups: the treatment and the control groups. The treatment group teachers underwent a training session which provided them an opportunity to develop skills and strategies for estimating item statistics, whereas the control group was not trained. Equating items were embedded among the items whose item statistics were to be estimated. The teachers were each provided with 10 "anchor" items (_i_.g., items with Bong-Cheang Quek known population values of item characteristics) to guide them in the estimation. A double-repeated measures design with one between- subjects factor and two within-subjects factors was used to analyze the data. The data indicated that 1. The treatment was effective in improving the accuracy of p-value estimation but not the accuracy of point-biserial estimation. 2. The factors: (a) content area, (b) cognitive level, (c) difficulty level, (d) discrimination level, and (d) item type, have a significant effect on the accuracy of p-value estimation by the experienced teachers. 3. There was no significant difference between the accuracy of p-value estimation by the teachers competent in estimation skills and the accuracy of p-value estimation obtained by field-trial of item pool. 4. The factors: (a) content area, (b) cognitive level, and (c) discrimination level, significantly affect the accuracy of point-biserial estimation by the experienced teachers. 5. The factors: (a) difficulty level, and (b) item type, do not significantly affect the accuracy of point-biserial estimation by the experienced teachers. 6. The estimated and population values of the point-biserial has a Spearman rank correlation of 0.43. ACKNOWLEDGEMENTS Many people have contributed toward the success of this study. First, I am deeply grateful to my wife, Lee- Chin, and our children Wei-Kin and Wei-Schen for their love, emotional support, patient understanding, and many sacrifices throughout my doctoral studies. I would especially like to thank Dr. Herbert C. Rudman, the Chairman of my Dissertation Committee, my academic advisor and my friend for his guidance, insightful suggestions and personal interest in the study. Without his professional advice and unfailing support this project would have taken a much longer time to complete. I wish to express my gratitude to the members of my Dissertation Committee: Dr. William Mehrens for his advice, insightful comments and valuable suggestions; Dr. Stephen W. Raudenbush for his statistical expertise and valuable suggestions in data analysis: Dr. Ralph T. Putnam for his professional expertise and contribution to the early development of the proposal for the study; and Dr. Louis Romano for his contribution as a member of the committee. Special thanks goes to the teachers who participated in the study. Without their cooperation and contributions this study would not have been completed. iv I would also like to thank the Malaysian Ministry of Education for the scholarship award which enabled me to pursue the doctoral program; the Director of the Educational Planning & Development Division, Datin Asiah bt. Abu Samah for granting the approval for the project to be carried out in Malaysia; the Director of the Examinations Syndicate, Dato' Haji Mohd. Ghazali b. Hj. Mohd. Hanafiah for permitting the use of the facilities in the Examinations Syndicate for the study; Puan Nik Faizah Mustapha, Datin Hajjah Rapiah Tun A. Aziz, Puan Hajjah Rosni Hamzah, and Mr. Ivan D. Filmer Jr. (senior officers in the Examinations Syndicate) for their assistance and advice which enabled the project to be carried out smoothly. Page LIST OF TABLES . . . . . . . . . . . . viii LIST OF FIGURES . . . . . . . . . . . . x Chapter I. THE PROBLEM . . . . . . . . . . 1 Introduction . . . . . . . . . 1 Purpose of the Study . . . . . . 3 Testing and Test Development in Malaysia . . . . . . . . . 5 Need for the Study . . . . . . . 9 Research Hypotheses . . . . . . . 11 Overview . . . . . . . . . . . 13 II. REVIEW OF THE LITERATURE . . . . . . 14 Introduction . . . . . . 14 Judgment Under Uncertainty . . . . 14 Determinants of Item Difficulty and Discrimination . . . . . . 23 Empirical Evidence of Accuracy of Estimate . . . . . . . . . 38 Discussion and Summary . . . . . . 49 III. PROCEDURES AND DESIGN . . . . . . . 57 Sampling Procedure . . . . . . . 58 Sample Item . . . . . . . . . . 61 Design . . . . . . . . . . . 63 Hypotheses . . . . . . . . 89 Statistical Analysis . . . . . . . 90 Summary . . . . . . . . . . . 97 IV. RESULTS . . . . . . . . . . . . 99 Introduction . . . . . . . . . . 99 Accuracy of P-values Estimation . . . 100 Accuracy of Point-Biserial Estimation . 122 TABLE OF CONTENTS Summary . . . . . . . . . . . 132 vi Chapter V. SUMMARY AND CONCLUSIONS . . . Summary . . . . . . . Conclusions . . . . . . Discussion . . . Implications for Further Research Appendix A Teacher Data . . . . . . Explanation (Technical Terms) . Determinants of Item Difficulty Bayesian Approach . . . . . Sample of Scatterplots . . . Tables of Means of Accuracy . Q '11 M U 0 CD ANOVA Tables . . . . . . Bibliography . . . . . . . . . vii Page 137 137 142 147 161 164 166 168 169 172 178 188 198 LIST OF TABLES Summary of Findings Related to the Study Examples of Types of Items . . . A Summary of Research Findings on Item Difficulty . . . . . . Estimated and Population P-Values of the Equating Items . . . . . . Estimated and Population Delta-Values of the Equating Items . . . . . A Double Repeated Measures Design with One Between-Subjects and Two Within— Subjects Factors . ‘. . . . Competency Indices of Teachers . . Multiple-Comparisons among the Means the Different Content Areas. (Form Multiple-Comparisons among the Means the Different Content Areas. (Form Multiple-Comparisons among the Means of A) of B) of the Different Cognitive Levels. (Form Multiple—Comparisons among the Means of the Different Cognitive Levels. (Form Multiple-Comparisons among the Means the different Difficulty Levels. (Form A) Multiple-Comparisons among the Means the Different Difficulty Levels. (Form B) Multiple-Comparisons among the Means of of of the Different Discrimination Levels. (Form A) . . . . . . . . viii A) B) Page 54 62 67 80 81 92 95 103 103 106 106 110 110 112 Table Page 4.8 Multiple-Comparisons among the Means of the Different Discrimination Levels. (Form B) . . . . . . . . . . . 112 4.9 ANOVA Table for Evaluating the Difference between the Accuracy of Estimation by Teachers and the the Accuracy of Field-Trial . . . . . 116 4.10 Means and Standard Deviations of Accuracy of Estimation for Different Methods of Estimation . . . . . . . 121 4.11 Multiple-Comparisons among the Means of the Different Content Areas. (Form A) . . 127 4.12 Multiple-Comparisons among the Means of the Different Content Areas. (Form B) . . 127 4.13 Multiple-Comparisons among the Means of the Different Cognitive Levels. . . . . 129 4.14 Multiple-Comparisons among the Means of the Different Discrimination Levels. (Form A) O O O O O O O O O O O 131 4.15 Multiple-Comparisons among the Means of the Different Discrimination Levels. (Form B) . . .' . . . . . . . . 131 4.16 Summary of Tests of Significance for Item Difficulty Estimation . . . . . 133 4.17 Summary of Tests of Significance for Point-Biserial Estimation . . . . . 136 ix LIST OF FIGURES Relationship between P-Value and Normal Deviate . . . . . . . . Frequency Distribution of the Original Accuracy of P-Value Estimates . . . Frequency Distribution of the Original Accuracy of Point-Biserial Estimates . Frequency Distribution of the Transformed Accuracy of P-Value Estimates . . . Frequency Distribution of the Transformed Accuracy of Point-Biserial Estimates . Interaction of Form and Content on Accuracy of P-Value Estimation . . . Interaction of Form and Cognitive Level on Accuracy of P-Value Estimation . . Interaction of Form and Difficulty Level on Accuracy of P-Value Estimation . . Interaction of Form and Discrimination Level on Accuracy of P-Value Estimation Interaction of Form and Item Type on Accuracy of P-Value Estimation . . . Frequency Distribution of the Original Accuracy of P-Value Estimates by Competent Teachers and by Field-Trial . Frequency Distribution of the Transformed Accuracy of P—value Estimates by Competent Teachers and by Field-Trial . Interaction of Method of Estimation and Content on Accuracy of P-Value Estimation . . . . . . . . . . Page 80 84 85 86 87 102 105 108 111 114 117 118 119 Figure Page 4.9 Scatterplot between Estimated and Population Point-Biserials . . . . . 124 4.10 Scatterplot between Point-Biserials from Item Analysis and Population Point-Biserials . . . . . . . . . 125 4.11 Interaction of Form and Content on Accuracy of Point-Biserial Estimation . . 126 4.12 Interaction of Form and Discrimination Level on Accuracy of Point-Biserial Estimation O O O O O O O O O O O 130 xi CHAPTER I THE PROBLEM Introduction The quality of a test is determined, in part, by the extent to which the scores produced by the test are reliable and.‘validn Mehrens and Lehmann (1987) defined reliability as "the degree of consistency between two measures of the same thing" (p.54) and validity as "the extent to which certain inferences can be made from test scores or other measurement" (p.74). Even though reliability and validity are both important indicators of a test's quality, a valid interpretation of the test scores is possible only when there is consistency in the scores. Thus reliability of test scores is a necessary, although not sufficient, condition for valid test score interpretations. Nitko (1983) argued that " we cannot realistically expect...that any test will yield perfectly consistent or reliable scores. Nevertheless, as the degree of reliability of test scores diminishes, so does their degree of validity" (p.388). This statement underscores the importance of constructing tests that will yield consistent scores so that accurate inferences can be made. While it is not always easy to construct a test that enables valid interpretations of test scores, it is not as difficult to develop tests that yield consistent scores, provided the test developers adhere to the principles of test construction. Psychometricians have developed theories that relate test characteristics such as reliability , mean, standard deviation, and standard error of test scores with item statistics such as item difficulty level and item discrimination index. For example, Kuder-Richardson reliability coefficient (KRZO) can be computed when the difficulty level and. the discrimination index (in this case the point-biserial correlation coefficient) of each item in the test are known. The mean of the test scores is the simple sum of the difficulty levels of all the items in the test. Apart from affecting the test statistics, knowing item statistics also allows the test developer to control the test score distribution to serve certain specific purposes such as awarding scholarships or diagnosing learning difficulties. In all these cases, item difficulty level is defined as the proportion of a defined group of examinees who answer the item correctly, while the point- biserial correlation coefficient is the correlation of the score on the item and the total test score for a defined group of examinees. A logical conclusion is that to construct a test with desirable qualities, that is, a test with high reliability, suitable difficulty level and suitable test score distribution, the test developer needs to know the difficulty levels and the discrimination indices of the items. These item statistics can be obtained through item analysis. The usual approach to obtain the required information on the items is to try out the items on a sample of the population similar to the population for which the test is intended and analyze the items. This approach of obtaining item statistics is currently practiced by the Examinations Syndicate, Ministry of Education of Malaysia. Purpose of the Studv This study investigated how accurately experienced chemistry teachers can estimate the item statistics of the Chemistry test items that will be used in the Malaysian Certificate of Education Examination (MCE). Experienced chemistry teachege are defined as current teachers who have taught examination class chemistry for at least 3 years and have some experience in grading the essay or practical components of the Chemistry examination. Examination classes are classes in which students are taught curricula that will be examined in the MCE Examinations at the end of the academic year, and the teachers teaching these classes are expected to prepare students so that as high a percentage of students as possible will pass the examination in the subjects they teach. This study will investigate the accuracy with which these experienced examination class teachers estimate the item characteristics. The item characteristics to be investigated are the item difficulty and the item discrimination index. The item giffiieulty is defined, for purposes of this study, as the percentage of the test population answering an item correctly (p-value). The irem discrimination will be defined as the point-biserial correlation coefficient, between the dichotomous rpbis' item score and the total test score. In addition to investigating how accurately experienced chemistry teachers can estimate item characteristics, this study will examine whether the accuracy of estimation can be improved by an intervention program aimed at increasing the competency of teachers' estimation. It should be noted that the main focus of the study is not to generalize the findings to the population of experienced chemistry teachers but to investigate the degree of accuracy to which the selected group of teachers are able to estimate item characteristics and to what extent the intervention program can improve accuracy of estimation. If it is found that the intervention program can improve the accuracy of estimation substantially, it suggests that further research should be conducted to investigate whether more extensive intervention program can improve the accuracy further. This study, then, will be concerned with the following broad questions: (a) With what degree of accuracy do experienced chemistry teachers pp; trained in estimation skills, estimate the difficulty levels and discrimination indices of the chemistry items in the MCE Examination? (b) With what degree of accuracy do experienced chemistry teachers rrained in estimation skills, estimate the difficulty levels and the discrimination indices of the chemistry items in the MCE Examination? (c) Will the training program designed to improve the competency of estimation of the experienced chemistry teachers result in an increase in the accuracy of estimation? (d) Will the accuracy of estimation depend on the item-type, the cognitive levels at which the items were written, the subject matter contents of the item, item difficulty, and item discrimination? ’ (e) Are the discrepancies between the item statistics estimated by the trained teachers and the corresponding item parameters (obtained from post-test analysis) of the same magnitude as the discrepancies between the item statistics obtained in the field-trial and the corresponding item parameters ? Testing and Test Development in Malaysia A brief description of testing and test development in the Malaysian Ministry of Education will provide some background information that will enhance the understanding of not only the purpose and the need of, but also the method of analysis used in this study. Backgroppd. The educational system in Malaysia is highly centralized. The syllabi of school subjects are designed by the Curriculum Development Center of the Ministry of Education , and schools are required to adopt curricula that provide instructional activities adhering closely to the syllabi. Evaluations of pupils' learning, in the form of examinations, are carried out centrally by another body called the Examinations Syndicate which is a division in the Ministry of Education itself. Public examinations at the end of primary education, and the third year, the fifth year and the seventh year of secondary education are conducted at the end of each academic year. All the students in the respective examination classes are required to take the designated examinations and the results of the examinations at the secondary level are used for promoting the students to the next level of education. All primary school students are allowed to enter the secondary schools regardless of the results of the examinations which they took at the end of the primary education. The tests constructed for a specific examination are based on the same curricular contents each year, except when there is a change of syllabus. The test papers are 'open', that is, the papers are made available to the public at the end of the examination period. This means that fresh tests have to be prepared each year for the public examinations. It was reported in 1982 that the Examinations Syndicate constructed annually 34 multiple- choice objective tests and 288 essay tests (Report on the second national semipar on test management system, 1982). How an ObjectivefTeet is Construgreg. A typical objective test developed1 in the Examinations Syndicate goes through the following fourteen steps (Report on the second national seminar on tee; management system, 1982): (a) (b) (C) (d) (e) (f) (9) (h) (i) (j) (k) (1) (m) (n) Selection of panel members Commissioning prospective panel members to write items Panel meeting to review and to write items Review accepted items by test developer Assemble accepted items into test booklets for field-trial Field-trial of item pool Item analyzing of item pool Processing data Selecting items Computing predicted test statistics Preparing first draft of the test Review of draft by internal committee Assemble final draft Proof reading Among the fourteen steps listed above, steps of particular relevance to the present study are: field-trial 1. The test development in Malaysia is greatly influenced by the work done in Educational Testing Service (ETS), Princeton. In fact the test development procedure in the Examination Syndicate is modified from that practiced at ETS . of item pool, item analyzing of item pool, selection of items and computing of predicted test statistics from item statistics. The item analysis of the item pool is handled by a mainframe IBM computer. The item analysis reports the difficulty level (the p-value), the point-biserial correlation coefficient (rpbis)' and the mean criterion score (MCS) for each option in the items. The mean criterion score for a particular option is the mean of the criterion scores of all the examinees who chose the option as the answer for the item, while the criterion score is the total score of the examinee on the test converted to a standard score with a mean of 13 and a standard deviation of 4. The p-value of each item is further converted to a Delta-scale which also has a mean of 13 and a standard deviation of 4. The Delta-scale is preferred to a pwvalue for describing the difficulty level of items because a Delta-scale is linear while a p-value scale is not, and it is more meaningful to compute mean difficulty on a linear scale. Items are selected based on whether they fit the table of specification, and whether they have suitable Delta-values and acceptable point-biserial correlation coefficients. Since the tests constructed by the Malaysian Examinations Syndicate are achievement tests that are scored dichotomously, the Kuder-Richardson formula-20 (KRZO) is used to compute the internal-consistency reliability coefficient of the tests. 208 r-anelyeis . After the tests have been administered to the actual student population at the end of the academic year, item analysis is performed on the items in the test. This set of item analysis is referred to as the post-analysis. The post-analysis reports two sets of statistics: one for the test and the other for the items in the test. The test statistics portion of the post-analysis reports the population size on which the analysis is based, the number of items in the test, the raw mean of the test scores, the standard deviation of the raw scores, the mean score on the Delta-scale, the mean of the point-biserial correlation coefficient, the internal consistency reliability coefficient (KR20)I and the standard error of measurement. The item statistics portion of the analysis reports, for each item in the test, the proportion of examinees choosing each option, the mean criterion score of the examinees choosing each option, and the point-biserial for each option. Need for rhe §rudy The test development process in Malaysia and the examination policy in which test papers are "opened" after each public examination require field-trials of item pools to be carried out annually. The volume of multiple-choice 10 objective tests constructed annually -- 34 as reported in 1982 (Report on the second nationalfieeminar onjesr mapagement system, 1982), makes the annual field-trial exercise a formidable task. The same report estimated that about 1,900 'working' items , that is, items that satisfy the tables of specifications and have suitable item difficulty and item discrimination, are required annually to construct the 34 objective tests. According to the report "To come up with this number of working items will require at least 3,800 draft items. The [Test Development and Research] Unit [of the Malaysian Examinations Syndicate] carries out four trial studies a year covering a total sample of about 20,000 students" (1982, p. 47). The need to conduct annual field-trials of item pools has created administrative as well as practical problems, not to mention the large cost. The administrative problems have become more acute in recent years due, in part, to the increasing volume of items to be item analyzed and, in part, to the recent trend of schools conducting their regional trial examinations prior to the public examinations. The regional trial examinations are conducted by groups of schools in particular regions. The nature of the tests is similar to that of the public examinations and the tests are written, in a cooperative effort, by the teachers in the regions. All the schools participating in a particular regional trial examinations adopt the same schedule for the examinations. This means 11 that the field-trial exercise conducted by the Examinations Syndicate has to compete with the school regional trial examinations for the limited time available in the schools for testing. This gives rise to problems in scheduling for field-trials. The cycle of test development requires that item statistics of the item pools be available before the construction of the final drafts of the tests. This requirement creates a "bottle—neck" situation in that tests can not be constructed before trial studies of item pools and field-trial exercises can only be carried out at a specific time of the year. Furthermore, the field-trial exercises are plagued with many scheduling problems. Taking all these problems and constraints into considerations, there is, therefore, a need to investigate an alternative approach of estimating the item statistics, namely the item difficulty level and the item discrimination index, of the item pool: with the intentions that this alternative approach will be less expensive, will not be constrained by the specific time at which the estimation can be conducted, and will not require representative samples of students. es 0 heses The research hypotheses of the study are as follows: 1. The accuracy of item difficulty estimation by the experienced teachers rrained in estimation skills is better than the accuracy of item difficulty estimation by the teachers per trained in estimation skills. 1a. 1b. 1C. 1d. 1e. 3a. 3b. 3C. 3d. 12 There is a difference between items of different content areas in the accuracy of p-value estimation by the experienced teachers. There is a difference between item cognitive levels in the accuracy of p-value estimation by the experienced teachers. There is a difference between items of different difficulty levels in the accuracy of p-value estimation by the experienced teachers. There is a difference between items of different discrimination power in the accuracy of p-value estimation by the experienced teachers. There is a difference between items of different types in the accuracy of p-value estimation by the experienced teachers. The accuracy of item difficulty estimation by the experienced teachers trained in estimation skills is the same as the accuracy of item difficulty estimation obtained from field-trial of item pool. The accuracy of discrimination index estimation by the experienced teachers trained in estimation skills is better than the accuracy of discrimination index estimation by the teachers pgr trained in estimation skills. There is a difference between items of different content areas in the accuracy of point-biserial correlation coefficient estimation by the experienced teachers. There is a difference between items of different cognitive levels in the accuracy of point-biserial correlation coefficient estimation by the experienced teachers. There is a difference between items of different difficulty levels in the accuracy of point-biserial correlation coefficient estimation by the experienced teachers. There is a difference between items of different discrimination power in the accuracy of point- biserial correlation coefficient estimation by the experienced teachers. 13 Be. There is a difference between items of different types in the accuracy of point-biserial correlation coefficient estimation by the experienced teachers. Overview This chapter has presented the problem, purpose, background, need and research hypotheses of the study. In Chapter II a review of the literature related to the study will be presented. Chapter III describes the procedures and design of the study. The dependent variables and the procedure for transforming the skewed distribution of the original dependent variables to a more nearly normal distribution are also described. Chapter IV presents the analysis of data obtained. The analysis of the data for item difficulty estimation is presented first, followed by the jpresentation. of ‘the analysis of the discrimination index estimation. Chapter V contains a summary of the study and the findings. The conclusions, a discussion of the findings and the implication of the study are also included. CHAPTER I I REVIEW OF THE LITERATURE Introductigp The purpose of this study is to investigate the accuracy with which teachers estimate item statistics and whether teachers trained in estimation skills are able to estimate item statistics more accurately than teachers not trained in estimation skills. Other studies were reviewed to find out the factors that affect the values of item statistics and estimation accuracy, and to develop a rationale for designing the training program. Research related to the present study was reviewed and grouped into three broad categories. One category concerns judgment under uncertainty. The second category deals with the determinants of item difficulty and item discrimination. The third category describes empirical studies of the accuracy with which subjective judgment of item characteristics were made under different situations. Judgment under Uncertainty An extensive research on jpdgment ppger uncertainty has been reported in the literature. This body of research has focused on two areas. One area deals with the 14 15 psyehoiogy of predictipp and the other, expert iudqment under uncertainty. Esyepgiogy of Predictigp Tversky and Kahneman (1974), in a discussion of judgment under uncertainty, suggested that people rely on heuristic principles to reduce the complex task of assessing probabilities and predicting values to simpler judgmental operations, and cautioned that while these heuristic approaches are generally useful they often lead to severe and systematic errors. The writers demonstrated that three heuristics -- representativeness, availability, and adjustment and anchoring -- were employed by people to assess probabilities and to predict values. Bepresentativeness. The representativeness heuristic is an approach whereby people predict the outcome that appears most consistent with the evidence presented to them. Kahneman and Tversky (1973) illustrated the occurrence of judgment by the representativeness heuristic through the following event: A group was given the description of a person X as, " Mr. X is very shy and withdrawn, invariably helpful, but with little interest in people, or in the world of reality. A meek and tidy soul, he has a need for order and structure, and a passion for detail." They were then asked to assess the probability that Mr. X was engaged in a particular occupation from a number of occupations: farmer, salesman, airline pilot, 16 librarian and physician. In spite of the fact that there are more farmers than librarians, ‘the representativeness heuristic led people to believe that Mr. X was a librarian because he was similar to the stereotype of a librarian. In a series of studies, Kahneman and Tversky (1973) have shown that both naive and sophisticated subjects predict by representativeness. People are (1) insensitive to prior probability of outcomes, sample size and probability, (2) have misconceptions of chance and regression, and (3) have unwarranted confidence produced by a good fit between the predicted outcome and the input information. These three outcomes were offered as evidence that intuitive predictions follow a representativeness heuristic (Tversky & Kahneman, 1974). Several other studies supported Kahneman and Tversky's conclusion that judgmental heuristics are quite useful but sometime lead to systematic errors. For example, Hackman (1982) stated that one of the maxims for institutional researchers is that 'heuristics are run: always helpful'. Evidence of representativeness, even though highly unreliable and worthless, will lead people to ignore the base-rate knowledge or prior probability while making estimations. Nisbett and Borgida (1975) investigated the influence of a representativeness heuristic on people's reasoning of social behavior. They showed that base rate information about the behavior of most people in a given 17 situation often has little effect on a subject's attributions about the causes of a particular target individual‘s behavior. Aveilapiiity. Tversky and Kahneman (1974) stated that the " availability heuristic is employed when people assess the frequency of a class or the probability of an event by the ease with which instances can be brought to mind" (p.1127). In a study to demonstrate this effect, different lists of names of both sexes were distributed to different subjects. Each list contained the same number of names of men and women. However, in some of the lists the men were relatively more famous than the women, and in others the women were relatively more famous. When asked to judge whether the list contained more names of men than of women, the subjects erroneously judged that the class (sex) that had the more famous personalities was the more numerous. In general, instances of large classes are recalled better and faster than instances of less frequent classes. Furthermore, in addition to frequency and probability, other factors such as relevance, similarity, salience, familiarity, drama and recency also affect availability (Tversky & Kahneman, 1973, 1974). Following Tversky and Kahneman's work, a number of investigators have carried out field tests of the availability heuristic. Billings and Schaalman (1980) provided evidence to support the hypothesis that the 18 variables number, relative frequency, relevance, familiarity, drama and recency are related to subjective probability estimates. In a study to assess item-induced availability biases in ratings of leader behavior, Binning and Fernandez (1986) found that the availability of behavior description items, ire., items that described more specific, imaginable, dramatic, familiar or retrievable behaviors significantly correlated with the actual ratings of leader behavior. Levi and Pryor (1985) extended the list of variables affecting availability to include reasons of outcome. They found that the prediction of the outcome of the presidential debatcfi between Reagan and Mondale was affected by the availability of reasons but not by imagery of the outcome. Hence, according' to 'rversky' and. Kahneman (1974), reliance on availability leads to prediction biases such as (a) biases due to the retrievability of instances, (b) biases due to the effectiveness of a search set, (c) biases due to imaginability and (d) biases due to illusory correlation. Adjustment and anchoring. In this judgmental heuristic, estimates are made by assuming an initial value and adjustments are then made to yield the final answer. Whether the initial values are suggested by the formulation. of the problem or obtained from partial knowledge about the problem, adjustments made are in most occasions 19 insufficient. Tversky and Kahneman (1974) called this phenomenon of adjusting estimates from different starting points resulting in judgment biased toward the starting values anchoring. The occurrence of an anchoring phenomenon was demonstrated by a study in which two groups of high school students estimated, within 5 seconds the numerical values of two expressions: 8 x 7 x 6 x 5 x 4 x 3 x 2 x 1 andlx2x3x4x5x6x7x8.Thefirstgroup estimated the descending sequence while the second estimated the ascending sequence. The median estimate for the descending sequence was 2,250 while that for the ascending sequence was 512. The correct answer was 40,320. Tversky and Kahneman (1974) explained that both groups obtained a rough estimate by computing the first few multiplications and made adjustments to obtain the final values. The starting point for the first group (assuming 3 quick multiplications were made) was 330 and that for the second group was 6. Because subjects anchored at vastly different starting points they produced vastly different estimates. Tversky and Kahneman (1974) demonstrated with supporting evidence that this judgmental heuristic led to systematic and predictable errors. An example of estimation based on an anchoring and adjustment heuristic was provided by Hackman (1982) who wrote that in estimating higher education incremental budgeting and building estimates both for cost and time 20 required, there were anchors from past year figures or from initial estimates. "If a decision must be made about a greatly changed department's budget, the past budget amount will inexorably affect the new allocation" (p.15). An insight gained from this literature regarding the biases to which these judgmental heuristics led has a direct bearing on the present study which concerns subjective estimation of item statistics. For example, better estimates of item statistics may be achieved if judges are aware of the systematic errors due to a representativeness heuristic and are advised to make their judgment based on the characteristics of the population that will be taking the test rather than just based on the characteristics of the people or students they are familiar with. Recognizing that it might not be easy for the judges to free themselves from the influence of a representativeness heuristic when making an estimation, a procedure may be developed to rectify the errors caused by this heuristic. Expert Judgment Under Uncertainty In addition to research on the psychology of judgment under uncertainty, a number of studies focused on the issues of erp_e_rr judgment under uncertainty. Several investigators have studied the problem of individual versus consensus expert judgment. Winkler (1971), in a study in which college students and faculty assessed the 21 probabilities for the outcomes of collegiate and NFL football games, found that consensus judgments are more accurate than the average of the individuals' judgment. Winkler (1968) suggested two general methods for arriving at a consensus subjective judgment. One method is to allow each expert to revise his or her estimate after seeing the estimates of the remaining experts involved in the judging exercise without actually meeting the experts themselves. The other method is to allow the experts to discuss the issues in a panel in order to arrive at a final judgment. One problem with the second method is, according to Fitzpatrick (1983), the normative effects of opinion exposure. Beach (1975) felt that with this method "the experts may falsify their opinions in the hope of swaying other experts toward their point of view" (p.13). Berk (1986) suggested that this problem may be overcome by using independent expert estimates instead of the consensus value, at the final stage of the process. These concerns regarding the issue of individual ye consensus judgment and Berk's suggestion to overcome the problem seem logical. However, they should be regarded as tentative until they are verified empirically. Another problem related to consensus judgment as pointed out by Beach (1975) is that as a group people may make more extreme judgment than anyone in the group would make as an individual. However, this conclusion did not find support in a study by Goodman (1972) who found that of 22 six groups involved in a study to compare the estimation of likelihood ratio by individuals and groups, four made more conservative estimates than the average of the individuals within the groups while the other two groups were less conservative. The question of conservative versus extreme judgment is related to the error of central tendency in judgment. The error of central tendency, which was identified by Guilford as a source of error as early as 1954, refers to the tendency for people to avoid making extreme judgments (Guilford, 1954). In a study which predicted the difficulty of test items, Tinkelman (1947) found that there was a tendency for the judges to overestimate the difficulty level of easy items and to underestimate the difficulty level of difficult items -- a phenomenon of regression toward the mean of item difficulty. The question of whether training will improve the accuracy of subjective estimation of probabilities of outcome is an important one. In a review of the literature of expert judgment under uncertainty, Beach (1975) reported that research has shown that "feedback about their performance relative to the actual state of affairs, whether it is in terms of evaluation scores, point probabilities, or probability distributions, can lead to improvement in performance" (p. 17). The literature on expert judgment under uncertainty has provided valuable information for the present study. 23 For example, it is helpful to know that judgment based on consensus may give rise to a normative effect in which individual judges abandon their own judgment in favor of the group judgment. Berk's (1986) advice that judges be allowed to form their own independent estimates after group discussion would be appropriate for the present study. Better estimates may also be achieved if judges are advised to guard against the tendency for people to adjust their judgment toward the central value. ete i ants of tem if 'cu and Discrimination While research on judgment under uncertainty provides theoretical insights into how systematic errors and biases can occur when people make predictions under uncertainty, a number of other studies have focused on the practical issues of what factors influence the difficulty level and the discrimination of test items. Intrinsic and Extrinsic Determinants Campbell (1961), in an attempt to isolate the major factors determining the difficulty of nonverbal items involving classification of geometrical figures, divided the factors into two main groups, intrinsic and extrinsic. Intrinsic determinants pertain to the mental processes that the item is intended to measure and they include complexity, abstractness, and novelty of item contents while extrinsic determinants are factors that affect percent passing the item but are unrelated to the mental 24 process or processes measured by the item. These factors include unfamiliarity of item content, context of the item and personality variables. Item Difficulty Model Scheuneman & Steinhaus (1987) suggested that intrinsic item difficulty be defined in terms of "the item content, context, characteristics or properties and the task demands set by the item which must be met by an examinee with an assortment of skills and abilities in order to produce a correct answer" (p. 2). It is clear that Scheuneman's idea of intrinsic item difficulty encompasses Campbell's (1961) intrinsic and extrinsic determinants of item difficulty. Scheuneman & Steinhaus (1987) embodied the issues of intrinsic and extrinsic determinants of item difficulty and item discrimination, and a host of other variables that affect observed item difficulty and discrimination in a theoretical framework. This theoretical framework is summarized by an item difficulty model and an item discrimination model. The item difficulty model takes the following form mere: rpe irep discrimination mode; WM): . = - + ' ' 01g 99 + “9 + Bl F981 + Egl where Dig = observed difficulty of item 1 for group g 99 = the true ability of examinees in group 9 on the trait the test is intended to measure 25 1T9 = other abilities and attributes that may be used by individual examinees in group 9 in meeting the task demand Bi = the demand of the item on these different abilities and attributes WgBi = the interaction between the abilities not intended to measure and the level of ability demanded for the task set by the item, and E91 = error According to this model, the intrinsic difficulty of an item is a function not only of the ability demanded by the item task but also of all abilities and attributes which are not intended to be measured by the item but may be used by the examinee in meeting the task demand. More specifically, the intrinsic difficulty is represented by the components Bi and W9 in the model. The observed item difficulty, then, is affected by the true abilities of the examinee 6 the intrinsic item difficulty and the 9: interaction 17931 between ability demanded by the item and the unintended abilities. The value of the model is that it provides a framework for systematic investigations and/or discussions of variables affecting the intrinsic or the observed item difficulty. For example, the model reminds investigators that apart from intrinsic item difficulty, examinees' ability is an important determinant of observed item 26 difficulty. Various factors ‘that affect item difficulty and item discrimination such as item content, item context, item format and item complexity can be examined under the category, "Components of Item Task Demand". In discussing the effect of item content (a component of item task demand) on item difficulty, Scheuneman & Steinhaus emphasized the need to consider ipgigeprgi demands on knowledge as a source of variation in item difficulty. These incidental demands, in my opinion, have often been overlooked by both novice and professional item writers. Intrinsic Determinants It seems helpful to use Campbell's (1961) terminology to group the literature related to intrinsic item difficulty into (a) studies that deal with intrinsic determinants and (b) studies 'that investigate extrinsic determinants of item difficulty. Item complexity. Campbell studied the effect on item difficulty of increasing or decreasing the complexity of geometrical classification items by varying the number of figural properties incorporated into the item. The items were administered to a total of 693 children, from 11 to 12.5 years of age, at ten different schools. The results did not support the apriori prediction that an increase in the number of figural properties will increase the item difficulty. Instead they showed that it was the nature of 27 the properties that were used to classify the geometrical figures rather than the complexity (as reflected by the number of figural properties) that influenced the difficulty of an item. It should be pointed out here that Campbell's finding is restricted to nonverbal items and it might not generalize to verbal items. Furthermore, the complexity of an item is likely to mean different things in other content areas. In contrast to Campbell's finding, a number of studies have shown a significant positive association between item complexity and item difficulty. For example, Green (1983) found that item complexity was significantly associated with empirical item difficulty level. Pollitt, Entwistle, Hutchinson, and De Luca (1985) reported that the difficulty of an item was dependent on the complexity of reasoning processes required to answer chemistry test items. However, a study by Crawford (1968) showed that when complexity of an item was defined by the level of intellectual process at which the item was written, there was no direct relationship between the complexity and the difficulty of items. The conflicting findings indicate that there is disagreement as to what constitutes item complexity; Campbell (1961) defined. complexity' of a geometrical classification item as the varying number of figural properties incorporated into the item, whereas in Green's (1984) study, complexity of an item referred to the 28 number of steps and amount of information required to answer the item. Crawford (1968) viewed complexity of an item as the level of cognitive ability required to process the task of the item. Pollitt er L. (1985) adopted yet another definition of complexity. Thus interpretation of the findings should be based on the definition of complexity of items perceived by the investigator. Cognitive procesees. In the literature, the results of studies investigating whether there is a relationship between the cognitive processes required to answer an item correctly and the item difficulty are quite mixed. For example, Malpas and Brown (1974) requested two judges to classify 720 General Certificate of Education ordinary- level mathematics items into one of the two levels of cognitive demand, i. ., "concrete" and "formal", using criteria derived from Piaget's theory of cognitive development. They found that the classification category correlated significantly with the difficulty index of the item. Simpson and Cohen (1985) found that items that could be answered by recalling course information, i.e., "knowledge" items, were significantly easier than items requiring reformulation of course information, i._e_., "thinking" items. When difficulty was held constant, item discrimination was significantly greater for knowledge items. In contrast, Crawford (1968) reported that there was no significant relationship between the taxonomy of 29 intellectual processes and item difficulty for the items in the Comprehensive Interdepartmental Examinations of the College of Medicine, University of Illinois. In this case, the taxonomy of intellectual processes are : knowledge, generalization, problem-solving of a familiar type, problem-solving of an unfamiliar type, and evaluation. While the above two studies allowed cognitive processes to be confounded by content, Blumberg e_t_ 11. (1982) studied the relationship between cognitive processes and item difficulty holding the content constant. In this case only three taxonomy levels were used, i.e., recognition of information, interpretation of data and application of knowledge. The result showed that there was no significant relationship between taxonomic levels and item difficulty. In reviewing the literature on Bloom's taxonomy, Scheuneman & Steinhaus (1987) reported that research " has failed to demonstrate a clear link between these [cognitive] process variables and item difficulty" (p. 17). More recently, with the revival of cognitive psychology, researchers began to focus on cognitive components required to process the item tasks as a source of variations in item difficulty. Cognitive components are more specific than the broad taxonomy schemes used by Bloom and others to classify mental processes. Research in this area concentrated mainly on individual differences in the cognitive components employed for solving verbal analogy items. For example, Whitely (1980) found that, in 30 the process of solving verbal analogy items, there were individual differences in (a) image construction, (b) response evaluation, and (c) event recovery. However, only image construction and response evaluation are related to item difficulty. In a study by Mitchell (1983) the items in the Word Knowledge and Paragraph Comprehension subtests of the Armed Services Vocational Aptitude Battery were rated on. cognitive components: (a) perceptual processing; (b) executive processing, (c) short term storage, (d) long-term storage of information structures, and (e) selection and execution of the response. These cognitive processes were found to be related to Rasch model item difficulty. Research in this area focused mainly on verbal items and individual differences. Hence they are less relevant to the present study. Extrinsic Qeterminants Campbell (1961) considered. factors affecting item difficulty, but not related to the mental processes intended by the items, the extrinsic determinants of item difficulty. Numerous studies on the extrinsic determinants of item difficulty have been reported in the literature. tem a a e. The findings of research on the effects of item language on item difficulty have been quite inconsistent. Millman (1978) used a computer program to generate items in elementary statistics that vary in linguistic presentation but test the same concept. The 31 items were administered to one class of students who took the course. He found that linguistic variations were related to item difficulty. However, because of the small sample size, he cautioned the generalizability of the results. In an investigation using undergraduate students as subjects, Green (1983) found low association between language difficulty and empirical item difficulty of 10 items selected from a 40-item test in Introductory Astronomy. In a subsequent study, Green (1984) varied the difficulty of item language by increasing sentence length and syntactic complexity, and by replacing more familiar terms with unfamiliar ones in the stem. The subjects were 990 students in 19 separate undergraduate classes at the University of Washington. The results showed that language difficulty of the stem did not affect item difficulty. The author suspected that the manipulation of language difficulty in the study was not effective. In reviewing several studies on item language, Green (1984) reported that variations in item language difficulty have been found to affect item difficulty with samples of young children, but with high school and college samples the results have not been consistent. She suggested that difficulty of language would have no effect on item difficulty once individuals have reached some criterion of verbal proficiency. 32 Qontent familiarity. Pollitt, Entwistle, Hutchinson, and. Luca (1985) attempted to determine the factors consistently associated with item difficulty beyond simple content consideration. They analyzed the top and bottom groups of 550 answer scripts of candidates taking the '0' level Chemistry Examination in Scotland. Pollitt er e1. found that if the answer to a chemistry test item was based on a knowledge of the properties of a chemical that was either unfamiliar to the candidates or could not be deduced from a set of more basic knowledge, the difficulty of the item would be high. For example, an item required the candidates to differentiate between the properties of sodium and the properties of barium. If the candidates did not know the properties of barium and these properties could not be deduced from a knowledge of the properties of other chemicals, the difficulty of the item would be very high. The conclusion of the study is consistent with the idea of "incidental demands of an item" introduced by Scheuneman (1987). In this case the item intended to test the knowledge of the properties of sodium. However, candidates could not answer the question unless they also knew the properties of barium. The findings of this study have direct applications for the present study which requires judges to estimate difficulty levels of chemistry test items. Judges could be advised to focus on the possibility that answers to some of the items may require 33 content knowledge unfamiliar to the examinees, resulting in items becoming excessively difficult. Item rormat. Several studies have attempted to ascertain the effects of item format on item characteristics. Dudycha and carpenter (1973) manipulated item orientation (positive ye negative stem), structure (closed stem ye open stem format) and option (presence or absence of an inclusive alternative). The items were administered to 1,124 students taking a regularly scheduled Introductory Psychology test. All three variables were found to significantly affect item difficulty -- negative stem more difficult than positive, open stem format more difficult than closed format, and inclusive alternative more difficult than specific alternative. No interactions were found between these three independent variables. However, only the "presence or absence of an inclusive alternative" factor significantly affects item discrimination. Items with specific alternatives discriminate better than items with inclusive alternatives. The results also showed an interaction between the "closed or open" factor and the "positive or negative" factor. The closed-positive and the open-negative formats were slightly more discriminating than the closed-negative and the open- positive formats. Hughes and Trimble (1965) studied the effect of a complex distractor "Both 1 and 2 above are correct" on item difficulty and item discrimination. The result showed that 34 this type of complex alternative can increase item difficulty. However, its effect on item discrimination was not clear. More recently, in a series of studies comparing item characteristics for parallel multiple-choice items in three different content areas --statistical terminology, measure- ment concepts and synonyms -"- Tollefson and Chen (1986) concluded that items using "none of the above" as a correct answer have a higher difficulty level but are not more discriminating than items with one-correct answer format. Forsyth. and. Spratt (1980) found that multiple- 'choice items with "Not given" as an alternative are more difficult than items not using this alternative. Considering the studies together, evidence seems to point to the conclusion that item format affects item difficulty and item discrimination. Multiple-choice optione. The effects of the characteristics of item options on item characteristics have been the subject of investigation of numerous researchers (Chase, 1964: Dudycha & Carpenter, 1973; Dunn & Goldstein, 1959 Millman, 1978). Homogeneity of options has been suggested by several authors of measurement textbooks as an important factor affecting an item's difficulty and discrimination index (Chase, 1974: Nitko, 1983). Ebel (1979) stated that "multiple-choice items can be made easier by making the stem more general and the responses 35 more diverse; items can be made harder by making the stems more specific and the responses more similar" (p. 162). Green (1984), studying effects of item characteristics on multiple-choice item difficulty, reported that when options of an item became more convergent the item became more difficult. Pollitt et a1. (1985) found similar results. Research has verified the suggestions made in major measurement textbooks that length of options, use of technical options, and the extent of grammatical inconsistencies across stem and options influence the selection of correct answers (Ebel, 1979; Mehrens & Lehmann, 1984; Thorndike, 1971). For example, Strang (1977) found that nontechnical options were more often chosen as correct answers than were technical options, long options were more often chosen as correct answers than were short options regardless of whether the options were technical or non-technical. Long nontechnical options were most often chosen as correct answers while short technical option were least often chosen. The result regarding the length of option is consistent with the findings of Dunn and Goldstein (1959) which showed that items containing extra- long correct alternatives are less difficult. The results of these two studies imply that items with an incorrect option which is longer than others and is nontechnical will be more difficult, because examinees have a higher tendency to choose this type of option as the key. This 36 finding provides an important clue for judging item characteristics. Even though the majority of measurement textbooks make "stem-options grammatically consistent" a rule in item writing, the results of the studies on the effect of grammatical inconsistencies between stem and distractors on item difficulty are mixed (Board & Whitney, 1972; Dunn & Goldstein, 1959). In a study on grammatical consistencies, Plake and Huntley (1984) reported that there was some evidence of differential sensitivity between males and females toward subtle cues in items which achieved vowel— consonant consistency between stems and the correct alternatives by adding "an" parenthetically to the article a. Item context. The literature of educational measurement is replete with research which attempted to ascertain the presence or absence of the effects of item context on item statistics. In a review article of some 40 studies, Leary and Dorans (1985) grouped research on the effects of item context into three main categories: (a) single factor, item order effects, (b) item order interaction effects, and (c) section placement effects. They concluded that there was evidence of item context effects. When tests were speeded, items arranged in easy- to-hard sequence resulted in better test performance than when items were arranged in hard-to-easy sequence. Under power conditions, random rearrangement of items or sections 37 of items measuring the same contents did not affect item difficulty. Another conclusion drawn by Leary and Dorans was that aptitude test items are more sensitive to item rearrangement than achievement test items. These findings suggest that when judges estimating the item statistics of an item, an important factor to consider is whether the test will be administered under speeded or power conditions and whether the item is likely to appear at the beginning or at the end of the test. Thus research on what makes test item difficulty has not resulted in a clear set of factors affecting item difficulty. A comparison of the methodologies used in this research reveals that from one study to another there is a great deal of difference in the sample sizes and the test instruments used, and the way variables are defined and manipulated. Under such circumstances, inconsistent research findings are inevitable. In spite of this variability, valuable information can be gained from this research. The fact that various researchers are studying the effects of complexity, cognitive processes, item language, item format, item options and item context on item characteristics indicates that these variables are perceived by measurement researchers as having an influence on item statistics. In areas where the findings showed conclusive relationships between a particular factor and the item statistics, that variable could be incorporated either into the training program (to improve 38 estimation skills) or into the design of the present study. In areas where the findings did not show a conclusive relationship between the variables and the item statistics, judges could still be sensitized to the possible influence of these variables on item statistics. Empirical Evidence of Aggprecy of Estimates A number of studies have been carried out to investigate how consistently and accurately judges could estimate item statistics. Since this group of studies are similar to the present study, a more detailed review was made with the view of incorporating their findings and methodology into the design of the present study. Estimation by Erofessional Examiners Tinkelman (1947) investigated the degree to which item difficulty could be predicted prior to actual test administration. Thirty' experienced and competent professional examiners of public personnel agencies estimated the percentage of candidates likely to answer correctly each of the 100 items in a multiple-choice test for selecting patrolmen. The empirical item difficulties of the items were determined from a 1000 answer test papers selected at random from 30,000 candidates to whom the test was administered. To find out how well each judge could estimate the difficulties of the 100 items, Tinkelman correlated the item difficulties estimated by each judge with the 39 empirically determined item difficulties. He found that the judges could estimate the relative difficulties reasonably well (median correlation = 0.53: range from 0.23 to 0.77). The correlation between the pooled estimates and the empirical item difficulties was 0.76. To investigate the consistency of estimation, Tinkelman divided the test into two halves (one consisting of 50 odd-items and the other 50 even-items) and compared, for each judge, the correlation coefficient computed for the estimates of the 50 odd-items with that of the 50 even-items. The result showed that judges were consistent in their judgment about relative item difficulties in the two sets of items. The investigator was aware that the two halves (i.er, the two sets of 50 items) might not be equivalent and that there was no time interval between the two estimations. However, these two factors affect the correlation coefficients of the two halves in opposite directions. Non-equivalent halves increase the difference between the correlation coefficients while no time interval reduces the difference. Thus the comparison should be treated with caution. Although judges could estimate the reIeriye difficulties of items well, the investigator discovered that they tended to estimate difficulty toward the center of the scale, that is, difficult items were judged easier than empirical difficulty and easy items were judged more difficult. This implied that judges were not able to judge the epsolute item difficulties well. 40 Tinkelman also investigated the group size of the judges that would give optimum accuracy in estimating item difficulties. He selected judgment groups of varying sizes on the basis of the judges' ability to predict the relative difficulties of the set of 50 odd-number items. By comparing the accuracy with which these different judgment groups predicted the relative difficulties of the set of 50 even-number items, the investigator found that pooling the judgment from only the top three of the most competent judges provided predictions of relative difficulty as accurate as those provided by the entire group of judges. This finding' has important application for the present study. It is the competency and not the group size of the judges that determines the accuracy of estimation. If incompetent judges can be identified by a preliminary trial, improvement of judgment accuracy can be achieved by pooling only the judgment of the competent judges. The same study found no relation between item content and the relative accuracy of estimation. Different item content areas, however, were found to have different constant estimation errors (defined as the difference between the estimated and the empirical item difficulties). 1k! a similarly' motivated study, Bejar (1981) investigated the accuracy with which four experienced professional test development staff at Educational Testing Service estimated item statistics of the items in the Test of Standard Written English (TSWE). In an attempt to 41 increase the accuracy of estimation, a training component was included in the study . The results of the study showed that while the interrater reliabilities for item difficulties and item discrimination indices were very high (about 0.90), the correlation coefficients between the subjectively estimated item statistics and the empirically determined item statistics did not approach the level that would be required to substitute subjective estimation for field trial. This low level of accuracy of estimation could probably be attributed to the fact that the type of items used in the study were items that attempted to measure writing skills. While in subject matter like mathematics the difficulty of an item is largely determined by the mathematical operation required to solwe the problem, the difficulty of an item measuring writing skills depends on a greater variety of factors. According to Bejar (1981) "it is probably not sufficient to determine what error is present in an ...item [measuring writing skills], for the semantic and syntactic context in which that error is presented may influence item statistics significantly" (p. 303). The question of what type of errors in what semantic and syntactic context will result in higher or lower item difficulty is extremely difficult to judge. The investigator suggested that more raters may be required to achieve a high level of correlation. 42 Improvement by Anchor Items The result of the study by Tinkelman (1947) that judges could only estimate the relative, but not absolute, difficulties of items ‘well generated interest in other researchers to investigate this problem further. In an attempt to improve estimates of both relative and. absolute item «difficulty, Lorge and Kruglov (1952) hypothesized that estimates of item difficulty would improve if judges had knowledge of the difficulty of some similar items, and that additional information would make for greater consistency of the judges' estimates. Eight doctoral students rated the difficulties of 150 arithmetic items. Four of the judges were given the 150 items, with 30 items whose actual difficulties were known. The other four judges were given the same 150 items without information about the difficulty of any of the items. The judges were required to rank the items according to difficulty and then estimate the percentage of eighth grade students passing each item. The product-moment correlation between estimated and actual item difficulties was computed for each group. No significant difference was found. Both groups were also found to overestimate the average difficulty of the items to the same extent. This means that the absolute item difficulties were not estimated more accurately by the group of judges who were given information about the 43 difficulty of similar items than the group who did not receive that information. The failure of the additional information (about some item difficulties) to improve the accuracy of estimation (both relative and absolute) could be because the judges were all doctoral students and were not especially oriented in the teaching of arithmetic or familiar with the ability of the population taking the test. I}: a subsequent study, Lorge and Diamond (1954) defined competent judges as judges whose mean and standard deviation of difficulty estimates for a set of items approximated the empirical mean and standard deviation of item difficulties. They found that providing "anchor items", i.e., items with known empirical item statistics, improved the accuracy of estimates of the less competent judges to a greater extent than they improved the accuracy of the competent judges. Improvement Using Erperienced Teachers In another study, Lorge and Kruglov (1953) used as judges persons experienced as teachers of mathematics at the high school level to estimate the absolute difficulty of mathematics items under two conditions: one in which difficulties of a subset of items were given and the other with no information about item difficulties. The difference between the average estimated difficulty of the items and 44 the average empirical difficulty was computed for each of the two conditions. The results showed that judges under both conditions underestipareg the average difficulty of the items. However, the degree of underestimation was substantially smaller under the condition in which information about the difficulty of a subset of items was given. The result suggested that experienced teachers were able to make use of the additional information to improve the accuracy of their estimation. This finding offers a guideline for selecting judges for the present study. That is, judges should be selected from experienced subject matter teachers rather than just from people who have some knowledge in the subject matter. Improvement by Rank Order Prediction In an extension of the previous study, Lorge and Diamond (1954) assumed that (a) absolute item difficulties were normally distributed, (b) the correlation between rank order for difficulty and the absolute difficulty in percent is 1.0, and (c) the mean and the standard deviation of difficulties of the judged group of items was known, or could be estimated with very little error. They then demonstrated that a better estimate of absolute difficulties of test items could be obtained by predicting from average rank order assigned by judges than by averaging judges' estimates of the percentage likely to pass each item. The investigators also demonstrated the 45 technique for estimating the mean and standard deviation of difficulties by including a set of anchor items of known difficulties among the items to be judged. The technique involved extrapolation of the quartiles of the distribution of item difficulties and computation of the mean and standard deviation using the relationship between median and mean, and that between semi-quartile range and the standard deviation. The study also found that providing judges with information about the item difficulties of a subset of the items to anchor their judgment improved the accuracy of estimation. This result was consistent with the findings of previous research. The technique used by Lorge and Diamond can be criticized in that the second assumption that the ggrreIation between rank order for estimated difficulties and the empiricaI difficulties is 1.0 is too stringent. In situations where the correlation is not 1.0 and where the rank orders for the estimated difficulties of some of the items did not vary strictly with the empirical difficulties, the mean and the standard deviation of the distribution of the difficulties could not be estimated. When this happens the technique can not be applied. Improvement by Elaborate Written Report Quereshi and Fisher (1977), in an attempt to gain insight into the question of logical estimation of item difficulty, went beyond the Lorge-Kruglov approach by 46 asking the judges to develop a written report elaborating the processes and the criteria they used to arrive at their estimates of item difficulties. Five judges who had completed 2 years of graduate program in psychology and had experience in administration and interpretation of psychological tests ranked 44 letter series items and then rated the items on a 1 to 10 point scale (1 being the easiest). Pooled estimates of rank order and rating for each item were computed. The empirical rank order and rating of each item were obtained by administering the test to 186 undergraduates. Spearman Rank correlations were computed between the subjectively estimated rankings and the empirical rankings of the items for each judge separately as well as for the pooled estimates of the judges. The same correlations were computed for ratings of items as well. Interjudge consistency was then studied by computing the Pearson-Product moment correlations among the ratings of the five judges and the Spearman Rank correlations among the ranks. From the intercorrelation indices and the reports of judges describing the criteria on which they based their estimates, the investigators concluded that the accuracy of estimates depended on how elaborately a judge analyzed the structure and organization of the items. The finding of this study offers a strategy for improving judgment. That is, judges should analyze the structure and organization of the items before making their estimation of item 47 difficulties. The same strategy was also employed by Bejar (1981) in his study of subject matter experts' assessment of item statistics. Studies InvoIving Subjective Esrimeriop A number of studies were related to the ability of the judges to subjectively estimate the item difficulties for minimally competent examinees, especialLy in standard setting research (Berk, 1986). Melican and Thomas (1984) used Angoff's (1971) Standard Setting Method to identify items whose difficulty levels are hard to estimate accurately. The results of the study suggested that the difficulties of items involving calculation as well as items.‘with :negatively’ phrased. stems were harder to estimate. In both cases judges tended to underestimate the difficulties. In a study using a method based on Nedelsky's (1954) and Angoff's (1971) models to determine the cut-off score for a certification examination, Bernknopf (1979) instructed the judges "to draw upon their experience to construct a hypothetical group of persons, each of whom, in their judgment, has the minimum amount of academic knowledge to perform effectively in the schools, and then to estimate the percentage of the candidates who would know the answer" (p. 8). The group's estimates of item difficulties were found to correlate highly with the empirical item difficulty. The method used by Bernknopf 48 indicated that "appropriate experience" was a crucial element for the success of the approach. A study attempted to partial out the effect of content relevance of items from the accuracy of estimation of item difficulties was carried out by Ryan (1968). Fifty-nine secondary level mathematics teachers estimated the difficulties of 50 multiple-choice mathematics items. It was found that the ability of the teachers to estimate item difficulties was higher when the content of the items was covered in the instruction. When content relevance was partialed out, only in one of the four subtests was there a substantial decrease in the proportion of teachers having significant correlations (between estimated and empirical item difficulties). This showed that content relevance was not the only criterion on which judgment of item difficulty was based. Poor Estimerion by gngpalirieg Judges That unqualified judges make poor estimation of item difficulties is evident in a study conducted by Willoughby (1980). Eight non-physicians rated independentLy 30 items from a medical examination for format, relevance, difficulty, discrimination and overall quality on a scale of 1 to 5. Group estimates, for each dimension and for each item, were obtained by computing means across judges. The empirical item statistics were obtained from 345 medical students. No significant correlation between estimated and 49 empirical item difficulties was found. However, there was a significant correlation between estimated and empirical item discrimination. In this study, there was no evidence that the judges had. medical knowledge. Neither' did they' have knowledge about the characteristics of the population taking the test. Since item difficulty depends on both the intrinsic difficulty of the item as well as the characteristics of the population taking the test, it is not surprising that the estimated item difficulties did not correlate significantly with empirical item difficulties. Discussion and Spmmary The literature related to the present study has suggested that three judgmental heuristics may km: in operation when people made judgments under uncertainty. Judgments under uncertainty are influenced by the representativeness of events, availability of instances and the tendency to make adjustments from an intuitive starting point. Tversky and Kahneman (1974) stated that "these heuristics are highly economic and usually effective, but they lead to systematic and predictable errors" (p. 113). Research on expert judgment under uncertainty seems to suggest that consensus judgment may lead to normative effects in which individual members, under the influence of other group members, abandon their own estimates and accept group estimates. A suggestion to overcome this problem is 50 to allow individual members to make independent estimates at the final stage of group discussions. The advantage of this suggestion is that group discussions provide the opportunity for the members to debate and to study the criteria for making judgments, and independent estimation allows for individual judgments to contribute to the overall estimation. The ‘tendency’ for' people to avoid. making extreme judgments was identified as a source of error. Research has shown that training in terms of providing feedback about judges' performance relative to the actual values will improve their performance. An understanding of the psychology of judgment under uncertainty is important for the present study which requires judges to make subjective estimates of item statistics. Before the judges estimate the item statistics, it would be a good strategy to brief them about the systematic errors and biases that these judgmental heuristics can lead to so that better estimates of item difficulty can be made. The findings of research on expert judgment also provide valuable information. An implication of the findings is that group discussions to better understand or to identify the specific determinants of item statistics followed by independent individual judgment of the estimates would be a suitable procedure to adopt for the present study. The usefulness of providing feedback of the empirical item statistics during the training sessions 51 (designed to improve estimation skills) has also been implicated by these research findings. Several determinants of item difficulty and item discrimination have been studied by various investigators. The determinants were broadly divided into (a) intrinsic determinants, and (b) extrinsic determinants (Campbell, 1961). Intrinsic determinants include item complexity and cognitive processes/components required to process item tasks, whereas extrinsic determinants include item language, content familiarity; item format, option homogeneity, grammatical inconsistency, option characteristics and item context. Although the results of these studies have not been conclusive as to what factors affect item statistics, there seems to be evidence that complexity, cognitive components required to process item tasks, content familiarity, similarity of item options, item format and item context are closely related to item difficulty. However, complexity of items was defined differently in different studies. Thus the definition of complexity should be examined carefully before the findings of the studies are applied to other situations. The literature has provided insights into the question of subjective judgment of item statistics in terms of possible factors affecting accuracy of estimation. This information would be utilized in the present study to design and develop the intervention program to improve the accuracy of estimation. Forming small groups to discuss the 52 possible impact of these factors on item statistics might sensitize the judges to the need to focus on these factors while making estimation. This may lead to more accurate estimates. A number of studies have shown that relative but not absolute item difficulties can be estimated well. Accuracy of estimation generally improved when estimation of judges were pooled; pooling only the estimates of those competent judges will provide a more accurate estimation than pooling estimates from the entire group of judges. It was fbund that estimates tended to regress toward the mean and that the accuracy of estimation would improve if a subset of items with known item difficulties was provided to enable the judges to anchor their judgments. Lorge and Diamond demonstrated that prediction from average rank order assigned by judges produced more accurate estimates than when estimates were obtained by averaging the judges' estimates. However, as mentioned in an earlier section of this review, this procedure requires a stringent assumption which is difficult to satisfy. Hence this method has not been adopted for the present study. Nevertheless, the design of the present study could take advantage of the fact that accuracy of estimation will improve if judges are given examples of similar items with known empirical item difficulties, i.e., the idea of anchor items has been incorporated into the design. Pooling the estimates of only those competent judges rather than the estimates of the 53 entire group of judges is another step that can be taken to improve the accuracy of estimation. In: the studies. of subjective estimation of item statistics reviewed here, with the exception of the study by Ryan (1968), there was no evidence that the judges have sufficient knowledge about the ability of the subjects who took the test. Psychometric theory has shown that item difficulty depends both on the intrinsic difficulty as well as on the characteristics of the population taking the test. Thus, ignorance of the characteristics of the population taking the test could be a reason why the judges were able to estimate only the relative, but not the absolute, item difficulties well. In this connection, it seemed appropriate for the present study to use judges who were experienced subject teachers and who had been involved in either rating students' examination papers or in constructing' test. papers for' public examinations to be taken by their own students. The findings that relate to the present study are summarized in Table 2. 54 Table 2.--Summary of Findings Related to The Study. Investigator Year Finding Significance Direct lndirect 1. Judgment Under Uncertainty 1. Guilford 1954 2. Winkler 1971 3. Tversky & 1973 Kahneman & 1974 4. Beach 1975 5. Fitz- 1983 patrick 6. Berk 1986 ll. Determinants of {gem Statistics Campbell Scheunman & Steinhaus 1961 1987 There is a tendency for people to avoid making extreme judgments. Consensus judgments are more accurate than the average of the individuals' judgments. Demonstrated that representativeness, availability, judgnental and anchoring heuristics were enployed by people to predict values; these heuristics may lead to systematic errors. Experts may falsity their opinions in the hope of swaying other experts toward their point of views; Group judgments were more extreme than would be made by anyone in the group as an individual. Normative effects of opinion exposure may occur which results in individual abandoning their own judgments in favor of consensus. Suggested that a way to avoid normative effect was to allow experts to make their own independent judgments after group discussions. Classified determinants of item difficulty into intrinsic and extrinsic factors; intrinsic determinants pertain to mental processes; extrinsic determinants are factors affecting the difficulty of the items but are unrelated to mental processes. Proposed an Item Difficulty Model which provides a frame- work for systematic investigations and discussions of of determinants of item statistics. X X X X X X X X X Table 2. (cont'd) 55 Significance Invest igator Year Finding Direct Indirect 9. Green 1983 umber of steps and amomt of infomtion required x to answer an item affect the difficulty of the item. 10. Pollitt 1985 Difficulty of a chemistry test ital depentk on the x Lt a_l. couplexity of the reasoning processes required to answer the item; Examinees' familiarity of content affects item difficulty. x 11. Huges & 1965 Couplex distractors such as “Both 1 and 2 above are x Iriuble correct" increase item difficulty. 12. Crawford 1968 Malpas 8. 1974 Dram Results regarding the relationships between cognitive x Sinpson G 1985 processes and item difficulty were mixed. Cohen Blmberg 1982 at a_l- 13. Green 1984 Variations in item language has no effect on item difficulty x once indivimals have reached some criterion of verbal proficiency; when options become more convergent, the item become more x difficult. 14. Dudycha 8. 1973 Item with negative stems are more difficult than item x Carpenter with positive stem; open stem format more difficult than closed format; inclusive alternatives more difficult than specific alternatives. 1S. Strang 1977 Nontechnical options and long options are more often x chosen as the correct answers. 16. Dun & 1959 Item with extra-long options as the correct answers are x Goldstein less difficult. 17. Leary & 1985 Concluded from literature review that item context has a x Durans ' greater effect on item in a speeded test than on items in a power test; aptitude test item are more sensitive to item rearrangements than achievement test item. Table 2. (cont'd) 56 Investigator Year Finding Significance Direct Indirect III. Enpirical Evidence of Accuracy of Estimates 18. 19. 20. 21. 22. 24. 25. 26. 27. Tinkelmn 1947 Bejar 1981 Lorge G 1952 Kruglov Lorge G 1953 Kruglov Lorge & 1954a Dianond Lorge G 1954b Di mend Ouereshi 1977 8. Fisher Melican 1984 8. Thomas Ryan 1968 willoughby 198D Judges could estimte relative, but not absolute difficulty of item well; judges tend to regress the estimation toward the center of the scale; it is the conpetency and not the groin size of judges that determines the accuracy of estimtion. lnterrater consistency for item difficulties and item dis- crimination irdices were high (about 0.9); the correlation coefficients between subjectively estimated item statistics and eupirically estimated item statistics were low. Judges without experience in teaching arithmetic were not able to make use of the informtion in the anchor item to iuprove the accuracy of estimation of arithmetic item. Experienced high school mathemtic teachers inproved their accuracy of estimation for mathematic item when provided with anchor item. Providing anchor item improved the accuracy of estimates of the less coupetent judges more than it did to the more conpetent judges. with certain assmptions, better estimtes of item difficulties could be obtained by predicting from the average rank order of item assigned by individual judges. Accuracy of item difficulty estimation depends on how elaborate a judge analyzes the structure and organiza- tion of the item. Difficulties of item involving calculation and negatively phrased stem were harder to estimate. Even though the ability of teachers to estimate item difficulties was higher when the content of the item was covered in instruction, content relevance was not the only criterion on which judgment was based. The item difficulties of 30 medical examination item estimated by 8 non-physicians did not correlate signifi- cantly with enpirically determined item difficulties. X X X X X X X X X X CHAPTER III PROCEDURES AND DESIGN The purpose of this study was ‘to investigate how accurately experienced chemistry teachers could estimate the item statistics of the Chemistry test used in the Malaysian Certificate of Education Examination. The accuracy of estimation of the experienced teachers trained in estimation skills was compared with that of the experienced teachers not trained in estimation skills. Further, the questions of whether accuracy of estimation is dependent on the content areas, the difficulty levels, the discrimination power, the cognitive levels, and the format of the items were examined. Finally, the accuracy of teachers' estimates. were: compared. with the accuracy of estimation obtained in a field-trial of the item pool. This chapter includes a description of the sampling procedure, the subjects involved, the test materials used, the design of the study, the hypotheses to be tested, and the statistical procedures for testing the hypotheses of this study. 57 58 Sampling Procedure gambling of Teachers A total of 30 teachers participated in this study. These teachers were chosen based on information which indicated that they had a Bachelor's Degree in Chemistry and were currently teaching or had recently taught the Chemistry examination classes. In addition, teachers with additional experience in examination work (serving in item writing panels, as raters of essay or the practical components of the Chemistry Examination, and in administering the Chemistry Practical Examinations) were chosen in preference over those teachers who did not have these experiences. Due to the high cost of the attendance allowance, subsistence allowance and traveling expenses, the total number of teachers was restricted to an affordable number of 30 and to an area within the Federal Territory' of IKuala Lumpur 'where subjects could commute between their homes and the place of meeting. No single sampling frame was readily available. As a result a list consisting of 30 teachers who satisfied the conditions mentioned above was compiled from different sources such as various lists of panel members, examiners and teachers who had administered the Chemistry Practical Examination. The resulting list contained teachers who held in common the necessary academic qualifications but who differed in the number of years of experience teaching the 59 subject, in the type of experiences they had in examination work, and in gender and ethnicity. Only two had experience as item writing panel members. This research required one treatment group and one control group. To have divided this final list of teachers into two groups using simple random assignment procedure may not have resulted in two equivalent groups especially when the sample size is small. Thus a procedure similar to stratified random sampling was used. The two teachers who had considerable experience in item writing were of the same gender and same ethnicity, and they were grouped as one stratum. The rest of the teachers were grouped into strata with the same gender and ethnicity. Teachers from each stratum were then randomly assigned to one of two groups. In the first stratum where there were only two teachers, a coin was tossed to decide their group memberships. In each of the other strata, slips of papers each with an identification number of an individual were placed in a container and mixed thoroughly. The required numbers of teachers were then drawn one at a time from the container and assigned to either the treatment group or the control group. The sampling process produced two lists of 15 teachers each. Official letters inviting the selected teachers to participate in an Item Statistics Estimation workshop were sent through the Principals of the schools where the teachers worked. The teachers in the treatment 60 group were invited to attend a three-day training/workshop session, while the teachers in the control group were invited for a one-day workshop. The two workshops were scheduled on two separate weeks. The letters specified the nature of the workshop and that subsistence and traveling allowances would be paid accordingly. The teachers were requested to return a reply slip confirming their consent to participate in the workshop two weeks prior to the workshop. The relevant data concerning the teachers in the treatment and the control groups are shown in Appendix A. Sampling of Students The part of the study concerning discrimination index estimation required an item analysis with a small sample of about 100 students (a distinction should be made between the sample of students used in the regular item analysis research carried out by the Examinations Syndicate in the process of constructing the tests and the small sample of students used specifically for this study. The regular item analysis research has a much large sample size). In this study , two forms of test items were used. Each form was to be tried out with about 100 current Chemistry examination class students. Two schools with medium performance in the subject of Chemistry in the Malaysian Certificate of Education (MCE) Examination were 61 chosen. Medium performance schools were defined as the schools whose percentage of passes in MCE Chemistry is similar to the national average. Both schools were located in the outskirts of the city of Kuala Lumpur. One of the schools provided 101 students and the other provided 118 students. Sam t 5 Two "Chemistry Paper 1's" of the Malaysian Certificate of Education (MCE) Examination were used for this research. The criteria used for choosing a particular paper was the availability of the parameter values of the item characteristics and the item pool estimated values of the same. The test paper is a 75 minutes test consisting of 40 multiple-choice items each with 5-options. Of the 40 items, about 25 are single-answer multiple-choice type and 15 are multiple-answer multiple-choice type. An example of each type is shown in Table 3.1. The two test papers used in this study measure the same content areas of chemistry. The internal consistency reliabilities (KRZO) of the tests were 0.917 and 0.909, and their standard deviations were 9.324 and 8.769. The population ‘values of the item characteristics were routinely computed in the post-analysis of the tests by the Examinations Syndicate. These population values were available for the present research. Since the focus of this 62 Table 3.1.--Examples of Types of Items. SINGLE-ANSWER MULTIPLE-CHOICE ITEM The chloride of metal M has a formula MCl and potassium phosphate has a formula K3PO4. What is the ormula of metal M phosphate? A MPO4 B M(PO4)3 C M3PO4 D M2P04 a M2(PO4)3 MULTIPLE-ANSWER MULTIPLE-CHOICE ITEM Direction A B C D E I,II,III I,III II,IV IV I,II,III,IV only only only only (all four) H2(g) + 12(g) .e==é 2HI(g) Heat change = negative Which of the following changes will increase the yield of hydrogen iodide in the above equilibrium system ? I Add more hydrogen into the system. II Reduce the temperature of the system. III Remove hydrogen iodide from the system IV Increase the pressure of the system. 63 study was on the estimation of item statistics rather than students' performance, it would be more appropriate to use the term "forms" rather than "tests" to refer to the groups of items used. Thus the two tests are referred to as Form A and Form B. esi This research utilized an experimental design. Two equivalent groups each consisting of 15 volunteer experienced chemistry"teachers were formed by’ia procedure similar to stratified random sampling. One group was assigned to the treatment condition and the other served as a control. The treatment group received training in estimation skills for two days and the control group was not trained. Both groups estimated the item statistics -- the p-value, and the point biserial correlation coefficient between item score and the total test score -- for the two forms of Chemistry test items. As mentioned in the previous section, each form contained 40 multiple-choice items. Procedure Three days were scheduled for the training/workshop session for the treatment group. The teachers reported to a panel room in the Examinations Syndicate, Ministry of Education, Malaysia at 8.00 a.m. on each day of the scheduled. training/workshop session. The first. two ldays were used to "train" or to help these teachers to develop 64 item statistics estimation strategies and skills. The whole of the third day was reserved for the teachers to actually estimate the item statistics. Training. The training session consisted of two components, one theoretical and one practical. It involved the following steps: A. Theoretical componggt. The purpose and rationale for carrying out the research was first introduced to the teachers. The implications of the success of the project such as the possibility of the method developed in this research being applied in future estimation procedure were also explained. The teachers were requested to try their best and to use the first two days of the training/workshop session to develop some strategies and skills for items statistics estimation. The training session then followed the following sequence: a» Definitions of items statistics were first explained to the teachers. An example in which statistics were used to describe the characteristics of an object or a person were given. The definitions of item difficulty and item discrimination index were then introduced. The formulae for computing these item 65 statistics, i.e., the p-value and the point biserial correlation coefficient, were also explained and illustrated by examples. A handout to enhance the teachers' understanding of these definitions was prepared and distributed. A copy of the handout was included in Appendix B. The concept of item discrimination index, which was difficult to understand from just examining the formula of a point biserial correlation coefficient, was illustrated by a computation of the D-index. By definition, D- index is the difference in a proportion of correct responses between the group of those scoring in the top 27 percent on the total test and the group scoring in the bottom 27 percent on the same test (Ebel, 1979; p. 376). To simplify computations, top and Ibottom 25 percent was used in the example instead of top and bottom 27 percent. However, teachers were reminded that although the D-index was similar to point biserial correlation coefficient in terms of the concept of discrimination, these two indices are quite different in terms of computation. The teachers were informed of the systematic errors and biases caused by judgmental 66 heuristics described by Tversky and Kahneman (1974). The three judgmental heuristics -- Representativeness, Availability, and Anchoring & Adjustment -- were briefly discussed. The teachers were cautioned against basing their estimation on adjustments to unrealistic initial estimates. They were also advised to be aware of a tendency for estimation to regress toward the mean of item difficulty. c. The determinants of item characteristics, such as item complexity, option homogeneity, and familiarity of item content, identified by various researchers were discussed. Specifically, a summary of research findings on the determinants of item difficulty (Table 3.2) was explained and discussed, and each member of the group was provided with a copy of the summary for their reference. After the discussion of the research findings related to the determinants of item statistics, teachers were divided into three groups to study and analyze a set of 10 sample items with known p-values and point-biserial correlation coefficients with the aim of discovering what factors determine the values of the item statistics of these items. 67 Table 3.2.--Summary of Research Findings on Item Difficulty. General Fingiggs Judges are able to estimate relative item difficulty well, i.g., judges are able to rank items according to difficulty level well. Judges either consistently gnderestimate or overestimate item difficulty levels, i.g., tend to make a "constant error" of estimation. Judges tend to judge difficult items easier and easy items more difficult than the actual difficulty levels. Accuracy of estimation of item difficulty levels can be increased when the judges are given "anchor" items to guide them in their estimation. People make estimation of the value of a certain variable by assuming a rough initial value and making adjustments to yield the final estimate. The adjustments made were, on most occasions, insufficient. Anchor items can prevent judges from making erroneous initial values which may lead to inaccurate estimate. The accuracy of estimation of item difficulty depends on how elaborate a judge analyzes the structure and organization of the items. Qetermigagts gf Itgm Difficulty When "complexity" of an item is defined as the number of steps and the amount of information required to answer the item correctly, it is found that "complexity of an item has a direct relationship with item difficulty. 68 Table 3.2 (cont'd) 10. 11. 12. 13. 14. 15. If the answer to a chemistry test item is based on a knowledge of the properties of a chemical that is either unfamiliar to the candidates or that cannot be deduced from a more basic knowledge, the difficulty of the item will be high. Items that require complex reasoning with "unknown" or several reagents are more difficult. Items are difficult if the syllabus content on which the items are based involves concepts that are difficult for the students to grasp. Items that require incidental knowledge or obscure facts are more difficult. "Deductive" problems which involve novel/unusual/new situations tend to be more difficult. An item is harder if the options are more homogeneous and easier if the options are more heterogeneous. If an item has more than one basis for choosing the correct answer, it is easier. An item is easier if the stem is general and the options are diverse. 69 These 10 sample items were selected from items used in Chemistry test papers during the previous 5 years. They represented items from different content areas, and with a wide range of p-values and point-biserials. These 10 sample items are referred to as the "anchor" items in this research. Each item was printed on a 5 x 8 pink paper. Besides the item proper, three statistics -— the rank order of the item difficulty (in this set of items), the p-value, and the point-biserial of the item -- were also printed on each of these 5 x 8 pink papers. A leader was appointed in each of the three subgroups, and was given the responsibility of leading the discussions and recording the findings of his/her group regarding the determinants of the item statistics of the "anchor" items. This studying, analyzing and discussing of the determinants of item difficulty and item discrimination were guided by the summary of the research findings mentioned above. After each group had completed its list of determinants of item statistics, a discussion involving the total group was held, this time led by the investigator. In the total group discussion, each subgroup leader was requested to explain to the total group how his/her subgroup arrived at a particular determinant. Other members were encouraged to voice their opinions as to whether they agreed or disagreed with the suggested determinants and gave reasons for their opinions. An integrated list of determinants of item difficulty was prepared at the end of 70 the total group discussions. Each member was also provided with one copy of the list for their reference when they practiced estimating item statistics which was the next process in the training program. B. W Four sets of items were used for practicing item statistics estimation. The first :3 sets consisted of 5 items each and the last set contained 10 items. These items were also selected from 5 previous years Chemistry test papers. The item parameter values ( _i._e., item p-values and point- biserials computed from the post-analysis of the items based on the total population of about 40,000 students) of these items were available. The items as a whole were selected to be representative of the content areas of the syllabus, and to have a wide range of p-values and point-biserials. These items are referred to as "practiced" items in this study. Each of them were also printed on an item card as was done for the "anchor" items described in the previous section. On each of the item cards, below the item proper, three small boxes were drawn and labeled for the teachers to fill in their estimates of the rank order of item difficulty (of the particular item in the set of items given), the p- value and the point-biserial of the item. 71 The practice session proceeded as follows: Each teacher was first given a set of 5 "practice" items and were told that they were required to practice item statistics estimation shortly. Before the estimation practice began they were advised to recall what they had learned about the definitions of p-values and point-biserials, and about the research findings on item statistics estimations. They were also asked to review the list of determinants of item statistics that the group had discussed and finalized the previous day. They were then requested to first rank the items in the order of either increasing or decreasing difficulty, and after which to estimate the item p—value and item point- biserial for each item in the set. The estimates given by each teacher for each item in the set were recorded on the black-board. The parameter values of the item difficulty and item discrimination index of the items were also recorded on the black board. Teachers were advised to compare their own estimates with the parameter values, and for those items which they over- or underestimated, to try to find out why that happened. Teachers who could 72 estimate the item statistics within : 0.05 of the parameter values for 4 out of the 5 items were identified, and were invited to explain to the whole group how they went about estimating the item statistics. With these new insights, the list of determinants of item statistics was studied again as a group to determine how they could be applied in the actual estimation process. The procedure described in Step (a) was repeated with two more sets of five items each. Each practice was followed by feedback and discussions of the accuracy of estimation. Strong emphasis was put on the importance of the teachers' reflections on their errors of estimations, and their efforts to improve estimation skills. The final practice involved a set of 10 items. The procedure was the same except that more items were involved (10 items as compared to 5 items in the previous three practices). At the end of this final practice, refinements on the list of determinants of item statistics was also made. This refined list was to be used by the teachers when they worked on the actual estimation. A copy of the list was included in Appendix C. 73 Estimating The actual estimation of item statistics by the treatment group was scheduled on the third day of the training/workshop session. Each teacher was given two sets of items. Each set contained 50 items. Each item was again printed on an item card. The first set (call this Set A) consisted of the 40 items from Form A and 10 "equating" items. The second set (Set B) consisted of the 40 items from Form B and the same 10 "equating" items used in the first set (the description of Forms A & B was given in the section on "sample items"). The 50 items in each set were numbered from 1 to 50 with the equating items appearing in every interval of 5. The 10 "equating" items were chosen from past years' Chemistry test papers to represent different topics in the syllabus and to have a wide range of item difficulty levels and item discrimination values. The first set of items were printed on green item cards and the second set on yellow. Two spaces were provided and labeled below each item for the teachers to record their estimates of the item statistics. In addition to the two sets of items, each teacher was provided with a separate set of 10 "anchor" items, each printed on a pink card. These 10 "anchor" items were the same 10 "anchor" items used in the training session. As mentioned in the earlier section, the parameter values of item difficulty and item discrimination of each "anchor" item were also printed on the card bearing the item. The 74 "equating" items have the same properties as the "anchor" items: the only difference between them was that the item statistics of the anchor items were made known to the teachers while the item statistics of the equating items were not known to the teachers. The actual estimating process was divided into two parts: one for item difficulty and the other for discrimination index. Item difficulty. Before teachers began their estimation exercise, they were requested to first review the list of determinants of item statistics that they had prepared the previous day, and then the 10 anchor items. They were then required to estimate the item difficulty of the items in Set A first and then items in Set B. Before being distributed to the teachers, the items in each set were shuffled so that their placements in the set would be random. Each set was estimated separately, that is, the teachers did not refer to the estimates given to the first set while working on the estimation of the items in the second set. As a strategy, they were advised to read each item carefully, to answer it (the answers to all the items were provided on a separate sheet), and then to place the item in one of three categories: easy, medium, and difficult. The items in each of the three categories were further divided into three more categories according to difficulty. No constraints were put on the final number of difficulty 75 categories used and the number of items to be placed in each category. Teachers made their own decisions as to how many categories they needed to help them in estimating the item difficulty. After the teachers had grouped the items into different difficulty categories, they were advised to review their categorization to see whether any relocation of items into other categories was necessary. This was followed by giving an estimate of the p-value for each item in each category. They were reminded to refer to the anchor items to guide them in the estimation process. Item discrimination indexu .A review of the literature provided no evidence to indicate that judges were able to estimate accurately the discrimination indices of test items. The review also revealed that very little research on factors influencing the accuracy of subjective estimation of item discrimination has been carried out. Furthermore, the mental processes involved in subjective estimation of item discrimination indices are much more complex than those involved in item difficulty estimation. Thus there are reasons to believe that it will be extremely difficult for the teachers to make accurate estimation of item discrimination index. It is also reasonable to believe that estimating the range within which the population value of the item discrimination would be expected to lie will be easier than estimating the specific value of the discrimination index. It is for this reason that teachers 76 were requested to estimate the pagge rather than the specific vaiue of the item discrimination index (in this study the point-biserial) for each item, igg., each teacher estimated the highest and the lowest values of the range within which the point-biserial of the item would be expected to lie. Once the teachers have estimated the rm; of the point-biserial, the next step is to identify a procedure for obtaining a ppipp estimate of the point-biserial from the estimated range. The Bayesian framework provides such a procedure. However, the Bayesian approach (described in Appendix D) requires, in addition to the estimated range, an empirical estimate of the point-biserial from a small sample of students. In the Bayesian terminology, the range (of point- biserial) estimated by each teacher can be considered as the prior information of the point-biserial distribution. The Bayesian approach then combines the prior information with the information present in the data obtained from the small sample of students to produce a posterior distribution. A final estimate of the point-biserial (for each item) was obtained from the posterior distribution. To obtain an empirical estimate of the point- biserial required in the Bayesian approach, the items were tried out in two schools which provided a total of 219 students for the exercise. Each item was tried out on about 77 110 students. The try out procedure was described as follows. All of the items, except the equating items, were assembled into two booklets, i_,_e_,_, two forms: Form A and Form B. Each form contained 40 items. The forms were arranged in ABABAB... sequence and were distributed in that order to the students who took part in the field trial. This design (equivalent to matrix sampling) ensured that the two forms were administered to two equivalent groups of students from the two selected schools. Estimation by the Control Group The teachers in the control group were invited to attend the workshop for one‘day. They were given an introduction of the rationale of the research, an explanation of the definitions of item statistics, and an illustration of the computation of these item statistics, in exactly the same way as was done for the treatment group. However, no training was given to them. The teachers were po_t informed that they served as a control group. They were requested to estimate the item statistics using the same strategies as described for the treatment group, that is, to group the items into categories with the same item difficulty and then estimate the p-values of the items in each categories. They were also reminded to make use of the anchor items to guide them in their estimation. Since no training was given, no information about the 78 determinants of item statistics was imparted to the teachers in the control group. The items whose item statistics to be estimated by the control group were exactly the same as those estimated by the treatment group. Dependent variables There were two dependent variables in this research, one was the accuracy of p-value estimation and the other was the accuracy of point-biserial estimation. Acc c of -value estimation. The accuracy of p- value estimation of an item was defined as the absolute value of the difference between the eguated p-value estimate of the item and the population p-value of the item. The procedure for obtaining the equated p-value estimate is presented here. As described in the section on estimating, teachers were required to estimate the p—values of the items in Forms A & B as well as the equating items embedded in each form. The information present in the equating items (i.e., the population and the estimated p- values) was used to derive the equated p-value estimate for each item. The population p-values of the equating items were first converted to normal deviates by the normalizing procedure. These normal deviates were then transformed linearly to a scale with a mean of 13 and a standard deviation of 4 (similar to the Delta-scale used at ETS). The estimated p-values of these equating items 79 were also converted to the Delta-scale. For each teacher, a best fitting line (using least square errors criterion) was computed to represent the linear relationship between the population and the estimated p-values. This line was then used to equate the estimated p-values to the population p-values. Samples of scatterplots showing the relationship between the estimated and the population Deltas of the equating items were shown in Appendix E. The equating process for a given teacher in the treatment group was illustrated as follows. An illustration of the ggpating process. Table 3.3 contains the population p-values for the 10 equating items embedded in Form A, and the p-values estimated by a teacher (call this teacher T1) for the same 10 equating items. The estimated and the population p-values were converted to the normal deviates using the Table of Unit- Normal Distribution (Glass 8 Hopkins, 1984; p. 522). The relationship between the p-value and the normal deviate was illustrated in Figure 3.1. The normal deviates are the z-scores corresponding to the values of (l- p) in a normal distribution. The normal deviates therefore have a mean of 0 and a standard deviation of 1. Each of the normal deviates corresponding to a particular p-value was transformed to a Delta scale 80 Tables 3.3.--Estimated and Population P-values for the Equating Items. Equating Item 1 2 3 4 5 6 7 8 9 10 Population p-value .82 .75 .66 .58 .52 .44 .40 .32 .32 .23 Estimated p-value .88 .39 .69 .81 .45 .33 .38 .46 .32 .32 Density P/ -3 -2 -1 0 +1 +2 +3 Normal Deviate Figure 3.1--Relationship between P-value and Normal Deviate. with a mean of 13 and a standard deviation of 4, using the following equation: Delta value = 4 x normal deviate + 13 (3.1) For example, if the p-value of an item is .65, then its. normal deviate is -0.253. Substituting this normal deviate in Equation (3.1) gives a Delta value of 81 11.99 (i.e., 4 x (-0.253) + 13 = 11.99). The estimated and the population Delta values of the equating items were displayed in Table 3.4, and the scatterplot of these two sets of Deltas was given in Appendix E (Figure E1). Table 3.4.--Estimated and Population Delta Values of the Equating Items. Equating Item 1 2 3 4 5 6 7 8 9 10 Population Delta 9.3 10.3 11.3 12.2 12.8 13.6 14.0 14.9 14.9 16.0 Value Estimated Delta 8.3 14.1 11.9 9.5 13.5 14.8 14.2 13.3 14.9 14.9 Value Using the usual regression analysis procedure, the regression equation for the data in Table 3.4 was found to be: Equated Delta = .634 x Estimated Delta + 4.77 (3.2) Equation (3.2) was used to "equate" (for Teacher T1) the estimated p-values of the items in Form A to the population p-values. However, before the "equating" could be carried out, the estimated p-vaiues of the items in Form A were first transformed to the Delta-scale, using the procedure described above, i.e., converting the p-values to 82 the corresponding normal deviates which were then transformed to the Delta values using Equation (3.1). The resulting estimeped Deltas were then entered in Equa- tion (3.2) to obtained the corresponding egpated Deltas. For example, if the estimated Delta values of Item 8 and Item. 16 ‘were 10.9 and 12.9 respectively, then substituting these estimateg Delpas in Equation (3.2) produced the corresponding egpggeg_Del§ee as follows: (1) Estimated Delta = 10.9, Equated Delta = .634 x 10.9 + 4.77 = 11.7 (2) Estimated Delta = 12.9, Equated Delta = .634 x 12.9 + 4.77 = 12.9 Finally, the egpated Deltas were transformed back to the p-vaiue scale using the Unit-Normal Distribution Table. The resulting p-values are referred to as the "Equated P- values". Accurecy of point-bisegiai estimation. The estimates of point-biserial involved the combination of two estimates, one from the range of point-biserial correlation estimated by the teachers and the other from the observed point-biserial obtained in the small sample field-trial. The Bayesian approach (see Appendix D) combined these two 83 pieces of information to obtain an estimate of the point biserial. The accuracy of point-biserial estimation was defined as the absolute value of the difference between the estimated point-biserial and the parameter value of the point-biserial of the item. a e es The frequency distributions of the dependent variables defined above were found to be positively skewed (Figures 3.2 & 3.3). Since the hypotheses to be tested in this study are based, in part, on the assumption of normal distributions of dependent variables, the original dependent variables were changed, through the following logarithmic transformation, to a metric in which the distributions were more nearly normal: (Assume Y and R to be the original dependent variables and Y* and R* to be the transformed variables) Y* -ln (.05) + 1n (.05 + Y) R* -ln (.05) + 1n (.05 + R) The histograms for the two transformed variables, Y* and R* are shown in Figures 3.4 & 3.5. ‘The transformed variables were used as the dependent variables in the hypothesis testing. 84 Variable: Original Y Count Mid-pt. One Symbol Equals Approximately 10 Occurences 75 -.005 397 .020 250 .045 216 .070 428 .095 157 .120 168 .145 129 .170 155 .195 83 .220 97 .245 57 .270 68 .295 32 .320 38 .345 20 .370 16 .395 5 .420 7 .445 1 .470 1 .495 *sssssss , *ssssssssesss,************************** ******************,****** *ssesssesesseeessseee, ************************,****************** *****e********** ***************** *sssssessseee **************** ******** *tssseses, ******, ***,*** es, ,*** ,* es s * I...0+...II....+....I....+OO..IOOOO+OOOIIOOOO+OOO 100 200 300 400 450 HISTOGRAM FREQUENCY Figure 3.2.--Freq. Distribution of the Original Accuracy of P-Value Estimatesa. aMean Mode Kurtosis S E Skew Maximum .120 Std Err .002 Median .105 .105 Std Dev .093 Variance .009 .553 S E Kurt .100 Skewness .984 .050 Range .492 Minimum .000 .492 Sum 288.138 85 Variable: Original R Count Mid-pt. One Symbol Equals Approximately 10 Occurences 410 .010 289 .035 278 .060 400 .085 225 .110 162 .135 121 .160 161 .185 57 .210 43 .235 36 .260 21 .285 9 .310 8 .335 6 .360 6 .385 3 .410 1 .435 3 .460 0 .485 1 .510 ***************,**********************+** **************e******,******* **********seeeesesssesssse,* ****************************,*********** *eeseseeesesseeeessesss **************** ************ *************,** ****** *sse, **,* * sabe- I....+....I....+....I....+....I....+....I....+... 100 200 300 400 450 HISTOGRAM FREQUENCY Figure 3.3.--Freq. Distribution of the Original Accuracy of Point-Biserial Estimatesa. aMean Mode Kurtosis S E Skew Maximum .094 Std Err .022 Median .073 .010 Std Dev .076 Variance .006 2.170 S E Kurt .103 Skewness 1.280 .052 Range .522 Minimum .000 .522 Sum 209.548 Variable: 86 Transformed Y Count Mid-pt. One Symbol Equals Approximately 8 Occurences 0 -.30 0 -.15 75 .00 121 .15 131 .30 145 .45 132 .60 226 .75 108 .90 240 1.05 345 1.20 168 1.35 247 1.50 120 1.65 154 1.80 100 1.95 58 2.10 28 2.25 2 2.40 0 2.55 0 2.70 ****,**** *******,******* ***********,**** ****************,* ***************** , sssss*********************,e ************** , esee************************** , *******************************,*********** ********************* , ssss*****e*************,******* *************** , *************,***** ********,**** *****,* **,* IOOOO+OOOOIOO..+....I.0.0+OOOOIOOOO+OOOOIOOOO+OOO 80 160 240 320 360 HISTOGRAM FREQUENCY Figure 3.4.--Frequency Distribution of the Transformed Accuracy of P-Value Estimatesa. aMean Mode Kurtosis S E Skew Maximum 1.078 Std Err .011 Median 1.133 1.133 Std Dev .548 Variance .300 -.792 S E Kurt .100 Skewness -.045 .050 Range 2.383 Minimum .000 2.383 Sum 2587.090 Variable: 87 Transformed R Count Mid-pt. One Symbol Equals Approximately 8 Occurences 0 0 108 161 141 145 144 278 148 252 318 138 172 98 79 31 H OONQQ -.30 -.15 .00 .15 .30 .45 .60 .75 .90 1.05 1.20 1.35 1.50 1.65 1.80 1.95 2.10 2.25 2.40 2.55 2.70 *****,******** *********,********** **************,*** *ssssessesseeesess ****************** , seeses*sseeeesseseeesesesessse,**** ******************* , *******************************, ***************************_************ ***************** , ****************,***** *seeessssee, *******,** see, I.O00+.0.0IOOOO+OIOOIOOOO+OOOOICO..+....I....+OOO 80 160 240 320 360 HISTOGRAM FREQUENCY Figure 3.5.--Frequency Distribution of the Transformed Accuracy of Point-Biserial Estimatesa. aMean Mode Kurtosis S E Skew Maximum .926 Std Err .011 Median .896 .183 Std Dev .507 Variance .257 -.652 S E Kurt .103 Skewness .073 .052 Range 2.437 Minimum .000 2.437 Sum 2074.955 88 General izabil ity of Resuips Since the subjects were restricted to those chemistry teachers who have experience in teaching the examination classes, in rating examination papers and/or in writing Chemistry test items, and to those who were teaching in the Federal Territory of Kuala Lumpur, the generalizability of the results of this study to other teachers and items in other subject matters is limited. However, the main concern of this study is not the generalizability to other teachers, but the comparison of teachers' estimation accuracy of item statistics with the estimation accuracy obtained in the field trial of item pool, and also the comparison of estimation accuracy of those teachers who have received treatment to improve estimation skills with those who did not receive treatment. The nonrandom selection of teachers did not affect the comparison of the estimation accuracy of the treatment group with that of the control group, because random assignment of subjects had formed two equivalent groups and treatment occurred over a short period of time. In this situation, possible threats to the internal validity (Campbell 8 Stanley, 1963) of the study such as history, maturation, statistical regression, differential selection, experimental mortality, selection-maturation interaction, and experimental treatment diffusion would be undercontrolled. Since no pretest was involved, the threats 89 due to pretesting and measuring instruments did not affect the validity of the study. The fact that the teachers in the treatment group could be identified and be included in the future item statistics estimation exercises made the need for the generalizability' of ‘the results to other‘ teachers less crucial. EYEQLQ§§§§ The major hypotheses of the study are presented here. Each hypothesis is stated in the alternative hypothesis form. Under hypotheses Hla to Hle and H3a to H3e, statistical tests for interactions between treatment effect, form and the respective factor (egg. content area, cognitive level of item, item-type) stated in the hypothesis were carried out, and if significant, the interactions would be taken into consideration in the interpretation of main effects. Wherever appropriate, Tukey's method of multiple comparisons was also carried out to find out which pairs of factor levels were different. H1: There is a difference in the accuracy of item Difficulty estimation of experienced teachers ffained in estimation skills and experienced teachers pot tfaineg in estimation skills. The difference favors the teachers trained in estimation skills. Hla: There is a difference between items of different content areas in the accuracy of p-value estimation by the experienced teachers. Hlb: There is a difference between item cognitive levels in the accuracy of p-value estimation by the experienced teachers. 90 ch: There is a difference between items of different difficulry levele in the accuracy of p- value estimation by the experienced teachers. Hld: There is a difference between items of different discrimination power in the accuracy of p-value estimation by the experienced teachers. Hle: There is a difference between items of different types in the accuracy of p-value estimation by the experienced teachers. H2: There is a difference in the accuracy of item difficulty estimation of the experienced teachers trained and competent in estimation skills and the accuracy of item difficulty estimation obtained in the item analysis of item pool. H3: There is a difference in the accuracy of point- biserial estimation of experienced teachers trained in estimation skills and experienced teachers not trained in estimation skills. The difference favors the teachers trained in estimation skills. H3a: There is a difference between items of different content areas in the accuracy of point-biserial estimation by the experienced teachers. H3b: There is.a difference between items of different cognitive levels in the accuracy of point- biserial estimation by the experienced teachers. H3c: There is a difference between items of different difficulty levels in the accuracy of point- biserial estimation by the experienced teachers. H3d: There is a difference between items of different discrimination pgwer in the accuracy of point- biserial estimation by the experienced teachers. H3e: There is a difference between items of different rypes in the accuracy of point-biserial estimation by the experienced teachers. Statistieal Apelysis A double-repeated measures analysis of variance with two within-subjects factors and one between-subjects factor was used to analyze the data for Hypothesis H1 in 91 combination with each of the Hypotheses from Hla to Hle. For example the combination of Hypothesis H1 with Hypothesis Hla was tested with the treatment/control as the between-subjects factor and form and content area as the two within-subjects factors. The three factors were completely crossed enabling interaction effects to be tested. However, teachers were nested within treatment factor. A diagrammatic representation of the design is presented in Table 3.5. A double-repeated measures MANOVA was used because each teacher was considered as a block and the same teacher estimated the items in both forms (i._e_._, Form A and Form B) and also in all the different content areas (irer, chemical structure, electricity & energy, rates & equilibrium, and descriptive chemistry). The question of the possibility of unequal correlation between the accuracy of p-value estimates for any two levels of the content area factor favored a multivariate approach (Norusis, 1988). Nevertheless, if the chi-square test of sphericity is nonsignificant, a univariate multiple-comparison procedure would be used to test the pairwise comparisons among the levels of the content area (Kirk, 1982). The Hypotheses Hla to Hle, and H3e to H3e implied that the items in each of the two forms A and B would be grouped according to the factor tested in the hypothesis. For example to test hypothesis Hla, the items would be grouped into 4 different content areas : chemical 92 Table 3.5--A Double Repeated Measures Designa with One Between-Subjects and Two Within-Subjects Factors Form Treatment Teacher A B CA1 CA2 CA3 CA4 CA1 CA2 CA3 CA4 T1 T2 Trained T3 in Estimation Skills T14 T15 T1 Not T2 Trained T3 in . Estimation . Skills . T14 T15 aCA1 = Chemical Structure CA2 = Electricity & Energy CA3 = Rates & Equilibrium CA4 = Descriptive Chemistry 93 structure, electricity & energy, rates & equilibrium , and descriptive chemistry. For hypotheses Hlb and H3b, the items were grouped into knowledge, comprehension, and application levels. The initial classification of the items into various cognitive levels by the panel of teachers at the time of test construction was used in this study. For hypotheses file and H3e, the items were grouped into easy, medium or hard categories according their population p-values. Items with population p-values greater or equal to 0.65 were grouped into easy category, items with p-values less than 0.65 and greater than 0.49 were grouped into medium category, and items with p-values smaller than 0.49 into hard category. For hypotheses Hld and H3d, the items were grouped into low discriminating category (items with population values of r less than or equal to 0.4), medium discrimina- ting category (r between 0.4 and 0.5), and high discrimina- ting category (r greater than 0.5). For hypotheses Hle and H3e, the items were grouped according to single-answer and multiple-answer type. For each of the above hypotheses, the mean of the accuracy of estimation for the items in each particular category was computed for each teacher. These means were taken as indicators of the accuracy of estimation of the teachers on the particular category of items. One teacher in each of the treatment and the control groups did not 94 provide estimates of the point-biserials for the items. Hence for the mean accuracy of point-biserial estimation, .data were available for only fourteen teachers in each group. The complete tables of the means for the individual teachers on all dependent variables are presented in Appendix F. Hypothesis H2 which involved the comparison of the accuracy of p—value estimation of the teachers trained in estimation skills with the item pool accuracy of estimation was tested by a three-way ANOVA. In this case, an accuracy index was computed for each teacher by averaging the accuracy' of p-value estimation (i.e., the absolute difference between the equated p-value estimated by the teacher and the population p-values of the item) across the 80 items. The associated standard deviation of these absolute differences was also computed (Table 3.6). The competency of teachers in estimating item difficulty was evaluated by the size of the accuracy index and the standard deviation: the smaller the values of both indicators the more competent. The ten most competent teachers were selected and their average equated p-value estimate (p-bar) were computed for each item. The absolute differences between these p-bar's and the corresponding population p-values represent the accuracy with which the teachers as a group estimated the item difficulty. The accuracy of estimation obtained in the field-trial of item pool was indicated by the absolute differences between the 95 Table 3.6.--Competency Indices of Teachers Teacher Mean Std. Dev. Tl .1088 .1053 T2 .1090 .0849 T3 .1106 .0928 T4 .1113 .0853 TS .1163 .0857 T6 .1048 .0788 T8 .1200 .0918 T9 .1120 .0855 T10 .1123 .0863 T11 .1163 .1034 T12 .1239 .0886 T13 .1010 .0838 T14 .1060 .0804 T15 .1224 .0917 96 item pool estimated p-values and the corresponding population p-values. Hypothesis H2 essentially concerned a comparison of the difference between the accuracy of these two methods of estimation. The five hypotheses from Hla to Hle were based on the same data. This created a problem known as inflation of the alpha level. If the usual alpha level of 0.05 was utilized in testing the hypotheses, then the chances of making a type I error was approximately equal to the sum of the alpha levels across the tests, i._e._, 0.25. To avoid this problem, these five hypotheses were tested using an alpha level of 0.01. The experimentwise alpha would therefore be fixed at 0.05. These same alpha level would also be utilized for testing the hypotheses H3a to H3e which have a similar error rate problem. In order to provide an indication of the accuracy of teachers' point-biserial estimation, a Spearman rank correlation coefficient between the estimated and the population point-biserials was computed. In this case, an average estimateg range within which the population point- biserial was expected to lie was fire; computed for each item. This was done by averaging, for each item, the upper limits of the range estimated by all the teachers in the treatment group, and then averaging the lower limits. The average upper and the average lower limits then constitute the everege esrirnated range. An estimate of the point- biserial for each item was obtained using the Bayesian 97 approach as described in Appendix D. Thus each of the 80 items (from Forms A 8 B) has an estimated and a population point-biserials. A Spearman rank correlation coefficient was then computed from these two sets of point-biserials (Glass & Hopkins, 1984). gummem The accuracy with which two groups of experienced Chemistry teachers estimate item statistics were compared; one group received training in estimation skills and the other was not trained. Accuracy of estimation was also compared between trained teachers and accuracy of estimates obtained in field-trial of item pool. Thirty experienced Chemistry teachers were randomly assigned to two groups: treatment and control. The items to be estimated by the teachers were grouped into two forms: A & B: each form contained 40 items. Equating items were embedded in each form. The teachers were provided with 10 "anchor" items (i._e_., items with known population item characteristics) to guide them in the estimation. The training program consisted of a theoretical component and a practical component. In the theoretical component, teachers were involved in the identification and study of determinants of item statistics whereas the practical component focused on practicing the application of these determinants in actual item statistics estimation. 98 Statistical analysis of data involved a double- repeated measures design with one between-subjects factor and two within-subjects factors. The treatment-control dimension was the between-subjects factor, and form (A & B) was one of the two within-subjects factors. The other within-subjects factor was one of the following four dimensions: (a) Content area (with 4 levels: chemical structure, electricity 8 energy, rates & equilibrium, and descriptive chemistry), (b) Cognitive level (with 3 levels: knowledge, comprehension and application), (c) difficulty (with 3 levels: easy, medium and hard), (d) discrimination power (with 3 levels: low, medium and high), (e) type of item (with 2 levels: single-answer type and multiple-answer type). Apart from comparing the trained teachers' estimation accuracy with the accuracy obtained from an item analysis of the item pool, examination of the effects of treatment, different content areas, cognitive levels, difficulty levels, discrimination power, and item-type on the accuracy of item statistic estimation were important targets of the study. CHAPTER IV RESULTS Introduction The results of the study are presented in this chapter. Even though the hypotheses tested with respect to the accuracy of estimation of p-values and point-biserials were similar, the results are presented separately. For the dependent variable: accuracy of estimation of p-value, Hypothesis H1 in combination with Hypotheses Hla to Hle were tested using a multivariate double-repeated measures analysis of variance, and Hypothesis H2 was tested with a three-way ANOVA in which method of estimation, form and content area were the three factors. For the dependent variable: accuracy of estimation of point-biserial, Hypotheses H3 in combination with Hypotheses H3a to H3e were tested using a multivariate double-repeated measures analysis of variance. The results of the hypotheses testing are presented in the order with which they are mentioned above. All the ANOVA tables are presented in Appendix G. 99 100 Accuracy of P-Value Estimation Beeulte Concerning the Treatment Effegr The test of Hypothesis H1 was carried out by a multivariate double-repeated measures in which treatment was a between-subjects factor, and form and content area were the two within-subjects factors. The hypothesis was stated as, H1. The accuracy of item difficulty estimation by experienced teachers trained in estimation skills will be better than the accuracy of estimation by experienced teachers not trained in estimation skills. The difference in the means between the accuracy of p-value estimation by the trained teachers (mean = 1.05, on a scale which ranged from 0 to 3) and that of the untrained teachers (mean = 1.12) was found to be statistically significant, _F_‘(l,28) = 16.67, p<0.05. Thus this hypothesis was accepted (alpha = 0.05) and it can be concluded that the teachers trained in estimation skills could estimate the p-values of the items more accurately than the teachers not trained in estimation skills. The treatment had an effect size of 0.53. Results Concerning the Effect of antent The hypothesis was stated as, Hla. There will be differences among different content areas in the accuracy of p-value estimation by experienced teachers. 101 The multivariate test of significance for content effect indicated that the effect was statistically significant, £(3,26) = 72.8, p<0.01. The average univariate F test was also significant, £(3,84) = 111.73, p<0.01. As was explained in the section on statistical analysis in Chapter III, in order to avoid the problem of an inflation of the alpha level, the main effects of content area, cognitive level, difficulty level, discriminating power and item type were each tested at an alpha level of 0.01. Thus Hypothesis Hla was accepted at 0.01 level. The other within-subjects factor, Le” form, has an observed univariate F-ratio of 7.02 (_f= 1,28; p < 0.05). This indicated that the main effect of form was significant (alpha = 0.05). The treatment x form x content interaction has a multivariate F-ratio of 2.25 (gf = 3,26: p > 0.05). For the treatment x form interaction, the multivariate test has an observed significance level of 0.41: whereas for the treatment x form interaction, the F-ratio has an observed significance level of 0.65. Therefore, the interactions: (a) treatment x form x content, (b) treatment x content, and (c) treatment x form were all not significant at 0.01 level. However, the form x content interaction was found to be significant at 0.01 level by both multivariate test £(3,26)=38.9, p<0.01 and univariate test £(3,84)=29.75, p<0.01. The significant interaction is depicted in Figure 4.1. 102 Form A Form B KOWHCOQW |'-' j.» .90 L t A Chemical Electricity Rates & Descrip . Structure & Energy Equilm. Chemistry A A A Figure 4.1.—~1nteraction of Form and Content on Accuracy of of P-Value Estimation. The Mauchly sphericity test involving content factor has an observed significance level of 0.36 (chi-square = 5.5; d_f = 5). This indicated that the assumption of sphericity is not violated. The results of the e posteriori test of differences among the means of the levels of the content factor at each level of the fem factor are shown in Tables 4.1 and 4.2. In order to avoid the risk of an inflated Type I error, the e posteriori multiple comparisons were based on the studentized range statistic, g. The data showed that for Form A, the mean accuracy of p-value estimation for "rates & equilibrium" items was significantly higher than the means for 103 Table 4.1.--Multiple-Comparisons among the Means of the Different Content Areas. (Form A) Studentized Range Statistics, q Descrip. Chemical Electricity Rates & Content Chemistry Structure & Energy Equilm. (Mean=.093) (Mean=.093) (Mean=.102) (Mean=.l39) Descrip. .15 2.8 14.4** Chemistry Chemical 2.7 14.2** Structure Electricity ll.6** & Energy ** Significant at 0.01 level; the critical value of the Studentized Range Statistic, g = 4.56, gr = 84,4. Table 4.2.--Multiple-Comparisons among the Means of the Different Content Areas. (Form B) Studentized Range Statistics, q Descrip. Electricity Chemical Rates & Content Chemistry & Energy Structure Equilm. (Mean=.094) (Mean=.094) (Mean=.124) (Mean=.126) Descrip. .18 9.3** 10.1** Chemistry Electricity 7.5** 8.3** & Energy Chemical .74 Structure ** Significant at 0.01 level; the critical value of the Studentized Range Statistic, g = 4.56, gr = 84,4. 104 (a) chemical structure, (b) electricity 8 energy, and (c) descriptive chemistry, at 0.01 level. However, for Form B, the means of the accuracy for "rates & equilibrium" and for "chemical structure" were each significantly higher than the means for (a) electricity & energy, and (b) descriptive chemistry. [Note: a higher mean accuracy implies that the p-values of the items in the particular content area were less accurately estimated than content areas with lower mean accuracy]. Beeults Concerning the Effect of Cognitive Level The hypothesis was stated as, Hlb. There will be a difference among item cognitive levels in the accuracy of p-value estimation by experienced teachers. The obtained multivariate F-ratio of 45.66 (gf=2,27) was significant at the 0.01 level, supporting the hypothesis that there were differences among item cognitive levels in the accuracy of p-value estimation. The average univariate F-test gave a similar result £(2,56)=58.25, p<0.01. The interactions: (a) treatment x form x cognitive level, (b) treatment x form, and (c) treatment x cognitive level were all nonsignificant. However both multivariate and average univariate F-tests indicated that the interaction between form and cognitive level was significant, Multivariate 2K2,27) == 53.5, p<0.01: univariate average §(2,56) = 43.3, p<0.01. The 105 graphic representation of the interaction is shown in Figure 4.2. _ _ _ _ Form A 1.3 Form B 1.2 .90 . “COD-111500? ...a O .80 l l_¥i A A A Knowledge Comprehen- Applica— sion tion Figure 4.2.-—Interaction of Form and Cognitive Level on Accuracy of p-value Estimation The Mauchly sphericity test involving the cognitive level effect has an observed significance level of 0.075 (chi-square = 5.2: gf = 2) indicating that the assumption of sphericity of the dependent variable was not violated. The results of an e posteriori ”test of differences among the means of the cognitive levels at different levels of the fern factor were presented in Tables 4.3 and 4.4. The multiple—comparisons were based on the studentized range statistic, g. The analyses showed that, for Form A, the 106 Table 4.3.--Multiple-Comparisons among the Means of the Different Cognitive Levels. (Form A) Studentized Range Statistics, q Cognitive Application Knowledge Comprehension Level (Mean=.090) (Mean=.099) (Mean=.1l6) Application 2.9 8.3** Knowledge 5.4** ** Significant at 0.01 level: the critical value of the Studentized Range Statistic, g = 4.3; Q: = 56,3. Table 4.4.--Multiple-Comparisons among the Means of the Different Cognitive Levels. (Form B) Studentized Range Statistics, q Cognitive Application Comprehension Knowledge Level (Mean=.094) (Mean=.110) (Mean=.126) Application 5.1** 10.2** Comprehension 5.1** ** Significant at 0.01 level: the critical value of the Studentized Range Statistic, g = 4.3; Q; = 56,3. 107 mean accuracy of p-value estimation of "comprehension" items was significantly higher than the means for (a) "application" items, and (b) "knowledge" items; whereas for Form B, the mean for "knowledge" items was significantly higher than the means for (a) "application" items , and (b) "comprehension" items. The mean for "comprehension" items was also significantly higher than the mean for "application" items. Resulte Congerninq the Effect of Item Difficulty Level The Ihypothesis concerning the effect of item difficulty on the accuracy of estimation was stated as, ch. There will be differences among items of different difficulty levels in the accuracy of p-value estimation by experienced teachers. The multivariate F-test for the main effect of difficulty level was significant, as it gave an F-ratio of 69.9 (Q; = 2,27; p<0.01). This supports the hypothesis that there are differences among item difficulty levels in the accuracy of p-value estimation. Similar result was obtained from the univariate average F-test (_13=109.5, _cfi = 2,56: p<0.01) The univariate F-test indicated that the p-values of the "easy" items were estimated significantly less accurately than the average of the "medium" and "hard" items, F(1,28) = 141.17, p<0.01. There was no difference in the accuracy of p-value estimation between "medium" and "hard" items, F(1,28) = 0.29: p = 0.60. The interactions: (a) Treatment x form x difficulty level, (b) treatment x 108 form, and (c) treatment x difficulty level were all nonsignificant. However, both multivariate and univariate average F-tests indicated that the interaction between difficulty level and form on the accuracy of p-value estimation was significant, multivariate £(2,27) = 9.4, p<0.01; univariate average _E(2,56) = 10.6, p<0.01. Figure 4.3 depicts the interaction graphically. _ __ Form A 1.5 Form B KODHGOOV .90 080 4 V 1’ - - - Easy Medium Hard \ Figure 4.3.--Interaction of Form and Difficulty Level on Accuracy of P-Value Estimation. Since the Mauchly's test of sphericity indicated that the assumption of sphericity was not tenable (chi- 109 square = 10.1, _f = 2; p<0.05), the residual mean squares appropriate for the specific contrasts of interest (Kirk, 1982) were used in the e posteriori comparisons among the means at different difficulty levels. Tables 4.5 & 4.6 display the results of the post-hoc comparisons. The analyses showed that, for D932 forms, "easy" items were estimated significantly less accurately than both "medium" and. "hard" items: whereas ‘there 'was no significant difference in the accuracy of estimation between the "medium" and the "hard" items. Resulte Concerning the Effect of Different Discrimination Power The hypothesis involved in this aspect of the study was stated as, Hld. There will be differences among items of different discrimination power in the accuracy of p-value estimation by experienced teachers. Both multivariate and univariate F-ratios for the main effect of discrimination power have an observed significance level of less than 0.01 (multivariate E = 27.4, gr = 2,27: univariate average E = 35.9, gr = 2,56). Thus the hypothesis that there would be differences among items of different discrimination power in the accuracy of p-value estimation was accepted at the 0.01 level. The interactions: (a) treatment x form x discrimination level, (b) treatment x form, and (c) treatment x discrimination level were all not significant at 0.01 level. The result of 110 Table 4.5.--Mu1tiple-Comparisons among the Means of the Different Difficulty Levels. (Form A) Studentized Range Statistics, q Difficulty Hard Medium Easy Level (Mean=.088) (Mean=.095) (Mean=.134) Hard 2.9 14.6** Medium 11.2** ** Significant at 0.01 level: the critical value of the Studentized Range Statistic, g = 4.3, gr = 56,3. Table 4.6.--Multiple-Comparisons among the Means of the Different Difficulty Levels. (Form B) Studentized Range Statistics, q Difficulty Medium Hard Easy Level (Mean=.089) (Mean=.093) (Mean=.147) Medium 1.7 15.4** Hard 13.9** ** Significant at 0.01 level; the critical value of the Studentized Range Statistic, g = 4.3, gf 56,3. 111 the multivariate F-test indicated that the form by discrimination level interaction was significant, E(2,27) = 7.2, p<0.01 and the univariate average F-test gave similar result, E(2,56) = 6.2, p<0.01. The graphic representation of the interaction is shown in Figure 4.4 _ _ Form A 1.3 Form B <00HCOO> [_l O t A é A Low Medium High Figure 4.4.--Interaction of Form and Discrimination Level on Accuracy of P-Value Estimation. The Mauchly sphericity test for the discrimination power factor has an observed chi-square value of 2.04 (Q; = 2, p>0.05), indicating that the assumption of sphericity was tenable. The results of the e posteriori test of differences among the means of items of different discrimination levels were presented in Tables 4.7 & 4.8. The post-hoc analyses indicated that, for Form A, the mean 112 Table 4.7.--Multiple-Comparisons among the Means of the Different Discrimination Levels. (Form A) Studentized Range Statistics, q Discrim. High Medium Low Level (Mean=.095) (Mean=.105) (Mean=.113) High 2.6 4.3** Medium 2.1** ** Significant at 0.01 level: the critical value of the Studentized Range Statistic, g = 4.3, gr = 56,3. Table 4.8.--Multiple-Comparisons among the Means of the Different Discrimination Levels. (Form B) Studentized Range Statistics, q Discrim. High Medium Low Level (Mean=.094) (Mean=.11) (Mean=.120) High 6.3** 6.6** Medium .31 ** Significant at 0.01 level; the critical value of the Studentized Range Statistic, g = 4.3, gr 56,3. 113 accuracy of p-value estimation for items with low discrimination power was significantly higher than the means for (a) items with medium discrimination power, and (b) items with high discrimination power. For Form B, the mean for items with low discrimination power was significantly higher than the mean for items with high discrimination power, and the mean for‘ items with medium discrimination power was also significantly higher than the mean for items with high discrimination power. Eeeulte Concerning the Effect of Item Type The hypothesis concerning the effect of item type on accuracy of estimation was stated as, Hle. The p-values of the single-answer multiple choice items will be estimated more accurately than that of the multiple-answer multiple choice items by experienced teachers. The paired t test for the difference between the accuracy of p-value estimation of single-answer type and the accuracy of multiple-answer type was significant, r(28) = 6.0, p<0.01, supporting the hypothesis that the p-values of single-answer type items were estimated more accurately than that of the multiple-answer type items. The interactions : (a) treatment. )( form )( item type, (b) treatment x form, and (c) treatment x item type were all not significant at 0.01 level. The interaction between form and item type was significant, E(l,18) = 42.6, p<0.01. The form by item type interaction is represented in Figure 4.5. 114 1.3 , _ _ _ _ Form A Form B 1.2 . / ¥ / l i A 1.1 , // C / c ‘I " u 1.0 b / r / a c .90 i Y , ’1’ Single-answer Multiple-answer Figure 4.5.--Interaction of Form and Item Type on Accuracy of P-Value Estimation. For Form A, the mean of p-value estimation for items with multiple-answer was significantly higher than the mean of items 'with single-answer; 3(1) = 6.5, p<0.01. However, there was no significant difference between the means of these two item types in Form B. Results of Teacher and Item PoolrEstimatiop Comparison The hypothesis concerning the comparison of the accuracy of estimation by teachers with the accuracy of estimation by item pool field-trial was stated as, H2. There will be a difference in the accuracy of item p-value estimation by experienced teachers trained and competent in estimation skills and the accuracy of the item p-value estimation obtained in the field-trial of item pool. 115 The results of the three-way ANOVA for evaluating Hypothesis H2 are displayed in Table 4.9. The distribution of’ the. original dependent 'variable (i.e., the absolute difference between the estimated p-value and the parameter value of item difficulty) was positively skewed as shown in Figure 4.6. To reduce the skewness, the original dependent variable was transformed to a scale in which the distribution was more normal, by the equation: in = -LN(.05) + LN(.05 + Y) Figure 4.7 showed the distribution of the transformed variable (Y*). Based on the transformed scale, the difference in the means between the accuracy of estimation by the teachers (mean = 0.946, on a scale which ranged from 0 to 3) and the accuracy of field-trial estimation (mean = 0.925) has an F-ratio of 0.092 (Q; = 1,144; p = 0.76). This result indicated that the accuracy of estimation by the teachers was not significantly different from the accuracy of estimation obtained in field-trial (alpha = 0.05). The main effects of form and content were also not significant at 0.05 level. However, the method of estimation by content area interaction was significant, E(3,144) = 4.0, p<0.05. The graphic representation of the interaction is shown in Figure 4.8. Efficiency of eetimation bv teacpers. The efficiency of p-value estimation by the teachers was defined, in this study, as the extent to which the accuracy 116 Table 4.9.--ANOVA Table for Evaluating the Difference between the Accuracy of Estimation by Teachers and the Accuracy by Field-Trial. Source of Variation SS Main Effects Method of Estimation 1.9 Form 13.0 Content 62.4 Two-way Interactions Method by Form 3.5 Method by Content 247.8 Form by Content 37.6 Three-way Interactions Method by Form by Content 71.5 Residual 2940.3 Total 3389.9 DF 144 159 MS F Sig. of F 1.9 .09 .762 13.0 .64 .425 20.8 1.0 .387 3.5 .17 .678 82.6 4.0 .009 12.5 .61 .608 23.8 1.2 .324 20.4 21.3 117 Variable: Original Accuracy Count Mid-pt. One Symbol Equals Approximately .6 Occurences 0 -4 . 0 -2 . 7 o ***********, 15 2 *essssssesseeesee,*eeseee 17 4 ***********************,sees 27 5 sees************************,**************** 24 3 ***s**************************,********* 16 10 **s************************ , 16 12 s************************** 10 14 ****************s . 7 15 ************ . 6 13 ********** . 4 20 *******, 3 22 sees, 1 24 s*. 2 26 .** 4 23 ,****** 0 30 1 32 ** 0 34 0 36 I....+....I....+....I....+....I....+....I....+. 0 6 12 18 24 25 HISTOGRAM FREQUENCY Figure 4.6.--Freq. Distribution of the Original Accuracy of P-Value Estimates by Competent Teachers and by Field-Triala. aValues in figure were obtained from Y which had been multiplied by 102. Mean 9.15 Std Err .53 Median 8.00 Mode 5.00 Std Dev 6.66 Variance 44.34 Kurtosis 1.05 S E Kurt .38 Skewness 1.08 S E Skew .19 Range 32.00 Minimum .00 Maximum 32.00 Sum 1464.00 Variable: 118 Transformed Accuracy Count Mid-pt. One Symbol Equals Approximately .5 Occurences \Ofifld‘Ul-hUNI-‘O ***,********** *******,********** **********,* ********** , ********************,*** *s*************s**s****,********** ******************** **************** , es******************sss***,***************** ******************** , **********************,********* *******************, ************** ************, *********, *sse . ****,* **,***** * I0.00+O0.0‘IOOCO+OOOOIOOOO+O..OIOOOO+OOOOI 5 10 15 25 HISTOGRAM FREQUENCY Figure 4.7.--Frequency Distribution of the Transformed Accuracy of P-Value Estimates by Competent Teachers and by Field-Triala. aValues in figure were obtained from Y* which had been multiplied by 102. Mean Mode Kurtosis S E Skew Maximum 9.36 Std Err .37 Median 9.56 6.93 Std Dev 4.62 Variance 21.32 -.39 S E Kurt .38 Skewness -.03 .19 Range 20.02 Minimum .00 20.02 Sum 1497.00 119 _ Estimation by Teachers Estimation by Field-Trial .90 . “(DUNE—1003’ .80 , \ .70 . \ .60 l T r ‘ - Chemical Electricity Rates & Descrip. Structure & Energy Equilm. Chemistry Figure 4.8.--Interaction of Method of Estimation and Content on Accuracy of P-Value Estimation. 120 of estimation by the teachers approximates the accuracy of estimation obtained in the field-trial of item pool. The efficiency of estimation was thus estimated by computing the percent change in the mean accuracy of estimation with the mean accuracy of estimation obtained in field-trial as the base line, and then subtracting the percent change from 100%. The means and standard deviations of the accuracy of estimation across the items in Forms A & B, for the trained and the untrained groups as well as the field-trial are presented in Table 4.10. The computation of the efficiency of estimation was based on the original dependent variable rather than the transformed variable and was illustrated as follows: Efficiency = 100% - -------------- x 100% where Met = Mean accuracy of estimation by teachers across all items, “ft = Mean accuracy of estimation by field trial across all items. The efficiency of estimation by the teachers who had been trained in estimation skills was found to be 91% whereas the efficiency of estimation of the teachers not trained in estimation skills was 78%. 121 Table 4.10.--Means and Standard Deviations of Accuracy of Estimation for Different Methods. Mode of Estimation Estimation by Teachers Trained and Competent in Estimation Skills Estimation by Teachers Not Trained in Estimation Skills Estimation by Field- Trial of Item Pool Mean Form A .0873 .1008 .0850 Form B A&B .104 .0906 .1125 .1066 .0898 .0874 Std. Dev. Form A Form B AGB .071 .077 .074 .085 .083 .084 .052 .061 .057 122 Accuracy of Point-Biseriai Estimation Resuirs Comcerning Treetment Effect As was done for Hypothesis H1, the test of Hypothesis H3 was carried out by a multivariate double- repeated measures in which treatment was a between-subjects factor, and form and content area of items were the two within-subjects factors. The hypothesis was stated as, H3. The accuracy of point-biserial estimation by experienced teachers trained in estimation skills will be better than the accuracy by experienced teachers not trained in estimation skills. The difference in the means between the accuracy of point-biserial estimation by the trained teachers (mean = 0.92, on a scale of 0 to 3) and that by the untrained teachers (mean = 0.94) was not significant, E(1,26) = 0.28, p = 0.60. Hence the hypothesis was not accepted and it was concluded that there was no difference between the accuracy of the trained and the untrained teachers in their point-biserial estimation at 0.05 level. Results of Estimated and Eopulation Eoint-Biserials Comparison The Spearman rank correlation between the estimated and the population point-biserials of the 80 items (from Form A and Form B) has a value of 0.43. A scatterplot between the two sets of point- biserials for the 80 items is shown in Figure 4.9. For comparison purposes, a scatterplot between the point- 123 biserials estimated by the item analysis research and the population point-biserials (for the same 80 items) is displayed in Figure 4.10. The two scatterplots seem to show a different spread of points. The "teacher E population" plot seems to have a fairly even spread of points along' the ‘whole range of ‘point-biserial values: whereas the "item analysis y_s_ population" plot seems to show a wider spread of points at the lower end of the range. Results Concerning the Effect of Contemr The hypothesis with respect to the effect of content area on the accuracy of point-biserial estimation was stated as, H3a. There will be differences among different content areas in the accuracy of point- biserial estimation by experienced teachers. The data indicated that the main effect of content area has a multivariate F-ratio = 27.28 (Q = 3,24: p<0.01) and thus supported the hypothesis. It can be concluded that the accuracy of point-biserial estimation was different for items of different content areas. The interactions: (a) treatment x form x content area, (b) treatment x content area, and (c) treatment x form were all not significant at 0.01 level. However, the interaction between form and content area was found to be significant, multivariate E(3,24) = 11.8, p<0.01. The "Si-*0” SOP-fiUHC'UO'U waa++H1001wwn 124 .65 r .60 > ' .55 w 0 . O “O O .50 e O . . O .45 0 . go 0 O 0 O O O O .40 ° ' . O 0 g 0 O . O O O .35 » .30 > O .25 > O 020 1 1 1 A 1 1 .30 .35 .40 .45 .50 .55 .60 .65 Figure 4.9.--Scatterplot between Point-Biserials. Estimated Point-Biserial Estimated and Population Popu1atolon peolnt Bilserolal (TSP-0'0 SCI-bri'fllHC'UO’U hlfllmtimlnhhw .65 .60 O U1 U1 UT 0 .45 .40 .35 .30 .25 .20 125 .30 .35 .40 .45 .50 .55 .60 .65 Estimated Point-Biserial Figure 4.10.--Scatterplot between Point-Biserials from Item Analysis and Population Point-Biserials. significant interaction is represented graphically in Figure 4.11. _ _ _ Form A r Form B 1.2 , /\ ,’ \ A 1.1 . ,’ \ c / \ C / \ u 1.0 / r / \ a / \ C .90 / \ Y / \ / .80 / ’1’ - - A - Chemical Electricity Rates & Descrip. Structure & Energy Equilm. Chemistry Figure 4.11.--Interaction of Form and Content on Accuracy of Point-Biserial Estimation. The Mauchly’s sphericity test for the levels of the content factor was not significant, chi-square(5) = 1.14, p>0.05, again indicating that the assumption of sphericity was tenable. Tables 4.11 & 4.12 display the results of the a posteriori multiple-comparisons among the means of the different levels of the content factor. The post-hoc comparisons shows that, for Form A, the mean accuracy of point-biserial estimation for "rates & equilibrium" items was significantly higher than the means for (a) "chemical 127 Table 4.11.--Multiple-Comparisons among the Means of the Different Content Areas. (Form A) Studentized Range Statistics, q Chemical Descrip. Electricity Rates & Content Structure Chemistry & Energy Equilm. (Mean=.080) (Mean=.088) (Mean=.106) (Mean=.120) Chemical 2.0 6.1** 9.5** Structure Descrip. 4.1 7.5** Chemistry Electricity 3.4 & Energy ** Significant at 0.01 level; the critical value of the Studentized Range Statistic, g = 4.53, gr = 78,4. Table 4.12.--Multiple-Comparisons among the Means of the Different Content Areas. (Form B) Studentized Range Statistics, qa Descrip. Electricity Chemical Rates & Content Chemistry & Energy Structure Equilm. (Mean=.085) (Mean=.085) (Mean=.086) (Mean=.095) Descrip. .02 .28 2.4 Chemistry Electricity .26 2.4 & Energy Chemical 2.0 Structure aThe critical value of the Studentized Range Statistic, Q = 4.53, g; = 78,4. 128 structure" items, and (b) "descriptive chemistry" items: the mean for "electricity & energy" items was significantly higher than the mean for "chemical structure" items. However, for Form B, none of the means was significantly different from other means at 0.01 level. Results Concerning the Effect of Cognitive Level The hypothesis to be tested with regard to this aspect of the study was stated as, H3b. There will be a difference among item cognitive levels in the accuracy of point- biserial estimation by experienced teachers. This hypothesis was supported by the data (multivariate F-ratio = 5.7, __f = 2,25: p<0.01). It was concluded that the point-biserials of items with different cognitive levels were estimated with significantly different accuracy. None of the interaction effects was statistically significant. The Mauchly’s test of sphericity was also not significant (chi-square = 2.3, df 2: p>0.05). Table 4.13 presents the e posreriori comparisons among the means. The data showed that none of the means was significantly different from each other. Results Concerning the Effect of Item Difficuity Level The hypothesis concerning this effect was stated as, H3c. There will be differences among items of different difficulty levels in the accuracy of point-biserial estimation by experienced teachers. 129 The main effect of difficulty level was not significant, as it gave a multivariate F-ratio of 1.17 (Q; = 2,25; p>0.01) and the hypothesis was not accepted. All the two-way as well as the three-way interactions effects among the factors were also nonsignificant. Table 4.13.--Multiple-Comparisons among the Means of the Different Cognitive Levels. Studentized Range Statistics, qa Cognitive Application Comprehension Knowledge Level (Mean=.087) (Mean=.093) (Mean=.096) Application 2.8 3.8 Comprehension 1.0 aThe critical value of the Studentized Range Statistic, g = 4.3, gr = 52,3. Resulte Concerning the Effect pf Different Discrimination Power The hypothesis to be tested was stated as, H3d. There will be differences among items of different discrimination power in the accuracy of“ point-biserial estimation. by experienced teachers. The result of multivariate test of significance has an observed F-ratio of 10.4 (Q; = 2,25: p<0.01). Thus the hypothesis was accepted. The interaction between form and 130 discrimination level was also significant, multivariate E(2,25) = 5.8, p<0.01. Figure 4.12 depicts the significant interaction. ____FormA 1.3 . Form B 1.2 . A c 1.1 L c u r 100 b a c y .90 . .80 , ’1’ ’l’ r - r Low Medium High Figure 4.12.--Interaction of Form and Discrimination Level on Accuracy of Point-Biserial Estimation. The results of the e posteriori tests of differences among the means of the accuracy of point-biserial estimation at. different discrimination levels are presented in Tables 4.14 & 4.15. The data showed that the means were not significantly different from each other. Results Concerning the Effect of item Type The hypothesis was stated as, H3e. The point-biserials of the single-answer multiple-choice items will be estimated more accurately than that of the multiple-answer multiple-choice items by experienced teachers. 131 Table 4.14.--Multiple—Comparisons among the Means of the Different Discrimination Levels. (Form A) Studentized Range Statistics, qa Discrim. Medium High Low Level (Mean=.088) (Mean=.100) (Mean=.106) Medium 2.8 4.2 High 1.4 aThe critical value of the Studentized Range Statistic, g = 4.3, Q; = 52,3. Table 4.15.--Multiple-Comparisons among the Means of the Different Discrimination Levels. (Form B) Studentized Range Statistics, qa Discrim. High Medium Low Level (Mean=.094) (Mean=.110) (Mean=.120) High .33 1.5 Medium 1.1 aThe critical value of the Studentized Range Statistic, g = 4.3, gr = 52,3. 132 The paired t test has an observed significance level greater than 0.01 (t-value = 0.44, Q = 26) and the hypothesis was not accepted. The point-biserials of single- answer multiple-choice items were not estimated more accurately than that of the multiple-answer multiple-choice items. All the two-way as well as the three-way effects were also not significant at 0.01 level. Summary The results of the statistical data analyses for this study were presented in two separately sections, one for the accuracy of p-value estimation and the other for the accuracy of point-biserial estimation. Accuracy of P-value Estimation The tests of all the hypotheses involving the accuracy of p-value estimation were presented in Table 4.16 and the results were summarized as follows: 1. The p-value estimation by experienced teachers trained in estimation skills was significantly more accurate that by experienced teachers not trained in estimation skills. 2a. There was a significant difference among the items of different content areas in the accuracy of p-value estimation by experienced teachers. 2b. There was a significant difference among the items of different cognitive levels in the accuracy of p-value estimation by experienced teachers. 2c. There was a significant difference among the items of different difficulty levels in the accuracy of p-value estimation by experienced teachers. 133 Table 4.16.--Summary of Tests of Significance for Item Difficulty Estimation. Double Sphericity Sig. of F Repeated Test Univ Measures Effect Chi-Sq Multiv (Avr F) Form Treatment - .000 & Form - .013 Content Content .358 .000 .000 Treat. x Form - .649 Treat. x Content .405 .309 Form x Content .050 .000 .000 Treat. x Form x Content .106 .207 Form Treatment - .000 & Form - .000 Cogn. Cognitive Level .075 .000 .000 Level Treat. x Form - .408 Treat. x Cogn. Level .808 .749 Form x Cogn. Level .464 .000 .000 Treat. x Form x Cogn. Level .631 .604 Form Treatment - .000 & Form - .022 Diff. Difficulty Level .006 .000 .000 Level Treat. x Form - .293 Treat. x Diff. Level .094 .047 Form x Diff. Level .806 .001 .000 Treat. x Form x Diff. Level .089 .057 Form Treatment - .000 & Form - .001 Discrim. Discrimination .360 .000 .000 Level Treat. x Form - .460 Treat. x Discrim. .136 .169 Form x Discrim. .449 .003 .004 Treat. x Form x .343 .419 Discrim. Form Treatment - .000 & Form - .001 Item Item Type - .000 Type Treat. x Form - .274 Treat. x Item Type - .884 Form x Item Type - .000 Treat. x Form x Item Type .171 2d. 2e. 3a. 3b. 134 There was a significant difference among the items of different discrimination power in the accuracy of p-value estimation by experienced teachers. The p-values of the single-answer multiple-choice items were estimated significantly more accurately than the multiple-answer multiple- choice items by experienced teachers. There was no significant difference between the accuracy of p-value estimation by the teachers competent in estimation skills and the accuracy of p-value estimation obtained by field-trial of item pool. The efficiency of estimation (as compared with the accuracy of estimation by field trial of item pool) by the teachers trained and competent in estimation skills was estimated to be 91% and the efficiency of estimation by teachers not trained in skills was estimated to be 78%. Accuracy of Point-biserial Estimatign The tests of all hypotheses involving the accuracy of point-biserial estimation were presented in Table 4.17 and the results were summarized as follows: 4. 5a. There was no significant difference between the accuracy of point-biserial estimation by the teachers trained in estimation skills and that of the teachers not trained in estimation skills. There was a significant difference among the items of different content areas in the accuracy of point-biserial estimation by experienced teachers. 5b. There was a significant difference among the items of different cognitive levels in the accuracy of point-biserial estimation. by experienced teachers. 5c. There was no significant difference among the items of different difficulty levels in the accuracy of point-biserial estimation by experienced teachers. 5d. 5e. 135 There was a significant difference among items of different discrimination power in accuracy of point-biserial estimation experienced teachers. There was no significant difference between single-answer multiple-choice items and multiple-answer multiple-choice items in accuracy of point-biserial estimation experienced teachers. The Spearman rank correlation between estimated and the population point-biserials 0.43. the the by the the the by the was 136 Table 4.17.--Summary of Tests of Significance for Point-Biserial Estimation. Double Sphericity Sig. of F Repeated Test Univ Measures Effect Chi-Sq Multiv (Avr F) Form Treatment - .603 & Form - .000 Content Content .951 .000 * .000 Treat. x Form - .990 Treat. x Content .339 .304 Form x Content .370 .000 .000 Treat. x Form x Content .303 .422 Form Treatment - .000 & Form - .000 Cogn. Cognitive Level .316 .009 .001 Level Treat. x Form - .823 Treat. x Cogn. Level .296 .352 Form x Cogn. Level .026 .243 .206 Treat. x Form x Cogn. Level .921 .958 Form Treatment - .357 & Form - .000 Diff. Difficulty Level .505 .327 .370 Level Treat. x Form - .772 Treat. x Diff. Level .195 .144 Form x Diff. Level .041 .033 .134 Treatment x Form x Diff. Level .276 .208 Form Treatment - .450 & Form - .000 Discrim. Discrimination .194 .001 .002 Level Treat. x Form - .867 Treat. x Discrim. .653 .677 Form x Discrim. .019 .009 .074 Treat. x Form x Discrim. .905 .874 Form Treatment - .353 & Form - .000 Item Item Type - .666 Type Treat. x Form - .781 Treat. x Item Type - .191 Form x Item Type - .268 Treat. x Form x Item Type — .860 CHAPTER V SUMMARY AND CONCLUSIONS umma The quality of a test is determined, in part, by the extent to which the scores produced by the test are reliable and valid. Even though reliability and validity are both important indicators of a test's quality, a valid interpretation of the test scores is possible only when there is consistency in the test scores. Thus reliability of test scores is a necessary, although not sufficient, condition for valid test score interpretations. Psychometricians have developed theories that relate test characteristics such. as reliability, mean, standard deviation, and standard error of test scores with item statistics such as item difficulty and item discrimination index. These theories enable the test developer to control the quality of a test through the use of item statistics such as the p-value and the point-biserial correlation coefficient. In the Examinations Syndicate, Malaysia, the p-value and the point-biserial of an item are estimated through annual field-trials of an item pool. However, this annual field-trial exercise is plagued with practical as well as administrative problems. 137 138 The purpose of this study was to investigate an alternative approach for estimating item statistics. Specifically, this study had investigated how accurately experienced chemistry teachers can estimate the item statistics of the Chemistry test items that will be used in the Malaysian Certificate of Education (MCE) Examination. In addition, this study has examined whether the accuracy of estimation can be improved by an intervention program aimed at increasing the competency of the teachers' estimation. A review of the literature indicated that studies related to the present research can be grouped into three broad categories: (a) Judgment Under Uncertainty, (b) Determinants of Item Difficulty and Discrimination, and (c) Empirical Studies of .Accuracy’ of Item Statistics Estimation. Judgment under uncertainty dealt with the psychology of prediction and expert judgment. This group of studies reveal that judgmental heuristics -- representativeness, availability, and judgmental and anchoring, are generally useful but often lead to severe and systematic bias. Studies concerning expert judgment focused on the problems of' consensus judgment ‘versus. individual judgment. The studies recommended that, in order to avoid a normative effect in which individual judgments were unduly influenced by group judgments, judges should be allowed to form their own independent estimates after group discussion. Research 139 on expert judgments also indicates that feedback about the experts' performance relative to the actual state of affairs can lead to improvement. According to the studies, determinants of item statistics could be broadly divided into (a) intrinsic determinants, and (b) extrinsic determinants. Intrinsic determinants include item complexity and cognitive processes/components required to process item tasks, whereas extrinsic determinants include item language, content familiarity, item format, option homogeneity, grammatical inconsistency, option characteristics and item context. Although the results of these studies have not been conclusive as to which of these factors affect item statistics, there seems to be evidence that item complexity, cognitive components required to process item tasks, content familiarity, similarity of item options, item format and item context are closely related to item difficulty. Empirical studies of item statistics estimation seem to indicate that judges could estimate the relative but not the absolute item difficulties well, that the accuracy of estimation generally improved when estimation of judges were pooled, that pooling only the estimates of those competent judges will provide a more accurate estimation than pooling estimates from the entire group of judges, and that providing "anchor" items will improve accuracy of estimation. 140 In the present study, 30 experienced teachers who have taught the examination classes in the subject of Chemistry were randomly assigned to one of two groups: the treatment and the control groups. Examination classes are classes in which the teachers prepared students to take the Malaysian Certificate of Education Examination at the end of the academic year. The teachers in the treatment group were invited to attend a 3-day training/workshop session at the Examinations Syndicate, Ministry of Education, Malaysia. The first two days of the training/workshop sessions were devoted to providing the teachers with the opportunity to develop skills and strategies for estimating item statistics. The training session consisted of a theoretical and a practical component. The theoretical component provided the teachers with the necessary knowledge about item statistics, informed the teachers of the pitfalls involved in judgment under uncertainty and the research findings of the determinants of item statistics. Teachers analyzed some sample items and were helped to develop their own list of determinants of item statistics. The practicai eomponenf provided opportunity for the teachers to practice and to sharpen their skills in estimation. Feedback was provided after each practice to move the teachers' estimates, by successive approximations, closer to the parameter values of the item difficulty and item discrimination. The third day of the training session was 141 reserved for the teachers to actually estimate the item statistics. The teachers in the control group estimated the the item statistics without being trained in estimation skills, although they were also informed of the definitions and the meaning of the item statistics to be estimated. The items to be estimated by the teachers were grouped into two forms: A & B: each form contained 40 items. Equating items were embedded in each form. The teachers were each provided with 10 "anchor" items (i.e., items with known population values of item characteristics) to guide them in the estimation. The dependent variables in this research are: (a) the accuracy of p—value estimation which was defined as the absolute difference between the equated p-valueestimate of the item and the population p-valwe of the item; (b) the accuracy of point-biserial estimation which was defined as the absolute difference between the estimated point- biserial and the parameter value of the point-biserial of the item. It was hypothesized that the trained teachers would be able to estimate the item statistics more accurately than the teachers not trained in estimation skills, and that the accuracy of estimation by the teachers who had been trained and were competent in estimating item statistics was not different from the accuracy of estimation obtained in field-trials of the item pool. It was also hypothesized that each of the factors: (a) content 142 area of item, (b) cognitive level of item, (c) difficulty level of item, (d) discrimination power of item, and (e) item type, affected the accuracy of estimation by the experienced teachers. Statistical analyses of the data involved a double repeated measures design with one between-subjects factor and two within-subjects factors. The treatment-control dimension was the between-subjects factor, and form was one of the two within-subjects factors. The other within- subjects factor was one of the five factors (a) to (e) mentioned above. In other words, a double-repeated measures analysis of variance technique was employed to test the treatment effect, the main effects of form, content area of item, cognitive level of item, difficulty level of item, discrimination of item and item type, as well as the interactions among the factors. The difference between the accuracy of p-value estimation by the teachers trained and competent in estimation skills and the accuracy of estimation obtained in field-trial of item pool was tested by a three-way ANOVA in which the factors, form and content area, were incorporated to increase power and to study possible interactions. Conclusions Aecurecy of E-value Estimatiom 1. There *was statistical evidence that the intervention program was effective in increasing the 143 accuracy of p-value estimation by the experienced Chemistry teachers. There was no significant interaction effect between treatment and form. There was also no significant interaction effect between treatment and each of the following factors: content area, cognitive level of item, difficulty level of item, discrimination power of item, and item type. Thus form, content area, cognitive level of item, difficulty level of item, discrimination power of item, and item type do not affect the generalization of the treatment effect. 2a. The p-values of items from different content areas were estimated with different accuracy by the experienced chemistry teachers. Content area interacted with form. In Form A, the accuracy of estimation increases in the order: "rates & equilibrium", "electricity & energy", "chemical structure", and "descriptive chemistry"; whereas in Form B, the order was: "rates & equilibrium", "chemical structure", "electricity & energy", and "descriptive chemistry". 2b. The p-values of items of different cognitive levels were estimated with different accuracy by the experienced chemistry teachers. Cognitive level interacted with form. In Form A, the accuracy of estimation increased as the cognitive level of items changes from "comprehension" to "knowledge" and then to "application". 144 However, in Form B, the accuracy increased in the order: "knowledge", "comprehension", and "application". 2c. The p-value of items of different difficulty levels were estimated with different accuraoy by the experienced chemistry teachers. Difficulty level interacted with form. In Form A, "hard" items were most accurately estimated, followed by "medium" items and then "easy" items; whereas in Form B, the accuracy decreased in the order: "medium", "hard", and "easy" items. 2d. The p-values of items of different discrimina- tion power were estimated with different accuracy by the experienced chemistry teachers. In both forms, the high discriminating items were most accurately estimated, followed by the medium and then the low discriminating items. However, discrimination level interacted with form. The difference in accuracy of estimation between items with high discrimination power and items with medium discrimination power was smaller in Fonm A as compared to that in Form B. 2e. The p-values of single-answer items were estimated more accurately than that of multiple-answer items. However, item type interacted with form. In Form A, the single-answer items were more accurately estimated whereas in Form B, the multiple-answer items were slightly more accurately estimated. 145 3a. There was no statistically significant difference between the accuracy of p-value estimation by the teachers who were trained and competent in estimation skills and the accuracy of p-value estimation obtained by field-trial of item pool. 3b. The efficiency of estimation by the teachers trained and competent in estimation skills (as compared with the accuracy of estimation by field trial of item pool) was estimated to be 91%, and the efficiency of estimation by the teachers not trained in estimation skills was 78% Accuracy of Point-biserial Estimariom 4. There was no statistical evidence that the intervention program increased the accuracy of point- biserial estimation by the experienced chemistry teachers. 5a. The point-biserials of items of different content areas were estimated with different accuracy. Content area interacted with form. In Form A, the difference in accuracy of estimation between the "chemical structure" items and the "electricity & energy" items, and also between the "rates & equilibrium" items and the "descriptive chemistry" items were larger than the corresponding differences in Form B. 5b. The point-biserials of items of different cognitive levels were estimated with different accuracy. 146 Even though the mean of accuracy of estimation for the "application" items was the lowest (i.e. most accurately estimated), the post hoc comparisons did not indicate any significant differences among the means of the three cognitive levels. 5c. The point-biserials of items of different difficulty levels were not estimated with significantly different accuracy. 5d. The point-biserials of items of different discrimination power were estimated with significantly different accuracy. Discrimination level interacted with form. In Form A, the accuracy of estimation increased in the order: "low discriminating" items, "high discriminating" items, and "medium discriminating" items: whereas in Form B, the order was: "low discriminating" items, "medium discriminating", and "high discriminating" items. Be. There was no statistically significant difference in the accuracy of point-biserial estimation between single—answer items and multiple-answer items. 6. The Spearman rank correlation between. the estimated and the population point-biserials for the 80 items was 0.43. 147 Discuseigp Estimating P-velues The study showed that the experienced Chemistry teachers who had been trained in estimation skills were able to estimate the p-values of the items significantly more accurately than did those experienced Chemistry teachers who were not trained in estimation skills. In other words, the intervention program has been effective in improving the accuracy of estimation of item difficulties. The estimated effect size of the treatment was 0.53 standard deviations which was considered a reasonably large value. It represents a fair amount of improvement in accuracy of estimation. However, whether this improvement is large enough to offset the cost involved in training the experienced teachers was not investigated in this study. Another relevant concern is: Was the accuracy of estimation by the trained teachers high enough that it could be used in place of an empirical method of estimation? This question, although not answerable in absolute terms, will be addressed later. The fact that treatment did not interact with form, content area, cognitive level of item, difficulty level of item, discrimination power of item, and item-type suggests that treatment effect does not depend on the levels of these factors. The treatment was effective in spite of the small sample size used in the study. This could be attributed to 148 the fact that the design of the intervention program incorporated the findings of three broad areas of research in subjective estimation, namely: (a) the psychology of estimation, (b) the determinants of item statistics, and (c) the empirical studies of subjective estimation of item statistics. The use of experienced teachers, who are familiar with the curriculum, the characteristics of the student population, and the nature of the examination, as the judges in the study could also be a reason for the effectiveness of the treatment. It is reasonable to believe that judges who are per familiar with these characteristics may not be able to benefit from the intervention program as much as those who are familiar. The p-values of items of different content areas were estimated with significantly different accuracy. As indicated in Figure 4.1, the "rates & equilibrium" items were least accurately estimated. A likely explanation might be that the content area "Rates & Equilibrium of Chemical Reactions" involves very abstract and difficult concepts. They are difficult to grasp, especially for students who are being exposed to them for the first time. However, once the fundamental concepts in this area are mastered, the whole topic becomes easy. Teachers who have taught the concepts for a number of years will tend to feel that the topic consists of just a few simple rules which can be easily applied to solve the problems in the items. This feeling may bias their judgments when estimating the 149 difficulty of items in this particular content area. In general, to estimate the p-values of an item, a judge has to make judgments concerning not only the intrinsic difficulty of the item, but also the degree to which the students have mastered the concept or concepts on which the item was based. As indicated earlier, items in this content area could be very difficult if students did not completely understand the few fundamental concepts. However, once these fundamental concepts are mastered, most of the items could be answered correctly by routine applications of a few rules. The items then become very easy. Thus the ability to make an accurate judgment about students' degree of mastery of the fundamental concepts is crucial for an accurate estimation of the difficulty of the items in this content area. A small error in the judgment may result in a large error in the accuracy of estimation. This need for an accurate assessment of students' mastery of the fundamental concepts of "rates & equilibrium of chemical reactions" coupled with the tendency for the teachers to feel that the topic is easy, result in teachers' estimations being least accurate in this content area. The study showed that the p-values of the "chemical structure" items in Form B were less accurately estimated than that in Form A.(Figure 4.1). However, for other content areas the items in Form B were either more accurately estimated or about the same as that in Form A. 150 The two forms were constructed to measure the same content areas of Chemistry. The corresponding test statistics, i.e., means, standard deviations, and reliabilities (KR20), of the two forms are similar (the mean Deltas of Form A and Form B are 12.3 and 12.0, their standard deviations are 9.32 and 8.77, and their KR20's are .917 and .909 respectively). Furthermore, the accuracy of estimation for three out of the four content areas rank in the same manner in both forms (Figure 4.1). The exceptionally poor accuracy of estimation for "chemical structure" items in Form B was unanticipated. The p-values of items of different cognitive levels were estimated with different accuracy. An examination of the graph of accuracy versus cognitive levels (Figure 4.2) indicated that the "knowledge" items in Form B have an unexpectedly low accuracy of estimation as compared to the same type of items in Form A. A further investigation showed that three of the "knowledge" items in Form B were exceptionally poorly estimated. These two observations seem to indicate that the variability of the accuracy of p-value estimates for "knowledge" items was larger than that for items of the other two cognitive levels. This larger variation in the accuracy p-value estimates for "knowledge" items could be attributed, in part, to the fact that "knowledge" items tend to have a larger range of p-values. Items measuring knowledge that is familiar to the examinees will tend to be easy and hence have very high p-values. On 151 the other hand, if a "knowledge" item measures a certain fact or specific knowledge that the examinees either could not remember or did not learn, its p-value will be very low. It is unlikely that the judges would be able to identify which specific knowledge/facts the examinees had not learned or could not remember. This could then lead to a situation in which, for some of the items, the judged p-values differ greatly from the actual p-values, resulting in the p-values of some items being exceptionally poorly estimated. The p-values of items of different difficulty levels were estimated with varied accuracy. Specifically, the easy items were least accurately estimated. This result can be explained, in part, by the error of central tendency phenomenon (Tinkelman, 1947: Guilford, 1954) which is the tendency for people to avoid making extreme judgments and check the middle of the scale instead. Hence items with high p-values (i.e., easy items) tend to be rated as medium items. However, the phenomenon of error of central tendency did not seem to affect the accuracy of estimation for "hard" items. A probable reason is that teachers have a natural tendency to find out why certain items are difficult and how to help students answer these items correctly. As a consequence, teachers are sensitized to recognize difficult items rather than easy items. Thus experienced teachers are able to estimate the p-values of 152 difficult items more accurately than that of the easy items. With regard to items with different discrimination power, the study showed that the p-values of low discriminating items were estimated less accurately than that of the high discriminating items. The result was consistent with what was expected. It is generally true that the discrimination index of an item reflects the quality of the item. If an item has a low discriminating power, it generally indicates that some ambiguities are present in the stem and/or the options. In many instances, a low discrimination index is also caused by the presence of extraneous difficulties which make the item difficult for both the high and the low ability groups. Ertraneous difficulties are difficulties that may arise as a result of examinees not learning or not being being taught certain concepts or facts (which the examinees were assumed to know) that are essential for answering the particular item correctly. When estimating the p—value of an item, the teachers first answer the item themselves and then assess how difficult it will be for the examinees to answer it correctly. It is generally true that teachers are more knowledgeable than the examinees in the content areas measured by the item, and hence are unlikely to experience any difficulties in answering the item even if extraneous difficulties are present in the item. In other words, it is 153 difficult for teachers to detect the presence of extraneous difficulties. If teachers are not able to detect irrelevant or extraneous difficulties, it is unlikely that they will be able to estimate accurately the p-values of low discriminating items because low discriminating items are items that are ambiguous and/or have extraneous difficulties. The results of the study suggested that the p-values of the single-answer type items were more accurately estimated than that of the multiple-answer type items. A possible explanation could be offered as follows: Single- answer format is less complex than multiple-answer format. In estimating the p-value of a single-answer type item, the judges read the stem, evaluate the options and then make a judgment about how difficult it would be for the examinees to select the right answer. However, in a multiple-answer type situation, the judges have to make, for each item, as many such judgments as the number of right alternatives. Furthermore, in the case of multiple- answer format, the judges also have to make judgments on how the various combinations of alternatives affect the item difficulty. These two factors make the p-value estimation of multiple-answer items more difficult and hence less accurate than in the case of single-answer type items. Although the result of the study is consistent with expectations, the significant interaction effect between 154 item-type and form limits the generalizability of the finding. The results of data analysis indicated that Form had a significant interaction with each of the within-subjects factors -- content area, cognitive level, difficulty level, discrimination level, and item type -- investigated in this study. The jpresence. of interactions indicated that the generalization of the main effect of each of the five within-subjects factors must be qualified. A discussion of the generalizations, taking into consideration the presence of interactions, is presented below: Eorm x content area impereeripp. An examination of Figure 4.1 and Tables 4.1 & 4.2 seemed to indicate that the interaction between Form and Content Area factors resulted mainly from the differential mean accuracy of estimation of "chemical structure" items in Form A and Form B. In other words, if we disregarded the level "chemical structure", the findings that (a) the "rate & equilibrium" items were less accurately estimated than both the "electricity & energy" and. the "descriptive chemistry" items, and (b) there was no significant difference between the accuracy of the "electricity & energy" items and the "descriptive chemistry" items could be generalized to other forms. W The instability of the accuracy of p-value estimation for "knowledge" items across the two forms (see Figure 4.2) appeared to be the main source of variation that contributed to the 155 significant interaction effect between Form and Cognitive level factors. Although the instability of "knowledge" items has complicated the interpretation of the main effect of cognitive level, it does not, however, invalidate the generalization (to other forms) that the "application" items are more accurately estimated than the "comprehension" items. Form x difficulty level imteracrion. Even though the interaction between Form and Difficulty level factors was significant, an examination of the graphical representation of the interaction effect (Figure 4.3) showed that the conclusion that "easy" items were less accurately estimated than either the "medium" items or the "hard" items can still be generalized to other forms. The results in Tables 4.5 & 4.6 also allow the finding that the "medium" items and the "hard" items are estimated with similar accuracy to be generalized to other forms. Form x discrimination level interaction. The situation for the interaction between Form and Discrimination level factors (Figure 4.4) is slightly different from that for Form x Difficulty level interaction. In this case, in spite of the presence of interaction, the finding that the p-values of the "highly discriminating" items were more accurately estimated than that of the "low discriminating" items can still be generalized to other forms. However, no generalization (to other forms) can be made with regard to either the 156 difference in the accuracy of estimation between the "highly' discriminating" and the "medium discriminating" items or between the "medium discriminating" and the "low discriminating" items. Form x item type interacriem. Even though the main effect of item type was significant, the interaction between Form and Item Type factors was also significant. The presence of an interaction indicates that the effect of item type on the accuracy of p-value estimation depends on form, thus limiting the generalization of the main effect. A comparison of the accuracy of p-value estimation by the experienced teachers trained and competent in estimation skills with the accuracy of estimation by a field-trial of item pool was carried out in the study. The purpose of the comparison was to address the question as to whether the estimation by these teachers could be used as a substitute for the estimation obtained from the empirical method. The results suggest that the accuracy of estimation by the trained and competent teachers was not significantly different from the accuracy obtained in the item analysis research of an item pool. This conclusion was arrived at by comparing the mean accuracy of estimation by the 10 most competent teachers (competent in terms of estimation skills) with the accuracy obtained in the field-trial of item pool, using an ANOVA technique. The method of comparison used in this study was different from the usual approach reported in the literature (Bejar, 1981: Large & 157 Diamond, 1954: Lorge & Kruglov, 1953: Quereshi & Fisher, 1977; Tinkelman, 1947: Willoughby, 1980) in which the estimated p-values were typically correlated with the empirically determined p-values of the items under study. A shortcoming of using correlational technique in this type of study is that high correlation between subjectively estimated p-values and empirically estimated p-values does not necessarily result in high agreement between the two sets of p-values: conversely, high agreement does not necessarily result in high correlation. Using the accuracy of estimation obtained in the field-trial of item pool as the standard of accuracy, the study showed that the accuracy of estimation by the teachers competent in estimation skills was about 91% of the accuracy obtained in the field-trial of item pool. Whether or not this degree of accuracy is sufficiently high that subjective estimation by experienced teachers can be used in place of empirical estimation should be determined by a practical rather than a theoretical consideration. Apart from degree of accuracy, the availability of students for field-trials, administrative constraints involved in field-trials of item pool, financial expenses, and the constraints of time are some of the factors that will influence the choice between teacher-estimation and empirical procedure for estimating item statistics in the process of test construction. It is 158 only after these factors have been considered that a choice between the two methods can be made. s ' ' - 's ' o ' s The. study' showed. that. treatment did. not improve teachers' accuracy in estimating the point-biserials of the items. The nonsignificant treatment effect could be due to the fact that the estimation procedure, in this case, consisted of two components: (a) teachers' subjective estimation and (b) a small-sample empirical estimation. The data from the treatment and the control groups were each combined with the same data obtained from the empirical estimation procedure to produce the respective sets of data for the treatment group and the control group teachers. Incorporating a common set of data from the empirical estimation has an effect of reducing the final difference between the mean accuracy of point-biserial estimates of the treatment and that of the control groups. This could then result in a nonsignificant difference. It is perhaps relevant to point out that very little research on subjective estimation of point-biserials has been reported in the literature. As a result, the intervention program was not able to benefit from a review of the literature to gain insights concerning subjective point- biserial estimation. The accuracy of point-biserial estimation was found to be affected by content areas, cognitive level of the 159 items, and the discrimination power of the items. However, the accuracy was not affected by the difficulty level of the items, and the item-types. Regarding content areas, the "rates & equilibrium" items were least accurately estimated. In the case of cognitive levels, it was the "knowledge" items that were most poorly estimated: whereas among the different discrimination levels, the accuracy of the "low discriminating" items was the least. It is interesting to note that the accuracy of p-value estimation of items in each of these three areas (i.e., "rates & equilibrium", "knowledge" and "low discriminating") was also the lowest when compared to other levels within the appropriate factor. This similarity helps to shed some light on the question of why the point-biserial estimation of items in these areas are lowest. First of all, the accuracy of point-biserial estimation depends on how accurately the proportion of the high ability group who will answer the item correctly and the proportion of the low ability group who will answer the item incorrectly are estimated. To the extent that these proportions are inaccurately estimated, the (accuracy' of'jpoint-biserial estimation ‘will be low. Since the p-values of the items in these three areas were poorly estimated, it follows that the proportion of the high ability group who will answer the items correctly and the proportion of the low ability group who will answer incorrectly would also be poorly estimated. This results in 160 poor point-biserial estimations. Thus the explanation (which was given in the earlier section) for the low accuracy of p-value estimation of items in these three areas could also account for the low accuracy of the point- biserial estimates in these areas. Both multivariate and univariate F-tests indicated that the interaction between Egrm epd Qontept Area factors was significant at 0.01 level. However, a study of Figure 4.11 and Tables 4.11 8 4.12 showed that, for Form A, the mean accuracy of "rate 8 equilibrium" items was significantly higher than that of (a) "chemical structure" items and (b) "descriptive chemistry" items. The mean accuracy of "electricity 8 energy" items was significantly higher than that of "chemical structure" items. However, such results were not repeated for the items in Form B. Thus it may not be appropriate to generalize the findings concerning the differences among the means to other forms. With regard. to ‘the interaction between ‘Egrm__emg Diseriminariop level factors, even though the multivariate F-test indicated that the interaction was significant, the Tukey's test did not show any significant differences among the means in either form. This discrepancy was probably due to the fact that Tukey's multiple-comparison procedure is a post-hoc comparison technique which is known to be less powerful (Glass 8 Hopkins, 1984). 161 Implications For Further Research This study explored two major concerns in item statistics estimation. One concerned whether the accuracy of subjective estimation could be improved by an intervention program, and the other concerned the extent to which the accuracy of the subjective estimation approached that of the accuracy of estimation obtained during item analysis. Although the results of the study with respect to both concerns were quite promising, (i.e., the treatment effect was significant for p-value estimation and the accuracy of estimation by the teachers was not significantly different from that obtained in the field- trial of item pool), additional research in this area is required before subjective estimation procedures can be used in the place of an empirical estimation procedure. Several directions can be suggested for future research on subjective estimation. One direction that might be taken is to replicate the study using larger samples of teachers from different geographical regions and using Chemistry test items from other years. The present study has indicated that several properties of the items such as content areas, cognitive level of the item and item-type interacted with test forms. Replication with other forms can shed more light on the pattern of interaction between each of these factors and form. 162 Only Chemistry test items and chemistry teachers were used in this study. A relevant question that can be asked is to what extent can the results of the study be generalized to teachers and also test items of other subject matters. Thus another direction which future research can take is to replicate the study using teachers and items from other subject matters such as Physics and Biology. This study showed that the intervention program was effective with two days of training. If subjective estimation procedure were to be used in place of empirical estimation procedure, it would be important to find out whether a more intensive training program could improve the accuracy of estimation over and above the improvement achieved in the original two-day training session. To answer this question, research similar to the present study could be carried out with varying intensity and duration of training program as two of the factors in the research design. The study has also identified the specific level within each of the factors that was least accurately estimated. For example, among the various content areas, "rates 8 equilibrium" items were least accurately estimated, and among items of different difficulty levels, the estimation of the item statistics of the "easy" items was least accurate. Further research efforts could be taken to explore and to devise procedures to improve the 163 accuracy of estimation in those specific levels. One possible approach is to train judges using more items in these areas as practice items and providing feedback to fine tune their estimation after each practice session. A study in which the types and quantity of practice items used in the training sessions are manipulated may address this issue. APPENDICES APPENDIX A TEACHER DATA Table A1.--Description of the Teachers in the Treatment Group. Teacher Std. Identification T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12 T13 T14 T15 Mean Dev. Age 35 28 35 35 33 4O 38 38 40 31 39 43 32 29 28 34.9 4.55 Teaching Experience 9 5 9 11 9 16 14 12 15 4 15 20 7 6 4 10.7 4.48 (Year) Average No. of 6 4 7 5 3 15 8.5 7.5 4.5 3 13 6 2 4.5 3.5 6.18 3.67 Years Teaching Forms 4 8 5* No. of Years Teaching 0 0 4 1 0 8 0 3 1 0 0 0 0 0 1.21 2.24 Form 6* No. of Years Rating Essay 7 1 0 0 0 13 1 2 0 0 4 1 4 D 2.36 3.58 Chemistry Test Papers No. of Years Rating Practi- 0 O D 10 , O 0 O 6 D 0 12 1 0 0 0 1.93 3.87 cal Chemistry Test Papers Percent of Students Obtaining 67 55 80 90 75 96 66 55 93 69 39 6O 60 70 69.6 15.5 a Grade of 6 and 8etter" * Forms 4, 5 8 6 are equivalent to the 10th, 11th 8 12th Grades in the American educational system. ** Grading is based on a 1-9 point system, a grade of 6 is considered as having passed with credit. 164 1J6£5 Table A2.-'Description of the Teachers in the Control Group. Teacher . Std. Identification T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12 T13 T14 T15 Mean Dev. Age 30 35 42 33 33 35 27 31 40 43 33 37 35 39 29 34.8 4.55 Teaching Experience 6 12 15 10 1O 9 4 6 15 16 8 12 1D 11 5 9.9 3.59 (Year) Average No. of Years Teaching 3 5 13 10 2 9 4 1.5 13 8 3.5 6 6.5 11 5 6.71 3.8 Forms 4 8 5* No. of Years Teaching 0 12 10 D 5 7 0 6 O 1 0 0 D 2 3.07 4.03 Form 6' No. of Years Rating Essay 5 D 3 4 2 6 0 1 5 1 5 2 3 D 2 2.6 1.96 Chemistry Test Papers No. of Years Rating Practi- 0 0 0 O 3 0 2 0 0 13 O O 0 0 O 1.2 3.3 cal Chemistry Test Papers Percent of Students Obtaining 36 75 so 91. 50 52 90 71 58 70 52 90 1.0 66 10.7 a Grade of 6 and Better" ' F°rm5 4. S 8 6 are equivalent to the 10th, 11th 8 12th Grade in the American educational system. " Grading is based on a 1-9 point system, a grade of 6 is considered as having passed with credit. APPENDIX B EXPLANATION Technical Ierms This research involves several technical terms that have to be explained in details. You are encouraged to understand these terms before you begin the estimation work. Item Statistige In the field of educational testing and measurement, the term an item is used by test developers to mean a question in an objective question paper. Item statistics are figures that indicate the characteristics of an item. It is common for people to use different types of statistics to describe the properties/characteristics of different things. For example, statistics used to indicate the characteristics of a boxer include: height, age and arm-length. In educational measurement, the statistics used to indicate the characteristics of an item are: item difficulty and discrimination index. The following example will clarify the concept: The following item was used in Mathematic Test for standard three: Example: 28 + 45 = A. 19 B. 63 C. 73 Item Difficulty is .45 Discrimination Index is .28 Item Difficulty . Item difficulty is a figure that indicates how dlfficulty an item is. It is the ratio of the number of Students who answer the item correctly to the total number 0f students who answer the item. For example, suppose 100 Students answer the item shown in the example above, and OUt.(3f the 100 students only 45 students answer it corractly. Hence, the item difficulty of the item is 45/100, that is, the item difficulty is .45. 166 167 In theory, item difficulty is a number that can take values from 0.00 to 1.00. An Item difficulty of 0.00 means that the item is too difficulty and no one has answered it correctly. On the other hand, item difficulty of 1.00 shows that the item is too easy and every student who attempted the item got the right answer. However, in general, useful items have item difficulty in the range of .20 -.80. Discrimination Index Discrimination index is an item statistic which is more difficult to understand and also more difficult to estimate accurately. It is a number which indicates the extent to which high ability students answer the item (question) correctly and the extent to which low ability students answer the item wrongly. In theory, a discrimination index can take values from -1.00 to +1.00. But in practice its values generally do not go below -.40 and do not go above .70. An item with a high and positive discrimination index means that the majority of the students who answer the item correctly are those students with high ability and that the majority of students who answer the item wrongly are those students with low ability. In other words, items with high discrimination index are able to discriminate or differentiate high ability students from low ability students. In contrast, items with low and positive discrimination indices ( for example, +.05) are not able to differentiate high ability students from low ability students because approximately equal proportions of high ability students and low ability students answer the item correctly. As explained above, a discrimination index may take negative values. Negative discrimination index (for example, -.30) means high ability students answer the item incorrectly and low ability students answer the item correctly. . It is clear, then, that items with high and positive discrimination index are more useful than items with low 39d positive discrimination index and items with negative discrimination index should not be included in the test. 5a. APPENDIX C DETERMINANTS OF ITEM DIFFICULTY Frequency with which the same type of items appears in the previous years' test papers is related to the difficulty of the item. The time of the year when the topic was taught; items from topics taught earlier tend to be easier. Items based on unfamiliar (obscure) aspects of the syllabus are more difficult. Look for options that are too obvious: obvious options reduce item difficulty considerably. Questions involving calculations are generally more difficult. If items involve calculations which are mechanical, the items tend to be easier. Items requiring direct recall or testing direct relationship are easier. The way the stem is phrased often gives an indication of the difficulty of the item. Items involving "mole" concept and "ratio" concept are generally more difficult. Items requiring transfer of knowledge tend to be more difficult. 168 APPENDIX D BAYES IAN APPROACH The theoretical value of a correlation coefficient lies within the range of +1.00 and -1.00. In the framework of a Bayesian theorem (Box 8 Tiao, 1973: Iversen, 1984), if no prior knowledge about the distribution of the population correlation coefficient is available, it essentially means that the prior distribution of the population correlation coefficient F has a rectangular distribution from -1.00 to +1.00, i.e., the unknown Ficould be anywhere in that range. From the 'anchor' items, and the experience gained in the training' sessions, the ‘teacher' were expected to narrow down, for each item, the range in which the item discrimination index will lie. This narrower range represents the information about the prior distribution of the point-biserialfp. In a Bayesian framework, this prior distribution is translated to a normal distribution through the following equation: 1 1+,o e = ---t In ------ 1 (1) 2 1 — ,0 The variable 9 is normally distributed. Thus if the upper and lower limits of the range of the value oflp were denoted by PU and PL respectively, then the upper and lower limits of the range of e, 9U and 9L respectively, would be computed as follows: 169 170 1 1 +,0U GU — --- 1n ------ 2 1 ’P0 and l 1 + 2 1 -PL The mean ([N) of the prior distribution became: j“. = (9U + 9L)/2 The prior standard deviation became: O” = (GU ‘ 9L)/4 The point-biserial (r) obtained from the tryout sample of 110 students was transformed into a 9 scale using equation (1) and this transformed value was denoted by z, where 1 l + r z = —--[ ln ------ ] 2 1 - r With the observed discrimination index transformed to z, the random variable z followed a normal distribution with variance given by: The information of the discrimination index given in the prior distribution with meanJ/x' and standard deviation 171 0" could be combined with the observed sample discrimination index by the following equation: 1 1 ---_ 9 + _____ z '2 var(z) 11 = ................... /a 1 1 ---- + ..... (7'2 var(z) ’11" represents the mean of the posterior distribution, which could be converted to the P-scale. This is taken as the Bayesian estimate of the point-biserial of the item. scapunwvhacmao'm WTTF‘OIJ 18 17 16 15 14 13 12 11 10 APPENDIX E SAMPLE OF SCATTERPLOTS A A A 10 11 12 13 14 15 16 17 Estimated Delta Figure E1.--De1ta Plot for Teacher No.1 (Treatment Group, 172 Form A) sownmwcpow Didi-“(DU 19 18 17 16 15 14 13 12 11 10 # 10 f v f 11 12 13 # fi ' 14 15 16 17 Estimated Delta Figure E2.--Delta Plot for Teacher No.2 (Treatment Group, Form A) SOP-WWHC'UO’U ”fit-'00 19 18 17 16 15 14 13 12 11 10 174 A A A A 10 11 12 13 14 15 16 17 Estimated Delta Figure E3.--Delta Plot for Teacher No.3 (Treatment Group, Form A) .‘JOi-hd'mi-‘C'UO'U 0’erbe 19 18 17 16 15 14 13 12 11 10 175 10 11 12 13 14 15 16 Estimated Delta Figure E4.--Delta Plot for Teacher No.4 (Control Group, Form A) :oenmwcoom Blrfi-HDU 19 18 17 16 15 14 13 12 11 10 176 10 11 12 13 14 15 16 17 Estimated Delta Figure E5.--Delta Plot for Teacher No.5 (Control Group, Form A) SOP-flflli-‘C’UO’U Diffi-‘DU 19 18 17 16 15 14 13 12 11 10 177 10 11 12 13 14 15 16 17 Estimated Delta Figure E6.--Delta Plot for Teacher No.6 (Control Group, Form A) APPENDIX F TABLES OF MEANS OF ACCURACY Table F1.--Mean of Accuracy on Different Content Areasa. (P-Value) Form Treatment Teacher A 8 CA1 CA2 CA3 CA4 CA1 CA2 CA3 CA4 T1 090 095 147 056 113 066 146 079 Trained T2 096 093 122 063 116 095 112 117 . T3 077 105 114 094 112 092 119 095 In T4 105 110 132 099 114 077 116 111 Estimation Skills TS 086 092 123 095 112 102 113 092 T6 080 078 153 090 144 099 104 100 T7 071 102 151 079 121 093 141 070 T8 097 078 132 074 124 095 115 065 T9 101 095 143 082 106 110 101 076 T10 091 083 156 087 124 121 147 089 T11 080 084 145 109 125 108 110 079 T12 095 102 128 094 106 112 143 089 T13 089 105 148 085 121 105 126 093 T14 097 099 138 087 103 102 127 097 T15 104 104 154 109 111 088 144 107 T1 087 114 140 097 131 103 132 095 Not T2 100 131 133 102 144 108 130 114 Trained T3 119 143 140 130 142 102 125 087 in T4 083 104 120 098 123 089 129 123 Estimation 15 081 094 161 088 135 127 105 071 Skills T6 089 113 145 107 124 084 127 108 T7 095 113 146 086 132 080 131 094 T8 089 113 121 098 135 103 110 091 T9 118 099 130 085 150 082 136 115 T10 088 114 127 086 125 103 150 092 T11 086 105 146 112 129 106 123 101 T12 102 092 147 106 123 116 132 088 T13 101 101 145 094 113 111 132 080 T14 087 085 136 103 130 096 122 091 T15 115 112 153 089 119 105 132 096 8Values in table have been multiplied by 10 . CA1 = Chemical structure; CA2 = Electricity 8 Energy CA3 Rates 8 Equilibrium; CA4 8 Descriptive Chemistry. 178 179 Table F2.--Hean of Accuracy on Different Difficulty Levelsa. (P-Value) Form Treatment Teacher A B 0A1 DAZ DA3 DA1 0A2 DA3 T1 119 089 077 134 080 082 Trained T2 103 D95 078 107 110 113 _ T3 141 084 076 146 081 081 Esti:ation T4 131 120 077 122 089 099 Skills TS 113 090 088 124 086 101 T6 095 D87 097 117 108 120 T7 123 097 073 127 088 103 T8 119 076 083 145 075 080 T9 100 109 090 120 074 100 T10 106 092 096 128 114 119 T11 154 073 078 162 072 080 T12 152 072 099 147 065 109 T13 151 086 085 172 072 081 T14 149 087 081 140 104 072 T15 143 102 100 138 109 085 T1 138 096 099 180 086 074 Not T2 173 112 074 162 107 103 Trained T3 139 139 123 149 081 113 in T4 128 085 095 145 099 097 Estimation T5 124 090 092 136 093 107 Skills T6 137 110 087 141 086 100 T7 130 113 083 144 107 079 T8 141 103 075 170 088 072 T9 106 109 102 172 084 102 T10 149 095 074 165 104 078 T11 157 094 082 175 073 088 T12 153 081 097 174 082 080 T13 153 084 095 160 068 087 T14 163 065 081 171 063 087 T15 141 107 100 144 108 085 3 8Values in table have been multiplied by 10 . DA1 = Easy Item; 0A2 = Medium Items; DA3 = Hard Items. 180 Table F3.--Hean of Accuracy on Different Cognitive Levelsa. (P-Value) Form Treatment Teacher A B CN1 CNZ CN3 CN1 CNZ CN3 T1 102 091 085 117 095 087 Trained T2 097 089 090 123 113 084 . T3 092 115 075 119 109 077 Esti2ation T4 106 116 106 114 108 084 Skills TS 083 117 078 130 088 102 16 084 111 074 129 109 107 T7 087 118 076 120 105 093 T8 093 102 063 119 096 092 T9 096 119 077 120 099 072 T10 093 106 088 116 131 109 T11 088 114 081 122 105 092 T12 103 118 073 126 107 097 T13 090 122 093 133 109 087 T14 108 101 094 132 100 078 115 096 122 123 103 111 125 T1 093 121 112 126 117 104 Not T2 121 141 071 146 122 102 Trained T3 142 136 117 134 111 105 in T4 084 114 101 131 113 098 Estimation TS 086 115 096 111 119 109 Skills T6 108 132 077 125 109 096 T7 108 118 091 121 106 106 T8 093 124 092 137 106 088 T9 116 106 090 148 119 094 T10 099 108 105 129 111 112 111 097 128 089 134 119 085 T12 110 104 102 130 114 098 T13 095 127 087 127 111 078 T14 097 109 074 127 114 083 T15 104 124 115 138 116 069 8Values in table have been multiplied by 103. CN1 = Knowledge; CN2 = Comprehension; CN3 = Application. 181 Table F4.--Hean of Accuracy on Different Discrimination Levelsa. (P-Value) Form Treatment Teacher A B DL1 DL2 DL3 DL1 0L2 DL3 T1 117 088 081 110 110 078 Trained T2 094 096 088 123 099 108 . T3 107 115 074 108 127 071 Esti:ation T4 123 104 105 097 106 112 Skills T5 084 119 085 112 110 091 T6 090 099 088 110 116 123 T7 105 096 092 101 122 098 T8 096 092 083 107 115 083 T9 095 112 096 101 108 088 T10 103 094 095 108 130 125 T11 110 108 077 118 121 080 T12 116 103 091 124 110 097 T13 116 104 093 134 127 065 T14 121 086 101 123 109 082 T15 121 124 096 105 131 093 T1 128 115 088 133 127 083 Not T2 135 116 105 145 111 123 Trained T3 142 132 129 112 130 108 in T4 109 103 090 122 118 104 Estimation T5 109 093 099 111 124 104 Skills T6 103 124 106 122 109 102 T7 129 099 100 115 116 100 T8 106 119 092 130 123 077 T9 082 125 109 136 125 105 T10 120 091 101 126 130 090 T11 119 118 090 138 131 071 T12 122 086 110 130 125 087 T13 136 100 087 122 114 087 T14 136 089 071 129 128 069 T15 130 102 123 135 100 105 8Values in table have been multiplied by 103. DL1 Low Discrimination; DLZ = Medium Discrimination; DL3 = High Discrimination. 182 Table F5.--Hean of Accuracy on Different Item Typea. (P-Value) Form Treatment Teacher A 8 SA HA SA MA T1 077 122 101 101 Trained T2 078 116 103 122 _ T3 094 102 105 105 Esti:ation T4 105 119 101 111 Skills T5 092 102 102 113 T6 091 094 118 113 T7 095 100 103 116 T8 075 114 106 099 T9 103 097 099 103 T10 090 109 135 098 T11 090 109 116 096 T12 092 120 115 105 T13 098 112 109 119 T14 093 118 102 114 T15 109 119 108 117 T1 097 128 118 115 Not T2 108 133 127 125 Trained T3 122 154 116 120 in T4 093 111 117 114 Estimation T5 094 110 116 111 Skills T6 108 116 114 108 T7 091 138 112 110 T8 098 117 111 117 T9 102 113 122 125 T10 098 113 120 114 T11 103 116 118 114 T12 102 113 117 115 T13 097 122 111 106 T14 077 130 114 109 T15 105 131 123 097 8Values in table have been multiplied by 103. SA = Single-answer type; HA 2 Multiple-answer type. 183 Table F6.--Hean of Accuracy on Different Content Areasa. (Point Biserial) Form Treatment Teacher A 8 CA1 CA2 CA3 CA4 CA1 CA2 CA3 CA4 T1 082 099 107 086 072 082 093 091 Trained T2 087 106 132 093 100 086 085 061 . T3 072 087 145 103 105 059 107 089 in . T4 079 115 108 106 084 095 116 074 Estimation Skills T5 092 133 118 106 098 073 081 086 T6 061 105 140 084 082 081 105 075 T7 062 097 111 100 084 072 117 086 T8 065 103 137 051 075 100 098 091 T9 072 094 105 111 092 063 107 099 T10 093 105 116 054 068 083 076 058 T11 060 107 134 075 064 077 084 076 T12 112 114 139 040 085 098 095 109 T13 065 109 101 087 062 064 098 083 T14 107 106 119 079 091 100 108 123 T1 092 136 164 109 099 084 089 091 Not T2 106 096 . 109 079 075 089 089 063 Trained T3 098 088 120 097 089 078 092 088 in T4 086 094 104 075 075 099 081 064 Estimation T5 069 096 145 079 110 137 108 116 Skills T6 079 101 118 114 075 083 092 075 T7 069 108 120 082 089 073 079 114 T8 098 095 130 112 103 079 099 073 T9 038 132 110 071 097 067 103 092 T10 091 085 128 086 090 058 072 067 T11 069 084 094 078 072 072 100 064 T12 070 123 134 118 084 113 101 107 T13 099 124 074 095 097 119 074 091 T14 065 121 100 105 094 097 100 071 _-_--____-____--_-------_-_-_--_-_------_-_§ __________________________ 8Values in table have been multiplied by 10 . CA1 = Chemical Structure; CA2 8 Electricity 8 Energy; CA3 Rates 8 Equilibrium; CA4 = Descriptive Chemistry. 184 Table F7.--Mean of Accuracy on Different Difficulty Levelsa. (Point-Biserial) Form Treatment Teacher A 8 DA1 0A2 DA3 DA1 DAZ DA3 T1 084 079 122 089 073 082 Trained T2 089 090 132 081 090 086 . T3 087 085 116 102 081 085 ‘n T4 082 099 128 089 091 093 Estimation Skills T5 135 116 096 096 076 083 T6 104 089 096 080 091 086 T7 074 086 112 086 082 096 T8 090 088 086 068 096 106 T9 098 079 109 111 085 068 T10 089 086 108 069 087 061 T11 101 090 089 075 072 073 T12 086 114 101 076 097 115 T13 109 077 099 068 067 085 T14 080 114 107 115 090 101 T1 138 128 105 094 111 075 Not T2 101 091 102 055 110 082 Trained T3 100 093 101 082 082 095 in T4 064 089 115 085 070 082 Estimation T5 120 084 081 110 144 106 Skills T6 129 091 089 051 103 095 T7 121 108 053 077 096 093 T8 109 094 114 099 077 092 T9 088 092 098 095 081 090 T10 107 078 101 093 077 051 T11 086 068 094 106 062 055 T12 117 096 124 093 084 118 T13 080 108 123 091 113 090 T14 127 O88 094 084 105 089 8values in table have been multiplied by 103. DA1 = Easy Item; DA2 = Medium Items; DA3 = Hard Items. 185 Table F8.--Mean of Accuracy on Different Cognitive Levelsa. (Pointosiserial) Form Treatment Teacher A 8 CM1 CM2 CM3 CM1 CM2 CN3 T1 089 097 094 095 074 079 Trained T2 110 102 091 069 097 091 _ T3 100 098 082 098 091 078 Esti:ation T4 096 106 110 077 105 088 Skills TS 116 118 108 095 079 084 T6 100 091 095 076 102 067 T7 084 095 094 092 092 074 T8 098 093 063 077 090 107 T9 104 096 072 092 096 072 T10 092 090 102 067 078 066 T11 101 093 079 072 066 091 T12 097 095 125 099 086 107 T13 108 095 061 081 080 050 T14 105 110 086 103 102 105 T1 136 119 113 087 095 094 Not T2 098 096 098 075 079 085 Trained T3 103 095 091 087 091 078 in T4 095 O95 070 074 089 071 Estimation T5 096 093 088 123 111 122 Skill T6 119 090 089 076 070 108 T7 109 086 086 101 083 077 T8 100 109 103 091 088 092 T9 095 097 083 083 095 088 T10 104 088 084 096 067 053 T11 082 076 085 100 069 051 T12 125 106 094 114 095 084 T13 105 099 113 091 111 078 T14 113 105 073 075 104 092 8Values in table have been multiplied by 103. CN1 = Knowledge; CNZ = Comprehension; CM3 = Application. 186 Table F9.--Mean of Accuracy on Different Discrimination Levelsa. (Point-Biserial) Form Treatment Teacher A 8 DL1 0L2 DL3 DL1 DL2 DL3 T1 100 104 079 085 078 085 Trained T2 104 094 108 082 099 072 . T3 097 089 099 101 097 068 Esti2ation T4 097 107 104 098 101 068 Skills T5 130 101 115 073 088 099 T6 108 083 096 077 078 105 T7 100 094 080 102 072 093 T8 106 065 094 072 095 103 T9 089 096 095 113 097 049 T10 100 074 105 064 062 093 T11 106 071 101 064 067 095 T12 102 072 129 096 088 104 T13 103 080 094 094 062 064 T14 100 092 114 116 114 072 T1 155 097 123 104 093 074 Not T2 097 081 112 079 071 090 Trained T3 095 108 090 100 076 084 in T4 066 090 108 083 087 065 Estimation T5 137 079 069 100 125 131 Skills T6 106 105 092 069 081 095 T7 115 073 098 087 074 108 T8 113 103 098 109 097 057 T9 088 072 114 093 093 080 T10 113 074 094 089 063 070 T11 097 074 073 112 069 040 T12 130 091 111 099 095 106 T13 095 104 113 095 097 097 T14 115 097 092 084 090 103 aValues in table have been multiplied by 103. DL1 = Low Discrimination; 0L2 8 Medium Discrimination; DL3 = High Discrimination. 187 Table F10.--Mean of Accuracy on Different Item-Type'. (Point-Biserial) Form Treatment Teacher A 8 SA MA SA MA T1 094 092 074 095 Trained T2 106 096 085 086 _ T3 098 090 088 095 Esti2ation T4 107 095 092 089 Skills T5 118 111 085 087 T6 097 092 086 083 T7 107 064 096 075 T8 091 083 083 099 T9 100 082 094 082 T10 082 112 075 065 T11 095 089 077 068 T12 096 113 096 094 T13 094 089 075 072 T14 102 104 102 105 T1 127 ‘ 120 087 101 Not T2 090 110 083 073 Trained T3 086 115 086 089 in T4 090 089 078 082 Estimation T5 085 106 116 120 Skills T6 101 101 075 090 T7 096 092 085 093 T8 109 096 080 108 T9 097 087 092 086 T10 088 102 070 080 T11 078 084 081 067 T12 122 090 097 104 T13 110 096 093 102 T14 112 082 096 084 aValues in table have been multiplied by 103. SA = Single-answer type; MA = Multiple-answer type. APPENDIX G ANOVA TABLES Table Gl.--Repeated Measures (Form 8 Content) By Treatment. (P-Value Estimation) Effect Between Subjects Treatment Subj w. Treat. Within Subjects Form Treat. x Form Form x Subj w. Treat. Content Treat. x Content Content x Subj w.Treat. Form x Content Treat. x Form x Content Form x Content x Subj w. Treat. df 28 84 Mean Sq. F 16.68 Sig. of F .000 O 013 .649 .000 .309 .000 .207 188 189 Table GZ.—-Repeated Measures (Form 8 Difficulty Level) By Treatment. (P-Value Estimation) Effect Between Subjects Treatment Subj w. Treat. Within Subjects Form Treat. x Form Form x Subj w. Treat. Difficulty Level Treat. x Diff. Level Diff. x Subj w. Treat. Form x Diff. Level x Form x Diff. Treat. Form x Diff. x Subj w. Treat. df 28 56 Mean Sq. F 20.03 Sig. of F .000 .022 .29 .000 .037 .000 .057 190 Table G3.--Repeated Measures (Form 8 Discrimination) By Treatment. (P-Value Estimation) Effect Between Subjeets Treatment Subj w. Treat. Within Subjects Form Treat. x Form Form x Subj w. Treat. Discrimination x Discrim. Treat. Discrim. x Subj w. Treat. Form x Discrim. Treat. x Form x Discrim. Form x Discrim. x Subj w. Treat. df 28 56 Mean Sq. F 19.49 14.41 .68 Sig. of F .000 .000 .416 .000 .169 .004 .419 191 Table G4.--Repeated Measures (Form 8 Cognitive Level) By Treatment. (P-Value Estimation) Effect Between Subjects Treatment Subj w. Treat. Within Subjects Form Treat. x Form Form x Subj w. Treat. Cognitive Level Treat. x Cogn. Level Cogn. Level x Subj w. Treat. Form x Cogn. Level Treat. x Form x Cogn. Level Form x Cogn. Level x Subj w. Treat. df 28 56 Mean Sq. 43.27 F 17.84 28.16 .71 58.25 .29 42.46 .51 Sig. of F .000 .000 .408 .000 .749 .000 .604 192 Table GS--Repeated Measures (Form 8 Item Type) By Treatment (P-Value Estimation) Effect Between Subjects Treatment Subj w. Treat. Within Subjects Form Treat. x Form Form x Subj w. Treat. Item Type Treat. x Item Type Item Type x Subj w. Treat. Form x Item Type Treat. x Form x Item Type Form x Item Type x Subj w. Treat. df 28 28 Mean Sq. 21.34 .97 .72 27.46 .02 .75 .87 F 21.94 12.98 36.48 .02 Sig. of F .000 .001 .274 O 000 .884 .000 .171 Table G6.--Repeated Measures (Form 8 Content) By Treatment. 193 (Point-Biserial Estimation) Effect Between Subjects Treatment Subj w. Treat. Within Subjects Form Treat. x Form Form x Subj w. Treat. Content Treat. x Cont. Cont. x Subj w. Treat. Form x Cont. Treat. x Form x Cont. Form x Cont. x Subj w. Treat. df 26 78 Mean Sq. F .28 25.31 .00 26.40 12.07 .95 Sig. of F .603 C 000 1.000 .000 .304 .000 .422 194 Table G7.--Repeated Measures (Form 8 Difficulty Level) By Treatment. (Point-Biserial Estimation) Effect df Mean Sq. F Sig. of F Between Subjects Treatment 1 3.09 .88 .357 Subj w. Treat. 26 3.51 Within Subjects Form 1 53.04 28.16 .000 Treat. x Form 1 .16 .09 .772 Form x Subj w. Treat. 26 1.88 Difficulty Level 2 2.47 1.01 .370 Treat. x Diff. Level .2 4.89 2.01 .144 Diff. x Subj w. Treat. 52 2.43 Form x Diff. Level 2 6.11 2.14 .128 Treat. x Form x Diff. 2 4.67 1.64 .205 Form x Diff. x Subj w. Treat. 52 2.85 195 Table G8.--Repeated Measures (Form 8 Discrimination) By Treatment. (Point-Biserial Estimation) Effect Between Subjects Treatment Subj w. Treat. Within Subjects Form Treat. x Form Form x Subj w. Treat. Discrimination Level Treat. x Discrim. Discrim. x Subj w. Treat. Form x Discrim. Treat. x Form x Discrim. Form x Discrim. x Subj w. Treat. df 26 52 Mean Sq. F .59 28.29 .03 Sig. of F .450 .000 .867 .002 .677 .065 .901 196 Table G9.--Repeated Measures (Form 8 Cognitive Level) By Treatment. (Point-Biserial Estimation) Effect Between Subjects Treatment Subj w. Treat. Within Subjects Form Treat. x Form Form x Subj w. Treat. Cogn. Level Treat. x Cogn. Level Cogn. Level x Subj w. Treat. Form x Cogn. Level Treat. x Form x Cogn. Level Form x Cogn. Level x Subj w. Treat. df 26 52 Mean Sq. F .42 23.74 .05 .04 Sig. of F .524 .000 .823 .001 .352 .206 .958 197 Table G10.--Repeated Measures (Form 8 Item Type) By Treatment. (Point-Biserial Estimation) Effect Between Subjects Treatment Subj w. Treat. Within Subjecte Form Treat. x Form Form x Subj w. Treat. Item Type Treat. x Item Type Item Type x Subj w. Treat. Form x Item Type Treat. x Form x Item Type Form x Item Type x Subj w. Treat. df 26 26 Mean Sq. F .89 22.55 .08 .19 1.28 1.80 .03 Sig. of F .353 .000 .781 .666 .268 .191 .860 BI BLI OGRAPHY DIDLIQGRAPEY Angoff, W. H. (1971). Scales, norms, and equivalent scores. In R. L. Thorndike (ed.), Educational Measurement (2nd ed.). Washington, DC: American Council on Education. Beach, B. H. (1975). Expert judgment about uncertainty: Bayesian decision making in realistic settings. Organizatignal Behavier eng Hgmep Perfermance, 2, 10- 59. Bejar, I. I. (1981). Subject Eatrer experts' assessment of item statistics. (Report No. RR-81-47). Princeton, NJ: Educational Testing Service. Berk, R. A. (1986). A consumer's guide to setting performance standards on criterion-reference tests. Eeview of Educational Researcp, §§(1), 137-172. Bernknopf, S. (1979, April). A defepeipie model for determining e minimai gut-off score for eriterion- referenced tests. Paper presented at the Annual Meeting of the National Council on Measurement in Education, San Francisco, CA. Blumberg, P., Alschuler, M. D., 8 Regmovic, V. (1982). Should taxonomic levels be considered in developing examinations ? Edueetigpei amg Esychologicei Measurement, 2;, 1—7. Billings, R. S., 8 Schaalman, M. L. (1980). Administrators' estimations of the probability of outcomes of school desegregation: A field test of the availability heuristic. Organizationai Behavior and Human Eerformance. Ee, 97-114. Binning, J. F., 8 Fernandez, G. (1986, August). Heuristic processes in ratings pf leader behavior: Assessing irem-induced availability biesee. Paper presented at the Annual Convention of the American Psychological Association, Washington, DC. Box, G. E. P., 8 Tiao, G. C. (1973). anesian imference in statisticei analysis. Reading, MA: Addison-Wesley. 198 199 Campbell, A. C. (1961). Some determinants of the difficulty of non-verbal classification items. Educational and Esychgiogicei Measuremenr, 21, 899-913. Campbell, D. T., 8 Stanley, J. C. (1963). Experimental and quasi-experimental designs for research on teaching. In N. L. Gage (Ed.), ook ea on eachin . Chicago: Rand McNally. Chase, C. I. (1964). Relative length of option and response set in multiple choice items. Egneerional and Esychoipgieai neeeuremenr, 21, 861-866. Chase, C. I. (1974). neeenremenr fer Egngerionei Evaluation. Reading, MA, Menlo Park, CA, London, Don Mills, Ontario: Addison-Wesley Publishing Company. Crawford, W. R. (1968). Item difficulty as related to the complexity of intellectual processes. Jgurnal of Educational Measurement, §(2), 103-107. Dudycha, A. L., 8 Carpenter, J. B. (1973). Effects of item format on item discrimination and difficulty. Journal pf Applied Esyengiggy, §E(1), 116-121. Dunn, T. F., 8 Goldstein, L. G. (1959). Test difficulty, validity, and reliability as functions of selected multiple-choice item construction principles. Educationai end Psychological Measurement, 1D, 171-179. Ebel, R. L. (1979). ss t ls d 'o a easurement (3rd ed.). Englewood Cliffs, NJ: Prentice-Hall, Inc. Fitzpatrick, A. R. (1984, April). social influence in sfandarg seeping: The effect pf group interaction on individuais' judgments. Paper presented at the annual meeting of the American Educational Research Association, New Orleans, LA. Forsyth, R. A., 8 Spratt, K. F. (1980). Measuring problem solving ability in mathematics with multiple-choice items: The effect of item format on selected item and test characteristics. ur ' na Measurement. 11(1). 31-43- Glass, G. V., 8 Hopkins, K. D. (1984). Eretistical methods in_edgsation_and_e§xsbelogx (2nd ed-)- Englewood Cliffs, NJ: Prentic-Hall, Inc. Goodman, B. C. (1972). Action selection and likelihood ratio estimation by individuals and groups. Drganigational Eehavior eng Euman Performance, 1, 121-141. 200 Green, K. E. (1983). Subjective judgment of multiple-choice item characteristics. duca d Ps c o o ical Measurement, 5;, 563-570. Green, K. E. (1984). Effects of item characteristics on multiple-choice item difficulty. u ' nd Esychoiogicel Meesnremenr, 51, 551-561. Guilford, J. P. (1954). Eeyengmefrie mernods (2nd ed.). New York: McGraw-Hill Book Co. Hackman, J. D. (1982, May). v ' f ' ' ut'onal esea c e 5° ' co ’t ve h es ch. Paper presented at the annual forum of the Association for Institutional Research, Denver, CO. Hughes, H. H. 8 Trimble, W. E. (1965). The use of complex alternatives in multiple-choice items. Edncetional and Psychologicai measurement, 2§(1), 117-126. Iversen, G. R. (1984). Bayesian statisrical inference. Bevery Hills, London, New Delhi: Sage Publication. Kahneman, D., 8 Tversky, A. (1973). On the psychology of prediction. Esycholegieel Review, ED, 237-251. Kirk, R. E. (1982). Experimental gesign (2nd ed.). Monterey, CA: Brooks/Cole Publishing Company. Leary, L, F. 8 Doran, N. J. (1985). Implications for altering the context in which test items appear: A historical perspective on an immediate concern. Review of Educationel Research, §§(3), 387-413. Levi, A. S. 8 Pryor, J. B. (1985, August). Mediarors of the evailability heuristic in probability estimates of fnture evenr . Paper presented at the Annual Convention of the American Psychological Association, Los Angeles, AC. Lorge, I., 8 Diamond, L. K. (1954a). The value of information to good and poor judges of item difficulty. Educetignel end Es yeno oiggigei Meeenrem enr, 14, 29- 33. Lorge, I., 8 Diamond, L. K. (1954b). The prediction of absolute item difficulty by ranking and estimating techniques. Educational and Psychological Measurement, 11, 365-372. Lorge, I., 8 Kruglov, L. (1952). A suggested technique for the improvement of difficulty prediction of test items. Educationai end Esycnoiogical Measurement, 1;, 554-561. 201 Lorge, I., 8 Kruglov, L. (1953). The improvement of estimates of test difficulty. Educational and Esychological Measnremenr, 1;, 34-46. Malpas, A. J., 8 Brown, M. (1974). Cognitive demand and difficulty of GCE O-level mathematics pretest items. WW1. 155-161. Marascuilo, L. A., 8 McSweeney, M. (1977). Mgnparemetric and ' ut on- ree e ods o soc a1 sc'ences. Monterey, CA: Brooks/Cole Publishing Company. Mehrens, W. A., 8 Lehmann, I. J. (1984). Meeeuremenr end W New York: Holt. Rinehart and Winston, Inc. Mehrens, W. A., 8 Lehmann, I. J. (1987). 'n s n d'zed rests in educatign (4th ed.). New York and London: Longman. Melican, G., 8 Thomas, N. (1984, April). dent' 'ca 'on of items that ere hard to rere eeeurately using Angoff'e standard setting methog. Paper presented at the Annual Meeting of the American Educational Research Association, New Orleans, LA. Millman, J. (1978). e e na ts m d' 'cu t ° preliminery investigation. (CSE Report No. 114). Los Angeles, CA: Center for the Study of Evaluation, UCLA Graduate School of Education. Mitchell, K. J. (1983). ngnirive progessing geterminants of tem d' f' u t on the verba subtests of the armed eeryices yocational epfituge bertery. Arlington, VA: Army Research Institute for the Behavioral and Social Sciences. Nedelsky, L. (1954). Absolute grading standards for objective tests. d 'on and s cho o i al HEQEBIQEQDE: lip 3'19- Nisbett, R. E., 8 Borgida. (1975). Attribution and the psychology of prediction. 0 rso it nd W 32. 932- 943. Nitko, A. J. (1983). duca on sts n eas me t: An intreducrign. New York: Harcourt Brace Jovanovich. Norusis, M. J. (1988). §ES§£PC+ Advanced statistics V2.0: Chicago, IL: SPSS Inc. 202 Plake, B. S., 8 Huntley, R. M. (1984). Can relevant grammatical cues result in invalid test items ? Edu2ati9nal_and_2sxshglegigal_nea§urement. ii. 687-696. Pollitt, A., Entwistle, N., Hutchinson, C., 8 De Luca, C. (1985). Wher makes erem gpeerigne difficult ? Edinburgh: Scottish Academic Press. Quereshi, M. Y., 8 Fisher, T. L. (1977). Logical versus empirical estimates of item difficulty. Educational and 2sxsbelegisal_ueasnrement. 31. 91-100- Report on rne secong narionel eeminar en pest management eyerem. (1982). Kuala Lumpur, FT: The Examinations Syndicate, Ministry of Education, Malaysia. Ryan, J. J. (1968). Teacher judgments of test item properties. Journal of Educational Measurement, §(4), 301-306. Scheuneman, J. D., 8 Steinhaus, K. S. (1987). A theoretical framework for the srudy pf item difficulty end diserimination. (Report No. RR-87-44). Princeton, NJ: Educational Testing Service. Simpson, D. E., 8 Cohen, E. B. (1985). Problem solving gnestions for multiple-choice rests: A method for enalyzing the cognitive demands of items. Paper presented at the annual meeting of the American Educational Research Association, Chicago, IL. Strang, H. R. (1977). The effects of technical and unfamiliar options on guessing on multiple-choice test items. Journal of Educationel Measurement, 15, 253-259. Tinkelman, S. (1947). Difficulty prediction of test items. Ieacnere College Contributign to Education, No. 941. New York: Bureau of Publications, Teachers College, Columbia University. Tollefson, N., 8 Chen, J. S. (1986). A comparison of item gifficnlty end item giecriminarion of multiple-choice items using "none of the above" options. Paper presented at the Annual Meeting of the Midwest Educational Research Association, Chicago, IL. Tversky, A., 8 Kahneman, D. (1974). Judgment under uncertainty: Heuristics and biases. Egience, 185, 1124- 1131. Whitely, S. E. (1980). Modeling aptitude test validity from cognitive components. gournel of Educational Esycnology, 1;, 750-769. 203 Whitney, D. R., 8 Board, C. (1972). The effect of selected poor item writing practices on test difficulty, reliability and validity. Journel pf Edueatignel Meesurement, e(3), 225-233. Willoughby, T. L. (1980). Reliability and validity of a priori estimates of item characteristics for an examination of health science information. Educerignal end Eeychological Meesuremenr, AD, 1141-1145. Winkler, R. L. (1968). The consensus of subjective probability distributions. Management Seience, 1;, B61- B75. Winkler, R. L. (1971). Probabilistic prediction: Some experimental results . W Statistical Association, ee, 675-685. HICH 1111111111111