' . ‘ .- ,ill ‘lllll III II n v- 29'3 10389 mm 31 Ina-mi Date 0-7639 ‘ LmLALY ‘ limbéza Sum .4?" ‘1: . ‘9' , This is to certify that the thesis entitled A COMPARISON OF ITEM CHARACTERISTIC CURVE MODELS FOR A CLASSROOM EXAMINATION SYSTEM presented by James Bruce Douglass has been accepted towards fulfillment of the requirements for Ph.D. d . Counseling and egree m Educational Psychology / Measurement, ~ Evaluation and Research Design A2AA A? Major professot July 24 , 1980 lll‘ “a '1’!" \\\‘ L ' T‘. '9‘ agar!!! ‘ {in [1‘ ovsnoug FINE : 35¢ P" W nor it. m u my man ALS Place in book ”ta move me full circulation records A COMPARISON OF ITEM CHARACTERISTIC CURVE MODELS FOR A CLASSROOM EXAMINATION SYSTEM By James Bruce Douglass A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Department of Counseling and Educational Psychology College of Education 1980 (:> Copyright by James Bruce Douglass 1980 ABSTRACT A COMPARISON OF ITEM CHARACTERISTIC CURVE MODELS FOR A CLASSROOM EXAMINATION SYSTEM By James Bruce Douglass Item characteristic curve (ICC) models offer the potential of pro- ducing a measurement of a person's ability or achievement that does not depend on the particular sample of items in a test or on the particular examinees upon which test analysis is based. The present study is designed to provide methods and results relevant to the introduction of item characteristic curve models into classroom achievement testing. The overall objective is to compare several common ICC models for item calibration and test equating in a classroom examination system. Parameters for the one-, two- and three-parameter logistic ICC models were estimated for loo-item final examinations taken over the course of four academic terms by 594 to 1082 examinees in an introductory college-level Communications course. The three-parameter model provided unacceptable lower asymptotes for this data set and was given no further consideration. One of the four tests was chosen to provide a base scale to which items from the other examinations were calibrated using the one- and two-parameter models. There were 43 to 53 items in common between the base scale examination and each of the other three. Random samples of 200, 600 and 800 examinees were selected from a total sample of 1082 examinees to investigate the impact of examinee sample size upon item James Bruce Douglass calibration and test equating. Rasch calibration and two methods of two- parameter model calibration were investigated. One method of two-para- meter model calibration was based on item difficulty means and standard deviations, and the other was based on mean item difficulties and ratios of item discriminations. Calibrations based on the sample of 200 exami- nees and all 43 items in the calibration section were quite divergent across methods and also diverged from the calibrations based on the other samples. The calibration constants estimated from the Rasch model and the two methods using the two-parameter model were used with a true-score method to equate the examinations to a common scale. The Rasch model equatings were very consistent across all sample sizes investigated. Both of the two-parameter methods provided equatings that were less consistent. Consistency of equating provides only indirect evidence of equating accuracy. To provide a criterion for equating, a test was equated to itself using three separate sets of examinee samples selected from the 1082 examinees in the largest total sample. The three sets represented random, high-low and very high-very low examinee ability differences. Although none of the three methods was uniformly best, the Rasch model provided the most acceptable equating in general. The stability of Rasch calibrating links and hence ability estima- tion based on 16 class sections ranging in size from 49 to 66 examinees was investigated for a 43 item test taken from a total test of 100 items. Calibration links were calculated for subsets of the 43 items ranging in size from 7 to 37 in increments of five items. Even with 37 items in the calibrating link, it was found that there was an average bias in ability estimation equivalent to 1/5 of a standard deviation of the ability esti- mates based on all 43 items. James Bruce Douglass For this testing system, it is concluded that the Rasch model pro- vided test equatings as good as or better than were provided by either two-parameter method. There was, however, evidence that Rasch item dif- ficulties were less stable across examinations and academic terms than they were across different examinee samples taken from a single test administration. It is recommended that the long-term stability of para- meter estimates, and the scales based upon them, be systematically monitored and studied in testing systems using ICC models. DEDICATION To my parents Robert E. Douglass and Bette C. Douglass and my sisters Barbara and Karen ii ACKNOWLEDGMENTS A number of people contributed to the successful completion of this dissertation. I would first of all like to thank my dissertation committee for their excellent assistance. Dr. Robert L. Ebel, the Committee Chair- man, has served as my mentor for the past five years. I am truly grateful for the benefit of his wisdom, experience and dedication to the profession. Dr. LeRoy Olson has contributed his time freely to the dissertation study and to some of the more practical aspects of my education in measurement. Dr. Neal Schmitt has contributed a number of insights derived from his work in industrial and organizational psychology. I owe much of my original training in statistics to Dr. William Schmidt. Several people outside of Michigan State University have contributed substantially to this study. I am grateful to Dr. Ronald Hambleton of the University of Massachusetts for his support for and encouragement of the project and for providing a version of the LOGIST computer program. Dr. Gary Marco of the Educational Testing Service provided me with my first opportunity to work with item characteristic curve models and has been a continuing and valuable source of advice. I would also like to thank Dr. Frederick Lord of the Educational Testing Service for his help with the item calibration problem. I would also like to thank several other persons at Michigan State University. Dr. Cassandra Book, who directed Conmunication 100, enthusi- I astically supported the project from the beginning. Eric Eisenberg coordinated the training of Communication 100 instructors and the student testing. Debra Bennett did much of the computer programming during the final phases of the study, saving me many weeks of work. I also thank Karen Tilson for her excellent and patient typing of the several versions of the manuscript. Finally, I wish to acknowledge the support provided by the Educational Development Program, the Communication Department and the Learning and Evaluation Service at Michigan State University. iv TABLE OF CONTENTS Chapter I. II. III. IV. V. STATEMENT OF THE PROBLEM .................. Objectives ....................... Historical Perspective ................. COMPARING MODELS ...................... An Overview of Model Comparisons ............ A Process for Comparing Models ............. Previous Model Comparisons ............... METHOD ........................... Sample ......................... Models ......................... Estimation of Model Parameters ............. Scaling Model Parameter Estimates ........... Equating ........................ Procedure ....................... RESULTS .......................... Descriptive Data .................... Parameter Estimation .................. Calibration ...................... Rasch Model ...................... Two-parameter Model .................. Test Equating and Ability Scaling ........... Study 1 ........................ Study 2 ........................ Study 3 ........................ SUMMARY, CONCLUSIONS AND RECOMMENDATIONS .......... Summary ........................ Conclusions and Discussion ............... Parameter Estimation .................. Calibration ...................... Test Equating and Ability Estimation .......... Recommendations .................... APPENDIX ............................. LIST OF REFERENCES ........................ Table 01 .10 .11 .12 .13 .14 > H #h-b-b-b-b-b-b-b-b-b-b-bb oo 45 LIST OF TABLES Page_ Descriptive Data for Study 1 ................ 37 Descriptive Data for Study 2 ................ 37 Descriptive Data for Study 3 ................ 39 A Two-parameter Item Difficulty (b) Statistics for Study 1 . . 41 A Two-parameter Item Difficulty (b) Statistics for Study 2 . . 41 Rasch Model Calibration for Study 1 ............. 43 Rasch Model Calibration for Study 2 ............. 45 Two-parameter Model Calibration for Study 1 ......... 47 Two-parameter Model Calibration for Study 2 ......... 49 Equating Inconsistency Across Models ............ 63 Equating Inconsistency Within Models ............ 64 Equating Error Across Models ................ 73 Rasch Ability Estimates for Study 3 ............. 75 Rasch Ability Estimate Error Due to Calibration for Study 3 ......................... 76 A Difficulties (b) of Items Removed Before Calibration . . . . 95 vi LIST OF FIGURES Figure 0—3 bb-b-b-b-b-b-b-bbb Socowmmhwm b t—‘O—I Ni—i 4.13 4.14 4.15 4.16 A2 A3 A4 A5 Spring 1979 Equating Fall 1978 Equating Winter 1979 Equating Fall 1979 Equating Fall 1979 Sample of 200 Equating Fall 1979 Sample of 600 Equating Fall 1979 Sample of 800 Equating Rasch Equating Fall 1979 Samples 8 Method Equating Fall 1979 Samples A Method Equating Fall 1979 Samples Rasch Random Halves Rasch Low to High Half Rasch Very Low to Very High Half 2 Par Random Halves 2 Par Low to High Half 2 Par Very Low to Very High Half F78 and S79 Diff Est N79 and S79 Diff Est F79 and S79 Diff Est F79 (N 200) and 579 Diff Est F79 (N 600) and 579 Diff Est vii Page 51 52 53 54 55 56 57 59 60 61 67 68 69 7O 71 72 86 87 88 89 9O Elms A6 F79 (N = 800) and S79 Diff Est A7 F79 Random Halves Diff Est A8 F79 Low-High Halves Diff Est A9 F79 Very Low-High Halves Diff Est viii CHAPTER I STATEMENT OF THE PROBLEM Item characteristic curve (ICC) models, also called latent trait or item response models, have drawn increasing interest in recent years from both psychometric researchers and practitioners (Hambleton and Cook, 1977; Wright, 1977; and Hambleton, et.al., 1978). The primary appeal of ICC models lies in their potential to solve the problem of deriving comparable scores from different tests given to different peeple. For example, in Communication 100, an introductory course at Michigan State University, different instructors give different mid-term examinations to different sections. If one section receives lower scores than another, does it mean that the students in that section have less knowledge or that they received a harder test? It would be highly desirable to devise a common scale to which scores from each mid-term could be referenced. Although a common final examination is administered each term, the finals differ across terms. Again a common reference scale would be useful. Students' grades should not depend upon the difficulty of the particular tests they take. Nor should their grades depend on the grades of other students in their particular section. If the potential of ICC models to provide item statis- tics that are independent of particular examinees and ability scores that are independent of particular items can be realized in the classroom, then score comparability can be achieved in a classroom testing system. Relatively little ICC research has foCused on achievement testing and even less has been addressed to classroom examination. The present study is one of the first to systematically apply ICC models to a class- room examination system. There are two potential problems in applying ICC models to classroom examinations. The first problem is that if dif- ferent teachers teach the same material in different ways,different factors are likely to result (Lord, 1980). In other words, teachers and teaching methods can affect the factor structure of examinations. Although a single factor is only a sufficient but not necessary condition for the ICC model assumption of unidimensionality (Lord and Novick, 1968), class- room achievement tests are likely to be less unidimensional than aptitude tests. Furthermore the dimensionality of achievement tests could vary from term to term and class to class depending on differences in instruc- tors and instructional methods. The second problem has been recognized by Bejar, Weiss and Kingsbury (1977). They note that aptitude test item pools can typically be construc- ted and refined to meet the unidimensionality assumption quite well. However, a classroom achievement test must fairly represent the entire course content. There is much less freedom to delete areas of content in classroom tests than there is in aptitude tests. Fortunately the question of interest is not whether the assumptions of ICC models are met. In practice, assumptions are never completely met. The practical questions are: 1. To what extent are the assumptions of the models met? 2. What effects do violations of the assumptions have on the desired properties of the models? Robustness of the models to violations of their assumptions is a cen- tral concern. There is a problem resulting from the robustness issue which has been largely ignored in the ICC literature. As an illustration of the problem, consider Wright's (1977, p. 113) procedure for designing a "best test." His procedure involves only psychometric properties of the items in the test and guesses about the mean and dispersion of the abilities of persons to be tested. There is no mention of item content. Of course, if an item pool is truly unidimensional, then the content of any particular item may be irrelevant. But a model may give consistent and valid results either because its assumptions are met or because it is robust with respect to violations of its assumptions. A model may show evidence of fit despite a lack of unidimensionality. Therefore the content validity of a test is still a relevant concern in test development. The use of latent trait models with classroom achievement tests does not relieve the test con- structor of the task of developing a test with a balanced representation of course content. OBJECTIVES The purpose of this study is to determine whether item characteristic curve models can be successfully applied to a classroom achievement testing system. The overall objective is to compare the one-, two-, and three- parameter logistic item characteristic curve models for item calibration and test equating in a large multi-section college course. The overall objective will be accomplished through the following specific objectives: i. To obtain and examine parameter estimates for all three models. ii. To calibrate items to a common scale using the one- and two- parameter models. Calibration constants will be estimated using different numbers of examinees. This will result in different calibrations which will be examined for consistency. iii. To equate tests to a common scale using the one- and two- parameter models. Equating will be accomplished using the different calibrations mentioned in ii. This will result in test equatings that will be compared for consistency within and across models. iv. To equate a test to itself using the one- and two-parameter models. The equating will be accomplished using samples of examinees that have different average levels of ability. The accuracy of equating will be compared for the various models. v. To obtain ability estimates from different examinee sections using the one-parameter model on a set of items common to all sections. Ability estimates will be compared for each class section assuming that different numbers of items within the common set of items were previously calibrated. This simulates the situation of an instructor adding some of his/her own items to a test composed of previously calibrated items. HISTORICAL PERSPECTIVE Lord (1968, 1980) states that Lawley (1943) was the first to present a coherent test theory based on item characteristic curves. Lawley assumed that there was no guessing, ability was normally distributed and that item intercorrelations were equal. Lazarsfeld (1950) presented a more general theory and may have been the first to use the term "latent traits" (Hambleton, et.al., 1978). However, Lord's (1952, 1953a, 1953b) initial work with the ogive model is considered by many to be the real beginning of latent trait theory applied to mental tests. Lord himself does not use the term "latent trait theory". He originally spoke of item characteristic curve theory and now uses the term "item response theory" (Lord, in press). In Lord's early work with the normal ogive, he assumed no guessing but unlike Lawley, did not assume a normal distri- bution of ability or equal item intercorrelations. Lord (1980) stopped work for ten years on item characteristic curve models for three reasons: 1. Maximum likelihood estimators for all parameters were computationally impractical. 2. His early work assumed responses were either correct or incorrect. There was no provision for omitted or not reached items. 3. It was not clear that one can assume that the probability of correct response to an item can be represented as a normal ogive function of ability. In particular, if item-test regressions are not monotonic it would be dif- ficult to fit any simple mathematical model (Lord, 1965). It had been argued that item-test regressions might start at chance level, then decline before rising because the worst examinees would answer randomly while slightly better students would choose plausible distractors. Another argument against monotonicity was that the very best examinees might do slightly worse than somewhat less able examinees. The contention was that the very best students might consider factors beyond the scope of the item and thus answer incorrectly. Lord (1965) investigated the monotonicity of item-test regressions using 103,275 examinees and 90 verbal and 60 mathematical items from the May 2, 1964 '57 (“H ..: "- 9“ IQ‘II :- ’bl“ Dir. ‘- -5. J .‘ III administration of the Scholastic Aptitude Test. He found that except for chance fluctuations, the item-test regressions were monotonic throughout the range of test scores. The computational difficulties were greatly reduced by Lord's (1968) adoption of Birnbaum's (1968) three-parameter logistic model. This coupled with advances in high speed computers made parameter esti- mation feasible. The three-parameter model, unlike Lord's earlier two- parameter normal ogive model, assumes that guessing can occur. Lord (1974) also modified the likelihood function to handle omitted and not reached responses as well as correct and incorrect responses. The major impediments to progress on item characteristic curve models were thus removed. Meanwhile an independent line of development in latent trait models was occurring. In Denmark, Rasch (1960) was working on several different measurement models based on properties he felt mental measurements should possess. One of these models has the mathematical form of a one-para- meter logistic item characteristic curve model. This has become known as the Rasch model. Wright (1968) introduced the Rasch model in the United States and has since become it's leading proponent. Nineteen sixty-eight was a very significant year for item character- istic curve theory. In that year, a rigorous mathematical presentation of the three-parameter logistic model (Birnbaum, 1968) was published. Lord (1968) publicly adopted Birnbaum's model and Wright (1968) intro- duced the Rasch model in the United States. The next major event occurred in 1977 with the publication of the summer issue of the Journal of Educa- tional Measurement. The entire issue consisted of six invited papers on latent trait models. The issue represented the first attempt to present current developments in latent trait theory at a level understandable to the average measurement specialist. A year later a review of the latent trait literature appeared (Hambleton, et.al., 1978). Perhaps the ultimate evidence of the acceptance of latent trait theory was the publication of two textbooks in the area (Wright & Stone, 1979 and Lord, in press). The popularization of latent trait theory combined with the availability of computer programs (Wood, Wingersky & Lord, 1976 and Wright, Mead & Bell, 1979) has led to a proliferation of research in the area. CHAPTER II COMPARING MODELS The first section of this chapter presents a conceptual framework useful for analyzing model comparisons and studies of fit. A process by which models can be compared for the solution of practical testing problems is developed in the second section. The final section con- tains a review of previous latent trait model comparisons. AN OVERVIEW OF MODEL COMPARISONS Model comparisons and tests of fit can be categorized by the ante- cedent and consequent conditions investigated, by the anticipated generalizability of results and by the degree of artificiality in the situation or data. In their study of the Rasch model, Rentz and Bashaw (1975) distinguish between assumptions of the model, antecedent condi- tions derivable from the assumptions and consequent conditions derivable from specific objectivity. They associate tests of item fit with ante— cedent conditions such as unidimensionality, equal item discriminations and the absence of guessing. Others(e.g. Hambleton and Cook, 1977) classify these conditions as assumptions. For Rentz and Bashaw the consequent conditions of primary interest are the invariance of item parameter and ability estimates. Antecedent conditions are closely asso- ciated with test construction and consequent conditions are associated with analyzing existing tests. ’r' m. I'q n . a IA "3‘ u a o O- ‘7' Fl. if Antecedent and consequent conditions can be further subdivided (cf. Douglass, 1979). Antecedent conditions can be categorized as assumptions or derivations from assumptions, as operational definitions of models defined by computer programs or estimation procedures, or as situational or design conditions. Assumptions of latent trait models have been discussed in the literature in some detail (Hambleton & Cook, 1977 and Lord & Novick, 1968). The particular models used in this study are discussed in Chapter III. Much less thought seems to have been given to how the models are operationalized. Different computer programs estimate the parameters of the models differently. For example, the LOGIST program is capable of estimating parameters of the one-, two- or three-parameter logistic model. The program uses item response data categorized as correct, incorrect, omitted or not reached. On the other hand, BICAL, which estimates parameters for the one-parameter or Rasch logistic model, only accepts item responses of correct or incorrect. Although LOGIST and BICAL both estimate parameters of the one-parameter logistic model, some- what different estimates could result if there are a significant number of omitted or not reached items in a test. The procedure used to estimate parameters can make a real difference in results obtained. Using simulated data, Ree (1979) compared four procedures for obtaining estimates for the three-parameter model. He found that none of the four procedures, LOGIST, ANCILLES, OGIVIA, and a transformation of the item-raw score biserial, was uniformly superior. The best procedure depended on the distribution of ability in the cali- bration sample, the intended use of the parameter estimates and the computer resources available. 10 A third type of antecedent condition is a situation or design condition. These conditions can reflect contraints in a testing situa- tion of interest or they may be manipulated by the researcher or practitioner. Examples include shapes and ranges of ability distribu- tions, sample sizes of examinees and test items, test content and item types. Situational conditions can affeCt the usefulness of models by affecting either the assumptions of the models or the goodness of the estimation procedures. Consequent conditions can be classified by the degree to which they are related to some problem of interest. At one extreme are consequences predicted from the model that have little or no direct relation to any particular practical problem of interest. Some statistical tests of fit are based on this type of consequence. These tests of fit are of very limited usefulness in determining how well a model will work in any given practical situation of interest. Probabilities associated with tests of fit are highly dependent on sample size. Furthermore, there is rarely a known relationship between a fit statistic and a practical criterion of interest. However, this type of fit statistic may have some utility for comparing degrees of fit across models. Of course, fit statistics are quite appropriate when applied to antecedent conditions in an attempt to better meet assumptions. For example, tests of item fit can be very useful in building a test that is more likely to result in desired consequences. At the other extreme are consequent conditions that are of direct interest for a problem under investigation. For example, if one is equating tests, consistency of equatings based on different examinee samples is a consequent condition of latent trait models that is of 11 direct interest. The comparison of a predicted consequence of a model with an actual consequence may contain little or much useful information. Studies of model fit and model comparison can also be categorized by the type of data and situations investigated. Data can be real or simulated and situations can be realistic or contrived. Generally the trade-off is between realism and control. With simulated data the para- meters of the model are known. Ability and item distributions can be completely specified. Unfortunately the parameters and distribution may not accurately reflect real data sets. On the other hand, real data comes from some actual situation, but the true parameters are unknown. Of course, it is possible to have realistic simulations or unrealistic real data sets. Simulations can include many relevant variables and real data sets may come from a very contrived situation or from a very poor sampling plan. When considering model comparisons and tests of fit, it can be help- ful to consider how general the results are intended to be and how general they are in fact. Cronbach (1975) has argued that the search for generic laws and grand theories,so successful in the natural sciences, is doomed to failure in the behavioral sciences. He argues that the problem is not that human events are unlawful, but that the times are a relevant variable, so generalizations can not be stored up for assembly into a causal network. Cronbach feels that behavioral science will be better served if generali- zations from research studies are treated as working hypotheses rather than as conclusions. Cronbach's.views have several implications for those attempting to apply models to human behavior, including item response behavior. One implication is that a model may work quite well in one situation and 12 not at all well in another. It is necessary to be alert to local condi- tions that may be influencing one's results. A second implication is that it is always necessary to check to see if a model is working satis- factorily in any given situation of interest. Furthermore, since results may change over time, it is necessary to constantly check to see that a model continues to work satisfactorily. If science is thought of as a discipline where laws are determined and theories built and engineering is thoughtof as the application of those laws and theories to practical situations, it may be that in the past those in the behavioral sciences have done too much poor science and too little good engineering. Cron- bach (1975, p. 126) argues that we can realistically expect to achieve two goals in the behavioral sciences, "... to assess local events accu— rately, to improve short-run control ... (and) to develop explanatory concepts, concepts that will help people use their heads." A PROCESS FOR COMPARING MODELS A process developed by Douglass (1979) for comparing models will be outlined in this section and serves as the basis of the research strategy for this study. The process includes consideration of the three types of antecedent conditions mentioned previously, effects of antecedent condi- tions on relevant consequent conditions and focuses on the particular practical problem of interest in a situational context. The process can be thought of as an elaboration and extension of a strategy suggested by Lord and Novick (1968) for investigating the practical utility of a model. Lord and Novick (1968, p. 383) suggest the following four steps for investigating the practical utility of a model: 13 Estimate the parameters of the model, assuming it is true ... Predict various observable results from the model, using the estimated parameters. Consider whether the discrepancies between predicted results and actual results are small enough for the model to be useful ("effectively valid") for whatever practical application the investigator has in mind. If in step 3 the discrepancies were considered too large, then it may be useful to compare them with the discrepancies to be expected from sampling fluctuations. Douglass (1979) extends this procedure to include comparisons of different models for consistency, validity, and utility. The basic steps Douglass suggests are: 1. 2. Identify the models to be compared. Obtain different sets of parameter estimates for each model, assuming each is a true representation of reality. These estimates should be obtained in the actual situation where the selected model will be used and should reflect any sources of error that will be operating in the practical setting. In addition to naturally occurring variation in antecedent condi- tions, variation can be introduced artificially to simulate the practical situation. For example, different samples can be investigated by splitting a large sample into sub-samples. This will result in discrepant estimates for each model. Observe the effects of the discrepant estimates for each model on each consequent condition of interest. If any given model does not give sufficiently consistent results, then the antecedent conditions should be investigated. Sources of inconsistency include sampling error or bias, estimation procedure error and violation of model assumptions. 14 5. Compare the results of the models which give consistent results or can be made to give consistent results to one another. 6. Investigate the patterns of convergence and divergence both between and within models along with any available criteria to aid in model selection. PREVIOUS MODEL COMPARISONS One of the first comparisons between latent trait models was made by Hambleton and Traub (1973) using data from two verbal tests and one mathe- matics aptitude test. Hambleton and Traub compared observed and expected test score distributions for the one- and two-parameter logistic models. For the one-parameter model, a person's test score was simply the sum of all items answered correctly. For the two-parameter model, the test score was the sum of the products of each correct item and its ICC discri- mination estimate. Comparisons between the one- and two-parameter model were made using a chi-square test of fit applied to the observed and expected distributions of test scores. The comparison of observed and expected distributions of test scores is a good example of using a con- sequent condition that is predicted from a model but that is not directly related to any practical problem of interest. Of course, the limitation of such a comparison is that it only provides information on which model appears to be generally better. There is no way to determine how much better one model will be than another for any given practical testing problem. Hambleton and Traub assumed that ability was normally distributed in their study. This allowed them to estimate item characteristic curve difficulties and discriminations using simple functions of classical item difficulties and item-test biserial correlations. The obvious 14 5. Compare the results of the models which give consistent results or can be made to give consistent results to one another. 6. Investigate the patterns of convergence and divergence both between and within models along with any available criteria to aid in model selection. PREVIOUS MODEL COMPARISONS One of the first comparisons between latent trait models was made by Hambleton and Traub (1973) using data from two verbal tests and one mathe- matics aptitude test. Hambleton and Traub compared observed and expected test score distributions for the one- and two-parameter logistic models. For the one-parameter model, a person's test score was simply the sum of all items answered correctly. For the two-parameter model, the test score was the sum of the products of each correct item and its ICC discri- mination estimate. Comparisons between the one- and two-parameter model were made using a chi-square test of fit applied to the observed and expected distributions of test scores. The comparison of observed and expected distributions of test scores is a good example of using a con- sequent condition that is predicted from a model but that is not directly related to any practical problem of interest. Of course, the limitation of such a comparison is that it only provides information on which model appears to be generally better. There is no way to determine how much better one model will be than another for any given practical testing problem. Hambleton and Traub assumed that ability was normally distributed in their study. This allowed them to estimate item characteristic curve difficulties and discriminations using simple functions of classical ltem difficulties and item-test biserial correlations. The obvious 15 advantage of assuming a normal distribution of ability is that estimation is very inexpensive and does not require sophisticated computer programs such as LOGIST or BICAL. The disadvantage is that an unnecessary and restrictive assumption is added to the models. Hambleton and Traub found that the two-parameter model predicted test performance better than the one-parameter model for each of the three sets of test data. The greatest difference was for a short test of 20 items that had a fairly large range of item discriminations (.74). A recent study by Hutton (1980) is an extension of the work of Hambleton and Traub (1973). She examined the relationship between dimen— sionality, the range of item discriminations and guessing on distributions of observed and expected number-right scores for the one- and three- parameter logistic models. She compared the observed and expected distri- butions graphically and with chi-square and Kolmogorov-Smirnov tests of fit. Advantages of the Hutton study over the earlier Hambleton and Traub study include a sample of 25 tests, an attempt to relate dimensionality and range of item discriminations to degree of fit and the use of LOGIST for parameter estimation, thereby making no assumptions about the under- lying distribution of ability. The degree of unidimensionality was represented by the ratio of eigenvalues between the first and second factors of a principal compo- nents factoring of the matrix of tetrachoric correlations for each test. The measure of spread of item discriminations for each test was the per- centage of point-biserial item-test correlations falling within a .10 band around the mean of the point-biserials for that test. It proved impossible to find an acceptable estimate of guessing that was indepen- dent of the latent trait models. 16 Hutton found that the three—parameter model fit slightly better on the average than the one-parameter model using the Kolmogorov-Smirnov (K-S) test of fit. Correlations between degree of unidimensionality and K-S fit, ranging from -.555 to -.468, were significant for both the one- and three-parameter models at the .05 significance level. Correlations between degree of equality of item discrimination and K-S fit, ranging from -.238 and e.158 were not significant at the .05 level for either model. Hutton concluded that for estimating ability the one-parameter model produces estimates practically as good as those of the three-para- meter model. Several cautions are in order. First, Hutton used number-right score as a criterion. Number-right score is a sufficient statistic for esti- mating ability under the one-parameter model but not under the three- parameter model. An appropriately weighted score as a criterion for the three-parameter model probably would have made the model look better. Second, as Hutton observed, similarity of results for ability estimation does not imply similarity of results for other applications such as test development, test equating or investigation of item bias. Reckase (1979) also investigated the relationship between dimen- sionality and fit of the one- and three-parameter logistic models. He used five real data sets and five data sets simulated to have particular 'Factor structures. The study was designed to answer three questions (p. 225): (1) What component of the tests is being measured by the two models? (2) Does the size of the first factor control the estimation of the parameters of the two models? (3) What is the relationship between the factor analysis, latent trait and item analysis parameters? 17 He found that in the case where a test is composed of several indepen- dent factors, an uncomnon situation in practice, the three-parameter model picks just one factor and differentiates among ability levels on that factor ignoring the other factors. The one-parameter model, on the other hand, estimates ability based on the sum of the independent fac- tors. In the common situation where there is a large first factor in a test and several smaller factors, both methods measure primarily the first factor to the same extent. The size of the first factor was found to affect parameter esti- mates. Reckase obtained good ability estimates when the first factor accounted for less than 10 percent of the test variance, but found that stable item calibration required that at least 20 percent of the test variance be accounted for by the first factor. A factor analysis of the correlations between item statistics from factor analysis, latent trait analysis and traditional item analysis resulted in three factors which Reckase called difficulty, discrimina- tion and guessing. He found that the one- and three-parameter difficulty estimates, traditional difficulty and the three-parameter discrimination all loaded on the difficulty factor. The undesirable relationship between latent trait difficulty and discrimination was also found by Lcnrd (1975). The three-parameter discrimination also loaded on the dis- cn~infination factor, as did the traditional point-biserial. The one- Par‘ameter chi-square item fit statistic loaded with the three-parameter "guessing" parameter estimate. Guessing, not multidimensionality or i’Cem discrimination, seemed to most greatly affect the one-parameter tests of fit. at! Aflh '1 Fl 1 1.1! 18 Work like that of Reckase, where there is a systematic attempt to relate consequent conditions to variations in antecedent conditions, is clearly useful. Although there is no assurance that his results will generalize to other tests and testing situations, his study does give tentative relationships and information which can serve to guide others. The three studies just discussed considered consequent conditions that, although predicted from the models, were not directly related to particular practical testing problems of interest. The next three studies all attempt to compare models for consequent conditions of direct interest for a practical testing problem. Hambleton and Cook (1978) compare the one-, two- and three-para- meter models for rank ordering examinees. The correct rank ordering of examinees is of course relevant in any norm-referenced testing situation. They simulated data, using a three-parameter model, to study the effects of variation in range of item discrimination parameters (0, .62 and 1.24), the average value of lower asymptotes of the item characteristic curves (0 and .25), test length (20 and 40 items), and the shape of the abi- lity distribution (uniform and normal) upon the rank ordering of examinees. .All tests were constructed to be unidimensional. Hambleton and Cook found that the three-parameter model was better irt ranking examinees in the lower half of the ability distribution than true one-parameter or number-right model. Correlations of ability esti- mates for 20 item tests were approximately .08 higher for the three- pa Fameter model. The difference went from approximately .75 to .83 for the uniform distribution of ability and .65 to .73 for the normal distri- bl“tion of ability. For 40 item tests, correlations were approximately -(323 higher for the three-parameter model compared to the one-parameter 19 model or number-right model. The averages were approximately .90 com- pared to .93 for the uniform distribution and .80 compared to .83 for the normal distribution of ability. Except for lower ability examinees, the number-right score did just about as good a job of ranking examinees as the more complicated ability estimates. One caveat should be noted here. The rank ordering of a set of examinees on the basis of a set of test items is not a very strong use of latent trait ability estimates. Applications requiring better esti- mates, for example item calibration, test equating or tailored testing (McKinley & Reckase, 1980) show greater differences. Douglass, Khavari and Farber (1979) compared classical item anal- ysis procedures with a method based on the Rasch model for constructing a scale for assessing alcoholism. Items were selected from a 49 item total scale. The traditional method consisted of choosing items with item-total correlations greater than .20, resulting in a scale with 29 items. The Rasch selection procedure consisted of rejecting those items that had item chi-square fit probabilities less than .001. Thirty-six items remained in the Rasch scale from the original 49. The two procedures resulted in somewhat different items being Selected for each scale. Twenty-four items were comnon to both scales. The traditional scale contained five unique items and the Rasch scale CIDFTtained 12 unique items. Despite the differences in items, the cor- "e'lations between the two scales with the total scale were not signifi- cantly different. The correlations between the two scales with an outside ‘3Y‘i‘terion were also not significantly different. There are several methodological problems that limit the generaliz- ab‘i Tity of the results. First, the limited item pool, 49 items, neces- 51. tated a great deal of overlap in items selected by the two methods. 20 ,(P The overlap would tendgmask differences. Secondly, the selection of a cut-off level for the item chi-square test of fit was arbitrary and is subject to fluctuation depending on sample size. A better method would be to select the best 30 items using the Rasch method and the best 30 using a traditional method with both selections from a larger item pool. Third, the correlation between the traditional scale and the total scale should be higher, other conditions being equal, since the items were selected because they had relatively high correlations with the total scale. But four, other conditions were not equal, since the Rasch scale contains more items from the original scale, 36 of 49, compared to 29 of 49 for the traditional method. In fact, the number of items made the greater difference, the correlations for the Rasch and traditional scales with the total scale were .914 and .881, respectively. The study is admirable for the emphasis on a practical consequent condition of interest but not for its methodology. In the most massive comparative study of test equating models ever attempted, Marco, Peterson and Stewart (1980) compare nine linear observed- score models, two curvilinear observed-score models, nine linear true- score models and two curvilinear true-score models. The two curvilinear true-score models are based on the one- and three-parameter logistic item characteristic curve models. The LOGIST computer program was used to <>trtain parameter estimates for both logistic models. The anchor-test fl"<§1:hod was used as the design method for all test equatings. The April 3159775 and November 1975 administrations of the verbal portions of the SCholastic Aptitude Test (SAT) provided the test data for the study. ECluating conditions were varied by having anchor tests internal or exter- "EIT to a test and similar or dissimilar in context, difficulty and length to the test being equated. The consequent condition investigated was the 21 goodness of test equating. This was investigated by two methods. One method was to equate a test to itself. The conversions from raw to scaled scores should be the same in this case. The other method, where two different tests were equated, was to use the most ideal equating method possible to provide a criterion. This was accomplished by equating the tests based on a single sample of 4,731 examinees who had scores on both tests being equated. Equipercentile equatings using the three-parameter logistic model and observed scores provided two criteria for the second method. Total error in equating was defined as the standardized weighted mean square difference between the estimated criterion score and the criterion score. The total error is equal to the variance of the differ- ence plus the squared bias. Algebraically, fd2/ 2 f (d E)2 2 + 62 2 (21) . . ns = z . . - s . 331.1 t jJ J mt /St whe e d. = t.’- t. r J J J t3 is the estimated criterion score for raw score Xj, tj is the criterion score for Xj, 3' = zf.d. n, j J J/ st is the standard deviation of the criterion scores tj, fj is the frequency of Xj, n = 2f. j J iiflci the summation is over the range of X for which extrapolation was not neczessary for any of the models studied. The multiplication by fj serves 't!3 weight the error more where more scores actually occur. Division by 2 'tJWGE variance of the criterion score, St’ standardizes the error terms. 22 The variance of the difference between the estimated criterion score and the criterion score is a measure of the inconsistency between the esti- mated and actual criterion. The squared bias is the mean difference between the estimated and actual criterion squared and reflects syste- matic error. Marco, Peterson and Stewart plot the errors of each equating method for each condition studied on graphs with bias on the horizontal axis, the standard deviation of the difference between the estimated and actual criterion on the vertical axis and total error on parabolic lines. It is thereby easy to compare not only total error for the different methods, but also the two types of error. Several conclusions drawn by Marco, Peterson and Stewart are parti- cularly interesting. 1. Types of examinee samples have relatively little and unsystematic effect on the quality of equating results if the anchor test is similar in content and difficulty to the total test. 2. When a test is equated to a test like itself using an easy or hard anchor and random examinee samples, all of the models have small total error. But when dissimilar examinee samples are used, the one- and three-parameter ICC models are clearly superior. 3. When tests differ considerably in difficulty, linear models and the one-parameter ICC model have very large total errors. 4. The three-parameter ICC model gives better results than any of the other models under extreme or unusual conditions; for example, tests of different difficulty 23 given to dissimilar examinee samples. This result is of course consistent with the invariance property of ICC model parameter estimates. As Marco, Peterson and Stewart point out, the strength of their study is the comprehensiveness of the models studied. However, all com- parisons were made using verbal items from the Scholastic Aptitude Test. The SAT verbal items are relatively homogeneous and difficult for cur- rent examinees. These are conditions which favor meeting the ICC model assumption of unidimensionality but not the one-parameter ICC model assumption of zero lower asymptotes. Under these conditions, the three-parameter ICC model should work fairly well regardless of item and examinee samples. The one-parameter ICC model should work reasonably well if the lower asymptote assumption is not challenged by the data. In other words, if item difficulties and examinee abilities are reason- ably similar, little guessing will occur and problems with the one-para- meter model will not become apparent. 0n the other hand, tests of different difficulty will differently encourage guessing, causing poor results from the one-parameter ICC model. Although there have been several studies comparing latent trait "Kniels for different testing applications, none have focused on a class- "Omun examination system. Most model comparisons used simulated data 0!" conveniently available standardized aptitude test data. The method— OTOQy described in the next chapter is designed to facilitate a practical ComDarison of latent trait models for test equating in a classroom Examination system. CHAPTER III METHOD SAMPLE The test data is taken from an introductory communication course offered at Michigan State University. Most students taking the course are freshmen, some are sophomores and a few are juniors or seniors. The course is divided into sections of approximately fifty to seventy students taught by different instructors. Generally each instructor constructs his/her own mid-term examination. For research purposes the Fall 1979 mid-term was constructed by the course coordinator. All sec- tions are given a common final examination. The final examination is constructed by the course coordinator from an item pool of approximately 600 items. Each final examination consists of 100 four-option multiple choice items. Test construction is based on principles derived from Items are preferred with high upper-lower discri- Each final classical test theory. Ininations and with difficulties of approximately .30. examination reflects a balanced coverage of the ten content areas of the CCMJrse. For a more detailed description of the course, the item pool, the test construction, as well as administrative and instructor training cOncerns; see Eisenberg and Book (1980). The data for this study includes the Fall 1978, Winter 1979, Spring 1979 and Fall 1979 final examinations taken by 947, 820, 594 and 1082 examinees respectively. The Spring 1979 examination was chosen to 24 25 provide the base scale for calibrating abilities and item difficulties. Calibration required that the Spring 1979 examination share items with each of the other three examinations. Between the Spring 1979 exami- nation and the Fall 1978, Winter 1979 and Fall 1979 examinations there were 49, 53 and 43 items in common, respectively. In order to make the methodology and results of this study easier to assimilate, the overall study is subdivided into three parts. Study 1 corresponds to objectives i, ii, and iii in Chapter I. Study 1 investigates the equating consistency across models for the Fall 1978 and Winter 1979 final examinations and within and across models for the Fall 1979 final examination. For the purpose of investigating model consistency for various size examinee samples, independent random samples of 200, 600 and 800 examinees were selected from the 1082 examinees taking the Fall 1979 examination. Study 2, corresponding to objective iv in Chapter 1, attempts to provide a criterion for comparing models by equating the Fall 1979 final examination to itself using examinee samples differing in relative levels of ability. The Fall 1979 mid-tenn, rather than the final exami- nation, was used to determine the ability groupings so that group assignment would be independent of errors of measurement on the final examination. This was the only use of the Fall 1979 mid-term exami- nation. The midterm examinations for Fall 1979 all had 37 items in common. The scores on the 37 item subtest were used to select the samples of students for the Fall 1979 final examination. Three separate sets of samples were selected. The first set was simply a random split of the 937 examinees for which both mid-term and final examination data were 26 available. The second set consisted of very low and very high ability groups. The groups were formed by placing examinees with scores below the median on the mid-term into the very low group and examinees with scores above the median into the very high group. Examinees at the median were randomly assigned to the two groups. The third set consisted of a low group and a high group designed to represent an ability split that was half of the difference in original standard deviation units between the very low and very high ability groups. The probabilities that examinees above and below the median should have of going into each of the groups were determined so that the expected difference in group means would be as desired. These probabilities were used with a list of random numbers to assign examinees to the low and high ability groups. Study 3, corresponding to objective v in Chapter I, is designed to investigate the effect .of different numbers of uncalibrated items upon abi‘lity estimates within examinee sections. The 43 items also on the ~Spr‘ilng 1979 final were selected from the Fall 1979 final examination to Siintilate a mid-term length examination. Sixteen intact examinee sections r‘ang‘ing in size from 49 to 66 students provided the examinee sample tOtaT ing 917. MODELS T‘he three models selected for comparison were the three-, two- and one-pa rameter or Rasch, logistic item response models. Each of these "'Ode'ls specifies a particular mathematical relationship between person ab“ ity and the probability of a correct response to a given item (see Hamb] eton and Cook, 1977 for a good introduction to ICC models). The three- parameter model can be written: 27 Pg (o, ag, og, cg) = cg + (1 - cg) (1+ e '1-7 a9 (0 ' b9)) '1, (3.1) where Pg (0, ag, bg, cg) is the probability of a correct response to item 9 by a person of ability 0, cg is the probability of examinees of extremely low ability giving a correct response to item 9, i.e. the lower asymptote of the ICC, bg is the point on the ability scale where the slope of the ICC is maximum and is usually called the item's diffi- culty, is proportional to the slope of P9 at o = b9 and is called the item's discrimination, and e is the base of the natural logarithms and is approxi- mately equal to 2.718. The two-parameter model is identical to the three-parameter model When all of the lower asymptotes (cg) are fixed at a constant, usually zero. The two-parameter model when c9 = 0 can be written: Pg (0. ag, bg) = (1 + e '1'7 a9 (0 ' b9)) '1, (3.2) vvhere the notation is defined above. 1 f the model is further simplified by substituting 5, a common value ' for the item discriminations in a test, for ag, the one-parameter model is Obtained: Pg (9, bg) = (1 + e '1°7 5 (9 ' b9)) '1, (3.3) 28 A version of the one-parameter logistic model was developed independently by Rasch (1960). This version, called the Rasch model, is usually writ- ten with the constant 1.7 5 incorporated into the o (6) scale and can be written: - (e’ - b§)) -1. P (o’, b’) = (1 + e g g (3.4) ESTIMATION OF MODEL PARAMETERS The computer program LOGIST (Wood, Wingersky and Lord, 1976) was used to estimate ability and item parameters for the two- and three- parameter logistic models. Although LOGIST can be used to estimate parameters for the one-parameter model, BICAL, (Wright, Mead, and Bell, 1979) was used. One trivial difference between the programs is that the estimates for LOGIST are based on equations (3.1) - (3.3) while BICAL is based on equation (3.4). More importantly, BICAL is much easier to use than LOGIST and has many convenient and useful features not found in LOGIST. SCALING MODEL PARAMETER ESTIMATES The degree of uniqueness of the person ability and item parameter scales is determined by the item response models. One critical issue is the extent to which the parameters of a model are determined. For the three-parameter model, consider a single item, 9, where , c ,) represents one scaling and P P9 1 1 9 represents another scaling. We will consider what P91 = P92 implies 1 (01. a1. b 2 (02. a2. b2. c2.) about the relationship between the two scales. Notice immediately from equation (3.1) that the minima of the Pg's occur as their respective o's approach negative infinity. The minimum 29 of P91 TS c1, and the minimum of P92 same item so c1 = c2. In other words, the lower asymptotes of the item is c2, but we are considering the characteristic curves are completely determined. Now consider the case where P91 and P92 equal a constant greater than c . For the probabilities to be equal, it can be seen from (3.1) 9 and the fact that c1 = c2 that: a (o1 - bl) = a2 (o2 - oz). (3.5) 9) DJ 2 _ 2 O -— 02-b1-E— b2. (3.6) H 0—. But we are considering a single item with P P so a1, a2, b1 and 91 = 92 b2 are all constants. Therefore there exists a constant K equal to each side of equation (3.6). We can write a = ..2 01 K + 31 62 (3 7) and a2 b1 = K + §—-b2. (3.8) 1 Notice in the Rasch Model that (3.7) and (3.8) reduce to and b . = K + bz’, (3.10) since the a's are not modeled. 30 The LOGIST program solves the problem of the indeterminacy in the person ability and item difficulty scales by setting the average person ability to approximately zero and the standard deviation of ability to approximately one on any given computer run. Therefore, two separate LOGIST runs with different groups of examinees will result in scales that are linear transformations of one another. In this study we have dif- ferent examinees but common items across tests so equation (3.8) is the appropriate one to examine, where now b1 is on the scale defined on one test and b2 is on the scale defined by another test. Calibration requires estimation of the transformation represented by (3.8). This can be done in several ways. Perhaps the most obvious way is to use the item difficulty estimates to estimate the parameters of the regression equation. b1= a + 8 b2, (3.11) where we have as many sets of (61, 62) as we have items in common between the two tests. Unfortunately, this is not a very good way to estimate the parameters of the transformation equation (Lord, personal communica— tion, April 1980). The problem is that two inconsistent transformation equations will result depending on which test is regressed on the other. The magnitude of the discrepancy will depend on the correlation between the two sets of difficulty estimates. Identical equations result only if the correlation is one. The equations are increasingly different as the correlation departs from one. One method (Warm, 1978, pp. 113-116) which avoids this problem is to let A — .— a = 51 - (sb1 / sbz) b2 (3.12) 31 and A— B Sbl/ 562’ (3.13) where 6; is the mean of the 61's, 'bé is the mean of the 62's, sb1 is the standard deviation of the 61's, and sb2 is the standard deviation of the 62's. Substituting 3 and § for a and e in equation (3.11) provides a consistent transformation regardless of which test is chosen to define a base scale. Notice that the transformation dependent on (3.12) and (3.13) can be thought of as a regression equation where the correlation between the b's has been replaced by one. Once 8 has been determined, the item dis- criminations are transformed by the equation obtained by substituting g for B in the equation a1 = aZ/B (3.14) An alternative method of estimating a and 8, developed by the author, uses both the item discriminations and difficulties. This method was obtained by comparing equations (3.8) and (3.11), and noticing that they are identical if K = a and -—- = a. (3.15) A Then a reasonable estimate of a is the average of the ratios 32/a1. Here there are as many (31, 32) as there are items in common between the A two tests. An estimate of a can be obtained by substituting B for 8, the mean of the bl's for b1 and the mean of the bz's for b2 in equation (3.11) 32 and solving for a. Each method will result in a generally different linear equation for converting abilities and difficulties from the second scale to the first. Other methods could be developed that also use both the difficulties and discriminations simultaneously. However, the best method of combining the information contained in the discriminations and difficulties is not obvious. However, Lord (personal communication, April 1980) is currently working on robust estimators for the transfor- mation equation. The BICAL computer program solves the indetenninacy of the Rasch scale by setting the mean of the items in any given test to zero. Sub- stituting the mean of the bi's for b; and the mean of the bg's for b; and solving for K in equation (3.10) resultsin K=b-b. ‘(3.16) This estimated value of K can be simply added to each b; to place all the items on a common scale. EQUATING Once scaling has been accomplished, test equating is straightfor- ward. If ability scores are being compared, then the same transformation that was applied to item difficulties can be applied to abilities to form ability measures on the same scale. If raw-score equating is desired then a further complication is introduced; it is necessary to set up a correspondence between raw scores on different tests through the ability estimates. One method of equating uses true-score equating applied to raw scores. The true score on a test for a given ability is simply the expected value of the observed score at 33 that ability level. This is simply the sum of the item characteristic curves at that ability, i.e., T (o) = E (Xle) = ng (o), (3.17) where the summation is over all the items in a test. Equating is simply A a matter of tabling or graphing the relation of 2 Pg (9) on one test to A that of zl’g (e) on a second test for many fixed values of o, where 8 (e) - p (o 3 8 e ) (Lord 1977) g g 3 g, g, g, 9 ' PROCEDURE Separate estimates for each of the samples mentioned above were obtained for the Rasch model using BICAL. For Study 1 and 2, examinees with less than 40 of 100 items correct or with person t fits above 2.00 (see Wright, Mead, and Bell, 1979, p. 15) were deleted from the item calibration sample. From 1/2% to 5% of the examinees in a given sample were removed under these criteria. It was felt that persons with scores of less than 40 were likely to be guessing on many items and thereby adding error to item calibrations. Persomswith large t fit values were removed since they were demonstrating unusual response patterns which could adversely affect item parameter estimation. For the same reasons in Study 3, examinees with less than 17 of the 43 items correct or person t fits above 2.00 were removed from the calibration sample. In Study 1 all item difficulties were transformed to the scale defined by the Spring 1979 examination using equations (3.10) and (3.16). The person abilities were then transformed to a standard scale designed so that the mean ability of the Spring 1979 examinees is 500 and the standard deviation is 100. The transformation equation is 34 A s = 53—13—33 (100) + 500 (3.18) .60 where S is the scaled score, 8 is the estimated ability on the Spring 1979 scale, 1.04 is the person ability mean on the Spring 1979 scale, and .60 is the ability standard deviation on the Spring 1979 scale. Separate estimates for each sample in Study 1 and 2 were obtained for the two-parameter model using LOGIST. Person abilities and item difficulties and discriminations for all samples in Study 1 were trans- formed to the scale determined by the Spring 1979 examination. The method using item difficulties and the average ratio method using item discriminations and average difficulties were used to estimate a and 8 in equation (3.11). This resulted in two different transformations for the two-parameter model. The additional transformation 5 = (100) O + 500 was applied to the scaled estimates to facilitate comparison between the Rasch and two-parameter models. LOGIST was used to obtain separate estimates for the three-para- meter model on the Fall 1978, Winter 1979, Spring 1979 and Fall 1979 final examination total samples. No calibrations were done. For Study 2, the three pairs of ability samples taken from the Fall 1979 final examination were used to equate the Fall 1979 examination to itself. For each of the three pairs, one sample in each pair was cali- brated to the other using only those 43 items that were used to calibrate 35 the Fall 1979 examination to the Spring 1979 scale. The calibrated abilities were used for true-score to true-score equating. The two methods of item calibration for the two-parameter model provided the basis for two methods of equating and the Rasch model provided a third. For Study 3, BICAL was used to obtain Rasch item difficulties and ability estimates using the 43 items selected from the Fall 1979 final and the entire group of Fall 1979 examinees for which section member- ship was known. Separate BICAL computer runs were also made for each of the 16 sections for the 43 item subtest. Average item difficulties were calculated for subsets of 7, 12, 17, 22, 27, 32 and 37 items from the 43 total items. The averages were calculated for the total group and for each section. This allowed comparisons of calibration constants and hence ability estimates, acting as though there were different numbers of calibrated items within the total group of 43 items. CHAPTER IV RESULTS The results will be presented in four sections with separate discussions for each study. The first section contains descriptive data for the various tests and examinee samples. The next three sec- tions present the results of parameter estimation, item calibration and test equating. DESCRIPTIVE DATA Table 4.1 contains the basic descriptive information for Study 1. For each final examination and examinee sample, the number of examinees and test mean, standard deviation and KR-20 reliability are reported. In Table 4.1 all test statistics are based on simple number-right The means for the three random samples of the Fall 1979 exami- scoring. Imation range from 68.48 to 69.01 and the standard deviations range from 13.14 to 13.56. Table 4.2 contains the same descriptive information for the Study 2 The means range from 61.66 for the very low ability sample to The standard deviation of the data 76 .65 for the very high ability sample. Very high ability sample is smaller than the others due to a ceiling Effect on the final examination. Based on the total Fall 1979 examinee sample standard deviation of 13.37, the two random samples differ by 36 Term Fall 1978 Winter 1979 Spring 1979 Fall 1979 Fall 1979 Fall 1979 Fall 1979 ____Term Fall 1979 Fall 1979 Fall 1979 Fall 1979 r"all 1979 F611 1979 DESCRIPTIVE DATA FOR STUDY 1 Sample Total Total Total Total Random Random Random 37 TABLE 4.1 # Examinees 947 820 594 1082 200 600 800 TABLE 4.2 Raw Mean 69.26 68.83 70.62 68.67 69.01 68.56 68.48 Raw 5.0. 12.95 12.05 11.62 13.37 13.14 13.56 13.30 DESCRIPTIVE DATA FOR STUDY 2 Sample Random Random Very Low Low High Very High # Examinees 481 456 471 461 476 466 Raw Mean 69.49 68.72 61.66 64.76 73.34 76.65 Raw 13. 12 11 12. 11. 9 A 19 .82 .84 77 81 .28 _KB_-_2_O_ .89 .87 .87 .90 .89 .90 .90 gig .90 .89 .85 .88 .88 .82 38 only .06 standard deviation units. The low and high examinee ability samples differ by .64 units and the very low and very high samples differ by 1.12 units. The descriptive data for Study 3 is presented in Table 4.3. The average Rasch ability (8) for the entire sample is 1.11 with a stan- dard deviation of .67. The BICAL program automatically sets the average difficulty of the items in any given test to zero. It can be seen that the average person ability is considerably higher than the average item difficulty. The examinee section means range from .87 to 1.44. Stan- dard deviations range from .52 to .85. Based on the overall standard deviation of .67, the two most extreme sections differ by an average ability of .85 standard deviation units. PARAMETER ESTIMATION The three-parameter model has unacceptable lower asymptote esti- Inates (cg) for the data used in this study. Lord (1975b)has shown that C9 is poorly estimated when (bg - 2/ag) is less than negative two. Con- .sequently when (bg - 2/ag) 5_-2, LOGIST fixes cg at the average of the (Ither unfixed c's. For the Fall 1978, Winter 1979, Spring 1979 and Fall 15379 examinations 98, 98, 98, and 99 of 100 c's were fixed at the aver- age of the c's for the remaining 2, 2, 2, and 1 items. The averages vveere .250, .205, .130 and .075, respectively. In effect the three-para- mEter model was reduced to an inconsistent two-parameter model. There- Fore, the three-parameter model was given no further consideration. On the whole, estimation of parameters for the two-parameter model Was acceptable. Since discriminations (ag) can approach zero or infi- lti‘tgy, LOGIST places the restriction .01 5_a .5 2.0. No more than one of 9 Section Number LOCDVOlm-hw 10 .11 12 13 14 15 16 A11 DESCRIPTIVE DATA FOR STUDY 3 A Mean 0’ 1 .02 .25 .10 .27 .07 .87 .23 .98 .16 .40 .95 .29 .11 .16 .44 .14 .11 39 TABLE 4.3 .65 .54 .67 .60 .63 .56 .61 .76 .75 .85 .66 .81 .68 .52 .72 .68 .67 A G I 5.0. # Examinees 60 54 65 49 6O 57 57 57 56 55 54 55 57 53 66 62 917 40 100 discriminations was fixed at .01 and no items approached 2.0 in any two-parameter LOGIST run. There were at least a few implausible item difficulty estimates in every examination. Tables 4.4 and 4.5 summarize the two-parameter model item diffi- culty (5) estimates for Studies 1 and 2. Items with difficulties between -3 and +2 should be reasonably well estimated. Recall that LOGIST sets the ability and item difficulty scale so that the mean person ability is approximately zero and the person ability standard deviation is approximately one. Items with difficulties much below —3 or above +2 are not likely to be very well estimated, since there is relatively little useful item response data at those extremes. There- fore, the number of items with difficulties between -3 and +2 is an indication of the number of items that are likely to have reasonably well estimated parameters. It can be seen from Table 4.5 that for the very high ability sample only 64 of 100 items were in that range. For the remaining samples in Tables4.4 and 4.5, between 84 and 96 of 100 item difficulties were in the range of -3 to +2. Tables 4.4 and 4.5 also contain the difficulty estimates for the lowest and highest of the 100 items in each test sample. Notice that item parameter estimates obtained from LOGIST can obtain absurdly low or high values. Such items are essentially useless for measurement in the present situation. Many have extremely flat ICC slopes leading to arbitrary and unstable difficulty estimates. Rasch model estimates obtained from BICAL were all plausible. How- ever, the item tests of fit provided by BICAL are highly dependent on sample size. For example, for the examinee samples of 200, 600, 800 and 1082 from the Fall 1979 examination, 8, 13, 25, and 29 items, 41 TABLE 4.4 A TWO-PARAMETER ITEM DIFFICULTY (b) STATISTICS FOR STUDY 1 Tenn Fall 1978 Winter 1979 Spring 1979 Fall 1979 Fall 1979 Fall 1979 Fall 1979 Term Fall 1979 Fall 1979 Fall 1979 Fall 1979 Fall 1979 Fall 1979 Sample # Examinees fin Range -3 to +2 Low 5 High 5 Total 947 89 -13.12 31.16 Total 820 84 -34.97 2.48 Total 594 88 -94.54 3.36 Total 1082 94 - 6.05 1.46 Random 200 92 -19.25 1.31 Random 600 96 - 5.57 2.61 Random 800 92 -11.36 2.20 TABLE 4.5 TWO-PARAMETER ITEM DIFFICULTY (5) STATISTICS FOR STUDY 2 # Items with 6's A A Sample # Examinees in Range -3 to +2 Low b Higp_p Random 481 88 - 5.67 1.86 Random 456 90 - 11.75 1.47 Very Low 471 90 - 5.49 4.64 Low 461 88 - 9.95 2.84 High 476 87 - 92.52 1.15 Very High 466 64 —120.21 .93 A Items with b's 42 respectively, were identified as having an overall t fit greater than 2.0. It is clear that the fit statistics are more appropriate for item selection than for testing model fit unless some means of taking sample size into account is used. In conclusion, estimates were successfully obtained for the Rasch and two-parameter models. The next two sections on calibration and equating provide information on the consistency, accuracy and usefulness of the estimates. CALIBRATION Rasch Model For the Rasch model, the Study 1 samples were separately calibrated to the Spring 1979 scale by the method described in Chapter III. Table 4.6 presents the calibrating constants, residual and standardized residual statistics and the numbers of items in the calibrating samples. The residual is simply the difference for the common items between the actual Spring 1979 item difficulties and the item difficulties cali— brated from another test scale. The residual mean is necessarily zero. A totally error-free calibration would result iniaresidual standard deviation of zero. The standardized residual is simply a residual divided by the stan- dard error of that residual. The standard error of a residual is - 2 + 2 8 SR ' (51 52) (4.1) 2 2 where 51 is the error variance for an item in test 1 and $2 is the error variance for the same item in test 2. In general, different items will have different residual standard errors. 43 ”mm Ammv mpoom mnmfi maveam may op Amy mpmom oco scam mmuoswpmm xupauwwva so»? msgommcoeu cows: cowpmacm use mo snow mgh me ou.~ mo.- om. mug. cow soucom me mm.~ mo.- mm. onfi. oou Eczema ma mw.H mo.- um. muH. com Eoucom me mm.~ mo.- um. NNH. NmoH pogo» mm om.~ No. Hm. HmH. omm pouch me Ho.m mo. mm. Nno. Nam Pope» msmufi .o.m com: .o.m x mmmcwsmxu * «PmEmm mcwpmen Fosuwmmm _o=uwmmm Poauwmmm -wpmo * um~w ewNw -ugoucmpm -ugoccoam a >oshm mom zo~pozhm «Om zo~h mnmfi Fpou 3o; mNmH Fpod soucom mum“ Ppou mFmEmm acme 46 consistency compared to the other samples. The effects of calibra- tion and item difficulty estimate error due to different examinee ability samples will be seen in the test equating section. Two-Parameter Model For the two-parameter model, both methods of calibration previously described were separately applied to the various samples in Studies 1 and 2. As with the Rasch model, the Spring 1979 examination provided the common scale for the Study 1 samples. Table 4.8 contains the cali- brating constants estimated from the item difficulties method (B Method) and from the ratio of item discriminations method (A Method) for Study 1. It also contains correlations between the difficulty estimates and num- bers of common items between Spring 1979 and each of the other examina- tions used for calibration. A comparison of Table 4.6 and 4.8 reveals that three items in the Fall 1978 examination and two items in the Winter 1979 examination were not used for calibration. Scatterplots of item dif- ficulties estimated for the common items revealed that all five items were clearly outliers. All five greatly affected the calibration equations, so they were removed and the calibration equations were again calibrated. Figures A1 to A6 in the Appendix are scatterplots of the difficulty esti- mates for the items actually used for calibration in Study 1. Table A1 in the Appendix contains the difficulty estimates for the items that were removed. The potential effect of outliers on item calibration can be seen in Table 4.8. The random sample of 200 resulted in one outlying item (see Table A1). Notice that the calibrations are quite different with Just the one outlier removed. Also notice that the correlation between 47 use it? A ”mew Ama .mmv mpoum mnmfi mcwgam asp op Am .mv mFoUm wco seem mmpmewumm emmeoLma so»? Eeoemcmeu op mcowpmzam one m3 Hm. mHH.H whfi. mew. mmo.- com Eocene mumfi _FAL me mm. NMH.H 3mH. coo.H Aao.- com Eocene mum“ F_AL N3 mm. AHM.H mam. mom. mNH.- com eoocmm mkmfi pra me we. mmN.H NNm. New. mmm.- CON soocmm mhmfi P_AL m3 Am. mmL.H ALL. mom. meo.- Nwofi .epoe aka“ _FAL Hm kw. mmo.~ wHo. mno.H moo. omm Pagoe mhmfi Loocwz 83 mm. AmH.H AOL. 8mm.H moo. New ”meek mums _FmL msoAH mewo m.m coLA .mimmmim w IA: < m IL: L w :42 m moo=L5mxm A o_a9mm ELoL -anWFou 9 -opwggou H >oshm «on onH mNmH __AL 8“. mmo.fi mom. - NNO.H H38. - H83 384 mnmfi P_AL Hm. Noo.H ewe. mom. 35o. - Ame soocmm mAmH ._AL m.a cow» m :9: < 5 IA: < m IL: L 8 :4: m moocLEAXL A oFQEAm sLoL .o 3.28 < < < < N >925 mo... zo:<~_m:mo oeoocopm mop 9o mgozcm asp m:_o mowo mop Lo oeoocm on» cu Foacm m? zooopmwmooucw Pouch mm.o Na.m om. mm.m mH.H ma.~ mo.HH mN.m me. me oow Ecocom mnmfi —_ou mn.m aw.H mo. om.m H~.H ca.~ no.m Hm.N mm. ma coo soooom mmmfi ppou NH.mN o~.e mn.H em.a Hm.H om.~ mm.H~ oe.m ma. 1 Ne com soooom mnmfi Fpou mn.moH mm.m mm.m om.oH mn.~ mm.~ an.mn mm.m oH.H- ma oom Eoocom mnmfi Fpou aw.o mo.~ mm. no.~ mo.H wN.H mm.w mm.~ om. me Non Pouch mnmfi Ppou «O. «H. No. om.~ mm.H em. om.m we.H mm. Hm omm Pouch mnmfi mecwz mm.mH a~.e oo.H- Hm.a HH.H mn.H ma.e~ mo.m mm.m we Rem Pooch wnma F—ou Pouch .wewo mowm Pouch .wao mowm Pogo» .9999 mowm mempm mom: oposom seek .o.m .o.m a.m mcwpogo uwsoxu twpou * mo * vogue: < . vogumz m venom: < . comma oocumz m . cumom mgmooz mmomu< >ozmhmmmzooz~ ozHHoo ogooooum mop 9o ocooom mop mopo mowo on» mo ogooom moo op Poooo mm zocmumwmooocp Fopoh Non van cow Nwofi uco com Non van roow Non van com HNH. mom. moo. a Hoo. omo. moo. coo. omo. cam. coo. meo. cmo.i owe. ace. mom. 1 mmo. NmH. mefi. mnw.Nm mvm.m mm¢.Hu mmo. NmH. oeH. —ouop .mmwo mowm Pogo» .vmwo mowm .o.m .o.m oozhmz m IumuzmHmHmzoqu wz~o<=om HH.¢ m4m4¢r zoozcm Iowcm "HH.a mmDon n: 838038 M88 D31338X3 68 ..zowmm + ...E. H «0 II ozmowg mmmouw zcm cupummxm oou oo oo oo om P F P bl Pr. pl d 1 d d d d «)- P q 1. Loo: roe: op zoo Iooom "NL.4.mmooHL cu M88 031338X3 838038 69 4>Iomcx + ._>m._. H an I. ozuom; an: on mmmouw om zmm oupommxm oe ow ci- P d r d P d P d P 1 hr 8 P 4 dr- 8 * OH mgcr :oHI >mm> OH 304 >mm> Iowcm umH.v mmonm 838038 M88 D31338X3 7O wmmoow zcm ompummxm o2 oo oo 3 ow o o kzhzc e u q d 4 d 1 d d d ozezo L. .. 8:8 I .8 ozuomo .o4 .oo .60 o" mm>4¢r :oozcm mam N uaH.¢ mmDon 838038 M88 D31338X3 70 ...:th 9 “5.25 + ...NHHmol ozmom; mmmoum zcm DMHummxm oo« oo oo oo ow 8 P P P 8 1 d 1 '0- d)- P 1 4 dr- wm>4¢I zoozmm mcm N uvH.a mmDon 838038 M88 D31338X3 wumouw zcm oupommxm 71 oofl . oo. .t om. . o4r .. om. . o ..EE e ...:E + .. 85%|: .8 aflmfl .oo .oo .oo L o" “Sc: .52.. 0.— 3o._ mam N umHJ. manor... 838038 M88 D3133ch3 muxoow 3cm OOHommxw 72 OOH _ OOr L OO. .. Otr .. ON. . O .._>.._._.:c 0 42.23 + i .5551. Lou afififl ..Oe .. 1.OO .60 4 OH mJOI IOHI »mm> OH zOJ >mm> mcm N "OH.¢ mmOOHm 838038 M88 031338X3 73 .ooooLoHHHo mow Ho :oHuoH>ou oeooooam on» mo ogooom on“ mopo moHo we“ mo ogooom mop oo Hooom mH Loeco Hooch on. mmw. ¢m¢.u mNH.m MN~.H 008.- emN. cam. mvo.u HoHOP .HHHO mon .O.m oozhmz < mmkmzizo4 Hem> sow: - zoo soocom moHoEom mowpozou 74 by incorrect calibration of abilities across tests. The first source of error is investigated in Table 4.13 and the second in Table 4.14. Table 4.13 contains Rasch ability estimates corresponding to raw scores based on the 43 item subtest selected from the Fall 1979 exami- nation. Estimates are presented based on examinees taking the Spring 1979 and Fall 1979 examination. Estimates for the two class sections, taking the Fall 1979 examination, that have the most extreme ability estimates compared to the other 14 sections are also presented. The standard deviations of ability are taken from the Fall 1979 sample but differ from the other samples by less than .05 at every ability level. The discrepancy between the estimates based on the Spring 1979 and Fall 1979 samples varies from .00 in the middle of the test score range to .19 at a raw score of one. No discrepancy in Table 4.13 is greater than .40 standard errors. Table 4.14 contains the mean, standard deviation, maximum negative and maximum calibration constant error (Ek) across the 16 sections for Study 3. The criteria for determining the correct calibration constant are the calibration based on all 43 items and either the scale determined by the 43 items on the Fall 1979 examinee sample or the 100 items on the Spring 1979 examinee sample. The Spring 1979 examination sample provides the more realistic criterion. Notice from Table 4.14 that there is a systematic error in calibra— tion constants. Using the Spring 1979 criterion, the average calibration error across sections ranges from -.O68 to -.192. Even with only six of 43 items removed from calibration, there is an average error of -.119 and a maximum error of -.167. Assuming a person ability standard deviation of .67, on the average persons in all the sections would be assigned ability Raw Score t—‘Nw-‘DUTC‘NCDNO 75 TABLE 4.13 RASCH ABILITY ESTIMATES FOR STUDY 3 A A A 9 Section 4 8 Spring 1979 9 Fall 1979 Fall 1979 3.95 3.77 4.02 3.23 3.07 3.30 2.79 2.64 2.86 2.47 2.33 2.53 2.21 2.08 -2.27 1.99 1.87 2.05 1.80 1.69 1.85 1.62 1.53 1.68 1.47 1.38 1.52 1.32 1.24 1.37 1.19 1.11 1.23 1.06 .99 1.10 .93 .87 .97 .82 .76 .85 .70 .65 .74 .59 .55 .62 .48 .45 .51 .38 .35 .40 .27 .25 .30 .17 .15 .19 .06 .05 .08 -.04 -.04 -.02 -.14 -.14 -.13 -.25 -.24 -.24 -.35 -.34 -.35 -.46 -.44 -.46 -.57 -.54 -.57 -.68 -.65 -.69 -.8O -.75 -.81 -.92 -.87 -.93 -1.05 -.98 -1 O6 -1 18 -1.11 -1 20 -1 32 -1.24 -1 35 -1 47 -1.38 -1 50 -1 63 -1.53 -1 67 -1 80 -1.69 -1 86 -2 00 -1.88 -2 07 -2 23 -2.09 -2 31 -2 49 -2.34 -2 59 -2 82 -2.66 -2 94 -3 26 -3.09 -3 41 -3 99 -3.80 -4 17 A 0 Section 15 A Fall 1979 S.E. O 3.76 1.02 3.05 .73 2.63 .60 2.32 .53 2.07 .48 1.86 .45 1.68 .42 1.51 .40 1.37 .38 1.23 .37 1.10 .36 .98 .35 .86 .34 .75 .33 .65 .33 .54 .33 .44 .32 .34 .32 .25 .32 .15 .32 .05 .32 -.O4 .32 -.14 .32 -.24 .32 -.34 .32 -.44 .32 -.54 .33 -.64 .33 -.75 .34 -.86 .34 -.98 .35 -1.10 .36 -1.23 .37 -1.37 .38 -1.52 .40 -1.68 .42 -1.86 .45 -2.07 .48 -2L32 .53 -2.64 .61 -3.06 .74 -3.77 1.02 76 TABLE 4.14 RASCH ABILITY ESTIMATE ERROR DUE TO CALIBRATION FOR STUDY 3 Origin of Comparison Scale # Calibra- Maximum Term # Items ting Items Average Ek S.D. Ek. NegativeEk Maximum Ek Fall 1979 43 37 -.002 .036 -.050 .061 Fall 1979 43 32 .002 .043 -.O76 .077 Fall 1979 43 27 -.002 .074 -.128 .128 Fall 1979 43 22 -.007 .096 -.175 .188 Fall 1979 43 17 -.013 .120 -.316 .154 Fall 1979 43 12 -.006 .146 -.358 .170 Fall 1979 43 7 -.023 .233 -.570 .330 Spring 1979 100 37 -.119 .036 -.167 -.056 Spring 1979 100 32 -.O76 .043 -.154 -.001 Spring 1979 100 27 -.120 .074 -.246 .010 Spring 1979 100 22 -.138 .096 -.306 .057 Spring 1979 100 17 -.192 .120 -.495 -.025 fpring 1979 100 12 -.149 .146 -.501 .027 Spring 1979 100 7 -.068 .233 -.615 .285 77 estimates .18 standard deviation units too low and for the most dis- crepant section ability estimates would be .25 standard deviations too low. Descriptive data on the examinations and examinee samples and para— meter estimation information were presented in the first two sections of this chapter. Calibration statistics and their impact on test equating and ability estimation comprised the remaindercfi'the research results. The next and final chapter summarizes the study, draws conclusions from the findings and presents recommendations based on the findings for class- room examination systems. CHAPTER V SUMMARY, CONCLUSIONS AND RECOMMENDATIONS 311414.431 Item characteristic curve models offer the potential of producing a measurement of a person's ability or achievement that does not depend on the particular sample of items in a test or on the particular exami- nees upon which test analysis is based. There has been much research effort devoted to ICC models using simulated or aptitude test data. Perhaps because of the apparent multidimensionality of achievement tests, there has been very little research focused on tests of academic achieve- ment. If ICC models are to be of any use in classroom testing, more ICC research must be focused on achievement testing in general and on classroom achievement testing in particular. The present study is designed to provide methods and results rele— vant to the introduction of item characteristic curve models into class- room achievement testing. The overall objective is to compare several common ICC models for item calibration and test equating in a classroom examination system. Parameters for the one-, two- and three-parameter logistic ICC models were estimated for IOO-item final examinations given over the course of four academic terms. The three-parameter model provided unacceptable lower asymptotes for this data set and was given no further 78 79 consideration. One of the four tests was chosen to provide a common scale to which items from the other examinations were calibrated using the one- and two-parameter models. Random samples of 200, 600 and 800 examinees were selected from a total sample of 1082 examinees to facili- tate the investigation of the impact of examinee sample size upon item calibration and test equating. Rasch calibration and two methods of two- parameter model calibration were investigated. The calibrations based on the examinee sample of 200 and the entire set of items in the calibration section were quite divergent across methods and also diverged from the calibrations based on the other samples. The calibration constants estimated from the Rasch model and the two methods using the two-parameter model were used to equate the examinations to a common scale. The Rasch model equatings were very consistent across all sample sizes investigated. The two-parameter methods provided less consistent equatings than those provided by the Rasch model. Consistency of equating provides only indirect evidence of equating accuracy. To provide a criterion for equating, three separate sets of examinee samples were selected from the 1082 examinees in the largest total sample. The three sets represented random, high-low and very high-very‘kwvexaminee ability differences. Although none of the three methods was uniformly best, the Rasch model provided the most acceptable equating in general. The stability of Rasch calibrating links and hence ability estima- tion based on examinee class sections was investigated for a 43 item test taken from a total test of 100 items. Calibration links were calcu- lated for subsets of the 43 items ranging in size from 7 to 37 in incre- ments of five items. Even with 37 items in the calibrating link, it was 80 found that there was an overall bias in ability estimation of approxi- mately 1/5 of a standard deviation of ability. For the most extreme group, there was a bias of 1/4 of a standard deviation using 37 items. CONCLUSIONS AND DISCUSSION The previous section provided a brief review of the problem, methods and results. This section presents some conclusions that can be drawn from the results and is followed by a final section containing recommenda- tions for further research. Parameter Estimation There are two reasonable explanations for the inadequacy of lower asymptote estimates for the three-parameter model. First, for the four examinations studied, there is very little useful information available for estimating the lower asymptotes. In order to accurately estimate the probability of very low ability examinees answering a question cor- rectly, it is necessary to have some very low ability examinees in the sample. The examinations had means and standard deviations in the neigh- borhood of 69 and 13, respectively. Obviously there are not very many examinees at all close to a chance score of 25. The second explanation is that the problem of adequately estimating lower asymptotes has not been solved. However, newer versions of LOGIST are using maximum likeli- hood estimation procedures that should give better estimates. Unless there are very low scoring personsin an examinee sample, it is probably futile to attempt to estimate parameters for the three para- meter model. For common classroom achievement tests this is unlikely. Several alternatives are available. One is to give the test as a pretest to a group of examinees and combine their responses with those of an ordi- 81 nary group. A second alternative is to set the lower asymptotes for all items at a reasonable value, for example .05 below chance for well-written items. Further research is needed to identify the models for which para- meters can be estimated well under different conditions. Calibration Problems still remain with calibration methods for the two-parameter model. One undesirable pr0perty of both methods used in this study is that it is necessary to remove some of the items from the calibrating sample before determining a calibration equation. The removal of these outliers is a subjective and somewhat arbitrary task. The removal or retention of even one item can make a meaningful difference in the cali— bration equation. Lord (personal communication, April 1980) is pursuing a promising technique using robust estimators. He is attempting to estimate the parameters of calibration equations by using estimates that are little affected by outliers and yet have high efficiency in a variety of situations. Neither of the two methods considered in this study for calibrating two-parameter estimates provided more consistent results than the other across different size samples. It should be noted that goodness of cali- bration depends on the degree of linearity between item difficulty estimates for common items across tests. The correlations for the tests in this study were not impressive. For the total examinee samples they ranged from .57 to .87 (Table 4.8). The higher coefficients were obtained only after outlying items had been discarded. The lack of a strong linear relationship may limit the accuracy of any calibration technique. The Rasch calibrations in general displayed remarkable consistency across types and sizes of examinee samples taken from a single test 82 administration. However, a very disturbing fact became apparent in Study 3. When calibration constants based on different subsets of items and different sections of examinees were compared to those obtained from the total sample, they displayed reasonable congruence. But when the more realistic criterion of similarity of calibration across exami- nations was used, the calibrations were incorrect but reasonably consis- tent with one another. In other words, there was a systematic error in calibration across examinations. This is a particularly disturbing result since the invariance of item characteristic curve statistics and ability estimates clearly depends on a stable scale. Scale drift has also been found using the three-parameter model at the Educational Testing Service (Marco, personal communication, April 1980). It looks as though scale drift may be a common phenomenon with latent trait models. It is certainly a possibility that must be recognized and investigated in any testing system to which ICC models are being applied. Test Equating and Ability Estimation True-score test equating depends upon parameter estimates and item calibration, so it is not surprising to find that the equating results mirror the calibration findings. The Rasch equatings across the various sample sizes were extremely consistent. Both of the two-parameter cali- bration methods resulted in reasonably consistent equatings for samples of 600 and 800 compared to 1082 and very divergent equatings for samples of 200 when all 43 items were used for calibration. The consistency of the Rasch model, even with examinee samples of 200 and all items used for calibration, coupled with its generally supe- rior equatings of a test with itself, provides support for its adoption. However, two caveats need to be made. The first is that in Study 3 the 83 Rasch calibrations were consistent but incorrect. The second caveat is that equating a test to itself seems somewhat biased in favor of a model with less parameters (Marco, Peterson and Stewart, 1980). True-score equating is accomplished by setting up a correspondence between the sums of the ICC's taken from two tests at a number of different ability points. When a test is equated 'to itself the proper equating will occur if the sums of the ICC's are the same at every point. Since the items are the same across the tests when a test is equated to itself, any consistent assignment of parameter estimates across the two samples will result in a correct equating. The two-parameter model has two parameter estimates that may differ across samples for each item. The one-parameter model only has one opportunity to differ for a given item. Therefore, the criterion of equating a test to itself seems to favor the one-parameter model which has only half as many opportunities to produce incorrect results. The equatings of the test to itself were somewhat surprising in that the amount of difference between ability groups appeared to have no con- sistent relationship to the goodness of equating. However, this is con- sistent with the finding of Marco, Peterson and Stewart (1980) that types of examinee samples have little effect on equating if the anchor test is similar in content and difficulty to the tests being equated. In summary, the Rasch model, despite some concern that one criterion may be biased in its favor, seems to be providing better achievement test equatings than the two-parameter model in this study. It is clearly more consistent than the two-parameter model, especially for smaller sample sizes. Given the greater ease of using and interpreting the Rasch model and its lower cost, the Rasch model seems to most warrant further study in the Communication 100 testing system. 84 RECOMMENDATIONS In general the problem of scale drift associated with item charac- teristic curve models needs to be investigated more fully. Any design for implementing ICC methods in a testing system should include a syste- matic study of changes in item parameter estimates over time. Items with unstable estimates can be identified and removed from item pools. Studies should be designed to investigate and control the amount of scale drift over time. In the context of the Communication 100 testing system and others like it, there are three specific recommendations. First, continue investigating the Rasch model and calibrating items to the Spring 1979 scale. The Rasch model seems to be working well enough to warrant fur- ther study. The second reconmendation is to work to improve the results of the Rasch model by evaluating items currently in the pool as well as new items using criteria relevant to the Rasch model and using these criteria to select items for retention or replacement. The criteria should include comparisons of items on item tests of fit supplied by BICAL and on stabi- lity of difficulty estimates over time. The third recommendation is to explicitly study and attempt to con- trol scale drift. Methods of controlling scale drift include improving the fit of items to the Rasch model and removal of items displaying unstable difficulty estimates from the item pool and from calibration sections before calibration. The effect of scale drift on test equating can be determined by administering a test previously given, equating the test in the normal way and then comparing the new equating to the original equating. 85 Classroom achievement tests present some unique problems to those seeking the advantages of ICC models. They tend to be less unidimen- sional than standardized aptitude tests and their dimensionality may change over time due to instructor or curriculum changes. Typically, they are of relatively low difficulty for the examinees, so lower asymp- totes of the three-parameter model may be hard to estimate. Examinee sample sizes are typically much smaller than for standardized tests and item quality is usually lower. However, the potential advantages of ICC models are great. If ICC models can be made to work properly, estimates of ability or achievement will not be dependent on the particular sample of items in an examination nor upon the particular students in a class- room. The advantages of ICC models are sufficiently great and the results of attempts to apply ICC models are sufficiently positive to warrant further research on and applications of ICC models. APPENDIX 86 .Erm 0 ozman ohm“ 44cm Sm; com. 03.: 9m. 7. oom.~u 80.9. - u v u w v w T 0 w 09m: 0 L. ..oomé: O 0 90 Q 0 .. g e % @ 10°F. «I 0 n9 £u%u muse . so 0 0 ice. 9 0 .Eomé 0 23m me uuHo arm 02¢ mbm "Hm mmonm BLBI ONIUdS 87 mum“ mahzaz oom.~ one.“ o oo¢.~- oom.~- cow..- , I. + n .1 + f + to .1 8.... e j & 9 gr. 0 . aim... as. e a...“ s. .. 00 e seam ow 936 -58... e .58. e .594. e .. 84 Emu “7:0 mbm 02¢ 2.3 "mm manor. BLBI ONIUcIS 88 com. I pwm mmHo mbw o2¢ mbm um¢ w¢30H¢ BLBI ONIUdS mbmfi 44¢¢ cow. ... on: . «a coo. NI com. «I cow. m... u " vi 0 .9 La “ t » oQé- e .1 g e e .85.? e .. 9 0 00 .0978: 8 0 e we .. e e .cewd- 60 0 0 0 e 0 .. 00 0 fi % e e nbomol 0 % 0 0 0 .. o9; 89 com no mgmz¢m mbmd 44¢¢ oou.n oum.- oom.g- oo~.«- o.o.«- on..m- " n, w “l ”WT " .w n. v oom.a- 5.5 e 3%... e 4 00 % .Bcthl E) E) 60m.—1 9 i8: anal. me mmao mhw 02¢ AO0Nu20 mbm u¢¢ u¢00H¢ SLBI ONIUdS 90 000 “.0 m4¢2¢m mbmfi 44¢... ooo.- coo.- new..- ooo.«- oo..~- ooo.n- , a + f v + T fa Jr I 8.? 2.3 o a e .. o 000 .52.? a 6 tr e e e o e 0.52.? e e u e . . .. a see 0 58 . 9 ..r 0 ad 9 m. e e e .68.- 0 90 0 0 e e .. call- .50 .....20 2% 02¢ Hoowuz. 2.... um¢ 0000: 61.61 ONIUJS 91 000 m0 04¢z¢w mbmd 44¢¢ cc..- ouo.u oom.~u oo~.~u coo.~u ovm.mu , v N? 0. m. .+ a v .v “I + on- .520 0 aflfifl e 4. e .62.? e : % o o e 53.“: O .ue e .0 no .rou..- 9 0 0 0 Lt 0%0 90 0 .600... 9 00 r e .u on... hwm ¢¢H0 mbw 02¢ noomuza 00¢ uw¢ manomm BLBI ONIUdS 92 wm>4¢z 2002¢¢ mbmd 44¢¢ op~.- 9pm.- opo.~- chm.~- orc.m- opp.m- - u a m u " "uxw+:|||4¢~¢..- :25 e . azummm .. e 95. m- l bum.NI E) E) 0 E) E) 009.“: l %e 0099 e 88.. e e CUT NO.I Emu mmHD mm>4¢r zoozcm mhm "hm MMDOHm SLBI ONIEIcIS 93 wm>4¢r IOHIIZOJ whoa 44¢¢ 00¢. 00¢. I 00N. “I 000. NI 000. NI 000. ml . n u .1 u «T u N? .r u 0047 .520 0 023m 0 .. .6m0.¢l 0 e 1... a 0 46070I O 00 0 .. 0 e e e O O 4.0am.NI 0 0% 9 0 0 0 0 .. 0 O 09 o a e ...Omfio w! 9% e 0 n6 9 8.- me ¢¢H0 wm>4¢I 1002:200 00¢ um¢ mmonm BLBI DNIHAS ww>4¢r 202.7304 0¢w> 000g ._._¢.._ 000. 0Nm.I 0vm.l ommmul 00NHNI 000.0I 5.3 e 9.31.. v a w q 4? q # a T OCQ*.I AU my 1 94 e 1.8.? . . ... L.00N.0I 00¢.NI 000.—I E) E) E) E) 8... 0.00 ...“:0 wu>4¢r 20:720.. >¢m> 00.._ u0¢ 00000.". SLBI ONIUdS 95 000.0HI 0N¢.0 HmH.0 000.0 I 00¢.N mo~.m 50H.0NI ~00.N I H¢0.¢0I ¢00.NHI ¢mN.0 I a < 00¢ 00¢ 00¢ 00¢ 00¢ ¢00 ¢00 ¢00 ¢00 ¢00 ¢00 mmmcwmem 0 saw: sgm> saw: zgm> ;m_: zcw> saw: agm> saw: aca> _apoe _apoe Page» _auo» Payee .mpop .mflmamw onHozwa WZMHH to Amy mmfipoaufidmfio ommfi _Fmd mumfi Fpau mumfi __au mumfi Fpmd mfimfi PPM; mum“ newcam mhmfi meweam mhmfi mewtam mNmH mcweam whoa newtam mkmfi mcwtam Egmh 0H0. 000.¢ I 0N0.m 000.H I 0H0.0 HON.u 000.00I 000.H 00H.0 000.0 m00.m a < H< 000<0 Hn¢ Hn¢ Hn¢ Hm¢ Hn¢ 00m 0N0 0N0 n¢0 ~¢0 ~¢0 mmmcwsmxm 0 300 zgm> so; xgm> 3o; xgw> 3o; xgm> 3o0 xcm> Eoucmm punch Peach Papoh Pouch Pmuoh mmmammm 000H 000a 0NOH 000g 000H 0n0~ Fpm¢ Ppm¢ Ppm¢ Ppmm Fpm¢ Ppmu 0NOH mecwz 000a gmpcwz 000a 0NOH 000a FFM¢ —Pm¢ Fpmu Egmh L I ST OF REFERENCES 96 LIST OF REFERENCES Anghoff, W. H. Scales, norms, and equivalent scores. In R. L. Thorndike (Ed.), Educational Measurement (2nd ed.). Washington, D.C.: Ameri- can Council on Education, 1971, pp. 508-600. Bejar, I., Weiss, 0., & Kingsbury, 6. Calibration of an item pool for the adaptive measurement of achievement (Research Report No. 77-5). Minneapolis, Minn,: University of Minnesota, Psychometric Methods Program, Department of Psychology, 1977. Birnbaum, A. Some latent trait models and their use in inferring an examinee's ability. In F. M. Lord & M. R. Novick, Statistical theories of mental test scores. Reading, Mass.: Addison-Wesley, 1968. Cormier, G. An investigation of the fit of the Rasch measurement model to data from the Medical Admission Test. Unpublished doctoral dis- sertation, Michigan State University, 1977. Cronbach, L. J. Beyond the two disciplines of scientific psychology. American Psychologist, 1975, 39, 116-127. Douglass, F. M., Khavari, K. A., & Farver, P. D. A comparison of classical and latent trait item analysis procedures. Educational and Psycho- logical Measurement, 1979, 32, 337-352. Douglass, J. B. A process for testing a mathematical model for the solution of a practical problem: applications to test equating. Paper pre- sented at the Annual Meeting of the American Educational Research Association, San Francisco, California, April 1979. (ERIC Document Reproduction Service No. ED 175 942) Eisenberg, E. M., & Book, C. Applying latent trait theory to a course examination system: administration, maintenance and training. Paper presented at the Annual Meeting of the American Educational Research Association, Boston, Massachusetts, April 1980. Hambleton, R. K., & Cook, L. L. Latent trait models and their use in the analysis of educational test data. Journal of Educational Measure- ment, 1977, 15, 75-96. Hambleton, R. K., 8 Cook, L. L. Some results on the robustness of latent trait models. Paper presented at the Annual Meeting of the American Educational Research Association, Toronto, Canada, April 1978. 97 Hambleton, R. K., Swaminathon, H., Cook, L., Eignor, D., & Gifford, J. Developments in latent trait theory: models, technical issues and applications. Review of Educational Research, 1978, fig, 467-510. Hambleton, R. K. & Traub, R. E. Analysis of empirical data using two logistic latent trait models. British Journal of Mathematical and Statistical Psychology, 1973, 26, 195-211. Hutton, L. R. Some empirical evidence for latent trait model selection. Paper presented at theTAnnual Meeting of the American Educational Research Association, Boston, Mass., April 1980. Lawley, D. N._ On problems connected with item selection and test con- struction. Proceedings of the Royal Society of Edinburgh, 1943, 61, 273-287. Lazarsfeld, P. F. The logical and mathematical Foundation of latent structure analysis. In S. A. Stouffer et al., Measurement and prediction. Princeton: Princeton University Press, 1950. Lord, F. M. A theory of test scores. Psychometric Monograph, 1952, No. 7. Lord, F. M. An application of confidence intervals and of maximum likeli- hood to the estimation of an examinee's ability. Psychometrika, 1953, 18, 57-75.(a) Lord, F. M. The relation of test score to the trait underlying the test. Educational and Psychological Measurement, 1953, 13, 517-548.(b) Lord, F. M. An empirical study of item-test regression. Psychometrika, 1965,_3Q, 373-376. Lord, F. M. An analysis of the Verbal Scholastic Aptitude Test using Birnbaum's three-parameter logistic model. Educational and Psycho- logical Measurement, 1968, gg, 989-1020. Lord, F. M. Estimation of latent ability and item parameters when there are omitted responses. Psychometrika, 1974, 39, 247-264. Lord, F. M. The 'ability' scale in item characteristic curve theory. Psychometrika, 1975,.44, 205-217.(a) Lord, F. M. Evaluation with artificial data of a procedure for esti- mating ability and item characteristic curve parameters (ETS RB 75-33). Princeton, N.J.: Educational Testing Service, 1975.(b) Lord, F. M. Practical applications of item characteristic curve theory. Journal of Educational Measurement, 1977,-14, 117-138. Lord, F. M. Item rggponse theory and application. Paper presented at the Annual Meeting of the American Educational Research Association, Boston, Mass., April 1980. Lord, F. M. Personal communication, April 1980. 98 Lord, F. M. Applications of item response theory toypractical testing problems. HiTTsdale, N.J.,: Erlbaum, in press. Lord, F. M. & Novick, M. R. Statistical theories of mental test scores. Reading, Mass.: Addison-Wesley, 1968. McKinley, R. L. & Reckase, M. D. A successful application of latent trait theory to tailored achievement testing (Research Report 80-1). Columbia, Mo.: university of Missouri, Tailored Testing Research Laboratory, Educational Psychology Department, 1980. Marco, G. L. Item characteristic curve solutions to three intractable testing problems. Journal of Educational Measurement, 1977, 14, 139-160. Marco, G. L. Personal communication, April 1980. Marco, G. L., Peterson, N. S., & Stewart, E. E. An evaluation of linear and equipercentile equating methods. Paper presented at the ETS Research Conference on Test Equating, PrinCeton, N.J., April 2-3, 1980. Mosteller, F., & Tukey, J. W. Data analysis and regression. Reading, Mass.: Addison-Wesley, 1977. Rasch, G. Probabilitistic models for some intelligence and attainment tests. Copenhagen, Denmark: Denmarks Paedogogiske Institute, 1960. Reckase, M. D. Unifactor latent trait models applied to multifactor tests: results and implications. Journal of Educational Statistics, 1979, 5, 207-230. Ree, J. M. Estimating item characteristic curves. Applied Psychological Measurement, 1979, 3, 371-385. Rentz, R. R., & Beshaw, W. L. Equating readingytests with the Rasch model, Volume I final report, VOlume II technical reference tables. Athens, Ga.: University of Georgia, Educational Research Labora- tory, 1975. Warm, T. A. A primer item response theory. 1978 (ERIC Document Repro- duction Service No. ED 171 730). Woods, R. L., Wingersky, M. S., & Lord, F. M. LOGIST: A computerypnggram for estimating examinee ability and item characteristic curve para- meters. (Research Memorandum 76-6). Princeton, N.J.: Educational Testing Service, 1976. Wright, B. D. Sample-free test calibration and person measurement. Proceedings of the 1967 Invitational Conference on Testing Problems. Princeton, N.J.: Educational Testing Service, 1968. Wright, B. D. Solving measurement problems with the Rasch model. Journal of Educational Measurement, 1977, 14, 97-116. 99 Wright, B. D., Mead, R., & Bell, S. BICAL: calibrating items with the Rasch model. (Research Memorandum No. 238). Chicago: University of Chicago, Statistical Laboratory, Department of Education, 1979. Wright, B. D., & Stone, M. Best test design: Rasch measurement. Chicago, Ill.: MESA, I979. "I11111111111111!1111711“