9‘. r "‘ 3-... ,w. ”N 11" ' {:5 ' ’i - .. Lr - , .. D 4' ‘ .- I ' .. 1&3: 9&2. 1:. -. A ..... .3: .:>:;;. ~... w —. 4 3a.. 1: - £3. "#3.... V.“ r‘ n» .uA. . “1'; :1 " ‘1. t? . 1r < :. Arr- 1:) - .u a-.. .. . w I» ;" J 9255",: My ' M?" s ,. 23’“??? -‘.- an 4-“ -4 ‘ .. ~_. . u A .3”. 2..-. a." J?! pr ‘ #5.“ a}? {S ...‘ . V: r:“’*' ML '34 ““1628" :‘;L w .‘Aawg'fliuL’f; ii ’11’til$’£ait‘l:n ‘ , l" N . \l J ,. x3. . .‘3 ""544; . Q‘fflf! 3.22%: ,u a “1...:- m: gm- -\ figs. - 'r 4.‘ This is to certify that the dissertation entitled Designing Optimal Item Pools for Computerized Adaptive Tests with Exposure Control presented by Lixiong Gu has been accepted towards fulfillment of the requirements for the Ph. D. degree in Counseling, Educational Psychology. and Special Education fines/Q (g mare Major Professor’s Signature 7 i Date MSU is an Affirmative Action/Equal Opportunity Institution LIBRARY Michigan State University PLACE IN RETURN BOX to remove this checkout from your record. TO AVOID FINES return on or before date due. MAY BE RECALLED with earlier due date if requested. DATEDUE DAIEDUE DAIEDUE its 821 @131 w i35izgja — it $035 2m Jig; 312 AUG 1 4. 2m 0727 1 . 6/07 p:lClRC/DateDue.indd-p.1 —_ DESIGNING OPTIMAL ITEM POOLS FOR COMPUTERIZED ADAPTIVE TESTS WITH EXPOSURE CONTROLS By Lixiong Gu A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Department of Counseling, Educational Psychology, and Special Education 2007 ABSTRACT DESIGNING OPTIMAL ITEM POOLS FOR C OMPUTERIZED ADAPTIVE TESTS WITH EXPOSURE CONTROLS By Lixiong Gu Computerized adaptive testing requires a well-designed item pool containing an appropriate number of items to build an individualized test that matches the examinee's ability level. An optimal item pool can be defined as a pool consisting of appropriate items for each individual test that is capable of reaching the desired level of precision. It also contains well-balanced items that will achieve optimal item usage and lower the cost of item creation. One of the methods to develop an optimal item pool is Reckase’s method (2003). which is a Monte Carlo method to determine the properties of an optimal item pool. This study extends the method for designing item pools calibrated with 3PL and applies it to situations where no exposure control, Sympson-Hetter procedure, or a- stratified procedure is imposed to control the item exposure rate. The procedures for designing the item pool and two approaches of simulating test items are presented. The performance of each optimal item pools is evaluated along with the operational item pools. DEDICATION To my wife Yanxuan, my parents, and my sister Lishu iii ACKNOWLEDGEMENTS I am deeply indebted to Professor Mark D. Reckase for his guidance in academics and the dissertation work, as well as my career paths. Without his constant support and insightful comments, this work would not have been possible. I would also like to thank three other members of my dissertation committee: Dr. Ken Frank, Dr. Richard Houang, and Dr. Hua—Hua Chang for their helpful suggestions on this study. I am extremely grateful to Dr. Linda Chard for her critiques on the writing and assistance in editing the dissertation. Thanks also go to Dr. Mary Pommerich and Dr. Daniel Segall, who provided two operational item pools for this study, to Wei He who shared her MATLAB programs on designing item pools for lPL, and to Raymond Mapuranga for his comments on the early version of my dissertation. I am also grateful to Dr. Dianne Henderson-M‘ontero and Dr. Venessa Lall, who supported me in balancing my time between operational work at Educational Testing Service and the work on my dissertation. My deep gratitude goes to my wife Yanxuan for her love and support, and to my parents and my sister, for their understanding and encouragement. iv TABLE OF CONTENTS LIST OF TABLES ............................................................................................................ vii LIST OF FIGURES ............................................................................................................ x Chapter I Introduction ......................................................................................................... 1 1.1 Research Context ...................................................................................................... 6 1.2 Summary ................................................................................................................. 13 Chapter II Item Pool Design and Components of Computerized Adaptive Testing ......... 15 2.1 Brief History of the Computer Adaptive Testing ................................................... 15 2.2 Pros and Cons of CAT ............................................................................................ 16 2.3 Components of Computerized Adaptive Testing .................................................... 17 2.3.1 Item Pool .......................................................................................................... 18 2.3.2 Scoring Procedure ............................................................................................ 20 2.3.3 Item Selection Procedure ................................................................................. 23 2.3.4 Stopping Rule ................................................................................................... 26 2.4 Practical Constraints in Item Selection ................................................................... 26 2.5 Exposure Control Methods ..................................................................................... 30 2.5.1 Sympson-Hetter Exposure Control .................................................................. 31 2.5.2 a-Stratified Adaptive Testing ........................................................................... 34 2.6 Item Pool Design and Its Relationship with Other Components of CAT ............... 37 Chapter III Reckase’s Simulation Method and Extensions to 3PL ................................... 40 3.1 Basic Concepts of Reckase’s Simulation Method .................................................. 40 3.2 Reckase’s Method for Optimal Item Pool Calibrated with lPL ............................. 43 3.3 Reckase‘s Method Applied to 3PL ......................................................................... 48 3.2.1 Extending the “Bin” Concept ........................................................................... 50 3.2.2 Strategies to Generate Items for Item Pool Simulation with 3PL .................... 55 3. 2. 2.] Prediction Model (PM) Strategy ............................................................... 57 3.2.2.2 Minimum Test Information (1W1) Strategy .............................................. 58 3.2.3 Post-simulation Adjustment ............................................................................. 60 3.4 Design Adjustments to Different Exposure Control Methods ................................ 65 3.4.1 Item Pool Design without Exposure Control ................................................... 65 3.4.2 Item Pool Design with Sympson-Hetter Exposure Control ............................. 65 3.1.3 Item Pool Design with a-Stratified Exposure Control ..................................... 66 Chapter IV Methods .......................................................................................................... 69 4.1 Operational Item Pools ........................................................................................... 69 4.2 Simulation Procedure .............................................................................................. 70 Step 1: Modeling CAT Procedures ........................................................................... 71 Step 2: Generating Examinee Population ................................................................. 71 Step 3: Generating Item Parameters ......................................................................... 7] Step 4: Generating Response Data ............................................................................ 72 Step 4: Post-Simulation Adjustment ......................................................................... 73 4.3 Control Variables .................................................................................................... 73 4.4 Evaluating Simulated and Operational Item Pools ................................................. 74 Chapter V The Performance of the Item Pools without Exposure Control ...................... 80 5.1 Item Pools for Test without Content Balance ......................................................... 80 5.2 Item Pools for Tests with Content Balance ............................................................. 89 5.3 Summary ................................................................................................................. 97 Chapter VI The Performance of the Item Pool with Sympson-Hetter Exposure Control 99 6.] Item Pools for Tests without Content Balance ....................................................... 99 6.2 Item Pools for Tests with Content Balance ........................................................... 107 6.3 Summary ............................................................................................................... 117 Chapter VII The Performance of the Item Pools with a-Stratified Exposure Control... 118 7.1 Item Pools for Tests without Content Balance ..................................................... 118 7.2 Item Pools for Tests with Content Balance ........................................................... 126 7.3 Summary ............................................................................................................... 136 Chapter VIII Discussion ................................................................................................. 137 8.1 A Revisit to the Definition of “Optimal” .............................................................. 137 8.2 Implications on the Practice of Item Pool Development ...................................... 139 8.2 Implications on Item Pool Management ............................................................... 141 8.3 Reckase‘s Method vesus the Mathematical Programming Method ...................... 142 8.4 Limitations and future studies ............................................................................... 143 APPENDIX ..................................................................................................................... 145 REFERENCES ............................................................................................................... 174 vi Table 4.1 Table 5.1 Table 5.2 Table 5.3 Table 5.4 Table 5.5 Table 6. I. Table 6.2 Table 6.3 Table 6.4 Table 6.5 Table 7.1 Table 7.2 Table 7.3 Table 7.4 Table 7.5 Table A.1 Table A2 LIST OF TABLES Simulation Design .......................................................................................... 74 Item Pool Size and Item Parameter Statistics for Arithmetic Reasoning without Exposure Control .............................................................................. 83 Summary Statistics of the Performance of the Item Pools ............................ 83 Item Pool Size and Item Parameter Statistics for General Science without Exposure Control ........................................................................................... 91 Summary Statistics of the Performance of the Item Pools ............................ 92 Number of Over- and Under-Exposed Items by Content .............................. 92 Item Pool Size and Item Parameter Statistics for Arithmetic Reasoning with Sympson-Hetter Exposure Control .............................................................. 102 Summary Statistics of the Performance of the Item Pools .......................... 102 Item Pool Size and Item Parameter Statistics for General Science with Sympson-Hetter Exposure Control .............................................................. 110 Summary Statistics of the Performance of the Item Pools .......................... 111 Number of Over- and Under—Exposed Items by Content ............................ 1 11 Item Pool Size and Item Parameter Statistics for Arithmetic Reasoning with a-Stratified Exposure Control ...................................................................... 121 Summary Statistics of the Performance of the Item Pools .......................... 121 Item Pool Size and Item Parameter Statistics for General Science with a- Stratified Exposure Control ......................................................................... 129 Summary Statistics of the Performance of the Ideal Item Pools ................. 130 Percentage of Over- and Under-Exposed Items by Content ........................ 130 Item Distribution for the Operational Item Pool — Arithmetic Reasoning... 146 Item Distribution for Item Pool Designed by MTI Method and without Exposure Control — Arithmetic Reasoning .................................................. 147 vii Table A.3 Table A4 Table A5 Table A6 Table A.7 Table A8 Table A9 Table A.10 Table A.11 Table A.12 Table A.13 Table A. 14 Table A.15 Table A.16 Table A.17 Item Distribution for Item Pool Designed by PM Method and without Exposure Control -— Arithmetic Reasoning .................................................. 148 Item Distribution for Item Pool Simulated with MTI method and with Sympson-Hetter Exposure Control ~— Arithmetic Reasoning ....................... 149 Item Distribution for Item Pool Simulated with PM Method and with Sympson-Hetter Exposure Control — Arithmetic Reasoning ....................... 150 Item Distribution for Item Pool Simulated with MTI Method and with a- Stratified Exposure Control — Arithmetic Reasoning .................................. 151 Item Distribution for Item Pool Simulated with PM Method and with a- Stratified Exposure Control — Arithmetic Reasoning .................................. 152 Item Distribution for the Operational Item Pool — General Science Content 1 ...................................................................................................................... 153 Item Distribution for the Operational Item Pool — General Science Content 2 ...................................................................................................................... 154 Item Distribution for the Operational Item Pool — General Science Content 3 ...................................................................................................................... 155 Item Distribution for the Optimal Item Pool Designed by MTI and without Exposure Control - General Science Content 1 .......................................... 156 Item Distribution for the Optimal Item Pool Designed by MTI and without Exposure Control — General Science Content 2 .......................................... 157 Item Distribution for the Optimal Item Pool Designed by MTI and without Exposure Control — General Science Content 3 .......................................... 158 Item Distribution for the Optimal Item Pool Designed by PM and without Exposure Control — General Science Content 1 .......................................... 159 Item Distribution for the Optimal Item Pool Designed by PM and without Exposure Control — General Science Content 2 .......................................... 160 Item Distribution for the Optimal Item Pool Designed by PM and without Exposure Control — General Science Content 3 .......................................... 161 Item Distribution for the Optimal Item Pool Designed by MTI and with Sympson-Hetter Exposure Control — General Science Content 1 ............... 162 viii Table A. 1 8 Table A.19 Table A.20 Table A2] Table A.22 Table A.23 Table A.24 Table A.25 Table A.26 Table A.27 Table A.28 Item Distribution for the Optimal Item Pool Designed by MTI and with Sympson-Hetter Exposure Control — General Science Content 2 ............... 163 Item Distribution for the Optimal Item Pool Designed by MTI and with Sympson-Hetter Exposure Control — General Science Content 3 ............... 164 Item Distribution for the Optimal Item Pool Designed by PM and with Sympson—Hetter Exposure Control — General Science Content I ............... 165 Item Distribution for the Optimal Item Pool Designed by PM and with Sympson-Hetter Exposure Control — General Science Content 2 ............... 166 Item Distribution for the Optimal Item Pool Designed by PM and with Sympson-Hetter Exposure Control — General Science Content 3 ............... 167 Item Distribution for the Optimal Item Pool Designed by MTI and with a- Stratified Exposure Control — General Science Content 1 ........................... 168 Item Distribution for the Optimal Item Pool Designed by MTI and with a- Stratified Exposure Control — General Science Content 2 ........................... 169 Item Distribution for the Optimal Item Pool Designed by MTI and with a- Stratified Exposure Control — General Science Content 3 ........................... 170 Item Distribution for the Optimal Item Pool Designed by PM and with a- Stratified Exposure Control — General Science Content 1 ........................... 171 Item Distribution for the Optimal Item Pool Designed by PM and with a- Stratified Exposure Control — General Science Content 2 ........................... 172 Item Distribution for the Optimal Item Pool Designed by PM and with a— Stratified Exposure Control — General Science Content 3 ........................... 173 ix LIST OF FIGURES Figure 1.1 Steps of computerized adaptive testing ............................................................ 2 Figure 3.1 Demonstration of determining bin width ....................................................... 42 Figure 3.2 Items used for two individual examinee ........................................................ 45 Figure 3.3 Item pool for two examinees .......................................................................... 46 Figure 3.4 Item pool for 5000 examinees ........................................................................ 47 Figure 3.5 Item information provided by two different items ......................................... 49 Figure 3.6 Bins defined by both a- and b- parameters. ................................................... 51 Figure 3.7 Item distribution by b-Bins and ab-Bins ........................................................ 54 Figure 3.8 Bivariate plot of b-parameter and a-parameter for operational item pool ..... 56 Figure 3.9 Demonstration of items in one bin offering more information than items in another bin. .................................................................................................... 61 Figure 3.10 Items in the order of information provided most in each b-bin. .................... 63 Figure 3.11 Item usage in the order of information provided most in each b-bin ............. 64 Figure 3.12 Item distribution for optimal item pool before adjustment ........................... 64 Figure 3.13 Item distribution for optimal item pool after post-simulation adjustment. ....65 Figure 5.1 Item distribution for item pools without exposure control ............................ 81 Figure 5.2 Test-retest overlap rate conditional on 6 ........................................................ 84 Figure 5.3 Item exposure rate by difficulty level ............................................................ 85 Figure 5.4 Average test information conditional on true 6 ............................................. 86 Figure 5.5 Conditional standard error of measurement (C SEM) .................................... 87 Figure 5.6 Conditional bias ............................................................................................. 88 Figure 5.7 Conditional mean square error (CMSE) ........................................................ 88 Figure 5.8 Item distribution for item pools with content balancing and without exposure control ............................................................................................................ 89 Figure 5.9 Test-retest overlap rate conditional on 6 ........................................................ 93 Figure 5.10 Item exposure rate by difficulty level ............................................................ 94 Figure 5.1 1 Average test information conditional on true 6 ............................................. 95 Figure 5.12 Conditional standard error of measurement (CSEM) .................................... 96 Figure 5.13 Conditional bias ............................................................................................. 96 Figure 5.14 Conditional mean square error (C MSE) ........................................................ 97 Figure 6.1 Item distributions for item pools with Sympson-Hetter exposure control ..... 99 Figure 6.2 Test-retest overlap rate conditional on 6 ...................................................... 103 Figure 6.3 Item exposure rate by difficulty level .......................................................... 104 Figure 6.4 Average test information conditional on true 6 ........................................... 105 Figure 6.5 Conditional standard error of measurement (C SEM) .................................. 106 Figure 6.6 Conditional bias ........................................................................................... 106 Figure 6.7 Conditional mean square error (CMSE) ...................................................... 107 Figure 6.8 Item distribution for item pools with Sympson-Hetter exposure control ....108 Figure 6.9 Test-retest overlap rate conditional on 6 ...................................................... 112 Figure 6.10 Item exposure rate by difficulty level .......................................................... 113 Figure 6.1 1 Average test information conditional on true 6 ........................................... 1 15 Figure 6.12 Conditional standard error of measurement (CSEM) .................................. 1 16 Figure 6.13 Conditional bias ........................................................................................... 116 Figure 6.14 Conditional mean square error (C MSE) ...................................................... 1 17 Figure 7.1 Item distribution for item pools without content balancing and with a- stratified exposure control ............................................................................ 1 18 xi Figure 7.2 Test-retest overlap rate conditional on 6 ...................................................... 122 Figure 7.3 Item exposure rate by difficulty level .......................................................... 123 Figure 7.4 Average test information conditional on true 6 ........................................... 124 Figure 7.5 Conditional standard error of measurement (CSEM) .................................. 125 Figure 7.6 Conditional bias ........................................................................................... 125 Figure 7.7 Conditional mean square error (CMSE) ...................................................... 126 Figure 7.8 Item distribution for item pools with content balancing and without exposure control ..................................................... . ..................................................... 127 Figure 7.9 Test-retest overlap rate conditional on 6 ...................................................... 131 Figure 7.10 Item exposure rate by difficulty level .......................................................... 132 Figure 7.1 1 Average test information conditional on true 6 ........................................... 133 Figure 7.12 Conditional standard error of measurement (CSEM) .................................. 134 Figure 7.13 Conditional bias ........................................................................................... 135 Figure 7.14 Conditional mean square error (CMSE) ...................................................... I35 xii Chapter I Introduction Since its introduction in the early 1970 s, computerized adaptive testing (CAT) has been used extensively in educational and psychological assessments (Lord, 1971; 3. Reckase, 1974; Weiss, 1976). The objective of adaptive testing is to build individualized tests by selecting items based on the examinee‘s current ability estimate. Test takers do * not receive questions that are either too difficult or too easy, but are always challenged A by appropriate items during the entire course of the testing. From the test administrator’s perspective, when examinees are given questions maximizing the information about their ability levels from the item response, reduced standard errors and satisfying measurement precision can be achieved with only a handful of properly selected items. This potentially leads to more efficient item usage and accurate ability estimates (Weiss, 1976). Adaptive tests are often administered with computers, which can quickly update the estimate of the examinee’s ability after each item and then select subsequent items based on the estimate. Therefore. when adaptive tests are mentioned, they are often referred to as CAT. Figure 1.1 shows a typical CAT process. It begins with an item pool (also called item bank) that contains an adequate number of items calibrated using Item Response Theory (IRT) (Lord, 1980). The CAT algorithm is most commonly an iterative process involving a procedure for obtaining ability estimates based upon candidate item performance and an algorithm for sequencing the set of test items to be administered to candidates. A new ability estimate is computed based on the responses to all of the administered items and the "best" next item is administered. This process is repeated until it meets certain stopping criterion. such as time limit, number of items administered (test length limit), change in ability estimate, content coverage, a precision indicator such as the standard error, or a combination of factors. Item Pool V Select Item Initial Ability Estimate Administer Item Obtain Response I Update Provisional Trait EStimate Less than test length Equal to test length Compute Final Score Figure 1.1 Steps of computerized adaptive testing Computerized adaptive tests offer many advantages over paper and pencil test. Examinees get flexible testing schedules and the opportunity to obtain their scores immediately following administration of the test. More importantly, more accurate ability estimates and higher test reliability and validity are achieved. Test developers may be able to adopt new item formats in computerized administration that are not possible with paper-and-pencil tests. CAT has also dramatically changed the way tests are administered. Traditional paper and pencil tests are usually administered in classrooms or a large auditorium where desks and chairs are the only requirement for a testing site. CAT, however, needs more expensive equipment (computer system) and a more individualized space to ensure privacy and security. Specialized test centers with appropriate computer equipment have been setup that will accommodate only a few examinees at a time. However, tests can be offered on a nearly continuous basis—multiple administrations per day throughout the week—to compensate for the large test volumes per administration. Continuous testing gives test administrators and examinees more freedom. yet it also poses difficulties in test development and test security. Computerized adaptive testing administers slightly different items to different-IT"; examinees according to their ability levels. It thus requires a large number of diversel; distributed items to measure each person with precision and efficiency (Embretson, 2001) Items are needed at the extreme levels of difficulty, although it may be arduous for even an experienced item writer to produce such items. The high costs for item development becomes one of the major impediments to cost-effective implementation of computerized adaptive testing. A quality item usually needs to go through a lengthy and costly procedure such as item writing, content and editorial review, pretest, item analysis, and final review before it can be chosen for an operational test. Particularly, item writing relies on experienced human item writers, who may rather slowly produce items of varying levels of quality. Some of the items that are produced may fail to meet screening criteria or may be unable to achieve adequate psychometric properties based on an empirical tryout. Until the Internet became widely used, CAT was considered more secure than paper and pencil tests because different examinees receive different test items. It is difficult to artificially promote one‘s score by merely studying a few items (Wainer, 1990). One would have to learn a large portion of the item pool in order for pre- knowledge to have any impact on an examinee‘s score, because the test was individually tailored to his or her ability level. However, the increasingly popular Internet makes it easily for examinees to share test related materials with each other. Presently, a student taking a test on a future date can obtain information from a friend who took the test on the previous day (Davis. 2002). Item leakage and the high costs of item development have been the primary driving“ forces for research on designing and maintaining CAT item pools for better item usage. A," One of the solutions is to design high quality items more efficiently. A promising technique is iteméil‘oning, also called model based item generation. This approach starts with a formal description of a set of “parent items” along with algorithms to derive families of clones from them (Bejar, 1993, 1996; Bejar & Yocom, 1991; Hively, - 3' Patterson, & Page, 1968; Osburn, 1968). These parents are also known as “item forms,’i “item‘templates,” or “itemshells.” More recent item generation research has tried to model the relationship between item parts and its psychometric properties, such as item difficulty and item discrimination (Bejar, Lawless. Morley, Wagner, & Bennett; 2003; Glas & van der Linden. 2003', Graf. Peterson, Steffen, & Lawless, 2005). Items with expected content coverage and psychometric properties (e.g., item difficulty and discrimination) can be produced by varying the essential parts of an item (Deane, Graf, " Higgins, Futagi, & Lawless. 2006; Singley & Bennett. 2002). Comprehensive reviews of item-cloning techniques are given in Bejar (1993). . Item cloning gives some hope in producing a large amount of items in a very short time, but it can be very difficult and costly in the process of developing item models for " item cloning. A more realistic solution is to optimize the blueprint for the item pool development. In common computerized adaptive testing practice, items are selected to maximize item information at an estimated ability level so we learn the most about an examinee’s ability (Wainer,]990). It may produce variability in the frequency with _ which items are used because items differ in terms of the desirability of the _ ,i characteristics for measuring an examinee‘s ability level. For example, items with average difficulty will be more often selected for administration because of the assumed normal distribution of the examinee ability level. Additionally, more discriminating items will be selected more often because they tend to be more informative. If item writers know how many items with certain psychometric properties are needed for an item pool, they will not waste resources on developing too many items that are rarely used while creating too few items that may be frequently used. This current proj ect investigates the applicationsof a simulation me’thodfideveloped by Reckase (2003, 2004) on designing the blueprint for a CAT item pool. Because an optimal item pool will be different under different test situations,ithis project also explores the item pool design when different exposure control methods are used, specifically the Sympson-Hetter and a-stratified methods. The first chapter briefly introduces the background under which this project is developed. The second chapter reviews the literature on computerized adaptive testing, exposure control, and various methods developed to design the blueprint to optimizethe item pool construction. Chapter 3 illustrates Reckase's method in detail and provides extensions of his methods to the applications with the three-parameter logistic IRT model. Following these sections, simulation studies are conducted to demonstrate the item pool design process and ’. «mi 2 ."‘ :. investigate the effectiveness of Reckase‘s method compared to a real item pool. Finally, the implications of this current study and future directions are presented. 1 . I Research Context The idea of an item pool is not new. It evolved long before computerized adaptive testing became popular. Even in conventional paper and pencil tests, a well-designed item pool provides test developers and teachers a convenient yet powerful tool to produce 99 “- high quality tests. Item pools have been called by such terms as “item banks, question banks," “item collections." "item reservoirs," and "test item libraries.” Although distinctions among some of the temis can be made, they all refer to a relatively large .- ‘- i . . . . . ‘ . »— l collection ofeaszly accesszhle test questions (Millman & Arter, 1984). In other words, an. . ' ' 1 item pool has to include a number of items which exceeds by several times the number tql be used in any one test. Items in the pool are indexed, structured, or otherwise assigned information that can be used to facilitate their selection for a test. Millman agglArter 1.. f (‘1984Lcategorized this information into assigned item characteristics (e.g., keywords or ‘1; other subject matter classifiers) and measured item characteristics (e.g., item difficulty or: discrimination). The latter is also known as psychometric characteristics. The concept of an item pool is expanded in computerized adaptive testing. Two kinds of item pools are distinguished in a typical CAT program. One is often called the i. master pool, which includes as many items as possibly being created for the testing use. Another kind is the operational item pool, which is a smaller subset of the master pool and by design, has to be small enough so that the computer may easily retrieve the item and minimize item exposure. Yet, it has to be large enough to provide items with the required characteristics. Due to the continuous nature with which many CATS are administered, the useful life of an operational item or the entire operational item pool is limited. After a certain number of uses, items are retired and put back into the master pool. Some items can be reused only after a reasonably long time. One question often asked during item pool design is how many items should be in 2 i, a" fr'. l' a pool. Ideally, the more items the better, because it allows more choices in test assembly, and seldom do the same items appear in tests repeatedly. With larger pools, it is difficult for examinees to memorize answers. This is a problem in situations where learners have access to the item pool. Larger pools also mean more items that match content, item 2“ format, and statistical requirements are available (Millman & Arter, 1984). The caveats, however, are: (a) the items added to the pool should be well written, content valid, and statistically fit; and (b) the total number of items should be manageable and easily retrievable. In paper and pencil test situations. test items are not reused as often. Millman and Arter (1984)-??? Prosser( l 974) suggested the rule of thumb for the number of items in an item pool be 10 items for each one that could be used on a testing occasion and 50 items for each class hour of presented material. In computerized adaptive testing, Luecht (1998) suggests that between 3,800 and 21,000 items may be needed to begin a CAT program when sufficient pool size. multiple pools, and item pretesting are taken into consideration. Guidelines that have been suggested. for the appropriate size of the operational item pool are 150-200 items or from six to twelve times the test length on an operational form (Luecht, 1998; Patsula & Steffan. 1997; Stocking. 1994; Weiss, 1985).; I, ' -' However, issues of item exposure, item retirement, and pool rotation may require this number to be much larger. An often overlooked issue in item pool design is how to construct a blueprint that outlines the optimal composition of items with desirable assigned and psychometric characteristics. The blueprint as the outcome of item pool design can tell item-writers to write items not only by format (multiple-choice or constructed-response) and content ’4‘, coverage, but also by the desired psychometric characteristics of the items. The blueprint { " l " ,1 is optimal in that it consists of appropriate items for each individual test that is capable of reaching the desired level of precision. An optimal blueprint also contains well-balanced items to achieve optimal item usage and lower the cost of item creation. Optimizing an item pool may not be an important issue in paper-and-pencil tests where item exposure is not much of the concern. Usually an item pool may require only a few items in each assigned item characteristics. more moderately difficulty items, and some extremely easy and extremely difficult items. In computerized adaptive testing, items have more chances to be overexposed, and the costs of developing new items are so high that an operational item pool needs to have a more balanced item composition in order to reduce the item exposure for often used items and increase exposure for less used items. A better way to address this problem is to design and develop item pools in a more systematic and empirical manner. The item writing process is usually guided by appropriately designed test specificationsthat outline the contents attributes and their T? distributions. Requirement for statistical attributes, such as the range of difficulty, may " ”a. I be provided but are often difficult to satisfy simply because the values of statistical I/ attributes for individual items are not easily predicted. However, at the item pool level i ___'t they often show persistent patterns of correlation with content attributes. These patterns can be used to minimize the item-writing effort. Through carefully modeling of the CAT procedure, test specification for the item pool could be developed with computer simulations to forecast the number of items needed with specific attributes (vanfider Linden_,_1999; Reckase,‘2003). The methods compared here are for the design of a single item pool and can serve as tools for monitoring the item writing process. Only a few empirical studies on optimal item pool design have been documented for computerized adaptive testing. Among them there are two bodies of research. One assumes the existence of a master item pool with research focusing on the best way to allocate items into multiple operational item pools. The other focuses on the design of the operational item pool independent of existing items. These studies assume no existing items and focus on design of a blueprint for item pool construction in order to provide precise ability estimation and minimize item exposure. ..-”" Stocking and Swanson’s (1998) system of rotating item pools assumed the presence of a master item pool from which several smaller operational pools were generated. The number of operational pools each item should be included in can be manipulated so that items with hi ghcr exposure rates are assigned to a smaller number of pools and items with lower rates to a larger number of pools. By randomly rotating the operational pools during the testing. uniformly distributed exposure rates for the test items can be achieved. Ariel, Veldkamp, & van der Linden (2004) also, presented a mathematical method to calculate the optimal way to allocate items from master item pool into multiple operational pools. The effectiveness of the rotating pools method, however. relies on the quality of the master pool. Items have to be available in the master pool to be assigned into smaller operational pools. Apparently, if the master pool contains difficult items only, even the optimally assembled rotating pools would not have the easy items. The fundamental issue in item pool design is how to design an item pool without assuming the existenceof ' the masterwitem 8992999. to explore the ideal characteristics, $491! as the Size and_'th_ehitem distribution the item pool should have to function efficiently. Research concerning the design of the item pool from scratch focuses on developing a blueprint for an item pool —— a document that specifies the attributes of the a r" items needed in a new pool or an extension of an existing pool. The blueprint is designed to allow for the assembly of a prespecified number of test forms from the bank, each with its own set of specifications. The resulting item pool would allow CAT procedure to generate adequate measurement precision for a majority of the test takers even with the constraints of exposure control or content balancing. A favorable consequence is that the number of unused items in the bank is also minimized. As will become clear, the , at r {7 ‘ blueprint not only specifies the number of items with each content coverage, but also the i I number of items with certain psychometric properties, particularly the range of IRT I parameters. Boekkooi-Timminga (1991) used integer programming to calculate the number of items needed for future test forms. She used a sequential approach that maximized the test information function (TIF) under the one-parameter logistic (Rasch) model. These results were then used to improve on the composition of an existing item bank. Subsequently, several methods for the construction of rotating item pools have been demonstrated in empirical studies, some achieved the design goal with integer 10 ‘7 I. programming methods (for a review of these methods, see Ariel, Veldkamp, & van der Linden, 2004). Veldkamp & van der Linden (1999) described five steps to design an optimal blueprint for a CAT item pool with a mathematical programming method. First, a set of specifications for the CAT is analyzed and all item attributes figuring in the specifications are identified. Second, using the specifications, an integer programming model for the assembly of the shadow tests in the CAT simulation is formulated. Third, the population of examinees is identified and an estimate of its ability distribution is obtained, for example, from historical data. Fourth, the CAT simulation is carried out using the integer programming model for the shadow tests and sampling simulees from the ability distribution. Counts of the number of times items from the cells in the classification table are collected. Fifth. the blueprint is calculated from these counts, adjusting them to obtain optimal projections of the item exposure rates. As the basic idea of mathematic programming is to optimize the resource allocation under the assumption of limited resources, in this case the optimization is constrained by the resources prespecified. It thus assumes a design space that is defined as the Cartesian product of all item attributes figuring in the specifications of the tests in the program. Each combination of the attributes represents a virtual item. For example, a combination of content coverage and a-, b-, c- parameters represents a virtual item available for optimal item pool. Because item parameters are real numbers, there are infinitely many combinations of item attributes. To simplify the problem, discrete values are chosen to represent the possible values of the item parameters. For example, b = -3.0, -2.9, , 2.9, 3.0 represent the possible b-parameter values and a = 0.1. 0.2, , 2.9, 3.0 represents the 11 possible values for a-parameters. Any combination of the a- and b-parameters represents a virtual item. Veldkamp & van der Linden (1999) demonstrated that modeling constraints for the CAT version of the GMAT led to a design space containing 12,096 items. Formulating constraints is an important step in the mathematical programming method. Van der Linden (1998) laid out three kinds of constraints based on their mathematical types: categorical item attributes, quantitative item attributes, and inter- item dependencies. Categorical item attributes are attributes that characterize the content, format, or author of items. Quantitative item attributes are certain quantitative attributes items have, such as word counts. difficulty parameters, and discrimination indices. Inter- item dependencies deal with possible relations of exclusion and inclusion between the items in the pool, such as items in so-called enemy sets, in which items cannot be included in the same sets. The advantage of the mathematic programming method is that it is able to model complicated test specifications. Once the constraints are identified and transformed to numerical Constraints, special software is available to simulate the optimal item pool. However, item pool design with mathematical programming method is closely tied with shadow test procedure in item selections and requires the knowledge of special optimization software. Depending on the way item attributes are partitioned, the design space can be very large and the simulation process becomes computationally arduous. Reckase (2003, 2004) took a slightly different approach and avoided using mathematic programming. This approach does not assume pre-existing items. Instead, items are simulated (in terms of the IRT parameters) to match the current ability estimates to provide sufficiently optimal information. Reckase’s method first partitions the target item pool into smaller ones based on different non-statistical attributes, such as content. Then the CAT process is simulated to construct the small items pools ‘ simultaneously. The simulation starts with an examinee randomly drawn from the 1 tr" ‘ expected examinee distribution to receive the adaptive test. Each item is simulated to be the optimal item based on the current ability estimate. The same procedure is repeated for subsequent examinees and the items needed to support a large sample of examinees is tallied and becomes the optimal item pool. Exposure control rules can be built into the simulation to decide how many times an item can be reused. This procedure has been demonstrated successfully with widely available programming software in the design of CAT item pools for TABE and NCLEX. 1.2 Summary The present study reports on the development of optimal item pools for computerized adaptive tests in the investigation of the relative merits of two different strategies. A modified version of Reckase’s method is applied to designing optimal item pools calibrated with three parameter logistic models. Sympson-Hetter and a-stratified exposure control methods, as well as content balancing. are investigated. For the purpose of this research, it was desired to have an operational pool of items measuring an empirically significant dimension of ability, while balancing the content areas measured. Operational item pools for two sections of the C AT-ASVAB were chosen as the design target in this study. The final item pools were designed to meet the criteria described by van der Linden (2000): (I) it would be sufficiently large to allow several thousand overlapping subtests to be drawn from its items; (2) the items would 13 l 5, .. span the entire range of item difficulty relative to the population of interest; and (3) it would consist of an appropriate mix of high and low discriminating items to lower the item creation cost while meeting the needs of test precision. This study compared simulated optimal item pools to operational item pools on item distribution and performance for examinees randomly sampled from expected examinee distribution. The simulation study took into consideration the distribution of the examinee population, content balancing, and the expected precision of ability estimates. The following research questions are investigated in this study: 1. What does the optimal item pool designed for a computerized adaptive test look like when the item selection procedure imposes no exposure control, when it incorporates Sympson-Hetter method, or when it incorporates the a-stratified method? 2. What do the optimal item pools designed for a computerized adaptive test look like when the test does not need content balancing. and when the test needs content balancing? 3. Do optimal item pools designed by a Monte Carlo simulation perform better than the real operational item pool in terms of empirical criteria? 14 Chapter II Item Pool Design and Components of Computerized Adaptive Testing An optimal item pool design is based on the indepth understanding of the mechanism of CAT. This chapter introduces computerized adaptive testing, its history, pro and cons, and the components of the CAT procedure, which includes the item pool, item selection procedure. ability estimation method, and stopping rules. Special attention will be given to exposure control procedures, which are an integral part of the item selection procedure, and how the design of the item pool should be based on the analysis of its relationship with other CAT components. 2.] Brief History ofthe ComputerAdaptive Testing Computerized adaptive testing (CAT). as its name suggests, is adaptive testing delivered by computers. While CAT has only recently become a major force in measurement practice. the idea of adaptive testing is not new. It has always been recognized that too easy or too difficult items contribute little to the information about an examinee’s ability level. By eliminating the need to administer items of inappropriate difficulty, adaptive testing can shorten testing time. increase measurement precision, and reduce measurement error due to boredom, frustration or guessing (Wainer, 1990). The first adaptive test is known to be Alfred Binet‘s (I905) intelligence test, which is still in use today in a more modern version. Since the concern was with the diagnosis of the individual child, rather than the group, Binet realized he could tailor the test to the individual by a simple strategy—first rank-ordering the items in terms of difficulty, then starting to test the child at what he deemed to be a subset of items targeted at his 15 approximation of the candidate's ability. If the child gave the correct answer, harder item subsets were administered until the child answered a few questions in a row incorrectly. If the child failed the initial item subset, then easier item subsets would be administered until the child succeeded frequently. From this information, the child's ability level could be estimated. Lord's (I980) Flexilevel testing procedure and its variants, such as Henning's (1987) Step procedure and Lewis and Sheehan's (1990) Testlets, are a refinement of Binet's method. The items are stratified by difficulty level, and several subsets of items are formed at each level. The test then proceeds by administering subsets of items, and moving up or down in accord with the success rate on each subset. After the administration of several subsets, the final ability estimate is obtained. Though a crude approach, these methods can produce approximately the same results as more sophisticated CAT techniques (Yao. 1991). The use of computers facilitates a further advance in adaptive testing with convenient administration and selection of single items. Reckasejs (I974) study is an early example of this methodology of computer-adaptive testing (CAT). Since the mid 1990 3, many well known large-scale high stake testing programs, such as GRE, GMAT, NCLEX and LSAT, have switched from paper and pencil tests to computerized adaptive ICSIS. 2.2 Pros and Cons of'CA T The advantages and cautions of CAT have been well document (e.g., Rudner, I998; Wainer, 1990, Wainer & Eignor, 2000). One advantage is that, in general, CAT greatly increases the flexibility of test management (e.g., Grist. Rudner. and Wise, I989; 16 Weiss and Kingsbury, 1984). Tests are individually paced so that an examinee does not have to wait for others to finish before going on to the next section. Self-paced administration also offers extra time for examinees who need it, potentially reducing one source of test anxiety. A number of options for timing and formatting, notably interactive audio and video, can be offered. Therefore, it has the potential to accommodate a wider range of item types. Besides flexibility, CAT increases test efficiency by shortening the test length without reduction in measurement precision. A shorter test reduces examinee frustration, boredom, and fatigue, a factor that can significantly affect an examinee's test results. Since examinees see only those items appropriate for their ability level, they remain engaged and challenged. In addition, individualized test leads to easy removal of faulty items. With CAT, a poorly performing or incorrect item would only affect a segment of the test-takers, and even for those, the self-correcting nature of CAT would make it unlikely there would be any impact on the pass-fail decision. 2.3 Components ofComputerizea’ Adaptive Testing Reckase (1989) listed four major components of a computerized adaptive test: the item pool, the item selection procedure, the scoring (ability estimation) procedure, and the stopping rule. Item exposure control and content balancing have recently been extensively studied to constraint the item selection so that items are selected not only by their statistical appeals but also content specifications and security concerns. An optimal item pool should be determined by the other components of the CAT, namely test length, expected distribution of the examinee population, ability estimation and item selection procedure, and target item exposure and overlap rates (Bergstrom & Lunz, 1999). I7 2.3.1 Item Pool The adaptive feature of CAT makes it unnecessary to use pre-designed test forms like paper and pencil test. It requires an item pool from which all tests will be drawn. In practice, there are two kinds of item pools: one is called the master pool and another is called theoperational pool. The master pool is an inventory of test items maintained to supply the testing program. An operational pool is a pool of items from which'individual adaptive tests are actually assembled. Typically, anm‘asterpoolflis much less strugtured thagagopgrational pool. Its items may be in various stages of development, while the items in an operational item pool have passed all preparatory stages and the pool is ready / for test assembly. The focus of this study is to investigate the methods for designing an \ _.s_r.~_...,‘-_._ optimal operational item. pool. ~ 2“‘ .. 7" Ideally, the item pool would have a sufficient number of high quality items to allow several thousand of overlapping subtests to be drawn from its items. It would have a sufficient number of items ingeach desired content area to meet the test specifications. It would span a wide range of item difficulties relative to the population of interest to allow the CAT to estimate ability levels for a broad range of examinees (Urry, 1977). In addition, care must be taken to ensure that the item pool would consist of appropriate items to reduce the over- and under-exposure rate while meeting the test precision requirement (Davis, 2002, Wainer, 1990). Guidelinesfor the appropriatesi‘zeoftheitem pool _§_l£§_._1_.5.0:200.items.or_from-six to twelvetimes thetestlengthon an operational form (Luecht, 1998; Patsula & Steffan, 1997; Stocking, 1994; Weiss, 1985). However, issues of item exposure, item retirement, and pool rotation may require this number to be much larger. Due to the continuous nature with which many CATS are administered, the useful 18 t 2 w life of an item or an item pool is limited. Luecht (1998) suggests that between 3,800 and“ 21.000 items may be needed to begin a CAT program when sufficient pool size, multiple pools, and item pretesting are taken into consideration. Strategies to extend the life of a pool, such as drawing multiple overlapping pools from an item vat (Patsula & Steffan, 1997), have been proposed. However, the cost and effort to create and maintain a CAT item pool remains formidable and far exceeds that of paper and pencil testing, which makes optimal design of item pool more important. .- m,. _. .r-’ a. An item pool is not only a reservoir of items, but also an organized list of items with clearly defined attributes attached to them. Vanncgler“ LLndegQQQflaI’s’tinngfiéd l ‘i l1 “MN“ ”.22-- t '{ "7'7. . . —“’—-~.__...M l' \ three types of item attributes: quantitative, categorical, and logical. Quantitative attributes are item attributes that take on numerical values. Examples of quantitative attributes are word counts, expected response times, statistics such as item p-values and [RT parameters, and frequency of previous item or stimulus usage. Categorical attributes divide or partition the item pool into subsets of items with the same attribute. Examples of categorical attributes include content category, response format of items (e.g., constructed response or multiple-choice), and use of auxiliary material (e.g., graph or table). Logical attributes differ from quantitative and categorical attributes in that they are not prOperties of single items or tests but of pairs. triples, and so forth. The logical attributes involve relations of exclusion and inclusion between items or tests. For example, a relation of exclusion between items exists if they cannot be selected for the same test because one has a clue to the solution of the other (so-called “enemy items”). A relation of inclusion exists if items belong to a set with a common stimulus and the selection of any item implies the selection of more than one. 19 2.3.2 Scoring Procedure One of the advantages CAT has is the ability to administer items suitable to an examinee’s ability level. This is achieved by repeatedly estimating the ability level after each item is administered. At the beginning of the test administration, an initial value for ability level is arbitrarily provided since no information is known about an examinee and no item is administered. This value is commonly the expected mean ability level of the testing population or a random number around the expected mean ability level. If prior information is available. it may help determine the initial value that may be closer to the examinee’s real ability level, thus facilitating the proceeding estimations. After each item is administered, an examinee‘s ability level is re-estimated, based on his or her responses to all previously answered items. Maximum likelihood estimation (MLE) and Bayesian estimation approaches are two commonly used ability estimation methods. MLE determines the most likely ability level for an examinee given the response string to items with specified parameters, by multiplying together the individual probabilities of a correct or incorrect response given theta to compute a joint probability with the function. h L(ul9)=l—IPI(“1‘lasaiabiaci) (3) i=1 where Pi(ui |6,a,-,b,-,c,-) is the probability of getting response 11,- on item i given an examinee’s true ability 6and item parameters . and n is the number of items. The maximum likelihood estimate of an examinee’s true ability 6 is 6 , the value that maximizes the likelihood function (or, equivalently, the log likelihood function). 20 Mathematically, this can be done by taking the derivative of the likelihood function, setting the result equal to zero and solving for6. Iterative numerical methods such as Newton-Raphson method (Wainer, 1990) typically are used to solve this equation. MLE ability estimates are popular in CAT contexts due in part to their desirable theoretical properties such as asymptotic consistency and asymptotic normality. Problems, however, occur in solving the likelihood equation when examinees get all items correct or all items incorrect, because such response pattern yield unbounded ability estimates. These problems are often handled in CAT by setting arbitrary minimum and maximum ability estimates for such response patterns (e.g., -4 and +4) or by using a Bayesian-based ability estimate until examinee answers at least one item correctly and one item incorrectly. Owen’s (l 969) Bayesian sequential ability estimation technique was proposed as part of his adaptive testing strategy, which selects items that minimizes the expected value of the Bayesian posterior variance. This ability estimation procedure, however, has proven useful in adaptive testing strategies using other item selection criteria, as well. Owen’s Bayesian method begins with a prior distribution of ability — in effect, as an assumption that the examinee is a member of a population with a normal distribution of ability, with known mean and variance. After each test question, the mean and variance are updated using a statistical procedure that combines the information in the prior distribution with the observed score (right or wrong) on the most recent test question, and the parameters of that question’s IRT model. The updated values of the ability distribution parameters specify a normal “posterior” distribution, which is used as the prior distribution for the next question. This process continues until the end of the 21 test. At that point, the posterior mean is used as the estimate of the examinee’s ability scale location. Owen’s formula for updating the prior mean is as follows: _ j6P(u,-|6)h(6)d6 4 _ IP(u,-|6)h(6)d6 U #(91' lui) Owen (1975) showed that after each item is administered the estimates for 6,-1- and l " 2 " ~ —2 . 2 __ 5 9,. =41 —o,-_1(a,- +0.4) 2t¢ 66 I (6): 13(6)Qj(6) = (I) where PJ-(6) is the probability of a correct response. given 6, and Qj(6) is the probability of an incorrect response. Plugging item parameters in Equation (1), it can be simplified for the dichotomous three-parameter logistic item response model (Hambleton, Swaminathan, & Rogers, 1991; Lord, 1980): DZDafa—c.) (c .D+e l")(1+ Wm) 11(6): where L1. = a j (6 — bl) , D = 1.7, a,- is the item discrimination parameter, bj is the item difficulty parameter. and cj is the pseudo-chance level parameter (i.e., the probability of a very low 6 examinee correctly answering the item). Equation (2) indicates that the item information increases as b,- approaches 6, as a,- increases, and as cj approaches 0 (Hambleton et al., 1991). Unconstrained maximum information selection chooses an item i that maximizes the Fisher information evaluated at é,- . the provisional proficiency estimate for the examinee after it proceeding items. When the items that constitute CAT are selected using M1, the precision of 6 increases as each item is administered (Hambleton et al., 1991) In practice, maximum information item selection is often based on a previously computed table in which items are sorted by the information they provide at each number of proficiency values (an “Info Table”). Item selection is equivalent for all 6 ’s in an interval around a tabulated value. Rather than evaluating Fisher information for each item 24 in the pool at the current value of 6 each time it needed to select the next item, it need only be evaluated once for each item at each tabulated point. Item selection based on an Info Table is slightly less efficient but less computationally burdensome than maximum information item selection. The MPP is also called Owen‘s Bayesian method, which selects an item that maximizes the expected posterior precision of the ability estimate, or, conversely, to minimize the expected posterior variance of the ability estimate. At the initial stages of a test, MPP item selection might differ from MI item selection due to its use of the posterior distribution. However, at later stages, the posterior variance approaches the reciprocal of the test information (Chang & Stout, 1993). Therefore, MPP strategy might provide results that are similar to the MI strategy. MPP is computationally easier than the M1 for its quick approximation to the posterior ability distribution (Davey & Parshall, 1995). This advantage makes MPP more popular when computing power is limited. However, increases in available computing power and the fact that with the MPP strategy estimated ability level varies as a function of item order have made maximum information more widely used (W ainer, 1990). A hybrid strategy was also developed using Owen’s Bayesian sequential procedure to update ability after each item response and selecting items sequentially by referring to precomputed item information Iookup tables. This strategy was evaluated by several researchers and seemed to retain the advantages of MI and MPP and avoid their disadvantages (Wetzel & McBride, 1986). These statistically motivated item selection procedures can be tempered by practical considerations such as item exposure rates and content balancing, both of which will be described in detail later in this chapter. 25 2.3.4 StoppingRule There are two methods to determine when to stop administering items in a CAT. One is “fixed length,” which requires all examinees to take the same number of items. With the same test length for all examinees, test-taking time is similar across examinees, making test administration more predictable. However, measurement precision will differ across ability levels. which causes difficulty in reporting test reliability. The other method, the “variable length” method, requires that examinees continue to take items until reaching a pre-specified level of precision. It is, however, hard to explain to non- experts why different items are administered. In terms of the item pool use, variable length tests tend to perform better than fixed length tests as they minimize test length (Bergstrom & Lunz, 1999). However, simulation studies showed that fixed-length testing was more efficient because highly informative items were typically concentrated over a restricted range of ability examinees. In variable-length testing, examinees falling outside of a certain range of ability tend to receive long tests, with each additional item providing very little information and fatigue influencing the test precision (Segall, Moreno, & Hetter, 1997). In practice. some adjustments are made to compensate for shortcomings of either method. For example, in some certification tests. every examinee has to take a minimum number of items to cover enough contents. If the desired precision level cannot be reached for an examinee whose ability is close to the cut-score, more items are given until the precision level or a maximum test length is reached. 2.4 Practical constraints in Item Selection Without additional constraints, item selection algorithms select a single best item at each step of testing for each examinee. In practice, some constraints that play no formal 26 role in an IRT model are often imposed on item selection algorithms to achieve desirable patterns of item usage. For example. a given examinee should not receive a particular item twice; the number of items in each content area must not exceed certain proportions specified by test specification; and the same item should not be administered more than a certain number of times. Among these practical considerations, item exposure control and content balancing are two of the most addressed in CAT operations. The exposure rate of an item is defined as the ratio between the number of times the item is administered and the total number of examinees. Because CAT’S are administered frequently to small groups of examinees. there is a risk that items with high exposure rates might become known to examinees (Mills & Stocking, 1996). Thus, high item exposure rates can decrease test security. Item selection based purely on a statistical model. such as maximum information or maximum posterior precision, is the primary reason that leads to high item exposure rates. For example, if no auxiliary information is used to start the CAT, every examinee sees the same item first and one of a single pair of items second. Those three items would soon become public knowledge for any widely used CAT. Research has demonstrated that typically, CAT item selection results in the fact that not only are the pool’s most informative-items most often administered, but also that a very small percentage of the pool’s items account for a very large percentage of administered items (Wainer, 2000; Wainer & Eignor, 2000). In other words, the most popular items are popular by a large margin, resulting in an exponential decline in item usage as rank increases. Furthermore, with maximum infomiation item selection, two examinees with the same ability estimate will likely see the same item (Hetter & Sympson, 1997). While rarely used items seem to be a waste of resources Spent on item creation, frequently exposed items may cease to be a valid measure of the ability because examinee may have prior knowledge of items either from taking a pre-test or from friends who took the test before (Parshall, Davey, & Nering, 1998). The probability the same examinee would take items he/she saw in pre-test can be minimized if items that are pretested together are put into separate item pools (Stocking & Lewis, 2000). It is hard, however, to deal with the threat from what Luech (1998) called “Examinee collaboration networks (ECNS),” which are global groups of examinees who seek to pool their resources and test experience to discover a sufficient number of test items from an item pool to artificially increase scores (Davis, 2002). This situation is exacerbated by the fact that tests are often continuously administered in CAT (Stocking & Lewis, 2000). To lower the risk of overexposing test items, mechanisms are imposed on the item selection function to control the item exposure rate. Estimation efficiency is thus traded off against a more evenly distributed item exposure. a benefit from the point of view of test security. Another consideration on item selection is the balance of item content. Content areas in a single test are often associated with the notion of multidimensionality. There have been extensive debates on whether or not a test composed of different content areas Should be treated as unidimensional or multidimensional test. If unidimensional IRT is used in item selection and scoring procedures in CAT, one of the major assumptions is that performance on items within a given content area can be characterized by a unidimensional IRT ability or ability. Violations of the unidimensional adaptive testing model may have serious implications for validity and testing fairness (Segall, Moreno, & 28 Hetter, 1997). When content areas are disparate or introduce additional dimensionality, a logical option is to use a multidimensional model and estimate separate ability levels within a single test (Parshall, Davey, & Nering, 1998). Another option is to split the item pool by administering separate tests, with separate ability estimations for each content area (Segall, Moreno, & Hetter, 1997). However, it is often not practical to divide a single test into several separate ones with the goals of the testing program (Thissen & Mislevy, 2000). However, when content areas are shown to measure a single ability dimension it is possible to design the item pool with item quantities proportional to the desired content coverage for each examinee‘s test (Green, Bock, Humphreys, Linn, & Reckase, 1984; Segal, Moreno, & Hetter, 1997). For example, in a test of general science it might be desirable to constrain the 15—item adaptive test to include seven life science items, seven physical/earth sciences items, and a single chemistry item, all in prespecifed ordinal positions in the test. Similar to item exposure control, programmatic restrictions to the item selection procedure would be necessary to ensure the desired. content coverage during the CAT. In a fixed-length CAT, the ordinal positions for each item type or content may be specified as a priori, or a spiraling scheme rotating through the various kinds of items may be used. With variable length CAT, the algorithm rotates through the various types of items so that balance is approximately maintained at each possible stopping point. Like item exposure control, one possible drawback for any content balancing method is that the most informative item in the selected content area may not be the most informative item available in the item pool. It might threaten measurement precision (in a fixed length test) and could result in longer tests (in a variable length test) due to administration of sub-optimal items. Some alternatives were proposed to balance the content area while maintaining the test efficiency. The first is to balance fixed proportions of information based on the same fixed target content percentages of items required by the test specifications. The second is to balance the proportions of information based on target content percentages of total test information that are conditional on estimated ability (Davey & Thomas, 1996; Thomasson, 1997). 2.5 Exposure Control Methods Way (1998) classified exposure control procedures into two categories: randomization (Davis and Dodd, 2001; Kingbury and Zara, 1989; Lunz and Stahl, 1998; McBride and Martin, I983) and conditional selection procedures. Instead of always administering the most informative item, randomization procedures select several items near the maximum information level and then randomly administer one from the selected items. Although relatively easy to implement, for randomization procedures it is hard to specify a maximum exposure rate for specific items. Conditional selection procedures juggle this problem by setting exposure control parameters to each item and adjust the administration rate accordingly. Conditional selection procedures, however, require a time-consuming iteration process to obtain exposure control parameters. If the item pool or the ability distribution of the examinee population changes. it has to go through the same process to reset the exposure control parameters. In addition to the randomization and conditional selection procedures, Chang and Ying (1996) developed the a-Stratified procedure in which items with low discrimination are administered first, followed by items with high discrimination as more accurate estimations of the examinees’ ability levels are determined. 30 2.5.1 Sympson—Hetter Exposure Control The Sympson-Hetter exposure control procedure (Sympson & Hetter, 1985) is one of the most commonly used conditional selection procedures. This procedure assigns to each item an exposure control parameter value that is based on the frequency of item selections during an iterative CAT simulation. Items with high administration frequencies are assigned smaller exposure control parameters, which range from zero to one. During the test operations, the exposure control parameter of the selected item is compared to a random number, which also ranges from zero to one. If the exposure control parameter is larger than the random number, the item is administered. If it is smaller, the item is put back into the item pool and the same process is applied to the next best item. The item exposure control parameter is like a threshold. By controlling the thresholds, the S-H method limits the administration of frequently used items in CAT and ensures a maximum item exposure rate for less often used items. The exposure control parameters in the S-H method are usually set by a series of iterative simulation of real CAT administrations. Simply put, it is the ratio of the target exposure rate to the probability of the item being selected in testing. How it works can be shown as follow: Let S]- denote the selection of item j for a randomly sampled examinee, and let A j denote the administration of that item. The exposure rate for item j can be interpreted as P(Aj), the probability that item j is administered to a randomly sampled examinee. The S-H method separates item administered from item selected by the probability relation P(Aj) = P(AleJ-)P(Sj) and controls P(Aj) by controlling P(Aj|Sj), the proportion of selections that lead to administration. For any given exposure rate rj > 0; PM) 3 rj can 31 be achieved by setting P(Alej) 5 r/P(Sj). If P(Sj) is known or can be approximated, this method can be easily implemented by generating a uniform (0,1) random variable. Hetter and Sympson (1997) described steps of setting exposure control parameter K ,- for the items. The probability of administering one item depends on the relationship between a random number k and the item’s K value. Given that one item is selected, a random number k is generated and compared to K. if k < K ,- the item will be administered, otherwise the item will be retained and the next highest information item will be selected and the same procedure is applied. There are five steps to set the exposure control parameters: Step 1. Generate the first set of K i values, which are 1.0 for every item. This results in an n-by-one vector for n items. Denote each i”? element of it as K ,- associated with item 1. Step 2. Administer adaptive tests to a random sample of simulees. For each item, identify the most informative item i available at the examinee’s current ability estimate then generate a pseudo-random-number x from the uniform distribution (0,1). Administer item i if x is less than or equal to the corresponding Ki- Whether or not item i is administered, it is excluded from further administration for the remainder of this examinee’s test. Note that for the first Simulation, all the K [’s are equal to 1.0 and every item is administered, if selected. Step 3. Keep track of the number of times each item in the pool is selected (NS) and the number of times that it is administered (NA) in the total simulee sample. When the complete sample has been tested, compute P(S), the probability that an item is selected, and P(A ). the probability that an item is administered given that it has been selected, for each item: P(S) = NS/i‘VE P(A) = NA/NE where NE = total number of examinees. Step 4. Use the value of the expected exposure rate r, and the P(S) values computed above to compute new K ,- as follows: If P(S) > r, then new K ,- = r/P(S) If P(S) <= r. then new K,- = 1.0 Make sure that there are at least n items in the item pool that have new K ,- = I. Step 5. Given the new K i~ go back to step 2. Use the same examinees and repeat steps 2, 3, and 4 until the maximum value of P(A) that is obtained in step 3 approaches a limit Slightly above r and then oscillates in successive simulations. The K ,- obtained from the final round of computer simulations are the exposure- control parameters to be used in real testing. The S-H method effectively limits the exposure rates of all items. However, because items that are not selected cannot be administered, items with small probabilities of being selected will still have small exposure rates; thus, the S-H method does not increase exposure rates for "underexposed items. In addition. while the exposure of an item across 6levels may be controlled. the same control may not hold for examinees at a particular level of ability. For instance, even though the exposure of an item may be controlled such that it is administered to no more than 30% of the examinees overall, it may be administered to examinees of high ability 100% of the time. Furthermore, 33 implementation of this method requires knowledge about P(Sj), which is associated with the 6distribution of the examinee population. Hence. it is necessary to specify this distribution a priori and then approximate the value of P(Sj) using Simulation. Many variations of the S-H technique were proposed afterwards. Parshall, Davey, and Nering (1998) developed the conditional Sympson-Hetter procedure in which the exposure control parameters are determined based on ability level. Stocking and Lewis (1995) extended the technique to utilize a multinomial model and proposed another version of the technique (Stocking & Lewis, 1998) that conditions the exposure control parameter not only on the frequency with which the item is selected but also on 6level. This addition to the S-H technique (often referred to as the conditional Sympson-Hetter technique, or C SH, when a multinomial model is not used) is desirable for it overcomes the major disadvantages of SH method by establishing an exposure control parameter for each item at a number of different 6 levels. 2.5.2 a-Stratified Adaptive Testing The use of stratification testing based on item response theory is not new. Poststratification in which stratification is applied according to an examinee’s test results has been widely used in assessing differential item functioning (Dorans & Kulick, 1986; Holland & Thayer, 1988; Shealy & Stout, 1993). Weiss (1973) proposed a stratified CAT design in which stratification was performed according to item difficulty. Chang and Ying’s (1999) a-stratified adaptive testing (STR) design was proposed primarily to address the plagued concern of overdrawing items with high discriminating indices in item pools. 34 CAT item selection procedure based on maximum item information tends to select items with higher discrimination parameters more often in the beginning stage of the test than those with lower discrimination parameters. Because estimation of 6 tends to be quite inaccurate early in the test, it seems a waste to use high discriminating items at this point (Chang et al., 2003). With the STR method, the item pool is divided into a number of strata, based on the values of the discrimination parameter of the items. During the test, item selection is always constrained to one stratum, selecting items with maximum information (van der Linden & Pashley, 2000) or the smallest distance between the value of their difficulty parameter, b ,- and the current estimate of 6 (Chang et al.,1999). Early in the test, items are administered from the stratum with the lowest value for the discrimination parameter. However, as the test progresses, strata with higher values are used. AS a consequence, a—stratification forces a more balanced exposure for all items, particularly if the strata in the item pool are chosen to have equal size and nr 5 n/R (Chang et al., 2003). A simple a-stratified selection method can be described as follows: 1. Partition the item bank into K levels according to the item a values; 2. Partition the test into K stages; 3. In the k’h stage, select nk items from the lr’h level based on the Similarity between b and 6 , then administer the items (note that nl+ + n K equals the test length); 4. Repeat Step 3 from k= 1. 2. K. Note that item selection with the a-stratified design is based on matching b- parameter with 6 rather than maximizing item information (Chang & Ying, 1999). This simpler criterion is used because the a values are similar within a level. Thus, for the 35 2PLM. maximizing item information is equivalent to matching b with 6. For the more general 3PLM, matching b with 6 when item a values are the same very closely approximates maximizing item information (Chang & Ying, 1996). Thus, this Simpler selection method should maintain higher efficiency. It should also result in more evenly distributed item exposure rates. In practice, the strata consisting of items with high a values tend to have high b values. A shortage of lower b items in those strata could cause low b items to be selected more frequently (Chang. Qian. & Ying, 2001; Parshall, Davey. & Nering, 1998). A refined STR selection method is called a-stratified with b blocking (BSTR), which balances the distributions of b values among all strata. In the BSTR method, the basic idea is to force each stratum to have a balanced distribution of b values to ensure a good match of 6 for different examinees. In most of the stratified designs (e.g., Chang et al., 1999), four strata have been used. Hau, Wen, and Chang's (2002) simulation study shows that the ideal and optimum number of strata to be used in each specific application depends on the item pool structure, test length. and other testing conditions. There is a diminishing return in that dividing the pool into too many strata can lead to small stratum, in which there are not any items of close difficulty for each particular examinee. It is also shown in their study that when item difficulty is normally distributed in an item pool, the optimal strata are quite independent of the pool size and the correlation between item discrimination and difficulty 0:05). 36 2. 6 Item Pool Design and Its Relationship with Other Components of CAT Parshall, Davey. and Nering (1998) discuss the three often-conflicting goals of item selection in CAT. First, item selection must maximize measurement precision by selecting the item maximizing information or posterior precision for the examinee’s current ability level. Second, item selection must seek to protect the security of the item pool by limiting the degree to which items may be exposed. Third, item selection must ensure that examinees will receive a content balanced test. Stocking and Swanson (1998) add a fourth goal to this list, stating that item selection must also maximize item usage so that all items in a pool are used, thereby ensuring good economy of item development. Stocking and Lewis (2000) portray the item selection problem as a balloon — pushing in on one Side will cause a bulge to appear on another. An optimally designed item pool seeks the best compromise of the conflicting goals. To allow several thousand overlapping subtests to be drawn from its items, the item pool must have a sufficient number of high quality items. This is partly decided by the number of examinees the item pool serves and the distribution of the examinees. With item security consideration, the more examinees taking the test, the more items that should be in the item pool. The CAT item selection procedure picks items with a difficulty level approximately comparable to the ability estimates of the examinees, therefore it is expected that items in the pool have a difficulty distribution that is similar to the examinee ability distribution. It is desirable to have items in the pool to Span a wide range of item difficulty relative to the population of interest to allow the CAT to estimate ability levels for a broad range of examinees (Urry. 1977). 37 Test length. which is closely tied to the stopping rules in CAT, also plays an important part in determining the number of items needed in an item pool. For a fixed length test, if the tests for individuals have no overlapped items, the number of items in a bank should be exactly the number of items in each form multiplied by the number of test takers. In reality, items can be used repeatedly within certain security constraints. Even with item overlap, it is expected that the more items a test requires, the more items needed in an item pool. Stocking (1994) recommended that the item pool Should have a number of items that is at least 12 times the length of a test. Variable test length CAT usually reduces the items needed for individual examinees. In this case, the number of item needed for an item pool is correlated with the distribution of the test takers, i.e., number of examinees at each ability level. With respect to the same item response patterns. different estimation methods may lead to slightly different ability estimates and. in turn, influence the choice of the best suitable item. Different item selection rules, such as picking the item that maximizes the information or minimizes posterior variance at the current ability estimates, may choose different items to be the most appropriate one for the examinee. Both Situations cause different item usage and require different items in an optimal item pool. Requirements on content balancing also require different compositions of the items in an item pool. For example, if the test blueprint for a 40-item math test requires 20 arithmetic reasoning items and 20 problem-solving items. the optimal item pool would contain a similar number of items for both contents. The goal is to have a sufficient number of items in each desired content area to assemble an individual test with the balanced content coverage required by the test design. 38 In addition, care must be taken to ensure that the item pool consists of the appropriate items to reduce the over— and under-exposure rate while meeting the test precision requirement. Item overuse causes security concerns because the more examinees take the same item, the more likely that item would be disclosed to the public. Item under-use potentially increases the item developing costs. It has been commonly realized that a tradeoff exists between test efficiency and item exposure control. A choice needs to be made that maximizes efficiency within the limits of security constraints, and that is essentially a matter of optimization. An optimally designed item pool should be able to compensate for the exposure control and cause very little decrease in the efficiency of ability estimation. Chapter III Reckases Simulation Method and Extensions to 3PL This chapter first introduces the key concepts of Reckase‘s Simulation method for optimal item pool design. Then it discusses the simulation procedure when the method iS applied to items calibrated with one-parameter logistic model (1 PL) and the potential problems with the three-parameter logistic model (3PL). Finally, the extensions of the method and their applications in situations when exposure methods are built into item selection process are discussed in detail. 3.] Basic Concepts of Reckase ’s Simulation Method An item pool could be described by a list of item parameters for the items in the pool. The basic idea of Reckase‘s method is to determine the item parameters with randomly sampled examinees from the expected examinee distribution. The simulated computerized adaptive tests are administered to the examinees assuming that each item administered to the examinee has the item parameters best suitable to the provisional ability estimation. After a certain number of examinees taking the test, the union of the “virtual” items is the optimal item pool for the CAT program. Theoretically, every 6estimate is unique and the items optimally suitable for the estimate have unique item parameters. This Simulation process described above would lead to as many items in the item pool as the total number of items administered to examinees, equal to the test length multiplied by the number of examinees. In practice, however, items function very similarly to those items whose parameters differ by a small amount. These items are redundant in the item pool in that any one of them could be used to estimate the ability level for a person with very small loss in precision. 40 The concept of "bin” is introduced to account for the redundancy of items with similar parameters. A bin is an item reservoir whose boundary is defined by numerical attributes of the items so that a number of items within a bin have Similar attributes and are exchangeable in use. If items are calibrated with IPL, the item difficulty parameter (b-parameter) is the main feature that controls the selection of test items. The bins are, therefore, defined as ranges on the IRT 6—scale. For example, two consecutive bins with width of 0.2 on the 6—5cale are denoted as (0.002) and (02:04). Items with b- parameters 0.1 1 and 0.13 are considered exchangeable in CAT item selection because they all belong to the bin (0020.2). The item pool, therefore. can be considered as a list of “bins” containing items with similar properties. The bins that define an item pool Should have a width that is sufficiently small so all items are considered equally good for estimating the ability level of an examinee. If the bin width is too large. items in the same bin may vary in their usefulness for estimating the ability level. The approach taken to determine bin width used here is to identify the range of 6—scale for an item that includes the maximum of the item information function and the range around the maximum that is not much lower. “Not much lower” is arbitrarily defined as 98% of the maximum. Certainly. an argument could be made for using 96% or 97% as well. 41 1.4 1 T I l ' 12~ I“ - /9 1 118 l \. 1=1.i8 e 007 \ 8 =0.35 1" \ -i l \ l . / t 2 08~ \ _ +3 l N i E i l. E. O (3‘. 1 Mk—M... W l ()4 ~ e l \ la o -3 -2 -1 O 1 2 3 Figure 3.1 Demonstration of determining bin width The end product of the optimal item pool design is an array of integers (x1, x2, , x3), which tells how many items are needed in each bin to assemble all tests in a program. If no exposure control is used, the integers are bounded between zero and the test length L, because items in each bin can be reused and no single test requires more than L items from each bin. When item exposure control is assumed, some bins may contain more items so that the Shared exposure rates for items from the highly exposed bins are below the target exposure rate. 3.2 Reckase 's Method/or Optimal ltem Pool Calibrated with 1 PL When items are calibrated with IPL, item difficulty is the only psychometric factor to decide if an item provides the most information at the 6estimate. Therefore, when designing optimal item pools that are calibrated with IPL, Reckase’s (2003) method focuses on matching the item b-parameters and the provisional. 6estimates. Reckase’s method consists of four steps: The first step is to understand clearly the characteristics of the CAT program, because item pool design must model the test procedure as closely as possible. It is important to identify the distribution of the expected examinee population, the test length, and the type of items the test uses. For the CAT process, item exposure control, scoring algorithm, ability estimate procedure, and stopping rules should be clearly specified and strictly followed during item pool simulations. The second step is to identify the categorical attributes required for the items, such as content area, and divide the item pool into smaller ones according to these attributes. If a test has more than one categorical attribute requirement, each separate attribute introduces a partition of the item pool. This step is to simplify the simulation procedure by focusing on determining the optimal item by the quantitative attributes such as its psychometric characteristics. The third step of the process for determining the optimal item pool is to administer a Simulated CAT to examinees randomly sampled from the expected ability distribution. If the ability follows a standard normal distribution, the initial ability level for the examinee is zero in the 6 metric. The first item is the same for all examinees. It is an item with maximum information at an ability level of zero. The next optimal item will be 43 based on the examinee‘s response on previous items and the estimate of the examinee’s ability level. Subsequent items are selected to have maximum information at the most recent ability estimate. If items are calibrated with IPL, the optimal item is the one with b-value the same as the current theta estimates. As the test items are selected and administered, they are tallied in bins based on their b-valueS. Assuming the bin width is 0.25, the histograms in Figure 3.2 Show the distribution of items across bins for two individual examinees A and B with a true ability level of -.095 and .032 taking a lS-item fixed-length test. 44 Frequency A (,0 63: 8 Frequency A 01 (p -0.032 Bin Location on Theta Scale 0 I I I ' I I I I I I I l -3,25 -2 75 -2.25 -l 75 -l 25 -0.75 -0 25 0 25 0 75 125 175 2 25 2.75 I 0 I I I I . > 1»- I I I —3 25 -2.75 -2.25 -1 75 -l 25 43.75 —0.25 0.25 0 75 l25 l Bin Location on Theta Scale 7 :I 2 ’3 L. C J l .75 Figure 3.2 Items used for two individual examinee 45 As the number of test administrations increase, the number of items needed in the ideal pool will increase as well, but 15 items are not added each time because many of the needed items are already in the pool. Figure 3.2 shows that examinee A and B took some items from different bins. but also took some items in the same bins. For example, one item from bin (-0.25:0) is administered to examinee A while two items from the same bin are given to examinee B. Because the one item administered to examine A can be reused for examinee B, only one additional item in the same bin is added to the optimal item pool. Therefore, instead of 30 items being needed for an optimal item pool to support two examinees, only 23 items are needed if other constraints are not taken into account. Figure 3.3 displays the distribution of the items in the pool for two examinees. CI U I I I I r I r I I I I I ,5- 7 a I i I , l I I i l l I 325975-2254 75-126 4375 412*? 025 075125175 225 275 Pin Location on Theta Scale Figure 3.3 Item pool for two examinees To determine the required size of an ideal item pool, a large number of tests are administered and the required item pool is tallied. Additional items are added to the item 46 pool after more examinees take the test. In the fourth step, when the expected numbers of examinees are administered the test, the union of the items forms a distribution of items that represents an optimal item pool. The sum of the number of items in the bins is the total number of items needed. Figure 3.4 Shows an example of an optimal item pool for 5000 examinees taking a lS-item fixed length test. I? I I v I I I I l I I I I I I Frequency 0') U I I l I i I 1 4415-3 325-2 ~15 -1 -05 0 05 1 1.5 2 25 3 35 4 8m Location on Theta Scale Figure 3.4 Item pool for 5000 examinees This strategy works well for item pools calibrated with IPL, when item difficulty is the only factor in determining the amount of information an item provides. In this case, items with b-parameters the same as an ability estimate will always provide maximum information at the ability estimate. Therefore, they are always the optimal items at the ability estimate compared to items with b-parameters different from the ability estimate. When items are calibrated with 2PL or 3PL, they may differ in the amount of information they provide even with the same b-parameters, simply because they have different a- or 47 c-parameters. Extensions to Reckase‘s method, therefore, are needed to account for the differences between a- and c-parameters in designing item pools calibrated with 3PL. 3.3 Reckase '3 Method Applied to 3PL AS mentioned above, determining the optimal item pool that is calibrated with 3PL is more complicated than that with lPL because the information an item could provide at an ability level is determined by the combination of three parameters, the discriminating parameter a. the difficulty parameter b, and the pseudo guessing parameter c. An item could provide an infinite amount of information at any ability level, given that the b- parameter is close to the theta level and the a-parameter is infinitely large. Although it is impossible to have items with infinitely large a-parameters, it is common to have items vary widely in their a-parameters. This implies that at a certain ability level, an item reaching the maximum information it could provide is not necessarily the item providing maximum information at the theta level. On the other hand, an item providing its highest information at one ability level may provide more information than any other items in the item pool over a range of ability levels. As demonstrated in Figure 3.5, an item with parameters a=l .2, b=0.0, and c=0.2 provides more information at ability level -0.28 than an item with parameters a = 0.8, b = -0.5, c =0.2, even though the latter reaches its peak in the amount of information it can provide at this ability level. 48 0.8 I I I I I a=1? b==00 c=O 2 —" ‘— 2120 8 b=—O 5 c=0 2 O7» / \ 0.6 I K“ / I Q 'J‘I I \_ / . . Item In fomation O W t \ \ \ K I /I r I / \ O 1 "' / // \ \\\ .- . / 5:. -..- / l _,s/ \rzr a. 0 -‘ M+_,-- I I I I '- -3 —2 -1 0 1 2 3 6 Figure 3.5 Item information provided by two different items. Therefore, the optimal item for an ability level should not be defined as the item providing its most information at the ability level. In addition, it is unrealistic to define the optimal item pool as the one that contains items with the highest possible a- parameters. Instead, the optimal item pool should contain items with a range of discriminate parameters so that tests assembled from it would provide the sufficient precision the testing program requires. This study explores two strategies proposed to simulate the realistically optimal item pool. One focuses on simulating items that meet the minimum precision needed for an examinee taking the test. The other takes into consideration the relationship between the a-parameters and b-parameters in real operational items so that the simulated item parameters are within realistic boundaries. 49 Before introducing both strategies. it is important to extend the “bin” concept to fit the three-parameter IRT models. 3.2.1 Extending the “Bin” Concept Under the framework of the three-parameter IRT model, the maximum amount of information an item could provide is decided by all three parameters. An item with high discrimination (i.e., high a value) generally provides more information than one with low discrimination. However. Chang and Ying (1999) demonstrated that it may provide less information at the level oftheta estimate where 6 is far from the examinee’s true theta. An item with smaller c-parameters provides more information at its maximum level, but c-parameters usually vary slightly across items so they have little influence on the amount of information items give. Therefore a- and b- parameters are the two primary factors to determine how much information an item is capable of providing at an ability level. Items that function Similarly have similar a- and b-parameters. This leads to the extension of the “bin” concept introduced in item pool simulation with IPL, where it is defined to be the interval of b-parameter values within which items provide similar amounts of information over a range of ability levels. With 3PL, the boundary of “bin” is defined by both the a- and b- parameters. This forms a grid partitioning the plane formed by values of a and b. As illustrated graphically in Figure 3.6, each cell defined by a range of a- and b- parameters is denoted as ab-bin, whereas the marginal total across each row is denoted as a-bin and the marginal total across each column is denoted as b-bin. Items with parameters within the boundary of any grid defined by both a- and b-parameters provide similar information 50 over the entire range of ability level, and provide maximum information at the ability level around the boundary of the bin in which they are located. 2.19— ------------ ------------- ------------ ------------- ------------- ------------- ------------ — 2.00-. ___________ ............. ............ ....... _____ . ________ ____________ _ 1.79_ ............. ....... .i ........ ......... ,,,,,,,,,,, ............. ....... ,, ..... ; ............. ........... ........... .............. ............ a 155_ ............ .............. ............. i ............. i ........... .............. .............. ............. .............. .............. .............. ............ —l 1,26- ............ ............. ............ ............. ........... .. a ' i I ' ' ' I 2 2 I I 089 .. ........ .......... ............ ............ .......... , 000 l is, i H7-l_____ i l l l l l l -1.40 -l.12 -0.84 -0.56 -0.28 0.00 0.28 0.56 0.84 1.12 1.40 1.68 b Figure 3.6 Bins defined by both a- and b- parameters. While the boundaries of the b-bins are determined by dividing the 6—metric (or equivilantly, metric of the b-parameters) into equal intervals, the width of the boundaries for the a-bins are set to be different, because the maximum amount of information an item can provide is proportional to the quadratic function of the a-parameters, assuming the c-parameter is constant (Lord, 1980). Equation (7) shows the relationship between the a-parameter and the maximum information. Mi, an item provides. 02a2 2 3/2 Mi =___1_i.[1—20c,.—8c,- +(1+8c,-) ] (7) 8(l-Ci) 51 It can be further shown that the differences between maximum information function (AM ) for items with different a parameter is DZU —20c-—8c2 +(1+8c)3/2] 8(l—c)2 AM 2 Aa2 (8) Plugging in the average c-parameter of the existing items, which is around 0.2, the resulted constant is 0.5, therefore AM = 0.5 Au2 (9) Therefore, the boundary of the a-bin within which the changes of the a-parameters cause little information change can be calculated. The grid defined by a-parameter intervals and b-parameter intervals becomes the boundary of ab-bins. If 0.4 is considered a small information change and 0.28 is a small b-parameter change, the bins defined by both a- and b- parameters are shown in Figure 3.7. For simplicity, an ab-bin is denoted by its b- parameter boundaries and a-parameter boundaries: (blower bozmd3bupper bound alower boundzauppw. bound): For example. items with a-parameters between 0.89 and 1.26 and b- parameters between 0.00 and 0.28 are in an ab-bin (0.00:0.28, 0.89:1.26). They are considered interchangeable in item selection. Distinctions are made. however. to the functions of b-bins and a-bins. As mentioned above, the closeness of the b parameters to ability level determines how an item would perfomi the best and provides the most infomiation. On the other hand, the value of the a-parameter determines how much information an item can provide around the ability level where it functions the best. With the maximum information item selection approach, if an item with high information at ability level is available, it will be picked over the low information items. An optimally designed item pool, thus. Should 52 provide sufficient items within each b-bin, and make sure the items with adequately high a-parameters are available when needed. In other words, b-bins tally the number of items needed that perform best over the ability levels around the b-bin. Within each b-bin, the a-bins record at most how many high discriminating items are needed. The item pool Simulation would produce an array of integers it = (x1, x2, , xB), which tells how many items are needed in each b-bin, and a matrix X: (X1, X2. , X B), where each element X8 is a integer vector 0131.333. . J’BA) indicating at most how many items are needed in each ab-bin within a b-bin. In both cases, B is the number of b-bins and A is the number of ab-bins within each b-bin. The reason why they are recorded in two different matrices is that x3 is usually not the same as the sum of X8 in the early stage of the item pool design. After the CAT simulation, yB‘s from ab-bins with the lowest item discrimination are set to zero so that Z yB.= X8 and only the highest discriminating items required by the simulation are in the optimal item pool blueprint. Visual displays of the two matrices are shown in Figure 3.7, where the plot on top shows how many items in each b-bin are needed for the optimal pool and the plot on the bottom distinguishes different ab-bins with gray-scales and shows the number of items needed for each ab-bin within a b-bin. 53 b-Bins N A O> W ' 0 ab-Bins -2.94—2.38-1.82-1.26 -0.7-0.14 0.42 0931.54 2.1 2.66 24F 22- 20- 18- 16- f r O N ‘8 O) W Y E30..0058<123 ' l:l1.23sa<1.48- -1.825a PI-(Hj) is the probability that a personj = l .1 with an ability parameter (9} gives a correct response to an item i = 1 I; a,- is the value for the discrimination parameter, b,- for the difficulty parameter. and Ci for the guessing parameter of item 1. Because the examinee‘s true 6was known in the simulation, P0- was computed after each item administered to the examinee was simulated. Then a random number mi]- was drawn from a uniform distribution U(_0.1) and compared to Pij. If mg- was equal to or less than Pi]- then it was assigned a l as the response. otherwise 0 was assigned. Step 4: Post-Simulation Adjustment Five replications were conducted for each combination of methods and control variables so that a relatively stable approximate of the optimal item pool could be obtained. The blueprints and the item exposure counts from the five replications were averaged before a post-simulation adjustment was done. 4.3 Control Variables Two independent variables were controlled for in all item pool designs: design method and exposure control method. Both AR and GS have the same target exposure control rates (1/3 for Sympson-Hetter method), which is the same as the operational procedure does. For simplicity. four strata were assumed for the a-stratified method. Each of the two item pools. AR and GS, had its unique control variables. Specifically. (I) there was no content balancing for AR while three contents were administered in a fixed order for GS; (2) each item pool defined the bin width differently in the H-metric, depending on the characteristics of the operational item pools. The simulation design is illustrated in Table 4.1. 73 Table 4.1 Simulation Design CAT Item Pool Design Item Pools Arithmetic Reasoning (AR) General Science (GS) Test length 15 Examinee distribution N(0.1) Exposure control No exposure control Sympson-Hetter (target exposure rate is 1/3) a-stratified (4 strata) Design Method Prediction Model Minimum Test Information Bin width b-bin: 0.20 for AR and 0.26 for GS ' . 7_ , ._ U-bll’l. AU“ '2JItl/Iaximum - 0.8 Content Balancing Single content for AR Three contents for GS Two types of distribution were considered in the item pool evaluation: (a) 6,000 6" s were simulated from N(0.l ). and these values were treated as the true abilities for the examinees, and (b) 65 fixed values ranging from -4 to 4 with an interval of 0.125 were selected (i.e., 0 = —4.0. —3.875. . . . . 3.875. 4.0). Five hundred examinees were set to have an identical latent ability at each filevel. The former is to evaluate general 4.4 Evaluating Simulated and Operational Item Pools performances, and the latter is to compute statistics conditional on 0. The item pool evaluation criteria used by Chang & Ying (1999) and Reckase (2005) if were adopted for this study. Precision of proficiency estimation include average test 74 information at each theta level, bias. Mean Square Error (MSE). and correlation coefficients between estimated and true person parameters. Test security indicators include skewness of item exposure rate distribution, percentage of overexposed items, item overlap rate, and percentage of underexposed items. Conditional Test Information Test information is the sum of all the Fisher item information in the test. In a fixed-length CAT test. it can be taken as the index of test efficiency. The larger the amount of information a test provides, the more efficient the test is. Conditional Standard Error of Measurement (CSEM) At each fixed 6point. the standard error of measurement (SEM) was calculated by the formula: N- 1 I . _ ij=1 Where N ,- = 500 is the number of replications (i.e.. the number of adaptive tests 1 N, administered) at each fixed 6point. and 6,- : ——f— 2 6, is the mean of the ability estimates i n==l over the N ,- replications at 6,-. Bias and Mean Square Error (MSE) These quantities are defined as follows: . 1” ~ Btas=——Z(6I-—6j) (19) NF] ~ and 75 , l N . a MSE = N Z to, —o_,)- . (20) ./=1 where N is the number of simulees. and 6 j is the estimator of the jth simulee with ability level 6]: . Conditional Bias and Conditional Mean Square Error (CMSE) These quantities are defined as .v. . . . 1 ‘ ’ * C ondtttonal Bias = -N— Z ( 6.1.1“ - 6i) , (21) I 1:] and 1 Ni ~ 7 CMSE = — Z (6 -, — 6,)“ . (22) N i i=1 ’ where 6,- = -4.0, -3.875, . 3.875. 4.0. for i = l. 2. , 65. respectively. and 61-1, (1' = 1, 2, , 500) is the corresponding estimator of 6,; These values are estimated as the conditional averages of errors and squared errors in the final estimates of 6,- in simulations. As additional overall measures of the quality of the final estimates of 6, the estimates of the bias and MSE functions in (19) and (20) were averaged over all simulated values of 6in the study. They give a picture of item pool performance for individual ability level. Skewness of Item Exposure Rate Distribution A )8 statistic proposed by Chang and Ying (1999) is used to measure skewness of item exposure rate distribution. It is defined as follows: 76 2 (r~—L/n)2 Z =2 ’ 9 23 L/n ( ) i =1 where r,- is the observed exposure rate for the fill item. L is the test length. and n is the number ofitems in the item pool. Equation (23) captures the discrepancy between the observed and the ideal item exposure rates. and it quantifies the efficiency of item pool usage. A low )8 value implies that most of the items are fully used. The ratio of )8 measures follows an F distribution and can be used to compare the exposure rates of two methods. Fmethodl, method2 : szethodl / xzmethodZ (24) If F < 1. then method 1 is regarded as superior to method 2 in terms of the overall balance of item exposure rates. Percentage of Overexposed Items The exposure rate of an item can be defined as the ratio of the observed number of item administrations to the total number of the examinees. A moderate level of item exposure rate is generally desired. A high exposure rate of an item means an increased risk of the items being known by the prospective examinees. If so, both the test security and validity are threatened by the high item exposure rate. Therefore, the percentage of overexposed items is taken as an important criterion to evaluate the success of a CAT program. The expected exposure rate was set equal to 1/3 for both tests in ASVAB (Segall, Moreno, & Hetter, 1997). 77 Percentage of Underexposed Items Low item exposure rate means that the item is rarely used. An item pool with too many items with too low an exposure rate is a sign of the underutilization of the pool. Both the cost—effectiveness of developing the items and the appropriateness of the item selection method are challenged by the low item exposure rate. In this study, an item with an exposure rate lower than .02 is considered as underexposed. Test Overlap Rate and Conditional Test Overlap Rate Test overlap rate is the expected number of common items encountered by two randomly selected examinees divided by the expected test length. Ideally, the number of common items between any two randomly sampled examinees should be minimized. Test overlap rate can be calculated by ( l) counting the number of common items for each of the N(N — [)0 pairs of examinees, (2) summing all the N(N — 1)/2 counts, and (3) dividing the total counts by LN(N — M"? (Chang & Ying, 1999). The following equation summarizes the calculation (Chen. Ankenmann. & Spray. 1999): n mej Z mitmi — 1) i-l _ . 2 . T=____I=1 =-~ (25) L81] LN(N—1) where N denotes the number of fixed-length CATS administered. L is the number of items in each of the CATS, n is the number of items in the pool. and m ,- is the number of times item i was administered across all N CATS. Conditional test overlap rate computes the test overlap rates for 500 tests administered to each of the sixty-five fixed 6 s. The same procedure Chang and Ying (1999) described in equation (25) can be used to compute test overlap rate for the tests 78 administered to each fixed 6. In this case. N is 500 and m ,- is the number of times item i was administered across all 500 C ATs. The conditional test overlap rate gives a more accurate picture of test overlap at a particular ability level. instead of the average across all ability levels. 79 Chapter V The Performance of the Item Pools without Exposure Control 5.] Item Pools/0r Test without Content Balance Figure 5.1 compares the distributions of the operational item pool and two optimal item pools designed by MTI and PM, assuming no exposure control. Table 5.1 presents the sizes and the summary statistics of the item parameters for the three pools. The optimal item pools consist of the fewest items. This is not surprising. partly because both assume no exposure control. while the operational pool is designed for tests with the Sympson-Hetter exposure control. Table 5.1 indicates that all item pools have items that span a wide range of difficulty levels. roughly from -2.5 to 2.5. However. the items in the optimal item pools have slight smaller ranges. The operational pool has a large number of items with b- parameters between 0.0 and 1.5. while the optimal pools display a more even distribution across b-bins. The MTI pool consists of the fewest items and their a-parameters are more concentrated. ranging from 1.275 to 1.781. The PM pool shows the characteristics of the item parameters similar to the operational pool. in which difficult items tend to have high a-parameter and easy items tend to have moderate to low a-parameters. 80 a. Operational Item Pool Content Area 1: 137 Items at , , t 10 ‘1' ' i ~‘ 1 8 . . . H, i ; l . , l 6 = ‘ i 2‘ i i 4* i ‘ t l 2’ II I l o , I I . t -4.1 -2.5 -0.‘ . 3.9 b. Item Pool Designed by MTI Content Area 1: 82 Items 1 I I i 12 l t i 10 ~ 1 8” .4 i 6' I l i; 4 ~ a. i 2 1' 5 “ .4i ,___,,_:_2..5_ -09 0.7 (23 3?) " F Figure 5.1 Item distribution for item pools without exposure control 0. Item Pool Designed by PM Content Area I: 101 Items -0.005a<0.89 -2.193a<2.37 -0.89 sa <1.26 -2.37 sa < 2.53 -1.2esa<1.55 [22.53sa<2.63 -1.55$a<1.79 [:12.68sa<2.83 -1.795a<2.00 [32.83sa<2.97 i2.00sa<2.19 @2973“ Inf Figure 5.1 (Cont’d) Item distribution for item pools without exposure control The overview of the evaluation results for these item pools are presented in Table 5.2. The ability estimates from all pools exhibit a certain level of positive bias; however, the magnitudes of the bias are negligible. MSEs from optimal item pools are smaller than that from operational pool. The MTI pool and PM pool resulted in a higher correlation coefficient than the operational pool. 82 :: S E 2% 60.. Novena $3.2 cam? 88.3. 2385 so: as. was: .8 an A: A03 53 $8.: $0; 22693 :5: as. as: Co 5 $92 ammo as? ea 925% ea: 8%.? o :wm— Nwmm. _ m 2E 2:898 Eu: .«0 Buzz/gm 823 £23 83.0 8:288 28¢ 33o $8.0 mm: 3 :3 ~83 38.0 .85 2.. :2 .5 223a Econ— Eo: of mo oocaccottom 2: mo mocmzfim meEsm Mm 635% Sod mono owed Ed MEN- 3: 9.2 Red. ~30 mm: $3 83 :: 2a $0.0 3% Ed was 3 8. N2: is: $3- $3 a: 83 53 Q :2 MES 23 $3 3 _ .o 83. 2.3 E I 2 3 $3 2.3. 23.0 £2 E .5 :5 .52 as: am 53.2 .52 as: am :82 .52 as: am 532 .2... .25 o a a _o.::cU Esmoaxm SEE? chOmmom QuoE£t< c8 85:me 8688mm Eu: new oflm Eon :8: _.m mink 83 Table 5.2 also shows that optimal item pools have a smaller test-retest overlap rate despite having fewer items. It indicates that the magnitude of the item overlap rate may not be related to the pool size with the optimal combinations of the items in the pool. The plots for the conditional test-retest overlap rate in Figure 5.2 reveal that optimal item pools have higher overlap rates for 6 levels below approximately -2.00 and above 2.00. However, in practice, there are very few examinees at these ability levels. +0PwithoutER --MTIwithoutER _ +PM warm ER 0.1 94 .2 6 2 4 Figure 5.2 Test-retest overlap rate conditional on 6 Both optimal item pools have significantly smaller percentages of under-exposed items. Although the MTI pool has a higher percentage of over-exposed items, it is reasonable given that it is the smallest pool and no exposure control was imposed. Increasing the pool size reduced the item overlap rate. 84 a. Operational Item Pool 1 2 (5 1:1: 3 0.5 U) D Q. X UJ D J 1 -4 3 4 b. Item Pool Designed by MTI ' 2 (B o: E 0.5 (D D 2’ LU wflfifiw A D i l L*'v‘l l l -4 -3 -2 -1 U 1 2 3 4 6 c. Item Pool Designed by PM OJ 5 E! g 11.5 - U) 3 MM) X Lu 0 1 QM 1 1 1 J -4 -3 -2 -1 El 1 2 3 4 6 Figure 5.3 Item exposure rate by difficulty level Figure 5.3 plots the item exposure rate for individual items in the order of their difficulty levels. Extremely easy and extremely difficult items tend to have smaller 85 exposure rates, but under-exposed items are across all difficulty levels, especially those in the operational item pool. Table 5.2 indicates that the MTI pool has the fewest under- exposed items and Figure 5.3a shows that items with extreme difficulty levels are utilized more often in MTI pools. As shown in Figure 5.4, the three item pools resulted in quite different average test information plots at various fixed 6levels. The plot for the PM item pool looks similar to the one for the operational pool, but provides more information over most ability levels. The MTI item pool provides significantly less information over the ability level between approximately -1.5 and 2.0. but the amount of information it provides over a long range of ability levels exceeds the target information, which is 10.0 between ability levels 21:20, and 8.0 beyond ability levels $2.0. 40; ‘ ‘ ' t. . l —+— OP without ER ‘ 3st .' —+- MTI without ER ”a“ PM without ER J. e N 01 .3 U1 Average Test Information 8 Figure 5.4 Average test information conditional on true 6 86 Figure 5.5 to 5.7 present the C SEM, conditional bias, and CMSE for the three item pools. Figure 5.6 shows a significant increase in the biase of ability estimation, which is positive for the ability levels below around -2.0 and negative for the ability levels above around 2.0. It is not surprising because of the short test length and the Baysian estimation method. The charts show that MTI perform better for ability level below -2.0 and PM performs better for ability level over 2.5. 1 i ~e~~op without ER ‘~ 09; —-—--—- MTI without ER -9~ PM without ER i _._L___._ ..... L.._.._. -2.... 0.8} 0.7} . E i (0 0.6; o i 3,0513% 9 _’ g “a“ mué‘mi 0.3}WL .3353... 02 i ‘ t‘ l. t I i t i 0 . -4 -2 Figure 5.5 Conditional standard error of measurement (C SEM) 87 Average Conditional Bias Average Conditional MSE -* N 9’ Q__._.,~_-.:-.-~‘,’L-“viii—mfi‘mSi. 9" _ .0 01 I h I M Q 0 N A Figure 5.7 Conditional mean square error (CMSE) 88 .4 -2 o 2 4 o i~e~0P without ER i -%—+~MTtwtthoutERi "9* PM without ER .—e—~OP without ER i «g—«r—MTI without ERl é—e— PM without ER 1 5.2 Item Poolsfor Tests with Content Balance As can be seen from Figure 5.8, the distribution of the item pools with content balancing shares a very similar pattern with the distribution of the item pools without content balancing. Both optimal pools show a more even distribution in difficulty levels, but the PM pool has more highly discriminating items in difficult items and very few for easy items. The MTI pool has mostly moderately large a-parameters, regardless of the difficulty levels. Table 5.3 lists the item pool size and the summary statistics of the items in the pool. Compared to the operational pool, on average the MTI pool and the PM pool have slightly higher a-parameters and lower b-parameters but the range of the item parameters are smaller. Both optimal pools consist of fewer items in two larger contents a. Operational Item Pool Content Area l: 59 Items Content Area 2: 58 Items 6 0-4.0 -25 ~09 07 2.2 3.3 0:40 -25 .0907 2.2 3.3 Content Area 3: 13 Items 8 - . . . _-... 6. 4 i 1 3 2 ' j i l 0 .. w...) .4.0 -25 .09 0.7 2.2 3.}: Figure 5.8 Item distribution for item pools with content balancing and without exposure control 89 b. Item Pool Designed by MTI Content Area I: 46 Items Content Area 2: 48 Items - - e - . . } - - . . e - 8 8g 6 6} 4. 4i- 2’ 2% i i 0. I ‘ l i ‘ Ol . . j l i 5 -4.0 -2.5 -0.9 0. 2.2 3.8 -4.0 -2.5 -0.9 0.7 2.2 3.8 Content Area 3: 13 Items 8 6 4' ~ 2L ‘ O n . -4.0 -2.5 -0.9 0.7 2.2 3.8 c. Item Pool Designed by PM Content Area 1: 41 Items Content Area 2: 46 Items 8' ; 3 f i 6 6% 4. 0} l i .w___~_d' ‘ f 1 l i l -4.0 -2.5 —0.9 0.7 2.2 3.8 -4.0 -2.5 -0.9 0.7 2.2 3.8 Content Area 3: 12 Items t ‘ f ' ' 8 l 6 3 4* «l 2: WW; 0 .4.0 .25 a9 0.7 2.2 3.8 Figure 5.8 (Conti’d) Item distribution for item pools with content balancing and without exposure control 90 33¢ 33 N85 82 ea. _- $3 $3 83- mm; mm? 23 £2 2 2,2 mm 3 Rec 83 82 NE. _- Q2 33 :2. we? mm: wood 23 S :2 2 S 83 $3. ammo RR. R :1 mm: 33- em; on: $2 $3 2 mo M HGDHCOU god SEC 33 mg as. _- $3. 2 2 Rod ES to. ado mm? ow SE 33 $2 256 Ema «mod. 83” :2 RS- 32 $2 83 82 we F2 MES 03% good m ad mow. _- 2mm See 33 $2 SE two 22 mm no N “SCH—HOD 2 S 83 god ammo meow- 29m 82 53 mod 5: fig $2 3 35 OM _ .o was Sod ego $3.. can ow _ ._ 83.. $2 a: 8 S 23 8 E2 :56 83 Sad E3 :Q- :S 3 _ ._ god Rod 3 mm 32 $2 3 “5 fl «cows—CU Qumm .52 .32 am :32 .52 as: am as: .52 as: am so: .25 .25 o a e .2250 2:89am 505:3 oocofim .8286 Sm moumtfim SBEEE :5: was 8% Boa So: m.m £an 91 m_ m _ «.— 3 we mm 3 we om oflm .oom mo.v2wm $8.3 $3.8 $3.? $3.: $50.2 $3.3 $3.9 $3.0m $3.3 2:898 Eu: ~23 as: Co 8a m: Aged $mm.w $oo.o $90.5 $eo.m _ $3.3 $3.3 $mo.3 $3.». _ $3.: 8:898 Eu: ~23 28: ,6 8L SE :tz mO SE 5.2 m0 SE 5.2 AC 2335 m 2.8.80 Beam—EU 2.3—.50 E0280 3 £5: vomomxmLoED use -5>O us 83:52 Wm 28; 8 2: 3 Ram eon. Sveé so _ .2 3%. _ N can _ m 2385 so: a? meeco E m: 03 $2.2 $2.2 $2 _ 2386 Eu: 5? was: .6 5 wow? ammo o0 _ to 32 35.6 E: mowod omww.w oww©.M— 02w.— OCSmOon 53m mo mmocgoxm $86 £23 886 8:288 886 225 326 mm: 2:; as: as: 35 s: :2 .5 23:3 floom Eu: 2: mo oocmficotom of mo 85:me meEzm 1m 03$. 2 but have similar number of items in the content with only one item in the test. The MTI pool has the fewest items. The overviews of the evaluation results for these item pools are presented in Table 5.4 and 5.5. The ability estimates from all pools exhibit slightly positive bias. The MSE from the optimal item pools are smaller than that from the operational pool. In addition, both optimal item pools result in a higher correlation coefficient. The results show that optimal item pools have smaller test-retest overlap rate despite having fewer items. Figure 5.9 also shows that the MTI item pools have the smallest test-retest overlap rates at most ability levels. 0.9» £] —+——MTI without ER , , ~e— PM without ER _ e t g) E 9 0.4» 4 ‘1’ 3 (E 0.3» { 0.2 g 0.1 r t 0 - - -4 -2 O 2 4 0 Figure 5.9 Test-retest overlap rate conditional on 6 Figure 5.10 plots the item exposure rate for individual items in each item pool by content level. Regardless of the content. both optimal item pools have significantly smaller numbers of under-exposed items. On the other hand. Table 5.4 shows that the 93 percentages of over-exposed items are similar for all item pools, although optimal item pools have fewer items. a. Operational Item Pool 1 - it 1 1 - B 3 2 (U (U ('5 0: [I 0: gas @05- $0.5» (0 0') U) 0 D D Q. Q. Q. X X X UJ UJ DJ 0 U 0 -4 -2 U 2 4 -4 -2 D 2 4 41 -2 U 2 4 3 t9 3 b. Item Pool Designed by MTI 1 - 0 1 1 . CD OJ CD ‘6 E '5 CE II 11' $0.5» $0.5. $0.5 0') U) U) D O O Q. Q. Q. X X X .. .. .. Jim U U U ‘ ~' - * -4 -2 U 2 4 -4 -2 l3 2 4 41 -2 U 2 3 9 8 c. Item .Pool Designed by PM 1 . l 1 ' 1 . CD OJ CD ‘65 13 1'6 0: Ct: II East g [15- £0.5- 0) U) (D D O O Q. Q. Q. X X X m UJ m U T U U 4 -2 D 2 4 -4 -2 D 2 4 4 2 U 2 8 8 B Figure 5.10 Item exposure rate by difficulty level 94 Figure 5.1 1 displays the plot of conditional test information for the item pools. It can be seen that the PM pool results in a plot similar to the operational pool but provides more information at most ability levels. The MTI item pool provides similar amount of information, which exceeds the target information. over the ability levels between approximately -2.0 and 2.0. The CSEM. conditional bias. and C MSE for three item pools are presented in Figure 5.12 to 5.14 present. The charts show that at most ability levels, the three item pools perform very closely. The MTI item pool results in the smallest bias and CMSE for an ability level below approximately -1.5. 40 g. WOP without ER --+- MTI without ER .i -3... PM without ER Ein’ (:0 O N 01 I Average Test Information a a .3. O Figure 5.11 Average test information conditional on true 6 95 1 . ~ . , +0P without ER 09* ‘ --- MTI without ER 08* 4 -6- PM without ER 0.7 E m 0.6» 0 gOf-J- *} 4: (U i . I.— ‘f ‘ 9 0.4%. is? «I “an a. Q‘ngfif 7:;9n l 0.3 - ' steam : 9W ’ O . . i -4 -2 0 2 4 0 Figure 5.12 Conditional standard error of measurement (CSEM) . -_.... OP without ER --*~ MTI without ER + PM without ER Average Conditional Bias ' -4 -2 Figure 5.13 Conditional bias 96 5 .. .9. OP without ER 4.5 ‘ + MTl without ER t—e—PMwithoutER Average Conditional MSE Figure 5.14 Conditional mean square error (CMSE) 5.3 Summary The results suggest that regardless of the constraints of content balancing, the optimal item pools perform better than the operational item pool based on pool size, test security and measurement accuracy, although each design method has its preferable features. The operational item pool performs better over a given range of ability levels because a large number of items, including very discriminating items, are clustered around these levels. Optimal item pools, especially ones designed with MTI, provide information more evenly over most ability levels and provide sufficient measurement precision with a minimum number of items. All optimal pools, compared to operational pools, save about 20 or more items and yield better correlations. In addition, optimal pools have a significantly lower percentage of items with exposure rate below 0.02. With 97 or without content balancing, PM item pools resulted in the highest correlation and the lowest item overlap rate. Overall, it seems that an item pool designed with the MTI method performs the best, which indicates that the optimal item pool needs the fewest items to achieve desirable precision if all the items have moderate item discrimination and distribute roughly uniformly over a wide range of difficulty levels. 98 Chapter VI The Performance of the Item Pool with Sympson-Hetter Exposure Control 6.1 Item Poolsfor Tests without Content Balance The blueprint of the optimal item pools with SH exposure control is based on the blueprint of those without exposure control. Specifically, more items are added in the bins where items tend to be selected more often than the desired exposure rates. This relationship is reflected in the item distribution illustrated in Figure 6.1, where compared to the optimal pools without exposure control, there are noticeably more items with b- parameters -1.0 to 1.0 in optimal pools designed by both MTI and PM. a. Operational Item Pool Content Area 1: 137 Items j f I T 7 12 l _. l i l s l 10 ‘5 l E i i i 8 ”i i . i o ,, l l l i l 1 2~ . i l 0 L l l g ”.3; -4.1 -2.5 —0.9 0.7 2.3 3.9 Figure 6.1 Item Distributions for item pools with Sympson-Hetter exposure control 99 b. Item Pool Designed by MTI Content Area I: 95 Items 1 r : 5.: il liq .l...:tt.i1 -0.9 c. Item Pool Designed by PM Content Area I; 120 Items it: ii: it till! it: .Ele :lu 1.4 .l... . Figure 6.1 (Cont’d) Item Distributions for item pools with Sympson-Hetter exposure control 100 Table 6.1 shows the item pool size and the summary statistics of the item parameters within each pool. The MTI pool consists of the fewest items and their a- parameters vary within 1.307 and 1.777. a smaller range than the other two pools. The optimal item pool designed by PM has more high a-parameter items. However, items in the operational pool have the maximum a-parameter value in 3.141, compared to 1.777 for items in the MTI pool and 2.633 for items in the PM pool. The MTI pool has 13 more items and the PM pool has 19 more items than the item pools without exposure control. but the size of either pool is still smaller than that of the operational pool. The added items are mostly highly discriminating items because they tend to have higher exposure rates. This leads to a slightly higher average a-parameter for the optimal item pools. Table 6.2 lists the performance overview of the item pools. On average, all three pools yielded slightly positive bias for ability estimates. The operational pool displayed the smallest bias, but the difference from the optimal pools is negligible. Both optimal pools exhibited better performance on all other criteria. The PM pool resulted in the highest correlation coefficient and the lowest mean square error. The MTI pool, however, consists of the least items. which is 42 items less than the operational pool and 25 less than the PM pool. 101 OS 3 E 2% 68 SEE $3.: 3%.: sex?” 2336 so: as, mesco as M: $st 5 _ a .50: .x; E 2385 as: 53, was: co 5 9:2 :33 Ego as 355 :5: ~82: 33 ”80.2 22 2385 53:6 eucafim £86 885 £36 cocsoeoo $86 $36 $23 mm: 35.0 3:3 38.0 85 s: :2 ,5 32:3. icon :8: 2t 95 oocmctotom 2: we mocmzfim meEsm N6 2an ES :3 03¢ 83 $2- 83 $3 mod- Sad 23 22V 55 o2 2; god wove 086 ”N3 EN- mm: 3.2 Ed- 82 RE mood £3 8 :2 was 33 Sod £3 83- $2 2 _ ._ 2 3 $3 2: .m 2:3 32 E “5 Ba .52 as: am :82 .52 £2 5 :82 .52 as: am :82 .8.— .8.— .u a a Etcob Esmcmxm 8:02-:OmmExm £3, chOmmom 25:52.2 8% 8:3me $8895; Em: new 3% 30m :8: _.© 2an 102 As shown in Figure 6.2, all three pools exhibit smaller test-retest overlap rates compared to the item pools without exposure control. This is anticipated because item selection with the exposure control method tends to utilize more items. Of the item pools with exposure control, the operational pool clearly has the smallest test-retest overlap rate at the ability levels larger than 2.0. However, Table 6.2 shows that on average, optimal item pools have slightly smaller test-retest overlap rate despite having fewer items. ' ~74» OP without ER 3 0'97 , "" é+Mleithout ER] g-WvPMwithoutERi % or-—.-3_7,‘3 . - {we MTI with S-H I 05 " - “ ~~+ PM with S-H %0-6: t a: 5 0.5- 1 8, . .’ E; 0.4 : 2 0.3- «g 0.2 - «3 0.1 » O . _ -4 -2 0 2 4 0 Figure 6.2 Test-retest overlap rate conditional on (9 Figure 6.3 shows the item exposure rate for individual items in each pool in the order of the item difficulty. It can be seen that the exposure control mechanic works very well, with the exposure rates for all individual items around or below the target exposure rate. The MTI pool seems to utilize items more evenly and have the fewest under- exposed items. The operational item pool seems to have large numbers of difficult items underexposed. 103 a. Operational Item Pool 1- Exposure Rate 3 b. Item Pool Designed by MTI 1 _ fl) *c'u‘ II g D 5 - 0') D Q. X ”J JR We 0 I I I _ l I I -4 -3 -2 -1 U 1 2 3 .9 c. Item Pool Designed by PM 1 _ 2 ('5 0: § 0.5 - U) D D. X L” 1% U I hm I _ I m” I -4 -3 -2 -1 U 1 2 3 Figure 6.3 Item exposure rate by difficulty level A closer look at the measurement precision at the individual ability level is displayed with the conditional test information plots in Figure 6.4. The plots for item 104 pools with exposure control look very similar to those without exposure control. Because of added items. optimal pools with SH exposure control yield more information at some ability levels and closely match the information provided at other levels with the optimal pool without exposure control. The operational pool, on the other hand, produces less information at the ability levels between -0.5 to 0.75 when SH exposure control is used. 40 -—o- OP without ER 35» .té-t—op with S-H - .. —&- MTt without ER c 30 45—» MTI with S-H .9 Mer- PM without ER a 52 . E 25 -~--v—- PM wtth S-H :5 tit 20 a, l I. §>15~ g < 10» *3 5~ It": I T i C Ifiiflfi “39* -4 4 Figure 6.4 Average test information conditional on true (9 The plots for conditional SEM, conditional bias, and conditional MSE are presented in Figure 6.5 to 6.7. Smaller values indicate better accuracy in ability estimate. The plots indicate that all item pools yield similar performance at ability levels between - 2.0 and 2.5. The MTI pool performs better for ability levels below -2.0 and the PM pool performs better for ability levels over 2.5. 105 t 3+0Pwithout ER 0-93 *3—-—-0P with S-H : 08'; 33+MTIwithoutER; ' 3+MTIwith S-H 3 07; fab—PM without ER FWPM with S-H 3 —e~ OP without ER .3-+- MTI without ER 3+ PM without ER .3-—+—0P with S-H 3+ MTI with S-H 3 —v— PM with S-H Average Conditional Bias .4 P o é 4 a Figure 6.6 Conditional bias 106 ; l;— OP without ER 3 4 « —-—MTI without ER3 3. + PM without ER 3 Lu 3-5 ~ +0P with S-H "2) +MTI with S-H . P 3 . ~+ PM with S-H :2 2.5» U 5 o 23 0 C!) as 1.5» > < 1 _ 0.5 0 1 - 3 _ -4 -2 o 2 4 Figure 6.7 Conditional mean square error (CMSE) 6.2 Item Poolsfor Tests with Content Balance As shown in Figure 6.8, the optimal item pools for tests with content balancing display the same pattern as those without content balancing. Items are added to the bins where items tend to have more than the desired exposure rates. Because content 3 appears only once in a test, no additional items are added in comparison to the pools without exposure control. An interesting fact is that in optimal item pools, content 2 has slightly more items than content 1 does, although both contents appear in a test with an equal number of items. 107 a. Operational Item Pool Content Area I: 59 Items Content Area 2: 58 Items 0 -4.0 -2.5-0.9 07 k 2.2 3.8 I 02 -4.0 -2.5 -0.9 0.7 2.2 3.8 Content Area 3: 13 Items to 3 4. 0.4.0 -25 -0.9 0.7 2.2 3.8 b. Item Pool Designed by MTI Content Area I: 52 Items Content Area 2: 54 Items 8 I Y Y Y 8 ' V r I f 6 4. 2i 0' -41) 4.51.0.9 0.7 2.2 73”“ 07ft) -2.5 -09 0.72.2 3? Content Area 3: 13 Items 3 . . . 6 4. 2. 0__..____-z____. -4.0 -2.5 -0.9 0.7 2.2 3.8 Figure 6.8 Item distribution for item pools with Sympson-Hetter exposure control 108 c. Item Pool Designed by PM Content Area 1: 51 Items Content Area 2: 54 Items 8 8i 6. 4. 2. 0‘ 2...; 0‘ ' -4.0 -2.5 -0.9 0.7 2.2 3.8 -4.0 -2.5 -0.9 0.7 2.2 3.8 Content Area 3: 12 Items 8~ 3 6. 4. OM- . ~40 -2.5 -0.9 0.7 2.2 3.8 Figure 6.8 (Conti’d) Item distribution for item pools with Sympson-Hetter exposure control Table 6.3 shows the item pool size and the summary statistics of the item parameters within each pool. The total numbers of items in the two optimal item pools are similar while both sizes are smaller than the operational pool. Compared to the item pools without exposure control, the MTI pool has 12 more items and the PM pool has 18 more items. On average, items in the optimal pools have higher a-parameters and relatively lower b-parameters than the operational pool. 109 vfld domd dwdd mmmd mdd.m- mom; mod._ mid- wwod Em; edmd Cm; 3 SE mdmd Kmd dmdd Nomd dmo. T vmm; odd; mo _ .d- Km; own; 3 _d dmog S :2 2 _d mdmd wmdd dmmd 52d- N. o; mm: mo_d- cmod who; mwmd 3d; 2 m0 m 68:00 59d Nmed Edd Cmd wmo. T owo; mmd._ omdd- Sod woo; domd Mme; em SE S _d owed Kidd Kmd omd.m- So; So: .3d. mom; own; wood dwm; em :2 wood owed ocdd w _ md mow. T m _ mim mood dmmd vomd Nmod Emd Em; mm no N E8200 nwdd memd omdd Rmd moo. T weed odd; mmdd- oowd woo; owmd owm._ E SE mood mmed Edd mmmd m _d.m- coo; mm 2 mm _ .d- Km; emu; md _ d mom: mm 52 _odd domd mood vmmd INN- :md ow: oodd wmod 3 md emmd no: om m0 _ “coocoo 3% .52 3.3—2 Gm 532 .52 .532 am :32 .52 432 Gm :32 Ban .8.— .o d e 35:00 830%: Suezicoafixm 53> oocflom 28250 So 855on 86:85.; :5: new ommm Boo So: md 2an 110 m. m_ 2 em em wm 3 mm om ofim Boo moVBmm $dddm $wd.mm $mwdm $mo.mm $$d_ $3.3 $3.3 $wddm $3.3V 2:898 Eu: 53> 2:873 6m m: Aoowm $ddd $ddd $ddd $3; $d5m $329 $oo._ $556 $3.2 2:898 Eu: 53> 252.6 8o EA— :.2 3O 2.— :2 m6 2.— EIE LG 333% m 2.3.80 Baum—EU E8—ueU EoEoU of £5: domomxmibdcb dam -B>O do BLE: Z We 238i E o: o: oem Boo Sveom $.98 so _ .om $3.2. osoooxo so: a? oEoEo E m: Aeom some some $8.2 2:893 so: 55 oeofio so $8.0 $85 28.0 as oocoo so: vmmod mmood Somd 28 23898 Eu: (do mmocaou—m $23 mmmoo $8.0 cousoeoo ago god 825 mm: 885 2 So £35 £5 2.. :2 .5 22:3 Econ. :5: of do oocmgotom 05 do motmcfim meEsm ed oEfir lll Table 6.4 and 6.5 display the overview of the evaluation results for these item pools. The performance shows the same pattern as the item pools without content balancing. The MSE from the optimal item pools is smaller than that from the operational pool. The PM item pool results in the highest correlation coefficient. Table 6.4 also shows that optimal item pools have smaller test-retest overlap rate despite having fewer items. The plots for conditional test-retest overlap rates in Figure 5.2 reveal that the operational pool has the lowest overlap rates at ability levels below - 1.5 and the MTI pool has the lowest overlap rates at levels beyond approximately 1.00. This is consistent with the findings from item pools without exposure control, but different from the ones for item pools without content balancing, where the operational pool has the lowest overlap rates at the two extremes of the ability levels. ' .7 7......“ _..Ah.._. .t _,......4.... + OP without ER ‘, -*— MTI without ER ; . +PM without ER ‘ +0P with s-H ‘ . +MTI with S-H . f:— PM with S-H 0.2 0.1 » ~ 0 J A L -4 -2 0 2 4 9 Figure 6.9 Test-retest overlap rate conditional on 6 112 a. Operational Item Pool 2mm 2ngan AIL 42024 5 U 2mm Ezmoaxm. .ML -4 5. 0 2mm 2:3an -2 -4 -2 9 _ 5. D 0 2mm osmoaxm 8 — U Bad mazwoaxm 9| 0 2mm 23886 b. Item Pool Designed by MTI 4 ~4-2024 -202 -4 -2 4 . 8 5. D 0 2mm muzmonxm 8 0 Sam 2szan 0 2mm msmoaxm 8 c. Item Pool Designed by PM 2024 4 . Figure 6.10 Item exposure rate by difficulty level 113 Table 6.4 and 6.5 also shows that. compared to the operational pool, both optimal item pools have significantly smaller numbers of under—exposed, as well as over-exposed, items. The number of under-exposed items is similar to the pools without exposure control. but the number of over exposed items decreases substantially. Figure 6.10 shows the item exposure rate for individual items ordered by their difficulty levels in each item pool. Extremely easy and extremely difficult items tend to have less exposure rates, but under-exposed items are across all difficulty levels, especially for the operational item pool. As indicated in Table 6.4, the MTI item pool has the fewest under-exposed items. and utilizes the extreme difficulty level items more often. Figure 6.11 presents the conditional test information plots. The plots for item pools with exposure control look very similar to those without exposure control. The optimal pools with exposure control yield similar or more information than the pool without exposure control. except that the MTI pool with exposure control yields a smaller amount of information at the ability levels between 0 and 1.5. The operational pool produces less information at the ability levels between -1.0 to 1.0 when compared to the same pool without exposure control. 114 §§+0Pwithom ER t .‘f;-—*—OPwithS-H 11+MleithoutER 38888 : .?—+—MTIwiths-H % l+PMwithoutER g dg—v-PMwithS-H .o. E a 0 '— 0) EMS 9 <10- Figure 6.1 1 Average test information conditional on true 6 Figure 6.12 presents the CSEM for three item pools. It shows that the MTI pool performs better at ability levels below -2 and the PM pool performs better at ability levels over 0.0. The operational pool has the largest SEM at ability levels between -2.0 and 1.5. Figure 6.13 and 6.14 show that the MTI pool yields the smallest bias and the smallest MSE at most ability levels. The operational pool and PM pool produce similar MSE. 115 9 t g 0.45 < N 0.3 0.2 O.1~ + OP without ER ; -*- OP with S—H 1 -—¢- MTI without ER 29 -*— MTI with S-H + PM without ER —°-- PM with S-H o -4 -2 Figure 6.12 Conditional standard error of measurement (C SEM) Average Conditional Bias Figure 6.13 Conditional bias -2 116 . --0-- OP without ER f: --+- MTI without ER "a“ PM without ER —*— OP with S—H L + MTI with S-H g—w-PM with S-H i-e-0Pwithout ER 4-5 ‘ -+-MleithoutER 4t; 4 +PM without ER m +0Pwtths-H g 3.5 . +MleithS-H .5 . gg—v-PMwith S—H c 3. 41 .2 l O (u 2* ' 3’ 91.5r < 1. 0.5 0 , . -4 -2 o 2 4 Figure 6.14 Conditional mean square error (CMSE) 6.3 Summary The results suggest all optimal pools, compared to operational pools, save about 10 or more items while performing better based on pool size, test security and measurement accuracy. Tests assembled from optimal pools have smaller test-retest overlap rates. In addition, optimal pools have significantly lower percentages of items with exposure rate below 0.02. 117 Chapter VII The Performance of the Item Pools with a-Stratified Exposure Control 7.1 Item Pools for Tests without Content Balance The a-stratified exposure control selects item by the closeness of b-parameter to the provisional ability estimate instead of maximum information. Therefore. it is not necessary to adjust the item pool after CAT simulations. The item distribution for the OP, MTI, and PM pools are displayed in Figure 7.1. a. Operational Item Pool Content Area l: 137 Items i 12~ t 10» n J I l ‘ ti 6* t - l t l 4» _ 0 , II I I" , -4.1 -2.5 -09 0.7 2.3 3.9 Figure 7.1 Item distribution for item pools without content balancing and with a-stratified exposure control 118 b. Item Pool Designed by MTI Content Area 1; 158 Items V . t I c. Item Pool Designed by PM Content Area 1: 240 Items -4.1 52.5 .09 3.9 Figure 7.1 Item distribution for item pools without content balancing and with a-stratified exposure control 119 As can be seen, the shapes of the distribution look similar for the MTI pool and the PM pool, although the MTI pool is much smaller. The PM pool has more difficult items with high a-parameters, while the MTI pool has more moderately difficult items with high a-parameters. Both optimal pools seem to require more items with moderate to low b-parameters and fewer items with high b-parameters. Table 7.1 presents the sizes of three pools and the summary statistics for the item parameters within each pool. The size of either optimal item pools is larger than the size of the operational pool by 40 or more. The PM pool has the smallest b-parameter range and the largest average a-value with a maximum of 3.520 and minimum of 1.146. The range of a-parameter values for items in the MTI pool is from 1.275 to 3.394 and from 0.746 to 3.141 for the operational pool. The overview of the evaluation results for these item pools are presented in Table 7.2. The average bias of the ability estimates is positive from the operational pool and the MTI pool and negative from the PM pool, but the magnitudes of the bias are negligible. Optimal item pools yield smaller MSE and higher correlation coefficient than the operational pool does. Of the item pools, the MTI pool results in the highest correlation coefficient and the smallest MSE. Table 7.2 also shows that the MTI pool has the lowest test-retest overlap rate. The PM pool, however, has the highest overlap rate despite having the largest pool size. The plots of conditional test-retest overlap rate for item pools shown in Figure 7.2 reveal that the MTI pool. has the lowest overlap rate at most ability levels. 120 RN 2: 5 Ba Boa Sveé $2.8 $8.8 $38 2:893 :5: as, as: E E .3 A03 $93 $wm.m $8.2 258% so: a? was: E 3a $28 ammo o8? 3e 9255 E: 53.2: £32 28.: as 2335 82:0 323% 83o NS? 823 832.8 M... $25 $23 2: 3 mm: 1 38.0- 8:; ”god 85 2,. :2 .5 322% icon :8: 2: mo cocmEcoitom of («o wormsfim meEsm ms 2%.? 25.0 82 $2 £5 £3- 83. £3 83. 3: cm?“ £3 E: ta 2; 2.3 3% god Sue 82. 8mm 23 zoo- $2 $2 2.3 if 2: E2 was $3 Mood em _ .0 SEN. $3 2 _ ._ 2 3 3.3 SE 2:3 82 5 mo 2% .52 £2 am 532 .52 .32 am :82 .52 as: am :82 .8; .2... o a a 7:250 8:89am vomumbmio 2:3 wEEOmmom 38:223. E8 mQSmufim E32523 :8: van ofim Boa Eu: .8 28d. 1 _ j 4+- OP with AS 0.9 ~ 33 -+- MTI with AS 0‘87 .. A; --B-- PM Wlth AS q, 547-:1.’ . *5 0.7 a: $0.6» t 9 5 o 0. a, I 80.4» m 2 0.3- 0.2- 0.1r O . -4 -2 o 2 4 o Figure 7.2 Test-retest overlap rate conditional on 6 As can be seen in Table 7.2, the MTI pool has the lowest percentage of items with an exposure rate below .02 and the lowest percentage of items with an exposure rate higher than the target rate 1/3. The PM pool has a lower percentage of over-exposed items, but has over half of the items under-exposed. Individual item exposure rates are shown in Figure 7.3. It is clear that items in the MTI pool are used more evenly. By contrast, a few items in the PM item pool are used more often than most items, leading to a highly skewed item exposure rates for PM pool. 122 a. Operational Item Pool Exposure Rate D 0‘! l A- A .—l #1 I l l -4 -3 -2 —1 U 1 2 3 4 8 b. Item Pool Designed by MTI 1 r CD ‘66 [I g [15 ~ N I 0') O Q. X “J W 0 I And I I I I -4 -3 -2 -1 U 1 2 3 4 3 c. Item Pool Designed by PM 2 (U [1 g D 5 - 0'! O Q X UJ D A-A l I -4 -3 -2 -1 D 1 2 3 4 I9 Figure 7.3 Item exposure rate by difficulty level Figure 7.4 shows the average test information at the fixed ability levels. The plots for the PM pool and the operational pool look similar in shape, but the PM pool provides 123 more information at most ability levels. The MTI pool provides similar amount of information between ability levels of-1.5 to 1.5, and provides information exceeding 10 over the range between -2.0 to 2.0. 40 (IO 01 0) O N 01 _.t 01 Average Test Information N _I. O O f—e—OP with AS 1 lea—MTI with AS; §-6-PM with AS Figure 7.4 Average test information conditional on true 6 Figure 7.5 to 7.7 present the C SEM, conditional bias. and CMSE for three item pools. The charts show that all three item pools yield similar SEM, bias, and MSE over the ability level between -2.0 to 2.0 and the results are mixed for ability levels beyond i2.0. 124 5 '——e—~OP with AS ’ 0.9; *-+—-MTI with AS 5 _ +PM with As; 0 T—«e— OP withAs i i -+-MTI with A8! ,+ PM with AS ; Average Conditional Bias Figure 7.6 Conditional bias 27+0PwithAs ——11—— MTI with AS 51- ig—e-PMwithAS Lu .. .. i cg ; _ 45 . CU . C i .9 " s 3 n o ‘1' “ l g .. i e 21 I g < 1: l i , 0i _ -4 -2 o 2 4 Figure 7.7 Conditional mean square error (CMSE) 7.2 Item Pools for Tests with Content Balance The item distributions for the OP, MTI, and PM pools are displayed in Figure 7.8. The results are similar to those without content balancing. The shape of the item distributions looks similar for the MTI pool and the PM pool. The PM pool has more highly discriminating and difficult items in both Content 1 and Content 2, while the MTI pool has more moderately difficult items with high discriminating parameters in all three contents. Unlike the operational pool, which has more items with both high a-parameters and high b-parameters, both optimal pools seem to require more items with moderate to low b-parameters and fewer items with high b-parameters. 126 a. Operational Item Pool Content Area I: 59 Items Content Area 2: 58 Items ovoo A 0 0 -4.0 -2.5 -0.9 0.7 2.2 3.8 -4.0 -2.5 -0.9 0.7 2.2 3.8 Content Area 3: 13 Items 4.0 -2.5 -0.9 0.7 2.2 3.8 b. Item Pool Designed by MTI Content Area 1: 66 Items Content Area 2: 75 Items 8 8: 6 6. 4 4? I l . 2 2i in. E i . 0 i l t i 1 5 t ' -4.() -2.5 -0.9 0.7 2.2 3.8 «4.0 -2.5 -0.9 0.7 2.2 3.8 Content Area 3: 14 ltcms J‘— ex _.______.‘.__~_.__. ,..._..M ._.__._.__ N . Wit ii :Ttl. . 5 0 I l -4.0 -2.5 -0.9 0.7 2.2 3.8 Figure 7.8 Item distribution for item pools with content balancing and without exposure control 127 c. Item Pool Designed by PM Content Area 1: 87 Items Content Area 2: 98 Items ”—‘___l'-' '- 'fig——-__ ' — —‘— _ —_ —T——’—“ _ _T_"_ fl " O‘ " ' ‘ ‘ 0' -4.0 -2.5 -0.9 0.7 2.2 3.8 -4.0 -2.5 -0.9 0.7 2.2 3.8 Content Area 3: 17 Items F T 1 2 ~ 1 1 0 . -4.0 -2.5 -0.9 0.7 2.2 3.8 Figure 7.8 (Conti’d) Item distribution for item pools with content balancing and without exposure control Table 7.3 presents the item pool sizes for the three pools and the summary statistics for item parameters within each pool. Results are similar to ones where no content balancing was present. The sizes for both optimal item pools are larger than the size of the operational pool by 40 or more. The MTI pool has the largest average a-parameter values with a maximum of 4.046 and minimum of 1.244. The operational pool has the smallest b-parameter range. All three item pools have similar numbers of items in Content 3. Interestingly. although Content 1 and 2 appear the same number of times in a test, the optimal pools require more items in Content 2 than Content 1. 128 $3 33 mod 3.8 82. ex: 82 ”NS- use 33 22 we: 2 2.: 3:3 8% £3: 23 32. me: N8: See a: #3. :33 83 3 E2 2 3 82 ”mod ommd R _ .N- e a: we: 33- £3 £5 ado :3: 2 .5 m HGDHGOU a 3 move wood 22: 32- :3 own: 2 S- 5% fig 82 £2 2: 3: wood memo Sod RSV Nam. 38 $3 $3. 3.3 $5 3.2. 2: mm :2 ME; 033 $90 w So 8”. _ - m Sim Sec 33 $2 .92 :2 m 5: mm mo N HGDHGOU ~de £2 £3 mm? 83. coed 83 mod- 83 £3 .83 82 cm s: 2 S .33 29o ammo 8mm- £3 82 $3. $2 $3: SS 9.3 we E2 Sod $3 $3 fig :3- :3 2:: god fig :2 $2 $2 am no ~ HGUHGOU 2% .52 .32 am :32 .52 52 am :82 .52 as: am :32 .25 .8.— .o a a 33:00 chamomxm eoccwbmé 53> oocofim 38:00 Sm moumtfim 36:3qu :5: can 35 Bon— Eo: .m& 2%; I_ N) S 3 2 2: mw mm om me am ofiw Bog Novgwm $535 $3.3 $moem $8.0m $oo.om $3.vm $_ mew $3.2 $3.8 83098 Eu: 2:3 mEBCo 6n— m: A85” $3: fl $3.3 $wm.2 $8.0 $mm.m $mo.w $3.0 $om.m $wo.m 2:398 Eu: 5:3 Eco: Co 8L SE 5.2 AC SE 5.: ac SE 3.2 AC ozmzfim n E0280 N “:3an _ 2.8.50 £00m Se: :83 2: mo oozes—Stem of mo motmzfim bashesm v.5 2an 28:00 .3 was: vomomxmeoeci was $30 mo ewficwoeon— m8 2an 8m E o: 9% 68 moves: $33 gang $3.8 2:893 as: 5:3 meaco eh: 2 A23: some <33. £35 2325 so: as) was: we an: e800 8” _ .o ~32 as 353 ea: 82%: ME: w _ 3.0 as 2335 share eases: 83o 335 2:3 Sewage $25 $de :83 mm: 88.? 886 $23 ea 2.. Ea .5 32.3 130 The overview of the evaluation results for these item pools are presented in Table 7.4. The results are similar to those without content balancing. The PM pool shows slightly negative bias for the ability estimates. The average bias from the operational pool and the MTI pool are positive but the magnitudes of the bias are negligible. Optimal item pools yield smaller MSE and higher correlation coefficient than the operational pool does. Of all the pools. the MTI pool resulted in the highest correlation coefficient and the smallest MSE. The plots of conditional test-retest overlap rate shown in Figure 7.2 draw a consistent picture as the plots for item pools without content balancing. The MTI pool has the lowest overlap rate at all ability levels. Summary statistics in Table 7.2 also shows that the MTI pool has the lowest average test-retest overlap rate. The PM pool, however, has the highest overlap rate despite having the largest pool size. " -o~ OP with AS ‘ ; ——+- MTI with AS 9 -~e- PM with AS Figure 7.9 Test-retest overlap rate conditional on 9 131 a. Operational Item Pool . _ 1 i 1 1 1 0.5 1 0.5 , 0.5 1 Exposure Rate Exposure Rate Exposure Rate o o o 41-2024 41-2024 42024 b. Item Pool Designed by MTI 1 . 1 » 1 . 2’. 2 3 ('0 (U (U D’.’ D: 11' £0.51 g 0.51 $0.5. (.0 U) (.0 D O D D. D. Q. X X X LLl LU LL! 0 U ‘ D -4 4 -4 -2 o 2 4 -4 -2 o 2 4 3 .9 c. Item Pool Designed by PM 1 - 1 1 2’. B 2’. (B ('5 (U [I 11' CE $0.51 $0.5. $0.51 U) (0 0') D D D D. D. D. X X X LlJ LIJ LU U D U1 -4 -2 U 2 4 -4 -2 U 2 4 -4 -2 U 2 4 8 8 8 Figure 7.10 Item exposure rate by difficulty level Table 7.2 also shows that the MTI pool has the lowest percentage of items with exposure rate below 0.02, the lowest percentage of items with exposure rate higher than the target rate 1/3, and the lowest skewness of the item exposure rate. The PM pool has over half of the items under-exposed and is the most skewed pool in terms of item usage. Individual item exposure rates are shown in Figure 7.3. It is clear that items in the MTI pool are used more evenly and only a few items in the PM item pool are used more often while others are less used. Figure 7.] 1 shows the average test information over the ability levels ranging from -4.0 to 4.0. The amount of information the operational pool provides seems quite low. Even at its peak levels. it is below 10. The plot for the PM item pool looks similar to that for the operational pool but the PM pool provides more information over a range of ability levels. The MTI pool provides a similar amount of information between ability levels -2.0 to 1.5. and provides information exceeding 10 at the range between -2.0 to 2.0. 4o +0PwithAS 35. . +MTI with As ,——e-PMwithAs,. .530" E .9 25' 5 :5 20 Q.) '— §>15~ a; <10» Figure 7.11 Average test infomiation conditional on true 6 Figure 7.12 to 7.14 present the CSEM, conditional bias, and CMSE for three item pools. The charts show that all three item pools yield similar SEM, bias, and MSE over the ability level between -2.0 to 2.0. The MTI pool performs better at ability level below -2.0 and over 2.0. 1 ' ‘ .. -, ..... i-+~ OP with AS 09* " Q -+-MTI with A8 0.8L . , +PM with AS -4 é o é 4 0 Figure 7.12 Conditional standard error of measurement (CSEM) 134 . '~<+——0PwithAs 2‘ "2+MleithAS :+PMwithAS Average Conditional Bias :5 .0 9‘ 9 E" T‘ I uni -1.5~ -5 -2 , . -4 -2 o 2 4 o Figure 7.13 Conditional bias 5 . . ‘ g+~op with AS 4.5 -;—~«Mri with AS 4 ,+PM with AS Lu . (0 s. 2 0 SE 2 O O 0 U) 3 <( Figure 7.14 Conditional mean square error (CMSE) 135 7.3 Summary The results suggest the optimal pool designed with MTI performs the best, based on test security and measurement accuracy. It does require more items and some highly discriminating items. Tests assembled from MTI pools have smaller test-retest overlap rates and significantly lower percentages of over- and under-exposed items. The item pool designed by the PM did not perform as well as the operational pool and. the pool designed by MTI, despite having more items. One possible reason is that the way items are stratified is slightly different from the way items are simulated. It might yield better results if items are stratified with Chang and van der Linden’s (2003) 0-1 linear programming method during the evaluation. stage. 136 Chapter VIII Discussion This chapter first revisits the definition of “optimal” in item pool design and discusses how this study successfully addresses the criteria of optimality the definition implies. It then presents the implication of Reckase’s method on the practice of item pool development and maintenance. Finally, the limitations of this study and the expected future research are discussed. 8.] A Revisit to the Definition of “Optimal /-. This study investigated two approaches to design optimal blueprints for CAT item po’o‘l_‘s‘.\Except for the item pool designed for a-stratified exposure control by the PM method, optimal pools designed by either method perform better than the operational pools no matter whether exposure control and content balancing are considered or not. Optimal item pool design looks for the most desirable or favorable combination of items to form an item pool that would support the assembly of a large number of individualized computerized adaptive tests. There is, however, no single pool that is absolutely optimal, as it is constrained by a number of factors and different compositions of the items that may yield similar measurement precisions. That is why the two “optimal” pools look quite different and each is still optimal in some sense. A general objective for an optimal item pool design is to meet the three criteria described by van der Linden (1999): 1) sufficiently large to allow several thousand overlapping subtests to be drawn from its items; 2) consisting of items spanning the entire range of item difficulty relative to the population of interest; and 3) consisting of an appropriate mix of high and low discriminating items to lower the item creation cost 137 while meeting the needs of test precision. It is not hard to meet the first criteria, where minimum size can be translated to the test length divided by the target exposure rate (1.0 if no exposure rate is considered). Simulation studies can easily realize the first two requirements by randomly sampling a large number of examinees from the expected examinee population. simulating a CAT administration to them, and tallying the number of items needed in different difficulty levels. All optimal item pools designed in this study are at least five times the test length and span a wide range of item difficulties. Acceptable precision can be achieved with difficulty ranges slightly smaller than the items in the operational pools in that the ranges of the item difficulty level for optimal pools are all smaller. The third criterion. item creation cost, was not estimated directly in this study. However. mechanisms to indirectly approximate the item writing cost and the effort to minimize it were utilized in Reckase’s item pool design method. Item generations with both MTI and PM methods imply the relationship between item parameters and the cost of item creations. The difference between PM and MTI can be primarily the difference between the assumptions. The MTI method assumes that highly discriminating items (i.e., items with high a-parameters) are more difficult to write and cost more to create, therefore, the MTI design tries to limit the number of high discrimination items by simulating items to meet the minimum test information requirement. The PM method assumes that the more the items with certain characteristics among the existing items, the less expensive to create items with the same characteristics. Therefore, it minimizes the item creation cost by modeling the item characteristics (i.e., the relationship between IRT parameters) and simulating items that are more like existing items. 138 Overall, the optimally designed item pools perform better than the operational pools obtained from CAT-ASVAB. The results show that the MTI design generally leads to smaller pools and contain items with lower a-parameters. The PM pools, on the other hand, maintain the correlation between item parameters but do not perform as well as the MTI pool. The operational pools, on the other hand, provide more measurement precision over some range of latent ability levels. A closer look at the Arithmetic Reasoning pool finds more highly discriminating items at the range of b-parameters between 0 and 1.5. In practice, when operational item pools retire frequently, such high discriminating items may be difficult to replace. It introduces doubts on whether or not the same performance over similar ability levels can be easily duplicated. Item pools designed with Reckase’s method have more items evenly distributed over a wider range of ability levels. As a result, optimal pools perform better than the operational ASVAB item pools at most latent ability levels. The results imply that improvement may be made to the operational item pools by adjusting the item distributions to make them closer to what optimal item pool blueprints demand. For example. the arithmetic reasoning pool may perform better by adding more moderately discriminating items in the lower end of the latent ability levels and taking out some highly discriminating items in the higher end of the latent ability levels for future use. 8.2 Implications 0n the Practice ofItem Pool Development This study is based on the assumption that examinees are normally distributed with a population mean ability of 0 and variance of 1. However, in reality. examinee 139 \ “’ distributions are not always normal, and the expected distribution may not match the exact examinee distribution, which can only be decided when the tests are administered. The question raised is how robust the design is to the violation of the distributions. There are two situations, and. thus, two treatments are required. In the case where the expected distribution is not normal, it is possible to sample the examinees from a predefined examinee distribution, which can be constructed from previous test administrations. On the other hand, since it is a simulation study, violation to the assumptions may threaten the validity of the study and impact the results. The extent of the potential impacts could be a study of interest for future research. The end product of the item pool design isnagblueprint listingthe number of (needed items in each'bintthat isLitems, with the a- and b- parameters in agertain- range. Similar NM __.~‘___ ___t ,_,._.v-—" to the functionflof a test blueprintforthe paper-and-penCil_test,themit‘ernpool blueprint (if? servesfaflsa guide for item selection oritem creation fortheitem fiQOl- It portrays the i optimal item composition an item pool should have and, therefore, is a target item developers should try to match. Items with desired content coverage and statistics can be either selected from previously written ones or created by item writers. This method has not been tested in the practice. In this study, all items required by the design method are assumed available when comparing the optimally designed item pools to the operational pools. It seems hard to produce items with exactly the same item parameters required by the item pool design blueprint. However, with the advance in item modeling-research, it will be more and more feasible to create large numbers of similar items/ with the desired psychometric properties. When the PM pool takes into account the correlations between a- and b-parameters. the blueprint designed by the PM 140 method may be easier to fulfill. The MTI pools achieve acceptable measurement precision with a minimum number of items, but it is uncertain how hard it would be to find or to create the proper items. On the other hand, improvement on the design method, such as combining the two methods to take advantages of the good features of each method, will make the design more practical. In addition, it should be pointed out that by defining the width of “bins”, the blueprint requires similar items within a certain range instead of with exact item parameters. Future studies are needed to investigate how hard it is to fulfill the required items of the blueprint. 8.2 Implications on Item Pool Management In practice, operational item pools are not static. In most testing programs, tests are administered from the bank and new items are pretested on a continuous basis. Obsolete items are removed from time to time. Thus, monitoring item usage and replenishing new items are two important tasks of item pool management (van der Linden & Veldkamp, 2000). The item pool design methods presented here can easily be adapted for use in item pool management, both at the master pool level and at the operational pool level. The master item pool is a union of operational item pools. The distribution of the optimal master pool could be simply a number of replications of the operational pool distribution. In other words, if the master pool will support ten smaller operational pools, the optimal item distribution of the master pool in each bin is simply ten times the item distribution in the optimal pool designed by the simulation method. Alternatively, the union method can take into consideration the expected exposure rates for the items in each bin, where the number of items needed in each bin for the master pool can be expressed as 141 X,AB=1MCLX(RXABI‘AB, XAB) where R is the number of operational item pools a master item pool can support, and r A B is the expected exposure rate for the numbers in each bin. In this way, the master item pool has more items in the most exposed bins and fewer items in the least exposed bins. 8.3 Reckase '5 Method vesus the Mathematical Programming Method The results show that the extensions to Reckase’s method work well in designing optimal item pools in situations where items are calibrated with 3PL. Compared to the mathematical programming method, Reckase’s method simulates the CAT procedure straightforwardly and, therefore, is more flexible in adapting different item selection and ability estimate process and is easier to implement. Constraints on non-statistical attributes (e. g., content balancing) are absorbed into the first stage of the design by partitioning the target pool into smaller ones. There is no special software needed. The mathematical programming method is more mathematically structured by quantifying all the constraints and searching for the optimal solutions with linear programming, but it also requires the use of a “shadow test” item selection approach in CAT simulation. Reckase’s method emphasizes the randomness of the item parameters in simulation, while the mathematical programming method focusing on optimizing predefined “pseudo” items. In the end, when they are all modeling the same CAT process, the simulation results should be similar. While taking different approaches, Reckase’s method and the mathematical programming method are similar in many ways. One of the important similarities is between the PM item simulation approach and the mathematical programming method in the way item costs are minimized in item pool design process. The mathematical 142 programming method defined a cost function, which is an inverse of the number of real items with certain combination of the attributes, including IRT parameters. It assumes that the more real items with the combination of item parameters, the less cost it is to create items with this item parameter combination. The idea is essentially the same as the PM method, in which the simulation would more likely generate items along the regression line of b-parameters on a-parameters where more real items are clustered. Either method may be able to borrow some ideas from the other to improve the item pool design. No literature has described the design of item pools with a-stratified exposure control by the mathematical programming method. Chang and van der Linden described the 0-1 linear programming method to optimize the stratification of a- parameters for an existing item pool but not for the “pseudo items”. As explored in this study, it may be possible to simulate the item pool design by varying the target information at different stages ofthe test. 8. 4 Limitations andfuture studies Due to the limited resources. the prediction models in this study were based on one operational item pool. In practice, it is possible to use multiple recent item pools to get a more accurate estimation of the attributes of the items written for the testing programs. Previous research showed that the bin width might influence the number of items required in the optimal pool. With post-simulation adjustment utilized in this study, item pools would trim unnecessary items in the bins, so the bin width might not influence the size of the final pool. However, future studies are needed to investigate the impact. The optimal item pool blueprints designed for computerized adaptive testing with a- stratified exposure control appear to require more items than those designed for CAT 143 with Sympson-Hetter exposure control. One of the reasons is post-simulation adjustment was not applied due to different item selection procedure a-stratified exposure control uses. Future research is expected to explore the appropriate post-simulation adjustment for item pools with item selection procedure different from maximum information based ones While this study investigated the optimal item pool design with the PM and MTI methods separately, both methods have their shortcomings. MTI method tends to result in items with low correlations between their a- and b-parameters. The PM method, while maintaining similar correlation as the items in operational pools do, tends to perform better over some ability levels than others. It is important for future research to explore ways to combine the two design methods so the item generation would take into account the item parameter correlations while meeting the minimum information requirement. 144 APPENDIX 145 00 Total 2.68 2.83 2.97 2.53 2.19 2.37 APPENDIX 1.79 2.00 2.19 2.37 2.53 2.68 2.83 2.97 Item Distribution for the Operational Item Pool — Arithmetic Reasoning 1.26 1.55 1.26 1.55 1.79 2.00 0.00 0.89 0.89 a b Table A1 00012341434658695H6H59HQ52220000 00000000000000000000011000000000 00000000000000000000000000000000 00000000000000000000010100000000 00000000000000000000010000000000 00000000000000000000131200000000 00000000000000000010100000000000 00000000000000001003002200000000 00000000000000000221124121000000 00000000000001112406213100000000 00000000010114477..35)10005307.10000 000111..107.13QJ43110000000001010000 00001231211000000000000000000000 0000000000000000000000000000000 0364208642086420246302463024530w s2222244444pppo00000 11111 222223 0000000000000000000000000000000 w0.86.4.2086420864202436.8024680246.80. .222222 11111 ppopooooo 11111 222223 137 146 16 26 40 11 Total Item Distribution for Item Pool Designed by MTI Method and without Exmsure Control — Arithmetic Reasoning 0.00 0.89 0.89 Table A2 1.79 2.00 2.19 2.37 2.53 2.68 2.83 2.97 1.55 1.26 1.55 a Q) Total 1.79 2.00 2.19 2.37 2.53 2.68 2.83 2.97 1.26 b 00000434454 00000000000 00000000000 00000000000 00000000000 00000000000 00000000000 00000000000 00000000000 00000004454 00000430000 00000000000 00000000000 54445344344334.4000 000000000000000000 000000000000000000 000000000000000000 000000000000000000 000000000000000000 000000000000000000 000000000000000000 000000000000000000 54445344344334.0000 000000000000004000 000000000000000000 000000000000000000 000000000000000000 36420246802468024h ppppooooo 11111 2222 000000000000000000 036420246302463024 topopooooo 11111 222 000 000 000 000 000 000 000 000 000 000 000 0 0 0 (X) 2.80 3.00 2.60 2.80 3.00 82 71 11 0 Total 147 2.83 2.97 00 Total 1.79 2.00 2.19 2.37 2.53 2.68 2.83 2.97 2.53 2.68 1.26 1.55 l.79 2.00 2.19 2.37 1.55 Item Distribution for Item Pool Designed by PM Method and without 1.26 Exposure Control — Arithmetic Reasoning 0.00 0.89 0.89 a b Table A.3 00000624533654566756655372000000 00000000000000000000000000000000 00000000000000000000000000000000 00000000000000000000000000000000 00000000000000000001100010000000 00000000000000001222323000000000 00000000000000032223232330000000 0000000000007.2233310000030000000 000000000003323000000000054000000 00000000233300000000000000000000 00000004300000000000000000000000 00000000000000000000000000000000 0000000000000000000000000000000 0364208642086420246302463024630m 22222244444400o00000 11111 222223 0000000000000000000000000000000 w0.86.4.20.86.4.20864202468072458024580. .3222224444400o000000 11111 222223 10] 148 l3 19 ll 0 Total Item Distribution for Item Pool Simulated with MTI Method and with Sympson-Hetter Exposure Control — Arithmetic Reasoning 0.00 0.89 Table A4 00 Total [.79 2.00 2.19 2.37 2.53 2.68 2.83 2.97 1.79 2.00 2.l9 2.37 2.53 2.68 2.83 2.97 1.26 l.55 0.89 I26 155 a b 00000434454655684654443344000000 00000000000000000000000000000000 00000000000000000000000000000000 00000000000000000000000000000000 00000000000000000000000000000000 00000000000000000000000000000000 00000000000000000000000000000000 00000000000000000000000000000000 00000000000000000000000000000000 00000004454655684654443340000000 .00000430000000000000000004000000 00000000000000000000000000000000 00000000000000000000000000000000 0000000000000000000000000000000 0.00/0.420.036.4208642024”fi30246807m4fi30.w 32222244444oo000000011111222223 0000000000000000000000000000000 m03642086420364.202463024680.24.680. .32222244444000000000 11111 222223 95 0 Total 149 Item Distribution for Item Pool Simulated with PM Method and with Sympson-Hetter Exposure Control —- Arithmetic Reasoning Table A5 00 Total 1.79 2.00 2.19 2.37 2.53 2.68 2.83 2.97 2.00 2.]9 2.37 2.53 2.68 2.83 2.97 l 55 1.79 .26 55 9 I 1.26 l. a 0.00 0.8 0.89 b 00000624533986697m67755372000000 O0000000000000000000000000000000 00000000000000000000000000000000 00000000000000000000000000000000 000000000000000000027.00010000000 00000000000000002532323000000000 00000000000000062223237.330000000 000000000000543333l0000030000000 00000000000632300000000002000000 00000000233300000000000000000000 00000004300000000000000000000000 0000067.0000000000000000000000000 00000000000000000000000000000000 0000000000000000000000000000000 0.36420.86.42086.420246802468024o30.w 222222444442oon00000 11111 222223 0000000000000000000000000000000 w0.36.420.364.208.6420.2423302468024630. .JZQQQQIJJJJfivfifinwOOOOO lllll 222223 120 28 20 25 l6 0 Total 150 Item Distribution for Item Pool Simulated with MTI Method and with a- Stratified Exposure Control — Arithmetic Reasoning Table A6 1.79 2.00 2.19 2.37 2.53 2.68 2.83 2.97 1.55 1.26 0.00 0.89 a 2.37 2.53 2.68 2.83 2.97 00 Total 1.26 1.55 1.79 2.00 2.19 0.89 b 00006667778888998877766554420000 00000000012212211211100000000000 00000000000000000000000000000000 00000000000000000000000000000000 00000000000000].1.an00000000000000 000000000022 IIIIII 21100000000000 000000002222222l2121212000000000 00000O27.]222222222222227.7~7.000000 0000023232002112.I.202l|7.27.7_2200000 00003317.]000000lOOOOOlOIIOIOOOOO 0000310100000000000000000017.0000 00000000000000000000000000000000 0000000000000000000000000000000 0.3642036420864202468024630246.30.m 22222244444ooooooooo 11111 222223 0000000000000000000000000000000 w0.86420364203642024.6.30.2468024630. .32222214444ooooooooo 11111 222223 158 I7 36 39 26 14 15 0 Total 151 Table A.7 Item Distribution for Item Pool Simulated with PM method and with a- Stratified Exposure Control — Arithmetic Reasoning \000 0.89 1.26 1.55 1.79 2.00 2.19 2.37 2.53 2.68 2.83 2.97 b 0.89 1.26 l.55 1.79 2.00 2.19 2.37 2.53 2.68 2.83 2.97 00 Total -66 -300 0 0 0 0 0 0 0 0 0 0 0 0 0 -3.00 -2.80 0 0 0 0 0 0 0 0 0 0 0 0 0 -2.80 —2.60 0 0 0 0 0 0 0 0 0 0 0 0 0 -2.60 -240 0 6 0 0 0 0 0 0 0 0 0 0 6 -240 -220 0 2 5 0 0 0 0 0 0 0 0 0 7 -2.20 -200 0 2 6 0 0 0 0 0 0 0 0 0 8 -200 4.80 0 0 9 0 0 0 0 0 0 0 0 0 9 4.80 4.60 0 0 10 0 0 0 0 0 0 0 0 0 10 4.60 4.40 0 0 7 3 0 0 0 0 0 0 0 0 10 4.40 4.20 0 0 6 4 0 0 0 0 0 0 0 0 10 4.20 4.00 0 0 3 7 0 0 0 0 0 0 0 0 10 4.00 -0.80 0 0 1 8 2 0 0 0 0 0 0 0 11 -0.80 -0.60 0 0 0 7 3 0 0 0 0 0 0 0 10 -0.60 -0.40 0 0 0 5 5 0 0 0 0 0 0 0 10 -0.40 -020 0 0 0 0 8 2 0 0 0 0 0 0 10 -0.20 0.00 0 0 0 0 8 3 0 0 0 0 0 0 11 0.00 0.20 0 0 0 0 3 5 2 0 0 0 0 0 10 0.20 0.40 0 0 0 0 1 7 2 0 0 0 0 0 10 0.40 0.60 0 0 0 0 0 3 5 2 0 0 0 0 10 0.60 0.80 0 0 0 0 0 2 6 2 0 0 0 0 10 0.80 1.00 0 0 0 0 0 0 5 4 1 0 0 0 10 1.00 1.20 0 0 0 0 0 0 1 6 2 0 0 0 9 1.20 1.40 0 0 0 0 0 0 0 3 4 2 0 0 9 1.40 1.60 0 0 0 0 0 0 0 1 5 3 0 0 9 1.60 1.80 0 0 0 0 0 0 0 0 2 4 2 0 8 L80 2.00 0 0 0 0 0 0 0 0 0 2 4 2 8 2.00 2.20 0 0 0 0 0 0 0 0 0 0 4 3 7 2.20 2.40 0 0 0 0 0 0 0 0 0 0 2 4 6 2.40 2.60 0 0 0 0 0 0 0 0 0 0 1 4 5 2.60 2.80 0 0 0 0 0 0 0 0 0 0 0 4 4 2.80 3.00 0 0 0 0 0 0 0 0 0 0 0 0 0 3.00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 Total 0 10 47 34 30 22 21 18 14 11 13 17 237 152 Table A.8 Item Distribution for the Operational Item Pool - General Science Content 1 \000 0.89 1.26 1.55 1.79 2.00 2.19 2.37 2.53 2.68 2.83 2.97 b 0.89 1.26 1.55 1.79 2.00 2.19 2.37 2.53 2.68 2.83 2.97 00 Total -00 -2.86 0 -2.86 -2.60 0 -2.60 -2.34 0 -2.34 -2.08 0 -2.08 -1.82 3 -1.82 -1.56 0 -1.56 -1.30 0 -1.30 -1.04 2 -1.04 -0.78 3 -0.78 -0.52 0 -0.52 -0.26 3 -0.26 0.00 0 0.00 0.26 0 0.26 0.52 0 0.52 0.78 0 0.78 1.04 0 1.04 1.30 0 1.30 1.56 0 1.56 1.82 0 1.82 2.08 1 2.08 2.34 0 2.34 2.60 0 2.60 2.86 0 2.86 00 0 Total 1 2 oo—~owoawwwbwowow——ooooo COOOO——————O—OI\)OOO—OOOOO b.)OOOOOOOOO—OIQOOOOOOOOOOOO NCOO—o—OOOOOOOOOOOOOOOOOO OOOOOOOOOOOOOOOOOOOOOOOOO NOOOOO—‘O—‘OOOOOOOOOOOOOOOO OOOOOOOOOOOOOOOOOOOOOOOOO OOOOOOOOOOOOOOOOOOOOOOOOO OOOOOOOOOOOOOOOOOOOOOOOOO OOOOOOOOOOOOOOOOOOOOOOOOO OOOOOOOOOOOOOOOOOOOOOOOOO OO—N—O—ON-bUt-Bwa-hw-b—NWOOOO DJ 0 O kl] \O 153 r1. . . II. III- ] II 1" a 11. 1!. g . . -~ on -~ Alum . o - . III“ I.“ o C ( ( ( 1 Item Distribution for the Operational Item Pool — General Science Content 2 Table A.9 00 Total 1.79 2.00 2.19 2.37 2.53 2.68 2.83 2.97 1.79 2.00 2.19 2.37 2.53 2.68 2.83 2.97 a 0.00 0.89 1.26 1.55 0.89 1.26 155 I) -oo 1 . l . ._ F.1rnligg 000001102485435483630100 000000000000000000000000 000000000000000000000000 000000000000000001000000 000000000000000000000000 000000000000000000000000 000000000000000000000000 000000000000000000000000 000000000000000010010000 000000000000000I30200000 .000000000000127.232200000 0000000002553I3I10220100 0 0 0 0 0 1 1 0 2 2 3 0 0 0 0 0 0 0 0 0 0 0 0 0 1 04 1.30 1 56 1 82 (D -2.86 0.00 0.26 0.26 0.52 0.52 0.78 1.82 2.08 2.08 2.34 2.34 2.60 2.60 2.86 0 78 1.04 1 30 1 56 2.86 -2.34 -2.08 -2.08 -l.82 -1.82 -1.56 -2.86 -2.60 -2.60 -2.34 -1.56 -1.30 -l.30 -1.04 -1.04 -0.78 -0.78 -0.52 -O.52 -0.26 -0.26 0.00 58 a). 9 Total 4 Table A. 1 0 Item Distribution for the Operational Item Pool — General Science Content 3 00 Total 1.79 2.00 2.19 2.37 2.53 2.68 2.83 2.97 1.79 2.00 2.19 2.37 2.53 2.68 2.83 2.97 1.26 1.55 1 55 a 0.00 0.89 0.89 1.26 b —oo 00001.. 00000 00000 00000 0 -2.86 -2.86 -2.60 -2.60 000 000 0 0 0 0 0 0 0 0 1 -2.34 -2.34 -2.08 -l.82 -2.08 -1.82 -1.56 0100311210020 0000000000000 0000000000000 0000000000000 0 0 1 -1.30 -1.30 -1.04 -1.04 -0.78 -1.56 -0.52 00000000000 00000000000 00000000000 00000000000 00000000000 00000000010 00000110000 —0.78 -0.52 —0.26 -0.26 0.00 0.00 0.26 0.26 0.52 0.52 0.78 1.04 1.30 1.56 1.82 1.82 2.08 0.78 1.04 1.30 1.56 2.08 2.34 001.000 000000 000000 000000 000000 000000 000000 000000 000000 000000 000000 2.34 2.60 2.60 2.86 (I) 2.86 13 A}. 3 Total 155 Table A.11 Item Distribution for the Optimal Item Pool Designed by MTI and without Exposure Control - General Science Content l \000 0.89 l.26 1.55 1.79 2.00 2.19 2.37 2.53 2.68 2.33 2.97 b 0.89 1.26 l.55 1.79 2.00 2.19 2.37 2.53 2.68 2.83 2.97 00 Total -00 -2.86 0 -2.86 -2.60 0 -2.60 -2.34 0 -2.34 -2.08 0 -2.08 -l.82 O -l.82 -I.56 0 -l.56 -l.30 0 —l.30 —l.04 0 -l.04 -0.78 0 -0.78 -O.52 O -O.52 -0.26 0 -0.26 0.00 O 0.00 0.26 0 0.26 0.52 0 0.52 0.78 0 0.78 1.04 0 1.04 1.30 0 L30 1.56 0 1.56 L82 0 "L82 2.08 0 2.08 2.34 0 2.34 2.60 O 2.60 2.86 0 2.86 00 0 0 Total OOOOOOOOOOOOOOOOOOOOOOOOO O\OOOOIQOOOOOOOOOOOOOOhOOOO COOOOWIQIQIQWWWLWWWWWWOCOOO OOOOOOOOOOOOOOOOOOOOOOOOO OOOOOOOOOOOOOOOOOOOOOOOOO OOOOOOOOOOOOOOOOOOOOOOOOO OOOOOOOOOOOOOOOOOOOOOOOOO OOOOOOOOOOOOOOOOOOOOOOOOO OOOOOOOOOOOOOOOOOOOOOOOOO OOOOOOOOOOOOOOOOOOOOOOOOO OOCOOOOOOOOOOOOOOOOOOOOOO oooomwwwmwwwhwwwwwwhoooo A O A O\ 156 Table A.12 Item Distribution for the Optimal Item Pool Designed by MTI and without Exposure Control — General Science Content 2 \000 0.89 1.26 1.55 1.79 2.00 2.19 2.37 2.53 2.68 2.83 2.97 b 0.89 1.26 1.55 l.79 2.00 2.19 2.37 2.53 2.68 2.83 2.97 00 Total -oo -2.86 0 —2.86 -2.60 0 -2.60 -2.34 0 -2.34 -2.08 0 -2.08 -l.82 0 -l.82 -l.56 0 -l.56 -l.30 O -l.30 -1.04 0 -l.04 -0.78 O -0.78 -O.52 0 -0.52 -0.26 O -0.26 0.00 0 0.00 0.26 O 0.26 0.52 0 O O O O O 0 O 0 O 0 0 0.52 0.78 0.78 1.04 1.04 1.30 1.30 1.56 1.56 1.82 l.82 2.08 2.08 2.34 2.34 2.60 2.60 2.86 2.86 00 Total OOOOOOOOOOOOOOOOOOOOOOOOO \lOOOOWOOOOOOOOOOOOOOAOOOO OOOOONWNwwtowww-ADJAWWOOOOO OOOOOOOOOOOOOOOOOOOOOOOOO OOOOOOOOOOOOOOOOOOOOOOOOO CDOOOOOOOOOOOOOOOOOOOOOOOO OOOOOOOOOOOOOOOOOOOOOOOOO OOOOOOOOOOOOOOOOOOOOOOOOO OOCOOOOOOOOOCOOOOOOOOOOOO OOOOOOOOOOOOOOOOOOOOOOOOO OOOOOOOOOOCOOOOOOOOOOOOOO oooowwwmwwmwwwhwbwwboooo A A 00 157 Table A. 13 Item Distribution for the Optimal Item Pool Designed by MTI and without Exposure Control — General Science Content 3 \000 0.89 1.26 1.55 L79 2.00 2.19 2.37 2.53 2.68 2.83 2.97 b 0.89 1.26 l.55 1.79 2.00 2.19 2.37 2.53 2.68 2.83 2.97 00 Total -00 -2.86 0 -2.86 -2.60 0 -2.60 -2.34 0 -2.34 —2.08 0 -2.08 -l.82 0 -l.82 -l.56 0 -l.56 -l.30 0 -l.30 -l.04 0 -l.04 -0.78 0 -O.78 -0.52 O -0.52 -O.26 0 -0.26 0.00 0 0.00 0.26 O 0.26 0.52 0 0.52 0.78 0 0.78 1.04 0 1.04 1.30 0 L30 1.56 O 1.56 1.82 O 1.82 2.08 0 2.08 2.34 0 2.34 2.60 0 2.60 2.86 0 2.86 00 0 0 Total oooooooooooooooooooooocoo NOOOOOO—OOOOOOOOOOO—OOOOO ooooooo—————-——————oooooo ocooooococooooooooooooooo ocoooooooocoooooooooooooo ooooocooooooooooooooooooo ooooooooooocooooooooooooo ooooooooooooooooocooooooo ocooooooooooooooooooooooo ooooooooooooooooooooooooo oooooooooooooocoooooooooo cooooo———————~—————ooooo — I—t — w 158 Table A.14 Item Distribution for the Optimal Item Pool Designed by PM and without Exposure Control — General Science Content 1 \000 0.89 l.26 1.55 1.79 2.00 2.19 2.37 2.53 2.68 2.83 2.97 b 0.89 1.26 1.55 1.79 2.00 2.19 2.37 2.53 2.68 2.83 2.97 00 Total -66 -2.86 0 0 0 0 o o 0 o 0 o o 0 o -2.86 -2.60 o 0 0 o 0 0 0 0 0 o 0 0 o -2.60 -234 0 o 0 0 0 o 0 0 0 o o 0 o -2.34 -2.08 o 0 0 o 0 o o 0 0 o o 0 0 -2.08 -l.82 o 3 0 o 0 o o o 0 o 0 o 3 4.32 -l.56 0 3 0 o 0 o 0 0 o 0 o o 3 -l.56 4.30 0 0 0 o 0 0 o o o 0 o o 0 4.30 4.04 0 0 3 o o o o o o 0 o o 3 4.04 -0.78 o 0 2 o o o o o 0 0 o 0 2 -0.73 -0.52 0 0 l 2 0 0 0 0 o o 0 o 3 -0.52 -0.26 0 0 0 2 0 o o 0 o 0 0 0 2 -0.26 0.00 0 0 0 2 2 0 0 o o 0 0 o 4 0.00 0.26 o 0 0 2 0 0 0 0 0 o o 0 2 0.26 0.52 0 0 0 2 2 0 0 o o o o 0 4 0.52 0.78 0 0 0 l 2 0 0 0 0 o o 0 3 0.78 1.04 o 0 0 I 1 0 0 0 0 0 o 0 2 1.04 1.30 o o 0 2 2 0 0 o o o o o 4 1.30 1.56 0 0 0 2 0 o 0 0 0 o 0 0 2 I56 1.82 0 0 0 2 0 o 0 o 0 0 0 o 2 1.82 2.08 o 0 2 0 o o o 0 0 0 o o 2 2.08 2.34 o 0 0 o 0 o 0 0 0 0 o 0 o 2.34 2.60 o 0 0 0 0 0 0 0 0 o o o 0 2.60 2.86 o 0 0 0 0 0 0 0 0 o 0 o o 2.86 00 o 0 o o 0 0 0 o o o 0 o 0 Total 0 6 8 l8 9 0 o 0 0 0 0 o 41 159 Table A. l 5 Item Distribution for the Optimal Item Pool Designed by PM and without Exposure Control — General Science Content 2 \000 0.89 1.26 1.55 1.79 2.00 2.19 2.37 2.53 2.68 2.83 2.97 b 0.89 l.26 1.55 1.79 2.00 2.19 2.37 2.53 2.68 2.83 2.97 00 Total ~00 -2.86 0 -2.86 -2.60 0 -2.60 -2.34 0 -2.34 -2.08 O -2.08 -l.82 0 -l.82 -l.56 0 -l.56 -l.30 0 -l.30 -l.04 0 -l.04 -0.78 O -O.78 ~0.52 O -O.52 -0.26 0 -0.26 0.00 0 0.00 0.26 O 0.26 0.52 0 0 O O 0 0 0 0 O O O 0 0.52 0.78 0.78 1.04 1.04 1.30 l.30 1.56 l.56 1.82 1.82 2.08 2.08 2.34 2.34 2.60 2.60 2.86 2.86 00 Total OOOOOIQ~———NWIQNNOOOOOOOOO UtOOOOOOOOOOOOOOOOOONWOOOO OOOOIQOOOOOOOOOIQWNNOOOOOO OOOOOOIQNNIQNN—‘OOOOOOOOOOO OOOOOOOOOOOOOOOOOOOOOOOOO OOOOOOOOOOOOOOOOOOOOOOOOO OOOOOOOOOOOOOOOOOOOOOOOOO OOOOOOOOOOOOOOOOOOOOOOOOO OOOOOOOOOOOOOOOOOOOOOOOOO OOOOOOOOOOOOOOOOOOOOOOOOO OOOOOOOOOOOOOOOOOOOOOOOOO COCONNwwwwAMWN-BWNNNWOOOO \l CS 4:. ox 160 Table A.16 Item Distribution for the Optimal Item Pool Designed by PM and without Exposure Control — General Science Content 3 \000 0.89 1.26 1.55 l.79 2.00 2.19 2.37 2.53 2.68 2.83 2.97 b 0.89 1.26 1.55 1.79 2.00 2.19 2.37 2.53 2.68 2.83 2.97 00 Total -00 -2.86 0 -2.86 -2.60 O -2.60 -2.34 0 -2.34 -2.08 0 -2.08 -l.82 l -l.82 -l.56 0 -l.56 -l.30 0 -l.30 -l.04 0 -l.04 -0.78 0 -0.78 -0.52 0 -O.52 -O.26 0 -0.26 0.00 O 0.00 0.26 O 0.26 0.52 O 0.52 0.78 0 0.78 1.04 0 1.04 1.30 O 1.30 1.56 0 1.56 1.82 0 1.82 2.08 0 2.08 2.34 O 2.34 2.60 0 2.60 2.86 O 2.86 00 0 1 Total Noooooocccoooooooo-ooooo oocoooo—————————ooooooooo ooooooooooooooooocooooooo ooocooooooooooooooooooooo ooooooooooooooooooooooooo ooooooocoooooooooocoooooo ooooooooooooococooooooooo oooooooooooocoooooooooooo ooooocooooooooooooooooooo ooooooooooooooooooooooooo ooooooooocooooooooooooooo ccoooo——~——————oo——-—oooo N 161 Table A. l 7 Item Distribution for the Optimal Item Pool Designed by MTI and with Sympson-Hetter Exposure Control — General Science Content 1 \0.00 0.89 1.26 1.55 1.79 2.00 2.19 2.37 2.53 2.68 2.83 2.97 b 0.89 l.26 1.55 1.79 2.00 2.19 2.37 2.53 2.68 2.83 2.97 oo Total --00 -2.86 0 -2.86 -2.60 0 -2.60 -2.34 O -2.34 -2.08 0 -2.08 -l.82 0 -l.82 -l.56 0 -l.56 -l.30 O -l.30 -l.04 0 -].O4 -O.78 0 -O.78 -O.52 O -O.52 -0.26 0 -0.26 0.00 0 0.00 0.26 O 0 0 0 0 0 O 0 O O O O 0 0.26 0.52 0.52 0.78 0.78 1.04 1.04 1.30 1.30 1.56 1.56 1.82 1.82 2.08 2.08 2.34 2.34 2.60 2.60 2.86 2.86 00 Total OOOOOOOOOOOOOOOOOOOOOOOOO O\OOOONOOOOOOOOOOOOOOAOOOO ooooowmwwhbwobbwwwwooooo OOOOOOOOOOOOOOOOOOOOOOOOO OOOOOOOOOOOOOOOOOOOOOOOOO OOOOOOOOOOOOOOOOOOOOOOOOO OOOOOOOOOOOOOOOOOOOOOOOOO OOOOOOOOOOOOOOOOOOOOOOOOO OOOOOOOOOOOOOOOOOOOOOOOOO OOOOOOOOOOOOOOOOOOOOOOOOO OOOOOOOOOOOOOOOOOOOOOOOOO oooowwwmwbawohhwwwwhooco A O U1 N 162 Table A. 18 Item Distribution for the Optimal Item Pool Designed by MTI and with Sympson-Hetter Exposure Control — General Science Content 2 \000 0.89 L26 l.55 L79 2.00 2.l9 2.37 2.53 2.68 2.83 2.97 b 0.89 1.26 1.55 1.79 2.00 2.19 2.37 2.53 2.68 2.83 2.97 00 Total -00 -2.86 O -2.86 -2.60 0 -2.60 -2.34 0 -2.34 -2.08 O -2.08 -l.82 O -l.82 -l.56 O -l.56 -l.30 0 -l.30 -l.04 0 -l.04 -0.78 0 -O.78 -O.52 0 -0.52 -0.26 O -0.26 0.00 0 0.00 0.26 0 0 0 0 0 O 0 O 0 0 0 0 0 O 0.26 0.52 0.52 0.78 0.78 1.04 1.04 1.30 1.30 1.56 1.56 l.82 l.82 2.08 2.08 2.34 2.34 2.60 2.60 2.86 2.86 00 Total OOOOOOOOOOOOOOOOOOOOOOOOO \lOOOOWOOOOOOOOOOOOOO-BOOOO OOOOONthwwwA-F-i-h-UIAAUJUJOOOOO OOOOOOOOOOOOOOOOOOOOOOOO OOOOOOOOOOOOOOOOOOOOOOOOO OOOOOOOOOOOOOOOOOOOOOOOOO OOOOOOOOOOOOOOOOOOOOOOOOO OOOOOOOOOOOOOOOOOOOOOOOOO OOOOOOOOOOOOOOOOOOOOOOOOO OOOOOOOOOOOOOOOOOOOOOOOOO OOOOOOOOOOOOOOOOOOOOOOOOO oooowwwwwwwhhbmnbwwaoooo .b. \I {It A 163 Table A. 19 Item Distribution for the Optimal Item Pool Designed by MTI and with Sympson-Hetter Exposure Control — General Science Content 3 \000 0.89 l.26 1.55 l.79 2.00 2.19 2.37 2.53 2.68 2.83 2.97 b 0.89 1.26 1.55 1.79 2.00 2.19 2.37 2.53 2.68 2.83 2.97 00 Total -oo -2.86 O 0 0 O 0 0 0 0 O 0 0 0 0 —2.86 -2.60 O 0 O 0 O O 0 0 0 0 O 0 0 -2.60 -2.34 0 0 0 0 0 O 0 O O O 0 0 0 -2.34 -2.08 O O 0 0 0 0 0 0 O 0 0 O O -2.08 -l.82 O 0 0 0 0 O 0 0 0 O 0 0 0 -l.82 -l.56 0 0 l O 0 0 0 0 0 0 O O l -l.56 -l.30 0 O 0 l O 0 0 0 0 0 0 O l -l.30 -l.04 O 0 O l 0 O O 0 0 O 0 O l -l.04 -0.78 0 0 0 l 0 0 0 O 0 0 0 O l -0.78 -O.52 0 0 O l 0 0 0 O 0 0 0 O l -0.52 -0.26 0 0 0 l 0 0 0 O O O 0 0 l —0.26 0.00 0 0 0 l O 0 0 0 0 O 0 0 l 0.00 0.26 O 0 0 l 0 O 0 O 0 O O 0 l 0.26 0.52 0 0 0 l 0 O 0 O 0 0 0 O l 0.52 0.78 0 0 0 l O 0 0 0 0 O O 0 l 0.78 1.04 0 0 0 l 0 O O 0 O O 0 O l 1.04 1.30 O 0 0 l 0 0 0 O 0 O 0 0 l [.30 1.56 0 0 l 0 0 0 O 0 0 O O 0 1 I56 l.82 0 0 0 O 0 0 O 0 0 0 0 O O l.82 2.08 0 0 0 O O 0 0 0 0 0 O 0 O 2.08 2.34 0 0 O O 0 0 0 O O 0 O 0 0 2.34 2.60 0 0 0 0 0 O 0 0 0 O 0 O O 2.60 2.86 O 0 O 0 0 0 0 O O O 0 O O 2.86 00 0 O O 0 0 0 O 0 0 0 0 0 0 Total 0 0 2 ll 0 O 0 O O O O 0 l3 164 Table A20 Item Distribution for the Optimal Item Pool Designed by PM and with Sympson-Hetter Exposure Control — General Science Content I \000 0.89 1.26 1.55 1.79 2.00 2.l9 2.37 2.53 2.68 2.83 2.97 b 0.89 1.26 1.55 1.79 2.00 2.]9 2.37 2.53 2.68 2.83 2.97 00 Total -00 -2.86 0 -2.86 -2.60 0 -2.60 -2.34 O -2.34 -2.08 0 -2.08 -l.82 O -l.82 -l.56 0 -l.56 -l.30 0 -l.30 -l.04 0 -l.04 -0.78 0 -O.78 -O.52 0 -O.52 -O.26 0 -0.26 0.00 0 0.00 0.26 0 0.26 0.52 0 0.52 0.78 O 0.78 1.04 O 1.04 [.30 0 1.30 l.56 O 1.56 1.82 0 1.82 2.08 0 2.08 2.34 O 2.34 2.60 0 2.60 2.86 0 2.86 00 0 Total 0 O\OOOOOOOOOOCOOOOOOOWWOOOO \OOOOONOOOOOOOOO—NhOOOOOOO OOOOONNN——NNWWMOOOOOOOOO OOOOOOON—WWOAOOOOOOOOOOO OOOOOOOOOOOOOOOOOOOOOOOOO OOOOOOOOOOOOOOOOOOOOOOOOO OOOOOOOOOOOOOOOOOOOOOOOOO OOOOOOOOOOOOOOOOOOOOOOOOO OOOOOOOOOOOOOOOOOOOOOOOOO OOOOOOOOOOOOOOOOOOOOOOOOO OOOOOOOOOOOOOOOOOOOOOOOOO COCONNNAN-bMquONN-BOWWOOOO N L») L») {II ~ 165 Table A21 Item Distribution for the Optimal Item Pool Designed by PM and with Sympson-Hetter Exposure Control — General Science Content 2 \000 0.89 1.26 l.55 1.79 2.00 2.19 2.37 2.53 2.68 2.83 2.97 b 0.89 l.26 I.55 1.79 2.00 2.19 2.37 2.53 2.68 2.83 2.97 00 Total -oo -2.86 0 -2.86 -2.60 0 -2.60 -2.34 O -2.34 -2.08 O -2.08 -l.82 O -l.82 -l.56 O -l.56 -l.30 O -l.30 -l.04 0 -l.04 -0.78 0 -O.78 -0.52 0 -0.52 -0.26 0 -O.26 0.00 O 0.00 0.26 O 0.26 0.52 O 0.52 0.78 O 0.78 l.04 O 1.04 1.30 0 1.30 1.56 O |.56 1.82 O 1.82 2.08 O 2.08 2.34 0 2.34 2.60 0 2.60 2.86 0 2.86 00 0 0 Total UtOOOOOOOOOOOOOOOOOONWOOOO OOOONOOOOOOOOOIQWNNOOOOOO OOOOON—-———NWIQWMOOOOOOOOO OOOOOONNNNWANOOOOOOOOOOO OOOOOOOOOOOOOOOOOOOOOOOOO OOOOOOOOOOOOOOOOOOOOOOOOO OOOOOOOOOOOOOOOOOOOOOOOOO OOOOOOOOOOOOOOOOOOOOOOOOO OOOOOOOOOOOOOOOOOOOOOOOOO OOOOOOOOOOOOOOOOOOOOOCOOO OOOOOOOOOOOOOOOOOOOOOOOOO COOONNwwwwMN-BWQWNNNWOOOO N \) LII A 166 Table A.22 Item Distribution for the Optimal Item Pool Designed by PM and with Sympson-Hetter Exposure Control — General Science Content 3 \000 0.89 l.26 1.55 1.79 2.00 2.19 2.37 2.53 2.68 2.83 2.97 b 0.89 1.26 1.55 1.79 2.00 2.19 2.37 2.53 2.68 2.83 2.97 00 Total -00 -2.86 0 -2.86 -2.60 0 -2.60 -2.34 0 -2.34 -2.08 0 -2.08 -l.82 l -l.82 -1.56 0 -l.56 -l.3O 0 -l.30 -l.04 0 -I.04 -O.78 0 -O.78 -O.52 O -0.52 -0.26 O -0.26 0.00 0 0.00 0.26 0 0.26 0.52 O O 0 0 O 0 0 0 0 O 0 I 0.52 0.78 0.78 1.04 1.04 L30 L30 1.56 1.56 1.82 1.82 2.08 2.08 2.34 2.34 2.60 2.60 2.86 2.86 00 Total Nooooocooooooooooo——ooooo coooooo—~———————oooooooco ooooooooooooooooooooooooo ooooooooooooooooooooooooo ooooooooooooooooooooooooo ooooooocooooocooooooooooo oooocoooooooooooooooooooo ooooooooooooooooooooooooo oooooooooooooooooocooooco oooooooooooooooocoooooooo ooooooooooocooooooooooooo oooooo—-——————~—oo——-—oooo N 167 Table A23 Item Distribution for the Optimal Item Pool Designed by MTI and with a- Stratified Exposure Control — General Science Content 1 \000 0.89 1.26 1.55 1.79 2.00 2.19 2.37 2.53 2.68 2.83 2.97 b 0.89 l.26 1.55 l.79 2.00 2.l9 2.37 2.53 2.68 2.83 2.97 00 Total -00 -2.86 0 -2.86 -2.60 0 -2.60 -2.34 O -2.34 -2.08 0 -2.08 -l.82 O -l.82 -l.56 0 -l.56 -l.30 O -l.30 -l.04 0 -l.04 -O.78 O -0.78 -O.52 0 -O.52 -O.26 O -0.26 0.00 O 0.00 0.26 O 0.26 0.52 0 0.52 0.78 0 0.78 l.04 0 1.04 L30 0 1.30 l.56 O 1.56 1.82 O l.82 2.08 0 2.08 2.34 0 2.34 2.60 0 2.60 2.86 0 2.86 00 0 0 Total OOOOOOOOOOOOOOOOOOOOOOOOO OOOON—OOONIQ—WIvo——NOW—OOO DOOM-—NMNNNNNNNNtQIQtOIQNNOOO OOOOOONNN——N——NNI\)—I\JOOOOO —OOOOOOOOOOOOOO—OOOOOOOOO OOOOOOOOOOOOOOOOOOOOOOOOO OOOOOOOCOOOOOOOOOOOOOOOOO OOOOOOOOOOOOOOOOOOOOOOOOO OOOOOOOOOOOOOOOOOOOOOOOOO OOOOOOOOOOOOOOOOOOOCOOOOO OOOOOOOOOOOOOOOOOOOOOOOOO COONMW-bbAMMMQMMMMM-DMWOOO to U) U! N \I 00 168 Table A.24 Item Distribution for the Optimal Item Pool Designed by MTI and with a- Stratified Exposure Control — General Science Content 2 \000 0.89 1.26 l.55 1.79 2.00 2.19 2.37 2.53 2.68 2.83 2.97 b 0.89 1.26 1.55 1.79 2.00 2.19 2.37 2.53 2.68 2.83 2.97 00 Total -oo -2.86 0 -2.86 -2.60 O -2.60 -2.34 0 -2.34 -2.08 0 -2.08 —l.82 0 -l.82 -l.56 0 -l.56 -l.3O O -l.30 -l.04 0 -l.04 -O.78 O -0.78 -0.52 0 -0.52 -0.26 O -0.26 0.00 0 0.00 0.26 0 0 0 0 0 O 0 0 O 0 0 0 0 0.26 0.52 0.52 0.78 0.78 1.04 1.04 1.30 1.30 1.56 1.56 1.82 1.82 2.08 2.08 2.34 2.34 2.60 2.60 2.86 2.86 00 Total —-oooooooocooocooooooo—ooo 3;oooo—————oow—oow—o~—Noco 3,"oooromwroNNNNNN—NNNNwwwooo '3ooooo————NN——NN——NN—oooo ['3oooooooo—————NN———oooooo ooooooooooooooooooooooooo ooooooooooooooooooooooooo ocoooocooooooooooooocoooo ooooooooooooooooooooooooo ooooooooooooooocooooooooo ooooooooooooooooooocooooo OOONw-hAAU:U:U~O\U:UIO\O\U‘UIO\AUIOOO 00 LI) 169 Table A25 Item Distribution for the Optimal Item Pool Designed by MTI and with a- Stratified Exposure Control — General Science Content 3 \000 0.89 1.26 1.55 1.79 2.00 2.l9 2.37 2.53 2.68 2.83 2.97 b 0.89 1.26 1.55 l.79 2.00 2.19 2.37 2.53 2.68 2.83 2.97 00 Total -00 -2.86 0 -2.86 -2.60 -2.60 -2.34 -2.34 -2.08 -2.08 -l.82 -l.82 -l.56 -l.56 -l.30 -l.30 -l.04 -l.04 -0.78 -0.78 -0.52 -0.52 -O.26 0 O 0 0 O 0 0 0 O O -0.26 0.00 0 0.00 0.26 0 O 0 O O O 0 0 O 0 0 O O 0.26 0.52 0.52 0.78 0.78 1.04 1.04 1.30 1.30 1.56 1.56 1.82 1.82 2.08 2.08 2.34 2.34 2.60 2.60 2.86 2.86 00 Total ooooooooooooooooooooooooo ooooooooooooooooooooooooo —-ooooo—oooooooooooooooooo woooooo——oooooooooo—ooooc -oooooooo—ooooooooooooooo ocoocoooooooooooooooooooo ooooooooooooooooooooooooo ooooooooooooooooooooooooo ooooooooooooooocooooooooo ooooooooooooooooooooooooo ooooo——————————————ooooo \OOOOOOOOOO————————-—OOOOOO A 170 Table A.26 Item Distribution for the Optimal Item Pool Designed by PM and with a- Stratified Exposure Control — General Science Content 1 \000 0.89 l.26 1.55 1.79 2.00 2.19 2.37 2.53 2.68 2.83 2.97 b 0.89 l.26 l.55 |.79 2.00 2.19 2.37 2.53 2.68 2.83 2.97 00 Total -00 -2.86 -2.86 -2.60 -2.60 -2.34 -2.34 -2.08 -2.08 -l.82 -l.82 -l.56 -l.56 -l.30 -l.30 -l.04 -I.04 -0.78 -0.78 -0.52 -0.52 -0.26 -0.26 0.00 0.00 0.26 0.26 0.52 0.52 0.78 0.78 1.04 1.04 1.30 1.30 1.56 1.56 1.82 l.82 2.08 2.08 2.34 2.34 2.60 2.60 2.86 2.86 00 Total 0 IONANNOOOOOOOOOOOOO OOOOOOOOOOOOOOOOOOOOOOOOO OOOOOOOOOOOOOO——lohAkin-5000 OOOOOOOOOOOthmwaOOOOOO OOOOOOO—NUJUIWIQNOOOOOOOOOO OOO- OOIQNNNNOOOOOOOOOOOOOOOOO OOOOOOOOOOOOOOOOOOOOOOOOO OOOOOOOOOOOOOOOOOOOOOOOOO OOOOOOOOOOOOOOOOOOOOOOOOO OOOOOOOOOOOOOOOOOOOOOOOOO OOOOOOOOOOOOOOOOOOOOOOOOO OOOOOOOOOOOOOOOOOOOOOOOOO CONWWAAM-BMUIMQMGMMOQMAOOO [Q N 9) GO A O 00 O\ 171 Table A.27 Item Distribution for the Optimal Item Pool Designed by PM and with a- Stratified Exposure Control — General Science Content 2 \000 0.89 1.26 1.55 1.79 2.00 2.19 2.37 2.53 2.68 2.83 2.97 b 0.89 l.26 1.55 1.79 2.00 2.19 2.37 2.53 2.68 2.83 2.97 00 Total -00 -2.86 0 -2.86 -2.60 O -2.60 -2.34 0 —2.34 -2.08 0 -2.08 -l.82 O -I.82 -I.56 0 -l.56 -l.30 0 -l.30 -l.04 0 -l.04 -O.78 0 -0.78 -O.52 0 -O.52 -0.26 0 -O.26 0.00 0 0.00 0.26 0 0.26 0.52 O 0.52 0.78 0 0.78 1.04 0 1.04 1.30 0 1.30 1.56 O 1.56 1.82 0 1.82 2.08 0 2.08 2.34 0 2.34 2.60 O 2.60 2.86 0 2.86 00 0 0 Total OOOOOOOOOOOOOOO—‘wwO‘x-bbboo OOOOOOOOOO—N-fi-MONONWNOOOOOO OOOOOOONwwmAN—OOOOOOOOOO OOOO——WWWNOOOOOOOOOOOOOO OONWWWNOOOOOOOOOOOOOOOOO OOOOOOOOOOOOOOOOOOOOOOOOO OOOOOOOOOOOOOOOOOOOOOOOOO OOOOOOOOOOOOOOOOOOOOOOOOO OOOOOOOOOOOOOOOOOOOOOOOOO OCOOOOOOOOOOOOOOOOOOOOOOO OOOOOOOOOOOOOOOOOOOOOOOOO OCONWA-BMUIGUIGONQO‘QNQMON-bhboo to kit to O to O m U.) C 172 Table A28 Item Distribution for the Optimal Item Pool Designed by PM and with a- Stratified Exposure Control — General Science Content 3 \0.00 0.89 1.26 1.55 [.79 2.00 2.|9 2.37 2.53 2.68 2.83 2.97 b 0.89 126 1.55 l.79 2.00 2.19 2.37 2.53 2.68 2.83 2.97 00 Total -00 -2.86 -2.86 -2.60 -2.60 -2.34 -2.34 -2.08 -2.08 -l.82 -l.82 -l.56 -I.56 -l.30 -l.30 -l.04 -l.04 -O.78 -0.78 -O.52 -0.52 -0.26 -026 0.00 0.00 0.26 0.26 0.52 0.52 0.78 0.78 1.04 1.04 1.30 1.30 1.56 1.56 1.82 1.82 2.08 2.08 2.34 2.34 2.60 2.60 2.86 2.86 00 Total ooooooooooocococooooooooo 4;ooooooooooooooooo———~ooo Aoooooooooocoo————ooooooo Aooooooooo————ooooooooooo moooo—————ooooooooooooooo ooooooooooooooooocooooooo ooooooocoooooooooocooooco ooooooooooooooooooooooooo coooooooooocooooooooooooo ooooooooooooooooooooooooo ooooooooooooooooooocooooo ooooooooooooooooooooooooo oooo—-—-——-——-—-—-—~—-—-—-——u—-—-—-—-—-ooo \l 173 REFERENCES 174 REFERENCES Ariel, A., Veldkamp, B. P., & van der Linden, W. J. (2004). Constructing rotating item pools for constrained adaptive testing. Journal of Educational Measurement, 41(4), 345- 360. 'v Bejar, I. I. (1993.). A generative approach to psychological and educational measurement. In N. Frederiksen, R. J. Mislevy & I. I. Bejar (Eds. ), Test theory for a new generation of tests (pp. 323-359). Hillsdale, NJ: Erlbaum. Bejar I. I. “(1 996). Generative response modeling. Leveraging the computer as a test ' delivery medium (ETS RR- 96 13). Princeton, NJ: ETS. Bejar, I 1., Lawless R. R..Morley,M. E. Wagner M E., Bennett,R. E. ,& Revuelta, J. (2003). A feasibility study of on- -the fly item generation in adaptive testing. The Journal of Technology Learning and Assessment 2(3). http: //www.jtla. org. .HM-m _,‘--n'--‘. a" Bergstrom, B. A., & Lunz, M. E. (1999). CAT for certification and licensure. In F. Drasgow & J. Olson-Buchanan (Eds.), Innovations in Computerized Assessment (pp. 67-91). Mahwah, NJ: Lawrence Erlbaum Associates, Inc. Binet, A., & Simon, Th. A. (1905). Méthode nouvelle pour le diagnostic du niveau intellectuel des anorrnaux. L'Année Psychologique, 11, 191-244. I, Boekkooi-Timminga, E. (1991). A method for designing Rasch Model-based item banks. Paper presented at the annual meeting of the Psychometric Society, Princeton, NJ. Brown, J. M., & Weiss, D. J. (1977). An adaptive testing strategy for achievement test batteries (No. 77-6). Minneapolis: University of Minnesota, Psychometric Methods Program. Chang, H. H. (2004). Understanding computerized adaptive testing: From Robins-Monro to Lord and beyond. In David Kaplan (Ed.) The Sage handbook of quantitative methodology for the social sciences. (pp. 117-136). Thousand Oaks, CA: Sage Publications, Inc. Chang, S. W., Ansley, T. N., & Lin, S. H. (2000). Performance of item exposure control methods in computerized adaptive testing: Further explorations. Paper presented at the Annual Meeting of the American Educational Research Association. Chang, H. H., Qian, J ., & Ying, Z. (2001). Alpha-stratified multistage computerized adaptive testing with beta blocking. Applied Psychological Measurement, 25(4), 333-341. 175 Chang, H. H., & Stout, W. (1993). The asymptotic posterior normality of the latent trait in an IRT model. Psychometrika, 58, 37-52. Chang, H. H., & van der Linden, W. J. (2003). Optimal stratification of item pools in a- stratified computerized adaptive testing. Applied Psychological Measurement, 27(4), 262-274. Chang, H. H., & Ying, Z. (1999). Alpha—stratified multistage computerized adaptive testing. Applied Psychological Measurement, 23(3), 211-222. Chang, H. H., & Ying, Z. (1996). A global information approach to computerized adaptive testing. Applied Psychological Measurement, 20(3), 213-229. C hen. S. Y., Ankenmann, R. D., & Spray, J. A. (1999). Exploring the relationship between item exposure rate and test overlap rate in computerized adaptive testing (N o. ACT-RR-99-5): American College Testing Program, Iowa City, IA. Davey, T., & Parshall, C. G. (1995). New algorithms for item selection and exposure control with computerized adaptive testing. Paper presented at the Annual Meeting of the American Educational Research Association, San Francisco, CA. Davey, T., & Thomas, L. (1996). Constructing adaptive tests to parallel conventional programs. Paper presented at the annual meeting of the American Educational Research Association, New York. Davis, L. L. (2002). Strategies for controlling item exposure in computerized adaptive testing with polytomously scored items. Unpublished doctoral dissertation, University of Texas, Austin. Davis, L. L., & Dodd, B. G. (2001). An examination of testlet scoring and item exposure constraints in the verbal reasoning section of the MCA T. MCAT Monograph Series: Association of American Medical Colleges. ' Deane, P. ,,Graf E. A. Higgins, D. ,Futagi, Y., Lawless, R. (2006). Model analysis and modefcreatzon Capturing the task-model structure of quantitative item domains (No. ETS- RR-06— 11). Educational Testing Service, Princeton, NJ. Dorans, N. J ., & Kulick, E. M. (1986). Demonstrating the utility of the standardization approach to assessing unexpected differential item performance on the Scholastic Aptitude Test. Journal of Educational Measurement. 23, 355-368. ------- “"“x Embretson, S. E. (2001’). Generating abstract reasoning items with cognitive theory. In S. H. Irvine & P. C. Kyllonen (Eds) Item generation for test development. Mahwah, NJ: Erlbaum Publishers. Forsythe, G. E., Malcolm, M. A., & Moler, C. B. (1976). Computer methods for mathematical computations, Prentice-Hall, 1976. 176 Graf, E. A., Peterson, S}, Steffen; MT, Lawless R. (2005): Psychometric and cognitive analysis as a basis for the design and revision ofquantitative item models (No. ETS- RR-05- 25). Educational Testing Service, Princeton, NJ. Green, B. F., Bock, R. D., Humphreys, L. G., Linn, R. L., & Reckase, M. D. (1984). Technical guidelines for assessing computerized adaptive tests. Journal of Educational Measurement, 21(4), 347-360. Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item response theory. Newbury Park CA: Sage. Hau, K. T., Wen, J. B., & Chang, H. H. (2002). Optimum number of strata in the a- stratified computerized adaptive testing design. Paper presented at the Annual Meeting of the American Educational Research Association, New Orleans, LA. 1 \\ Henning, G. (1987). A guide to language testing. Cambridge, Mass: Newbury House. Hetter, R. D., & Sympson J. B. (1997). Item exposure control in CAT-ASVAB. In W. A. Sands, B. K. Waters & J. R. McBride (Eds), Computerized adaptive testing: From inquiry to operation (pp. 141-144). Washington, DC: American Psychological Association. Grist, S., Rudner, L., & Wise, L. (1989). Computer adaptive tests (ERIC Digest 107). Washington, DC: American Institute for Research and ERIC Clearinghouse on Tests, Measurement, and Evaluatlon (ERIC No. ED 315 425) .J'-'” ‘,l ' Hively, W, Patterson, H. L., &Page S H. (1968). A "universe-defined" system of arithmetic aChievement items. Journal of Educational Measurement, 5, 275 2-9.0 Holland, P. W., & Thayer, D. T. (1988). Differential item performance and the Mantel- Haenszel procedure. In H. Wainer & H. Braun (Eds), Test validity (pp. 192-145). Hillsdale, NJ: Lawrence Erlbaum Associates. Koch, W. R., & Dodd, B. G. (1989). An investigation of procedures for computerized adaptive testing using partial credit scoring. Applied Measurement in Education, 2, 335-357. Lewis, G, & Sheehan, K. (1990). Using Bayesian decision theory to design a computerized mastery test. Applied Psychological Measurement, 14, 367-386. Luecht, R. M. (1998). A framework for exploring and controlling risks associated with test item exposure overtime. Paper presented at the the annual meeting of the National Council on Measurement in Education. (ford, F. M. (1971)‘The self-scoring flexilevel test. Journal of Educational Measurement, ‘ "’ 8, ‘147-151.~'””' 177 Lord, F. M. (1980). Applications of item response theory to practical testing problems. Hillsdale, NJ: Lawrence Erlbaum Associates. McBride, J. R., & Martin, J. T. (1983). Reliability and validity of adaptive ability tests in a military setting. In D. J. Weiss (Ed.), New horizons in testing (pp. 223-226). New York: Academic Press. MathWorks. (2005). MATLAB: The language of technical computing, Version 7, Release 14, Student Version [Computer software]. Natick, MA: Author :pMillman,mJ.I,8:cArter, J. A(1984)~1 Issues in item banking. Journal of Educational Measurement, 21 (4), 315-330. Mills, C. N. ,& Stocking,M. L. (1996). Practical issues in large scale computerized adaptive testing. Applied Measurement in Education, 9, 287-304. \‘-——_— Psychological Measurements, 28, 95-104. Owen, R. J. (1969). A Bayesian approach to tailored testing (RB-69-92). Princeton, NJ: Educational Testing Service Owen, R. J. (1975). A Bayesian sequential procedure for quantal response in the context of adaptive mental testing. Journal of the American Statistical Association, 70(350), 35 1 -3 56. Parshall, C., Davey, T., & Nering, M. (1998). Test development exposure control for adaptive tests. Paper presented at the Annual Meeting of the National Council on Measurement in Education, San Diego CA. Patsula, L. N., & Steffan, M. (1997). Maintaining item and test security in a CAT environment: A simulation study. Paper presented at the annual meeting of the National Council on Measurement in Education. H‘Prosser,F.w(1974) Item banking. In G. Lippey (Ed.), Computer-assisted test construction ' ° (pp. 29-66). Englewood Cliffs, NJ: Educational Technology Publications. . Reckase, MIDI-(1 97677415 application of the Rasch simple logistic model to tailored testing. Pape‘fpfesented at the Annual Meeting of the American Educational Research Association, Chicago, Illinois. 1...,Reckase, MD. (1989). Adaptive testing: The evolution of a good idea. Educational Measurement Issues and Practice, 8(3), 11-15. ,: Reckase, M. D. (2003) Eltem pool destgn for computerized adaptive tests. Paper presented ' ' at the National Council on Measurement 1n Education, Chicago, IL. 178 Reckase, M. D,-&.He, .W...(2004)._ The ideal item pool for the NCLEX-RN examination-- Report to NCSBN: Michigan State University, East Lansing, MI. Reckase, M. D. & He, W (2005). Ideal item pool design for the NCLEX-RN exam. Michigan State University, East Lansing, MI. I Rudner, L. (1998). Item banking. Practical Assessment, Research & Evaluation, 6(4). Segall, D. O., Moreno, K. E., & Hetter, R. D. (1997). Item pool development and evaluation. In W. A. Sands, B. K. Waters & J. R. McBride (Eds.), Computerized adaptive testing: From inquiry to operation (pp. 117-130). Washington, DC: P American Psychological Association. Shealy, R., & Stout, W. F. (1993). A model-based standardization approach that separates “3 true bias/DIF from group differences and detects test bias/DIP as well as item bias/DIF. Psychometrika, 58, 159-194. ‘~Singley,M.MK‘.,:8cBennett, R. E. (2002)::Item generation and beyond: Applications of schema theory to mathematics assessment. In S. H. Irvine (Ed), Item generation for test development (pp. 361 384). Mahwah, NJ: Lawrence Erlbaum Associates. Stocking, M. L. (1994). Three practical issues for modern adaptive testing item pools (No. ETS-RR-94-5): Educational Testing Service, Princeton, NJ. / Stocking, M. L., & Lewis, C. (2000). Methods of controlling the exposure of items in CAT. In W. J. v. d. Linden & C. A. W. Glas (Eds.), Computerized adaptive testing: Theory and practice (pp. 163-182). Netherlands: Kluwer Academic Publishers. Stocking, M. L., & Lewis, C. (1998). Controlling item exposure conditional on ability in computerized adaptive testing. Journal of Educational and Behavioral Statistics, 23, 57-75. Stocking, M. L., & Lewis, C. (1995). A new method for controlling item exposure in computer adaptive testing (Research Report 95-25). Princeton, NJ: Educational Testing Service. ’ Stocking, M. L., & Swanson, L. (1998). Optimal design of item pools for computerized ” adaptive tests. Applied Psychological Measurement, 22, 271-279. Sympson, J. B., & Hetter, R. D. (1985). Controlling item-exposure rates in computerized atapitve testing. Paper presented at the 27th Annual Meeting of the Military Testing Association, San Diego, CA. Thomasson, G. L. (1995). New item exposure control algorithms for computerized adaptive testing. Paper presented at the Annual Meeting of the Psychometric Soceity, Minneapolis, MN. 179 Thissen, D, & Mislevy, R. J. (2000). Testing Algorithms. InH. Wainer (Ed.), Computerized adaptive testing. A primer (2nd ed., pp. 101-133). Mahwah, NJ: Lawrence Erlbaum Associates. Urry, V. W. (1977). Tailored testing: A successful application of latent trait theory Journal of Educatzonal Measurement, 14(2), 181- 196. van der Linden, W. J. (1998). Optimal assembly of psychological and educational tests. Applied Psychological Measurement, 22, 195-211. ' van der Linden, W. J., & Pashley, P. J. (2000). Item selection and ability estimation in adaptive testing. In W. J. van der Linden & C. A. W. Glas (Eds.), Computerized adaptive testing: Theory and practice (pp. 1-25). Boston: Kluwer. van der Linden, W. J ., Scrams, D. J ., & Schnipke, D. L. (1999). Using response-time constraints to control for speededness in computerized adaptive testing. Applied Psychological Measurement, 23, 195-210. van der Linden, W. J. (2005) Linear models for optimal test design. New York: Springer- Verlag. Veldkamp, B. P., & van der Linden, W. J. (1999). Designing item pools for computerized adaptive testing. Research report 99-03: Twente Univ , Enschede (Netherlands) Faculty of Educational Science and Technology. Wainer, H. (2000). Rescuing Computerized Testing by Breaking Zipfs Law. Journal of Educational and Behavioral Statistics, 25(2), 203-224. Wainer, H., & Eignor, D. (2000). Caveats, pitfalls, and unexpected consequences of implementing large-scale computerized testing. In H. Wainer (Ed.), Computerized adaptive testing: A primer (2nd ed., pp. 271-299). Mahwah, NJ: Lawrence Erlbaum Associates. Wainer, H. (1990). Computerized adaptive testing: A primer. In. Hillsdale, NJ: Lawrence Erlbaum Associates. Weiss, D. J. (1985). Adaptive testing by computer. Journal of Consulting and Clinical Psychology, 53(6), 774-789. L:Wei_sv‘s",:1:)_._ _J,Z®Adaptive testing research in Minnesota: Overview, recent results, and future directions. In C. L. Clark (Ed.), Proceedings of the first conference on computerized adaptive testing (pp. 24-35). Washington, DC: United States Civil Service Commission. Weiss, D. J., & Kingsbury, G. G. (1984). Application of computerized adaptive testing to ,' educational problems. Journal of Educatzonal Measurement, 21, 361 -.375 —-—-/ 180 Wetzel, C .D., & McBride, J. R. (1986). Reducing the predictability of adaptive item sequences. Paper presented at the annual conference of the Military Testing Association, San Diego, CA. Yao. T. (1991). CAT with a poorly calibrated item bank. Rasch Measurement Transactions 5:2, 141. 181 IIIIIIIIIIIIIIIIIIIIIIIIIIIIII lllIIIHIIIIJIIIIIIIIIIIIIIIllIIlellll