Ix! \ LIBRARY Michigan State University This is to certify that the dissertation entitled OPTIMAL ITEM POOL DESIGN FOR A HIGHLY CONSTRAINED COMPUTERIZED ADAPTIVE TEST presented by Wei He has been accepted towards fulfillment of the requirements for the Ph.D. degree in Measurement and Quantitative Methods 77/62”! Z 6. /anf-'4i¢{a_& Major Professor’s Signature 774,01 ‘1’, 2 0 10 Date MSU is an Affirmative Action/Equal Opportunity Employer -._._._.-....— PLACE IN RETURN BOX to remove this checkout from your record. TO AVOID FINES return on or before date due. MAY BE RECALLED with earlier due date if requested. DATE DUE DATE DUE DATE DUE NUVJQ 5129312 5/08 KIProj/AcwpreleIRClDateDueJndd OPTIMAL ITEM POOL DESIGN FOR A HIGHLY CONSTRAINED COMPUTERIZED ADAPTIVE TEST By Wei He A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Measurement and Quantitative Methods 2010 ABSTRACT OPTIMAL ITEM POOL DESIGN FOR A HIGHLY CONSTRAIN ED COMPUTERIZED ADAPTIVE TEST By Wei He Item pool quality has been regarded as one important factor to help realize enhanced measurement quality for the computerized adaptive test (CAT) (e. g., Flaugher, 2000; Jensema, 1977; McBride & Wise, 1976; Reckase, 1976; 2003; van der Linden, Ariel, & Veldkamp, 2006; Veldkamp & van der Linden, 2000; Xing & Hambleton, 2004). However, studies are rare in how to identify the desired features of an item pool for the computerized adaptive test (CAT). Unlike the problem of item pool assembly in which an item pool is assembled from an available master pool according to the desired specifications, no actual items are available yet in the problem of item pool design (van der Linden, Ariel, & Veldkamp, 2006). Since there is no actual item available when designing an item pool, designing an item pool that is optimal intuitively becomes a desired goal. This study is focused on designing an optimal item pool for a CAT using the weighted deviations model (WDM; Stocking & Swanson, 1993) item selection procedure. Drawing on Reckase (2003) and Gu (2007), this study has extended the bin- and-union method proposed by Reckase (2003) to a CAT with a large set of complex non—statistical constraints. The method used to generate optimal item features is a combination of methods based on McBride & Weiss (1976) and Gu (2007) for statistical features and a sampling method based on test specifications for non-statistical features. The end-product is an item blueprint describing items’ statistical and non-statistical attributes, item number distribution, and optimal item pool size. A large-scale operational CAT program served as the CAT template in this study. Three key factors considered to potentially impact optimal item pool features were manipulated including item generation method, expected amount of item information change, and b-bin width. Optimal item pool performance was evaluated and compared with that of an operational item pool in light of a series of criteria including measurement accuracy and precision, item pool utilization, test security, constraint violation, and classification accuracy. A demonstrative example on how to use identified optimal item pool features for item pool assembly was provided. How to apply optimal item pool features to item pool management, operational item pool assembly, and item writing was also discussed. To t/ie [Ewing memory quy grandpa LianEi guo To my fiusEanJC/iuan, my Jaug/iter megan, and'my parents iv ACKNOWLEDGEMENTS The completion of this dissertation marks the end of my long journey in Ph.d. study, which would not be possible without guidance, support, and encouragement from many people. My deepest appreciation and thanks would go to my academic advisor, also my dissertation chair, Professor Mark Reckase. I feel grateful and blessed with the opportunities to have closely worked with him since the first year of my graduate study. I thank him for his unremitting mentoring and scholarly insights that kindles my enthusiasm in research; I thank him for his support and trust that motives me to achieve the best that I can and to turn those seemingly impossible things to the achievable. He has provided me with the most exceptional mentor-student relationship that I have dreamed of in my whole life. I would attribute most of my achievements at graduate school to his guidance, support, encouragement, and warm care. I would like to thank the other members of my committee, Professor Richard Houang, Professor Sharif Shakrani, and Professor Alexander von Eye for their insights, suggestions, and assistance not only in this dissertation but also throughout my Ph.d. study. My special gratitude also extends to Dr. Edward Wolfe, who has been providing me with support and advice during my Ph.d. study. I also feel blessed with the opportunities to work closely with him. The completion of my Ph.d. study is indispensable to his scholarly guidance and constant support. I would like to thank Lixiong for being always available to my questions when I worked on this dissertation and Linda for her generous help in proofreading this dissertation. I would like to express my gratitude to my friends Shufang and Rachel. Shufang—it is such a blessing to have you as a fiend and a big sister not only in Shanghai but here in the United States. Thank you for being such a great helper in my life! Rachel—thank you so much for your constant warm care since I landed in this country! I also thank other friends and professors that have enriched my life as a doctoral student including Hui Jin, Sungwom Ngudgratoke, Dipendra Subedi, Qi Chen, Qi Diao, Yunfei Wu, Muthoni, Chueh-an Hsieh, Hong Jiao, Shudong Wang, Rui Gao, Auntie Dorothy and Uncle Rex, Dr. Yeow Meng Thum, Dr. Kim Maier, and numerous others. I am very honored and grateful for the College Board Research Fellowship, the Robert Ebel Scholarship, and the Dissertation Completion Fellowship. 1 would also like to thank the National Council of State Boards of Nursing for providing me with valuable opportunities to work on their research projects. Finally, I would like to thank my parents and my sister for their immeasurable support and unquestioning love. I owe the most to my dear husband for his unconditional love and support and to my daughter Megan for being such a tremendous source of happiness and inspiration in my life! vi ‘12". TABLE OF CONTENTS LIST OF TABLES IX LIST OF FIGURES X ACRONYMS XIII CHAPTER I INTRODUCTION 1 1.1 Background ............................................................................................................... 1 1.2 Statement of Research Questions ............................................................................. 5 1.3 Significance of the Study ........................................................................................ 10 CHAPTER II LITERATURE REVIEW 12 2.1 Introduction to Computerized Adaptive Testing .................................................... 12 2.2 Item Pool Features, Population Distributions, and Other CAT Components ......... 15 2.3 Practical Constraints in Item Selection in CAT ...................................................... 17 2.3.1 Content-balancing techniques ......................................................................... I 7 2.3.2 Item exposure control procedure ..................................................................... 23 2.4 Optimal Item Pool Design Methods for CAT ......................................................... 25 2.4.1 The binary integer programming method ........................................................ 26 2.4.2 The bin-and-union method and its extension ................................................... 27 CHAPTER III METHODOLOGY AND RESEARCH DESIGN 30 3.1 Methodology ........................................................................................................... 30 3.1.1 Defining a bin map ......................................................................................... 30 3.1.2 Generating optimal items ............................................................................... 35 3.1.3 Modeling the WDM procedure ...................................................................... 39 3.1.4 Post-adjusting item pool size ......................................................................... 43 3.2 Research Design .................................................................................................... 44 3.2.1 CAT model ..................................................................................................... 44 3.2.2 Simulation design ........................................................................................... 46 3.2.3 Research procedure ....................................................................................... 4 7 3.2.4 Evaluation criteria ......................................................................................... 48 CHAPTER IV RESULTS 53 4.1 Characteristics of Candidate ROPs ....................................................................... 53 4.2 Performance of Candidate Optimal Item Pools ..................................................... 73 4.2.] Evaluation results from using conditional 6 points ....................................... 73 vii 4.2.2 Evaluation results from using 20,000 examinees randomly sampled from the target examinee population ....................................................................................... 81 4.3 A Demonstrative Example .................................................................................... 90 CHAPTER V SUMMARY, DISCUSSION, AND IMPLICATION 95 5.1 Summary of Research ............................................................................................. 95 5.2 Discussion ............................................................................................................... 98 5.3 Implication ............................................................................................................ 102 5.3.1 Implication for item pool management and assembly for CAT ..................... 103 5.3.2 Implication for item writing and development ............................................... 106 5.4 Limitation and Future Research Direction ............................................................ 106 APPENDIX 109 REFERENCE 134 viii LIST OF TABLES Table 3.1 Another view of ab—bin/ab-block ...................................................................... 34 Table 3.2 Information on exam constraints and weights ................................................. 39 Table 3.3 Simulation design .............................................................................................. 46 Table 4.1 Summary descriptive statistics of 24 candidate ROPs ...................................... 56 Table 4.2 Number of tests witnessing constraint violation given by the OP and the candidate ROPs ................................................................................................................ 79 Table 4.3 Overall performance statistics for the OP and the candidate ROPs using a random sample of 20, 000 examinees ................................................................................ 84 Table 4.4 Classification accuracy rates for each of four performance levels .................. 88 Table 4.5 Item pool statistics for the Demo_Pool, the OP, and the ROP_21 ................... 91 Table 4.6 Conditional biases, MSEs, and SES given by the OP, the Demo_Pool, and the ROP_21 ............................................................................................................................. 92 Table 4.7 Overall performance statistics for the OP, the Demo_Pool, and the ROP_21 using a random sample of 20, 000 examinees ................................................................... 93 Table A.l Conditional biases given by the OP and the candidate ROPs ....................... 128 Table A2 Conditional MSEs given by the OP and the candidate ROPs ....................... 130 Table A.3 Conditional SEs given by the OP and the candidate ROPs ........................... 132 ix pan- LIST OF FIGURES Figure 2.1 A flowchart of CAT administration ................................................................ 14 Figure 2.2 Increase of item pool size with the increase of number of examinees ............ 28 Figure 3.1 Percentage of maximum item information conditional on the distance between b value and 0 ....................................................................................... 31 Figure 3.2 An illustrative example of ab-bin/ab—block ................................................... 34 Figure 4.1 Item number distribution in each ab—block for the operational item pool ...... 59 Figure 4.2 Item number distribution in each ab-block for ROPl, ROP2, ROP13, and ROP14 ............................................................................................................................... 60 Figure 4.3 Item number distribution in each ab-block for ROP9, ROP10, ROP21, and ROP22 ............................................................................................................................... 62 Figure 4.4 Item discrimination and difficulty parameter distributions for the OP, ROPl, ROP2, ROP13, and ROP14 .............................................................................................. 64 Figure 4.5 Item discrimination and difficulty parameter distributions for the OP, ROP9, ROP10, ROP21, and ROP22 ............................................................................................ 67 Figure 4.6 Item pool information for the OP and 8 candidate ROPs ............................... 70 Figure 4. 7 Distributions of item attributes for the candidate ROP_l , ROP_2, ROP_13, and ROP_14 ...................................................................................................................... 72 Figure 4.8 Distributions of item attributes for the candidate ROP_9, ROP_lO, ROP_21, and ROP_22 ...................................................................................................................... 72 Figure 4.9 Graphical representation of conditional bias ................................................. 76 Figure 4.10 Graphical representation of conditional mean square error .......................... 77 Figure 4.11 Graphical representation .of conditional standard error ................................. 78 Figure A. I Item Number Distribution in each ab-block for ROP3 ................................. 109 X Figure A.2 Item Number Distribution in each ab-block for ROP4 ................................. 109 Figure A.3 Item Number Distribution in each ab-block for ROP5 ................................. 1 10 Figure A.4 Item Number Distribution in each ab-block for ROP6 ................................. 1 10 Figure A.5 Item Number Distribution in each ab-block for ROP7 ................................ l 11 Figure A.6 Item Number Distribution in each ab-block for ROP8 ................................. 11 1 Figure A. 7 Item Number Distribution in each ab-block for ROPll .............................. 1 12 Figure A.8 Item Number Distribution in each ab-block for ROP12 ............................... 1 12 Figure A.9 Item Number Distribution in each ab-block for ROP15 ............................... 1 13 Figure A. 10 Item Number Distribution in each ab-block for ROP16 ............................. 1 13 Figure A. I I Item Number Distribution in each ab-block for ROP17 ............................. 1 14 Figure A. 12 Item Number Distribution in each ab-block for ROP18 ............................. 1 14 Figure A. 13 Item Number Distribution in each ab-block for ROP19 ............................. l 15 Figure A. 14 Item Number Distribution in each ab-block for ROP20 ............................. 1 15 Figure A. 15 Item Number Distribution in each ab-block for ROP23 ............................. 1 16 Figure A. 16 Item Number Distribution in each ab-block for ROP24 ............................. 1 16 Figure A. I 7 Item Discrimination and Difficulty Parameter Distributions for ROP3 ..... 117 Figure A.I8 Item Discrimination and Difficulty Parameter Distributions for ROP4 ..... 117 Figure A. 19 Item Discrimination and Difficulty Parameter Distributions for ROP5 ..... 118 Figure A.20 Item Discrimination and Difficulty Parameter Distributions for ROP6 ..... 118 Figure A.21 Item Discrimination and Difficulty Parameter Distributions for ROP7 ..... 119 Figure A.22 Item Discrimination and Difficulty Parameter Distributions for ROP8 ..... 119 Figure A.23 Item Discrimination and Difficulty Parameter Distributions for ROPl 1 120 Figure A.24 Item Discrimination and Difficulty Parameter Distributions for ROP12... 120 xi Figure A.25 Item Discrimination and Difficulty Parameter Distributions for ROPl 5 Figure A.26 Item Discrimination and Difficulty Parameter Distributions for ROP16... Figure A.27 Item Discrimination and Difficulty Parameter Distributions for ROP17... Figure A.28 Item Discrimination and Difficulty Parameter Distributions for ROP18... Figure A.29 Item Discrimination and Difficulty Parameter Distributions for ROP19... Figure A.30 Item Discrimination and Difficulty Parameter Distributions for ROP20... Figure A.3] Item Discrimination and Difficulty Parameter Distributions for ROP23 Figure A.32 Item Discrimination and Difficulty Parameter Distributions for ROP24... Figure A.33 Distributions of item attributes for the candidate ROP_I to ROP_4 ........ Figure A.34 Distributions of item attributes for the candidate ROP_S to ROP_8 ........ Figure A.35 Distributions of item attributes for the candidate ROP_9 to ROP_12 ...... 121 121 122 122 123 123 124 124 .125 .125 .126 Figure A.36 Distributions of item attributes for the candidate ROP_13 to ROP_I6 ..... 126 Figure A.3 7 Distributions of item attributes for the candidate ROP_17 to ROP_20 Figure A.38 Distributions of item attributes for the candidate ROP_21 to ROP_24 (Images in this dissertation are presented in color.) xii .127 .127 ACRONYMS AS VAB: Armed Services Vocational Aptitude Battery CA T: Computerized Adaptive Testing CCA T: Constrained Computerized Adaptive Testing EAP: Expected a Posteriori E T S: Educational Testing Service DP: Davey and Parshall GMAT: Graduate Management Admission Test GRE: Graduate Record Exam KS: Kolmogorov-Smirnov test MCCA T: Modified Constrained Computerized Adaptive Testing MLE: Maximum Likelihood Estimation MMM: Modified Multinomial Model MP1: Maximum Priority Index MRP: Mixed Random and Prediction MT 1: Minimum Test Information NCSBN: National Council of State Boards of Nursing PPT: Paper-and-Pencil Test R: Random Procedure ROP: Range-Optimal Item Pool SH: Sympson-Hetter SL: Stocking and Lewis Unconditional Multinomial Procedure SLC: Stocking and Lewis Conditional Multinomial Procedure STA: Shadow Test Approach TOEFL: Test of English as a Foreign Language xiii WDM: Weighted Deviations Model WPM: Weighted Penalty Model xiv CHAPTER I INTRODUCTION 1.1 BACKGROUND Computerized adaptive testing (CAT) has been demonstrated to be a mature mode of testing as witnessed by successful application of CAT to several large-scale educational assessment programs in the last two decades. Examples of these large-scale testing programs include the Graduate Record Examination (GRE), the Graduate Management Admission Test (GMAT), the Test of English as a Foreign Language (TOEFL), the NCLEXR exam series by the National Council of State Boards of Nursing (N CSBN), and the Armed Services Vocational Aptitude Battery (ASVAB). An appealing feature of CAT is its ability to achieve, at individual examinee level, considerable gains in both measurement precision and efficiency with fewer items than would be required on a conventional paper-and-pencil test (Eggen & Straetmans, 2000; Lewis & Sheehan, 1990; Lord, 1977; Wainer, 2000; Weiss, 1982). This measurement advantage is realized mainly by administering each candidate an individualized test with each item sequentially tailored to the examinee’s current ability estimate (denoted by 0 thereafter). The “tailoring”, which is achieved by item selection rules, may take different forms depending on different item selection rules. Currently, there are two most popular CAT item selection algorithms: maximum Fisher’s information-based and minimum of posterior variance-based (van der Linden & Pashley, 2000). The former selects items that can maximize Fisher’s information at the current 0 whereas the latter selects items that can minimize the posterior variance evaluated at the examinees’ current0 . It can be easily anticipated that the optimal “tailoring” occurs when every item in the pool can exactly match the desired features requested by an item selection rule. For the maximum information selection rule, for example, the optimal tailoring occurs when an item with item difficulty parameter (denoted by b thereafter) equals the current proficiency estimate given that the Rasch model is used as item response model. A recent study by Reckase and He (2009b) demonstrated negligibly small bias and mean squared error (MSE) in ability estimates when there is optimal tailoring. Other CAT components being equal, a CAT that has every desired item requested by the CAT algorithm in the pool is expected to yield better measurement outcomes than a CAT that does not. In other words, the measurement outcome quality of adaptive testing, like any other test, is correlated with item pool quality. Flaugher (2000) discussed the relationship between the quality of the item pools and the job the adaptive algorithm can do: Obviously, the higher the quality of the item pools, the better the job the adaptive algorithm can do. The best and most sophisticated adaptive program cannot function if it is held in check by a limited pool of items, or items of poor quality (p. 38). Even as early as in early 197OS—the inception of CAT research—researchers started to notice and, either implicitly or explicitly, acknowledge that item pool characteristics may play a role for an adaptive test in achieving “the best attainable results” (McBride & Weiss, 1976, p. 9). Following this philosophy, several studies such as Jensema (1972; 1977) and McBride and Weiss (1976) created “ideal” and “perfect” item pools to explore the properties of Owen’s Bayesian adaptive ability testing procedures. The characteristics of an “ideal” item pool simulated in McBride and Weiss (1976) followed those in J ensema (1972) in terms that item discrimination parameters (denoted by a thereafter) were set as high as possible—preferably exceeding .8, item guessing parameters (denoted by c thereafter) were set equal to .2, and item b parameters were evenly and uniformly distributed along the proficiency scale. A “perfect” item pool was slightly different from an “idea ” item pool mainly in terms of two aspects: one was that a “perfect” item pool was created as one behaving as if it contained an unlimited number of items at any specifiable difficulty level; and the other was that the items’ b-values were optimal in that they were calculated based on a formula given by Birbaum (1968, p. 464) which defines the location on which an item can provide its maximum information given its (1,; Ci, and assuming 0,- =0,- , and the items’ a-values were obtained fi'om a linear equation regressing a-values on the optimal b-values. The importance of item pool quality in realizing CAT’S measurement quality continues to be emphasized in the 19903 and 20008 CAT research literature such as in Dodd et a1. (1995), Embreton (2001), Flaughter (2000), Gorin et a1. (2005); van der Linden (1998), Wang and Vispoel (1998), Wang and Kolen (2001), and Xing and Harnbleton (2004). Guidelines on the desired features of a CAT item pool are also recommended and discussed in studies such as Luecht (1998), Patsula and Steffen (1997), Stocking (1994), and Way (1998). Stocking (1994) analyzed five operational item pools for five fixed-length CAT tests for the purpose of estimating the sufficient pool size to meet content and statistical requirements of the CAT tests. Based on the examination, Stocking recommended it as a rule of thumb that a pool size of six to eight linear forms should be adequate to support a CAT test with the length of one-half of the linear test. Additionally, Stocking (1994) recommended that a CAT item pool containing 12 times the length of the CAT test be used in high-stake test. However, very little was addressed about what makes a hi gh-quality item pool fi'om the perspective of items’ psychometric properties in the Stocking study. In general, those studies described above are inadequate in depicting a full picture of the desired characteristics of an item pool for CAT. First, there lacks a universal understanding of the desired characteristics of an item pool. Second, very little attention has been paid to how the uniqueness of a particular CAT algorithm might affect item pool features. For example, it may be that “high-quality item pools” may appear different for a CAT which aims at measuring the examinees’ abilities equally well over a certain interval and a CAT which aims at classifying the examinees into pass/fail by using a cut score. Third, the relationship between target examinee population and item pool features has barely received any attention. It is not until the late 19903 that item pool assembly and design for CAT has developed as independent topics that begin to receive wide attention from researchers. Examples of studies on item pool assembly can be seen in Ariel, Veldkamp, and van der Linden (2004), Belov and Armstrong (2009), Stocking and Swanson (1998), van der Linden, Ariel, and Veldkamp (2006), and Way, Steffen, and Anderson (1998). Compared with studies on item pool assembly, fewer studies have been conducted on the item pool design and existing studies include Gu (2007), Reckase (2003), Reckase and He (2004; 2009a), and Veldkamp and van der Linden (2000). Item pool assembly and item pool design are two distinctive notions. van der Linden, Ariel, and Veldkamp (2006) provided a clear explanation of their differences. According to them, in an item pool design problem, no actual items are available yet. By simulating adaptive test administrations, an optimal item pool blueprint is generated in which the distribution of numbers of items over the space with all possible combinations of the relevant statistical and non-statistical attributes of the items are described. In other words, designing an item pool is similar to painting on a sheet of white paper which may require careful planning before setting to work with paint brushes in order to produce a good piece of art work. In an item pool assembly problem, however, the actual items are already available in a master pool; and what needs to be done is to assemble an item pool from this available master pool according to the desired specification. This study focuses on item pool design only. Terms such as item po‘ol design/development/generation are used interchangeably. Since there are no actual items available when designing an item pool, designing an item pool that is optimal intuitively becomes a desired goal. Then, a series of questions arise: what constitutes the desired features of an optimal item pool—both statistical and non-statistical? How can we identify the desired features of an optimal item pool? How large is an item pool considered to be appropriate for a CAT program and how can we estimate the optimal size? 1.2 STATEMENT OF RESEARCH QUESTIONS Intuitively, we would expect optimal item pool features for different CAT programs to be different due to various factors in the design of an adaptive test. These factors include item selection algorithm, constraints on item content, exposure control procedure, termination rule, overlap restriction, and target examinee population. For example, it is reasonable to speculate that a long adaptive test may require more items in 5 the item pool than a short one, or for the same adaptive test, implementation of item exposure control procedure may require more items in the item pool than implementation of no exposure control procedure. Despite the expected differences in optimal features, when it comes to optimal item pool design, our ultimate goal is to design an item pool that can accommodate the uniqueness of a specific CAT algorithm and contribute to the realization of the best attainable measurement outcomes no matter what the adaptive algorithms are. This expectation can be reflected by several definitions of optimal CAT item pool, which, from different perspectives, capture the key factors that should be considered in the item pool design. For example, van der Linden, Ariel, and Veldkamp (2006) defined an optimal item pool as one consist(ing) of a maximal number of combinations of items that (a) meet all content specifications for the test and (b) are most informative at a series of ability levels reflecting the shape of the distribution of the ability estimates for a population of examinees (p. 82). Veldkamp and van der Linden (2000) pointed out that the optimal blueprint—the product from item pool design effort—specifies what attributes the items in the CAT pool should have and how many items of each type are needed. Reckase (2003) defined an optimal CAT item pool as one that always has an item available for selection that matches the characteristics specified by the item selection rule. For example, the maximum item information selection method for the Rasch model requires an item with b—value equal to the current proficiency estimate. Reckase further pointed out that the characteristics of the optimal pool are dependent on item selection rule, stopping rule, examinee population, and item exposure control procedure. As an optimal item pool will be prohibitively large if items needed for every possible proficiency estimate are available, the notion of optimal item pool is later modified into range-optimal item pool (ROP; Reckase & He, 2009b). How a ROP is developed will be discussed shortly. To summarize, the definitions described above point to at least three basic elements that need to be considered when designing an optimal item pool. That is, no matter what the adaptive algorithms are, the optimal item pool features should look at both statistical and non-statistical features and item pool size. Statistical features may include item parameters, whereas non-statistical features may include content specifications, key distribution, and cognitive skills, to name a few. Previous literature on optimal CAT item pool design is focused on two major approaches respectively discussed in Veldkamp and van der Linden (2000) and in Reckase (2003), Reckase and He (2004; 2009a), and Gu (2007). They represent two lines of approaches: mathematical programming and heuristic. For the former, the CAT is administered by shadow-test approach (STA) (van der Linden & Reese, 1998) realized by 0-1 or binary linear integer programming in which an objective function is maximized subject to a set of specific constraints. The key point that the STA is different from other approaches is that, in the STA, items are not selected directly from the pool but from a shadow test, i.e., a full-size test assembled prior to selecting each item in the adaptive test. A detailed description on the STA will be given in the second chapter. This line of research argues that the item pool designed out of the STA should be optimal because adaptive testing with shadow tests can guarantee test administrations that can always meet all content specifications and at the same time the item selected at each step is optimal for ability estimation. However, as Chang (2007) and Robin et a1. (2005) pointed out, one potential limitation of this method is that commercial software such as CPLEX or LINDO has to be counted upon to get the solution. As a result, source codes may not be accessible to end-users, posing difficulty to the practitioners in that they have no control over program’s refinement or modification if needed. What’s more, it is likely that the solution may not always be feasible. The approach developed by Reckase (2003) can be viewed as heuristic in nature. Unlike the binary linear programming method, Reckase’ method is straightforward and easy to handle. The basic idea behind the Reckase’ method in the case of the Rasch model uses a set of “bins”——each of which takes a certainwidthworrwtheproficifienf‘y“ ,scateztaeelleetjtcms. and. -a .‘,‘.uni_9n” mechanismyhich is. used- gymnasium 9391 _ size. To design an item pool, first of all, an item pool is partitioned into smaller ones according to the non-statistical attribute, such as content area. The simulation starts with an examinee who is randomly sampled from the expected population and administered the target CAT test. Each item administered is assumed to be optimal because it satisfies not only statistical but also non-statistical constraints. fine items admide mifiemWQWwflnge same procedure is then repeated for subsequent examinees. Items in each bin are treated as equivalent in use and each bin is treated as exclusive from each other. Because items selected for one person can be used for another one, the ideal item pool is the union of the item sets that are administered for each examinee. Using a large number of examinees out of the expected examinee population, it can be anticipated that the number of the items that needs to be added would diminish with the increase of the sampled examinees, and ultimately, the pool size should asymptote to some value that will satisfy the requirements of all sampled examinees. Thus, the end-product of above procedures includes item pool size, item number distribution, and items’ psychometric properties. The successful application and extension of this method to operational CAT programs are reported in Reckase and He (2004; 2005; 2009a) and Gu (2007). However, the above applications are restricted to a CAT test with only one non-statistical constraint, i.e., the attribute used to partition the item pool. Further research is needed if a CAT is expected to satisfy a complex set of content constraints—a practice in operational adaptive testing programs. For example, a verbal adaptive test in Stocking and Swanson (1993) had 41 non-statistical constraints to satisfy. In the adaptive testing, an item has to be sequentially selected that can provide A maximum information at the updated 9 and at the same time meet non-statistical constraint requirement. A common solution to realize this goal is to force the item selection algorithm to combine the objective of maximum information with a strategy that imposes the same set of non-statistical constraints on the item selection for each examinee (van der Linden, 2005a). So far, four approaches have been developed to deal with complex content constraints and they include the shadow test approach, the «WEIQEEEVEEEES~ rnpdelLWDM; Stocking & Swanson, 1993), the weighted penalty model (WPM; Shin et al., 2009), and the maximum priority index (MP1; Cheng & Chang, 2009). Among all these four methods, the WDM is widely used in operational testing programs (Buyske, 2005). Examples include’GRE and ACCUPLACERTM. Therefore, this study intends to extend Reckase (2003) and Gu (2007) to design an optimal item pool for a CAT using the WDM item selection approach. Using an operational CAT program as a template, the following research questions are addressed: Q1. What are the desired features of the optimal item pools for a CAT test using the WDM item selection procedure? The desired features include optimal pool size, distribution of the number of the items, items’ statistical properties (i.e, psychometric properties) and item’s non-statistical attributes distribution. Q2. How do the optimal item pools perform in comparison to the operational item pool (OIP) in light of different pool size, pool utilization, constraint management, measurement accuracy and precision, and classification accuracy? 1.3 SIGNIFICANCE OF THE STUDY Any CAT program can be viewed as unique due to the interaction among many factors such as CAT algorithm and target examinee population. As a result, the desired pool features may vary with different CAT programs. The methodology presented in this study is very easy to implement and extremely helpful identifying the desired item pool features. The end-product of this methodology can be viewed as an item pool specifically tailored to the target CAT program and is therefore expected to ensure high quality measurement outcomes. Once the desired item pool features are identified, they can serve multiple purposes. First, they can shed light on the best attainable measurement outcomes that a CAT algorithm can achieve. This objective can be easily achieved through simulation in which the CAT in question is administered to the same target examinee sample using the optimal and the operational item pools respectively and then results are compared. Second, they can serve as a template for future item pool assembly and at the 10 same time provide meaningful guidance to monitor item writing and item pool maintenance. The methodology discussed in this study extends Reckase (2003) and Gu (2007) to CAT programs with a complex set of non-statistical attributes, which is commonly used in operational CAT programs. By manipulating several key elements that are expected to affect the application of the proposed methodology, this study also provides a detailed guidance on the effective application and adaptation of this method to other operational CAT programs. 11 CHAPTER II LITERATURE REVIEW 2.1 INTRODUCTION TO COMPUTERIZED ADAPTIVE TESTING Adaptive testing provides a feasible solution to the problem that very little information about the examinees’ abilities can be learned if the items are either far too difficult or far too easy for them. By matching item difficulty with examinee’ ability level, much more information about examinee’s ability level can be gained, therefore improving measurement precision and efficiency. In fact, the idea of adaptive testing is not new. Its original use can be dated back to Alfred Binet’s (1905) intelligence test whose administration employs an adaptive strategy. Focusing on diagnosing an individual’s rather than a group’s intelligence, the Binet-Simon test—with items sorted according to mental group or age group—is administered in a way that the examinee start a test with the item set deemed to be appropriate for their age group; depending on the examinee’s responses, the item set to be administered are adjusted until the examinees’ appropriate mental group can be identified with sufficient certainty. Testing format such as Lord’s flexilevel testing (1971b) and Weiss’ stradaptive test (1973) can also be viewed as variants of adaptive testing. The application of CAT in large-scale testing program in real time can be attributed to two major factors: one is the theoretical foundations laid out by researchers such as Lord (1970, 1971a, 1977, 1980) and Weiss (1976, 1978) including item response theory and item selection strategies borrowed from the bioassay field; and the other is the rapid 12 development of computer technology in the 19808 which enables instantaneous computation. The potential advantages of CAT have been well addressed in several studies (e. g., Wainer, 2000; Way, 1998). In summary, they include shorter tests, enhanced accuracy and efficiency in trait estimation, immediate score reporting, greater flexibility in test scheduling and management, and easier adoption of innovative item formats. Since the end of last century, however, test security has become a very thorny and challenging issue in CAT mainly due to the nature of CAT as a type of continuous testing in which items tend to be used for a certain period of time, leaving many opportunities for item theft. Once operational items become known through venues such as being posted on a website—as tends to occur these days -- later examinees are very likely to benefit by receiving inflated scores from item pre-knowledge. To administer a CAT, at least six components are necessary. They include 1) a pre-calibrated item pool; 2) a psychometric (e. g., item response theory) model; 3) an item selection rule (i.e., the method used to adaptively select an item for administration; examples include maximum Fisher’s information and Owen’s Bayesian item selection procedure (Owen, 1975)); 4) a starting point (i.e., the location where an examinee is assumed to be on the proficiency scale before the test starts); 5) a scoring rule (i.e., the method used to sequentially update the examinee’s interim ability estimate; examples include maximum likelihood estimation (MLE; Bimbaum, 1968) and expected a posteriori (EAP; Bock & Mislevy, 1982); and 6) a termination rule (i.e., the criteria used to stop administration of a test; examples include fixed-length test in which all examinees are administered the test with the same length and variable-length test in which, in 13 general, the level of measurement precision for the final trait estimate is used to determine whether to stop the test or not). In addition, item exposure control and content balancing procedures are often two important components that need to be considered by a CAT program. The former “is to ensure test security whereas the latter is“ to ensure test e“ ledid/ityihe presence of these two components imposes constraints on CAT, resulting in so-called constrained CAT (CCAT). How these two constraints are realized in item selection will be discussed shortly. CAT works in a very simple and straightforward way. Figure 2.1 presents a flowchart for a typical CAT administration. As can be observed in this figure, CAT administration is an iterative process in which the provisional ability estimates are updated immediately after an item. is administered and this process continues until the test termination rule is satisfied. Figure 2.1 A flowchart of CAT administration Select the first item for administration Score item Update 0 Select another item based Evaluate against on the ,. termination rule updated 6 14 2.2 ITEM POOL FEATURES, POPULATION DISTRIBUTIONS, AND OTHER CAT COMPONENTS The item pool serves as a resource for the creation of a CAT whose goals are three-fold according to Parshall, Davey, and Nering (1998): 1) to maximize measurement precision by selecting the item that can maximize information or posterior precision for the examinee’s current ability level, 2) to ensure that tests measure the same traits for each examinee by administering a content—balanced test; and 3) to protect the security of item bank by controlling the rates at which items are administered. These three goals are more often than not conflicting with each other. Stocking and Lewis (2000) compared the item selection problem in CAT to an inflated balloon—pushing against one side may address one issue but will cause a bulge on another side of the balloon. With these goals being in conflict with each other, the optimal item pool design for a CAT should seek to balance them. For the first goal maximizing measurement precision, in practice, maximization tends to occur on a targeted range of ability levels. This goal requires item pools to have characteristics that are appropriate as measurement tools for targeted range of ability levels. First, items should be of high quality in terms of carrying characteristics that can match the item selection rule. For example, when the maximum item information criterion is used for item selection, then high-quality items can be defined by their discriminating powers in the case of using three-parameter item response theory model. In general, the higher the discriminating power an item has, the more information this item provides. The desired item characteristics may be different if another item selection criterion is used. Second, the item pool characteristics should consider targeted examinee population so as to provide maximum measurement precision 15 where the examinees are located. For example, it is expected that a CAT item pool used to select high-ability examinees for gifted programs may not provide a satisfactory level of measurement precision for an exam that is used to place low-ability examinees in remedial courses. For the former, an item pool with most of the items located at the high- ability level might be more appropriate; whereas for the latter, an item pool with most of the items located at the low-ability level might be more appropriate. Dodd, Koch, and De Ayala (1993) indicated that trait estimates in CAT are more accurate when item pool characteristics and latent trait distribution of the examinees can match with each other. In addition to statistical suitability, items should also provide sufficient coverage of content specifications. An optimal item pool is expected to support assembling a content- balanced CAT for each individual examinee according to the target test specification. Research (e.g., Chang & Ying, 1999; Cheng & Chang, 2009; Way, 1998) has documented that maximizing measurement precision may come at the cost of overexposing certain items. For example, when the three-parameter item response theory model is applied, the CAT algorithm tends to capitalize on the differential discriminating power of the items in the pool, resulting in a disproportionate usage of the item pool. To equalize item usage, the item exposure control procedure is often implemented for the purpose of protecting item pool integrity. Some studies (Chang & Ansley, 2003) have documented the trade-off between item exposure control procedure and measurement precision. Therefore, an optimal item pool is expected to ease this tension by protecting test security without compromising measurement precision. To summarize, an optimally-designed item pool for a CAT should soothe the tension among these conflicting goals so that these goals can be realized in a well- 16 balanced and satisfactory manner. An optimal item pool is expected to provide desirable measurement accuracy and precision, make efficient use of items, ensure a balanced content coverage, and protect test security. 2.3 PRACTICAL CONSTRAINTS IN ITEM SELECTION IN CAT Most of CAT programs are constrained in that an item selected for administration is expected not only to maximize statistical information at the current 8 but also to satisfy the pre-specified requirement of non-statistical constraints, typically a set of content specifications defined in terms of combinations of attributes the items in the test should have. In addition, item exposure issues are also important factors that are always attended to in the CAT for the sake of test security. 2.3.1 CONTENT-BALANCING TECHNIQUES In conventional paper-and-pencil test (PPT), the requirement for each individual test to have the same content specification is easily met since every examinee is administered the same test. For a CAT, however, the realization of this requirement has to be achieved by forcing the item selection algorithm to cpmmbinemthepbjective of “\ w“-.. _ ,4 «rm-x“ ’ “n... - "4“”. ‘ w “I“... \rnaximizing information with a strategy that can impose the same set of content specifications on the items selected for/administration (van der Linden, 2005a). The M r- -j ”Hr. I‘ "'l “a kw 'n“\_./ ‘v’\aw—-" _“--./ J procedure for ensuring the same set of content specification for each individual CAT is called content-balancing. Several approaches have been proposed to ensurgefgon’terrtlllakancw'fg; These approaches include Kingsbury and Zara’s (1991) constrained CAT method, the weighted deviations model (WDM) approach (Stocking & Swanson, 1993), the shadow-test l7 approach (STA; van der Linden & Reese, 1998), the modified multinomial model (MMM; Chen & Ankenmann, 2004), the modified CCAT (MCCAT; Leung, Chang, & Hau, 2003b), the two-phase item selection procedure for flexible content balancing method (Cheng, Chang, & Yi, 2007), the weighted penalty model (WPM; Shin et al., 2009), and the maximum priority index (MP1) method (Cheng & Chang, 2009). Comparative studies on the performance of some of these methods can be found in other studies such as those by Warren. (299412- Change. Cheng. and, Yi (MCheng and Chang (2009), Leung, Chang, and Hau (2003a), and van der Linden (2005). Among all those above methods, CCAT, MCCAT, and MMM can be viewed as methods along the same line in terms that an item pool is partitioned as several sub-pools by a key attribute such as content area and the item is spirally selected across different sub-pools to meet the pre- specified objective. This line of methods is limited to the situation in which an item carries only one attribute, i.e., the one used to partition the item pool. Comparatively speaking, the STA, the WDM, the WPM, and the MPI are more flexible in dealing with a large set of item constraints. The STA belongs to a mathematical programming method whereas the other three are heuristic. Among all these four methods, the WDM is widely used in several operational testing programs (Buyske, 2005); and STA is also a method that has been widely researched in CAT literature. Provided below are descriptions of the WDM and the STA. 2.3.1.1 WEIGHTED DEVIATIONS MODEL The weighted deviations model method (WDM), originally developed by Stocking and Swanson (1993) out of the concern about possible poor-quality item pools in large-scale test assembly, is perhaps one of the most popular heuristic methods. The 18 WDM explicitly accounts for non-statistical and statistical item properties with the desired balance between measurement and construct concerns reflected by the weights selected by the test designers. Unlike the STA, the content specifications in the WDM are formulated as goals rather than constraints. Deviations from the content targets are weighted and incorporated in the objective function together with the distance of current item information fi'om the target value. In CAT, the WDM approach selects items sequentially with the smallest sum of weighted deviations. The WDM heuristic essentially consists of three steps when selecting an item. First, for every item not already in the test, the deviation for each of the constraints is computed as if the item were added to the test. Second, the weighted deviations across all constraints are summed. Finally, the item with the smallest weighted sum of deviations is selected. Provided below are explanations on the WDM. Let N denote the number of items in the item pool, k denote the number of constraints, Wk denote the weight assigned to each constraint, Lk and U k denote the lower and upper bound for each constraint k respectively, de and dUk denote the deficit from the lower bound and surplus from upper bound respectively, eL k and eUk denote excess from lower bound and deficit from upper bound respectively, and d9 denote the “deviations” fiom target item information. gik is 1 if item i has property k and 0 otherwise. xiis a binary decision variable: it equals 1 if ith item is included in the test and 0 otherwise. The model is formulated in [1]. l9 Minimize K K ZWdek + ZWkdUk +ngg [l] k=l k=1 Subject to N Zgikxi'l'de -6Lk =Lk k=1, ...,K . i =1 for the lower bound. And N Zgikxi +dUk —eUk =Uk k=1, ...,K i=1 for the upper bound N ZIi(0)xi+d9-eg =00 i=1 dUk ,de ,eLk ,eUk 20 k=l, ...,K d9,e9 20 xi e{0,l},i=l,...N Ideally, when the WDM is used in CAT for sequential item selection, the item selected for administration is as informative as possible at an examinee’s estimated ability level while at the same time contributing as much as possible to the satisfaction of all other constraints. The WDM has several advantages. Like any other heuristic approach, the WDM can always ensure a feasible solution in item selection. Based on the priority placed on content specifications and on measurement precision, the test specialists can assign 20 weights flexibly. The disadvantage of this algorithm is the uncertainty in balancing the content specifications. Studies have documented that the WDM may violate some of the constraints (Cheng & Chang, 2009; Robin et al., 2005; van der Linden, 2005). As an added concern, the WDM usually takes a considerable amount of time to adjust the heuristic, i.e., find best weights, for a new problem. 2.3.1.2 SHADOW-TEST APPROACH , The STA, originally proposed by van der Linden and Reese (1998), is based on the ideas developed for the application of linear programming to optimal test assembly. The key point that the STA is different from other approaches is that, in the STA, items are not selected directly from the pool but from a shadow test, i.e., a full-size test assembled prior to the administration of each item in the adaptive test. Shadow tests are linear tests that have the following properties: 1) containing all items already administered to the examinee; 2) optimal at the current ability estimate of the examinee; and 3) meeting all specifications an adaptive test has to meet. The item that is actually administered to the examinee is the one in the shadow test that has not been administered and is optimal at the current ability estimate. After the item is selected for administration, the shadow test is returned to the pool, the ability estimate is updated, and the procedure is repeated. Below is a brief description of how the STA works in CAT. Step 1: Initialize the estimator of the ability parameter. Step 2: Assemble the first shadow test that meets all constraints and optimizes the objective function. 21 Step 3: Administer an eligible item in the shadow test that can provide maximum information at the current ability estimate. Step 4: Update the parameters in the test assembly model. Step 5: Assemble a new shadow test but fix items already administered. Step 6: Repeat Steps 2-5 until expected test length is reached. In general, the way that the STA works falls into a category called constrained sequential optimization which typically includes two types of test specifications: objectives and constraints. In the STA, the statistical information from the test items at the current ability estimate can be viewed as the objective function to be optimized and all other specifications can be treated as constraints subject to which the optimization has to take place (van der Linden, 1998; van der Linden, Ariel, & Veldkamp, 2006; Veldkamp & van der Linden, 2000). One of the STA’s merits is that it can guarantee non-violation of test specification. It is very flexible as well. However, there is a tradeoff between the speed and optimality of its solution. For larger problems, exact solution may not be possible in realistic times (van der Linden, 1998). In addition, implementation of STA has to rely on the commercial software and may pose difficulty to practitioners if they want to modify and refine the source codes (Chang, 2007). Robin et al. (2005) conducted a comparative study on the performance of the WDM and the STA using item pools from three existing CAT programs at Educational Testing Service (ETS). Their results indicated that, in general, the STA does not produce dramatically better results than the WDM. The STA does not violate any content objective whereas the WDM has low rates of violations for some minor constraints. In 22 terms of psychometric quality and resource usage, both the STA and the WDM perform very similarly. Robin and his collaborators (2005) concluded that there was no compelling reason to believe that many of the practical issues that have arisen in the past few years in CAT can be cured simply by switching algorithms. Robin and his collaborators also discussed the concerns over using the STA at the end of their study. 2.3.2 ITEM EXPOSURE CONTROL PROCEDURE The implementation of item exposure control procedures in CAT aims at maintaining test security by constraining the administration of more popular items that would otherwise become compromised due to repeated administration. In CAT, the item selection rule generally seeks an item that can provide the maximum information at the current ability estimate, which, tends to pick certain items too often causing the issue of overexposure. When items become overexposed, examinees may become familiar with these items even before the actual test and have inflated test scores as a result; those overexposed items may become decreasingly less difficult. A detailed summary of the CAT item exposure control procedures developed between 1983 and 2005 is described in Georgiadou, Triantafillou, and Economides generally grouped into five categories: 1) randomization, 2) conditional selection, 3) stratified, 4) combined, and 5) multiple stage adaptive test design procedures. In randomized item selection, the next item to be selected is randomly chosen out of a group of N most optimal items. Procedures such as the 5-4-3-2-1 proposed by McBride and Martin (1983) and the randomesque procedure (Kingsbury & Zara, 1991) belong to the first category. In conditional item selection, the probability that a selected item is 23 administered is conditioned on the frequency with which the item is selected within a particular targeted population. The most fundamental, perhaps also the most commonly- used conditional selection procedure, is the Sympson-Hetter (SH) procedure (Hetter & Sympson, 1997; Sympson & Hetter, 1985). Based on the SH, a series of other conditional selection procedures have been developed, such as Davey and Parshall (DP) procedure (Davey & Parshall, 1995; Parshall et al., 1998), Stocking and Lewis unconditional multinomial (SL) procedure (Stocking & Lewis, 1995), and Stocking and Lewis conditional multinomial (SLC) procedure (Stocking & Lewis, 1998). Chang and Ying (1999) proposed a-stratified procedure and indicated that this procedure can satisfactorily control the item exposure by better balancing item use rates. This procedure has been further explored by incorporating more elements such as the SH item exposure control procedure and content balancing (Chang, et al., 2001; Leung, et al., 2003). As its name suggests, a combined strategy attempts to combine different methods to develop more robust strategies that can perform better than an individual strategy alone. Examples of combined strategies include Progressive Restricted strategy (Revuelta & Ponsoda, 1998), Nering, Davey, and Thompson’s Hybrid strategy (Nering, Davey, & Thompson, 1998), and content constraints in a-stratified adaptive testing using a shadow test approach (van der Linden & Chang, 2005b). Several studies (e.g., Chang & Ansley, 2003; Chang & Twu, 1998; Chang et al., 2000; Chang, et al., 2003; Davey & Parshall, 1995; French & Thompson, 2003; Revuelta & Ponsoda, 1998) have been conducted to evaluate the performance of different item exposure strategies. The study by Chang and Ansley (2003), by systematically comparing the performance of five exposure control algorithms, reported that the SLC procedure best serves the purposes of controlling the observed 24 exposure rates to the desired values as well as producing the lowest test overlap rate. In addition, they reported the tradeoffs between item exposure and measurement precision. 2.4 OPTIMAL ITEM POOL DESIGN METHODS FOR CAT As discussed in the first chapter, item pool design—different from item pool assembly—aims at generating an optimal blueprint which can guide item pool assembly. So far, two major methods have been proposed to design optimal item pools for CAT. One is the mathematical linear programming approach presented in Veldkamp and van der Linden (2000) and the other is called bin-and-union method originally presented in Reckase (2003) and later extended by Gu (2007). The approach discussed in Boekkooi-Timminga (1990) can be viewed in some sense as a prototype of applying linear programming to design IRT-based item pools for both linear test and CAT. A brief description of this approach is provided here. This is because, as will be revealed shortly, the optimal item pool design approaches presented both in Veldkamp and van der Linden (2000) and Reckase (2003) borrow some elements in Boekkooi-Timminga (1990). To determine the characteristics of the desired item pool, Boekkooi-Timminga partitioned the ability continuum into several intervals (also called clusters) assuming the items in the same interval to have equal information functions. Using a sequential approach, Boekkooi-Timminga calculated the numbers of items needed for the test forms by maximizing their information function based on the Rasch model. Since items were collected in each mutually exclusive cluster, the item number distribution can be determined. Boekkooi-Timminga also demonstrated how optimal item 25 features could be used to determine if the existing item pool can meet test construction requirement. 2.4.1 THE BINARY INTEGER PROGRAMMING METHOD The method used in Veldkamp and van der Linden (2000) to design an item pool for CAT can be viewed as a CAT version of the method described in van der Linden, Veldkamp, and Reese (2000) in that both studies adopted binary integer programming method to produce an optimal blueprint. In general, the whole process undergoes four major steps. First, the set of specifications for the CAT is analyzed and all item attributes are identified and formulated in a series of classification tables. These classification tables are set up for categorical and quantitative attributes respectively. For categorical attributes, they are partitioned by their Cartesian products. For quantitative attributes such as item parameters, they are partitioned into several clusters, for example, for item difficulty parameter, they can be divided into several intervals like ((-00, -2.5), (-2.5, 2), (2 2.5), (2.5, 00)); and the same approach is used for item discrimination parameters. Consequently, a blueprint having C-by—Q cells is formulated if C is denoted as the number of categorical classification tables and Q is denoted as the number of quantitative tables. Second, using this table, an integer programming model to assemble the shadow test in the CAT simulation is formulated. Third, the CAT is administered to simulees out of target examinee population with integer programming model for the shadow test. The ability distribution of simulees can be referred to in the historic data. Finally, the number of times items in each cell of the classification table are counted and collected. These counts are then adjusted to obtain optimal projections of the item exposure rate; and the adjusted counts are the final blueprints. 26 In Veldkamp and van der Linden’s study, the three-parameter item response theory (3PL IRT) model was used and c.- was fixed at a common value. The Cartesian product of all categorical and quantitative attributes yielded 96x124=12096 cells. 2.4.2 THE BIN -AND-UNION METHOD AND ITS EXTENSION Reckase’ bin-and-union approach, originally documented in Reckase (2003), was proposed out of the motivation to identify the “desired features of an item pool” (p. 1). Unlike the integer programming approach used in Veldkamp and van der Linden (2000) for item pool design, this method is very flexible, straightforward, and easy-to-handle with end-products including optimal item pool size, the distribution of item numbers, and items’ psychometric properties. To implement the bin-and-union method, five major procedures are required for each non-statistical attribute area, for example, content area: 1) specify the CAT procedure; 2) simulate CAT with examinees fi'om the expected population; 3) determine the ROP for each examinee, which is composed of items that are administered to each individual examinee. These items are allocated into different bins based on the items’ psychometric properties; items in each bin are considered equivalent in use and each bin is mutually exclusive to each other; 4) find the union of the ROPs for examinees; and 5) if the union of ROPs is formed sequentially after each examinee is sampled, the number of items will asymptote to the ROP pool size given the use of a large sample of examinees. In the Reckase method, the items are collected and allocated in a set of bins. Like Boekkooi-Timminga (1990), all items in each bin are considered equivalent in use and each bin is mutually exclusive to each other. Figure 2.2 illustrates how the item pool increases in size as the number of sampled examinees increases. 27 Figure 2.2 Increase of item pool size with the increase of number of examinees Increase of Pool Size with Increase of Number of Examinees 600 , x r r 1 ~ r 1 r - 500 400 - Item Pool Size or O O 200 100 — . oo — T1060 fob?) 30100 4000 5000 3000 7060 hobo—900070000 Number of Examinees The successful application of this method can be found in Reckase and He (2005; 2009) in which this methodology was used to design the optimal item pool for an operational CAT program. The results in those two studies indicated that the item pool features identified by this approach, including item pool size, item number distribution, and statistical properties, can sustain the successful implementation of the operational exam by allowing a better measurement precision and maintaining test security. The bin width used to collect items, exposure control procedure implemented by a particular CAT, and content balancing play a key role in the item pool features. By using the CAT algorithm implemented by the Computerized Adaptive Testing-Armed Services Vocational Aptitude Battery (CAT-ASVAB) as a template, Gu (2007) successfully developed optimal pools that perform better than operational item 28 pool. Gu extended the Reckase method by using the 3PL IRT model and incorporating Sympson-Better and a-stratified item selection procedures. Specifically, Gu worked out several key components required to determine optimal pool size and optimal item features including the bin-map constituted by a certain number of blocks with. items in each of these blocks treated as equivalent. In addition, Gu developed two methods that can generate items with required features defined by the CAT algorithm. Gu’s results indicated that the optimal item pools perform better than the operational item pool in terms of having a smaller item pool size, better measurement accuracy and test security, and more efficient item use. In comparison, the implementation of binary linear programming method requires special knowledge and software; whereas the bin-and-union method allows practitioners full control over the whole process. Existing research related to Reckase’s bin-and-union method, however, fislimite‘dflto the situation in which the item, pool is PaEIEPRedJELO several sub-pools by one key attribute only._ If a CAT has to deal with a more complex set -/‘\_/ we» ~x,fl,.--.-\/ --—- - - - w, . J --m j, -_ ‘r‘ \‘ __,- of constraints, the bineand-union method needs to be further extended and thisflis the“ ‘\‘ {fan—\w”- . fgcus of the current study. ‘ -.....’-' '5 x... .a-r-v ' 29 .fl- N. . CHAPTER III METHODOLOGY AND RESEARCH DESIGN 3.1 METHODOLOGY This section documents several key components required to design an optimal item pool based on the bin-and-union method. They include how to define a bin-map, to generate optimal items, to simulate non-statistical attributes, to model the WDM procedure, and to post-adjust the pool size. Descriptions of some of the components are also referred to in Gu (2007). 3.1.1 DEFINING A BIN MAP IRT provides a powerful item selection method in CAT through the use of the item information function. The item information function can depict the contribution that an item makes to ability estimation at points along the proficiency scale. The concept of ‘bin’ is used to collect and tally items in Reckase (2003) when the one-parameter logistic IRT model is used. A ‘bin’ takes a certain width on the b-parameter or 0 scale and items collected in the same bin can be used interchangeably since there is only a negligible difference in the item information that can be provided by the items in the same bin. In the case of the one-parameter logistic IRT model, for exmple,_flWt—Ml§ madame between b-paralrrctereaflfl:iust caUSPSijiEWtioa in > ’ ”A...“___~__P-t -. maximum item information/figure 3.1 depicts the percentage of maximum item W information that an item with b—value having certain distance from 0 can provide as opposed to that provide by a well-matched b-value and 0 in the Rasch model. 30 Figure 3.1 Percentage of maximum item information conditional on the distance between b value and 0 17’— fi T -'7'T~—.T'—T" T“W 7’T’-—T~ T‘T'fi—‘fi—T‘T;.‘W J 0.6 0.4 Percentage of item information 0 0| 0'” « 0 4 I J_ 1 J; l l -2.8 -2.4 -2 -1.6 -1.2 -O.8 -O.4 O 0.4 0.8 1.2 1.6 2 .4 2.8 Distance between item difficulty and ability However, when the other IRT models, for example, two- or three-parameter models, are used, the story is somewhat different in terms that the item information is largely determined by an item’s discrimination power (i.e., the magnitude of the a- pararneter; and the item information is generally higher when a- takes a high value); while it is the item’s difficulty, i.e., b-parameter, that decides the location at which the contribution of item information is realized. Birnbaum (1968) demonstrated that an item provides its maximum information at 0max where 1 0max = b,- +D—wln[.5(1+,/l +8cl- )] [2] i If c,- equals 0, i.e., in the cases of the one- and two-parameter IRT models, an item provides its maximum information at an ability level exactly the same as its difficulty. If 31 C; is greater than 0, i.e., in the case of three-parameter IRT model, an item provides its maximum information at an ability level slightly higher than its difficulty. In summary, among items that have the same or very. similar b-values, those having higher a-values can provide more information given the c-value is a constant. The above discussion suggests that when the idea of ‘bin’ is used to design the optimal item pool for a CAT program employing the three-parameter IRT model, the contribution by different a-values conditional on a certain range of b must be considered. In other words, the creation of a bin should take not only b- but also a-parameter into consideration. And we call a bin of this type an ab—block/ab-bin. Lord (1980) further illustrated that the highest information that a logistic item with ai and Ci can provide is a quadratic function of a-parameter, given that c-parameter is a constant. The relationship between the maximum item information that an item can provide and a-parameter is depicted in the following equation 3 D2“? 2 3 Mi=——2‘[1-200i—80i +(1+8c,-) ] [3] 8(1 — c-) If we slightly rearrange [3] and add a A denoting the change, [3] becomes [4] which can indicate the change of item information by the change of a-parameter 3 2 . 2 .5 D [l—20c1-8ci +(l+8c,) ]A 2 = a 8(1—c,-)2 [4] 32 where D equals 1.7. A 6 value of .25 leads to AM =.447Aa2 . Since a-value is conventionally set starting from zero, the boundary of a-parameter conditional on the same range of b can be determined step by step given the expected amount of information change. Figure 3.2 provides an illustrative example of a bin map developed by the methods discussed above. This bin map is composed of ab-bins/ab-blocks—the basic units in collecting and tallying items in the case of employing 3PL IRT model. The blocks in this figure were determined by a .4 change in item information when the b- range was set as .4. As a result, a total number of 128 ab-blocks were used to collect and tally items. How ab-bins/ab-blocks and b—bins are different is also visually demonstrated in this figure. For example, the shaded block bounded by the b range between -3.2 and - 2.8 and the a range between 0 and .89443 represents an ab-block in which items with b- and a-parameters falling into these two ranges are considered to be equivalent in use by providing similar item information. Another view of the ab-blocks is presented in Table 3.1. A b-bin can be viewed as a collection of multiple ab-blocks. As in the case of the Rasch model in which a series of b-bins are used to estimate items needed in a test, the set theory of “union” mechanism is still used to determine the number of items in each ab-block and the item pool size. 33 a 1.2649 Figure 3.2 An illustrative example of ab—bin/ab-block 1an 2.3664 2.1909 1 .7889 0.89443 ........r........... ............ .............................. ..................................................................... ............................... .................... ....-.............4 ............ I ....... ...................... 0.3 Note. Infi=infinite Table 3.1 Another view of ab-bin/ab-block ID 1316) b(ub) a(lb) a(ub) 1 .66 -3.2 0 0.89443 2 -oo -32 0.89443 1.2649 3 -oo -32 1.2649 1.5492 4 -oo -32 1.5492 1.7889 5 -oo .32 1.7889 2 6 -oo -32 2 2.1909 7 .66 -3.2 2.1909 2.3664 8 -oo -32 2.3664 00 9 -3.2 -2.8 0 0.89443 10 -32 -2.8 0.89443 1.2649 1 1 -3.2 -2.8 1.2649 1.5492 12 -3.2 -2.8 1.5492 1.7889 13 -3.2 -2.8 1.7889 2 14 -3.2 -2.8 2 2.1909 15 -32 -2.8 2.1909 2.3664 16 -3.2 -2.8 2.3664 00 Note. lb=lower bound ub=upper bound 34 i 1 1 d i .2 -2.8 -2.4 -2 -1.6 -1.2 -O.8 -0.4 g 0.4 0.8 1.2 1.6 2 2.4 2.8 3.2 3.1.2 GENERATING OPTIMAL ITEMS The quality of the items in an item pool is an important determinant of the success of the computer adaptive testing (CAT) program. Three approaches based on Gu (2007) and McBride and Weiss (1977) were used to generate optimal item features for this study. The first one is referred to as the random procedure (R), the second is the mixed random and prediction approach (MRP), while the third one is the minimum test information procedure (MT 1). In order for the optimal item features to be useful and realistic for the target operational CAT program, the historic data were analyzed for the information necessary for the simulation. The historic data included the distributions of operational item parameters, the relationship among item parameters, test reliability, and examinees’ ability estimates. According to the analysis on all 314 operational items, the a- and b-parameters of the operational items were not statistically correlated. For the a-parameter, it was normally distributed with a mean of .97938 and a standard deviation of .394903. This normal distribution was confirmed by the Kolmogorov-Smirnov (KS) test which retained the null hypothesis with p=.2. For the c-parameter, the results indicated that beta distribution, i.e., c ~ beta (2.734, 15.839), can better describe its distribution than others. Therefore, a beta distribution was used to generate c-parameter. x.._ .. \.__.o--~"' ' ~. “e. “4"... 3.1.2.1 RANDOM PROCEDURE (R) The R procedure can be applied to the situation in which the a- and b- parameters are statistically independent. To generate an optimal item using the R procedure, the following steps were followed. For a specific item, 1) Generated both a,» and c,- from their respective target distributions. 35 2) Given that both a,- and c,- were already known, calculated b,- using [2], where gmax was current ability estimate. 3.1.2.2 MIXED RANDOM AND PREDICTION PROCEDURE (MRP) As its name suggests, the MRP is a mixed method. The procedure for the random part followed 3.1.2.1; whereas for the prediction part, the procedure primarily followed McBride and Weiss (1976) in which a “perfect” item pool was simulated with optimal item parameters identified by regressing the a on the b parameters. To generate the regression equation, all operational items were first divided into three groups based on their magnitude of b-values. The correlation between the a- and the b-parameters for each group was calculated with SPSS and only low b—parameter group (i.e., b—value lower than -I.2103) had statistically significant correlation between a- and b-parameters. Therefore, a simple regression was run for this particular group by predicting a with b. The regression equation can be written as al- =1-414+-294bi +e,- where e,- is a random component following a normal distribution ~ N (O, 0%) and creis calculated by 08 = SCH/(1 — rib) =.400133 following McBride and Weiss (1976). To generate an optimal item using the MRP, the following steps were followed. For a specific item, if its 1),- value—which can be approximated by 8 at each step of test administration—was above -1.2103, then the R procedure described in 3.1.2.1 was used to generate optimal item features. Otherwise, the following procedures were adopted: 1) Generated c,- out of the target distribution. 36 2) Generated a,- with ai = 1.414 + .294b,‘ + ei where b,- can be approximated by 6 obtained at each item selection step and 61' was out of N(0,.4001332). 3) Recalculated b,- with [2]. 3.1.2.3 MINIMUM TEST INFORMATION PROCEDURE (MTI) To implement the minimum test information approach, the first step was to specify the target test information. Based on the historic information on the test and the distribution of ability estimates, the target minimum test information can be specified via the following two equations: Se=SO jl—rxxt [5] [61 N Q) II to mml" where S0 represents the standard deviation of ability estimates, Se represents standard error of estimate, r vrepresents test reliability, and 16" represents test information. xx Once I g was known, then the expected information that each item should provide could be obtained by dividing the 16" by the test length. Due to the fact that the actual item information that an item can provide conditional on the current ability estimate may not be exactly the same as expected, the target item information needed to be updated once an item was administered. The following formula was used to update the target item information: 37 _ Ttarget ’ Tadmin I- — I Lt arg et “ Lad min [7] where T represents test information and L represents test length. According to the historic data, the reliability of the operational CAT program is .91. The examinee ability is distributed with a mean of -.3813 and a standard deviation of .9768. Therefore, the target minimum test information is approximately 12.8. Note in this study the target test information was set differently for different examinees. For examinees with true abilities between -1.6245 and 1.088235, the target test information was set as 12.8; for examinees with true abilities between 1.08825 to 2.5 or between - 1.6245 to -2.5, the target test information was set as 9.8; and for the rest of examinees, the target test information was set as 6.8. The two numbers -1.6245 and 1.088235, along with -.05397, were three out scores used in this study to place examinees into different proficiency levels. To generate an optimal item using MTI approach, the following steps were followed. For a specific item, 1) Generated c,- out of the target distribution. 2) Calculated (if out of [8], which was actually derived by rearranging [3]. M i can be replaced by I i in [7]. 2 81— - M° “i: ( Cr) 1 3 [8] D2[1—20c,-—8cl.2+(1+8c,-)2] 38 3) Given that both ai and c; were already known, calculated b,- using [2], where Qmax was current ability estimate. 3.1.3 MODELING THE WDM PROCEDURE 3.1.3.1 SIMULATING ITEM ATTRIBUTES For the operational CAT to be mimicked in this study, Table 3.2 presents relevant information on types of constraints, weights, minimum, and maximum bounds of properties that were expected to be modeled. Table 3.2 Information on exam constraints and weights Constraint Cogsotzzmt Weight Minimum Maximum Item Format Sentence Correction C1 10 10 10 Construction Shift __ _ C2 10 10 10 Errors Comma C3 10 6 8 Coordination C4 1 0 6 8 Sentence Logic ________ _ C5 10 6 8 Content Area Arts C6 5 2 5 Practical Affairs C7 5 5 8 Social Science C8 5 2 5 Science C9 5 2 5 Human Sources ___________________ C10 _5 2 5 Content Diversity Male Reference C1 1 l 1 0 1 Female Reference C12 11 0 1 White C13 10 0 1 Non-white C14 10 O 1 Keys C15 1 3 7 C16 1 3 7 C17 1 3 7 C18 1 3 7 39 As Table 3.2 indicates, five broad categories are used to describe each individual item and under each broad category several sub-categories are subsumed. For example, there are two sub-categories under Item Format and three subcategories under Errors. A close examination was undertaken to analyze the description of these item attributes and the distributions of operational items’ attributes. In all, the examination revealed the following: 1) all items possess only one property under categories Item Format, Errors, and Key; 2) for Content Area, 271 items come from only one content area while the rest 43 items span two content areas. For these 43 items, they span content areas Cl and C10; 3) for Content Diversity, 138 items possess none of the properties under this category; 107 items possess only one property; and 69 items possess two properties. Note that an item can possess up to two properties under this category since C11 and C12 are exclusive to each other and so are C13 and C14. Based on this table, along with the information on the distributions of the operational items’ attributes, the following method was used to identify the attributes of each individual optimal item. For an item, a zero row vector with size l-by-18 was generated with each cell indicating a specific property. An item was assumed to possess only one property under all categories except Content Diversity. For the category Item Format, two numbers 1 and 2 were generated. If 1 was selected, then 1 was marked under C1 suggesting that this item possessed this attribute. Then, a number among 3, 4, and 5 was randomly drawn; and if, for example, 3 was selected, 1 was marked under C3. For Content Area, the item attribute was simulated by referring to the distribution fiom the real data. Specifically, a number between 1 and 100 was randomly selected and divided by 100. Based on one of five ranges into which this number fell, the content area from C6 40 to C10 was marked correspondingly. The five ranges were 1) less than .213376, 2) between .213376 and .509554, 3) between .509554 and .665605, 4) between .665605 and .818471, or 5) greater than .818471. The procedure used to identify the category under Content Diversity also applied to what was done to Content Area in that the distribution fi'om the real data was followed. A number was first generated from the uniform distribution between 0 and 1. If this number was less than .43949, then this item possessed none of property under Content Diversity. If this number was between .43949 and .780255, then a number between 11 and 14 was randomly selected and 1 was marked in the corresponding cell in the row vector. If this number was greater than .780255, then this item was allowed to possess two properties: one was from either C11 or C12 and the other was from either C13 or C14. For Keys, a number between 15 and 18 was randomly drawn and 1 was marked in the cell corresponding to the selected number. 3.1.3.2 MODELING THE WDM ITEM SELECTION PROCEDURE Recall that when the WDM is used for item selection, the item that has the smallest sum of deviations is selected for administration. The deviated sum is consisted of two components with one component from weighted deviations from target item information and other component fiom weighted sum of deviations from both lower and upper bounds. Since item parameters generated by three different methods are expected to be optimal, minimizing total weighted sum of deviation is equivalent to minimizing weighted sum of deviations from both lower and upper bounds. Based on this logic, the WDM item selection procedure was modeled through the following steps: 1) Generated an item for administration in R, MRP, or MTI procedure. 41 2) For this item, generated 322 different kinds of combinations of item attributes based on the description under 3.1.3.1. The reason for using 322 was that this number was tantamount to all possible ways of combination of attributes simulated in this study. Recall that the WDM works in a way that for every item not already in the test, the weighted sum of deviations has to be computed as if the item were added to the test. Therefore, from the second item forward, the item to be included in the exam should automatically take into account the attribute(s) of the previous item(s) by not possessing the attributes that have been satisfied by the items already administered. That is to say, each of these 322 different combinations of item attributes generated for each individual item should satisfy this requirement. 3) Computed the weighted sum based on [1]. Since 322 different combinations of item attributes were generated for each item, an item was expected to have 322 weighted sums. The item with the smallest weighted sum of deviations was identified as the optimal item for administration. Lower bounds, upper bounds, and weights followed those used in the operational exam. 4) Calculated the probability of correct response for this optimal item by using the target IRT model. The probability was compared with a random number generated out of uniform distribution between 0 and 1. If the probability of correct response was greater than the randomly-generated number, then this item was assigned as 1 for correct response; and vice versa. 5) Updated the estimate of 6 . 6) Repeated above steps until the target test length. was reached. 42 3.1.4 POST-ADJUSTING ITEM POOL SIZE Recall each ab-block is mutually exclusive when being used to collect and tally items. However, when the CAT algorithm searches for an item for administration, the search takes place in the whole item pool. Furthermore, an item provides its maximum information at an ability level slightly higher than its difficulty in the case of the three- parameter IRT mode, as Equation [2] indicates. This implies an item collected in an individual ab—block may not necessarily be able to provide the most information within the range of abilities equivalent to the b range covered by this ab-block as expected. For example, Gu (2007) graphically showed that an item from ab—block A with an a- parameter between 1.26 and 1.55 and b-parameter between -.84 and -.56 can provide more information at 0, between -l.12 and -.84 than items from an ab-block B with an a- parameter between 0 and .89 and a b-parameter between -1.12 and -.84. As a result, an item from A rather than B may be more likely to be selected when the interim ability estimate is between -1.12 and -.84. With regard to item pool size, the above discussion implies that the item pool size identified by summing the number in each ab-block may carry more items than needed. As a result, an adjustment procedure is needed to trim out these redundant items so that items can be more effectively used. To achieve this goal, the following procedures were adopted. The first step involved determining the number and the location of conditional 0 points that were used to calculate maximum item information. In general, this number can be set as the number of b-bins used to collect and tally item number. As to the location, these conditional 0 points can be a point within each b-bin. In this study, the mid-points of each b-bin were used to downsize the item pool and in total, 14 conditional 43 0 points were used (0=-2.6, -2.4, -2, ..., 2.6). In the second step, the maximum information for all items on each conditional 9 point was calculated and these items were then rank-ordered in descending order along each conditional 0 point. For each conditional 0 point, the number of items that was needed was equivalent to the number identified in the simulation for each b-bin. For example, if the simulation indicated that 11 items were needed for all ab-bins conditional on the b-value ranging fiom -2.8 to -2.4, then 11 items that can provide the highest information at 9=-2.6 were selected. In this way, the items that can provide the highest information at each conditional 0 point were identified. Finally, all items that were identified in the second step were collected together. The unique items were kept as the optimal items needed for the exam. Thus, the Optimal item pool size was equivalent to the total number of these unique items. 3.2 RESEARCH DESIGN This section describes the CAT model, simulation design, research procedure, as well as a series of criteria used to evaluate the performance of the candidate optimal item pools. 3.2.1 CAT MODEL An operational large-scale CAT model served as the template. The IRT response model was the three-parameter logistic model. All items, which were stand-alone independent items, were selected for administration based on the WDM. The initial ability was set as O for each individual simulee. To obtain the current ability estimate before both correct and incorrect responses were available, the expected a posteriori (EAP) method was used with a N(0,l) prior. Once both correct and incorrect responses 44 were obtained, the maximum likelihood estimation method was used. The test length was set as 20. No item exposure control procedure was implemented. The target examinee population followed a normal distribution with a mean of -.3813 and a standard deviation of .9318. Below is the 3PL IRT model. exp[Dal-(0 'bi )] P116): Ci +(1_Ci)l+exp[Da,-(6’-'bi)l [9] where D=1.7 P,- indicates the probability of responding to an item correctly for an examinee with latent ability 0 (1,- indicates item discrimination parameter b,- indidates item difficulty parameter 0,- indicates item-guessing parameter One thing that needs to be noted is that, to be able to use the WDM, such information as weights, lower and upper bounds, target item information, and item information weight has to be known beforehand. The simulation in this study employed the weight for each constraint, lower and upper bounds as described in Table 3.2. The target item information was set as 83.25 by referring to a research conducted by Cheng and Chang (2009). To determine the weight of item information, a series of item information weights including .1, .5, l, 5, and 10 was tested by using the operational item pool. The results suggested that using item information weight as 1 yielded the most stable results. As a result, 83.25 and l were consistently used throughout the entire study as target item information and item information weight respectively. 45 3.2.2 SIMULATION DESIGN Three factors were manipulated in this study: item generation method, expected amount of item information change, i.e., AM , and b-bin width. The simulation design, described in Table 3.3, involved 24 (6x2x2) conditions. In other words, 24 candidate ROPs were developed. Table 3.3 Simulation design Condition Item Item generation b-bin Expected amount of item pool methods width information change 1 ROP_I R1 0.4 0.4 2 ROP_2 R1 0.4 0.2 3 ROP_3 R2 0.4 0.4 4 ROP_4 R2 0.4 0.2 5 ROP_5 MRP] 0.4 0.4 6 ROP_6 MRPl 0.4 0.2 7 ROP_7 MRP2 0.4 0.4 8 ROP_8 MRP2 0.4 _ 0.2 9 ROP_9 MTIl 0.4 0.4 10 ROP_lO MTIl 0.4 0.2 1 l ROP_l l MT12 0.4 0.4 12 ROP_JZ MT12 0.4 0.2 13 ROP_13 R1 0.8 0.4 14 ROP_14 R1 0.8 0.2 15 ROP_15 R2 0.8 0.4 16 ROP__1 6 R2 0.8 _ 0.2 17 ROP_17 MRP] 0.8 0.4 18 ROP_18 MRPI 0.8 0.2 19 ROP_19 MRP2 0.8 0.4 20 ROP_20 MRP2 0.8 0.2 21 ROP_21 MTI] 0.8 0.4 22 ROP_22 MTIl 0.8 0.2 23 ROP_23 MT12 0.8 0.4 24 ROP_24 MT12 0.8 0.2 The only difference between the methods marked with R1 and R2 was in how an optimal item was determined. Specifically, for an item generated with R1, its a; and c,- 46 were generated first; once a,- and c,- were known, then its b,- parameter were calculated by [2]. However, for an item generated with R2, 20 a-parameters and 20 c-parameters were generated first out of their target distributions respectively; then all possible combinations between these 20 al- and 20 c; were taken resulting in 400 different ways of combination. With [2], the corresponding b,- can be calculated for each individual item. One item was randomly selected for calculating weighted sum of deviations. The motivation behind R2 is that, in theory, there are supposed to be more than one item that can provide maximum information at a given gmax The same was for MRP1 and MRP2. For MTIl, the target test information was set differently for examinees with different true abilities in light of the fact that the operational CAT program is used for placement purposes. The additional features that MT12 possessed was that the target test information for the first ten items for all examinees was expected to be only 1/3 of the target test information set for the whole test and the remaining was expected to be reached by the last 10 items. As a result, the target item information was different for the first and last 10 items. 3.2.3 RESEARCH PROCEDURE To develop each candidate ROP, the following procedures were carried out. Step I. Identified item pool size, the distribution of item numbers, and item attributes by using 10,000 examinees out of target examinee population. Step II. Generated item pools based on the results from Step I. Then, their sizes were adjusted and the resulting pools became the ROPs. Item attributes for each item 47 were determined by randomly sampling out of a set of item attributes collected within each ab—bin in the trimmed item pool. Step III. Evaluated performance of candidate ROPs in light of a series of criteria assuming items containing no estimation error. Evaluation was conducted using 3,000 examinees at each 0 point fiom -3 to 3 at an increment of .5 and a random sample of 20,000 examinees out of target examinee population. Note the same 20,000 examinees were consistently used across all ROPs in order for the results to be comparable. 3.2.4 EVALUATION CRITERIA The performance of each candidate ROP was compared with that of the operational item pool against a series of criteria listed below. In all comparisons, the evaluation results from the operational item pool served as the baseline. Since both conditional sample and a random sample were used in the simulation, two broad types of indices—conditional and overall—were used with conditional indices only including bias, standard error (SE) defined in [10], mean square error (MSE), and constraint violation (CV). The procedures used to calculate the conditional bias and MSE are very similar to [1 l] and [12] respectively. ( N /\\2 A 1 g: A kzrgk [10] sue): —— o——=— N—lj:1 N \ / 48 Overall indices include the following: Precision of proficiency estimation including: Bias; Mean Square Error (MSE); Correlation coefficients between estimated and true abilities; Item pool utilization; Number of underexposed items Skewness of item exposure rate distribution 12 (Chang & Ying, 1999) Test security; Number of overexposed items; Item overlap rate Constraint violation (CV); Classification accuracy rate; Pool size; What follows are the definitions of some of the evaluation criteria: Bias Bias is defined as: 1 N A Bias=—Z(0i—0i) [ll] Ni=l A where 0,- and 9i are the estimated and true ability of the ith examinee. SE MSE is defined as: 49 1 N A 2 MSE=—Z(0,°—0,') [12] Ni=1 A where 0,- and 0,- are the estimated and true ability of the i th examinee. Number of overexposed and underexposed items The exposure rate for an item is calculated by dividing the total number of times that an item is administered to the examinee by the total number of examinees who have been administered this item. A commonly-used cutoff value to evaluate whether an item is overexposed is 0.2 (e.g. see Eigor, Stocking, Way, & Steffen, 1993; Hau & Chang, 2001). An item with exposure rate lower than 0.02 is considered underexposed. Constraint violation The constraint violation is captured by 1) the total number of tests with constraint violation; and 2) the number of different levels of constraint violation. Skewness of item exposure rate distribution 12 12 can be calculated by the following equation n 2 2 ('Z-L/") = E —— 13 Z i=1 L/n [ ] where r; is the observed exposure rate for the i’h item, L is the test length, and n is the total number of items in the pool. According to Chang and Ying (1999) and the personal communication with Hua-Hua Chang (January 15, 2010), this 12 index can measure the departure of an item’s actual exposure fiom uniform item exposure and thus quantify the efficiency of item pool usage. 50 A useful generalization of this index is F ratio of 12 from two methods, i.e., Fmethod1,method2 = Ziahodl / xgemodz which can be used to compare the exposure rates of the two methods. If Fmethodl,method2 <1, then method 1 is regarded as superior to method 2 in terms of producing better overall balance of exposure rates. This index is used only as a descriptive statistic. Item overlap rate The item overlap rate (sometimes called test overlap rate) is defined as the percentage of the common items shared by two randomly selected examinees. The following equation describes how to calculate the average item overlap rate: 2 =T/CN R [14] where T is the total number of item shared by of, pair of N examinees in the test. C 1%, N gives the number of pairs of the test among N examinees. Z Li is the total number of the i =1 items administered for N examinees. Classification accuracy Classification accuracy is defined as the percentage of examinees who are correctly classified into different proficiency levels described in the operational testing program’s technical manual. To determine cut scores, an equation that depicted the relationship between scaled scores and 0 estimates was derived. Based on this equation, three cut-offs on the 0 scale were found for the equivalent scaled scores. 51 They were -l.6245 (53)], -.05397(86), and 1.088235 (110). As a result, the examinees were classified into four different proficiency levels which were called 1) Not Proficient (NP), 2) Partially Proficient (PP), 3) Proficient (P), and 4) Advanced (A). I The number inside the bracket indicates equivalent scaled score. 52 CHAPTER IV RESULTS This section consists of three major parts. The first part presents characteristics of all 24 candidate ROPs including their size, statistical, and non-statistical attributes. The second part summarizes the performance of all 24 candidate ROP8 and compares their performance with that of the operational item pool against the criteria described above. The last part presents an illustrative example in which the identified pool features were used as a template to guide operational item pool assembly, and the performance of this new item pool was evaluated and compared with that of the OP. 4.1 CHARACTERISTICS OF CANDIDATE ROPS Table 4.1 presents descriptive statistics of all 24 candidate ROPs. Clearly, all candidate ROPs contained fewer items than the OP. And the magnitude of the difference varied by item generation methods (i.e., R, MRP, and MTI) and expected amounts of change in item information (i.e., .2 and .4). In general, the R method tended to produce slightly larger item pools than the other two methods given other things being equal and was followed up by the MRP and the MTI methods respectively. For example, using a .4 b-bin width and a .4 expected amount of item information change, the R1, the MRP1, and the MTIl item generation methods respectively produced 192, 189, and 168 items in ROP_l, ROP_S, and ROP_9. Using a .8 b-bin width and a .2 expected amount of item information change, the R1, the MRP1, and the MTIl item generation methods respectively produced 240, 229, and 209 items in ROP_14, ROP_18, and ROP_22. In addition, the item pools produced by the .2 expected amount of item information change 53 contained roughly one third more items than those produced by the .4 expected amount of change, given other things being equal. This result was anticipated as .2 expected amount of item information change implies using more ab-blocks to collect and tally items than .4 expected amount of item information change. All item pools produced by the .2 expected amount of item information change contained more than 200 items; whereas those item pools produced by the .4 expected amount of change contained less than 200 items. With regard to the item discrimination parameter, the average a-values in all candidate ROPs—almost all of them being above 1.2—were higher than that (i.e., .979) in the OP. Specifically, both R and MRP item generation methods tended to produce item pools with higher average a-values than the MTI method regardless of using different expected amount of change in item information and b—bin width. For example, the average a-values for Pools ROP_I to ROP_8 were around 1.6, whereas the average a- values for Pools ROP_9 to ROP_12 dropped to a range between 1.3 and 1.45. Likewise, the average a-values for Pools ROP_13 to ROP_20 were around 1.45, whereas the average a-values for Pools ROP_21 to ROP_24 dropped to a range between 1.2 and 1.32. In addition, the item pools (i.e., ROP_13 to ROP_24) developed with .8 b-bin width tended to have smaller average a-values than those (i.e., ROP_l to ROP_12) developed with a .4 b-bin width. Meanwhile, the minimum a-values in those item pools (i.e., ROP_I to ROP_12) using .4 b-bin width tended to exceed .9—at least .1 higher than those in the rest of item pools (i.e., ROP 13 to ROP 24) developed using .8 b-bin width. With regard to the item difficulty parameter, the average b-values in all candidate ROPs were much higher than not only that of the OP but also the average ability (i.e., - 54 .3813) of the target examinee population. In general, the item pools developed with the .2 expected amount of change in item information tended to have easier items than those developed with the .4 expected amount of change, given other things being equal. For example, the average b-value of ROP_2 was -.227, but it was -.161 for ROP_l. Likewise, the average b-value of ROP_14 was -.155, but it was -.236 for ROP_13. 55 lltllllIIIIII'U'III'IIIII'I'I'IIIIIEI'IIUIOIOIIII'lllllI'IIIIIIII'III'I'0'I'I'O'I'I'l'llll'l'l'llllllllllIUIIIIIIIIIlIII'IIIII'III'O'I'I'lll'OII'III'IIIII'III RN :3 coed 33. em; 32. $2 82 $3. $3 EN one 83 21mg: N2 83 $3 Sod 23 $2. 82 82 an? as: 22 ems E: :iaoe emm 23 seed at; a; 22- SE SE 23 ES £3 emg 32 Eden we: 83 83 83 mm; 82- $2 :2 Ed- 83 $3.. £2 8? 35¢ .-.me.----ma..o ..... e emd--.-.e.e.o..o.---..m.m..m ........ emmww--.we.e...~..-.ew.es._.---.e.wm..m.. ........ Mmeeo.-s.ew_..ms;me.wo ..... m 3.9.2. ..... Madam- w: 23 $3 ES 83 SE- weed en: New? E: 32 Ed :3 55m SN ES 33 ES 23 $3- 83 3.: 33. :3 $3 32 23 piece a: was 83 eeod ~26 EN. 83 Na: peed- weed 33 83 £2 mince .-.m~m----.wmo..o ..... e. m3--.-w.e.m.m---.mm_..q ........ .emmw..---w.e.m.~..-.me.m._..--.ewmd.“ ........ ewes-.-..e.~a..m--.-o.e.wo ..... We...“ ..... enmmm: e: 83 $2 $3 23 :2. sea mt: $2. 33 mm; 33 23 niece EN :3 $3 EB om; $2- 82 82 Rad- m2: $5 Ed 83 Niece .9: Rod 82 meg :3 82- 32 82 $3. :3 £3 :2. SE 38m .-.me ..... Sena ...... we. ...... wm.m-----.e....._..m ......... m E... -.immense..-$312.83.. ........ guesswwwm lemme ..... m sweetieadm- ram 52 as: am :32 £2 as: am :82 £2 .32 em :32 .8; .8.— m m m as: £62 03338.0 VNB 8.3239. 8332.6ch befiszw :4 05m... SN 23 ES N86 use News NZ ea: 28- ea; acre. 88 82 em ace 3: was we; 83 we; we; Sea 22 £2. mews $3 $8 «a: niece e8 33 82 $3 £3 £2- Sea 82 83. eta Ea 98¢ a: mace e2 23 :3 Sea was 82- $2 $2 38. em; :3 $2 22 niece ..-mmm. ..... N mono ..... _. emdlwmoso ..... mead ........ gem:..owww-..owwm--.-me.m..q.. .llama.---mwww-.-.~.w.m.d ..... N 9.2 ..... owlmmm- a: «:3 was E; and 82- 82 E: 33. ea; $2 $2 5: 218m ea 23 5.2 83 mm; need. :3 we: 83- a; £3 8.3 $2 2.8m m2 :3 See at; was $2- 82 $2 23- see e:.~ 22 ea: trace ..-MmN. ..... w._.ouo.-.-wow.d-.-.wmoeo..-.-w.e._..m ........ «swam.-..Nmmm-..wmmw-.-fimd...-.-.-..wem.m.-.-wwwwasw~muo ..... 0 ME ..... Steam- o: :3 82 Sea is :2- :2 $3 News News a; :2 GE Brace ea :3 8% ES 33 32- £3 £2 £2. 33 eta 88 ea: Steam m: was 32 £3 £3 Need- 82 a: was- 23 a; 32 an: Menace :13...“ ..... N a“... ...... w a... ....... w as. ..... www.mssismmm. ..... N Edisemmnm.idem? ----.-.mm.e..m.--.wmmw.-.-w.mm.d ..... m NW... ........... mm- ram 52 as am as: a: :2 em :82 a: :2 em :82 Bed .8.— m m m as: u. see 3. ezee The graphical representation of item number distribution in each ab-block for the operational item pool and several selected candidate ROPs (e.g., ROP_I, ROP_2, ROP_13, and ROP_14; and ROP_9, ROP_IO, ROP_21, and ROP_22) are provided in Figures 4.1, 4.2, and 4.3. The item number distributions for the rest of the candidate ROP3—very similar to what are presented here—are provided in the APPENDIX. Obviously, the items in the OP were distributed quite differently from those in the candidate ROPs in three major aspects. First, the OP carried quite a few items with a,- values between 0 and .8944 while the candidate ROPs carried none or very few items with a,- values within this range. Second, regardless that the total number of items in each b-bin was different in item pools developed with different b—bin width and expected amount of item information change, the item number distributions of all candidate ROPs tended to be more uniform than that of the OP. For example, ROP_l, ROP_2, ROP_13, and ROP_14 had roughly 15, 20, 30, and 37 items respectively in most of their b-bins. The difference in deve10ping ROP_l and ROP_2 was in using different expected amount of change in item information: .4 for ROP_l and .2 for ROP_2; and the same was for ROP_13 and ROP_14. The difference in developing ROP_l and ROP_13 was in using different b-bin width: .4 for ROP_l and .8 for ROP_13; and the same was for ROP_2 and ROP_14. The findings about ROP_9, ROP_IO, ROP_21, and ROP_22 were also very similar to those about ROP_I, ROP_2, ROP_13, and ROP_14 except that the total number of items in each b-bin was slightly smaller. Third, for the OP, the largest proportion of items was those having a,- values between 0 and .8944 and then followed by items having a,- values between .8944 and 1.2649. However, for most of the candidate ROPs, a large proportion of items were those having a,- values between .8944 and 1.5492. 58 Figure 4.1 Item number distribution in each ab-block for the operational item pool a q l q fl 2 m w m m. u m «Sea: and l . 1. L r; ) m 1 m a m m. 58183." 0. m. 1. m 1. n . . 4... a... ..4. a. ..4. ._... 243$an a a a a a 3 III... 8.3.... 3973:; ems: 1 $8... . - . w. :8... z 32... 32... 153.- 33.- 2F? . -._.._.._..........r... «Fr Amen..- $34- .885- e85... 242. . ”=4”er mhér 70 peers... . 85:62“. 59 .l. I . . ..l. . m.‘ .. I I . . F. ...I. . m.‘ 2 .H Z .9 T .8 .v 0 V .8 .Z .7 W Z .9 r. 8 V 0 V .8 .Z 9 Z .7 8 _. ~ ~ ~ ~ . . .. .. . _ .. _ . . . ~ ~ ~ ~ . . ~ .~ .~ .~ .~ 2 ”I. I I z . z I I I . . o. I I z z Z 8 9 r. .L mt .0 my mo . .0 . 7. we no 9 r. .L «t .0 .7 8 r. .9 .0 7 .....a ”in. aw.» m ... ... row tar as: 282.... - a -3 as: as. ... .._ a £3.28. . 3 m am 53 3...... I n . n .32.. Ease... ..3 m as. N mu.- .3 w $5.235"... m a 82st.- m 5.: ea. :1. D_ . 8 I -3 332. 38. =an_ .33... gwran ......saarflm .3 53....afm e .38 228...... A . an. Ao‘Nncu a“ I _ H h A a h E F F L g a - 2 20% .m 65$ 30% .< ESE Samoa was .m EOM .NmOM . 50% com 0603-?» some 5 noun—£56 598:: :8: N .V NSMQ 60 .2323"..- £22.".- 55..."..- 58.3.?- 3.85:"..- $3.33....- 33.53.?- 3....25EE .§._§..EB .38..2§.ID .3838E- .38... an.- — _ » IO'Z' ~ 8'Z‘ 3 8 KouanbOJd S VEOM .Q 383 2 j? 61'1" ~r° 6€'~ 7" I7"~ Z'I' I0'Z‘ ~ 8'Z' P800 N.» 33%: mEOm .0 65¢ a T 42. . .a . .§.N~ra- .2 - a32..r..- .3 .gfifii- . a3§2rfi .3 . .§.§8ED .8 58...“..- . _ 0 . _ _ 2 Aouanbajd 61 I9'I' ~ Z' "'1' ~ 8°Z“ '-.‘ ; v-1 . r- 25:2? Is°-~ z'I- : IZ'I'~ 9'1- “. ~14 as: 28..."..- 352.1. a 5..."... 58.. £2"..- .32.. ”Ed".- A§.F§._r~. .33.. SSE. as: 32.5% .32 33...". . .33.. Serum was. a"... ..I. . z 8 ~ ~ I I. m) ..l 6 6 3‘ _ s'z-~rz ' 697% I .9 ~ "I. "M 8 . 3 8 Aouanbajg l as: a"... $2.2“..- - .82.. “3...“..- - .N3.F§..EE , 553..."..3 is. a".- _. _ p _ L _ T OEOZ .m ESE ~29. ...... .55”. 5:9. .29. .8 8.85-8 58 ... 8:35.... 38.... an: 3. SEE I 18"” Z'I' ' 6L'~ V ' wz-~ 8°:- é Kouan [331:] 62 9‘. flow... .23.. x 8'1 ~z 66'1 “TI 61 I “‘1’ I0'Z' ~ 8'Z' as: 2.3."...- Es. an..- .N 23..."..- EE 3...?- as... ESE- ...E 33...".- 1 .3: as?“ .38.. EEG as... 33..."..- .32 as?“ .38. 38."... .35. ...“..- L _ Aouanba: =1 NNmOm .Q 353 ... . .. u ..c M 4 a ..., 4 ~ .~ . u m m w w m m H T .33..»- 13m n _ 7 . a f .28.?- a; .3335".- ‘ :5...§«.ID is . EREQED is .32.".- ~ . p _ F — h as .580 m.» 2:5 KAOM .U 3ch 63 Figures 4.4 and 4.5 fiirther illustrate the differences in the distributions of item discrimination and difficulty parameters for the OP and the aforementioned 8 candidate ROPs. For the a- and b-parameter distributions of the rest of candidate ROPs, they can be referred to in the APPENDIX. To summarize, both item a- and b-parameters for the candidate ROPs were distributed quite differently from those of the OP. It seems that, compared with the OP, the distributions of item a-parameter of the candidate item pools were truncated at the value of approximately .8; whereas the distributions of item b parameter of the candidate item pools tended to be uniform along the proficiency scale. For the candidate ROPs, however, both their item a- and b-parameters were quite similar in shape to each other except that the number of candidate ROPs developed with .2 expected change in item information was larger than that of those developed with .4 expected change in item information. Figure 4.4 Item discrimination and difficulty parameter distributions for the OP, ROPl, ROP2, ROP13, and ROP14 Item dlgcolgmlnatlon parameter dlstrlbutlon In OP flgomodlfflculty parameter dlstrlbutlon In OP so» - ”l Frequency 1.5 2 2.5 O 0.5 64 Figure 4.4 Cont’d Item dIscrImInatIon parameter dIstrIbutlon In ROP1 Item difficulty parameter distribution In ROP1 . . . 1oo . . . , , . 1oo . m. . 90.. so an» 70 70- 5' Gar * 50* 5 § sol ~ 50» it 40t “l , so 30- 20 20 10» 1o 00 0:5 93 Item dlscrImlnatIon parameter dIstrIbutIon In ROP2 Item dIIIIcuIty parameter dlstrlbutton In ROP2 1oo - . 1oo . - 90- ~ 90- 80» . 80- 4 70» « 70» 2' 60+ 3 50» E 4th 30» ml 10» oo ois 65 Item discrimination 100 , Figure 4.4 Cont’d 100 . parameter distribution In ROP13 Item difficulty parameter distribution in ROP13 3 Frequency 8. Item discrimination 100 . 17 f parameter distribution In ROP14 I33: difficulty parameter distribution In ROP14 Frequency ‘17 l Bgsgaase ‘ Q j 60° r'o 66 Figure 4.5 Item discrimination and difficulty parameter distributions for the OP, ROP9, ROP10, ROP21, and ROP22 1 Item discogimination parameter distribution in OP “:35 difficulty parameter distribution In OP Frequency item discrimination parameter distribution In ROPS Item dil'flcuity parameter distribution In ROP9 1oo . 100 - fi fi 90- - our 80- . Ni 70 J 70 5‘ W i 6M u. 40- 40+ 30- so» 20- 20» ml 10- oo 0.? 93 67 Figure 4.5 Cont’d Item discrimination parameter distribution in ROPIO Item difficulty parameter distribution in Roma so- « 90» w» .. m. 70- 3 50* 5 :1 50. E .0. 30.. 20. 10» b0 Item discrimination parameter distribution In ROP21 Item difficulty parameter distribution In ROP21 100 1 . i , 100 . a r 90+ i 90* 80- l 80L T 70~ . 70» > so. . so- - i a 60 . 50- ~ I? 40 4m . 30 20 68 Figure 4.5 Cont’d Item discrimination parameter distribution in ROP22 Item difficulty parameter distribution in ROP22 100 . 100 e m. .4 90.. 80+ i so» 1 rol « 70. l 3‘ col so. 1 C 0 § 50» 50. i 40» 40» l 30~ ao~ __ « 20» 20am ; ;'=;' . 10» 1o~°=" 4 I 7 o ii 00 0.5 25 <3 4 Figure 4.6 presents item pool information for the OP and the aforementioned eight candidate ROPs. For the pool information of the rest of candidate ROPs, they can be referred to in the APPENDIX. Note that the operational item pool was the largest one containing 314 items in total. Clearly, the pool information curve of the OP was quite different from that of each candidate ROP mainly in that the curve for each candidate ROP was much flatter than that for the OP and continued as flat over a wide range of proficiency scale. This suggests that the item pool can support equally good measurement results for examinees with a wide range of abilities. For the OP, the pool information reached its peak in the proficiency range between -1.2 and -.8 and then decreased along both sides of proficiency scale. By the time the pool information for the OP dropped to approximately 60, that is, the largest pool information that can be provided by ROP_13, the range of ability scale that was covered by the curve was between -2 to .8 for the OP. 69 For the ROP_13, however, this range was between -2 to 1.6, suggesting that the ROP can provide equally good measurement outcomes for examinees with latent abilities within this range. Figure 4.6 also indicates that the pool information for each individual ROP at different proficient point is different. This difference was the result of different pool characteristics such as different average a-values, as discussed before, as well as different item pool size. As the next section will indicate, all candidate ROPs can perform better than the OP. The better performance can be partly attributed to the optimality of non- statistical attributes, which will be presented below. Figure 4.6 Item pool information for the OP and 8 candidate ROPs Panel A 14° 1 T T f T T i i i l i i l i a! 4 I 2;:;s+o~314) 120. ....... ........ ........ ........ ..... ...... E ....... ....... ........ ........ ........ ;§.-'¢-'ROP1(192) ..s immune) 100- ....... ........ ........ ........ ........ ........ 2. ............... ' ........ -§.4--Ror=13(173).l i 3 Z i .. . Z i i i "°"R°P“(24°) 30. ........ ...... 5:..- ........ ;. ............ .. ........ a 3 i '(9) e f f 3 S : 1': : : : : . . 5 : : 60. ...... g ......... g ......... ,I, ’ ....... i...;;*,,_.*..*.......r, ........ g....\-§....':.,., ........ ;........; ........ i ...... - 2 s 2 - - '5 s - t z s i 3 s a .- . . s a . s ' E a a . 5 § § § 40— ...... ...... ...... ........ ........ ........ ..... ........ 2..-“. ..... fiqrfi ........ ...... 1 5 E : . . 3 i : : z z : : . : - '4 z : 20 . . ,-. . . . . . . . . . . . ‘3. '.. . -.......',........',..1.’ ...,, ........ ', ................. ... ....... ', ...... '........', ........ \\.:.',.-......', ...... _. : z: : : : : : : : ; : : : : "A‘- : . i i i J i i i i i m i i l i I “- -4 -3.6-3.2-2.8-2.4 -2 -1.6-1.2-0.8-O.4 o 0.4 0.8 1.2 1.6 2 2.4 2.8 3.2 3.6 4 9 7O Figure 4.6 Cont’d Panel B - % A . . I I Z I i 3 I Z I I :+OP (314) 120-..- --B--ROP9(168) .. z §-’-°---R°P1°(229) 100, ....... ........ ........ _______ ....... ....... --o--R0P21 (139). ‘ ' ' ‘ ' ' ' ' ‘ ‘ ' --4---R0P22 (209) 140 .r l iff 17! l I f T fl . 1 mm : : ; : : : 2 ' ' : 2 : i : : 80 - ..... ..; ....... ...,é.-.-_-.~.-.-¢...n..'... ....... .... ................ i : : : t z ' . . é : : ' : : z . . . . . a" . . ... . . . . . . i i i I i . I i i I ‘n. i l I C ‘ i . ' ‘ ' ' i e. i ' i i i \ 1(9) : z s s .. a = s 2 : Ci. 5 i i z 2 so — ------- 1+ -------- :» -------- z» -------- ------- 3+ ----- arm-+44... -------- é --------- s --------- i --------- --------- . ------- . 5 5 : ' .‘ 3 fit... . "‘b-"é'm-lfi‘ ' . : ; ; : f ‘ a 40f ........ ........ -77” ...... #,.¢.~+-+-+._¢‘ ............... a‘ .1 ......... ........ ; ....... . 3 3 - - E 3 ' i E i '- 3 E : : : : . : : : : . ".‘o. : : s a . 1' ; s a E z z e - a = a - ‘;-=: ' ; 20 , . . ,4 .- ' ,' . . . . . . . ' .5 . .' .. 5 . , . . , _n . . . . » . . . a . ‘ a. . 2 : '.-I I : : : : ; : : : : ' ‘5: : a I ’ r I l . a I i u a r a 0 I a a z s r z s e a . s a ; $5 O ~,~ ."'iii.~iii ' i111 “- 1 l ‘ l m -4 -3.6-3.2-2.8-2.4 -2 -1.6-1.2-0.8 -0.4 0 0.4 0.8 1.2 1.6 2 2.4 2.8 3.2 3.6 4 0 Note. ( ) indicates item pool size Examples of the distributions of item attributes for the aforementioned eight candidate ROPs and the OP are presented in Figures 4.7 and 4.8. Again, the distributions of item attributes of the rest of candidate ROPs can be referred to in the APPENDIX. In these figures, the x-axis represents item attributes as described in Table 3.2. Overall, all candidate ROPs shared very similar distributions in item attributes regardless of slight differences. The largest differences in percentage between the OP and the candidate ROP tended to occur in Attributes 3, 11, 12, and 14. 71 Figure 4. 7 Distributions of item attributes for the candidate ROP_l , ROP_2, ROP_13, and ROP_14 0.7 I i I i I I ' ' ; -—or= 0.6... ........ .-... .................................... ..l. —-’--o ROP1 -—< -~4-- ROP2 0'5 --+-- ROP13 ~ o f -+ - ROP14 0" """" ,1, ------------------------------------------- ' ?W"_? ....... .4 E ’i“ , E 0.3 4;): """""" i/ i +!.€'+‘:‘.\5 ...... .’.'Ah 0.2 ,,,,,, \ \"(W ...,; _______ . "‘E , .; ‘5..- . :6... 1 : E O 1 i '21:: ’s’ a“ . ; 1 L I i i i 1 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10C11C12C13C14C15C16C17C18 Attribute Figure 4.8 Distributions of item attributes for the candidate ROP_9, ROP_lO, ROP_21, and ROP_22 0.7 I I I —- op 0.6 S ' ---- ROP9 " I .-.,... ROP10 ""m ROP21 0 . g 0.4 ---~--r -°+" ROP22 "l E i 8 a 0.3 . "'255 o. /'/\ , " (IL-a 0.2 - , . ------- )r’ ‘ é" ' 5 . £5 '5 0 cut I '0: 3 2i “. i i 1 '1 4 Ci CZ C3 C4 CS C6 C7 CB C9 C1OC11C12613C14C15C16C17018 Attribute 72 4.2 PERFORMANCE OF CANDIDATE OPTIMAL ITEM POOLS 4.2.1 EVALUATION RESULTS FROM USING CONDITIONAL ABILITY POINTS Figures 4.9 to 4.11 portray conditional biases, MSEs, and SEs given by all candidate ROPs and the OP. Tables A1 to A3, presented in the APPENDIX, also summarize conditional biases, MSEs, and SEs given by all candidate ROPs and the OP. Obviously, all candidate ROPs yielded much better measurement accuracy and precision as indicated by much lower biases, MSEs, and SEs. Compared with the OP, the following findings can be observed with regard to bias. First, all candidate ROPs yielded smaller amount of biases across the entire 9 scale than the OP, and the differences were more conspicuous at two tails of the 9 scale (i.e., 9=-3, -2.5, 2.5, 3) than at the middle range. It is interesting to notice that the OP yielded negative bias at two tails of the 6 scale; whereas the candidate ROPs yielded negative bias only at the low-ability level. The reason can be attributed to the considerable shortage of enough operational items with b,- values at the high-ability level of the proficiency scale, as Figure 4.4 indicates. Given that the item pools contained sufficient number of items with b,- values at the hi gh-ability level of the proficiency scale, as in the case of the candidate ROPs, the magnitude of conditional bias on these ability points were found to be close to zero. In addition, it seems that the magnitudes of biases at 0 = -3 were larger in the candidate ROPs developed by .4 expected change in item information than those developed by .2 change in item information given other things being equal. Second, the OP tended to underestimate examinees at both tails of the 9 scale. Although the candidate ROPs also tended to substantially underestimate examinees at low-ability proficiency range, the degree of underestimation was much less severe in the candidate ROPs than that in the 73 OP and underestimation appeared to occur within a narrower ability range. For example, the biases at 9 = -3 and 9 = -2.5 were .28 and .19 in absolute value for the OP. For the candidate ROPs, however, the largest absolute values were .24 and .08 at 9 = -3 and -2.5 respectively. Likewise, the bias at 9 = 3 was .47 in absolute value for the OP. For the candidate ROPs, however, the largest value was only .05. A similar pattern can also be observed for conditional MSEs. In general, the curve of MSEs given by the OP consistently stayed above the other curves given by the candidate ROPs. The magnitudes of MSEs at the low-ability level, i.e., below -2.5, were much larger than those at the rest of the ability levels and were followed by those given at the high ability points, i.e., 2.5 and 3. The reason can be attributed to the finding that a certain proportion of examinees with true abilities below -2.5 were found to be significantly underestimated with regard to their estimated abilities. The differences in MSEs yielded by the OP and the candidate ROPs were much closer at 0 points between - 1.5 and 0.5 than those at the rest of the 9 points. As for the MSEs given by the candidate ROPs, very similar amounts of MSEs can be observed at most of the 6 points, i.e., -2 to 3; and the largest difference was observed at the 9 point of -3. For example, the MSEs observed for ROP17 to ROP20 were quite different at the 9 point of -3. In terms of conditional standard error, Figure 4.11 indicates that, at all conditional 9 points, the ability estimates given by the OP are consistently more dispersed than those given by the candidate ROPs, suggesting more precise ability estimates given by the candidate ROPs. However, those low-ability examinees, i.e., those with true abilities lower than -2.5, tended to be measured less precisely than others no matter what item pool was used—the OP or the candidate ROP. 74 Table 4.2, describing the number of tests having constraint violation given by all candidate ROPs and the OP, indicates that all candidate ROPs have a substantially better control over constraint violation than the OP. For example, the OP . witnessed a total number of 9016 individual tests with constraint violation across the whole 9 scale; whereas for the candidate ROPs, even the highest number—given by ROP_3—was only 171. This number was considered tolerable. For all candidate ROPs and the OP, none of individual tests have seen more than 1 constraint violation. 75 Figure 4.9 Graphical representation of conditional bias M r - ' ' ' A 0.2 .-O . - it“ " 0 ”- -.. 5'in ‘.__‘ 0W W x V, o“, m ’26" '*"0P l ' at. ‘ ---a-~ROP1 0. .- ----ROP2 i“ --~--Rorz , 4M «06 ~ “”0" l or '0 -2 1 o 1 2 3 '0 o 1 2 3 02 . 9 0 .: if \’ Fm.- ‘ ’I "1" 0P \ Iii" ,I' "" 0P 1‘ ‘02 I”! "..- ROW 1“ 4 no.€:’/0 "0"ROP13 “‘ 4110910 410914 '0‘ -+- ROP11 j, '0" -4- ROP15 or 410912 M «101116 '0 -2 -1 o 1 2 3 '0 1 2 3 02 . , . . , . o» gW‘WJ Iii" -+- or I. mop .\ “iii"! '*'R°P17 41 ,,~” 4 ROP21 ' ““ROP13 " --~--R0P22 '0“ 11:01:19 1‘, 04- --*-'ROP23 11 "*"mm --u--R0P24 ' .05 1 I I ‘05 I . . 0 2 -1 o 1 2 3 .3 2 .1 1 2 3 76 Figure 4.10 Graphical representation of conditional mean square error +09 +9095 -r--ROP6 49097 i. --+---9098 ‘ M o'd"'r,a"‘ s“’ . .. .. g . 1 o 1 2 3 I409 \ ~490913 ‘1 --+--9091o 123:: M 49-90911 2' 490915 \Q " 3 -4--ROP17 490913 ‘ ---v~-90919 1. ‘~ ~4120920 .4. . K.“ ”are" ‘11 ‘ \ .12 ‘ . '1“ "¥~'4=W“" I ’W‘F‘émi' l -2 -1 o 2 1 . . 77 --o--09 a 9091 4 9092 -»—- 9091 -- +~~ 9094 ...uflufi'.-‘ .—o"' f ..r' \ ‘s.q Figure 4. 11 Graphical representation of conditional standard error ---- 09 ..... 9095 -~v--~9090 “19097 ---->---9090 0-...“_._,..e ,-MMWW- 0.2» 1 4 . 1 . . - .....op ""‘0P 0.8 .... 9099 0.8 --1-- 90913 06 \190910 0 ‘1‘ --1-- 90914 ' \ ~4~ 90911 2, xx -1— 9091s .5 0,4 ~4-~90912 0,4 02‘in \ -+- R0P16 .e' ‘0 -1 0 -1 0 1 2 3 -~o--09 -..-- op ' ' +9091? ~>~~90921 + 90910 -+- 90922 ---1- 90919 - 1- 90920 ,4" “a. . i a}, 01.. .. ....” "0' ‘l 0,2 ‘I‘WM -------- .1,-WM 78 a o o o o o o o o o o o o o N_ mom m o c o o o o o o _ v o o o :umom N o o o o o o o o o o H _ o 218m ..-.m ........ c ........ a ........ w ....... _. m ..... m2... ...... moo_ ....... o ....... N ....... ... .......... a ME ...... - mm v N m a N o v N N o o o o w mom 2 o o o o N N _ N _ N o o o 55¢ a o o o N o o o o o _ o c o cumom ..... _. NoemaeeoootNNNflSm NH o o N N o o _ o o o o N 0 «Mom E N N _N av cm a N o o o o o o Ndom em o o N o N : o o o o m m o Numom ....a ........ a ........ a ........ ._ as: ..... N._..-.-.._ ...... can ...... a ....... a ....... a ....... o 3.51....me ...... .22 Na: NE N5 N? e: 3 mm NN ”a 5. man NE SN mo . can—Emu— 53 N m.N N 3 N m... A. 2. _- 3- N- m.N- N- 3:? mKQM $333239 2% ENE KO 2% 3 zmfim $230.3 “Schema M23353» mNmmNKQ Lmefiaz. NV 2an 79 N N N N N N N N N N _ N N N N mom NN N N N N N N N_ NN N N N N N NNINom NN N N N N N N N N N _ _N NN : NNINom ..... NNNNNNNNNNNNNNNINE NN N _ N _ N N N N N N N N N NNsNom N: N N .N : NN N N N N N N N N NEON _N M: N_ _ N N N N N N N N N N 21%; ..... N NNN_NNNNNN__NNN:-§ NN N N N N N. N N N N N _ N _ Edam N N N N N N _ N M: NN N N N N Naom S N N N 2 N N N N N N _ N N idem ..... _. NNN_NNN_NNNN_NN2NN_-§ 20a mm: N5; N3: wmh 2m on mm om m: has 35 w; Duo m0 ESE. m Wm N mg g md o md- T m._. N- Wm- m- n. Bob NV 2an 80 4.2.2 EVALUATION RESULTS FROM USING 20,000 EXAMINEES RANDOMLY SAMPLED FROM THE TARGET EXAMINEE POPULATION Table 4.3 compares overall summary performance statistics for the OP and all candidate ROPs. To summarize, all candidate ROPs yielded much better performance with regard to almost all criteria. All candidate ROPs unanimously gave better measurement accuracy and precision than the OP as witnessed by lower bias and MSE. In general, all candidate ROPs and the OP tended to overestimate examinees’ abilities. However, the average biases given by candidate ROPs tended to be around .1— approximately .1 lower than that given by the OP. With regard to MSE, the results suggest that the candidate ROPs tended to yield remarkably lower (i.e., .4~.7) average MSEs than the OP (i.e., .13). The magnitude of average MSE given by each candidate ROP seems to correlate with its average a-value. For example, ROP_21 and ROP_22— the two candidate ROPs having the lowest a-values—yielded the highest MSEs, i.e., .07. The magnitude of correlation coefficient between the true and the estimated abilities given by each candidate ROP was at least .02 higher than that given by the OP. Similar to the findings about MSE, the magnitudes of correlation coefficient given by candidate ROPs also tended to be correlated with their average a-values. Those candidate ROPs with lower a-values tended to yield slightly lower correlation coefficients. For example, ROP_21 and ROP_22 reported correlation coefficient as .96 whereas ROP_l and ROP_2 reported .98 as their correlation coefficients. With regard to the issue about item use, all candidate ROPs made more efficient use of items than the OP. This better use of items can be supported by the evidence in two areas: number of underexposed items and the ratio between 13p over XGRO P for 81 each candidate ROP. First, compared with the OP, the candidate ROPs yielded at least 18% fewer underexposed items. In general, the candidate ROPs developed with a .4 expected amount of change in item information tended to better use items by having smaller percentage of underexposed items than those candidate ROPs developed with .2 expected amount of item information change; and the difference was about 15% in most of cases. For example, ROP_21 yielded 25% underexposed items whereas ROP_22 yielded 44% underexposed items. Second, the ratio given by 13p over gig/NR0 P for each candidate ROP was unanimously above 2, suggesting that all candidate ROPs made a more balanced use of items than the OP. The candidate ROPs developed by .4 expected amount of change in item information tended to yield a higher ratio than those developed by .2 expected amount of change, suggesting that those candidate ROPs developed by .4 expected amount of change in item information made better uses of items than those developed with .2 expected amount of change in item information. With regard to item overlap rate, it is somewhat surprising that the overlap rate given by the OP was at least 0.1 higher than that given by each of candidate ROPs despite the fact that the OP contained more items than each of candidate ROPs. For the candidate ROPs, meanwhile, it seems that their differences in size did not make a significant difference in the item overlap rate. In terms of constraint violation, significantly fewer numbers of tests having constraint violation were observed in the candidate ROPs than in the OP. In comparison with the OP in which at least 8% tests had constraint violation, no constraint violation has been observed in the tests using candidate ROPs when rounded up to two decimal places. 82 11.1- :‘I _ None of the tests have witnessed more than one constraint violations for both OP and candidate ROPs. In terms of overexposed items, interestingly, very similar numbers of overexposed items (i.e., approximately 35 items when overexposure rate was set as .2) were observed for both the OP and the candidate ROPs regardless of differences in size between the OP and the candidate ROPs and differences in size among all candidate ROPs as well. When overexposure rate was set was .3, the difference in the percentage of overexposed items given by the OP and the candidate ROPs was smaller than that given by setting overexposure rate as .2. Overall, this finding seems to be related to the underlying nature of maximum item information selection rule which tends to be highly sensitive to even slight difference in item information. Due to the fact that no item control procedure was implemented in this study, items with high a-values were always selected over the others. 83 mom 28680155 BENZ 821.42 NoEoNeNnNeN 032 ----mmN ....... N.N..~-.----.N.N.N. ....... N N0. ....... N .N..N.--.-.-..N.N.._.----.-.NN.N. ....... N N0. ....... N m ............................................. NWN.._.NN.N.. NN.N NN N NN.N NN.N NN N NN N NN N NN.N NN.N N238, 69:28 mesa; NNNNNNNNN em 2 m N 0m v on 3 mm N 02 coca—o? “586:8 mega: $338 00 80:52 ............................................................................................................................................ .0. 0.00m-.---.-. $.N mod 2N .206 SN 02.. EN 3mm <2 N RQN . N NW; Nm.wN 00.0w modN NN.0m N0.0N Nva mNdm 3.002 NN HHHMNHNHH“NN.NHHNNNHHMNNHHNNNHHNNHNHmNN.NH“NNWNHNNMNHHHHHHHHHHHHHMWNNMNNH 20.0 0N.0 00.0 Nm.0 0m.0 2.0 00.0 3.0 00.0 mac: 08005.5qu 00 005 ENNNNNN ........ .Nm ........ N .N.---.-.-wN.N.---.-.-.N.N ........ m ._.N.---.-.-.-------.--...--.emmmflmfiémfififl. N00 20 00.0 2.0 >00 00.0 50.0 00.0 N00 30 25: 0080x230 we no.5 - .... N. ._ ......... mN. ........ missmm ......... m ._ ......... N ._.--------N.N.------.-N.w--.-.-..Nm---------.--------mN...0.N..Nm..M.mNmmwawwN..N.N..wNmNm.. $0 2.0 $0 ~N.0 2.0 _N.0 2.0 0N.0 :0 3.0 NED: 00898.53 00 080 - .... N. m ......... NN. ........ $5---st ......... N N ......... N NNNNNNN .................... A. SmfimmNmNmmemwNmemNfl. mm 0 no 0 00 0 3.0 00.0 mo 0 ma 0 mo 0 v0 0 N NL MNHNHHNNNHMNNHHHNNMNHHHNNNHHNNNHHHNNNHH”NNHNHNHNHHHHHHHHMHHHNNNH ~00 ~00 000 ~00 8.0 80 ~00 ~00 N00 Nflm 0 00% N. 00M 0 mom m 00% v 00% m mom N 00% _ 00% no 300 So: Ndom s Dom .< 358 820538 000 .0N \e NNQEE. $3352 a 02.2: NNNQM 850.0389 2: 028 as 20 Le\N.N.:N.:SN 9885.5me 0596 mé 2an 84 me 0: 0VN m: mmN Nm: 0NN mm: 3 m 0N5 Eon NN N NN N NN N NN.N NN.N NN N NN N NN.N NN.N 8:203 3588 was: News? 2 0 v0 0N N N _ m 3 N03 sewage? “583:8 90.6: $568 00 50:82 ....................................................................................................................................................... m0mmm.-.-.--- _NN NNN N_N NNN NN.N VNV NNN NNN <2 N NNN N NN NV NN NN NN NV NN NN NNVV NN NN NN NV NN.NN V. VNN NN NNNHHNNNHHNNNHHNNNHHH“NNNHHNNNH“NN.NN“NNNHHNNNHHmHHHHuHHHunHWNNNNNNH NN.N NN N NVN NN.N NV N NN.N NV N NN.N NN N was: N03803:: No Nam ..... mm.-.-.-.-..Nm-------..N:.------.-.Nm----.-.--wz._.-------.-N.V---.-.-.-N.N.N.-.---.--M.N.-------N-~.N..---.-----------.-----NNNNNNNNNNNNWNNNNN.ENE- NN N V_ N N_ N : N N N N_ N NN N NN N NN.N NN.N as: Bmaxeoz No Nam ...... NN.N..E.-.-..N..N..-.-.-.-.-N.N.-.-.-.-.-.N._..---.-.13.-.-.-.---N.N.---.-.-.-.N._..-.-.-.-.NN.--.-.-..N.N.-.---.---.-.---.-.-.mmwyeammfimfimmmfimmHammad? NN.N NN.N NNN NN.N NNN NN.N N_.N NN.N :N NN.N as: 8.0.8503 No Neg NN NV NN NN NN NV NN - NN VN ANN 28: Emaxosé No 852 NN.N NN.N NN.N NN.N NN.N NN.N NN.N NN.N VNN NNN ..... N. 0.01--.-.-.mNMN..-.-.-..waa;.-..NN...N-.-.-.-.N.NMN..-.-.-..N.N..N.-.-.-..NN...N-.-.-..NN..N-.-.-.m£111-21---.-.-.3.---.-----;.---.-.-.-.-.-.-.-.mm£- _N N s N _N N EN EN EN SN 8 N NN N Nem I I I I I I I «NEED 2 00M 2 mom 3 00M 2 00% N_ mom 2 00M 0_ mom 0 00M 00 Ben 82— Naom 9 Numom .m 35$ NV. New N.V 2%... 5 00 SN 3; 00N 02 NN m2 0NN m3 Em 3% Bon— Il-l'l'lll|lllll'III‘III'I'III'I'IIl'lllllllllll'llllll'llllIIIOII'III'IIIIII'IU'I|III'I'l'III'IIO-II'I|III'IIIIO'IIO'I'III'I'I'I'OII'Ill'l'lll'flllli'l IIIIII"- IIOIOIIIIIIIII'Il1 NN N NN N NN N NN N NN N NN N NN N NN N NN.N 8:23 New-N88 was“: NNNNNNCN em 2 0m MN _ N 3 0 0 N 02 noun—03 HEN-5:8 0:33 NNENNB .No 808:2 ........................................................................................................................................................ m gum-:1--. NN N NN N NV N NN N NV.N NN N _V N NN N <2 N NNN N NN NV : NN NN NV NV NN NN.NV M: NN N_ NV N_.NN 3 V2 NNN HNNN“HHNHNHNHHHHNNHNHHHVNHNHHHNHNNHHNNHNHHHNNNHH“NHNHNHHNHVNHHHHHHHMHHHHHUN-.NWNNWNNH NV.N NN.N NV.N NN.N VVN NN.N NV.N _N.N NN.N 3% NNN-36025 No 85 ---.-.N-N-.---.-.--mw .......... V .N. .......... N. Nazi-me .......... N .N.:-------.N.N---.---1-0M-.-----N-N-N.---.---.---.---.-.----.NfiNNRNRNmN-Nwme-Nm.wNfiNma NN.N N_.N :N NN.N NN.N N_.N NN.N EN NN.N ANN NN.NN Eases No 85 SN .......... N .N. .......... m N-------.--N.N- .......... V .N.------INN:-----.---NW-.--saw-Nee-.--------.----mfl-Nfimfimmmxmwwfim.wNmN-Nz- NN.N NN.N N_.N NN.N N_.N NN.N V_.N NN.N :N ANN 80: “swag-208 No NN.N ---.-wV.--------..Nw .......... N NNNNN .......... N .N. .......... m .N-.-.-.-----N.N ......... NNN .................... A. m0.WNW-WmWWNmN-Nma-Nme-N-WNWNZI NN N NN N NN N NN N NN.N NN N NN N NN N VN N .3 HNNHNHHHNNHNHmHNNHNHHNNNHHNNNHHHHHHHNNNHHHNNNHHHHNHNHNHHHNHNHHHHHHmHHHMHHHHWNMNH 8N _NN NNN SN SN NN.N _NN NNN NNN Nam I I I I I I I I 0502.— VN mom NN mom NN mom NN mom NN mom a mom M: mom 2 mom No 68 as: . . e VNINom 3 Nice .e 358 3:80 mé 033- Table 4.4 presents the number and percentage of examinees correctly classified into each of four performance levels, as well as the differences in classification accuracy rates at each of four performance level given by the OP and each of candidate ROPs when three out scores (i.e., -1.6245, -.05397, and 1.088235) are used. As Table 4.4 indicates, at each performance level, the classification accuracy rate given by any candidate ROP is higher than that given by the OP. The classification accuracy rates given by the candidate ROPs developed with the .2 expected amount of item information change tended to be slightly higher at all performance levels than those given by the candidate ROPs developed with the .4 expected amount of item information change, given other things being equal. In comparison, the candidate ROPs developed by the random procedure tended to classify examinees more correctly than those developed by the MTI procedure. In general, this higher classification accurate rate seems to be related to the magnitude of average a-values. Item pools with higher a-values tended to yield more accurate classification accuracy rates. 87 cocoa-EB 30365 < .802 NNN-N NN.N 2N: VNNN NNN-N NNNN NNN-N NNN-N NNN NVNN NNNN NNNN_ VVNN NNN 82 NN Noe NVNN NNNN NNN: NNNN NNN 82 NNN-N NNNN _NNN NNN-N VNNN NNNN NNN-N NNNN NNNN 213m NN.N NN.N NNN: «NN.N NNN-N NNNN NNN-N NNNN NNN NVNN NNN-N VNNNN VNNN NN.N NNNN 318m NVNN NNNN 3V: NNN-N NVNN NNN NNNN NVNN NNNN NNN-N NN.N NVNN VNNN NNN-N VNNN NEON -.N-N.N..N.--._. NN.N..-.NN.N-3113.00---.NNNN;:33..-.NmNuN---..NN-N..N ----- N NN.N-.--meuNsewmd .2--NN.NN. .-.-N.NN..N.---.N.N-N..-N-.-.NN.N.-_.-..N.~1mmm .. NVNN NNN-N NVN: NVNN NVNN NNN VNNN NNN-N NNNN NNN-N NNNN NNNN NNN-N NNN-N NNN_ :INom NVNN VNNN NNV: NNN-N NNNN NNN NNNN NVNN NNNN NNN-N NNNN VNNN VNNN NNN-N VNNN 218m NVNN NNN-N NNV: NVNN NVNN VNN NNN-N NNNN NNNV VNNN NNNN NNNN NVNN NNNN NNNN N-Nom -mNNN-N-.-NN.N..N.-NNN-N.V--.NNN..N---.N.N.N...N.-..NNNN..--mN-_.VN..-.wN-N-.N ..... N NNN:-.mmNuN:--.N.N.N..N..---N.W.NN.N-.-N.N.N...N.-.-.N.NN..N-.-.E-NM ..... N..- Nam: NNN-N NNNN NNN: NNNN NNN-N NNN NNNN NNNN NEN NVNN NNNN N82 «NN.N NNN-N NNNN NuNom NNN-N NNN-N NVN: NNNN NNN-N NNN 83 NNNN :NN NVNN NNN-N NNNS NVNN NNN-N NNN_ NnNom NNNN NNNN _NNN_ NNNN NNNN NNN NNN-N NNN-N NN_N NNN-N NNNN VNNN NNN-N NNNN NNNN NINom -.N-NN..N.-.-NNN..N--NNN-N.V-.NN.NN-swam---.NNNN.-.mN-NVN-.--.N.N.N..N ..... N NN.N-.-.meuN-z..N.m.N-.N-.--«NNN-_.-mVN..N ..... _- NN.N-.133 ..... V..- Nam-.. NNN-N NNNN NNN: NNN-N NNNN V§ VNNN NNN-N NNN NVNN NNN-N N52 NNN-N NNN-N NNN_ NINom VNNN VNNN VNN: NNN-N NNNN NNNN NNN-N NNNN NN_N NVNN NNNN NNNS NVNN NNNN N_N_ NINom NNN-N NNN-N VNN: NVNN NVNN NNN NNN-N NNNN NN_N NVNN NNNN N82 NNN-N NN.NN NNN. Dom - .......... N wth---N.N.NNN-.---.-.----NNN..N.-.-NNN-.-.-.-.-.-.-..N.N-N-.N..--.NNNV..-.-.-.-.----mNNum----.N-Vw.m--.---.-.-----me..N-.-.m.N-N._ ....... ..— -.e-i ....................................... 3&0.-.-.-.-.-.---.-.-.---.NNN.€.-.-.-.-.-.-----.-.-.-.~&N-0.-.-.-.-.-----.-.-.-.-.Saws--9023--.---.-.-.. V Nam N << < 0?. .6 N 0% .34 E 00% N24 NZ Hz .35. gong—ZN «afloceum 2.38.3.5 £593.— «aflufieum «oz .4?& 9885.8me .5053 0.03 NONE-85 Guano-ea nomNcoSN-NSU vé 2an 88 009000511005 .0002 NVN.N VNNN NNV: NVN.N NVN.N NNN :NN-N NNN-N NNNN NNN-N NNNN NNNN NNNN NNNN NNNN VNINom NNN-N NNN-N NNN: NVN.N NVN.N NNN _NN.N VVNN N_NN NNNN NNNN NNNN VNNN NN.N NNNN NNuNom VNNN NNN-N NNN: NNN-N VNNN NNN NNNN VNNN NNNV VNN.N NNN-N NNNN NNN-N NNN-N NNN_ NNINom NNN-N NVN.N VNNNH NNN-N NNNN NNN VNNN NNN-N NNNV :NN NNNN NNNN NN.N NVN.N NNNN HNINom .-N-NNHN.--.NN.N.N---.NwN.N.N.--.V-N.N..N--.-N-N.N..N ..... NN.N----NNNNN ----NN.N..N:-..N.NN-N.---.WVNVN----N.NN..N-.-«NN.NN---.NmNnN.---.N-N.N..N.-.--NN.NN -.--NNImom - NNN-N NNN-N NNN: NNN-N NNN-N NNN NNN-N NNNN NNNN NVN.N VNNN NNNN NNNN NNNN NNN_ NEON NNN-N NNNN NNN: NNN-N NVN.N NNN 82 NNNN :NN NVN.N NNN-N NNNN NNN-N NNNN NNN" NEON NVN.N NNN-N NNN: NN.N NNN-N NNN NNN-N NNNN NNNN NNN-N NNNN NNNN NVN.N NNNN NNNN NEON ............ N NN.N.-.NNWNN-1ie----NN.NN--.--NN.N.-.-.-.-.---.-.NN.N.N-.N.---N.N.NN-.-----.---.--mN-NuNss-waN-s--.-------.-meNN.-.-.-_.N..N.N.-----.-ads---- ----.--------.-.-.-:.-----.----11----Nmommv- ...................... A. 00.0mm ----------------------- A. 00.0mm ........................ A 0.0.va ------------------------ - V 85 N << < 0?. .5 N 00.. 95 NN N59: N24 02 00 02 :38- 000§>0< 32000.5 82050.5 EFF—mm “02000.5 8 Z 0. 300 04V 053- 89 4.3 A DEMONSTRATIVE EXAMPLE This section demonstrates how the identified optimal features can be used to guide item pool assembly by serving as a model item pool. The target ROP features were from ROP_21 which has average a, b, and c values as 1.21, -.237, and .135 respectively. This ROP was chosen because its average a-value was the smallest and it contained the smallest number of items among all candidate ROPs. The item pool that was assembled based on ROP_21 features is called the Demo_Pool, whose item pool characteristics are described in Table 4.5 along with those of the OP and ROP_21. In total, 84 out 314 operational items can strictly meet the optimal psychometric configurations identified in ROP_21. To equalize the pool size of the Demo_Pool and ROP_21 in order for the results to be comparable, another 55 operational items were added to the Demo_Pool. The selection of these 55 items was based on two criteria: 1) the selected items—all with a-values below 0.89446—should have their ai close to 0.89443; and 2) the selected items should have b-values above -0.8; and this criterion was adopted in order to move the average b-value closer to that of ROP_2 (i.e., -0.237) as well as average ability of target examinee population (i.e., -0.3813). Table 4.5 indicates that both average a- and b-values of the Demo_Pool are lower than those of the OP. In addition, the a,- values were more dispersed in the Demo_Pool than in the OP, but the b,- values were less dispersed in the Demo_Pool than in the OP- Tables 4.6 and 4.7 compare the performance of the OP, the Demo _pool, and the ROP_21 produced by both conditonal and random examinee samples. The random examinee sample was the same as those used in the studies described in the previous 90 chapter. Table 4.8 compares the classification accuracy rates given by these three item pools. In general, these two tables reveal the following findings. First of all, the results in Table 4.6 from using conditonal 0 points indicate that, almost across all 0 scale, the Demo_Pool yields similar or even better measurement accuracy and precision than the OP except at the 0 points -3, -2.5, 2.5, and 3. This result can be largely attributed to the lack of items with b-values close to these values in the Demo_Pool. Second, the overall performance statistics in Table 4.7 indicate that the Demo_Pool performs better than the OP in terms of almost every criteria including better measurement accuracy and precision, higher classification accuracy rates, more efficient item use, and higher correlation coefficient between the true and the estimated abilities. To some degree, it seems that the items removed at the result of assembling the Demo_Pool did not contribute much to the measurement outcomes. The pool configuration of ROP_21 identified as the end products of this optimal item pool design effort can successfully serve as a model item pool in the practical application. Table 4.5 Item pool statistics for the Demo_Pool, the DP, and the ROP_21 Item Item pool Mean SD Max Min Pool Size parameters OP 0.979 0.395 2.125 0.022 314 a-parameter Demo_Pool 1.175 0.337 2.125 0.47 139 ROP_21 1.210 0.249 2.111 0.736 139 OP -0.746 0.896 3.617 -2.730 314 b—parameter Demo_Pool -0.501 0.929 2.082 -2.345 139 ROP_21 -0.237 1.553 2.753 —2.792 139 OP 0.147 0.080 0.500 0.007 314 c-parameter Demo_Pool 0.162 0.083 0.500 0.009 139 ROP_21 0.135 0.067 0.337 0.019 139 9] nous—0.5 Ngbgoou>o .0002 N N N N N N N N N N N N N NINE. NN: VN: VNS NNN NNN NN NN_ N_N _NN NVV NNN NN: NNNN 381980 >0 NNVN NNVN NVN_ NNN N_N NN NN NN NNN NNN NNN N; NVN N0 N N.N N N._ N N.N N N.N- _- N.N- N- N.N- N- 28m VN.N VN.N NN.N NN.N NN.N NN.N NN.N NN.N NN.N NN.N NN.N V.N NN.N NINE N_.N NN.N _V.N NV.N NN.N NN.N NN.N NN.N NN.N _N.N NN.N N; NE NNN-Neon mm NN.N NV.N NV.N NV.N NN.N NN.N NN.N NN.N NN.N _N.N NV.N NN.N NN.N N0 N N.N N N._ _ N.N N N? N- N._- N- N.N- N- N.NNN :.N N_.N NN.N NN.N NN.N NN.N NN.N NN.N NN.N NN.N NN.N NN.N NV.N NINE _V.N :.N EN EN VN.N :.N NN.N NN.N NN.N NN.N VN.N N2 NN._ BNNINENQ N.N: NN.N 2N NN.N N_.N NN.N :.N N.N NN.N NN.N NN.N NN.N NN.N VN.N No N N.N N N._ _ N.N N N.N. _- N._- N- N.N- N- 2.5 VN.N VN.N NN.N NN.N NN.N EN EN NN.N NN.N NN.N _N.N NN.N- 2N- NEON EN. EN. NN.N NN.N VN.N VN.N NN.N NN.N NN.N NN.N NN.N- NV.N- _N.N- 8N Neon SE NV.N. NN.N. NN.N NN.N NN.N VN.N NN.N VN.N NN.N 8.? No.9 N24 VN.N. No . . . . - .- - .- - NNN N NN N N_ N NN N NN- _ N_ N NN N N55. NNIRQQ 03 0:0 .N00KI0E0Q 05 NO 20 A0 N630 .N.-N0 N020 .N.-m0: .833 00:00-00:00 04V 2an 92 con-.305 $8065 moi 5983. $8065 NN.N owSE-tom $8065 A V $82 N NN.N NN.NN NN.N NNNNNNN NNNNNNN NN.N NN.N NN.N NNN—om . . . NNN-NV NNNNN . . . I «NNN NNN NN NV NN NN NN NNN 2 N NNN .8.— Neon . . NNNNN NN N.NN . . . N02 3 X: we 3N Vm woo 2 o moo m0 2 o .N.. «So: 25: 6 >0 NNN N a Swag-BE:— omaxo-Ngc N.N-N am: 35 m- N auto-NO MN No no...— 63 NN. me .x. can t NmmSESNm ©3638 Sagan Sorta-N a MEN-3 ~NIKQ- 8: N25 .NooaNIQEmQ 2% NO 2% Ska-3228» 8:950th 2396 N.V 053- 93 Table 4.8 Classification accuracy rates at each petformance level W OP ROP 21 Demo Pool Proficnency leve — - Not Proficient (#) 1521 1559 1501 Not Proficient (Prop) 0.828 0.848 0.817 Partially Proficient (#) 9542 9665 9592 Partially Proficient (Prop) 0.863 0.875 0.868 Proficient (#) 4596 4799 4673 Proficient (Prop) 0.773 0.807 0.786 Advanced (#) 983 961 922 Advanced (Prop) 0.798 0.827 0.793 Total (#) 16586 16984 16688 Total (Prop) 0.829 0.849 0.834 Note. # indicates number Prop=Proportion 94 CHAPTER V SUMMARY, DISCUSSION, AND IMPLICATION 5.1 SUMMARY OF RESEARCH This study introduced a heuristic approach to develop the optimal item pool for a CAT using the WDM item selection method. An item blueprint as the end-product of this approach was produced describing items’ statistical and non-statistical attributes, item number distribution, and optimal item pool size. Specifically, three different methods were used to identify the items’ optimal statistical attributes and the prototype of two methods can be traced back to McBride and Weiss (1976) in which the “ideal” or “perfect” item pools were simulated with item features identified by using a formula created by Birbaum (1968). To identify the optimal non-Statistical features required by the WDM, a sampling approach was developed that was primarily based on the description and examination of test specification and the distribution of operational items’ non-statistical attributes as well. To determine the optimal item pool size, the bin-and- union method was used that partitions the space constituted by item a- and b-values into several smaller ones called ab—bin or ab-block in this study. In other words, an individual ab-bin/ab-block is bounded by a specific range of a- and b-values. Blocks are mutually exclusive to each other; but items in each block are treated as equivalent meaning that they can be used interchangeably. When items administered to each individual examinee were sorted out into each individual block and items administered in one individual CAT were also administered to. another examinee, item pool size could be determined as the 95 union of item sets given by using a large number of examinees out of the target examinee population. By manipulating three factors—b-bin width, item generation method, and expected amount of item information change, 24 candidate ROPs were generated which varied in item pool size, average a- and b-values. To summarize, the item pool sizes varied from 139 (in ROP_21) to 270 (in ROP_2); the average a-values varied from 1.182 (in ROP_22) to 1.62 (in ROP_2); and the average b-values varied from -.096 (in ROP_S) to -.358 (in ROP_3)—all being higher than the average ability, i.e., -.3813. Compared with the OP, almost 90% candidate ROPs contained items with minimum a-values exceeding .8—similar to what Urry (1977) recommended as the preferred characteristics of a CAT item pool; whereas the b-parameters of the candidate ROP3—rather than follow normal distribution as does the examinee ability—~tended to uniformly distributed along the proficiency scale. Among these three manipulated factors, the b-bin width and the expected amount of item information change tended to affect item pool size, whereas item generation method tended to affect item a-values. For example, with other factors being equal, using a .2 expected amount of change in item information tended to cause 1/3 more items than using a .4 expected amount of change in item information. When the MTI item generation method was used, setting maximum individual test information as average test information tended to produce an item pool with less discriminating power than the other two item generation methods. The distributions of non-statistical attributes of different candidate ROPs were also comparable to each other. 96 Despite the differences in optimal features of different candidate ROPs, the evaluation results indicated that all candidate ROPs unanimously performed better than the OP in terms of achieving better measurement accuracy and precision, making more efficient and balanced use of items, and violating test constraints at a negligible manner. It is interesting to have observed that, despite the fact that the OP contained more items than any candidate ROP, the total number of overexposed items in each candidate ROP indicated little difference from that in the OP and the item overlap rate given by the OP was higher than that given by any candidate ROP. In addition, slight differences in the performance of different candidate ROPs have also been observed. In general, the candidate ROPs with higher average a-values (e.g., ROP_l and ROP_2) tended to yield better measurement precision and more accurate classification accuracy than those with lower average a-values (e.g., ROP_21 and ROP_22). The candidate ROPs developed with a .2 expected amount of item information change also tended to yield more underexposed items than those ROPs developed with a .4 expected amount of item information change. In summary, the resulting statistics suggest that the methodology introduced in this study can generate item pools with characteristics capable of supporting the good functioning of WDM item selection algorithm—examinees are administered a content- balanced exam and at the same time estimated with decent accuracy and precision with regard to their abilities. 97 5.2 DISCUSSION Stocking (1994) addressed several issues with regard to item pool size in the context of high-stakes admission programs administered in the form of CAT. Examining five operational item pools for five fixed-length CAT tests, Stocking recommended it a rule-of-thumb that a CAT item pool should be approximately 12 times the length of the CAT exam for high-stakes exams. Way (1998) commented this rule-of-thumb as “prudent advice” (p. 23) and “a valuable guideline” (p. 24). Stocking also concluded that about WW flung pAapegand-pencilquramm should be sufficient to construct a single CAT pool. As Table 4.1 indicates, the 24 candidate ROPs developed by using the bin-and-union method in this study vary in size from 139 to 272 with four item pools containing more than 240 items. When calculating the ratio of item pool size over test length (i.e., 20), we can see that the ratios were between 6 and 12 for 20 candidate ROPs; and for the remaining four item pools, they were between 12 and 14. In a study by Reckase and He (2009) in which the bin-and-union method was also used to develop an optimal ROP for a variable-length CAT, the results also indicated a ratio of item pool size over test length as approximately 13. In other words, the bin-and-union method, to a large degree, corroborates Stocking’s recommendation as to what constitutes an adequate item pool size for a CAT program. In addition to the item pool size, the bin-and—union method used in the current study also depicts the characteristics of a- and b- parameters. The a-parameters in almost all ROPs tended to peak somewhere around 1.1 with minimum value exceeding .8 in general; whereas the b-parameters tended to be uniformly distributed along the whole proficiency scale—not exactly matching the distribution of 0 used for the simulation in 98 this study. In general, the item parameter characteristics suggested in this study are in line with those recommended in Urry (1977) and Jensema (1977), both of which recommend uniformwdistribution of item difficulty as one of important item bank - .n “\mqu. 7.5 P-ru srqs- “NA-H!“ "‘ requirements for the CAT. This requirement appears to be reasonable in that maximum information-based item selection method, such as the WDM method, attempts to select for administration the item which will yield the most information. If there are not sufficient numbers of items available at a particular difficulty level, the item selection algorithm will have to select an item which is not appropriate and which will yield less than optimal amount of item information. As a result, the improved measurement quality for tailored testing may not be achieved as expected. An example of this type is that, as indicated in a previous session of this study, the operational item pool yielded substantial bias and error for those examinees located at the high-ability of the proficiency scale due to the shortage of items with high b-values; whereas the candidate ROPs did not experience the same problem. Recall when employing either bin-and-union or linear programming method to . design an item pool, the CAT simulation is conducted EQUIQIBQLENANLEISQQQQN and mixmfigww It is therefore expected that these two characteristics built into the simulation jointly affect the optimal item pool features. In other words, an optimal item pool should carry features that can meet the unique needs of each individual CAT program. In this sense, it might be appropriate to argue that the optimal item pool features for different CAT programs may be different. This explains the different findings of the optimal features of the CAT item pools in the literature. For example, Dodd, Koch, and De Ayala (1993) indicated that trait estimates in CAT are 99 more accurate when item pool characteristics and latent trait estimates in CAT can match with each other. In another study by Reckase and He (2004) on a variable-length CAT with termination rule involving evaluating whether the final ability estimate is within or out of the bound set by a certain level of confidence interval of a cut score, however, the optimal distribution of bi is found to be uniform along the proficiency scale but with a peak around the cut score—barely related to the distribution of expected examinee ability distribution. Therefore, optimal item parameter characteristics yielded by the current study may not necessarily work for other different CAT program. As Table 4.3 indicates, most of the candidate ROPs tend to yield the number of overexposed items that is very similar to the OP although they contain fewer items than the OP. If evaluted by the proportion of overexposed items over item pool size, all candidate ROPs had a higher percentage of overexposed items than the OP, suggesting that the candidate ROPs may potentially pose a more severe concern over test security than the OP if test security was really an issue. In fact, this result can be ascribed to the factors including using maximum information-based item slection rule and implementing no item exposure control procedure in the targeted operational CAT program. Using operational CAT pool configurations, such as the distributions of a- and c—parameters and the correlation between a- and b-parameters to serve as a template to generate optimal item features also plays a role. As well documented in the literature (e.g., Wainer, 2000; Way, 1998), the maximum information item selection rule is very sensitive to even very slight differences in item information. If a maximum information-based criterion is used for item selection in a CAT free of any item exposure control procedure, items with high discrimination are always very likely to be subject to overexposure while many low 100 discriminating or even moderate discriminating items are left never being selected. A solution to overcome the problem of unbalanced item exposure rate caused by using the maximum information-based item selection criterion is implementing item exposure control procedure. Due to implementing no item exposure control procedure in our target CAT program, overexposure may arise as a natural consequence in this study. It is anticipated that using item exposure control procedure can be of help in easing this concern. The procedures needed to develop an optimal item pool for a CAT implementing item exposure control procedure can be referred to in Gu (2007). On a side note, the target CAT is used to place examinees into different levels of courses. Therefore, in cases when overexposure causes a test security issue and examinees gain a higher score due to security breaches, they might just throw themselves into a kind of self- cheating condition in which they are likely to end up learning little if being placed in a class beyond their ability level. In addition to the number of overexposed items, the item overlap rate was also used in this study as another index for test security because it indicates the percentage of common items administered to two randomly-selected examinees. The higher the proportion of identical questions that different examinees receive, the greater the risk will be to test security. Way (1998) recommended using this index of item overlap rate as a “global picture of how often items are used” (p. 22). The results in this study indicated that all candidate ROPs yielded lower item overlap rates than the OP despite the fact that the OP contained more items. This result, from another perspective, suggests that the test security issue may not appear as severe as what the number of overexposed items imparts. A study by Chen, Ankenmann, and Spray (2003) found that equalizing item exposure rate 101 in a pool can reduce the average test overlap rates. Table 4.3 indicates all candidate ROPs have more homogeneous item exposure rates than the OP, as witnessed by the values of 12 for each candidate ROP. This result explained why the candidate ROPs produced lower test overlap rate than does the OP. As discussed in the previous chapter, the R and the MRP item generation methods tended to yield the candidate ROPs with higher average a-values than the MTI procedure given other factors being equal. This lower a-value can be attributed to the magnitudes of minimum test information set for examinees with different true abilities. This study adopted average test information calculated from the historical data as target test information for those examinees whose true abilities were within three cut scores used to classify examinees into four performance levels; and for those examinees whose true abilities fell beyond the range of cut scores, the target test information was set at a lower level. It is anticipated that raising the magnitudes of minimum test information can yield the candidate ROPs having higher average a-values. In light of the characteristics of the operational item pool used in this study, i.e., the average a-value as .979, it seems that the way used in this study to set the target test information works well in that it can yield a candidate ROP with reasonably realistic item parameter values. 5.3 IMPLICATION The results of this study indicate that the optimal item pools performed much better than the operational item pool in terms of almost every aspect. This finding was not surprising in some sense, given that the item characteristics—not only statistical but also non-statistical—were designed to be optimal. Rather, what is more interesting is to how to make good use of these optimal features for a CAT program to achieve “the best 102 attainable” measurement outcome. Therefore, this section discusses how to apply optimal item pool configurations to issues on item pool assembly, item pool management, and item writing. 5.3.1 IMPLICATION FOR ITEM POOL MANAGEMENT AND ASSEMBLY FOR CAT In practice, two types of item pools are often used for an operational CAT programimaster and Ioperfiationalflitenrpool‘s; .A master item pool, also called as “vat” by some researchers (Way, 1998; Way, Steffen, & Anderson, 1998), stores a large quantity of items from which an operational item pool is assembled. The operational item pool is the one that provides the resources from which tests are to be selected. Vat management can be seen as a dynamic process for its role is to maintain up-to-date information on each existing or newly created item with respect its psyChometrics, usage history, and availability status. As discussed before, test security is a concern for the CAT due to its nature as a type of continuous testing. The concern becomes more severe if only one single static pool is used. Wang and Kolen (2001) summarized three approaches that have been proposed to ease the concern potentially caused by using one single static pool. In the first approach, the operational item pool is updated and refreshed with new items periodically. In the second approach, the old operational item pool is replaced by an alternate new item pool. And in the third approach, multiple item pools are simultaneously used and rotated among different testing sites. No matter which approach is used, however, a key issue is to ensure comparability of different item pools so that the 103 exams are fair to examinees who are administered the CAT based on different item pools or different versions of the same item pool. Both Wainer (2000) and Wang and Kolen (2001) identified one basic approach to achieving comparability for CAT based on alternate pools is to create item pools parallel to each other. To create parallel item pools, the critical issue is to understand the characteristics of item pools that most affect comparability of the CAT scores. Wang and Kolen listed several factors, among which are item pool size, item pool assembly procedures and constraints used to assemble the pools. Wang and Kolen also called for more research to identify the most influential characteristics of item pools that operationally define the concept of parallelism of pools and to study what level of parallelism is needed in order to achieve a desired level of comparability of CAT scores. Wainer (2000) implicitly pointed out that parallelism between new and previous CAT pools should look at both content and statistical characteristics. Obviously, if a candidate ROP developed in this study is used as a model item pool to guide operational item pool assembly out of a vat, the parallelism can be automatically ensured. An additional advantage of looking at the item pool configurations of the candidate ROP is that the item pool assembled based on the desired features can provide “the best attainable measurement quality” for a given CAT algorithm. In fact, a similar idea has already been discussed by Way and his colleagues (2001) in which they proposed analyzing the characteristics—including psychometric and item properties—of operational item pools that had functioned well in the past, then based on the characteristics of the item pool found to work well, generating a model pool with the characteristics to ensure satisfactory functioning of a CAT. The item pool features 104 identified by the method discussed in this study can conveniently serve as the role of the model item pool proposed by Way and his colleagues (2001). When constructing multiple parallel pools, the characteristics of this model pool can serve as a template for other item pools to mimick. What’s more, because it is considered “optima ”, using the candidate ROP developed in this study as the model pool is also expected to automatically result in reasonably decent measurement outcomes. In addition to serving as a model item pool, the characteristics of the candidate ROP can also help to manage a vat or simplify vat management. To achieve this goal, first of all, a vat can be structured based on bin. For example, a bin may take certain width only on the proficiency scale if the Rasch model is used or take the shape of ab- block as discussed in this study if the 3PL IRT model is used. Based on their features, items can be stored in different bins or ab-blocks. The ideal number of items in each b- bin or ab—block that the vat intends to support can be worked out by multiplying the number of operational items. Alternatively, if operational item pools are intended to be overlapping to a certain degree, then the number of items in each b—bin or ab—block can be easily calculated based on the intended overlapping rates. When an operational item pool is assembled, the required number of items can be sampled from the corresponding b-bin or ab-bin in a vat. If the operational item pool needs to be refreshed, the items that are needed can be directly sought from those corresponding b-bins or ab-blocks storing items equivalent in use. In summary, the optimal item pool configurations can not only guide assembling high-quality item pool but also provide useful insights into how to manage and maintain a vat. This is a research area worth filrther exploration. 105 5.3.2 IMPLICATION FOR ITEM WRITING AND DEVELOPMENT One of the elements that the optimal blueprint includes is the distributions of items on each non-statistical attribute, for example, the attributes that an individual item should possess and the number of items that should possess a certain attribute or a combination of multiple attributes at the same time. These features can be used to guide item writing process in which item writers can be instructed to write items with the desired attributes based on this blueprint. For example, item writers can be instructed to write an item with a certain combination of attributes described in the test blueprint. Once a certain proportion of items are written according to the specifications in the blueprint, they can be tested in a sequential manner so as to get some idea on how far they are away from the intended goal. As van der Linden and his colleagues (1999) suggested, the best way to' view these optimal item blueprint is to treat them. as tools for continuous pool management rather than as a one-shot item pool design. 5.4 LIMITATION AND FUTURE RESEARCH DIRECTION Like most of studies, this study cannot be free fiom limitations. First, when the item pool characteristics of candidate ROPs were evaluated, item parameters were treated as known—true values containing no estimation error. However, for the operational items, item parameter estimates were used—implying that the values have estimation errors. In CAT, item selection criterion always involves optimization, for example, choosing an item that can provide maximum item information. When item estimates contain errors, a process known as “capitalization on chance” may occur in the process of CAT item selection which generally takes place over items calibrated with estimation error (van der Linden & Glas, 2000). Capitalization on chance makes use of the fact that 106 optimal values of a function of the item parameters is the result of true values of the parameters and large estimation errors as well (van der Linden & Glas, 2000). It is reported that the capitalization on chance can result in worse ability estimator than expected (Hambleton & Jones, 1994). Since items in the candidate ROPs were assumed containing no estimation error, future study should be conducted on how estimation errors may affect the performance of these candidate ROPs. Second, the CAT algorithm that this study considers was only selecting independent stand-alone items. In practice, a popular testing format is using the item set which refers to a group of items related to each other through a common stimulus. Examples may include a reading passage or a graph in which a cluster of questions refers to a common stimulus. Future studies should consider how to identify optimal features for the set-based items. Third, as discussed in the literature review chapter, employing the idea of bin to determine item pool characteristics can be dated back to the early 19903 in Boekkooi- Timminga (1990). In both Boekkooi-Timminga’s study and in the current study, each b- bin or ab-block was treated mutually exclusive from each other when collecting and tallying items. When a CAT algorithm searches for an item for administration, however, the search takes place in the whole item pool rather than a b-bin or an ab-block only. Therefore, the item pool size calculated by adding all items in each b-bin or ab—block may carry more items than needed; this explains the need of adopting a post-adjustment procedure to trim out those redundant items. The trimming procedure used in this study, though compatible with the use of maximum information-based item selection criteria, tended to remove those items with a-values lower than .8. As a result, the candidate 107 ROPs tended to have items with a-values that were too high, which may cause identified optimal features less useful in reality. The reason is that, in practice, items with high a- values are hard to produce. Thus, future research should be conducted to work out a trimming procedure that can do this job during the process when items are sorted out in each ab-bin. It is expected that the new trimming strategy can result in a candidate ROP having items with lower a-values that are more realistic in practice. Fourth, when using b—bin or ab—blocks to collect items, items in each b-bin or ab- block were treated as equivalent because they shared very similar item information. This idea of treating items in the same bin or ab-block can be applied to a couple of areas, for example, linear paper-and-pencil test assembly. To apply this idea, items can be first of all allocated into different b-bins or ab-block based on their information. Then, the expected number can be selected out of each group based on the target test specification. Cheng and Chang (2008) have conducted some preliminary research in this area. There are more research opportunities along this line. Finally, one of the major purposes of designing an optimal item pool for CAT is to provide a reference through which the best attainable measurement results that a CAT algorithm can produce can be known. Despite the facts that the historic data from the operational test, for example, the operational item pool characteristics and the target examinee ability distribution, are recommended as an indispensable part used in the simulation, it is still very likely that the identified optimal features may appear too optimal. Future studies may want to consider developing a tolerance region for the identified optimality so that the results can be more useful in practice. 108 338 Frequency 8 APPENDIX Figure A. 1 Item Number Distribution in each ab-block for ROP3 I I I I r f T I 7 I L j I -a=[0 .8944) _ [:Ia=[.89441.2649) — [:la=[1.2849 1.5492) ~ -a=[1.5492 1.7889) -a=[1.7889 2) ~ -a=[2 2.1909) I l 1-1 .— v-I 1-1 04 0\ V. c. "I c. "2 l‘. I I l I I I 00 v ° ‘3 .1 v. '0 '0 ‘7' "1' Figure A.2 Item Number Distribution in each ab-block for ROP4 Frequency III ' fl T 1 T1 T T'f-a=[0.63246) -a=[.63246 .89443) [ja=[.89443 1.0954) .a=[1.0954 1.2649) Da=[1.2849 1.4142) -a=[1.4142 1.5492) -a=[1.s492 1.6733) -a=[1.6733 1.7889) l -a=[1.7889 1.8974) .a=[1.8974 2) l-a=[2 2.0978) E I . 1. I a= 2.0976 2.1909 -....-.._-u I I ) '- I-i 0.! 0—1 —1 v. =2 9: N. x: N N F‘ "4 ' I I I I z I I I I N °°. “I '1‘ ‘0. 1-1 '7 '7 '7 ' 109 Figure A.3 Item Number Distribution in each ab-block for ROP5 70 I I I I I I I fl I T -a=[0.8944) 30" Da=[.89441.2649) r 50 _ [:Ia=[1.2649 1.5492) _ -a=[1.5492 1.7889) >4 §4°* -a=[1.7889 2) i 3 a= 22.1909 530, I r ) IL 20- — 10 o e s 8 a a: :1” ; °~ °~ *8 a I . . . . . . . "l I‘. "I 'Q . "2 N rrrrsz'z'flv 7 722' «a v. «9 e. S ‘3 ‘5 ° " 0=2 :3 :3 :2 r.“ *7 .... - Figure A.4 Item Number Distribution in each ab-block for ROP6 70 I I I I I I I I T I I f r .a=[0.63246) 60 . _ -a=[.63246 .89443) Da=[.89443 1.0954) 50 » — Da=[1.0954 1.2649) a= 1.26491.4142 D ‘ ’ .- .a=[1.4142 1.5492) 3 3' 30 -a=[1.54921.6733) b ” .- “- -a=[1.6733 1.7889) 20 . . -a=[1.7889 1.8974) .a=[1.8974 2) ‘ 10 - - -a=[2 2.0976) . ' _ -a=[2.0976 2.1909) ‘2 °. ‘°. "! °°. VI 3. ”'1 "- "1 "1 °) ”'2 hi N '1‘ ‘T ‘1‘ ' ' ' z 1 "‘ " "‘ N ' 2' I I I l l I z 1 1 '3. I «2 V. «:1 e. 2" °? 7' c v' “3 S :2 :3 "a '2' '7 ' 110 Frequency Frequency Figure A.5 Item Number Distribution in each ab-block for ROP7 70 I m I I I I 1 I I I T I I W -a=[0 .8944) so _ [:a=[.8944 1.2649) 50 Dam .2649 1.5492) -a=[1.5492 1.7889) 40 r -a=[1.7889 2) -a=[2 2.1909) 30— N—~~~———~——— 20~ _ 10— _ 0 ' -'E I i ! ~' ‘ ‘ E g. S. S. «'3. a: 33 '7 *7 —.‘ '7 ; 5 I I I I N VI; ”- V. '7 0. .4 N '1‘ ‘3' '7 ' Figure A.6 Item Number Distribution in each ab-block for ROP8 70I fi I I I 1 r I I I I I I I I -a=[0.63246) 60L . .a=[.63246 .89443) Dar—1.89443 1.0954) 50) . a=[1.09541.2649) [:Ia=[1.2649 1.4142) 40) ‘ -a=[1.4142 1.5492) 30 _ - .a=[1.5492 1.6733) ‘ -a=[1.6733 1.7889) 20 r 1 -a=[1.7889 1.8974) 1 ' -a=[1.8974 2) ”I l I ‘-a=[2 2.0976) n l I _. = O IUE-l- -!---m_ -a[2.09762.1909) NNNNN‘N;NNNNNNN ". °. ‘0. "i °9 ". °. "2 ". "I 'Q °§ ’2 hi (I. 6|. IT IT | ' ' 1 I '- v-1 1- N I z z z z I I I c, <1- 1 1 1 :3: ‘1' 99 V. v.9 9:. 3 °? 3'- ' °°- f3 3 N '1' '7 -.~ ' 111 Frequency Frequency 70 II I I I I I I I I I I I fl I .a=[0 .8944) 60* .a=[.89441.2649) ‘ 50p -a=[1.26491.6492) - .a=[1.6492 1.7889) 40 I -a=[1.7889 2) i 30 r .a=[2 2.1909) . Figure A. 7 Item Number Distribution in each ab-block for ROP1 l -.8 ~ -.41 917:? '— 0-1 a; c N N l I l l 09 v r? 8: Figure A.8 Item Number Distribution in each ab-block for ROP12 -1.6 -l.21 -4 -01 0 ~39 .4 ~ .79 ...! " 8 ~1191. '1. ' ° ° ' n'u ’ ‘-.-A" 1.2 ~ 1.59 2~ 2.39 1.6 ~ 1.99 2.4 ~ -2.8 70 - - - e - - - - N - - - - N lam-63246) 60» l-Hfim-M) [jar-8644310954) 50» a=[1.0954 1.2649) =[1.26491.4142) 40" ' -a=[1.4142 1.6492) 309 . .a=[1.5492 1.6733) .a=[1.6733 1.7889) 20. .-a=[1.78891.8974) gag , -a=[1.8974 2) 10’ 15‘- _” ,. 9 "-342 2997‘) 0 I _ E ...-2.09.2.1...- 99999999999999 5 S 5 5 1 1 1 c’, .’, I I I ,2, 3, 99. V. o: 0. Ci ’3 3'- ' °9 3 3 N' 9 6: —-* - 112 Figure A.9 Item Number Distribution in each ab-block for ROP15 70 T f Ia=[0 .8944) °°L [ja=[.8944 1.2649) ‘ Dan .2649 1.5492) 50L - I a=[1.5492 1.7889) E. 40 _ Ia=[1.7889 2) r g Ia=[2 2.1909) n 30 — ~ 20 e 10 ~ 0. 3. «'3‘: 3. 9.4 '7 ' I I 1 9°. «.3 S N. I Figure A. 10 Item Number Distribution in each ab-block for ROP16 70 I I I I I I I Ia=[0 .63246) 60 Ia=[.63246 .89443) Ela=1894431.0954) 50* -a=[1.0954 1.2649) 5. Ia=[1.2649 1.4142) 40 _ g Ia=[1.4142 1.5492) g sci Ia=[1.54921.6733) IL Ia=[1.6733 1.7889) 209 -a=[1.7889 1.8974) Ia=[1.8974 2) 1° ” Ia=[2 20976) Ia=[2.0976 2.1909) -2.8 ~ -2.01 -2 -l.21 2~ 2.8 Frequency Frequency Figure A. 11 Item Number Distribution in each ab-block for ROP17 70 I I I I I I T Ia=[0 .8944) 6° ’ Ia=[.8944 1.2649) 50 - Ia=[1.2649 1.5492) Ia=[1.5492 1.7889) ‘° * Ia=[1.7889 2) 30 Ia=[2 2.1909) 20 ‘7‘- : 1o I '- 0‘ ~’. 9".“ If: .1 I 8. «'4‘ '6. 35 3. 8. :3 ‘3‘ '7‘ 1' 1 ".1 "‘ (A .1, i. 2 ‘f- 4. '5: 8: ' ' Figure A. 12 Item Number Distribution in each ab-block for ROP18 70 I I I I I I I Ia=[0 .63246) 60~ .Ia=[.63246 .89443) Dumas 1.0954) 50- ‘Ia=[1.0954 1.2649) a=[1 .2649 1.4142) ‘ Ia=[1A142 1.5492) , Ia=[1.5492 1.6733) Ia=[1.6733 1.7889) lIa=[1.7889 1.8974) Ia=[1.8974 2) ‘Ia=[2 2.0976) Ia=[2.0976 2.1909) -2.8 ~ -2.01 -l.2 ~ -.41 -.4 ~ .39 Aw -.1.19 114 l.2~ 1.99 2~ 2.8 Figure A. 13 Item Number Distribution in each ab-block for ROP19 70 I I I I I -a=[0 .6944) 6° ” -a=[.8944 1.2649) la=[1.26491.5492) _ 50 ._ g ‘0 -a=[1.6492 1.7889) g -a=[1.7999 2) E 30 _ -a=[2 2.1909) _ 1-1 1-1 1- 0‘ cs a fl °. N V. "I "‘ . N ‘1‘ '1‘ 1' l "‘ "‘ A 1 1 N ...: i .3. :2 3' 1' ' " I Figure A. 14 Item Number Distribution in each ab-block for ROP20 70 I I r T I I T .a=[0 .63246) 60~ . 4.a=[.63246 .89443) [:]a=[.89443 1.0964) 50- “-a=[1.0954 1.2649) 3 a=[1 .2649 1.4142) c 40- 7 g -a=[1.4142 1.5492) g 30 <.a=[1.54921.6733) “- -a=[1.6733 1.7399) 20. - -a=[1.7889 1.8974) -a=[1.8974 2) 10~ ‘-a=[2 2.0976) -a={2.0976 2.1909) °. N V "I . ‘1‘ ‘1' 5 z . "‘ I z z N v. 4 .S co N ' . v- 6: ' ‘7 I Figure A. 15 Item Number Distribution in each ab-block for ROP23 70 I I f I f r I I I *.a=[0 .6944) 5° ” “Ea=[.6944 1.2649) ; ‘ 50 _ :Ea=[1.2649 1.5492)! _ B ‘.a=[1.54921.7889) g 4° ” ‘.a=[1.7889 2) 1 E 30» .a=[2 2.1909) q I]. 20 ~ - 10~ .. .3}; _ o ll‘i' H ..“"}91_}§L2 , .,._- 8. «'91 3": 3 '2‘ '7‘ 1' 1 .1, 4 2 ‘-'° 6'! I Figure A. 16 Item Number Distribution in each ab-block for ROP24 70 I I I I I I -a=[0 .63246) 60 . .a=[.63246 .69443) [311469443 1.0954) 50» . -a=[1.09541.2649) 5. 40» ‘ -a=[1.26491.4142) g .a=[1.41421.5492) g» _.a=[1.64921.6733) ‘L i.a=[1.67331.1889) 20» 1‘ - . J.a={1.76691.6974) I ‘ .' J-aqmsn 2) 10» ‘Iaqz 20976) ‘ 0976 21909 ._1_ 1.311 1 S 2: 8 9°. 03 A ._: N 5 5 .3. .3. 9°. 0.3 -' 9: 116 Figure A. I 7 Item Discrimination and Difficulty Parameter Distributions for ROP3 item discrimination parameter distribution in ROP3 Item dim 100 fi 90.. Frequency 0.5 I r 1.5 r 100 culty parameter distribution In ROP3 mb 80» 70» f Figure A. 18 Item Discrimination and Difficulty Parameter Distributions for ROP4 item discrimination 100 . parameter distribution in ROP4 ”L 80+ N 0 Frequency 88$$3 0.0 T 117 item difiic 100 . uity parameter distribution in ROP4 T Figure A. 19 Item Discrimination and Difficulty Parameter Distributions for ROP5 item difficulty parameter 100 T . distribution in ROP6 item Frequency 0 0.5 1 Figure A.20 Item Discrimination and Difficulty Parameter Distributions for ROP6 item discrimination parameter distribution in ROP6 item difficulty parameter distribution In ROP6 1.5 100 . 90~ m. N 0 Frequency 88628 .a O 0° 0 0| T r 118 discrimin 100 . ation parameter distribution in ROP5 90)- 80 fl 100 1* 4 i w> m» N a T Figure A.2] Item Discrimination and Difficulty Parameter Distributions for ROP7 item discrimination parameter distribution in ROP7 item difficulty parameter distribution in ROP7 100 . 100 t 90. . 90. 80 30. 70* 70 2? 60* so. 5 § 50» 50- d: ‘40- 40r 30 30 20k 20 10» 1o. Figure A.22 Item Discrimination and Difficulty Parameter Distributions for ROP8 item discrimination parameter distribution in ROP8 item difficulty parameter distribution in ROP8 100 1 f r 1” 90+ - so) i 80r i 30. 1 70* 1 70- - > -1 0 C 0 a .. E II. .5 119 Figure A.23 Item Discrimination and Difficulty Parameter Distributions for ROP1 1 item discrimination parameter distribution In ROP11 item difficulty parameter distribution In ROP11 100 f f 100 e 90» - sol « 60 ~ 80» ~ 70- - 70 . >. 605 . § :1 50» ~ 5 I IL 40" 30- .1 20b .1 10- i 00 0:5 4 Figure A.24 Item Discrimination and Difficulty Parameter Distributions for ROP12 item discrimination parameter distribution In ROP12 Item difflcuity parameter distribution In ROP12 100 e 100 . T 90+ 4 90» - , 60» - 60» « 70» .1 70» a > 606 -1 0 5 a 50) . E u. 40. I 120 Figure A.25 Item Discrimination and Difficulty Parameter Distributions for ROP15 item discrimination 100 r parameter distribution In ROP15 90* Frequency item difficulty 100 r . parameter distribution In ROP15 ”h 93 -2 -1 fir Figure A.26 Item Discrimination and Difficulty Parameter Distributions for ROP16 Rem discrimination parameter distribution In ROP16 Item difficulty parameter distribution In ROP16 100 90* Frequency T I 2.5 121 100 90* I N O I r Figure A.27 Item Discrimination and Difficulty Parameter Distributions for ROP1 7 Item discrimination parameter distribution In ROP17ltem difficulty parameter distribution In ROP1? . . . 100 . . . . . . 100 fi 90» « 90» « sol . m. l 70) 70» i 3 606 1 60» 1 § 50» 50) g .. .. 30» 30. 20 20. 10 10 ‘1: 9: Figure A.28 Item Discrimination and Difficulty Parameter Distributions for ROP18 item discrimination parameter distribution in ROP18 item difficulty parameter distribution in ROP18 10° . 100 r T 90~ ~ 90 I 80) a 80* 1 70 I 70) a 6‘” § 60» E ..r 30 20. 10» 122 Figure A.29 Item Discrimination and Difficulty Parameter Distributions for ROP19 item discrimination parameter distribution In ROP19 Item difficulty parameter distribution In ROP19 100 f r 100 T T 30 . so. . 70 ~ 70 . B. 30 a 60~ ~ § 60> so. . E? 40» 1 40r I 30 4 sot . 20+ 20. 10~ - °o Figure A.30 Item Discrimination and Difficulty Parameter Distributions for ROP20 Item discrimination parameter distribution in ROP20 Item difficulty parameter distribution In ROP20 100 f 100 1 T 90~ . sol - om . 80~ . 70. i 70- I >. 60~ i o 1: g 60 u. 123 Figure A.3] Item Discrimination and Difficulty Parameter Distributions for ROP23 item discrimination parameter distribution In ROP23 Item difficulty parameter distribution in ROP23 1” I fl 1” f ‘T f 90) « 90- aoi 60 I 70 70» ~ >60- 0 5 i” l ‘0)- 30» . 20 -1 0 0 0 0.5 1 1.5 2 2.5 4 -2 -1 o 1 2 3 4 a b Figure A.32 Item Discrimination and Difficulty Parameter Distributions for ROP24 item discrimination parameter distribution In ROP24 Item difficulty parameter distribution In ROP24 . . . . 1oo . . . . . . 100 901 1 901 80' q 80» 70* 70* 1 5‘ 60 1 60+ 5 § 5'” a: 40- 124 Figure A.33 Distributions of item attributes for the candidate ROP_l to ROP_4 0.7 I fl T 77 T if I I I I I I I I I I I + OP 0-5 1 ----- - ROP1 1 .. X" .......... ROPZ 0.5 _ t.’ ’ '1‘ 1 :Iv" ‘ \‘ "+" ROP3 " . , __,._-, ROP4 3 0.4 — .... -1 \ 1" :- ° \ e 5 ..w'. 3. ... .53.; 2 137% at: -. ~ : ‘ 0 0.3” Y 1 .\ j \‘ *1 1,1? 0- . \.\ .- 5 ‘ I s‘ I ‘0' ‘.“ g, I \‘\ ’ ~ ‘s ‘3‘ {5'1” :1 2V . ‘5 \ fil’ ‘_ .I .. 1'5! | ' r 0.2 ‘ \\\\ I" ‘5'," s.‘\ .5 \‘ 6'." _\ I § ....... .."'. a f 0.1 r ‘-.,_.\ g a s 4 L g L J 4 4 4 C1 C2 c3 c4 c5 cs c7 C8 csc10c Attribute 11 C12 C13 C14 C15 C19 C17 C18 Figure A.34 Distributions of item attributes for the candidate ROP_S to ROP_8 0.7 1 r l I T I I I f I I I I I I I i —I- OP 0.6) --~- ROP5 - "...... ROP6 i ’A I“. —4 0.5 '3, --+-- ROP7 \ I o .\ --+—- ROP8 g 0.4i '3‘ ‘ 4 E {3}“.1‘ " " ‘1 ail-*1 ‘ -’. :l 2 5".“ 1&1 ..R a Oa3i- '.' ‘\ I I " _' m \. \‘ . ‘ .‘i h K \ t ‘ It ’ . I. ‘s ’0 i 1 . \. \ t , . 9y \ \1 s I _, I v”. “ ’1' I “.3 ' ‘. ‘1 ' ’1‘“ . \ ' . 0.2 ’- gi ‘\ ‘ a. .' —4 . . ‘\.A§« .i ’ '4' ‘ ,u ' '- r" ‘03: I '-y- \‘\ \CI ! 0.1 ~ ‘3‘ - \ L l l I l J l L g l l l l 1 C1 C2 C3 Attribute 125 C4 C5 C6 C7 C8 C9 C10C11C12C13C14C15C16C17C18 Figure A.35 Distributions of item attributes for the candidate ROP_9 to ROP_12 0.7 I I j . I I +OP 0.6— __*__ ROPS - Q g... ......o.. ROP1O 0.5“ ,” 0+..- ROP11 .4 X """ ROP12 3 0.4 3 _ 5 E F K II. 0.3 7"" Ifjjib " ...)‘n. 1‘ 0-2— ’x‘ ..." “*' \“ ’IIIJ 0.1 ~ ‘- A ‘ .flfixIIIT 3:..- o J L 1 1 L l L 4 1 4 §£ '%' 1 l 1 C1 C2 C3 C4 C5 C C7 C8 C C10 C11 C12C13C14 C15C1C C17 C18 Attribute Figure A.3 6 Distributions of item attributes for the candidate ROP_13 to ROP_16 0.7 I I I I I I I I I l + OF ”I --«-- ROP13 * ' I "I ------ ROP15 g -. g 0.4 _ --*--- ROP13 4 '5 g 0'3 _ 1""‘\ _ (L -' . , 0.2 + v 1 0.1 ~ - o l l J l l I 1 L1 1 C1 C2 C3 C4 C5 C6 C7 Attribute 126 C8 CS C10 C11 C12 C13 C14C15C1SC17C18 Figure A.3 7 Distributions of item attributes for the candidate ROP_17 to ROP_20 0.7 —-— OP 0.6 ~ .2 ‘ --+-- ROP17 * [’2‘ ..... . ..... ROP18 0.5» ", ’ ‘3“ .3 -~v-- ROP19 > 1’ ““1 A g 0.4 - 35;. --+-- ROP20 I ‘1’ "“ ; - I" 3 “y; 7‘"? . le‘ g 0.3 i‘ I ’ I; ‘L IL I '. '2" ' .-"‘ \ " ' ---:"‘E. ' o 2 r- . \ O‘ V“ \‘y -N- 0.1 — y \ _ o L Ll L l l l 1 1 l L A - 1 I ‘.~. 1 l l 1 C1 C2 C3 C4 CS C6 C7 C8 C9 C10 C11 C12C13C14 C15 C16 C17 C18 Attribute Figure A.38 Distributions of item attributes for the candidate ROP_21 to ROP_24 0.7 If T I I I I f I I I I I f I I I I + 0P 0.6 *- F. "'0" ROPZ‘I I ’t ..... 9 ..... ROP22 0.5 * ... _ °' ‘.\ "'V'“ ROP23 End ‘ t --+-- ROP24 r CI '\ 3 §O.3~ _ IL . -. i A I: V‘s?»- 0.2 '- ’I’ \\ I. ‘8.” “- 9 ' I .‘E‘: \ I “b”... «<7- “:fi" 0 1 l l I l l l l L I ‘D~'-'=Zr:fi.'r I“?! l 1 1 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12C13C14 C15C16C17C18 Attribute 127 8o :3 So So So So 86 So So :3- 3o cod 8o. 2 mom :3 :3 so So So So So cod :3 cod So cod 3o- :Imom 3o 85 8o 85 go 8.0 So So So 3o 85 So- 85. 218m 85 3d 8d 85 8o 86 So cod 8d :3 So- 85 :.o- 38m .--auo---..m.§ ..... e do..-Masada.-.-a..m---._.oqo..-..e.a..o----._.q.os-.a..o----om..o-.sadiaomd. ........ wing... ..... - 8.? :3 so So So So So So So 85 85 8.? :.o- sumom 3o .86 8d 86 So So So So So 85 :3 86. Ed. 35m 85 So 23 So So So 85 So 8d 85 So 8.? 3o- 35m .--.Nmno:--mqme.---.~.o.o.--.mosm---adssad-.smorgasmad----adde.-.-awassadidmum. ........ «...QO ..... - So. 85 So So So :3 cod :3 So So So cod 26. 35m So 8d :3 so So So So So So So So 85 N3 Nunez 3o 86 8o cod :3 So 85 5o cod 2; :3 Se 25. 35m N3..--.E.m..-.-m°uo..-.mo..m Emma.-.«$530..-.w.o..o.-.-.~.q.os-.3.¢.-..&..m.eamuse”.-.figureseesmw ........ - m 3 n 3 _ m... a me. _- 3- N- 2- m- 8.. as. be? mnNQx 233328 2% 33 KO 8% 3 53m 8de Ngofihzeb _.< 033. 128 mod :3 Nod So 86 So so 86 :3 Se 86 No? 2.? .NN mom mod :8 Ned Nod 85 So So Nod 8o Nod So :3 z? N.NImom mod :3 Nod 85 Nod 86 So So Noe :3 So 8.? N_.? NNImom 3o :3 Nod 85 8d :3 So Nod so Nod So 8.? 2.? Ndom ..-wd..o-.-.w3 ..... _. one.--modsmmd.-.-a.m-.-mo.os--.ad.-.-mq.o-.-dmd---.&.d---.mo..¢..-..$..m. ........ a ~49.” ...... Nod 86 Nod Nod Nod So Nod So 86 So So 8.? 3.? 213m 85 86 So So So 8d :3 So 8o 23 Nod 5.? 2.? Edam 5o Nod Nod So :3 So :3 So :3 Nod 56 3.? ON? :Imom .--moqo ..... _ .3 ..... _. crammed.Ian?-..dessauce--qu.---.a.o-::&d.---.Ndd-.--wmd.-..o._..m. ........ e. 3.0m ...... So Nod 3o 86 So :3 3o Nod Noo So Nod 8.? 8.? flumom Nod Nod Nod 86 86 So so So so Nod 23 5.? 2.? idem 8d Nod Nod Nod Nod So cod :3 Noe Nod 55 No? ON? fldom ...Nwsmm.-.Nm..?-.-moqo..-warm.-.wm..o.-..$..m-.-mono?.3..o.-.-~¢.o-.ammo...-..eou?..-m.m?.s.wwuow..-i-i..._d ........ - N mN N w. _ m... a m? _- m..- N- m.N- m- 8.. so: mam—Ede. u. :5 _.< 2.3 129 :.0 00.0 00.0 00.0 00.0 00.0 00.0 00.0 00.0 00.0 00.0 00.0 0N0 N_ 00.0 00.0 00.0 00.0 00.0 00.0 00.0 00.0 00.0 00.0 00.0 00.0 :.0 0N0 :Imom S0 :.0 00.0 00.0 00.0 00.0 00.0 00.0 00.0 00.0 00.0 00.0 0N0 0350 00.0 :.0 00.0 00.0 00.0 00.0 00.0 00.0 00.0 00.0 00.0 00.0 00.0 0:090 200.0. -20.030500”?-0.0..0..-.0¢.0-..0m..0-..00..0.--.00..0.--.00..0.-.-.0.0...0---gas-0.3 ..................... ...- ........... 00.0 00.0 00.0 00.0 00.0 00.0 00.0 00.0 00.0 00.0 00.0 :.0 00.0 00.0 00.0 00.0 00.0 00.0 00.0 00.0 3.0 00.0 00.0 N00 00.0 00.0 00.0 00.0 00.0 00.0 00.0 00.0 00.0 00.0 00.0 :.0 -.00..0. --0040..-.0.0q0.--00..0---0.0..0-..00..0.-..0.0..0.--wads-00.0.-.-.0.0..0-.-.wm..0.--.0.0..0 ..................... ..- ........... 00.0 00.0 00.0 00.0 00.0 00.0 00.0 00.0 00.0 00.0 00.0 00.0 00.0 00.0 00.0 00.0 00.0 00.0 00.0 00.0 00.0 00.0 00.0 00.0 00.0 00.0 00.0 00.0 00.0 00.0 00.0 00.0 00.0 00.0 00.0 2.0 10000. .-Nm0..-....Nu0..-0.N...0...0440-.4.00.0.-..00..0.-..00..0;;00..0.-.-an0..-.0.N...0--.mw..0 ................................... 0 0.N N 0.0 _ 0.0 0 0.0- _- 0.? N- 0.N- £3 2000008 20 35 mo 20 00 00.00 £0: 03000008 N.< 200% 130 00.0 20 00.0 00.0 00.0 00.0 00.0 000 00.0 N00 80 2.0 :.0 00.0 00.0 00.0 00.0 00.0 00.0 000 00.0 N00 00.0 000 2.0 N_.0 00.0 00.0 00.0 00.0 00.0 000 00.0 N00 00.0 2.0 :.0 20 00.0 00.0 00.0 00.0 00.0 N00 00.0 N00 00.0 20 -Nada-M030:-.0.0q0-.-.0-0..0-.wad-:03-.30.-..00.0.-..0.0..0.-.-m0.0-.-.0m.m.s.&..0 ..................... I- ........... 000 00.0 00.0 00.0 00.0 00.0 00.0 000 00.0 00.0 00.0 0N0 00.0 00.0 00.0 00.0 00.0 00.0 00.0 000 00.0 00.0 00.0 00.0 00.0 00.0 00.0 00.0 00.0 00.0 00.0 00.0 00.0 000 00.0 00.0 20.000.--.0000..-.0.0q0.--.0.0..0--.0.0..0-..00...0-.0000.-.«mas-0.000.---0.0.0..-.0m.m.-..00..0 ..................... I- ........... 000 00.0 00.0 00.0 00.0 00.0 00.0 000 00.0 000 00.0 :.0 00.0 00.0 00.0 00.0 00.0 00.0 00.0 000 00.0 000 00.0 00.0 00.0 00.0 00.0 00.0 00.0 00.0 00.0 00.0 00.0 00.0 8.0 00.0 :30. .-NW0:-....Nu0..-0.w.0-;0.~...0-..m...0.-:03.-..00..0.-..00..0.-;0._40-.53.00.2003 ................................... 0 0.N N 0._ _ 0.0 0 0.0- 7 0.0- N- 0.N- fi 0:00 N4 2an 131 00.0 00.0 0N.0 0N0 N0 _N0 _N0 NN0 0N.0 0N0 NN0 :0 000 N0 .50 NN.0 0N.0 0N.0 0N0 0N.0 NN0 NN0 NN.0 NN.0 0N.0 0N0 N00 000 :Imom 00.0 00.0 0N.0 NN.0 0N0 0N0 NN.0 NN.0 0N0 0N0 0N.0 :0 N00 2:000 N00 00.0 0N.0 0N0 0N0 0N0 0N0 0N.0 0N.0 0N0 0N.0 N00 00.0 0109. -.m.N-..0- .-.0..N..0..-._.Nq0..-mN..0-.-.0._. .0-.-.0.w0..---_.m.0.-.-._.N..0..-.M.N...0.-..Nwms-wmd;.-0.N..0..-.ww..m .......... 0 Jam ......... NN.0 0N.0 NN.0 0N0 0N0 2.0 20 20 20 _N0 0N.0 000 00.0 Nunez 0N.0 0N.0 :0 2.0 20 00.0 2.0 2.0 20 «N0 0N.0 00.0 N0.0 0:000 0N.0 0N.0 NN.0 _N0 0N0 0N0 0N.0 e0 0N.0 0N0 0N0 00.0 00.0 0:09. -0.N-.0:-0.N-.0..-.qu0-.-.0.0.-0..0.0-0---0.0-0:22 .0--.-0._..0.---w0...0.-..Nwm--.NN..0.---0-N...0----Nw.0 .......... 0 ...-0.9.0 ......... 0N.0 0N.0 NN.0 0N0 0N0 20 e0 0N0 _N.0 _N0 0N.0 0N0 0.0 0:090 0N.0 0N.0 _N.0 0N0 0N.0 20 20 20 0N.0 0N0 _N.0 0N.0 000 Nu0om NN.0 0N.0 NN.0 0N.0 0N0 0N.0 0N.0 0N.0 0N.0 NN0 0N.0 00.0 00.0 0:090 .00....0- .-.0000..-mm0..-m0..0..-.0m.-0-.-mm0..-.._.m.._...-.-0.N...0..-.,-0.N...0ENE.-..00..0.-.-._-N.-0-.-.wm.m ............ mm ........... . . . - .- - - 8.— .5: anN00 _ 00 0 00. _ 0_ N 0 £00 ....QO 303008 as 0:0 mo 20 00 000.00 0000 B§§§o 0.< 20¢ 132 00.0 _0.0 0N.0 0N.0 0N.0 NN.0 0N.0 NN0 0N.0 0N0 0N.0 00.0 00.0 0N 000 00.0 00.0 NN.0 0N.0 0N.0 0N.0 0N.0 0N.0 0N.0 NN.0 0N.0 N00 00.0 0N:000 00.0 00.0 0N.0 NN.0 0N.0 0N.0 0N.0 0N0 0N.0 0N0 0N.0 00.0 00.0 NN:000 00.0 00.0 0N.0 0N.0 0N.0 NN.0 0N.0 NN0 0N.0 0N0 0N.0 00.0 00.0 8:000 :quo..-.0.N..:o.im-..:o-.-MNJO-iquo!-.m._...o.iaow..m-.--mN-O ::::: 0 mum:50w.d---w.~...0-i.mw..o!--mm.d .......... o win-0.0.0.0 ........ 0N.0 NN.0 0N.0 NN.0 NN.0 0N.0 0N.0 NN.0 0N.0 0N.0 NN.0 00.0 00.0 2:000 NN.0 0N.0 NN.0 0N.0 0N.0 0N.0 0N.0 0N0 NN.0 0N0 0N.0 N00 0.0 2:000 0N.0 0N.0 0N.0 NN.0 0N.0 0N.0 0N.0 NN0 0N.0 0N.0 0N.0 3.0 :0 2:00 -000.--0N...0...N.N...0-.-.N.N...0..-.000..-0.N-.0.-..00.0.-.-._.~.0..-.0.0-0.N.0:00-0:00..-.00 .......... 0. .300 ........ 0N.0 0N.0 0N.0 0N.0 0N.0 0N.0 0N.0 NN0 0N.0 0N.0 0N.0 00.0 00.0 2:000 0N.0 0N.0 0N.0 0N.0 0N.0 0N.0 0N.0 0N0 0N.0 0N0 0N.0 :0 00.0 2:000 0N.0 0N.0 0N.0 0N.0 NN.0 NN.0 NN.0 0N.0 NN.0 0N0 0N.0 00.0 00.0 2:000 dang....-a-mna-iwmna!-$ua-.-mwuo.;.mmm;33$-.-guesamwhw.-..«Wa-;wmna:i—-ma..-.w.m...-..-i.-i.e-MG ........... 0 0.N N 0.0 0 0.0 0 0.0- 0- 0.0- N- 0- 0.50 3:5 0.300 0.0. 2000 133 REFERENCE Ariel, A., Veldkamp, B. P., & van der Linden, W. J. (2004). Constructing rotating item pools for constrained adaptive testing. Journal of Educational Measurement, 41, 345— 359. Belov, D. I., & Armstrong, R. D. (2009). Direct and inverse problems of item pool design for computerized adaptive testing. Educational and Psychological Measurement, 69(4), 544-547. Binet, A., & Simon, Th. A. (1905). Méthode nouvelle pour le diagnostic du niveau intellectual des anormaux. L'Année Psychologique, 11, 191-244. Birbaum, A. Some latent trait models and their use in inferring an examinee’s ability. In Lord, F. M. and Novick, M. R., Statistical and theories of mental test scores. Reading, Mass: Addision-Wesley, 1968 (Chapters 17-20). Bock, D. R., & Mislevy, R. (1982). Adaptive EAP estimation of ability in a microcomputer environment. Applied Psychological Measurement, 6(4), 431-444. Boekkooi-Timminga, E. (1990). A method for designing IRT-based item banks (Research Rep 90-7). Netherlands: University of Twente, Department of Education. Buyske, S. (2005). Optimal design in educational testing. In M. P. F. Berger & W. K. Wong (Eds), Applied optimal designs (pp. 1—19). West Sussex, UK: Wiley. Chang, H. (2007). Book review: Linear models for optimal test design. Psychometrika, 72, 279-281. Chang, H., & Ying, Z. (2009). Nonlinear sequential design for logistic item response theory models with applications to computerized adaptive tests. The Analysis of Statistics. 37(1). 1466-1488. Chang, H., & Ying, Z. (1999). A-stratified multistage computerized adaptive testing. Applied Psychological Measurement, 23, 211-222. 134 Chang, H., Qing, J ., & Ying, Z. (2001). A-stratified multistage computerized adaptive testing with b blocking. Applied Psychological Measurement, 25, 333-341. Chang, S., & Ansley, T. N. (2003). A comparative study of item exposure control methods in computerized adaptive testing. Journal of Educational Measurement, 40(1), 71-103. Chang, S.. & Twu, B. Y. (September 1998). A comparative study of item exposure control methods in computerized adaptive testing. ACT Research Report Series, ACT-RR-98-3. Chen, S., & Ankenmann, R. D. (2004). Effects of practical constraints on item selection rules at the early stages of computerized adaptive testing. Journal of Educational Measurement, 41(2), 149-174. Chen, S. Y., Ankenmann, R. D., & Spray, J. A. (2003). The relationship between item exposure and test overlap in computerized adaptive testing. Journal of Educational Measurement, 40, 129-145. Cheng, Y., & Chang, H. (2008). A new heuristic for parallel form assembly based on information curve matching. Paper presented at the annual meeting of National Council on Measurement in Education, Chicago, IL. Cheng, Y., & Chang, H. (2009). The maximum priority index method for severely constrained item selection in computerized adaptive testing. British Journal of Mathematical and Statistical Psychology, 63, 369-383. Cheng, Y., Chang, H., & Yi, Q. (2007). Two-phase item selection procedure for flexible content balancing in CAT. Applied Psychological Measurement, 31, 467-482. Davey, T., & Nering, N. (2002). Controlling item exposure and maintaining item security. In C. N. Mills, M. T. Potenza, J. J. Fremer, & W. C. Ward, (Eds), Computer-based testing: Building the foundation for future assessments (pp. 165-191). Mahwah, NJ: Erlbaum. 135 Davey, T., & Parshall, C. G. (1995). New algorithms for item selection and exposure control with computerized adaptive testing. Paper presented at the annual meeting of the American Educational Research Association, San Francisco, CA. Dodd, B. G., De Ayala, R. J ., & Koch, W. R. (1995). Computerized adaptive testing with polytomous items. Applied Psychological Measurement, 19, 5-22. Eggen, T. J. H. M., & Straetmans, G.J.J.M. (2000). Computerized adaptive testing for classifying examinees into three categories. Educational and Psychological Measurement, 60, 713-734. Eignor, D. R., Way, W. D., Stocking, M. L., & Steffen, M. (1993). Case studies in computer adaptive test design through simulation (Research Rep. No. 93-56). Princeton, NJ: Educational Testing Service. Embretson, S. E. (2001). Generating abstract reasoning items with cognitive theory. In S. H. Irvine & P. C. Kyllonen (Eds), Item generation for test development. Mahwah, NJ: Erlbaum Publishers. F laugher, R. (2000). Item pools. In H. Wainer (Ed.), Computerized adaptive testing: A primer (pp. 37-59). Mahwah, NJ: Lawrence Erlbaum. French, B., & Thompson T. (April, 2003). The evaluation of exposure control procedures for an operational CAT. Poster presented at the annual meeting of the American Educational Research Association (AERA), Chicago, IL. Georgiadou, E., Triantafillou, E., & Economides, A. A. (2007). A review of item exposure control strategies for computerized adaptive testing from 1983 to 2005. The Journal of Technology, Learning, and Assessment, 5(8). Retrieved from http://escholarship.bc.edu/cgi/viewcontent.cgi?article=1093 &context=jtla. Gorin, J. S., Dodd, B. G., Fitzpatrick, S. J ., & Shieh, Y., (2005). Computerized adaptive testing with the partial credit model: Estimation procedure, population distributions, and item pool characteristics. Applied Psychological Measurement, 29(6), 433-456. Gu, L. (2007). Designing optimal iem pools for computerized adaptive tests with exposure controls. Unpublished doctoral dissertaion. Michigan State University. 136 Hambleton, R. K.,&J ones, R. W. (1994). Item parameter estimation errors and their influence on test information functions. Applied Measurement in Education, 7, 171— 186. Hau, K. T., & Chang, H. (2001). Item selection in computerized adaptive testing: Should more discriminating items be used first. Journal of Educational Measurement, 38 (3), 249-266. Hetter, R., & Sympson, B. (1997). Item exposure control in CAT-ASVAB. In W. Sands, B. Wasters, & J. McBride (Eds.), Computerized Adaptive Testing: From Inquiry to Operation (pp. 141-144). Washington, DC: American Psychological Association. J ensema, C. J. (1972). An application of latent trait mental test theory to the Washington Pre-College Testing Program. Unpublished doctoral dissertation. University of Washington, 1972. Jensema, C. J. (1977). Bayesian tailored testing and the influence of item bank characteristics. Applied Psychological Measurement, 1, 111-120. Kingsbury, G. G., & Zara, A. R. (1991). Procedures for selecting items for computerized adaptive tests. Applied Measurement in Education, 2, 359—375. Leung, C. K., Chang, H., & Hau, K. T. (2003a). Incorporation of content balancing requirements in stratification designs for computerized adaptive testing. Educational and Psychological Measurement, 63, 257-270. Leung, C. K., Chang, H., & Hau, K. (2003b). Computerized adaptive testing: A comparison of three content balancing methods. Journal of Technology, Learning, and Assessment, 2(5). Lord, M. F. (1970). Some test theory for tailored testing. In W. H. Holzrnan (Ed.), Computer assisted instruction, testing, and guidance (pp. 139-183). New York: Harper & Row. \f Lord, F. M. (1971a). Robbins-Monro procedures for tailor testing. Educational and Psychological Measurement, 31, 3-31. 137 Lord, F. M. (1971b). The self-scoring flexilevel test. Journal of Educational Measurement, 8, 147-151. Lord, F. M. (1977). A broad-range tailored test of verbal ability. Applied Psychological Measurement, 1, 95-100. Lord, F. (1980). Applications of item response theory to practical testing problems. Hillsdale, NJ: Lawrence Erlbaum. Luecht, R. M. (1998). A framework for exploring and controlling risks associated with test item exposure over time. Paper presented at the annual meeting of the National Council on Measurement in Education. McBride, J. R., & Martin. J. T. (1983). Reliability and validity of adaptive ability tests in a military setting. In D. J. Weiss (Ed.), New Horizons in Testing (pp. 223-236). New York, NY: Academic Press. McBride, J. R., & Weiss, D. J. (1976). Some properties of a Bayesian adaptive ability testing strategy (Research Rep No. 76-1). Minneapolis, MN: Psychometric Methods Program, Department of Psychology. Nering, M.L., Davey, T. & Thompson, T. (1998, June). A hybrid method for controlling item exposure in computerized adaptive testing. Paper presented at the annual meeting of the Psychometric Society. Urbana, IL. Owen, R. J. (l975).A Bayesian sequential procedure for quanta] response in the context of adaptive mental testing. Journal of the American Statistical Association, 70, 351- 356. Parshall, C., Davey, T., & Nering, M. (1998). Test development exposure control for adaptive tests. Paper presented at the Annual Meeting of the National Council on Measurement in Education, San Diego CA. Patsula, L. N., & Steffan, M. (1997). Maintaining item and test security in a CAT environment: A simulation study. Paper presented at the annual meeting of the National Council on Measurement in Education. 138 O Reckase, M. D. (1976). The efl'ect of item pool characteristics on the operation of a tailored testing procedure. Paper presented at the spring meeting of the Psychometric Society, Murray Hill, NJ. 0 Reckase, M. D. (2003). Item pool design for computerized adaptive tests. Paper presented at the National Council on Measurement in Education, Chicago, IL. Reckase, M. D., & He, W. (2004). The ideal item pool for the NCLEX-RN examination— Report to NCSBN: Michigan State University. Reckase, M. D., & He, W. (2005). Ideal item pool design for the NCLEX-RN exam. Michigan State University, East Lansing, MI. Reckase, M. D., & He, W. (2009a). ). Optimal item pool design for the 2009 NCLEX Exam--report to the National Council of State Boards of Nursing (NCSBN): Michigan State University. Reckase, M. D., & He, W. (2009b). The influence of item pool quality on the functioning of computerized adaptive tests. Paper presented at the annual meeting of Psychometric Society, Cambridge, UK. Revuelta, J ., & Ponsoda, V. (1998). A comparison of item exposure control methods in computerized adaptive testing. Journal of Educational Measurement, 34, 311-327. Robin, F., van der Linden, W. J ., Eignor, D. R., Steffen, M., & Stocking, M. L. (2005). A comparison of two procedures for constrained adaptive test construction (ETS Research Rep No. RR-04-39). Princeton, NJ: Educational Testing Service. Shin, C., Chien, Y., Way, W. D., & Swanson, L. (2009). Weighted penalty model for content balancing in CA TS. Pearson. Retrieved from http://www.pearsonedmeasurement.com/down]oads/research/Weighted%2OPenaltflg 20Model.pdf Stocking, M. L. (1994). Three practical issues for modern adaptive testing item pools (ETS Research Rep No. 94-05). Princeton, NJ: Educational Testing Service. 139 Stocking, M.L., & Lewis, C. (1995). A new method of controlling item exposure in computerized adaptive testing (Research Rep No. 95-25). Princeton, NJ: Educational Testing Service. Stocking, M. L., & Lewis, C. (1998). Controlling item exposure conditional on ability in computerized adaptive testing. Journal of Educational and Behavioral Statistics, 23(1), 57-75. Stocking, M. L., & Lewis, C. (2000). Methods of controlling the exposure of items in CAT. In W. J. van der Linden, & C. A. W. Glas (Eds.), Computerized adaptive testing:Theory and practice (pp. 163-182). Netherlands: Kluwer Academic Publishers. Stocking, M. L., & Swanson, L. (1998). Optimal design of item banks for computerized adaptive tests. Applied Psychological Measurement, 22, 271—279. «3 Swanson, D., & Stocking, M. L. (1993). A method for severely constrained item selection in adaptive testing. Applied Psychological Measurement, 1 7, 277—292. Sympson, J. B., & Hetter, R. D. (1985). Controlling item-exposure rates in computerized adaptive testing. Paper presented at the 27th Annual Meeting of the Military Testing Association, San Diego, CA. Urry, V. W. (1977). Tailored testing: A successful application of latent trait theory. Journal of Educational Measurement, 14(2), 181-196. van der Linden, W. J. (1998). Optimal assembly of psychological and educational tests. Applied Psychological Measurement, 22, 195-211. van der Linden, W. J. (2000). Constrained adaptive testing with shadow tests. In W. J. van der Linden, & C. A. W. Glas (Eds.), Computerized adaptive testing: Theory and practice (pp. 27—52). Boston: Kluwer. van der Linden, W. J. (2005). Linear models for optimal test design. New York: Springer. van der Linden, W. J. (2005a). A comparison of item-selection methods for adaptive tests with content constraints. Journal of Educational Measurement, 42, 283-302. 140 van der Linden, W. J ., & Chang, H. (2005b). Implementing content constraints in alpha- stratified adaptive testing using a shadow test approach (Computerized Testing Report 01-09). Law School Admission Council. Retrieved from http://www.lsacnet.org/Research/ct/implementing—content-constraints-in-alpha- stratified-adaptive-testing-using—shadow-test-approachpdf van der Linden, W. J ., & Glas, C. A. W. (2000). Capitalization on item calibration error in adaptive testing. Applied Measurement in Education, 13(1), 35-53. .3, van der Linden, W. J ., & Pashley, P. J. (2000). Item selection and ability estimation in adaptive testing. In W. J. van der Linden, & C. A. W. Glas (Eds.), Computerized adaptivetesting: Theory and practice (pp. 1—25). Boston: Kluwer. van der Linden, W. J., & Reese, L. (1998). A model for optimal constrained adaptive testing. Applied Psychological Measurement, 22 (3), 259-270. van der Linden, W. J ., Ariel, A., & Veldkamp, B. P. (2006). Assembling a CAT item pool as a set of linear tests. Journal of Educational and Behavioral Statistics, 31, 81- 99. Van der Linden, W. J., Veldkamp, B. P., & Reese, L. M. (2000). An integer programming approach to item bank design. Applied Psychological Measurement, 24(2), 139-150. Veldkamp, B. P., & van der Linden, W. J. (2000). Designing item pools for computerized adaptive testing. In W. J. van der Linden, & C. A. W. Glas (Eds.), Computerized adaptive testing: Theory and practice (pp. 149—162). The Netherlands: Kluwer Academic Publishers. Wainer, H. (2000). Rescuing computerized adaptive testing by breaking Zipf’s law. Journal of Educational and Behavioral Statistics, 25, 203-224. Wang, T., & Kolen, M. J. (2001). Evaluating comparability in computerized adaptive testing: Issues, criteria and example. Journal of Educational Measurement, 38(1), 19- 49. Wang, T., & Vispoel, W. P. (1998). Properties of ability estimation methods in computerized adaptive testing. Journal of Educational Measurement, 35(2), 109-135. 141 Way, W. D. (1998). Protecting the integrity of computerized testing item pools. Educational Measurement: Issues and Practice, 1 7, 17-27. Way, W. D., Steffen, M., & Anderson, G. S. (1998). Developing, maintaining, and renewing the item inventory to support computer-based testing. Paper presented at the colloquium computer-based testing: Building the foundation for future assessments, Philadelphia, PA. Way, W. D., Swanson, L., Steffen, M., & Stocking, M. L. (2001). Refining a system for computerized adaptive testing pool creation (Research Report. No. 01-18). Princeton, NJ: Educational Testing Service. Weiss, D. J. (1973). The stratified adaptive computerized ability test (Research Report 73-3). Minneapolis: University of Minnesota, Department of Psychology, Psychometric Methods Program. Weiss, D. J. (1976). Adaptive testing research at Minnesota: Overview, recent results, and future directions. In C. L. Clark (Ed.), Proceedings of the First Conference on Computerized Adaptive Testing (pp. 24-35). Washington DC: United States Civil Service Commission. Weiss, D. J. (1978). Proceedings of the I 97 7 Computerized Adaptive Testing Conference. Minneapolis: University of Minnesota, Department of Psychology, Computerized Weiss, D. J. (1982). Improving measurement quality and efficiency with adaptive testing. Applied Psychological Measurement, 6, 473-492. Adaptive Testing Laboratory. Wise, 8., & Kingsbury, G. G. (2000). Practical issues in developing and maintaining a computerized adaptive testing program. Psicologica, 21, 135-155. Retrieved from http://www.uv.es/psicologica/articulosdfiyz.OO/wise.pdf. Xing, D., & Hambleton, R. K. (2004). Impacts of test design, item quality, and item bank size on the psychometric properties of computer-based credentialing examinations. Educational and Psychological Measurement, 64(1), 5-21. 142 ”Willi Iii: [lit 11 lifliiljl‘llqlifiml”