INCORPORATING MIXED ITEM FORMATS IN CAT: A COMPARISON OF SHADOW TEST AND BIN - STRUCTURED APPROACHES By Xin Luo A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Measurement and Quantitative Methods Doctor of Philosophy 2015 ABSTRACT INCORPORATING MIXED ITEM FORMATS IN CAT : A COMPARISON OF SHADOW TEST AND BIN - STRUCTURED APPROACH ES By Xin Luo C urrent operational CAT s mainly use d ichotomous items. However, including polytomous and set - based items into CAT is attracting growing attention. Few studies have been conducted to investigate how to assemble a mixed - item - format CAT efficiently. The requirements for assembling a CAT are often in conflict with each other; the test assem bly approach should advance progress toward all objectives. The shadow test approach (STA) is one of the most appealing CAT assembly methods as it can handle complex constraints. It is very flexible and can deal with many constraints simultaneously. How ever, STA solves the op timization problem uniquely for each examinee, which may result in some problems in operational CATs, such as context effects and difficulty in item replacement. These problems can be partially solved by the bin - structured method, w hich aims to find a single standardized solution to divide the item pool and solve the constrained combination optimization problem. However, though the bin - structured method is promising in future applications, as a relatively new method, research in bin - structured method is still rare, and none uses mixed - item - format based CAT. And no study investigates what factors may influence the quality of results from the bin - structured method. This study compared the mixed - item - based CAT and dichotomous - item - b ased CAT to see whether the mixed CAT had advantages over the dichotomous - item - based CAT and what challenges it brought. Furthermore, it c ompare d three CAT test assembly approaches, including STA, combination of STA and bin - structured method, and bin - stru ctured method in context of CAT containing mixed item formats. The psychometric models used in item pool, item parameter distribution, test length and imposed test constrain ts were manipulated to simulate various real test situations. The results supporte d incorporating polytomous items and set - based items into CAT, as mixed CAT had higher test accuracy and stability than the binary CAT. However, the mixed CAT had a fairly skewed exposure rate distribution, and further analysis showed that the highly expo sed items were all polytomous - scoring items. Another relevant problem for mixed CAT was its low item usage efficiency, as a lot of items (mainly dichotomous items) were unused. This study also supported the application of bin - structured method in mixed C AT as it can produce equal or even better outcomes than the traditional STA. Meanwhile it can also simplify the computation involved in CAT, standardize the look of the test, provide good control over the content sequences in advance, and facilitate item replacement and exposure control . Copyright by XIN LUO 2015 v ACKNOWLEDGMENTS I am deeply indebted to my academic advisor and dissertation chair, Dr. Mark Reckase , for providing me the great opportunity to pursue advanced study in MQM, Michigan State University. I have benefited tremendously from his wisdom, insight and knowledge. I appreciate his guidance in academics, in my dissertation, and also in my career d evelopment. Without his constant support, encouragement, warm care and help, this work would not have been possible. I also would like to express my sincere appreciation to my dissertation committee members, Dr. Kimberly Maier and Dr. Richard Houang in M QM, MSU, Dr. Joseph Martineau at Center for Assessment, and Dr. Timothy Davey at ETS, for their superb instructions and suggestions . Their insightful comments and review help me greatly in the dissertation work. I am also deeply grateful to Dr . Spyros Konstantopoulos , Dr. Edward Roeber, and Dr. Tenko Raykov , who have been providing me with support and advice during my doctoral study. I also thank Dr. Hongyun Liu and Dr. Tao Xin from Beijing Normal University for their guidance since my undergraduate study. My gratitude also extends to psychometric s research teams at CTB/McGraw Hill, ETS, and National Council of State Boards of Nursing for providing me with valuable chances to work on their research and internship projects. My special thanks would go to Qi Diao, Hao Ren, Ada Woo, Doyoung Kim, Qian Hong, Xiao Luo, Lixiong Gu, Longjuan Liang, Priya Kannan, Richard Tannenbaum and Wei He. I also thank Dr. Wang in Qualcomm for his great suggestion s on my work. vi I appreciate my friends, Liyang Mao, Keyin Wang, Tingqiao Chen, Chi Chang, Jiahui Zhang, Emr e Gonulates, Shuyi Chen, Wei Li, Xinge Ji, Xuechun Zhou, Unhee Ju, Xi Wang, Fei Chen, and Huili Liu, for lighting up my life during the past five years . I would like to give special thanks to Guangwei Sun for his care throughout my doctoral study and contribution to the editorial work of my dissertation . I also thank my significant friends Wen Guo, Yangbing Xu and Tong Lu from Beijing Normal University for their immeasurable support. Finally, I would like to thank my parents and my grand parents for t heir unquestioning love. vii TABLE OF CONTENTS LIST OF TABLES x LIST OF FIGURES x i KEY TO ABBREVIATIONS xv i Chapter 1 : Introduction 1 Chapter 2: Literature Review 4 2.1 Item Format 4 2.1.1 Dichotomous Item 4 2.1.2 Polytomous Items 6 2.1.3 Set - based Items 7 2.2 Introduction to Computerized Adaptive Testing 9 2.2.1 A Brief History of CAT 9 2.2.2 Advantages of CAT 10 2.2.3 Procedure for Administrating a CAT 11 Item Pool 12 Psychometric Model 13 Item Selection Rule 16 Starting Point 20 Scoring Rule 21 Stopping Rule 23 2.3 CAT Assembly Approaches 25 2.3.1 Goals of CAT Assembly 25 2.3.2 Assembly Design in CAT 27 STA 28 Bin - Structured Method 31 Chapter 3: Methods and Procedures 36 3.1 Generate Item Pools 36 3.1.1 Data Source 36 3.1.2 Generate the Original Item Pool 39 3.1.3 Recalibrated Item Pool 41 3.1.4 Nested Difficulty 3PLM Pool 42 3.1.5 Nested Difficulty 2PLM Pool 43 3. 1.6 Balanced Item Pool 43 3.1.7 Heterogeneous Testlet Pool 44 3.2 Simulation of CAT Procedures 45 3.2.1 Long Tests 45 Original Pool 45 Nested Difficulty 3PLM Pool 49 viii Recalibrated Pool 50 Nested Difficulty 2PLM Pool 50 Balanced Item Pool 50 Heterogeneous Testlet Pool 50 3.2.2 Short Tests 51 3.3 Evaluation Criteria 52 3.3.1 Measurement Criteria 52 Conditional Statistics 53 Overall Statistics 53 3.3.2 Content Balance 54 3.3.3 Test Security 54 3.3.4 Item Usage 55 Chapter 4: Results 56 4.1 Research Question 1 56 4.1.1 Measurement Criteria 56 Conditional Result 56 Overall Result 57 4.1.2 Test Security Criteria 57 Item Exposure 57 Overlap Rate 57 4.1.3 Item Usage 57 4.2 Research Question 2 58 4.2.1 Measurement Criteria 58 Conditional Result 58 (1) Conditional Bias 58 (2) Condit ional Absolute Bias (CAB) 64 (3) Conditional Standard Error of Meas urement (CSEM) 70 (4) Test Information Conditional Standa rd Error of Measurement (TCSEM) 76 Overall Result 88 (1) Bias 88 (2) Mean Absolute Bias (MAB) 89 (3) Root Mean Squared Error (RMSE) 89 4.2.2 Content Balance 90 4.2.3 Test Security 90 Distribution of Item Exposure Rate 90 (1) Original Item Pool 90 (2) Nested Difficulty 3PLM Pool 91 (3) Recalibrated Pool 92 (4) Nested Difficulty 2PLM Pool 93 (5) Balanced Pool 94 (6) Heterogeneous Pool 95 Overlap Rate 97 (1) Overall Overlap R ate 97 (2) Conditional Overlap Rate (COR) 98 ix 4.2.4 Item Usage 104 Chapter 5: Summary and Discussion 105 5.1 Summary of This Study 105 5.1.1 Measurement Criteria 105 5.1.2 Content Balance 106 5.1.3 Item Exposure Rate Distribution 106 5.1.4 Item Usage 107 5.2 Discussion of Major Findings 107 5 .2.1 Incorporating Polytomous Items into CAT 107 5.2.2 Compari ng STA and Bin - Structured Method 108 5.2.3 Developing Bins Properly 110 5.3 Implications and Limitations 113 BIBLIOGRAPHY 116 x LIST OF TABLES Table 2.1 Item Parameters for a GPCM Item 19 Table 2.2 An Example for CAT Assembly Using STA (van der Linden & Reese, 1998) 31 Table 2.3 Item Pool (Davey, 2005) 32 Table 2.4 CAT Constraints (Davey, 2005) 33 Table 2.5 An Example for a Template (Davey, 2005) 33 Table 2.6 Dividing Items into Bins (Davey, 2005) 3 3 Table 2.7 Example of First Five Items Selected (Davey, 2005) 34 Table 3.1 OSSLT Test Specification 38 Table 3.2 Modified Test Specification for Heterogeneous Testlet Pool 51 Table 3.3 Test Specification for 22 - Item CAT 52 Table 4.1 Overall Bias of Ability Estimate 89 Table 4.2 Overall Mean Absolute B ias (MAB) 89 Table 4.3 RMSE of Estimate 90 Table 4.4 Number of Items Achieving the Highest Exposure Rate 97 Table 4.5 Overall Overlap Rate 97 Table 4.6 Proportion of Unused Items 104 Table 5.1 Comparing Item Usage of Combination Method in Different Pools 111 xi LIST OF FIGURES Figure 2.1 Steps for Administrating a CAT (He, 2010) 12 Figure 2.2 ICCs for 2PLM Items 14 Figure 2.3 ICC for 3PLM Item 15 Figure 2.4 Item Category Response Probability Curves for a = 0.93, b = - 1.28, d = [0, 1.3,1.07, - 2.37] 16 Figure 2.5 Information for 2PLM Items 18 Figure 2.6 Item Information for Polytomous Items with GPCM 19 Figure 3.1 Test Information for OSSLT 2015 (English) 37 Figure 3.2 Original Pool Information 40 Figure 3.3 Ability Distribution of English Population (OSSLT, 2005) 41 Figure 3.4 Recalibrated Pool Information 42 Figure 3.5 Balanced Pool Information 44 F igure 3.6 Summary of Six Pools 45 Fi gure 3.7 Five CAT Simulations in the Original Pool 49 Figure 4.1(a) Conditional Bias for the Original Pool, 44 Items 59 Figure 4.1(b) Conditional Bias for the Nested Difficulty 3PLM Pool, 44 Items 60 Figure 4.1(c) Conditional Bias for the Recalibrated Pool, 44 Items 60 Figure 4.1(d) Conditional Bias for the Nested Difficulty 2PLM Pool, 44 Items 61 Figure 4.1(e) Conditional Bias for the Balanced Pool, 44 Items 61 Figure 4 .1(f) Conditional Bias for the Heterogeneous Pool, 44 Items 62 Figure 4.1(g) Conditional Bias for the Original Pool, 22 Items 62 Figure 4.1(h) Conditional Bias for the Nested Difficulty 3PLM Pool, 22 Items 63 xii Figure 4.1(i) Conditional Bias for the Recalibrated Pool, 22 Items 63 Figure 4.1(j) Conditional Bias for the Nested Difficulty 2PLM Pool, 22 Items 64 Figure 4.1(k) Conditional Bias for the Balanced Pool, 22 Items 64 Figure 4.2(a) Conditional Absolute Bias for the Original Pool, 44 Items 65 Figure 4.2(b) Conditional Absolute Bias for the Nested Difficulty 3PLM Pool, 44 Items 66 Figure 4.2(c) Conditional Absolute Bias for the Recalibrated Pool, 44 Items 66 Figure 4.2(d) Conditional Absolute Bias for the Nested Difficulty 2PLM Pool, 44 Items 67 Figure 4.2(e) Conditional Absolute Bias for the Balanced Pool, 44 Items 67 Figure 4.2(f) Conditional Absolute Bias for the Heterogeneous Pool, 44 Items 68 Figure 4.2(g) Conditional Absolute Bias for the Original Pool, 22 Items 6 8 Figure 4.2(h) Conditional Absolute Bias for the Nested Difficulty 3PLM Pool, 22 Items 69 Figure 4.2(i) Conditional Absolute Bias for the Recalibrated Pool, 22 Items 69 Figure 4.2(j) Conditional Absolute Bias for the Nested Difficulty 2PLM Pool, 22 Items 70 Figure 4.2(k) Conditional Absolute Bias for the Balanced Pool, 22 Items 70 Figure 4.3(a) Conditional SEM for the Original Pool, 44 Items 71 Figure 4.3(b) Conditional SEM for the Nested Difficulty 3PLM Pool, 44 Items 72 Figure 4.3(c ) Conditional SEM for the Recalibrated Pool, 44 Items 72 Figure 4.3(d) Conditional SEM for the Nested Difficulty 2PLM Pool, 44 Items 73 Figure 4.3(e) Conditional SEM for the Balanced Pool, 44 Items 73 Figure 4.3(f) Conditional SEM for the Heterogeneous Pool, 44 Items 74 Figure 4.3(g) Conditional SEM for the Original Pool, 22 Items 74 Figure 4.3(h) Conditional SEM for the Nested Difficulty 3PLM Pool, 22 Items 75 Figure 4.3(i) Conditional SEM for the Recalibrated Pool, 22 Items 7 5 xiii Figure 4.3(j) C onditional SEM for the Nested Difficulty 2PLM Pool, 22 Items 76 Figure 4.3(k) Conditional SEM for the Balanced Pool, 22 Items 76 Figure 4.4(a) TCSEM for the Original Pool, 44 Items 78 Figure 4.4(b) TCSEM for the Nested Difficulty 3PLM Pool, 44 Items 78 Figure 4.4(c) TCSEM for the Recalibrated Pool, 44 Items 79 Figure 4.4(d) TCSEM for the Nested Difficulty 2PLM Pool, 44 Items 79 Figure 4.4(e) TCSEM for the Balanced Pool, 44 Items 80 Figure 4.4(f) T CSEM for the Heterogeneous Pool, 44 Items 80 Figure 4.4(g) TCSEM for the Original Pool, 22 Items 81 Figure 4.4(h) TCSEM for the Nested Difficulty 3PLM Pool, 22 Items 81 Figure 4.4(i) TCSEM for the Recalibrated Pool, 22 Items 82 Figure 4.4(j) TCSEM for the Nested Difficulty 2PLM Pool, 22 Items 82 Figure 4.4(k) TCSEM for the Balanced Pool, 22 Items 83 Figure 4.5(a) CTI for the Original Pool, 44 Items 83 Figure 4.5(b) CTI for the Nested Difficulty 3PLM Pool, 44 Items 84 Figure 4.5(c) CTI for the Recalibrated Pool, 44 Items 84 Figure 4.5(d) CTI for the Nested Difficulty 2PLM Pool, 44 Items 85 Figure 4.5(e) CTI for the Balanced Pool, 44 Items 85 Figure 4.5(f) CTI for the Heterogeneous Pool, 44 Items 86 Figure 4.5(g) CTI for the Original Pool, 22 Items 86 Figure 4.5(h) CTI for the Nested Difficulty 3PLM Pool, 22 Items 87 Figure 4.5(i) CTI for the Recalibrated Pool, 22 Items 87 Figure 4.5(j) CTI for the Nested Difficulty 2PLM Pool, 22 Ite ms 88 xiv Figure 4.5(k) CTI for the Balanced Pool, 22 Items 88 Figure 4.6(a) Exposure Rate Distribution for the Original Pool, 44 Items 91 Figure 4.6(b) Exposure Rate Distribution for the Original Pool, 22 Items 91 Figure 4.7(a) Exposure Rate Distribution for the Nested Difficulty 3PLM Pool, 44 Items 92 Figure 4.7(b) Exposure Rate Distribution for the Nested Difficulty 3PLM Pool, 22 Items 92 Figure 4.8(a) Exposure Rate Distribution for the Recalibrated Pool, 44 Ite ms 93 Figure 4.8(b) Exposure Rate Distribution for the Recalibrated Pool, 22 Items 93 Figure 4.9(a) Exposure Rate Distribution for the Nested Difficulty 2PLM Pool, 44 Items 94 Figure 4.9(b) Exposure Rate Distribution for the Nested Difficulty 2PLM Po ol, 22 Items 94 Figure 4.10(a) Exposure Rate Distribution for the Balanced Pool, 44 Items 95 Figure 4.10(b) Exposure Rate Distribution for the Balanced Pool, 22 Items 95 Figure 4.11 Exposure Rate Distribution for the Heterogeneous Pool, 44 Items 96 Figure 4.12(a) COR for the Original Pool, 44 Items 98 Figure 4.12(b) COR for the Nested Difficulty 3PLM Pool, 44 Items 99 Figure 4.12(c) COR for the Recalibrated Pool, 44 Items 99 Figure 4.12(d) COR for the Nested Difficulty 2PLM Pool, 44 Items 100 Figure 4.12(e) COR for the Balanced Pool, 44 Items 100 Figure 4.12(f) COR for the Heterogeneous Pool, 44 Items 101 Figure 4.12(g) COR for the Original Pool, 22 Items 101 Figure 4.12(h) COR for the Nested Difficulty 3PLM Pool, 22 Items 102 Figure 4.12(i) COR for the Recalibrated Pool, 22 Items 102 Figure 4.12(j) COR for the Nested Difficulty 2PLM Pool, 22 Items 103 Figure 4.12(k) COR for the Balanced Pool, 22 Items 103 xv Figure 5.1 Distribution of Item Information at = - 1 in Recalibrated Pool 112 Figure 5.2 Distribution of Item Information at = - 1 in Nested Difficulty 3PLM Pool 112 xvi KEY TO ABBREVIATIONS 2PLM: Two - Parameter Logistic M odel 3PLM: Three - Parameter Logistic M odel ASVAB: Armed Services Vocational Aptitude Battery CAB: Conditional Absolute Bias CAT: Computerized Adaptive Testing CB: Conditional Bias CCAT: Constrained Computerized Adaptive Testing CSEM: Conditional Standard Error of M easurement CTI : Conditional Test I nformation CTT: Classical Test Theory EAP: Expected a Posteriori EQAO: Education Quality and Accountability Office GMAT: Graduate Management Admission Test GPCM: Generalized Partial Credit M odel GRE: Graduate Record Exam ICC: I t em Characteristic Curve IRT: Item Response Theory MAB: Mean Absolute Bias MAP: Maximum a Posteriori MC: Multiple - Choice Item MCCAT: Modified Constrained Computerized Adaptive Testing xvii MI: Maximum I nformation MLE: Maximum Likelihood Estimation MML: Marginal Maximum Likelihood MMM: Modified Multinomial Model MPI: Maximum Priority Index NAEP: National Assessment of Educational Progress NCLEX : Council Licensure Examination for Registered Nurses OSSD: Ontario Secondary School Diploma OSSLT: Ontario Secondary School Literacy Test RMSE: Root - Mean - Standard - Error SBAC : Smarter Balanced Assessment Consortium SEM: Standard Errors of M easurement STA: Shadow Test Approach TCSEM: Conditional Standard Error of Measurement Obtained from the Test Information TOEFL: Test of English as a Foreign Language WLE: Weighed Likelihood Estimation WDA: Weighted Deviation A lgorithm WPM: Weighted Penalty M ode l 1 Chapter 1: Introduction The merits of computerized adaptive testing (CAT) have been widely acknowledged . Compared with traditional linear tests, CAT adaptively selects items suitable to improve the , and can improve the measurement precision and test efficiency . Meanwhile it facilitates instant score reporting and enables the test to adopt item s of various types (Wainer, 2000; Weiss & Schleisman, 199 9). O ver the past decades CAT has been successfully applied to sever al large - scale testing programs, such as GRE, GMAT, and TOEFL. Although c urrent operational CAT s mainly consist of d ichotomous items, including polytomous and set - based items into CAT s i s attracting growing attention. Compared with dichotomous items, polytomous items and set - based items can provide more item information, and are more appropriate to measure advanced cognitive activities. Meanwhile the dichotomous items still have signifi limited testing time, and the scoring is more convenient. T he prospects of combining dichotomous, polytomous and set - based items in CAT programs are promising (Parshall , Davey, & Pashley, 2002; SBAC, 2012), but few studies have been conducted to investigate how to assemble a mixed - item - format CAT efficiently. Generally there are three requirements for a ssembling a CAT ( Davey, 2005). The first is to measure each studen t s ability accurately with as few items as possible. The main benefit of CAT in improving test efficiency derives from the completion of this goal. The second is to guarantee each test can fulfill the pre - determined content specifications . This is driv en by the demand for enhancing test validity. The third is to avoid item over - exposure and ensure test security . Exposure control is important in ensuring the test fairness, and also in reducing the cost in item pool development as the item replacement i s often costly. Since an item is usually 2 required to go through a complicated development and review procedure before it is considered as qualified to be used (Gu, 2007), how to avoid the over - exposure problem and reduce unused items is worthy of research . These requirements are always in conflict with one another ; a n optimal solution which can best adva nce progress toward all objectives is desired in test assembly. Currently different CAT assembly approaches have been developed to find the combinations of items which can measure the target trait accurately while satisfying all test constraints. The shadow test approach (STA) is one of the most appealing methods as it can handle complex constraints (van der Linden & Reese, 1998 ). The go al of STA is to optimize an objective function (e.g., test information) under a set of constraints. In contrast to other approaches, the STA uses binary linear integer programming to assemble a full - size test (i.e., the shadow test) which can provide accu rate measurement while satisfying all the test constraints before selecting each item ; then the item is selected from this shadow test instead from the entire pool . As most of the conventional CAT assembly methods, one drawback of the STA is that the sequence in which items appear is not predictable and varies across examines, which may lead to context effects (Davey, 2005). Another problem resulted from the unpredictable item administration is that the decisions made in early stage may rule out item s which are important in later stage, and consequently no feasible solution can be obtained. In addition, changing a handful of items may influence the performance of the whole pool (Davey, 2005), which makes item replacement and exposure control difficul t. An approach named the - was proposed (Davey, 2005) to attack the se problem s in CAT assembly. Instead of building totally individualized tests, the bin - structured applies a single solution to partition the item pool 3 to non - overlap ping bins . The items in a given bin are interchangeable in terms of test construction rules (e.g., content area). The test is assembled by selecting one item from each bin, and therefore the number of bins is the same as the test length. Within this gen eral solution of partitioning the item pool , a further variety of specific item combinations are provided for item selection, which makes the bin - structured method no less adaptive than the STA. Considering the recent trend to incorporate polytomous item s and set - based item s in applications of compute rized adaptive testing (CAT), and the lack of research into the delivery of a CAT consisting of mixed item formats , this study investigate d the features of mixed CAT and how to assemble a mixed CAT efficiently, and therefore has important practical and theoretical implications. Specifically, the following two research objectives were addressed in this study: 1. Compare the mixed - item - based C AT and dichotomous - item - based CAT to see whether the mixed CAT has advantages over the dichotomous - item - based CAT and what challenges it brings (e.g., high exposure rate of the polytomous items). 2. Compare a highly ind ividualized test assembly de sign (specifically , STA) to a bin - structured approach in the context of CAT containing mixed item formats, in a variety of item pools with different psychometric models and item parameter distributions. The test length and imposed test constrain t s were al so manipulated to simulate various real test situations to investigate how the result s vary . 4 Chapter 2: Literature Review This chapter consists of three sections. First, three item formats involved in this study ( i.e. dichotomous items, polytomous items and testlet s) are defined and their advantages and disadvantages are compared. Second, a brief introduction to computerized adaptive testing ( CAT ) is presented, including the history and development, the advantages, and the elements of CAT. The th ird section provides a review of several current CAT assembly methods, with the focus on the method s investigated in this research, i.e., shadow test and bin - structured method . 2.1 Item Format In most of the current educational tests, items can be classified into two general categories: discrete items, and set - based items (van der Linden, 2000). Discrete items are independent o f each other and can be further classified as dichotomous items or polytomous items. Set - based items refer to a set of ite ms related to a common stimulus; items are often related to each other in some way. Previous research explored the difference s between these item formats in cognitive abilities and skills they can measure, content coverage, reliability, validity, scoring efficiency, etc. (Cao, 2007). Some major differences are discussed below . 2.1.1 Dichotomous Item Here is a question from NAEP Grade 4 Science test (NAEP, 2015): A thermometer shows that the outside air temperature is colder than the temperature at which water turns to ice. However, ice on the sidewalk melts. What probably caused this? A. The air heating the sidewalk B. The sidewalk reflecting sunlight into the air C. The wind causing the ice on the sidewalk to melt D. The sunlight making the sidewalk warmer than the air 5 This is a typical dichotomous item, as only option D is regarded as the correct answer though four choices are provided. Dichotomous items refer to items w ith onl y two score categories, e.g ., correct (scored as 1) or incorrect (scored as 0; Lord, 1958). Dichotomous item s are widely used in educational t esting and psychology assessment. For example, multiple - choice items (MC) with only one correct answer or questions from a personality inventory a re often scored dichotomously. Dichoto mous items ha ve come to dominate the research and application in CAT for several reasons: an examinee can answer many dichotomous items in a short time period, which allows the test to cover a broad range of content and to extract a representative sample o Rupp, 2004); the scoring for dichotomous items is objective , f ast, convenient, and inexpensive ; and, several item selection algorithms have been proved to be effective in dichotomous - item - base d CAT (Chang & Ying, 1996). However, some dichotomous items like MC are more likely to be influenced by test - wiseness and guessing (Burton, 2001; Oosterhof , 1996), and may result in overestimated scores. For example, examinees could rule out some alternatives without knowing which one is the correct answer . In this case the validity of the test will be compromised. Furthermore, dichotomous items are not optimal for evok ing complex cogniti ve act ivities. Although some studies indicate that well - designed dichotomous items can also elicit evidence for complex cognitive abilities (Haladyna, 1994; Hamilton, Nussbaum, & Snow, 1997), the spectrum of abilities that can be reach ed by dichotomous items is still constrained by their nature (Martinez, 1999). Some cognitive activities involving generating creative or divergent production are hard to be assessed by dichotomous items (Martinez, 1999) . If a test intending to evaluate complex constructs only ad opts dichotomous items, the construct might be under - represented , and the validity will be questionable (Messick, 1995). Therefore, to measure higher - order cognitive 6 functioning, more complex item formats like polytomous items or set - based tests are neede d (Zhou, 2012), as these items can assess a broader range of cognitive ability. 2.1.2 Polytomous Items The item below is from Education Quality and Accountability Office (EQAO) Grade 4 Writing test (EQAO, 2014): Your class has agreed to do some volunteer work in your school t his year. Each student can work in an area of his or her choosing. Write a detailed paragraph explaining what you choose to do and why. In contrast to being scored simply as correct or incorrect, the response to this item is evaluated using a 6 - point s cale, where 0 means the response is almost not readable, and 5 indicates high writing proficiency. Items scored in more than two categories are referred to as po lytomous items (Muraki, 1992). Constructed - response items, ordered response items, and multiple - response items often adopt polytomous scoring. Over the past few decades, there is an increasing demand for incorporating polytomous items into a CAT (van Rijn, Eggen, Hemker, & Sanders, 2002), and several item selection stra tegies developed for polytomous CAT also contribute to the growing popularity of polytomous - item - based CAT (Choi & Swartz, 2009). Moreover, although the scoring for polytomous item require s detailed rubrics, and is more complicated and time - consuming than dichotomous items, the advance in automated scoring improves the feasibility of including p olytomous item in CAT (Attali & Burstein, 2006). Compared to dichotomous items, polytomous items can provide more information about the trait level of an examinee (Bock, 1972; Drasgow, Levine, Williams, McLaughlin, & Candell, 1989; Thissen & Steinberg, 1984). Furthermore, they can reflect the association among knowledge and skills, and measure more complex constructs (Bock, 1972), which may not be 7 easily accomplis hed by simple dichotomous items such as MC or true/false items. Another cognitive activities by recording their solution processes, and provide diagnostic information an d facilitate educational instruction (Lukhele, Thissen, & Wainer, 1994; Martinez, 1999). Besides, the developments in computer technology facilitate the delivery of innovative items, and innovative item formats often require polytomous scoring, which also makes polytomous items more appealing in CAT (van der Linden & Glas, 2000). However, though the use of polytomous items shows promise for measuring complex ability and obtaining higher measurement precision, developing and using these items may be costl y and time - consuming. Hence , how to avoid ove r - exposure of these items is a main objective of CAT assembly and will be discussed in this research. 2.1.3 Set - based Items Set - based item s, also known as testlet s , refer to items grouped into clusters around a common stimulus (Wainer & Kiely, 1987). For example, in a reading test, reading passage is followed by several questions related to this passage. Q uestions associated with the same reading passage are regarded as a t estlet. The items within a testlet usually share some similarities and therefore demonstrate some homogeneity in content or assessed skills, and are not independent ( Wainer, Bradlow, & Wang, 2007 ) . Set - based items allow for more complicated, interrelated sets of items, and make use of the ex require less time in reading and understanding materi als. Set - based items also make the task more realistic, as many real - world tasks require solving r elated problems in a ste pwise fashion ; t herefo re including set - based items could potentially improve construct validity. And similar to polytomous items, set - based items are also appropriate to measure higher - level skills. For 8 instance, the development of performance - based testing is a great s pur to the popularity of set - bas ed items, as set - based items may help to elucidate more information on complex cognitions required in performance tests (van der Linden, 2000). Assembling set - based test s is much more complex than building discrete - item - ba sed test s , as the specifications for set - based test s are more complicated (van der Linden, 2000). Constrain t s for set - based test s may involve at least four levels: individual items, stimuli, item sets, and the entire tests (van der Linden, 2000) . Several studies have investigated how to assemble set - based test s , but mainly in linea r form (van der Linden, 2000). Assembly methods proposed in previous research include: (1) use separate decision variables to select item and stimuli s imultaneous ly (van der L inden, 1992); (2) simultaneous ly select pivot items; i n this method, the items which best represent the stimuli are defined as the pivot items and are drawn for admini stration (van der Linden, 2 000); (3) power set approach. T he basic idea of this approac h is that suppose an item set contains n items, and then the set will have at most 2 n - 1 different sub sets. The test can be assembled using separate decision variables for whether to includ e each subset in the test; (4) t wo - stage selection, wher e Stage 1 pi ck s an item set and Stage 2 selects items from the selected sets; and (5) s elect all items in a set ; i n this method, if one stimulus is selected, all the items related with it will be included in the test , and no within - set selection is performed . Davey ( 2005) suggests using the entire set rather than item as the unit for item selection , as the lat t er strategy complicates . Other issues related with using set - based items are how to develop high quality items, and what should be done to deal with the inter - correlation among items within a same set. When the violation of local independence is serious, generally two ways can be used to model the set - based items: the first is to fit a testlet response model (Wainer et al., 2007), and the 9 second is to treat the testlet as a polytomous item (Cook, Dodd, & Fitzpatrick, 1998). In this study, the testlet will be treated as an intact polytomous - scored unit in item selection and no within - testlet adaption is conducted. However, al t hough adopting polytomous scoring, the set - based item is still regarded as a unique item format different from the polytomous item when developing the blue print and selecting items in CAT . It also should be noted that a testlet may cover several conten t areas or cognitive skills simultaneously , which introduces within - testlet heterogeneity and distinguishes testlet - based item from polytomous item . In summary, one single item format cannot be better than another in all aspects, and a mixed - format test m ay concatenate their strengths while compensating for some weaknesses, and achieve broad content coverage, high reliability and validity, efficient scoring, and integrated measurement scope of high - level cognitive abilities. In conclusion, a test with a m ixture of different item formats may provide more efficient , valid and comprehensive measurement. This trend is more obvious in CAT, where polytomous items and set - based items hold promises for future application in CAT as computer provides various option s for using innovative items, while dichotomous items continue to have value. 2.2 Introduction to Computerized Adaptive Testing Computerized adaptive testing (CAT) has been widely used in educational and psychological testing. CAT assembles individualiz ed tests by administering items suitable for m ability, and therefore shortens the test length without losing the test precision. 2.2.1 A Brief History of CAT Although CAT only has begun to attract attention in educational practice since mid - 1990s , the ide a of adaptive testing is much older . The initial attempt at an adaptive test derives from 10 They tested students with a subset of items targeted at their approximate ability instead of using the whole test. If a student answered these items correctly, harder items would then be administered; otherwise easier items would be administered (Binet & Simon, 1905). In this way, adaptive tests are able to eliminate items with inappropriate difficult y, thereby increasing test effici ency and measurement accuracy. Other early adaptive testing includes Lord's flexilevel testing (1971) and Weiss' stradaptive test (1973). In these methods, each difficulty level has several item sets, and whether an exami nee get a harder or easier set depends on his or her performance on the previous set . Since 1990s, the application of computers facilitates further advancement in adaptive tes ting (Mills & Stocking, 1996). Currently adaptive testing has been successfully applied to many large - scale assessments, such as the Council Licensure Examination for Registered Nurses (NCLEX), Armed Services Vocational Aptitude Battery (ASVAB), Graduate Record Examination (GRE) and Smarter Balanced Assessment Consortium (SBAC). The popularity of computerized adaptive testing (i.e., CAT) mainly increases due to two factors: one is the progress of psychometrics theories, such as Item Response Theory (IRT; Lord, 1980; Weiss, 1978); and the other is the rapid development of computer tec hnology facilitating instantaneous computation (van der Linden & Glas, 2000 ; He, 2010). 2 .2.2 Advantages of CAT The advantages of CAT over linear tests have been well documented (Gu, 2007; Wainer, 2000; Way, 1998). First, by giving examinees items with appropriate difficulty, CAT decreases test length, increases test efficiency, and reduces examinee fatigue (Chang, 2004). While linear tests usually cannot provide enough information for students at the ends of the ability continuum, a CAT can maintain me asurement precision across the whol e ability continuum (Chang, 2004 ). 11 Second, the removal of poorly performing items is easier in individualized CAT; and an item with undesirable psychometrics characteristics (e.g., with high differential item functioning ) will only affect some of the examinees. Even for these examinees, as long as sufficient items are administered, the final estimate of their ability will converge to their true ability value (i.e., the ability value in theory) . This self - correcting feat ure of CAT would likely decrease the impact of small numbers of poorly developed items and avoid severely biased estimates of student ability e cheating. Fourth, CAT facilitates calculation of scores without a time lag, and therefore allows for immediate score delivery, which is very appealing to test - takers (van der Linden, 2010). Fifth, each examinee can control their testing pace, which red uces test anxiety and makes the test more flexible. Finally , the application of computer has the potential to use a variety of novel item format s such as items containing interactive video, and may improve the test validity. These attractive features lea d to extensive use of CAT in educational and psychological assessments . To examine how CAT improves test efficiency, the section below demonstrates the process of administration a CAT. 2.2.3 Procedure for Administrating a CAT As CATs proceed in an iterative way, the design and administration of a CAT is significantly different from a linear test. Figure 2. 1 (He, 2010) provides a good illustration of the adaptive nature of CAT. 12 Figure 2.1 Steps for Administrating a CAT (He, 2010) Once been obtained, the ability is estimated based on a pre - specified scor ing rule. Then a new item which optimizes an item selection criterion (e.g., maximizes information a t the current ability estimate while meeting pre - specified practical requi rements such as content balance at the same time) , is selected and administered. on all administered items. This proc ess continues until a pre - specified stopping rule is met. Generally, t he CAT procedure is defined through its six essential components : Item Pool The items are drawn from a pre - calibrated item pool containing adequate numbers of items along the whole ab ility continuum. In order to provide precise estimate over a Score i tems Update estimated ability Select the first item for a dministration No Yes Select new item based on updated ability estimate Whether the stopping rule has been met Stop 13 broad range of ability, a large item pool size is suggested (Luecht, 1998; Patsula & Steffan, 1997). Meanwhile, though exposure control and content balance are not necessary parts of CAT, they are often required since they can improve test security and validity. T he requirements for having sufficient items in each content area, avoiding item over - exposure to enhance security , and item retirement reinforce the need for large pool size . Consider ing the cost and effort to develop and maintain an item pool, how to maintain a reasonable level of item exposure and facilitate item replacement is important. The method involved in this study, i.e., the bin - structured method, may throw some light on thi s issue. Psychometric Model The psychometric mode l is typically based on IRT. IRT encompasses a set of models connecting the probability of answering an item correctly with an unobservable and hypothesized trait (i.e., a latent trait). This study is co nducted within the framew ork of unidimensional IRT (Lord, 1980) and entails three basic assumptions: (1) the test only m easu res along one latent trait; (2) t he item responses on different items are independent give n the latent trait value; and (3) a monotonically increasing function can be specified to represent the interaction b etween items and the person trait, i.e., t he probability of getting an item increases as the latent trait increases. These three assumptions outline a general class of unid imensional IRT models (Reckase, 2009). Based on the number of scored responses, these models can be divided into two families: dichotomous model (e.g., one - , two - , and three - parameter logistic model; Lord, 1980), and polytomous models (e.g., the nomin al r esponse model, Bock, 1972; the partial credit model, Maters, 1982; the generalized partial credit model, Muraki, 1992; and the graded response model, Samejima, 1969). In this study, two - parameter logistic model (2PLM) and three - parameter model (3PLM) are used for dichotomous items, as the original dichotomous item calibration was 14 conducted with 3 PLM with fixed a - and c - parameter (OSSLT, 2014), and 2 PLM is widely used in modeling dichotomous items in operational CAT . T he generalized partial credit model ( GPCM) is used for polytomous items and set - based items since the original data used in this study adopted GPCM to calibrate polytomous items . The 2PLM is defined as: P j ( 2. 1) where is the person (ability) parameter, a j is the discrimination of item j , b j is difficulty, D is a scaling constant to approximate the normal ogive model, and P j ( ) is the probability of getting a correct response (Lord, 1980). Figure 2 .2 shows the it em characteristic curve (ICC) for three two - parameter item s . Figure 2.2 ICC s for 2PLM Items In 2PLM, an examinee with very low proficiency has little chance to answer a difficult item correctly. However in real tests, especially in multiple - choice based tests, even low proficiency examinees still have a notable probability of responding correctl y to an item . In response to this phenomenon, the 3PLM includes a lower asymptote parameter c , which is also 15 known as guessing parameter or the pseudo chance parameter , indicating the probability of yielding a correct response by an examinee of extremely low ability. The 3PLM is defined as: P j (2.2) where is the a lower asymptote parameter for item j , and all the other notations have the same meaning as 2PLM. Figure 2.3 shows an item modeled with 3PLM. The lower end of the ICC is Figure 2.3 ICC for 3PLM Item The GPCM is an extension of the 2PLM to p olytomous items (Davis, 2004). GPCM is appropriate to model the item which comprises a series of ordered problem solving steps and examinees can get partial credit for completing a step. For example, solving the m ath problem below needs two steps: 2+3*4=? The first step is to get 3*4=12, and the second step is 2+12=14. The examinee can get partial score if they complete either step, and get full score if they get both steps correct. The GPCM is defined as: 16 P jk ( (2 .3 ) w here P jk is the probability of getting score k for item j , is the person ability, D is the scaling constant fixed at 1.7 to approximate the normal ogive model, is the discrimination parameter, is the overall item difficulty parameter, is the highest scoring category for item j , and is category threshold parameter. To resolve the indeterminacies in item estimation, for each item j , is set 0 and the sum of threshold par ameter s is also set as 0 (Muraki, 1992). Figure 2.4 illustrates the probability of each score for an item with four score categories (0 - 3). Figure 2.4 Item Category Response Probability Curves for a = 0.93, b = - 1.28, d = [0, 1.3, 1.07, - 2.37] Item Selection Rule The CAT process mainly adopts two methods to select the next item for administration: the item information method and the B ayesian approach (van der Linden & Pashley, 2000; Zhou, 2011) . The item information method selects the item that maximizes information at the current ability estimate . This method includes maximum information (MI; Lord, 1980), Kullback - Leibler informatio n (Chang & Ying, 1996; Veldkamp, 2003), and general 17 weighted information method (Veerkamp & Berger, 1997; Choi &Swartz, 2009; van Rijn, Eggen, Hemker , & Sanders, 2002) . The Bayesian method incorporates a weight function of a prior ability distribution into the information function to form the posterior distribution . This method comprises maximum posterior weighted information (van der Linden, 1998), maximum expected information (van der Linden, 1998), and the minimum expected posterior variance method (van der Linden, 1998). Various studies have compared the performance of different item selection methods under a number of IRT models, test lengths, and other CAT constrain t s (Veldkamp, 20 03; v an Rijn et al. , 2002; Ho, 2010), and found no significant dif ference between MI and other item selection methods in general . Therefore, MI is used in this study as its computation is easier. MI selects the item with maximum Fisher information at the current ability estimate . Fisher information (also simply named as information) indicates how much information that an observable random variable (i.e., the response to an item) has about the unknown parameter on which the probability of the random variable relies (Pratt, 1976). For a given dichotomous item j , infor mation is: I j = (2.4) where denotes the derivative of the item resp onse function with respect to . Specifically, f or 2PLM, the item information is: I j . (2.5) Figure 2.5 presents information for 2PLM items. It can be seen that for fixed b - parameter, items with higher a - parameters have higher information. This may cause concerns about over - exposure of highly discriminative items, which was studied in this research. Furthermore, for each item, information achieves the peak at = b . 18 Figure 2.5 Information for 2PLM Items For the 3PLM in function 2.2, the information is: I j = (2.6) where L j is equal to (see Hambleton, Swaminath an, & Rogers, 1991; Lord, 1980). For the GPCM, the item information given ability is: I j (2.7) where is defined in function 2 .3. Figure 2.6 shows the information for five polytomous ite ms with four score categories (see item parameters in Table 2.1). It indicates that items with high discrimination parameter have more information, and the information fu nction is more peaked when the distance between the first and last threshold parameters is shorter (Dodd & Koch, 1987). When the distance between two adjacent threshold parameters is large, the information function may not be unimodal (Akkermans & Muraki, 1997; Muraki, 1993). Furthermore, if the step parameters are in an ascending order, the info rmation function will be more peaked . 19 Table 2.1 Item Parameters for a GPCM Item a b d 0 d 1 d 2 d 3 0.93 - 1.28 0 1.3 1.07 - 2.37 0.73 - 1.28 0 1.3 1.07 2.37 0.93 - 1.28 0 2 1.07 - 3.07 0.93 - 1.28 0 1.07 2 - 3.07 0.93 - 1.28 0 - 2.37 1.07 1.3 Figure 2.6 Item Information for Polytomous Items with GPCM The sum of item information across items is the test information, which is equal to the reciprocal of variance of estimation , as indicated below: = = ( 2.8 ) where is t he maximum likelihood estimate (MLE) of true ability , and l is the likelihood of a given response pattern. As larger information indicates smaller standard error, items with higher information are always desired in CAT when adopting the MI item selecti on method . However 20 it is not possible that all the items in the pool have high a - parameter s , and therefore the MI method may threat en the security for informative items as items with large discrimination parameter s are more vulnerable to over - exposure pro blem s . It may also result in inefficient use of the item pool as items with less information are seldom picked. Furthermore, the selected item maximizes the information at the estimated ability rather than true ability . This may waste informative items at the early CAT stage as is not accurate (Chang &Ying, 1996). Some research proposes dividing the item pool into strat a based on the value of the a - parameter, selecting items from the stratum with the lowest a - para meter at the begin ning, and saving the highly discriminative items to later stage ( Chang & Ying, 1999). This strategy facilitates highly efficient and more balanced use of the item pool (Gu, 2007) , and it was incorporated in this study when developing the bins. Starting Po int As stated above, CAT aims to select items highly informative at the However, at the very beginning of CAT, there is no information available about examinees abili ty. In this case, CAT adopts a binary sort algorithm (Zhu & Fan, 1999). Binary sort algorithm first compares the target value to the middle value of the sorted sequence; i f the target value is smaller than the middle value , the search continues on the lower half of the sequence , otherwise the search is conducted on the upper half. In CAT, as a starting point, the initial estimate of ability is usually within the middle range of ability continuum; as a consequence, CAT usually picks an item with medium difficulty (Green et al., 1984; Hambleton, Zaal, & Pieters, 1991; Hulin, Drasgow, & Parsons, 1983; Wainer, 1990). The estimate of ability is updated based on the performance on this initial item, and an item with maximum information at this updated abilit y estimate is selected and administered as the second item. Although some 21 research claims that the starting point is unimportant as long as CAT has reasonable length, e.g., more than 25 items (Lord, 1987; Hulin et al., 1983), Wainer and Kiely (1987) argue that inappropriate starting point may increase test anxiet y and frustration. Moreover, too easy or too Hence in this study the starting point was located a round the medium ability level, as most CAT practice and research do. Scoring R ule In CAT, after administering ea - estimated. The two approaches most widely used for updating ability estimate are: (1) m aximum like lihood estimation, including maximum likelihood estimate ( MLE; Lord, 1980; Birnbaum, 1968 ), marginal maximum likelihood ( MML; Bock & Aitkin, 1981 ), and weighed likelihood estimation ( WLE; Warm, 1989) and (2) Bayesian estimation , including expected a posteriori (EAP; Bock & Aitkin, 1981 ) and maximum a posteriori ( MAP; Samejima, 1969) . As MLE is the basis of all the methods in the first category and was applied in this study, it will be introduced first; then a brief description of the Bayesian estimat ion is provided. In MLE, for a given examinee, the responses across test items are assumed to be local ly independent, so the likelihood is the product of probabilities of getting a correct or incorrect response on each item . In 2PLM or 3PLM , the likeliho od is: L( u | ( 2.9 ) where u is the response string, p i ( is the probability of getting response u i ( u i =0 for incorrect response and 1 for correct response) on item i given an and item parameter , and n is the number of administered items. The maximum likelihood is the value that maximizes L given response pattern u 22 and the collection of item parameters . For GPCM, the response has more than two plausible values and the likelihood can be formulized as: L( u | (2.10) where k is the score on item i , and other notations keep the same as Function 2.9. MLE is also the value where the first derivative of L is equal to 0 (Pfanzagl, 1994), as: (2.11) As no closed - form expression is available for MLE of an iterative numerical procedure like the Newton - Raphson algorithm (Segall, 2005) . MLE has desirable property of asymptotic consistency , i.e., as sample size n goes up, MLE will converge in probability to its true value . In addition, MLE is also asymptotic ally normal, i.e., has a normal distribution with the mean equal to true value , and the varian ce identical to the reciprocal of the test information (see Function 2.8). Due to these theoretical characteristics, MLE is widely used in CAT (Samejima, 1969; Hambleton & Swaminathan, 1985). However, when the response string consists of only correct or incorrect responses (or, only highest or lowest score category in polytomous - item - based tests), a positive or negative infinite ability estimate will result, which causes problems for item sel ection in next step. This can be solved by setting an arbitrary boundary (e.g., - 4 and +4) for estimates from such response patterns , or by adopting a Bayesi an estimate until the examinee has both correct and incorrect responses. Another problem related to MLE is that it is biase d. is ov er - estimated for positive underestimated for negative extreme values (Lord, 1980 ). This trend is obvious in short te sts, while in long tests MLE is asymptotically unbi ased. An alternative procedure to MLE is a Bayesian method, which has an assumption of a prior distribution of ability, i.e., the examinee comes from a popula tion with a normal 23 distribution of ability where mean and variance are known. After answering e ach test question, a posterior distribution is formed by combining the prior distribution with the response, as: (2.12) where is the posterior distribution, is the prior distribution, and is the likel ihood of a given response string u in the population, which is a constant. If the mean of this posterior distribution is used to update the ability estimate, this approach is named as expected a posteriori (EAP) a posteriori (MAP). When administering the same number of items, the Bayesian method yields smaller st andard error than MLE by absorbing additional information from prior distribution. And the Bayesian method can always produce a finite estimate. Howev er, though Bayesian method may overcome some drawbacks of MLE, o ne limitation is that for the Bayesian method the selection of prior may have significant influence on the final estimate, as the estimates will shrink to the mean of the prior. The estimate can be seriously biased if an inappropriate prior is used ( Wang & Vispoel, 1998; Lord, 1986; Warm, 1989). There have been numerous studies comparing ability estimation methods in CAT , in both dichotomous and polytomous cases (Chen, Hou, Fitzpatrick, & Do dd, 1997; Chen, Hou, & Dodd, 1998; Wang & Wang, 2001; Ho, 2010). Generally, the results suggest comparable effects of MLE and other methods (Ho, 2010). In this study, MLE was used to yield ability estimates. Stopping Rule Two strategies are widely used to determine when to terminate a CAT process: fixed length and variable length. When adopting fixed - length rule, all examinees are required to take the same number of items. For example, all the examinees take a 30 - item test. In fixed - length tests, dif ferent examinees spend similar testing ti me, which facilitates the test administration, and standardizes the testing conditions and related test ing - fatigue (Gu, 2007). 24 One disadvantage of fixed - length test is that the measurement precision varies among ex aminees , which causes problems for calculation and reporting reliability across ability levels (Segall, 2005; Gu, 2007). The other method, variable - length rule, pre - specifies a level of precision based on ML information or Bayesian posterior variance stat istics , and continually administers items until the estimate of ability reaches this target precision. Compared with fixed - length test, variable - length rule may improve test efficiency and item pool use, as it often minimizes test length while remaining high test accuracy (Bergstrom & Lunz, 1999). The drawback of this procedure is that to explain to the examinees why they have to take test of different length. Furthermore, in variable - length test, examinees of extremely high or low proficiency are likely to receive long t ests, especiall y when the item pool has no highly informative items for these extreme examinees, and then different fatigue level may have an effect on the results from the CAT (S egall, Moreno, & Hetter, 1997). Segall (2005) suggests imposing some adjustm ents to moderat e some of the operational difficulties, such as implementing an upper - bound for the variable - length tests. All of these components discussed above influence the design and the effect iveness of the CAT procedure (Chang, Qian, & Ying, 2001; Kingsbury & Zar a, 1989; Zhou, 2011). In addition, experience, should also be taken into consi deration when designing a CAT. For example, in CAT item selection , some items are used in most of the administrations, while other items are seldom used; how frequently an item appears in a test (i.e., the item exposure rate) depends on its psychometric properties, overall examinee ability distribution in the test - taking population, an d the quality and availability of other items in the pool (Gu, 2007). Items with high exposure rate s 25 indicate a waste of resou rces spent on item developing. Severa l exposure control methods have been developed to avoid the over - exposure and maintain reasonable item usage (Cheng & Chang, 2009; Hetter & Sympson, 1997). Another requirement for CAT is to guarantee each test meet s the same test specifications and c over s all the desired contents (i.e., keep the content balanced). The requirements for obtaining higher information, maintaining exposure rate and keeping the content balance d have direct influence on the test assembly, which will be further discussed in next section . 2.3 CAT Assembly Approaches 2.3.1 Goals of CAT Assembly Generally there are three require ments for assembling a CAT (Davey, 2005). First, as stated earlier, one of the major targets for CAT is to achieve higher measurement efficiency by administering informative items. By matching it em difficulty to the current estimate , CAT can reduce test length without losing measurement precisi on (Lord, 1980; Weiss, 1983; Robin, 2005). The strategies of selecting highly inform ative items have been stated in detail in the previous section. The second hurdle in CAT development is to balance content . In conventional paper - pencil test ing , all the examinees take the same test, and the requirement for content coverage can be met ea sily as long as the single test form fulfills the test specification . In contrast, CAT builds individualized tests by adaptively selecting items , and different tests should have comparable content coverage specified by the test blueprint . As a consequenc e, the item selection method should be adjusted to achieve maximized information while ensuring content balance ( Cordova, 1997; Stocking & Swanson, 1993; van der Linden, 1998 ; van der Linden & Reese, 1998; van d er Linden, 2005). Cons idering the threats to test validity and fairness brought by an unbalanced test, several models such as the weighted penalty model 26 (WPM) and the weighted deviation algorithm (WDA) have been developed t o ensure content balance. The third requirement is to avoid item over - expos ure and ensure test security. Item exposure rate is the ratio between the number of times a certain item is administered and the total number of examinees. Extremely low exposure rate means the item is rarely used and indicates a waste, while high exposu re rate threa tens test security and validity. The problem is more severe when the item development is time consuming and expensive (e.g., for polytomous items and set - based items ) an d when the test is high - stake s . As shown earlier, selecting items merely according to a statistical criterion (e.g., maximum information) is the main reason for item over - ex posure (van der Linden, 2004). Several procedures, such as randomization, conditional selection procedure, and a - stratified strategy have been app lied to control exposure rate. In summary, the objective o f CAT assembly is to construct efficient tests, and meet all the demands for content balance and test security ( He, 2010; Davey & Parshall, 1995 ; Wainer, Dorans, Flaugher, Green, Mislevy, Steinberg, & Thi ssen, 1990; Sands, Waters, & McBride, 1997; van der Linden, & Glas, 2000; Mills, Potenza, Fremer, & Ward, 2002). Actually when a CAT moves to operational implementation , b esides these three main requirements, sometimes some other issues have to be taken into account. For example, some tests, like NCLEX , have limits on total testing time. Other issues include how to eliminate the item context effect in CAT as the existing location of an item erformance on the same question , how to diminish the examinee nervousness at the beginning of the test, etc. Some of these issues will be addressed in this study. These requirements are always in conflict with one another and a compromise to balance all goals is needed in test assembly (Davey, 2005) . 27 2.3.2 Assembly Design in CAT A variety of test asse mbly methods have been proposed and successfully implemented, including the constrained CAT method (CCAT; Kingsbury & Zara, 1991), the modified CC AT (MCCAT; Leung, Chang, & Hau, 2003 ), the we ighted deviations model (WDM; Stocking & Swanson, 1993) , the m odified multinomial model (MMM; Chen & Ankenmann, 2004), the weighted penalty mode l (WPM; Shin, Chien, Way, & Swanson, 2009), the maximum priority index (MPI) method (Cheng & Chang, 2009) , the shadow - test approach (STA; van der Linden & Reese, 1998) , and bin - structured method (Davey, 2005) . Many of studies have compared these methods (Chen & Ankenmann, 2004; Cheng & Chang; 2009 ; van der Lin den, 2005). Among these methods, CCAT, MCCAT, MMM and bin - structured method partition the item pool into several su b - pools by some key features, such as content area , a nd the items are drawn from these sub - pools in a sequential way. One limitation of these methods is that they are applicable when an ite m only carries limited attribute, i.e., the ones used to divide the item pool (He, 2010). In contrast , the STA, the WDM, the WPM, and the MPI can handle more constraints and are more flexible. Among these four methods, the STA adopts a mathematical progr amming method while the other s are heuristic. This study involves one method from each of these two categories of test assembly approaches: STA and bin - structure d approach. The reason for choosing the STA is that it can deal with complex constraints and does n o t require judgment - based weights, which are not available for test to be used in this study. On the other hand, though the bin - structured method holds advantages over conventional methods especially in terms of exposure control and standardizing t been studied thoroughly, and no research is conducted in mixed - item - format case. This study aims to fill in this void. A more detailed description of these two methods is provided be low. 28 STA The STA was proposed by van der Linde n and Reese (1998) and since then has been widely researched in different CAT contexts (He, 2010). In general, the STA belongs to the constrained combination optimization problem (Nemhauser & Wolse y, 1988; Rao, 1985; Wagner, 1969 ), where the goal is to find a solution optimal in terms of one attribute while meeting a variety of constraints with respect to other attributes. As a consequence, two kinds of test specifications are defined and distinguished in STA: (1) objective, which requires a test attribute function (e.g., test information or posterior variance of estimate) to reach the maximum or minimum value, and can be written as a function to be optimized; and (2) constraint, which limits an attribute ( e.g., number of items in each content area) within a certain range, and can be formulated as equations (or inequalities). The constraints can be further classified into three categories: constrain t s on categorical attributes (e.g., item format), on quanti tative properties (e.g., expected testing time), and on item dependencies (e.g., item enemy). Then the test assembly issue is an optimal problem with a set of the constraints. In other words , in STA the test inf ormation at the current ability estimate ca n be regarded as the objecti ve function to be optimized, and this optimization problem is subject to all other specifications , which are viewed as constraints (van der Linden, 1998; van der L inden, Ariel, & Veldkamp, 2006; Ve l dkamp & van der Linden, 2000). Here is an example for how STA defines the goal of test assembly as a constrained combination optimization problem . Objective: maximize , i.e., maximize test information at , where N is the item number in the whole item pool and x i is an indicator variable specifying which items are included in the test. Constraints: , i N . i.e., if item i is selected when assembling a shadow test, is valued as 1; otherwise is 0; 29 , i.e., less than 5 items of Format 1 (e.g., dichotomous items); , i.e., more than 8 items of Format 2 (e.g., polytomous items); , i.e., less than 10 items in Content Area 1; , i.e., 3 items in Content Area 2; , i.e., more than 9 items in Content Area 3; , i.e., the total test length is 20 items; + , i.e., the total word count is less than 2000, where w i is the number of words in item i . The basic idea of STA is to assemble an optimal test using linear programming. In STA, a full - length test satisfying all requirements and with maximum information is assembled before selecting an item to be administered, and is named a shadow test, as shown in the example above; then the item with maximum information is picked from this shadow test instead of from the pool. In other words, t he item administered is the one in the current shadow test that is optimal at the current ability estimate and has not already been used. After administering the new item , the shadow test is released to the pool and the ability is re - estimated. This creation of a sh adow test an d selecting an item to be administered is repeated until the stopping rule is met. He (2010) provides a brief description of a typical STA procedure: Step 1 : Give an initial estimate of the ability as the starting point. Step 2: Assemble the first shadow test that satisfies all requirements (e.g., constraints for content area, item format, total testing time, exposure rate, etc.) and optimizes the objective function (e.g., maximize the test information). 30 Step 3: From the shadow test assembled in Step 2, se lect and a dminister the item that can provide maximum information at the current ability estimate, and return all the other items in the shadow test into the bank. Step 4: Update the ability estimate according to some scoring rule (e.g., MLE). Step 5: As se mble a new shadow test which is optimal and meets all constrain t s while containing items already administered . Step 6: Repeat Steps 2 - 5 until a stopping rule (e.g., a pre - specified test length ) is reached. This description indicates several properties of a s hadow tes t : - size linear test as no sequential selection is performed within a given shadow test; (2) it includes all items already taken by the examinee; (3 ) it provides maximum information at the current ab ility estimate; and (4 ) it sati sfies all the test specifications required by the CAT. An example by van der Linden and Reese (1998) may be helpful to understand the procedure: assume the goal is to assemble a 5 - item CAT for a given examinee. In Table 2.2, each column indicates a shado w test assembled at the current , the bold numbers are the item with maximum information selected to be administered, and all the non - bold items will be released into the pool. The items in the upper triangle have been administered to him/her. It can b e seen that the bold numbers enter into the next column of the upper triangle, as the items which are administered must be in the new assembled shadow - test. For this examinee, Item 39, 14, 41, 22, and 6 are administered. 31 Table 2.2 An Example for CAT Assembly Using STA (van der Linden & Reese, 1998) Shadow Test1 Shadow Test2 Shadow Test3 Shadow Test4 Shadow Test5 - 39 39 39 39 13 - 14 14 14 27 8 - 41 41 28 14 22 - 22 39 41 37 22 - 41 49 41 37 6 * Note : The columns are the item numbers for those selected for the shadow test and the bold item is administered and must exist in the following shadow tests. Compared with other CAT assembly approaches, STA can ensure that all the administered tests meet test speci fications. Furthermore, i t is very flexible and can deal with many constraints simultaneously. However, an exact solution for a shadow test may be im possible in realistic times if too many constraints are imposed and the item pool is large (van der Linden, 1998). Furthermore, STA solves the op timization problem uniquely for each examinee, and the order in which items appear cannot be predictable and varies across examines, which may raise concerns about context effects (Davey, 2005). Third, sometim es changing even only one or two of items of a pool with hundreds items may greatly affect the performance (Robin , 2005; Davey, 2005). Therefore, item replacement, item repairing and item retirement may be difficult in STA, and this is more obvious in large - scale CAT programs where items are required to be developed and replaced continuously (Davey, 2005). These problems can be partially solved by the bin - structured method, which will be introduced next. Bin - Structured Method Manfred Steffen - CAT assembly (Robin, 2005). It aims to find a single standardized solution to divide the item pool and solve the constrained combination optimization problem, as obtaining a unique routine for every examinee may not add too much value (Davey, 2005). The basic procedure of a bin - structured CAT assembly is: (1) t he test co nstruction rules determine what item properties, such as cognitive level, specific subject, content area and format , are specified in the blu eprint and 32 will guide the CAT assembly; (2) the item pool is divide d into non - overlapping and homogeneous clusters according to these identified item properties, and each cluster is regarded as a bin ; the items in the same bin are interchangeable in terms of these test construction rules , and t he number of bins is equal to the desired test length ; and (3) then test de velopers determine a sequence to arrange these bins. Such an ordered sequ ence is called a template and is applied to all examinees. It sat is D uring item administration, each item is selected from one bin, rather than from all t he available items in the pool. Each bin only contributes one item. As the test constrai n t s relevant to test construction properties such as content area have been handled in the design of the template, the main target for item selection in each step is to select an informative item while controlling exposure rate in each bin. Therefore, the specific solution for any examinee is unique and adaptive, while the assembled test is more standardized compared with STA . Davey (2005) set an example to illustrate how bin - structured method works: suppose a math test covers three content areas (Ari thmetic, Algebra and Geometry) and two item formats (Problem Solving and Data Sufficiency). The item pool has 13 items, as Table 2.3 shows: Table 2.3 Item Pool (Davey, 2005) Item Content Format 1 Arithmetic Problem Solving 2 Arithmetic Problem Solving 3 Arithmetic Problem Solving 4 Algebra Problem Solving 5 Algebra Problem Solving 6 Algebra Problem Solving 7 Geometry Problem Solving 8 Arithmetic Data Sufficiency 9 Arithmetic Data Sufficiency 10 Arithmetic Data Sufficiency 11 Arithmetic Data Sufficiency 12 Algebra Data Sufficiency 13 Geometry Data Sufficiency 33 Now assume each examinee is required to take a 6 - item test, with the following constraints: Table 2.4 CAT Constrain t s (Davey, 2005) Specification Classification Number of Items 1 Arithmetic content 3 2 Algebra content 2 3 Geometry content 1 4 Problem Solving format 3 5 Data Sufficiency format 3 A variety of solutions can sati sfy the requirement in Table 2. 4 . Which one should be chosen depends on the quality of items of the different types and the goal of the test. Here is one reasonable design satisfying all the constraints . Table 2.5 An Example for a Template (Davey, 2005) PS DS Total Arithmetic 1 2 3 Algebra 1 1 2 Geometry 1 0 1 Total 3 3 6 Since the CAT has fixed length of 6 items, the items in the entire pool can be divided int o 6 bins, as shown in Table 2.6. Table 2.6 Dividing Items into Bins (Davey, 2005) Bin Content / Format Items 1 Ar / PS 1, 2, 3 2 Ar / DS 8, 9 3 Ar / DS 10, 11 4 Al / PS 4, 5, 6 5 Al / DS 12 6 G / PS 7 The items collected in the same bin have same content and format. All examinees use this template when taking the CAT, but which specific items will be administered is determined e 1 may take Item 1, 34 8, 10, 4, 12, 7, while Examinee 2 may ta ke Item 1, 9, 11, 5, 12, and 7. It should be noted that Bin 2 and Bin 3 are identically defined, and during CAT, only one item is drawn from each bin. Another observation is the Geometry / Data Sufficiency item is not included in any divided bins, as such an item is not needed in the template in Table 2.5. One implication is that the items used in a bin - structured method may only contain a subset of items of the whole pool. The refore, as efficiency. Bin - structured method has several practical advantages. First, c ompared with assembling a CAT independently for each student as STA does , bin - structured method specifies the orderi ng of item delivery explicitly , standardiz ing the look of the test across examinees , and therefore eliminates context effects across examinees (Robin, 2005) . It administers the items in a controlled and predictable w ay rather than chaotically , which may be more acceptable to examinees . The merit of assembling CAT this way is more obvious when the item pool is small or only has limited items of a certain type. For example, in the example above , when adopting the othe . Table 2.7 Example of First Five Items Selected (Davey, 2005) Position Item Content Format 1 12 Al DS 2 1 Ar PS 3 3 Ar PS 4 13 G DS 5 2 Ar PS To satisfy the test requirement, an Algebra / Data Sufficiency is needed as the sixth item. However the pool only contains one Al/DS item and it has been used. Alternatively speaking, can have severe influence on late r stage (Davey, 2005), as the use of each item cannot be predicted. This will not happen in bin - structured method. 35 Because in bin - structured method the bins do not interact with each other, exposure control can be conducted within bins without influencin g the other bins or the entire pool, and item replacement is more convenient (Davey, 2005). Other CAT assembly methods such as WDA often require tedious preliminary simulations to control item exposure (Robin, 2005). Furthermore, it guarantees that test cons truction rules are satisfied by developing appropriate template in advance, which significantly simplifies item selection an d test administration. Also, as for each step the item selection is restricted in one bin, an item only competes wit h items in the same bin and therefore the calculation burden is greatly reduced. Finally, the control for item enemies is easy: the item enemies can be put in one bin; choosing one will exclude its enemies because each bin only contributes one item. Alt hough bin - structured approach adopts a uniform routine for all the examinees and seems less flexible less adaptive if the bins can be developed properly (Davey, 2005). Furthermore, it can be combined with other test assembly methods. Robin (2005 ) incorporates bin - structured model into WDA, and finds the bin - structured approach works equally well compared with conventional WDA in terms of measurement efficiency , content balance, exposure rate , and effi cient item use. However, as a relatively new method, research on the bin - structured method is still rare, and none uses mixed - item - format based CAT. And no study investigates what factor may influence the effect of bin - structured method. 36 Chapter 3: Methods and Procedures The main purpose s of this study were (1) to investigate whether the mixed - item - based CAT had advantages over the dichotomous - item - based C AT and what challenges it brought; and (2) to compare the STA with the bin - structured method in mixed - item CAT assembly, and to explore what were some factors that might influence any assemb ly effect. A simulation study wa s conducted, as a simulation can set a variety of conditions to evaluate the effects of different factors, and also provide the true value as a baseline to as sess bias. This chapter describes the methodological framework of the simulation study. The first section describes the procedure of developing the item pools. Next, the procedure for CAT simulation is described. Specifically, the CAT specifications wi th respect to content a rea , item format and required cogn itive skills are described. This section also illustrates how the STA and bin - structured method s assemble the CAT with different constraint sets. The final section describes the criteria that are u sed to assess the CAT assembly approach es . 3.1 Generate Item Pools 3.1.1 Data Source The item pool wa s based on the Education Quality and Accountability Office ( EQAO ) Grade 10 Ontario Secondary School Literacy Test (OSSLT ; http://www.eqao.com/ ) . EQAO has been existed for almost 20 years with the purpose of providing compa rable year - to - year information on student learning . EQAO provides several province - wide assessments: the Assessments of Reading, Writing and Mathematics, Primary and Junior Divisions; th e Grade 9 Assessment of Ma thematics ; and the OSSLT . To simulate the situation where both dichotomously and polytomously scored items (including polytomous items and testlets) were involved, this study was focused on the OSSLT. Th e OSSLT is administered o n an annual basis 37 and aims to evaluate Gra has cut - scores set through a modified Angoff method (OSSLT, 2015). Students must complete the OSSLT successfully in order to get the Ontario Secondary School Diploma (OSSD). A s the OSSLT is a graduation requirement to ensure that students who can complete this test have acquired minimum reading a nd writing skills, it is relative easy. This can be seen from the test information function shown in Figure 3.1. Figure 3.1 Test Information for OSSLT 2015 (English) The content of OSSLT is based on reading and writing curriculum requirements specified by The Ontario Curriculum to be acquired before the end of Grade 9. The reading part assesses students e xplicit information and ideas in various text s required by the curriculum (noted as R1); (2) to understand implicit information and ideas (noted as R2); and (3) to connect what they read with their background knowledge and personal experience (noted as R3 ) . The writing component evaluates details using correct spelling, grammar and punctuation for communication in written forms required by 38 the curriculum e measured by the writing test: (1) to organize main ideas (noted as W1); (2) to organize relevant information (noted as W2); (3) to use conventions (noted as W3); and (4) to develop a topic (noted as W4). The data in this study is from the performance o f English - speaking student on the 2015 Operational Test of OSSLT. The 2015 OSSLT contains 38 multiple - choice, 4 open - response questions, 4 short writing and 4 long writing questions. But the long writing items were not used in this study as they were not field - tested, and real CAT seldom adopts such an item format. That left 8 polytomous - scoring items consists of six 4 - point Likert scale items and two 3 - point Likert scale items, but the 3 - point Likert scale scores were excluded from this study as they di OSSLT analyses. Hence, the entire study was based on 44 items, among which 34 are reading items and 10 are writing items (see Table 3.1). Table 3.1 OSSLT Test Specification Reading Writing Total R1 R2 R3 W1 W2 W3 W4 Dichotomous 7 17 6 2 2 4 0 38 Polytomous 0 2 2 0 0 0 2 6 Test forms are assembled for both English and French versions from March 2014 field - test materials. Before administrating the test, a ll materials and questions are reviewed and approved by a content review committee ( i.e., Assessment Development Committee) which consists of educators from across the province. Meanwhile, a not her group of equity experts, known as the Sensitivity Committee, review all test items and materials to guarantee they are fair and free from bias . In the field test, a pproximately 5000 English students and 500 French students are randomly selected and answer each multiple - choice item. The sample used to score the polytomous items (including open - response items and short writing items) contain 1200 students in English and 500 students in French. EQAO requires comparable procedures for both 39 English and French students, but the French sample size is small. Therefore when calibrating the items with 3PLM, OSSLT fix the a - par ameter of dichotomous items at 0.588 and the c - parameter at 0.2. This modified 3PLM is also known a modified Rasch model. The slope of the G PC M model is also fixed at 0.588 . The IRT parameters, classical test theory (CTT) difficulty, and cognitive skill s measur ed are available in OSSLT report (OSSLT, 2014) . Overall, OSSLT can provide reliable, objective and high - quality scores (OSSLT, 2005). The reliability coefficient is above 0.85, and the correct classification rate is 0.90 (OSSLT, 2014). 3.1.2 Ge nerate the Original Item Pool As stated before, 44 items were kept as the basis for this study, among which 38 were dichotomous, and 6 (including polytomous items and testlets) adopted polytomous scoring with four score categories. The item cloning method (Glas & van der Linden, 2003) was used to expand the item pool size. The procedure for cloning items was: represent the parent item (i.e., the items which were used to produce the new items) as p = 1 ,..., P with item parameter , and items within family p as i p = 1 , ..., I p . For each item, the item parameter is a vector noted as . For instance, in 3PLM, ]. was assumed to have a multivariate normal distribution: ~ N ( ) where is the mean of item parameters in family p , and is the covariance matrix. In this study, for the GPCM, overall difficulty and thresholds were generated since the slopes were fixed. was the vector consisting of average overall difficulty and thresholds of the 6 polytomous items in the original OSS LT tests, was the covariance matrix. All the 6 parent item parameter values were drawn from the multivariate normal distribution with a mean of and covariance of . T hen, given the parent parameter , item parameters cloned within 40 family p were sampled from a multivariate normal distribution with mean of and a covariance matrix of with entries equal to half of the entries of , as the v ariability within the collection of items cloned from the same parent item should be much smaller than the variability between families (Enright, Morley, & Sheehan, 2002; Hively, Patterson & Page, 1968; Macready ,1983; Macready & Merwin, 1973; Meisner, Luecht & Reckase, 1993) . For the dichotomous items, as the a - and c - parameter were fixe d in OSSLT, only the b - parameter in 3PLM was generated and the above procedure shrank to a univariate case, i.e., was the mean of b - parameters of the original 38 items, and was the variance. The format, content area and cognitive skill of items within a same family were kept the same as the parent item. The final pool contained 950 dichotomous items and 150 4 - point Likert items, i.e., the size of final pool was 25 times of the original O SSLT. The pool information is: Figure 3.2 Original Pool Information Similar to the original 44 - item OSSLT test, Figure 3.2 indicates the entire pool has more items informative at lower abilities. 41 This pool was based on the original cali bration of OSSLT, i.e., a - and c - parameters in 3PLM were fixed, and slopes in GPCM were also fixed. To identify the pool for discussion in 3.1.3 Recalibrated Item Pool In real CAT implementations, the mod ified 3PLM fixing a - and c - parameters adopted by OSSLT is seldom used. To make the conclusions more generalizable, 2PLM and GPCM pool without fixing the slopes were generated. As Figure 3.3 (OSSLT, 2015) shows, in OSSLT, the English population has a norm al distribution N (0.22, 0.91). 5000 examinees were randomly drawn from N (0.22, 0.91), and their responses to 950 dichotomous items and 150 polytomous items in the original pool were generated through 3PLM and GPCM with slope equal to 0.588. This yielde d a 5000*1100 response matrix. Then this matrix was calibrated with flexMIRT (Cai, 2012) using 2PLM and GPCM. This new pool consisted of the calibrated item parameters and and cognitive skill measured) for each item were kept the same as the original pool. Figure 3.3 Ability Distribution of English Population (OSSLT, 2015) 42 The recalibrated pool information is in Figure 3.4. Compared with the original pool, there is a tendency for the pool to have more informative items for low ability examinees in the recalibrated pool due to the rescaling and error derived from estimation and sampling. Figure 3.4 Recalibrated Pool Information 3.1.4 Nested Difficulty 3PLM Pool In a test, sometimes some content areas are harder than the others (Leong, 2006; Ahmed, P ollitt, Crisp, & Sweiry, 2003), as the concepts, ideas, facts and princip les involved in each area are different. In this case, forcing the examinees to take items with inappropriate difficulty to n. The item parameters were same as the original pool, but the easiest 850 items (i.e., the 750 items with lowest b - parameter in 3PLM and 100 items with lowest overall difficulty in GPCM) were labeled as the reading items, while the other 250 items were l abeled as the writing items. Due to the effect of thresholds of GPCM, this non - overlapping, and therefore made the simulation more realistic. Within each rea ding/writing 43 category, the cognitive skill requirement was randomly assigned to each item, while the proportion of each skill category remained the same as the original item pool (i.e., the distribution of items measuring each skill was the same as Table 3 .1). 3.1.5 Nested Difficulty 2PLM Pool The item parameters were the same as the recalibrated pool, but 750 items with lowest b - parameters in 2PLM and 100 items with lowest overall difficulty in GPCM were regarded as the writing items, and the other 250 items were writing items. 3.1.6 Balanced Item Pool As both the original and recalibrated pool provided more information for the low - proficiency students, a more balanced pool was generated to explore the influence of shape of item pool on CAT assembly. For the dichotomous items, the a - parameters in 2PLM from the recalibrated pool were retained, while b - parameters were simulated from a uniform distribution within [ - 3, 3]. For the GPCM, the slopes and threshold parameters from recal ibrated items were retained, while the overall difficulty parameters were also randomly picked from [ - 3, 3]. The specifications for each item, including the requirements for cognitive skills and content area, were same as the recalibrated pool. Figure 3. 5 shows that the information for the balanced item pool is not skewed. 44 Figure 3.5 Balanced Pool Information 3.1.7 Heterogeneous Testlet Pool When items within a testlet were homogeneous in content and cognitive skills, for example, all individual items within a testlet measured R3, they could be merged into the a testlet mea sure different skills and abilities. For instance, in a given reading testlet, the first item measures the understanding of the main idea, the second item assesses the vocabulary, and the third item requires the examinees to make implicit inference. To s imulate suc h tests, half of the polytomous - scoring items in the balanced pool were randomly assigned two or three cognitive skills in a same content area. For example, in the balanced pool, a testlet which consisted of 3 individual items only measured R3 and could be modeled with a 4 - point GPCM. In the heterogeneous testlet pool, the parameter of the GPCM remained the same, but it was supposed to measure both R2 and R3 (e.g., with two individual items measuring R2 and one individual item measuring R3). A s stated before, all the items in the selected testlet would be administered and no within - testlet adaption was performed; the testlet was regarded as an intact 45 unit when calculating the information, in both the balanced pool and heterogeneous testlet pool . However the heterogeneity would influence the content balance control. In sum, six pools were generated in this study: Original pool Nested difficulty 3PLM pool Recalibrated pool Nested difficulty 2PLM pool Balanced pool Heterogeneous pool Figure 3.6 Summary of Six Pools 3.2 Simulation of CAT Procedures The goal of this study was to compare dichotomous - item - based CAT and mixed - item - format based CAT, and to explore which CAT assembly method was more efficient and convenient under various conditions. The variables manipulated included the test length, item pool shape, IRT model used, and imposed test constrain t s. 3.2.1 Long Tests The long test required each examinee to complete 44 items. To depict a whole picture for how the CAT assembly approaches work along the entire ability continuum, the ability level ranged fr om - 4 to 4 with changing step size of 0.1. The whole procedure was replicated 100 times, which means each level had 100 examinees to get conditional bias and standard error of measurement of the ability estimate. The starting ability estimate was randoml y picked from [ - 0.5, 0.5], and the ability estimate was updated through MLE. Original Pool Using the original pool, five CATs were implemented: Same item parameters an d different content labels Same item parameters and different content labels Same item parameters and different content labels 46 (1) All 44 items were drawn from 950 dichotomous items; no constraint was imposed. (2) The 44 items were sele cted from the entire mixed item pool (i.e., 950 items and 150 polytomous items); no constraint was imposed. (3) The 44 items were selected from the entire pool using shadow test with constraints in Table 3.1; and the maximum exposure rate for each item was fixed at 0.2. For a given Examinee J , when selecting the K th item ( K formulated as: Objective: maximize , i.e., maximize information at current ability estimate ; N was the it em number in the whole item pool, i.e., N = 1100. Constraints: N ; i.e., if item i was selected when assembling a shadow test, was valued as 1; otherwise was 0; , i.e., draw 7 binary R1 items; , i.e., draw 17 binary R2 items; , i.e., draw 2 polytomous R2 items; , i.e., draw 6 binary R3 items; , i.e., draw 2 polytomous R3 items; , i.e., draw 2 binary W1 items; , i.e., draw 2 binary W2 items; , i.e., draw 4 binary W3 items; , i.e., draw 2 polytomous W4 items; 47 for i N , i.e., the exposure rate for each item should be lower than 0.2; J was the number of examinees having taken the tests by far, and m was th e total number of examinees; for k K - 1, where ik was the item administered in k th step. This meant the decision variable for items which have been administered for this examinee must be equal to 1; in othe r words, the items already administered for Examinee J must be in the shadow test. After assembling the shadow test, the item with maximum information among those which had not been administered before was selected and administered; then the ability was re - estimated. A new shadow test fulfilling all the test specifications was assembled at the new . This procedure was repeated until the examinee completed the 44 - item CAT. (4) The 44 items were selected from the entire pool using a combination of bin - structured method and shadow test, i.e., the item format (polytomous vs. dichotomous) and content areas (reading vs. writing) were controlled by bin constructs, and specifications for cognitive skills were fulfilled by the shadow test. According to the test blueprint in Table 3.1, the items in the mixed pool were divided into 30 reading dichotomous bins (each bin included items of R1, R2 and R3), 4 reading polytomous bins (each bin included item of R2 and R3), 8 writing dichotomous bins (each bin covered W1, W2 and W3) , and 2 writing polytomous item bins (only involved W4). Each bin had 25 items of the same format and content. And the sequence of ordering the bins was exactly same as the item order in the paper - pencil OSSLT, i.e., 24 reading binary items --- 2 reading polytomous items --- 6 reading binary items --- 2 reading polytomous items --- 4 writing binary items --- 2 writing polytomous items --- 4 writing binary items. After determinin g the order of bins, a shadow test was used to satisfy the requirement for 48 cognitive level, test information and exposure rate, but the shadow test only picked one item from each bin in the order specified. In other words, besides the constraints in (3), one additional constraint for shadow test was: , k (5) The 44 items were selected from the entire pool, but in contrast to (4), here bin construct tool over the constraints on content area, ite m format, and also cognitive levels. The entire mixed item pool was divided into 7 binary R1 bins, 17 binary R2 bins, 2 polytomous R2 bins, 6 binary R3 bins, 2 polytomous R3 bins, 2 binary W1 bins, 2 binary W2 bins, 4 binary W3 bins and 2 polytomous W4 bi ns. Each bin contained 25 items which were interchangeable with respect to content, format and cognitive level. In other words, the number of bins in (5) was the same as (4), but the criteria used to develop bins were different. And the order of bins wa s: 7 binary R1 --- 17 binary R2 --- 2 polytomous R2 --- 6 binary R3 --- 2 polytomous R3 --- 2 binary W1 --- 2 binary W2 --- 2 polytomous W4 --- 4 binary W3, which was consistent with the OSSLT. A shadow test was used to control exposure rate and achieve high test informa tion; alternatively speaking, the specifications for the shadow tests were: Objective: maximize , i.e., maximize information at current ; N is the item number in the whole item pool. Constraints: , k e., draw one item from each bin; N , i.e., the exposure rate for each item was lower than 0.2; J was the number of examinees having taken the tests by far, and m was the total number of examinees ; for k K . where ik was the item administered in k th step. The items which had already been administered must be in the shadow test. 49 In (4) and (5), when ordering the bins within a same category, e.g., bins of binary R1, the early bins had b - parameter closer to 0 as the starting point of was within ( - 0.5, 0.5), and later bins covered broader range of difficulty. Among the five proce dures above, the comparison between (1) and (2) revealed whether mixed - item - format - based CAT can improve the performance of dichotomous - item - based CAT, and what challenges it mig ht bring . As polytomous items can provide more information, mixed CAT was exp ected to yield higher measurement accuracy. However the polytomous items may have higher exposure rate as they were more informative. The difference between (2) and (3) (4) (5) indicated the influence by imposing test constraints; (2) was expected to pr oduce more accurate ability estimate since the requirement s for content balance and exposure rate may compromise the test efficiency. Furthermore, (3) (4) and (5) were compared to explore which CAT asse mbly method performed better. In sum, five CAT simul ations were conducted in the original pool , as Figure 3.7 indicates. Binary CAT without constraint Mixed CAT without constraint STA with constraints in Table 3.1 Combination method with constraints in Table 3.1 Bin - structured method with constraints in Table 3.1 Figure 3.7 Five CAT Simulations in the Original Pool Nested Difficulty 3PLM Pool This pool labeled all the easy items in the original pool as reading and hard items as writing. When no test specification cons traint was added, like simulation (1) and (2) in the above original pool, this nested pool functioned the same way as the original pool. It would be different from the original pool only when content balance was Advantages and challenges of mixed CAT Effects of using bin - structure 50 required. Therefore three CATs w ere implemented using the nested difficulty 3PLM pool: (1) shadow test with constraints in Table 3.1; (2) items were divided into 44 bins according to format and content, while shadow test controlled the cognitive levels (as (4) in original pool); and (3) items were divided into 44 bins according t o format, content and cognitive levels (as (5) in the original pool). For (2) and (3), the procedure of developing bins and the specifications for the shadow test were same as the corresponding procedure in the original pool. Recalibrated Pool The five C AT procedures in the original pool were repeated in the recalibrated pool to explore the influence of adopting different IRT models. Each bin contai ned 25 items. In the original pool , the magnitude of b - parameter was used as a criterion to divide the bin s. In contrast, the recalibrated pool took a - parameters into consideration. When developing the bins for the recalibrated pool, within a same bin category (e.g., for the binary reading bins), the early bins had items with lower a - parameters, and later bi ns were more discriminative. This strategy borrowed the idea of a - stratification design for CAT (Chang & Ying, 1999), which states that in the early stage of CAT, the estimated ability may be far from the true ability, and administering highly informative items at the beginning is a waste. Nested Difficulty 2PLM Pool The three CAT procedures in the nested difficulty 3PLM pool were repeated in this pool. Again, when developing the bins, later bins within a given item category (e.g., binary R3) had highe r a - parameters. Balanced Item Pool The five procedures in original pool were conducted in the balanced item pool to investigate the influence of pool shape. For a given bin category, early bins contained items with lower a - parameters, while later bins had high a - parameters. Heterogeneous Testlet Pool In all the pools above, the items within a testlet were homogeneous and the testlet can be regarded as a polytomous item. However, in the 51 heterogeneous testlet pool, the polytomous items and testlet we re different. Although they both adopted polytomous scoring and were modeled by GPCM, a testlet involved multiple cognitive skills and made the content balance procedure tricky. In this pool, the test specifications in Table 3.1 are modified to Table 3.2. Table 3.2 Modified Test Specification for Heterogeneous Testlet Pool Reading Writing Total R1 R2 R3 W1 W2 W3 W4 Dichotomous 7 17 6 2 2 2 0 36 Polytomous 0 1 1 0 0 0 1 3 Testlet 3 Reading 2 Writing 5 The five testlet - based items contained 15 individual items in total, as each testlet had four scoring categories (0 - 3). For the nine individual items in reading, one additional requirement was that each of R1, R2 and R3 should be measured by at least one item. And for the six individual items in writing, each of W1, W2, and W3 was also measured at least by one item. Two CATs were assembled in the heterogeneous pool: (1) shadow test with constraints in Table 3.2, and maximum exposure rate of 0.2, as the shadow test in the original pool; (2) the combination of shadow test and bin - structu red method, where bin structure controlled the item format and content area, and shadow test took charge of the requirement for test information, cognitive level, and exposure rate, as the combination of shadow test and bin - structured method in the origina l pool. It should be noted the procedure in original pool where the cognitive skill was also controlled by bin - structure was not applicable here, since a testlet involved several skills and it was hard to build cognitive - skill - interchangeable bins. 3.2.2 Short Tests To investigate whether test length w ould influence the results, a 22 - item CAT was also simulated using all the pools above except the heterogeneous pool. The proportion of each item 52 type is similar to the 44 - item CAT, as Table 3.3 show. All the CAT procedures with long tests were repeated with constraints in Table 3.3. Tab le 3.3 Test Specification for 22 - Item CAT Reading Writing Total R1 R2 R3 W1 W2 W3 W4 Dichotomous 3 9 3 1 1 2 0 19 Polytomous 0 1 1 0 0 0 1 3 In summary, two test lengths (44/22) * six item pools (original 3PLM/nested difficulty 3PLM/ recalibrated 2PLM/ nested difficulty 2PLM/ balanced/ heterogeneous testlet) * five CAT assembly approaches (dichotomous only/ mixed format without constraint/shado w test/ combination of shadow test and bin - structured method/ pure bin - structured) were simulated. The MOSEK package in Matlab was used to solve the optimal information problem. Each simulation covered examinees with 81 evenly spaced ability within [ - 4, 4], and a ll simulations were repeated 100 times. 3.3 Evaluation Criteria Each testing simulation was evaluated by measurement, content, security, and item usage efficiency criteria. 3.3.1 Measurement Criteria Evaluation of measurement was based on ov erall and conditional results (Robin, 2008). The o verall statistics were obtained from all the 8100 (i.e., 81 ability levels *100 replications) examinees. Conditional statistics were obtained from 100 replications at the given ability levels . Both est imated indexes and true indexes were computed. Estimated standard errors of measurement (SEM) were obtained through MLE and test information . Furthermore, since one merit of the simulation study is the true value is known, the bias, mean absolute bias (M AB) and 53 RMSE (root - mean - standard - error) can be calculated based on true estimation error ( ). Smaller SEM, bias, MAB, and RMSE values indicate more accurate results. Conditional Statistics Given a = 1, 2, ..., 100 replications, for a given ,the true conditional bias (CB) is: (3.1) The true conditional absolute bias (CAB) is: (3.2) The conditional standard error of measurement (CSEM) is: (3.3) The conditional standard error of measurement can also be obtained from the test information as: (3.4) Overall Statistics The overall statistics pooled over all the j o form a unique index evaluating the effect of test assembly. The true overall bias (Bias) is: (3.5) The mean absolute bias (MAB) is: (3.6) The root mean squared error (RM SE) is: (3.7) where j refers to Examinee j . 54 3.3.2 Content Balance Content balance was evaluated by the proportion of assembled tests which could satisfy the specifications in Table 3.1 to Table 3.3. Under each condition, the rate of deviation from specification for content, format, and cognitive skills were calculated separately. As the shadow test and bin - structured m ethods force the item selection rule to incorporate the test specifications, all the tests should meet the requirements. 3.3.3 Test Security CAT commonly uses item exposure rate and average item overlap to evaluate item exposure and test security (Way , 1998 ) . Specifically, the number of items achieving m aximum exposure rate, distribution of item exposure rate , and distributions of overlap rate were reported in this study . As defined earlier, it em exposure rate is the relative frequency with which an i tem is administered across all CAT administrations: (3.8) where t refers to how many times a certain item is administered, and m is the total number of examinees. Another index used to evaluate test security was the test overlap rate. For a pair of CATs with fixed length, th e between - test overlap is the proportion of items appearing on both tests. The mean of the between - te st overlaps across all possible pairwise tests is the average between - test overlap (Way, 1998) . In this study, suppose m is the number of examinees ( m =8100) and l is test length ( l =44 or 22 ). The overlap rate was calculated through (1 ) counting the number of shared it ems for each of the m * ( m 1)/2 pairs of examinees, (2) summi ng across all the m *( m 55 1)/2 examinee pairs, and (3 ) dividing the total counts by l * m *( m indicated higher security level. 3.3.4 Item Usage The ideal item usage is achieved if all the items are utilized with equal frequency (Chang & Ying, 1999). Therefore the distribution of item exposure rate and the number of never used items can also measure the item pool usage efficiency. 56 Chapter 4: Results This chapter summarizes the res ults of the simulation study described in Chapter 3. The results are divided into two sections in response to the two r esearch objectives proposed in Chapter 1 . 4.1 Research Question 1 To answer the question of w hether the mixed CAT had advantages over the dichotomous - item - based CAT, and what challenges the mixed CAT brought , the mixed CAT and dichotomous CAT without any constraint were compared in measurement, test security and item pool usage criteria. No content balance evaluation was conducted sinc e no content constraint was added in this case . 4.1.1 Measurement Criteria The measurement criteria evaluated two facets of the CAT ability estimate: accuracy and sta bility. While the conditional result demonstrates how findings vary across different ability levels, the overall result can provide summary information about the effectiveness of each method, and facilitates the interpretation (Robin, 2001). Therefore b oth overall and conditional results are reported. Conditional Result Information about bias and absolute bias indicates the accuracy, while conditional standard error of measurement (C SEM ) shows the variation of the estimate around its mean and small value ind icates a stable estimate. Test - information - based conditional standard error of measurement (TCSEM) was also provided, which refers to the standard error of measurement calculated through the test information and small value means high stability. The difference between CSEM and TCSEM is: CSEM refers to the variation around the mean of estimate, while TCSEM indicates the variation of the estimate around the true value; furthermore, 57 the TCSEM also assumes that the item parameters are true and the model fits the dat a well. There was no obvious difference in conditional bias of ability estimate between the mixed and dichotomous CAT, but at all ability levels, the mixed CAT had smaller absolute bias , CSEM , TCSEM , and larger test information. See more details in Figur e 4.1 (a) - (k) to Figure 4.5(a) - (k). Overall Result Compared with the dichotomous CAT, the mixed CAT had smaller mean bias, mean absolute bias, and RMSE under all simulation conditions. See more details in Table 4.1 to 4.3. 4.1.2 Test Security Criteria Item Exposure The mixed CAT had more skewed item exposure rate distribution than the dichotomous CAT. In the mixed CAT, several items were administered to most of the examinees, w hile almost 90% of the items were never used. Further analysis showed that in the mixed CAT the items with highest exposure rate were all polytomously scored items. See more details in Figure 4.6 to 4.10. Overlap Rate The mixed CAT had higher overall overlap rate. Also, along the whole ability continuum, it had higher condi tional overlap rate than the dichotomous CAT. See more details in Figure 4.12(a) - (k) and Table 4.5. 4.1.3 Item Usage Since the more skewed item exposure rate distribution indicates less efficiency, the eff iciency of item usage was lower in the mixed CAT than in the dichotomous CAT . Furthermore, under most circumstances more than 85% of the items in the mixed CAT were never used. See more details in Table 4.6. In sum, the mixed CAT can lead to higher measurement accuracy and stability , in terms of both overall and conditional index . However, it had higher overlap rate and more highly 58 exposed items, and less balanced item usage. Operational CAT assembly should take these issues into considerations. T he section below compared three constrained CAT assembly methods in their effectiveness of dealing with these problems. 4.2 Research Question 2 As stated before, t he second research objective is to compare the STA with bin - structured method in m ix ed - Item CAT assembly and e xplore what were some factors that might influence any assembly effect. In this section, the results are organized and presented according to the four criteria (i.e., measurement, content balance, test security, and item usage) used to eva luate the assembly approache s. 4.2.1 Measurement Criteria picked from the mixed item pool containing dichotomously and polytomo shadow test with test constraints on item format, content area, cognitive ability and exposure rate; m format and content area, while using STA to fulfill the demands for cognitive ability and exposure rate; - structure. Conditional Result (1) Conditional Bias Figure 4.1(a) to (k) reveal no substantial difference in conditional bias of ability estimate among the three constrained CAT assembly methods. In other words, incorporating bin - structure will lead to comparable measurement accuracy to the STA. In all the methods, t he ability was overestimated at the lower end of ability continuum, and underestimated 59 at the upper end. This trend was more obvious in the unbalanced pool s . Furthermore, in unbalanced pools the magnitude of bias was larger for the highly p roficient examinees than for the examinees of extremely low proficiency , as the pool contained fewer informative items available for measuring the high ability levels. Compared with other pools, the balanced pool had more flat conditional bias pattern and much smaller bias at the extreme abilities, because the balanced pool can provide more information at the ends of the ability continuum (see Figure 3.5). The heterogeneous pool had similar pattern as the balanced pool as the item parameters were the same in these two pools. Given the ability level, the short tests had larger bias than the long tests. Figure 4.1(a) Conditional Bias for the Original Pool, 44 Items 60 Figure 4.1(b) Conditional Bias for the Nested Difficulty 3PLM Pool, 44 Items Figure 4.1(c) Conditional Bias for the Recalibrated Pool, 44 Items 61 Figure 4.1(d) Conditional Bias for the Nested Difficulty 2PLM Pool, 44 Items Figure 4.1(e) Conditional Bias for the Balanced Pool, 44 Items 62 Figure 4.1(f) Conditional Bias for the Heterogeneous Pool, 44 Items Figure 4.1(g) Conditional Bias for the Original Pool, 22 Items 63 Figure 4.1(h) Conditional Bias for the Nested Difficu lty 3PLM Pool, 22 Items Figure 4.1(i) Conditional Bias for the Recalibrated Pool, 22 Items 64 Figure 4.1(j) Conditional Bi as for the Nested Difficulty 2PLM Pool, 22 Items Figure 4.1(k) Conditional Bias for the Balanced Pool, 22 Items (2) Condit ional Absolute Bias (CAB) Figure 4.2(a) to (k) show that along the entire ability continuum, t he three constrained CAT assembly approaches had similar absolute bias which was larger than the unconstrained mixed CAT . In the unbalanced pools t he absolute bias for high - proficiency examinees was larger than the examinees of other ability levels. For all the 65 methods, c ompared with the unbalanced pools, the pattern of absolute bias in the balanced pool was more uniform, and the values at the upper end of the ability was much smaller than in the unbalanced pools, as the balanced pool can provide more information f or the high ability exam inees. When the pool was balanced, all the four approaches involving polytomous items had smaller abso lute bias than the binary CAT. When the pool was unbalanced, within the relatively low ability range, CAT incorporating polytomous items performed bette r than the unconstrained binary CAT even when constraints were imposed to the mixed CAT , as the mixed pool contained many informative items in this spectrum. Shorter tests had larger absolute bias than long tests given the ability level. Figure 4.2(a) Conditional Absolute Bias for the Original Pool, 44 Items 66 Figure 4.2(b) Conditional Absolute Bias for the Nested Difficulty 3PLM Pool, 44 Items Figure 4.2(c) Conditional Absolute Bias for the Recalibrated Pool, 44 Items 67 Figure 4.2(d) Conditional Absolute Bias for the Nested Difficulty 2PLM Pool, 44 Items Figure 4.2(e) Conditional Absolute Bias for the Balanced Pool, 44 Items 68 Fi gure 4.2(f) Conditional Absolute Bias for the Heterogeneous Pool, 44 Items Figure 4.2(g) Conditional Absolute Bias for the Orig inal Pool, 22 Items 69 Figure 4.2(h) Conditional Absolute Bias for the Nested Difficulty 3PLM Pool, 22 Items Figure 4.2(i) Conditional Absolute Bias for the Recalibrated Pool, 22 Items 70 Figure 4.2(j) Conditional Absolute Bias for the Nested Difficulty 2PLM Pool, 22 Items Figure 4.2(k) Conditional Absolute Bias for the Balanced Pool, 22 Items (3) Conditional Standard Error of Meas urement (CSEM) Figure 4.3 ( a ) to ( k ) indicate that at all proficiency levels, n o difference among the three constrained CAT assembly methods is found, and the mixed CAT without constraint always ha s smaller C SEM values than the other methods. I n the unbalanced pools , t he C SEM was higher for the highly proficient examinees . 71 The CSEM in the balanced pool yielded a more uniform shape than in the other pools, as the balanced pool can provide informative items to measure examinees along the whole ability continuum (see Figure 3.5). When the pool was balanced, binary CAT had largest SEM at all ability levels. When the pool was unbalanced, wit hin the range where the pool could provide more information, i.e., for the relatively low ability levels, the binary CAT still had larger SEM than the CAT assembly approaches using polytomous items. Shorter tests had lar ger C SEMs. Figure 4.3(a) C onditional SEM for the Original Pool, 44 Items 72 Figure 4.3(b) C onditional SEM for the Nested Difficulty 3PLM Pool, 44 Items Figure 4. 3 (c) C onditional SEM for the Recalibrated Pool, 44 Items 73 Figure 4 .3(d) C onditional SEM for the Nested Difficulty 2PLM Pool, 44 Items Figure 4.3(e) C onditional SEM for the Balanced Pool, 44 Items 74 Figure 4.3(f) C onditional SEM for the Heterogeneous Pool, 44 Items Figure 4.3(g) C onditional SEM for the Original Pool, 22 Items 75 Figure 4.3(h) C onditional SEM for the Nested Difficulty 3PLM Pool, 22 Items Figure 4.3(i) C onditional SEM for the Recalibrated Pool, 22 Items 76 Figure 4.3(j) C onditional SEM for the Nested Difficulty 2PLM Pool, 22 Items Figure 4.3(k) C onditional SEM for the Balanced Pool, 22 Items ( 4) Test Information Conditional Standa rd Error of Measurement (TCSEM) The findings in Figure 4.4 ( a ) to ( k ) are similar to Figure 4.3 (a) to (k). Among the three constrained mixed - CAT, i n the original and nested difficulty 3PLM p ool, STA had slightly smaller TCSEM than the combination and bin - structured method, especially at the ability levels where the pool had 77 limited informative ite ms; t have obvious differences. Under all conditions , th e mixed CAT without constraint had smallest value s of TCSEM. In the unbalanced pools the TCSEM was higher at high ability levels. In the balanced pool the binary CAT had highest TCSEM along the entire ability continuum, while in u nbalanced pools it performed worse than STA, combination and bin - structured method only at the ability levels where the mixed pool could provide high test information. Long tests had smaller TCSEM than short ones. Again, the balanced pool yielded much flatter TCSEM plot, a s the balanced pool can construct equally informative tests along the entire ability continuum. These findings were also supported by the conditional test information (CTI; see Figure 4.5 ( a ) to 4.5 ( k )). Among the three constrained mixed - CAT assembly app roaches, STA can provide slightly higher information for the examinees with extremely high or low proficiency in the unbalanced pools, since in these pools the quality of bins was compromised, and the STA had more options for item selection than the bin - st ructured method; in the balanced pool the advantage of STA in CTI diminished. The mixed CAT without constraint always provided maximum test information. When the mixed pool can provide high information, i.e., at all ability levels in the balanced pool an d relatively low ability levels in the unbalanced pool, including polytomous items can enhance the test information . In addition, it should be noted the CTI in the balanced pool had a bimodal distribution, as the pool information provided by the polytomou s items in this pool was bimodal (see Figure 3.5). 78 Figure 4.4(a) TCSEM for the Original Pool, 44 Items Figure 4.4(b) TCSEM for the Nested Difficulty 3PLM Pool, 44 Items 79 Figure 4.4(c) TCSEM for the Recalibrated Pool, 44 Items Figure 4 .4(d) TCSEM for the Nested Difficulty 2PLM Pool, 44 Items 80 Figure 4.4(e) TCSEM for the Balanced Pool, 44 Items Figure 4.4(f) TCSEM for the Heterogeneous Pool, 44 Items 81 Figure 4.4(g) TCSEM for the Original Pool, 22 Items Figure 4.4(h) TCSEM for the Nested Difficulty 3PLM Pool, 22 Items 82 Figure 4.4(i) TCSEM for the Recalibrated Pool, 22 Items Figure 4.4(j) TCSEM for the Nested Difficulty 2PLM Pool, 22 Items 83 Figure 4.4(k) TCSEM for the Balanced Pool, 22 Items Figure 4.5(a) CTI for the Original Pool, 44 Items 84 F igure 4.5(b) CTI for the Nested Difficulty 3PLM Pool, 44 Items Figure 4.5(c) CTI for the Recalibrated Pool, 44 Items 85 Figure 4 .5(d) CTI for the Nested Difficulty 2PLM Pool, 44 Items Figure 4.5(e) CTI for the Balanced Pool, 44 Items 86 Figure 4.5(f) CTI for the Heterogeneous Pool, 44 Items Figure 4.5(g) CTI for the Original Pool, 22 Items 87 Figure 4.5(h) CTI for the Nested Difficulty 3PLM Pool, 22 Items Figure 4.5(i) CTI for the Recalibrated Pool, 22 Items 88 Figure 4.5(j) CTI for the Nested Diffic ulty 2PLM Pool, 22 Items Figure 4.5(k) CTI for the Balanced Pool, 22 Items Overall Result (1) Bias Table 4. 1 indicates that the mixed CAT without constraint always has smaller overall bias than any other CAT assembly method , and when imposing constraints, in 89 unbalanced pools the STA has smaller bias than the other two approaches . Short tests had larger bias than the corresponding long tests. The overall bias in the balanced p ool was 0, while other pools le d to slightly pos itive bias. Table 4.1 Overall Bias of Ability Estimate Binary Mix STA Combination Bin Long Original 0.03 0.01 0.02 0.03 0.03 Nested 3PLM 0.03 0.01 0.02 0.03 0.03 Recalibrated 0.02 0.01 0.02 0.04 0.02 Nested 2PLM 0.02 0.01 0.02 0.03 0.03 Balanced 0.00 0.00 0.00 0.00 0.00 Heterogeneous 0.00 0.00 0.00 0.00 NA Short Original 0.04 0.02 0.03 0.05 0.05 Nested 3PLM 0.04 0.02 0.04 0.03 0.03 Recalibrated 0.03 0.02 0.03 0.03 0.04 Nested 2PLM 0.03 0.02 0.03 0.03 0.04 Balanced 0.00 0.00 0.00 0.00 0.00 (2) Mean Absolute Bias (MAB) Table 4. 2 shows that the mixed CAT has smallest overall mean absolute bias in all simulation conditions, while binary CAT has the largest MAB ; STA outperforms the combination and bin - structured method when the pools are unbalanced. Long test s had smaller MAB than the short test s . Table 4.2 Overall Mean Absolute Bias (MAB) Binary Mix STA Combination Bin Long Original 0.34 0.21 0.31 0.34 0.34 Nested 3PLM 0.34 0.21 0.33 0.36 0.35 Recalibrated 0.35 0.22 0.31 0.35 0.34 Nested 2PLM 0.35 0.22 0.34 0.35 0.34 Balanced 0.25 0.12 0.20 0.20 0.20 Heterogeneous 0.25 0.12 0.19 0.19 NA Short Original 0.47 0.28 0.41 0.46 0.46 Nested 3PLM 0.47 0.28 0.46 0.48 0.47 Recalibrated 0.46 0.28 0.43 0.45 0.45 Nested 2PLM 0.46 0.28 0.46 0.47 0.46 Balanced 0.35 0.15 0.30 0.28 0.27 (3) Root Mean Squared Error (RMSE) Table 4. 3 shows that the smallest value of overall RMSE is obtained in mixed CAT without constraint . STA had more stable estimate than the 90 combination and bin - structured method in unbalanced pools. The estimate from the short test s had larger RMSE than for the long test s . Table 4.3 RMSE of Estimate Binary Mix STA Combination Bin Long Original 0.44 0.28 0.41 0.47 0.46 Nested 3PLM 0.44 0.28 0.44 0.47 0.47 Recalibrated 0.46 0.29 0.42 0.47 0.46 Nested 2PLM 0.46 0.29 0.46 0.47 0.46 Balanced 0.32 0.15 0.25 0.25 0.25 Heterogeneous 0.32 0.15 0.24 0.24 NA Short Original 0.60 0.37 0.53 0.61 0.60 Nested 3PLM 0.60 0.37 0.60 0.63 0.61 Recalibrated 0.61 0.38 0.56 0.61 0.60 Nested 2PLM 0.61 0.38 0.62 0.62 0.61 Balanced 0.45 0.20 0.38 0.36 0.35 4.2 .2 Content Balance As expected, all assembled CAT fulfilled the pre - determined requirements for content area, cognitive ability and item format. This is because STA and bin - structured method combine the goal of administrating highly informative items with an algorithm that imposes the test constraints on the item selection (van der Linden, 2005; He, 2010 ). 4. 2. 3 Test Security Distribution of Item Exposure Rate (1) Original Item Pool Figure 4.6(a) to (b) indicate that a mong the three CAT proc edures with constraints, STA has the longest tail in exposure rate distribution, and fewest items achieving the maximum exposure rate of 0.2. In other words, STA had more balanced item exposure and higher item usage efficiency than the combination and bin - structured method. 91 Figure 4.6( a) Exposure Rate Distribution for the Original Pool, 44 Items Figure 4.6(b) Exposure Rate Distribution for the Original Pool, 22 Items (2) Nested Difficulty 3PLM Pool This pool yielded a similar tendency as the original pool: compared with the combinatio n method and bin - structure method, the shadow test approach had fewer highly exposed items and unused items. 92 Figure 4.7(a) Exposure Rate Distribution for the Nested Difficulty 3PLM Pool, 44 Items Figure 4.7(b) Exposure Rate Distribution for the Nested Difficulty 3PLM Pool, 22 Items (3) Recalibrated Pool In contrast with the original pool and the nested difficulty 3PLM pool, among the three CATs with constraints, the combination and bin - structured method had longer tails for the exposure rate distribution, meanwhile the numbers of items reaching maximum exposure rate for these two methods were smaller than the STA. In addition, the combination method performed slightly better than the bin - structured method. 93 Figure 4.8(a) Exposure Rate Distrib ution for the Recalibrated Pool, 44 Items Figure 4.8(b) Exposure Rate Distribution for the Recalibrated Pool, 22 Items (4) Nested Difficulty 2PLM Pool The nested difficulty 2PLM pool presented a similar pattern as the recalibrated pool. STA had more it ems reaching the maximum exposure rate, and also more unused items. The combination method still performed better than the bin - structured method. 94 Figure 4.9(a) Exposure Rate Distribution for the Nested Difficulty 2PLM Pool, 44 Items Figure 4.9(b) Expos ure Rate Distribution for the Nested Difficulty 2PLM Pool, 22 Items (5) Balanced Pool Compared with STA, the combination and pure bin - structured method had fewer unused items or highly exposed items, and the difference was more obvious than in the unbalan ced pools. The combination method outperformed the bin - structured method. 95 Figure 4.10(a) Exposure Rate Distribution for the Balanced Pool, 44 Items Figure 4.10(b) Exposure Rate Distribution for the Balanced Pool, 22 Items (6) Heterogeneous Pool The difference between STA and the strategy of incorporating the bin - structure was more obvious for the heterogeneous pool. The combination method had more balanced item exposure, i.e., fewer unused items and also fewer highly - exposed items. 96 Figure 4.11 Exposure Rate Distribution for the Heterogeneous Pool, 44 Items In sum, for the original pool and nested difficulty 3PLM pool ( which was also based on the original pool ), STA had fewer highly exposed items. For the recalibrated pool, nested difficulty 2PLM pool (also based on the recalibrated pool), balanced pool and heterogeneous pool, the combination and bin - structured method performed better than STA in test security, and the combination method ha d the least skewed item exposure rate distribution. The improvement caused by incorporating the bin - structured strategy was more obvious in the balanced pool. In all cases, the most skewed exposure rate distribution existed for t he mixed CAT without cons traint , where polytomous items were vulnerable to over - exposure problem . To facilitate the comparison among three constrained CAT assembly approaches, Table 4.4 shows the number of items achieving the highes t exposure rate (i.e., 0.2) in each method. Th e information conveyed by Table 4.4 is same as the above: for the original pool and nested difficulty 3PLM pool, STA had fewer items achieving maximum exposure rate; in all the other pools, especially in the balanced pool, the combination and bin - structur ed method were better than STA. Among the six pools, the balanced pool had fewer items vulnerable to high exposure 97 rate. Long tests led to more items at the risk of being highly exposed as more items were administered. Table 4.4 Number of Items Achievin g the Highest Exposure Rate Long Test Short Test STA Combination Bin STA Combination Bin Original 110 141 148 46 61 63 Nested Difficulty 3PLM 110 147 153 50 61 74 Recalibrated 118 107 117 53 49 53 Nested Difficulty 2PLM 128 118 116 59 51 54 Balanced 90 28 35 38 11 22 Heterogeneous 83 29 * Note : Each pool contains 1100 items. Overlap Rate (1) Overall Overlap R ate Table 4.5 summarizes the overall overlap rate under each condition. T he short tests had lower overall overlap rates than the long tests. When imposing the constraints, for the original pool and nested difficulty 3PLM pool, the STA performed best in terms of overlap rate; in all the other pools, the combination method and b in - structured method led to lower overall overlap rate. All the constrained CAT had smaller overlap rate than the unconstrained CAT. The difference between combination and bin - structured method was not obvious . Table 4.5 Overall Overlap Rate Binary Mix STA Combination Bin Long Test Original POOL 0.26 0.38 0.14 0.17 0.17 Nested Difficulty 3PLM Pool 0.26 0.38 0.15 0.17 0.17 Recalibrated Pool 0.31 0.39 0.17 0.14 0.15 Nested Difficulty 2PLM Pool 0.31 0.39 0.17 0.15 0.15 Balanced Pool 0.24 0.32 0.16 0.10 0.12 Heterogeneous Bin 0.24 0.32 0.15 0.11 NA Short Test Original POOL 0.22 0.32 0.12 0.15 0.15 Nested Difficulty 3PLM Pool 0.22 0.32 0.14 0.16 0.17 Recalibrated Pool 0.31 0.32 0.16 0.14 0.15 Nested Difficulty 2PLM Pool 0.31 0.32 0.16 0.14 0.15 Balanced Pool 0.25 0.22 0.15 0.09 0.12 * Note : The red indicates the CAT assembly approach of lowest overall overlap rate. 98 (2) Conditional Overlap R ate (COR) Figure 4.12(a) to (k) show the overlap rate conditioning on the ability level. In all cases, mixed CAT had the highest conditional overlap rate along the whole ability continuum, followed by the binary CAT. For constrained CAT, generally STA had higher conditi onal overlap rate, and the bin - structured method performed slightly better than the combination method; the overlap rate for extremely high or low proficient examinees was higher than the examinees of medium ability, as the pool contained more informative items within the middle range of the ability continuum. The advantage of the combination and bin - structured methods was more obvi ous at extrem e ability levels. One may concern that the early replications might be uncontrolled and therefore more overlappe d , while the later replications were highly constrained, since the simulation complete d one replication which covered the whole ability continuum (i.e., - 4 to 4), then proceed the next replication. However t he comparison between the first fifty replicatio ns and the last fifty ones indicate d no difference in conditional overlap rate. Figure 4.12(a) COR for the Original Pool, 44 Items 99 Figure 4.12(b) COR for the Nested Difficulty 3PLM Pool, 44 Items Figure 4.12(c) COR for the Recalibrated Pool, 44 Items 100 Figure 4.12(d) COR for the Nested Difficulty 2PLM Pool, 44 Items Figure 4.12(e) COR for the Balanced Pool, 44 Items 101 Figure 4.12(f) COR for the Heterogeneous Pool, 44 Items Figure 4.12(g) COR for the Original Pool, 22 Items 102 Figure 4.12(h) COR for the Nested Difficulty 3PLM Pool, 22 Items Figure 4.12(i) COR for the Recalibrated Pool, 22 Items 103 Figure 4.12(j) COR for the Nested Difficulty 2PLM Pool, 22 Items Figure 4.12(k) COR for the Balanced Pool, 22 Items One observation was that for the me dium ability levels, the STA had quite low overlap rate for the original pool and nested difficulty 3PLM pool, even lower than the other two constrained CAT assembly methods. A plausible explanation will be included in Chapter 5. 104 4. 2. 4 Item Usage The previous figures for the item exposure rate distributions also provide information for evaluating the item usage: the more skewed distribu tion indicates less effici ency. It has been shown that w hen imposing the constraints, in the original pool and ne sted difficulty 3PLM pool, STA had higher efficiency, while in all the other pools the bin - structured and combination method were more efficient. The number of unused item can also work as an item usage index. Table 4.6 presents the proportion of unused items under each condition. Mixed CAT always had the most unused items. For the original and nested difficulty 3PLM pool, STA had fewer unused items; this is consistent with the long tails in exposure rate distribution in Figure 4.6 and 4.7. While for other pools the combination and bin - structured method had fewer wasted items than STA. Table 4.6 Proportion of Unused Items Long Test Binary Mix STA Combination Original 0.00 0.86 0.08 0.35 Nested Difficulty 3PLM 0.00 0.86 0.08 0.44 Recalibrated 0.76 0.87 0.66 0.34 Nested Difficulty 2PLM 0.76 0.87 0.66 0.34 Balanced 0.72 0.85 0.61 0.21 Heterogeneous 0.72 0.85 0.72 0.23 Short Test Original 0.01 0.90 0.11 0.49 Nested Difficulty 3PLM 0.01 0.90 0.12 0.62 Recalibrated 0.87 0.90 0.81 0.61 Nested Difficulty 2PLM 0.87 0.90 0.82 0.60 Balanced 0.84 0.88 0.79 0.50 105 Chapter 5: Summary and Discussion This chapter contains a summary and a discussion . To start with, the research objective s , methodology used in this study , and results are summarized. The second section has the discussion of the major findings . The last part discusses the implications and limitations of this study, and also provides suggestions for future research . 5.1 Summary of This Study The main purpose s of this study were (1) to investigate whether the mixed - item - based CAT had advantages over the dichotomous - item - based C AT and what challenges it brought ; (2) to compare the STA with the bin - structured method in mixed - item CAT assembly, and to explore what were some factors that might influence the assemb ly effect. A simulation study was conducted to compare five CAT test assembly approaches (i.e., binary CAT, mixed C AT, STA, combination of STA and bin - structured method, and bin - structured method) in a variety of tes ting situations specifying the test objectives and constraints . The goal of the simulated CAT was to construct efficient, content (including content areas, item format and cognitive skills) balanced and secure test s . The effectiveness of assembly was eva luated through four types of criteria , including measurement, content balance, test security and item usage. The shape of item pool, test length, and imposed constraints were manipulated to explore how the findings varied . 5.1.1 Measurement Criteria No difference in conditional bias of ability estimate among the five CAT assembly methods was found . The ability was overestimated at the lower end of ability continuum, and underestimated at the upper end ; the magnitude for underestimation was larger. This trend was mo re obvious when the unbalanced pool was used, as the pool contained fewer informative items 106 for measuring the high ability levels. Conditional on ability level, short test s had larger bias than the long test s . Along the entire ability conti nuum, the mixed CAT alway s had smallest absolute bias and SEM among the five CAT assembly approaches. The absolute bias and SEM for high - proficiency examinees was larger than the examinees at other ability levels. The three constrained CAT assembly approaches had similar results . When the pool was balanced, all the four approaches involving polytomous items had smaller absolute bias and SEM than the binary CAT. When the pool was unbalanced, within the relatively low ability range, CAT incorporating polytomous items performed better than the unconstrained binary CAT even when constraints were imposed to the mixed CAT, as the pool contained many informative items in this range . Shorter tests had larger absolute bias and SEM than long tests given the ability level. The information from TCSEM also reinforced these findings. In terms of overall measurement issues, the mixed CAT worked best. STA had smaller bias, MAB and RMSE than the combinatio n and bin - structured method in unbalanced pools. 5.1.2 Content Balance All the simulations satisfied the content balance requirements. 5.1.3 Item Exposure Rate Distribution In sum, with the original pool and nested difficulty 3PLM pool (also based on the orig inal pool), STA had fewer high - exposure items and lower overall overlap rate . For the recalibrated pool, nested difficulty 2PLM pool (also based on the recalibrated pool), balanced pool and heterogeneous pool, the combination and pure bin - structu red method performed better than STA in test security, and the combination method had least skewed item exposure rate distribution. The improvement caused by incorporating bin - structured strategy was more obvious in the 107 balanced pool. In all cases, the m ost skewed exposure rate distribution and highest overlap rate existed in the mixed CAT without constraint, where polytomous items were vulnerable to over - exposure. Conditioning on ability level, generally STA had higher conditional overlap rate, and the bin - structured method performed slightly better than the combination method; the overlap rate for extremely high or low ability examinees was higher than the examinees of medium ability, as the pool contained more informative items within the middle range of the ability continuum. The advantage of the combination and bin - structured method was more obvious at extreme ability levels. 5.1.4 Item Usage The efficiency of item usage was lowest in mixed CAT. When imposing the constraints, in the original pool and nested difficulty 3PLM pool, STA had higher efficiency, while in all the other pools the bin - structured and combination method were more efficient. 5.2 Discussion of Major Findings 5.2.1 Incorporating Polytomous Items into CAT As stated before, poly tomous items are receivi ng growing attention in CAT, as it can evaluate an examinee's partial knowledge, assess high - level cognitive skills, and improve the test validity. The d evelopment of polyto mous response models and progress in computer computation allow for future flourishing application of polytomous items in CAT, and expanding the use of polytomous items in CAT is already on the agenda. This study confirmed the contribution of polytomous items to building an effective CAT, as in all conditions the mixed CAT led to smaller bias, absolute bias, and SEM than other CAT assembly methods. However one consequent problem is over - exposure of the polytomous items, as the highly informative 108 items are ten d to be more frequently selected. This problem was also verified in this study: the mixed CAT had the most skewed exposure rate distribution, and further analysis showed that the highly exposed items were all polytomous items. Considering the tedious wor k of developing the polytomous items, how to protect them from severe security problems was a critical issue. One related problem for mixed CAT was its low item usage efficiency, as a lot of items (mainly dichotomous items) were unused. When adding constr aints to CAT assembly, especially the rules to control the exposure rate, the CAT efficiency would be compromised as maximum information was no longer the unique criterion for selecting items; this was reflected by the increasing bias and SEM. This influe nce might be aggravate d if the psychometric properties of items were entangled with the categorical attribute specified in the blueprint, e.g., content area. This was why the nested difficulty 3PLM pool performed worse than the original pool (i.e., had la rger MAB and RMSE), and nested difficulty 2PLM pool was worse than the recalibrated pool. The requirement for content balance forced the examinees to take less informative items, and therefore the efficiency of CAT was reduced. However the constraints ma y balance the item usage: fewer items were suffering from high - exposure. This study only set an upper bound for exposure rate, but a lower bound could also be set to reduce the underused items. Appropriate boundary value should be determined to guarantee that the pool has reasonable item usage while the assembled CAT can still estimate the ability efficiently. 5.2.2 Comparing STA and Bin - Structured Method Both STA and bin - structured method have the same goal for test construction : optimize a measurement efficiency and ensure the test can satisfy all the test specifications. But they proceed in different ways. T he STA finds a unique and optimal solution for every examinee; this 109 can effectively construct highly informative test, but the cost is also high. Because searching for the best solution is conducted in the entire pool, the computation in STA may be formidable . On the other hand, t he bin - structured approach partition s the item bank into non - overlapping item sets so that each item selection step is completed within a bin, which greatly simplifies the item selection procedures. Besides reducing the calculation burden, the bin - structured method also has other advantages. By dividing the items into bins in accordance with the pre - spec ified test length and specifications, the bin - structured method automatically produces content valid test s. This template takes care of the feasibility issues that most item selection algorithms have to face to , and also can be reviewed in advance to enha nce the test validity (Robin, 2005). Furthermore, as the bin - structured method adopts a unique template for all examinees, tests across examinees will be more similar to each other ; the examinees are less likely to be disturbed by unexpected item topic or format sequences (Davey, 2005) , and the context effect will be diminished . By eliminating the factors irrelevant to the target trait but influencing the performance, the bin - structured method makes the tests more comparable across the examinees. Bin - st ructure can also help to improve the item usage. In this study, the bin - structured method had a lower conditional overlap rate than STA, especially for examinees of extremely high or low ability; and it also had more balanced item exposure rate distributi on in most of the pools, with the only exception in the original pool and nested difficulty 3PLM pool. An explanation for this result will be provided later. In addition, as the item selection was conducted in each bin, the item replacement and exposure control would be easier in bin - structured method. One intuition is that the bin - structured method may construct less informative tests than STA, as the STA selects the next item in the entire pool and is less restricted. However this 110 study showed this w as not necessarily the case: the bin - structure method had comparable conditional bias and SEM to STA. The reason was that when developing the bins, the later bins contained more informative items, which improved the effectiveness of bin - structured method. This emphasized the importance of producing and organizing bins properly. An example (see Table 5.1) on item usage in bin - structured method would be given later to underline the significance of producing proper bins. 5.2.3 Developing Bins Properly Whet her the bin - structured method performs well depends on the quality of bins. Whether the bins can be divided efficiently is influenced by the characteristics of the item pool. For instance, in this study, in most cases, the bin - structured method can assemb le equally good or even better CATs than STA. One exception was that i n the original and nested difficulty 3PLM pool, STA had slightly smaller TCSEM than the combination and bin - structured method, especially at the extreme ability levels . This was due to the fact that the item pools had fewer informative items for examinees within this ability range and the quality of developed bins would be compromised, while the STA had more options as it searched for the optimum item in the whole pool. An example may help focus in on the influence of item pool on the bin - structured method. Table 5.1 compares the combination method in nested difficulty 3PLM pool and balanced pool. When dividing the items in nested difficulty 3PLM pool into bins, the distance between b - parameter and 0 was used as the criterion: items in early bins were closer to 0, and in later bins were further from 0. However as the pool contains more easy items than hard items, the later bins may have a large proportion of items with low b - parameter and only a few items appropriate for measuring high - proficiency examinees. Therefore as shown in Table 5.1, in the nested 111 difficulty 3PLM pool, in a given bin, only a limited number of items are selected and they are administered to many examinees, while in the balanced pool the item usage distribution is more flat. Therefore when adopting the combination method, the overlap rate in nested difficulty 3PLM pool (0.1714) was much higher than in the balance pool (0.1027). Table 5.1 Comparing Item Usage of Combination Method in Different Pools Bin 10 Bin 20 Bin 30 Item ID Nested Difficulty 3PLM Balanced Pool Nested Difficulty 3PLM Balanced Pool Nested Difficulty 3PLM Balanced Pool 1 8 1621 1 18 50 396 2 1621 177 803 55 1 226 3 0 116 1 64 3 117 4 0 191 0 147 2 45 5 1460 5 5 166 10 11 6 1621 89 0 177 7 716 7 0 261 1621 197 21 391 8 0 515 0 234 5 154 9 0 114 1 255 2 27 10 0 73 1 493 2 29 11 0 202 5 500 1 1540 12 0 11 1621 508 66 234 13 0 565 550 544 6 226 14 67 4 92 567 6 522 15 0 57 3 568 0 1302 16 1621 1445 0 598 1621 133 17 0 763 7 666 4 272 18 0 85 1621 690 19 179 19 81 265 1 697 16 465 20 0 131 0 703 1382 217 21 0 113 64 807 1621 211 22 0 173 1 825 1621 615 23 1621 811 1621 861 12 0 24 0 48 76 896 1 0 25 0 265 5 947 1621 72 One relevant observation was that for the ability range [ - 2, 0] , the STA had quite a low overlap rate in the original pool and nested difficulty 3PLM pool, even lower than the other two con strained CAT assembly methods. One possible explanation was that: compared with other 112 pools, the original pool and nested difficulty 3PLM pool ha d more items informative for this ability range , and therefore provided more options for STA. As a consequence, STA achieved lower overlap rates in these two pools. Figure 5.1 to 5.2 support this explanation. Figure 5.1 Distribution of Item Information at = - 1 in Recalibrated Pool Figure 5.2 Distribution of Item Information at = - 1 in Nested Difficulty 3PLM Pool 113 In sum, the bin structure can improve test security witho ut losing measurement accuracy, but only when the item pool is big enough or balanced so that each bin contain items appropriate for measurin g the whole ability continuum. The test developers should check how the pool functions first and then decide wheth er bin structure can be used. 5.3 Implications and Limitations The major finding s from the this simulation study verified the enhancement of measurement accuracy brought by including polytomous items in CAT, however it also identified the over - exposure p roblem of polytomous items. Therefore in mixed CAT the item selection procedure should contend with the issue of how to maintain high measurement efficiency while guarantee the test security and content balance. This study supported the application of bi n - structured method in mixed CAT as it can produce equal or even better outcomes than the traditional STA with respect to the four major criteria. Meanwhile it can also simplify the computation involved in CAT, standardize the look of the test, provide good control over the content sequences in advance, and facilitate item replacement and exposure control . In fact the bin - structured method also has other advantages which were not revealed in this study. em enemies: the item enemies can be put in the same bin; as each bin only contributes one item, selecting one item will rule out its enemies. This study also had some limitations. First, all simulations adopted the fixed length stopping rule, and the num ber of bins was equal to the test length. Real CATs also widely use other stopping rule such as fixed precision; in this case the test length will vary across examinees, and how many bins are needed requires further investigation. One plausible solution is to develop as many bins as the maximum test length, and each CAT only picks part of bins for item 114 selection. But the test developers should organize the sequence of bins carefully so that each administration can satisfy the requirement for content bala nce. Second, this study set fixed proportions for each content area. Further research can set upper and lower bounds for content balance requirements, for example, the CAT might require that at least 10% of the items are polytomous. This change will br ing difficulty to bin development, as the number of bins in each content area is not determined. Again, one possible solution is to make the number of bins of a certain type equal to the upper bound of the requirement for this category. Third, only one template was used in this study. Operational CAT can develop multiple templates to further reduce overlap rate and improve item usage. The number of required templates depends on the characteristics of the item pool and the tested population. Future stu dy can investigate how to develop parallel templates efficiently. Fourth, the number of items in different bins kept the same in this study. Future research can vary the bin size in different CAT stages. For example, the later bins may contain more ite ms than the early bins since the later bins are expected to provide accurate measurement along a wider ability range. Following research can also investigate what the minimum number of item in each bin is. Fifth, the original OSSLT was designed for a sin gle cut - score and following analysis may focus th e results around the cut - score, and investigate the influence of the cut - score. Another potential research direction is to use bin - structure in computerized classification testing, where the goal is to clas sify examinees in an adaptive way. 115 At last, the simulation set even - space distribution for the e xaminee ability, which may overweigh the tails of the distribution. Future research may set normal or empirical distribution , or attach different weight to s ee how the results vary. In sum, this study supported the application of polytomous items in CAT, as they can enhance t est validity, as well as measurement accuracy and stability . However it also showed that the polytomous items were more vulnerable to ov er - exposure. Both STA and bin - structured method could help to control item exposure rate in mixed - item - based CAT; meanwhile, they could satisfy all the test requirements. When the item pool was not severely skewed and bins can be developed properly, the bin - structured method was recommended. 116 BIBLIOGRAPHY 117 BIBLIOGRAPHY Akkermans, W., & Muraki, E. (1997). Item information and discrimination functions for trinary PCM items. Psychometrika , 62 (4), 569 - 578. Ahmed, A., Pollitt, A., Crisp, V., & Sweiry, E. (2003). Writing examinations questions. Attali, Y., & Burstein, J. (2006). Automated essay scoring with e - raterĀ® V. 2. The Journal of Technology, Learning and Assessment , 4 (3). Bergstrom, B. A., & Lunz, M. E . (1999). CAT for certification and licensure. Innovations in computerized assessment , 67 - 91. Binet, A., & Simon, T. (1905). New methods for the diagnosis of the intellectual level of subnormals. , 12 , 191 - 244. Birnbaum, A. (1968). So me latent train models and their use in inferring an examinee's ability. Statistical theories of mental test scores , 395 - 479. Bock, R. D. (1972). Estimating item parameters and latent abilit y when responses are scored in two or more nominal categories. Psychometrika , 37 (1), 29 - 51. Bock, R. D., & Aitkin, M. (1981). Marginal maximum likelihood estimati on of item parameters: Application of an EM algorithm. Psychometrika , 46 (4), 443 - 459. Burton, R. F. (2001). Quantifying the effects of chance in multipl e cho ice and true/false tests: question selection and guessing of answers. Assessment & Evaluation in Higher Education , 26 (1), 41 - 50. Cai, L. (2012). flexMIRT: Flexible multilevel item factor analy sis and test scoring [Computer software]. Seattle, WA: Vector Ps ychometric Group, LLC. Cao, Y. (2008). Mixed - format test equating: Effects of test dimensionality and common - item sets . ProQuest. Chang, H. (2004). Understanding Computerized Adaptive Testing. The SAGE handbook of quantitative methodology for the social sciences , 117. Chang, H. H., Qian, J., & Ying, Z. (2001). a - Stratified multistage computerized adaptive testing with b blocking. Applied Psychological Measurement , 25 (4), 333 - 341. Chang, H. H., & Ying, Z. (1996). A global information app roach to computeriz ed adaptive testing. Applied Psychological Measurement , 20 (3), 213 - 229. 11 8 Chang, H. H., & Ying, Z. (1999). A - stratified multistage computerized adaptive testing. Applied Psychological Measurement , 23 (3), 211 - 222. Chen, S. Y., & Ankenman, R. D. (2004). Effects of practical const raints on item selection rules at the early stages of computerized adaptive testing. Journal of Educational Measurement , 41 (2), 149 - 174. Chen, S. K., Hou, L., & Dodd, B. G. (1998). A comparison of maximum likelihood estimation and expected a posteriori estimation in CAT using the partial credit model. Educational and Psychological Measurement , 58 (4), 569 - 595. Chen, S. K., Hou, L., Fitzpatrick, S. J., & Dodd, B. G. (1997). The effect of population distribution and method of theta est imation on compu terized adaptive testing (CAT) using the rating scale model. Educational and psychological measurement , 57 (3), 422 - 439. Cheng, Y., & Chang, H. H. (2009). The maximum priority index method for severely constrained item selection in computer ized adaptive testing. British Journal of Mathematical and Statistical Psychology , 62 (2), 369 - 383. Choi, S. W., & Swartz, R. J. (2009). Comparison of CAT item selection criteria for polytomous items. Applied Psychological Measurement . Cook, K. F., Dodd, B . G., & Fitzpatrick, S. J. (1998). A comparison of three polytomous item response theory models in the context of testlet scoring. Journal of outcome measurement , 3 (1), 1 - 20. Davey, T. (2005, April). An Introduction to bin - structured Adaptive Testing. Presented at the annual meeting of the American Educational Research Association, Montreal. Davey, T., & Parshall, C. G. (1995). New Algorithms for Item Selection and Exposure Control with Computerized Adaptive Testing. Davis, L. L. (2004). Strategies for controlling item exposure in computerized adaptive testing with the generalized partial credit model. Applied Psychological Measurement , 28 (3), 165 - 185. Dodd, B. G., & Koch, W. R. (1987). Effects of variations in item step values on item and test informati on in the partial credit model. Applied Psychological Measurement , 11 (4), 371 - 384. Drasgow, F., Levine, M. V., Williams, B., McLaughlin, M. E., & Candell, G. L. (1989). Modeling incorrect responses to multiple - choice items with multilinear formula score th eory. Applied Psychological Measurement , 13 (3), 285 - 299. 119 Enright, M. K., Morley, M., & Sheehan, K. M. (2002). Items by design: The impact of systematic feature variation on item statistical characteristics. Applied Measurement in Education , 15 (1), 49 - 74. G las, C. A., & van der Linden, W. J. (2003). Computerized adaptive testing with item cloning. Applied Psychological Measurement , 27 (4), 247 - 261. Green, B. F., Bock, R. D., Humphreys, L. G., Linn, R. L., & Reckase, M. D. (1984). Technical guidelines for as sessing computerized adaptive tests. Journal of Educational Measurement , 21 (4), 347 - 360. Gu, L. (2007). Designing optimal item pools for computerized adaptive tests with exposure controls. Unpublished doctoral dissertation. Michigan State University. Haladyna, T. M. (1994). A research agenda for licensing and certification testing validation studies. Evaluation & the health professions , 17 (2), 242 - 256. Hambleton, R. K., & Swaminathan, H. (1985). Item response theory: Principles and applications (Vol. 7). Springer Science & Business Media. Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item response theory (Measurement methods for the social sciences series, Vol. 2). Hambleton, R. K., Zaal, J. N., & Pieters, J. P. (1991). Computerized adaptive testing: Theory, applications, and standards. In Advances in educational and psychological testing: Theory and applications (pp. 341 - 366). Springer Netherlands. Hamilton, L. S., Nussbaum, E. M., & Snow, R. E. (1997). Interview proced ures for validating science assessments. Applied Measurement in Education , 10 (2), 181 - 200. He, W. (2010). Optimal Item Pool Design for a Highly Constrained Computerized Adaptive Test . Unpublished doctoral dissertation. Michigan State University. Hetter, R. D., & Sympson, J. B. (1997). Item exposure control in CAT - ASVAB. Hively, W., Patterson, H. L., & Page, S. H. (1968). A" universe - defined" system of arithmetic achievement tests. Journal of educational measurement , 275 - 290. Ho, T. H. (2010). A comparison of item selection procedures using different ability estimation methods in computerized adaptive testing based on the generalized partial credit model . University of Texas at Austin. Hulin, C. L., Drasgow, F., & Parsons, C. K. (1983). Item response theory: Application to psychological measurement . Dorsey Press. 120 Kingsbury, G. G., & Zara, A. R. (1989). Procedures for selecting items for computerized adaptive tests. Applied Measurement in Education , 2 (4), 359 - 375. Leong, S. C. (2006). On varying the difficult y of test items. In annual meeting of the International Association for Educational Assessment, Singapore. Retrieved from http://www. iaea2006. seab. gov. sg/conference/download/papers/On (Vol. 20). Leung, C. K., Chang, H. H., & Hau, K. T. (2003). Computer ized adaptive testing: A comparison of three content balancing methods. The Journal of Technology, Learning and Assessment , 2 (5). Linn, R. L. (1995). High - stakes uses of performance - based assessments. Rationale, examples, and problems of comparability. In T. Oakland & R. K. Hambleton (Eds.), International perspectives on academic assessment (pp. 49 - 73). Norwell, MA: Kluwer Academic Publishers. Livingston, S. A., & Rupp, S. L. (2004). Performance of men and women on multiple - choice and constructed - response t ests for beginning teachers. ETS Research Report Series , 2004 (2), i - 25. Lord, F. M. (1958). Some relations between Guttman's principal components of scale analysis and other psychometric theory. Psychometrika , 23 (4), 291 - 296. Lord, F. M. (1971). A theoretical study of two - stage testing. Psychometrika , 36 (3), 227 - 242. Lord, F. M. (1980). Applications of item response theory to practical testing problems . Routledge. Luecht, R. M. (1998). Computer - assisted test assembly using optimization heuristics. Ap plied Psychological Measurement , 22 (3), 224 - 236. Lukhele, R., Thissen, D., & Wainer, H. (1994). On the relative value of multiple choice, constructed response, and examinee selected items on two achievement tests. Journal of Educational Measurement , 31 (3), 234 - 250. Macready, G. B. (1983). The use of generalizability theory for assessing relations among items within domains in diagnostic testing. Applied Psychological Measurement , 7 (2), 149 - 157. Macready, G. B., & Merwin, J. C. (1973). Homogeneity within item forms in domain referenced testing. Educational and Psychological Measurement . Martinez, M. E. (1999). Cognition and the question of test item format. Educational Psychologist , 34 (4), 207 - 218. Masters, G. N. (1982). A Rasch model for partial credit scoring. Psychometrika , 47 (2), 149 - 174. 121 Meisner, R., Luecht, R., & Reckase, M. D. (1993). The comparability of the statistical characteristics of test items generated by computer algorithms (ACT Research Rep. No.93 - 9). Iowa City, IA: American College Testing . Messick, S. (1995). Validity of psychological assessment: validation of inferences from persons' responses and performances as scientific inquiry into score meaning. American psychologist , 50 (9), 741. Mills, C. N., Potenza, M. T., Fremer, J. J., & Ward, W. C. (Eds.). (2005). Computer - based testing: Building the foundation for future assessments . Routledge. Mills, C. N., & Stocking, M. L. (1996). Practical issues in large - scale computerized adaptive testing. Applied Measurement i n Education , 9 (4), 287 - 304. Muraki, E. (1992). A generalized partial credit model: Application of an EM algorithm. ETS Research Report Series , 1992 (1), i - 30. Muraki, E. (1993). Information functions of the generalized partial credit model. ETS Research Report Series , 1993 (1), i - 12. Nemhauser, G. L., & Wolsey, L. A. (1988). Integer and Combinatorial Optimization. Interscience Series in Discrete Mathematics and Optimization. ed: John Wiley & Sons . Ontario. (2014). EQAO: Education Quality and Accountability Office . Toronto: The Office. Oosterhof, A. (1996). Developing and using classroom assessments . New Jersey: Prentice Hall. Parshall, C. G., Davey, T. & Pashley, P. (2000). Innovative item types for computerized testing. In Computerized adaptive testing: Theory and practice (pp. 129 - 148). Springer Netherlands. Patsula, L. N., & Steffen, M. (1997). Maintaining Item and Test Security in a CAT Environment: A Simulation Study. Laboratory of Psychometric and Evaluative Research Report No. 309. Pfanzagl, J. (1994). Parametric statistical theory . Walter de Gruyter. Pratt, J. W. (1976). FY Edgeworth and RA Fisher on the efficiency of maximum likelihood estimation. The Annals of Statistics , 501 - 514. Rao, S. S. (1988). Combined structural and control optim ization of flexible structures. Engineering optimization , 13 (1), 1 - 16. Reckase, M. (2009). Multidimensional item response theory . New York: Springer. 122 Robin, F. (2001). Development and evaluation of test assembly procedures for computerized adaptive testing. Robin , F. (2005). A comparison of conventional and bin - structured test administration. In annual meeting of the American Educational Research Association, Montreal, Canada . Samejima, F. (1969). Estimation of latent ability using a response pattern of graded sco res. Psychometric Monograph , No. 17. Sands, W. A., Waters, B. K., & McBride, J. R. (1997). Computerized adaptive testing: From inquiry to operation . American Psychological Association. Segall, D. O. (2005). Computerized adaptive testing. Encyclopedia of Social Measurement. Amsterdam: Elsevier . Segall, D. O., Moreno, K. E., & Hetter, R. D. (1997). Item pool development and evaluation. Shin, C. D., Chien, Y., Way, W. D., & Swanson, L. (2009). Weighted penalty model for content balancing in CATS. Retrieved November , 14 , 2012. Smarter Balanced Assessment Consortium: Technology - enhanced items guidelines . (2012). Retrieved from www.smarterbalanced.org/ Stocking, M. L., & Swanson, L. (1993). A method for severely constrained item selection in adaptive testing. Applied Psychological Measurement , 17 (3), 277 - 292. Thissen, D., & Steinberg, L. (1984). A res ponse model for multiple choice items. Psychometrika , 49 (4), 501 - 519. van der Linden, W. J. (1992). Selecting passa ge based items for achievement tests [Internal report]. Iowa City IA: American College Testing. van der Linden, W. J. (1998). Optimal assembly of psychological and educational tests. Applied Psychological Measurement , 22 (3), 195 - 211. van der Linden, W. J. (1998). Bayesian item selection criteria for adaptive testing. Psychometrika , 63 (2), 201 - 216. van der Linden, W. J. (2000). Constrained adaptive testing with shadow tests. In Computerized adaptive testing: Theory and practice (pp. 27 - 52). Springer Netherlands. van der Linden, W. J. (2005). A Comparison of Item Selection Methods for Adaptive Tests with Content Constraints. Journal of Educational Measurement , 42 (3), 283 - 302. 123 van der Linden, W. J., Ariel, A., & Veldkamp, B. P. (2 006). Assembling a computerized adaptive testing item pool as a set of linear tests. Journal of Educational and Behavioral Statistics , 31 (1), 81 - 99. van der Linden, W. J., & Glas, C. A. (2000). Capitalization on item calibration error in adaptive testing. Applied Measurement in Education , 13 (1), 35 - 53. van der Linden, W. J., & Pashley, P. J. (2010). Item selection and ability estimation in adaptive testing. In Elements of adaptive testing (pp. 3 - 30). Springer New York. van der Linden, W. J., & Reese, L. M. (1998). A model for optimal constrained adaptive testing. Applied Psychological Measurement , 22 (3), 259 - 270. van der Linden, W. J., & Veldkamp, B. P. (2004). Constraining item exposure in computerized adaptive testing with shadow tests. Journal of Educational and Behavioral Statistics , 29 (3), 273 - 291. Van Rijn, P. W., Eggen, T. J. H. M., Hemker, B. T., & Sanders, P. F. (2002). Evaluation of selection procedures for computerized adaptive testing with polytomous items. Applied Psychological Measuremen t , 26 (4), 393 - 411. Veldkamp, B. P. (2003). Item selection in polytomous CAT. In New developments in psychometrics (pp. 207 - 214). Springer Japan. Veerkamp, W. J., & Berger, M. P. (1997). Some new item selection criteria for adaptive testing. Journal of Educ ational and Behavioral Statistics , 22 (2), 203 - 226. Veldkamp, B. P., & van der Linden, W. J. (2000). Designing item pools for computerized adaptive testing (pp. 149 - 162). Springer Netherlands. Wagner, H. M. (1969). Principles of operations research: with applications to managerial decisions. In Principles of operations research: with applications to managerial decisions . Prentice - Hall. Wainer, H. (2000). CATs: Whither and whence. ETS Research Report Series , 2000 (2), i - 15. Wainer, H., Bradlow, E. T., & Wang, X. (2007). Testlet response theory and its applications . Cambridge University Press. Wainer, H., Dorans, N. J., Flaugher, R., Green, B. F., & Mislevy, R. J. (2000). Computerized adaptive testing: A primer . Routledge. Wainer, H., & Kiely, G. L. (1987). Item clusters and computerized adaptive testing: A case for testlets. Journal of Educational measurement , 185 - 201. 124 Wang, T., & Vispoel, W. P. (1998). Properties of ability estimation methods in computerized adaptive testing. Journal of Educational Measure ment , 35 (2), 109 - 135. polytomous model in computerized adaptive testing. Applied Psychological Measurement , 25 (4), 317 - 331. Warm, T. A. (1989). Weighted likelihood estimat ion of ability in item response theory. Psychometrika , 54 (3), 427 - 450. Way, W. D. (1998). Protecting the integrity of computerized testing item pools. Educational Measurement: Issues and Practice, 17, 17 - 27.