"(w r A'. S‘- 1" 0.y.’ 0.2.: «a... - ~,L "Y ' . WEN? ' .U. m ‘1; . i' ' ‘32 2: 22:23-2- A. I’l" w... r» 22.2 fl.» ' '4’. '3' “I If: '._-— w... - if: i '2: 4 .5216”: 2..., . .h - *3: 33.“. 3F”. In ‘ :49: c .1 w. 5. a E 2 Hon .5? at”: 3 in: a" ' 2;. 522 a... sum ‘ 2%» .... N. .-.w .u. fiQ‘n *— ? - I .' 5 £1. ‘rfi "Q- n-' ’ .. c , mm» ' rx‘l‘ - f :«n» ox. 2:: 3', .. mm, .‘x’ u nu»...- W .— .""4 n o.- m. >=: -: “ «MN .1“ -. u. 2... \ .31. . a: ...- ‘- m.- 3. m,“ w This is to certify that the dissertation entitled DEVELOPMENT AND EVALUATION OF AN AUTOMATED MULTIDIMENSIONAL TEST ASSEMBLY METHOD BASED ON QUADRATIC KNAPSACK PROGRAMMING presented by Raymond Mapuranga has been accepted towards fulfillment of the requirements for the Ph.D. degree in CEPSE 7er4) £9 [Qua/2, Major Professofs Signature I/BI/O'7 / / . Date MSU is an Affirmative Action/Equal Opportunity Institution - --—-— u-n---—n-c-n—Q-I-I-O-u-l—.-o--.-.-—.-l-Q--.--.-.--C-n-O-I-C‘o-D-I-C-O’I-O_O-Q-h-.-.-.--._ _-_._.-.—-—-—._-—_- .-. L E :11: L :Ij‘ti {Y I‘H'Cliii‘ in 53139 L,“ ‘i: iirz: it‘y PLACE IN RETURN BOX to remove this checkout from your record. TO AVOID FINES return on or before date due. MAY BE RECALLED with earlier due date if requested. DATEDUE DAIEDUE DATEDUE AUG 2 9 2009 JUL 0 1 2009 1m; 0 :3 MO L, . 092109 6/07 p:lCIRC/DateDue.indd-p.1 DEVELOPMENT AND EVALUATION OF AN AUTOMATED MULTIDIMENSIONAL TEST ASSEMBLY METHOD BASED ON QUADRATIC KNAPSACK PROGRAMMING By Raymond Mapuranga A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Department of Counseling, Educational Psychology, and Special Education 2007 ACKNOWLEDGMENTS I would like to thank Mark Reckase for his tremendous help, direction and support. The other dissertation committee members: Kimberly Maier, Richard Houang and Suzanne Wilson, also provided outstanding help and direction. Michael Jodoin made significant contributions during the entire duration of this project. Blakely and Adam’s wonderful friendship and encouragement made the completion of this work possible. I am indebted to Dinoj for his technical help and superior insights. Some other people that were extremely helpful are: Joseph Martineau, Catherine McClellan, Lixiong Gu, Jonathan Manalo and Leah Walker. Thanks to Max for his love and kindness. A special thank you to my parents for making sacrifices that have allowed me to reach this point in my formal education. The financial support of the Spencer Research Training Grant from Michigan State University and the ETS Graduate Fellowship allowed the timely completion of this work. iii TABLE OF CONTENTS LIST OF TABLES ............................................................................................................ vii LIST OF FIGURES ............................................................ A ............................................. viii KEY TO ABBREVIATIONS .......................................................................................... xiii CHAPTER 1 INTRODUCTION ........................................................................................ 1 1.1 Test Specifications ..................................................................................................... 2 1.2 Combinatorial Optimization and Heuristic Techniques ............................................ 4 1.3 Item Response Theory and Dimensionality ............................................................... 7 1.4 Test Assembly Process .............................................................................................. 9 1.5 Achievement and Certification Tests ....................................................................... 11 1.6 Motivation ................................................................................................................ 13 1.7 Purpose ..................................................................................................................... 14 CHAPTER 2 LITERATURE REVIEW ........................................................................... 16 2.1 Multidimensional Two Parameter Logistic Model .................................................. 16 2.2 Item Vectors ................................................................................... . ......................... 20 2.3 Item Content Clusters .............................................................................................. 21 2.4 Multidimensional Information ................................................................................. 24 2.4.1 Ackerman Information .................................................................................. 25 2.4.2 Reckase Information ..................................................................................... 27 2.4.3 Difference between Reckase and Ackerman Information ............................ 29 2.5 D-optimality Criterion ............................................................................................. 30 2.6 0-1 Linear Programming .......................................................................................... 32 2.6.1 Veldkamp’s Automated Multidimensional Test Assembly Approach ......... 34 2.7 Quadratic Knapsack Programming .......................................................................... 36 CHAPTER 3 METHODOLOGY ..................................................................................... 39 3.1 Modeling the Test Assembly Problem ..................................................................... 40 3.2 The QKP Objective Function ................................................................................... 41 3.3 The Quadratic Knapsack Programming Heuristic ................................................... 43 iv 3.3.1 Solving Nonlinear Problems ......................................................................... 44 3.3.2 Phases of the QKP Heuristic ......................................................................... 44 3.3.3 Details of the QKP Heuristic .......................... ' .............................................. 45 3.4 Random Item Selection Method .............................................................................. 46 3.5 Item Pool Simulations .............................................................................................. 48 3.6 Test Length .............................................................................................................. 53 3.7 Simulees ................................................................................................................... 54 3.8 Test Assembly Model .............................................................................................. 55 3.9 Unidimensional Item Response Theory Simulations ............................................... 56 3.10 Data Analyses and Evaluation Criteria .................................................................... 58 3.10.1 Computational Efficiency ........................................................................... 59 3.10.2 Maximum Information ................................................................................ 59 3.10.3 Pairwise Comparisons Using Clamshell Plots ............................................ 60 3.10.4 Relative Efficiency ...................................................................................... 61 CHAPTER 4 RESULTS ................................................................................................... 63 4.1 Simulated Data ......................................................................................................... 63 4.2 D-optimality ............................................................................................................. 69 4.3 Comparison of Constructed Tests ............................................................................ 73 4.4 Computational Efficrency .............................................. 77 4.5 Maximum Information ............................................................................................. 77 4.6 Pairwise Comparisons Using Clamshell Plot .......................................................... 88 4.7 Relative Efficiency ................................................................................................. 102 CHAPTER 5 DISCUSSION AND CONCLUSIONS .................................................... 117 5.1 Unique Contributions ............................................................................................. 1 18 5.2 Computational Experiments and Evaluation Criteria ............................................ 119 5.2.1 Item Pool and Ability Simulations ........................................ 2 ..................... 119 5.2.2 Computations of D-optimality and Characteristics of Assembled Tests.... 121 5.2.3 Computational Efficiency ...................................................................... 123 5.2.4 MAXINFO and Pairwise Comparisons with Clamshell Plots .................... 123 5.2.5 Unidimensional Relative Efficiency of Assembled Tests .......................... 126 5.3 Limitations ............................................................................................................. 127 5.4 Future Research ......................................................... , ............................................ 128 REFERENCES ............................................................................................................... 1 32 vi 2.1 3.1 4.1 4.2 4.3 4.4 4.5 4.6 LIST OF TABLES Item parameters for a 20 item test ............................................................ 24 Item Characteristics for Low, Moderate and High Discrimination Item Pools........51 Descriptive Statistics of Parameters for 300-Length Item Pools ....................... 64 Descriptive Statistics of Clusters for Low, Moderate and High Discrimination Item Pools of length 300 ............................................................................ 66 Descriptive Statistics of Parameters for 900-Length Item Pools .............................. 67 Descriptive Statistics of Clusters for Low, Moderate and High Discrimination Item Pools of Length 900 ............................................................................ 68 Items Selected for 21-Item QKP, LP and RND Assembled Tests Using the 300-Item Low Discrimination Item Pool ............................................................... 75 Items Selected for 21-Item QKP, LP and RND Assembled Tests Using the 900-Item Low Discrimination Item Pool ................................................................ 76 vii 1.1 1.2 1.3 2.1 2.2. 2.3 2.4 4.1. 4.2. 4.3. 4.4. 4.5. 4.6. 4.7. 4.8. 4.9. 4.10. LIST OF FIGURES Sample Specifications for a Fictitious Mathematics Test Blueprint .................... 3 The Coding of Items in a Pool as 0-1 Decision Variables ............................... 5 Schematic of the Test Assembly Process .................................................. 10 Item Response Surface for the M2PLM .................................................. 19 Sample Plot of Item Vectors ................................................................ 21 Angular Distance in Two-Dimensional Space ............................................ 22 Dendrogram of item content clusters based on item parameters in Table 2.1 . ...25 D-Optimality Criterion Results for the BOO-Length Low Discrimination Item Pool70 D-Optimality Criterion Results for the 300-Length Moderate Discrimination Item Pool ............................................................................................ 70 D-Optimality Criterion Results for the 300-Length High Discrimination Item Pool ............................................................................................ 71 D-Optimality Criterion Results for the 900-Length Low Discrimination Item Pool ............................................................................................ 71 D-Optimality Criterion Results for the 900—Length Moderate Discrimination Item Pool ............................................................................................ 72 D-Optimality Criterion Results for the 900-Length High Discrimination Item Pool ............................................................................................ 72 Real Time Performance of Each Assembly Method for the 300-Length Low Discrimination Item Pool ................................................................... 78 CPU Time Performance of Each Assembly Method for the 300-Length Low Discrimination Item Pool ................................................................... 78 Real Time Performance of Each Assembly Method for the 300-Length Moderate Discrimination Item Pool .................................................................... 79 CPU Time Performance of Each Assembly Method for the 300-Length Moderate Discrimination Item Pool .................................................................... 79 viii 4.11. 4.12. 4.13. 4.14. 4.15. 4.16. 4.17. 4.18. 4.19. 4.20. 4.21. 4.22. 4.23. 4.24. 4.25. 4.26. 4.27. Real Time Performance of Each Assembly Method for the 300-Length High Discrimination Item Pool ................................................................... 80 CPU Time Performance of Each Assembly Method for the 300-Length High Discrimination Item Pool ................................................................... 80 Real Time Performance of Each Assembly Method for the 900-Length Low Discrimination Item Pool ................................................................... 81 CPU Time Performance of Each Assembly Method for the 900-Length Low Discrimination Item Pool ................................................................... 81 Real Time Performance of Each Assembly Method for the 900-Length Moderate Discrimination Item Pool .................................................................... 82 CPU Time Performance of Each Assembly Method for the 900-Length Moderate Discrimination Item Pool .................................................................... 82 Real Time Performance of Each Assembly Method for the 900-Length High Discrimination Item Pool ................................................................... 83 CPU Time Performance of Each Assembly Method for the 900-Length High Discrimination Item Pool.. ........83 MAXINFO Results for a 300-Length Low Discrimination Item Pool ................ 85 MAXINFO Results for a BOO-Length Moderate Discrimination Item Pool. . . ...85 MAXINFO Results for a 300-Length High Discrimination Item Pool ............... 86 MAXINFO Results for a 900-Length Low Discrimination Item Pool ................ 86 MAXINFO Results for a 900-Length Moderate Discrimination Item Pool. . . . ......87 MAXINFO Results for a 900-Length High Discrimination Item Pool ................... 87 Pairwise Comparison of MIN F for 21-Item QKP- and RND-Assembled Tests Using the Low Discrimination BOO-Length Item Pool90 Pairwise Comparison of MINF for 21-Item QKP- and LP-Assembled Tests Using the Low Discrimination BOO-Length Item Pool .......................................... 90 Pairwise Comparison of MIN F for 63-Item QKP- and RND-Assembled Tests Using the Low Discrimination 300-Length Item Pool .................................. 91 ix 4.28. 4.29. 4.30. 4.31. 4.32. 4.33. 4.34. 4.35. 4.36. 4.37. 4.38. 4.39. 4.40. 4.41. 4.42. Pairwise Comparison of MINF for 63-Item QKP- and LP-Assembled Tests Using the Low Discrimination 300-Length Item Pool .......................................... 91 Pairwise Comparison of MINF for 21-Item QKP- and RND-Assembled Tests Using the Moderate Discrimination 300-Length Item Pool ............................ 92' Pairwise Comparison of MINF for 21-Item QKP- and LP-Assembled Tests Using the Moderate Discrimination BOO-Length Item Pool .................................... 92 Pairwise Comparison of MINF for 63-Item QKP- and RND-Assembled Tests Using the Moderate Discrimination 300-Length Item Pool ............................ 93 Pairwise Comparison of MINF for 63-Item QKP- and LP-Assembled Tests Using the Moderate Discrimination 300-Length Item Pool .................................... 93 Pairwise Comparison of MINF for 21-Item QKP- and RND-Assembled Tests Using the High Discrimination 300-Length Item Pool .................................. 94 Pairwise Comparison of MINF for 21-Item QKP- and LP-Assembled Tests Using the High Discrimination 300-Length Item Pool ......................................... 94 Pairwise Comparison of MINF for 63-Item QKP- and RND—Assembled Tests Using the High Discrimination 300-Length Item Pool .................................. 95 Pairwise Comparison of MINF for 63-Item QKP- and LP-Assembled Tests Using the High Discrimination BOO-Length Item Pool ......................................... 95 Pairwise Comparison of MINF for 21-Item QKP- and RND-Assembled Tests Using the Low Discrimination 900-Length Item Pool .................................. 96 Pairwise Comparison of MINF for 21-Item QKP- and LP-Assembled Tests Using the Low Discrimination 900-Length Item Pool .......................................... 96 Pairwise Comparison of MINF for 63-Item QKP- and RND-Assembled Tests Using the Low Discrimination 900-Length Item Pool .................................. 97 Pairwise Comparison of MINF for 63-Item QKP- and LP-Assembled Tests Using the Low Discrimination 900-Length Item Pool .......................................... 97 Pairwise Comparison of MINF for 21—Item QKP- and RND-Assembled Tests Using the Moderate Discrimination 900-Length Item Pool ........................... 98 Pairwise Comparison of MINF for 21-Item QKP- and LP-Assembled Tests Using the Moderate Discrimination 900-Length Item Pool .................................... 98 4.43 4.44. 4.45. 4.46. 4.47. 4.48. 4.49. 4.50. 4.51. 4.52 4.53. 4.54. 4.55. 4.56. 4.57. Pairwise Comparison of MINF for 63-Item QKP- and RND-Assembled Tests Using the Moderate Discrimination 900-Length Item Pool ............................ 99 Pairwise Comparison of MINF for 63-Item QKP- and LP-Assembled Tests Using the Moderate Discrimination 900-Length Item Pool .................................... 99 Pairwise Comparison of MINF for 21-Item QKP- and RND-Assembled Tests Using the High Discrimination 900-Length Item Pool ................................ 100 Pairwise Comparison of MIN F for 21-Item QKP- and LP-Assembled Tests Using the High Discrimination 900-Length Item Pool ........................................ 100 Pairwise Comparison of MINF for 63-Item QKP- and RND-Assembled Tests Using the High Discrimination 900-Length Item Pool ................................ 101 Pairwise Comparison of MINF for 63-Item QKP- and LP-Assembled Tests Using the High Discrimination 900-Length Item P001 ....................................... 101 Test Information Functions for 21-Item Tests Assembled from the 300-Length, Low Discrimination Item Pool with the QKP, LP, and RND Methods... ..........104 Efficiency Functions for 21-Item Tests Assembled from the 300-Length, Low Discrimination Item Pool with the QKP, LP, and RND Methods .................... 104 Test Information Functions for 63-Item Tests Assembled from the 300-Length, Low Discrimination Item Pool with the QKP, LP, and RND Methods. .. . 105 Efficiency Functions for 63-Item Tests Assembled from the 300-Length, Low Discrimination Item Pool with the QKP, LP, and RND Methods .................... 105 Test Information Functions for 21-Item Tests Assembled from the 300-Length, Moderate Discrimination Item Pool with the QKP, LP, and RND Methods ....... 106 Efficiency Functions for 21-Item Tests Assembled from the BOO-Length, Moderate Discrimination Item Pool with the QKP, LP, and RND Methods........ ............106 Test Information Functions for 63-Item Tests Assembled from the 300-Length, Moderate Discrimination Item Pool with the QKP, LP, and RND Methods. . 107 Efficiency Functions for 63-Item Tests Assembled from the 300-Length, Moderate Discrimination Item Pool with the QKP, LP, and RND Methods .................... 107 Test Information Functions for 2l-Item Tests Assembled from the 300-Length, High Discrimination Item Pool with the QKP, LP, and RND Methods ............. 108 xi 4.58. 4.59. 4.60. 4.61. 4.62. 4.63. 4.64. 4.65. 4.66. 4.67. 4.68. 4.69. 4.70. 4.71. 4.72. Efficiency Functions for 21-Item Tests Assembled from the 300-Length, High Discrimination Item Pool with the QKP, LP, and RND Methods... .. .. ............108 Test Information Functions for 63-Item Tests Assembled from the 300-Length, High Discrimination Item Pool with the QKP, LP, and RND Methods. . . . . ...109 Efficiency Functions for 63- Item Tests Assembled from the 300- Length, High Discrimination Item Pool with the QKP, LP, and RND Methods. .. .. .....109 Test Information Functions for 21-Item Tests Assembled from the 900-Length, Low Discrimination Item Pool with the QKP, LP, and RND Methods......110 Efficiency Functions for 21-Item Tests Assembled from the 900-Length, Low Discrimination Item Pool with the QKP, LP, and RND Methods. .. .. ..1 10 Test Information Functions for 63-Item Tests Assembled from the 900-Length, Low Discrimination Item Pool with the QKP, LP, and RND Methods. .......1 11 Efficiency Functions for 63-Item Tests Assembled from the 900-Length, Low Discrimination Item Pool with the QKP, LP, and RND Methods .................... 111 Test Information Functions for 21-Item Tests Assembled from the 900-Length, Moderate Discrimination Item Pool with the QKP, LP, and RND Methods. . .....l 12 Efficiency Functions for 21-Item Tests Assembled from the 900—Length, Moderate Discrimination Item Pool with the QKP, LP, and RND Methods. .. .. .. . ...1 12 Test Information Functions for 63-Item Tests Assembled from the 900-Length, Moderate Discrimination Item Pool with the QKP, LP, and RND Methods. . ......1 13 Efficiency Functions for 63-Item Tests Assembled from the 900-Length, Moderate Discrimination Item Pool with the QKP, LP, and RND Methods....................113 Test Information Functions for 21-Item Tests Assembled from the 900-Length, High Discrimination Item Pool with the QKP, LP, and RND Methods ............. 114 Efficiency Functions for 21 -Item Tests Assembled from the 900- Length, High Discrimination Item Pool with the QKP, LP, and RND Methods. ......114 Test Information Functions for 63-Item Tests Assembled from the 900-Length, High Discrimination Item Pool with the QKP, LP, and RND Methods. . . . . ...1 15 Efficiency Functions for 63-Item Tests Assembled from the 900-Length, High Discrimination Item Pool with the QKP, LP, and RND Methods .................... 115 xii 3PLM ATA AMTA CV IRT LP M2PLM MAXINFO MDIFF MDISC MINF MIRT QKP RE RND UIRT KEY TO ABBREVIATIONS Three-parameter logistic model Automated test assembly Automated multidimensional test assembly Coefficient of variation Item response theory Linear programming Multidimensional two-parameter logistic model Maximum information Multidimensional difficulty Multidimensional discrimination Multidimensional information Multidimensional item response theory Quadratic knapsack programming Relative efficiency Random item selection algorithm Unidimensional item response theory xiii CHAPTER 1 INTRODUCTION In recent years, research in psychometrics has given more attention to the optimal assembly of tests in a fully automated fashion (van der Linden, 1998, 2005; Luecht, 1988). This process called automated test assembly (ATA) can be described as a set of methods in which test items are selected from an item pool (also called item bank) in order to construct one or more tests that meet both statistical (e.g., information, reliability, discrimination, difficulty) and substantive (i.e., content and cognitive classifications) test specifications. In practice, the statistical and substantive properties tend to fall into two classes. First, constraints, which are test specifications that must be met for an acceptable test to be assembled. Second, goals, which are those specifications the test should meet as best as possible, given the available items. Normally, substantive specifications form constraints and statistical specifications, within some broad acceptability criteria, form goals. (van der Linden, 1998; Veldkarnp, 2002). The word “automat ” of course refers to the use of computers to facilitate this process. Hence, ATA can be viewed as an optimization method where the idea is to optimize some statistical test property (e. g., test information, test reliability, item discrimination, or item difficulty) fOr a constructed test by either maximizing or minimizing that property; or setting it equal to some predefined value. ATA, otherwise known as optimal test design or automated test construction, is a relatively new method in test construction (van der Linden, 2005). The conceptualization and impetus for using this method came from the suggestions of several researchers who espoused the use of optimization techniques for assembling tests (e.g., Feurman & Weiss, 1973; Votaw, 1952; Yen, 1983). The first attempts at ATA gave acceptable solutions, but did not fully automate assembly since the intervention of , test assemblers was required in order for solutions to be found (e.g., Theunissen, 1985). The most feasible solution to ATA was first presented by van der Linden and Tirnminga (1989). This paper showed that various types of content and statistical criteria which govern the construction of a test could be formulated by the test developer in an ATA model in the same way these criteria had traditionally been specified. These criteria were expressed in the form of linear equalities or inequalities. The latter criteria (also called constraints) defined limitations to which assembly of the test should adhere. Additionally, in this ATA model, the goal for the development of a particular test (e. g., to maximize test information; to maximize test reliability; or to fit a specific target information function) could be specified as an “objective function”. Therefore, when the goal of a test is defined by the “objective function” and test specifications are defined as “constraints,” van der Linden & Timminga (1989) showed that computers could be used to assemble a test that optimally met the needs of the test developer. This remarkable innovation marked the birth of ATA. 1 . 1 Test Specifications Test specifications are the rules that govern AT A. These rules are defined by statements that formulate requirements for attributes and/or characteristics of a test or its items. They are classified as: a) constraints if they require an attribute or characteristic to satisfy an upper or lower bound, or b) objectives if the characteristics or attributes are required to take a maximum or minimum value (van der Linden, 2005). A succinct classification of types of test specifications is as follows (Davey, 2005): 1. Statistical. These often specify numbers of items in various difficulty and discrimination categories and can also be based on averaged statistics, IRT information, test reliability, etc. Statistical rules usually comprise the ATA objective function. 2. Intangible. Such specifications occur in situations where test developers or content experts give a qualitative evaluation of test items. Good examples of this specification are item enemies which are pairs of questions that should not be included together in a test. 3 3. Substantive. Specify numbers of items in each of various categories. Usually allows a permissible range, either because it is required or because it does not make a difference. Specification of these items is often based on marginal totals as shown below: Problem Data . Operations Total Solvrng Sufficiency Arithmetic 5 Algebra 9 3 Geometry . 5 Statistics 2 Total 5 5 5 15 Figure 1.1: Sample Specifications for a Fictitious Mathematics Test Blueprint 4. Formal. These are very similar to substantive rules and some are crossed in the same multi—way table. They are sometimes less lenient with regards to permissible range (e. g., number of items allowed in each section of a test). 5. Item eligibility. Occur when rules govern the re-use of items across forms and administrations. These specifications can be simple (e.g., Neverl), or complicated (e.g., CAT item exposure rules). 1.2 Combinatorial Optimization and Heuristic Techniques Combinatorial optimization models are mathematical formulations used in finding an optimal combination of a given set of resources in order to meet a desired objective. These models are most useful in situations where some or all the resources can be I represented as integer values. In other words, the models can be used to search through a discrete set of units such that the best subset which meets a number cf specified constraints is found. These models are often called integer programming models with the term “programming” referring to the planning of decisions that need to be made when only a finite number of alternative possibilities exist (Hoffman & Padberg, 1996; Nemhauser & Wolsey, 1988). In ATA, this implies constructing an optimal test by selecting test items from an item pool, so that the resulting test meets all the predetermined test specifications (van der Linden, 2005). For optimal test assembly models, only two choices of integers exist and they are characterized as yes and no decisions as follows: 1 if item iis a decision variable of inclusion in the test x- = ' 0 if item iis a decision variable of exclusion from the test Hence, a string of zeros and ones can be used to code this test assembly problem as shown in Figure 1.2 below. Figure 1.2, adds clarity to the description of constraints that was given above. For example, suppose that our constraint was to construct a test of length m. Since our decision variables can be coded as either 0 or 1, this test length constraint can be represented using summation notation as: I 239- =m (1-1) i=1 meaning that setting the sum of the x,- variables to the desired value m meets the test length constraint that are desired. This is standard notation which appears in recent ATA literature (e. g., van der Linden, 2005). Item 1 2 i [—1 I Variable x1 x2 xi , 161-1 x1 Testl 0 O 0 0 Test2 1 O 0 0 Test3 1 1 0 0 Testn 1 1 l 1 Figure 1.2: The Coding of Items in a Pool as 0-1 Decision Variables (van der Linden, 2005) ‘ As noted previously, the general test construction model is only comprised of linear constraints and therefore integer programming techniques can be used to arrive at a solution. Thus, the combinatorial optimization problem to solve in ATA entails choosing the most appropriate or relevant subset of test items from an item pool until an optimal test has been assembled. Once the combinatorial optimization model has been created, several methods can be used to solve it. There are at least three different ways of solving integer programming problems (Hoffman & Padberg, 1996): a) b) C) Enumerative techniques. These techniques find the best solutions by counting all the possibilities available for solving the problem and solving less desirable choices. Branch-and-bound algorithms are an example of how this method has been used in ATA. Here, “branching” refers to the cataloging of all possible solutions and “bounding” is the comparison and elimination of possible solutions using known upper and lower bounds (e.g., Adema, 1992). Lagrangian relaxation and decomposition techniques. These techniques incorporate constraints into the objective function in a Lagrangian fashion and solve subproblems iteratively until optimal values are found. They are complex because special attention has to be given to the integrality property of the functions involved (e. g. Veldkamp, 1998). Cutting Plane algorithms based on polyhedral combinatorics. Here, the set of constraints are replaced with alternative formulations of the potential solutions. These methods have not yet been applied to ATA problems and present a potential area for future research. Several other heuristics techniques have also been used for solving ATA problems. Veldkamp (2005) defines heuristic algorithms as procedures used for the construction of a solution to a test assembly problem based on a plausible intuitive idea. The most notable ATA heuristics are NW ADH (Leucht, 1988) and WDM (Stocking & Swanson; 1998; Swanson & Stocking, 1993). These heuristics assemble tests sequentially by adding one item at a time until the desired overall value (e.g., maximum information, maximum reliability) of the test has been fulfilled. Hence, the utility of each item is based on the extent to which it increases the optimality of the test’s targeted statistical properties. It is important to remember that due to the sequential selection of items, the utility of a particular item depends on the state of the test at the moment when it is added. The preceding summary of combinatorial optimization and heuristic techniques illustrates that a large number of methods exist for solving ATA problems. In practice, combinations of heuristic and standard optimization approaches are often used (Hoffman & Padberg, 1996). Hence, given that ATA is a relatively new area of psychometrics, it is important that research focus on implementing other optimization techniques to improve on existing methods. 1.3 Item Response Theory and Dimensionality In educational assessment, examinee performance on a test item is typically scored with a 1 when the examinee answers a test item correctly and scored with a 0 when the examinee answers the test item incorrectly. These item responses are said to be dichotomously scored. A set of mathematical models are then used to model test item characteristics and examinee ability from the resulting matrices of Os and Is. These types of models belong to the general statistical framework called Item Response Theory (IRT). In IRT it is assumed the performance of an examinee on a test can be predicted in terms of their latent traits or abilities. When it is assumed that a single ability underlies test performance, the IRT models are said to be unidimensional (Lord, 1980; Hambleton & Swaminathan, 1984). An example of a unidimensional item response theory (U IRT) model is the three parameter logistic model (3PLM) which models the probability, Pi(0j), of a correct response on an item i for an examinee with ability 61-. It can be represented mathematically as follows (Lord, 1980): (1.2) where, a i, is the item discrimination parameter which gives the rate of change of the probability of correct response to changes in ability levels, b,- is the item difficulty parameter, and Ci is the lower asymptote parameter (also known as the guessing parameter) which is the probability of correct response approached when abilities assessed by the items are very low (i.e., approach —oo ). 1 While unidimensional models provide a good method for modeling latent traits, advances in psychometric theory and practice have shown that many educational and psychological tests are inherently multidimensional (Ackerman, et al., 2003; Reckase, 1997). In other words, two or more latent traits are thought to govern the response process. Good examples of items exhibiting this property are mathematical “story problems”. To answer such items correctly, both mathematical and verbal skills are required (Reckase, 1985). This extension of IRT to a multidimensional framework is modeled using multidimensional item response theory (MIRT) models. The MIRT model used in this study will be described in the Section 2.1. 1.4 Test Assembly Process IRT is the psychometric theory that commonly underlies the test assembly process (van der Linden, 1998). In this process, three main steps are followed: (a) an IRT model is chosen and the test items in the item bank are calibrated, (b) the desired properties of the test are specified (e. g. test length, categorical test attributes such as item content and cognitive level, or quantitative specifications such as word count or the number of minutes it takes for an examinee to answer a test question), and (c) a model and/or algorithm that selects items from the item bank is formulated so that test specifications are met (van der Linden, 1998; Veldkamp, 20023, 2002b). A combinatorial optimization or heuristic technique (or combination of both) called the ATA algorithm is often used to complete the last step. Although an ATA algorithm significantly streamlines the process of test assembly, it is merely one component of a very complicated process of quality control as well as human and system-based evaluation. Figure 1.3 presents the entire cycle of test assembly which begins when items are constructed by item writers until those items are retired from use. After new items are constructed based on specifications such as item type and item content, they are placed in an item bank. An item bank is a collection of items that may be easily accessed for use when assembling tests. In established testing programs, item banks are constructed only after accurate IRT calibration and thorough goodness-of- fit testing. These steps are completed to ensure that items, are measuring the same or comparable constructs with known values for the item parameters (van der Linden, 2005). o Retired : 350.0883 < New items - Problem items “'m an" Item Content. Types i Item History 8. Statistics 1 /\ Volumes Human & System-based ltern Use Type Evaluation Item Stats Admin Schedule I Figure 1.3: Schematic of the Test Assembly Process The main advantage of using IRT for creating item banks is that new items are always calibrated on the same scale as existing or retired items. This is possible because item statistics are obtained that do not depend on the sample of examinees used in the calibration of test items. It is advantageous to maintain the same scale because this facilitates the comparability of test scores across different years or test forms. Moreover, if item banks consist of items that have valid content with statistically reliable and stable 10 item parameter estimates, the quality of tests produced using item banks is superior to those produced by test developers preparing test items themselves (Hambleton & Swaminathan, 1984). When a test is assembled using an ATA algorithm, only items that fulfill specific requirements such as exposure, security constraints, content categories, total time available to take the test and item type are made available. Therefore, this assembly pool consists only of items that are eligible for use on‘a particular test form(s) that is under construction. These test specifications are often formalized in the ATA model as test constraints that need to be fulfilled in order for an optimal test to be constructed. Note that this description loosely meets both requirements of computer adaptive and paper- and-pencil tests. The computer is used to assemble initial drafts Of test forms. These test forms are then iteratively reviewed and revised by committees and returned to the ATA algorithm until a test that meets all required specifications and constraints has been constructed. This process, although tedious, is a marked improvement on old methods in which ' humans constructed entire tests from start to finish. 1.5 Achievement and Certification Tests Several MIRT studies have shown that multidimensional models are better at explaining the structure of data than unidimensional models (e. g., McKinley & Reckase, 1983; Muraki & Engelhard, 1985). Further, it has been argued that it is advantageous to use MIRT for increasing measurement efficiency and measurement precision especially in situations where tests report multiple correlated scores (Luecht, 1996; Segall, 1996). 11 Examples of such testing contexts include K-12 assessments and certification tests as explained below. Certification tests determine whether an individual has mastered a unit of instruction or skill and provide information about what an individual knows, not how his or her performance compares to the norm group. Many certification tests report scores from different subscales that may be correlated and cover a large domain of integrated knowledge (Luecht, 1996). These tests also cover numerous combinations of crossed and nested specifications even though the purpose of the test is to report a single test score (e. g., Federation of State Medical Boards & National Board of Medical Examiners, 1996). Hence, these tests have a dual purpose of making pass/fail decisions and reporting subscores based on sets of items that come from various categories (Luecht, 1996; Segall, 1996). Therefore, certification tests are likely candidates for calibration and construction using MIRT. In K-12 state assessments, the use of multiple skills is even more apparent due to the variety of content standards and specifications that need to be met in each test. In other words, in practice many test frameworks often require that test items measure several inter-related (and often highly correlated) content strands or content areas (e.g., Michigan Educational Assessment Program, 2003). These test construction frameworks can be problematic when content is confounded with item difficulty (Segall, 1996). Research has shown that in K-12 state assessments, content strands or subscales in a test can be clearly defined using MIRT (e. g., Martineau, et al., 2006). Hence, ideally it would be more helpful to teachers and educators to be able to understand student achievement not only as a single score, but sets of scores on different dimensions on the test. Overall, 12 there is compelling substantive and statistical evidence to support that MIRT is the best approach for understanding, reporting, and assembling tests. 1 .6 Motivation ATA methods for multidimensional tests have recently been formulated (van der Linden, 1996; Veldkamp, 2002a). These methods have only been applied to ATA situations in which the intent was to maximize test information. This maximization used the variance of ability estimators to formulate the objective function and in conjunction with several test constraints, allowed optimal multidimensional tests to be constructed. However, an inherent disadvantage with this formulation is that since the resulting objective function is a nonlinear function, it was necessary to convert it into a linear form so the model would be suitable for 0-1 linear programming (van der Linden, 1996, 1998; Veldkamp, 2002a, 2002b). Using 0-1 linear programming (LP) can be problematic when the mathematical functions involved are not linear. Researchers have shown that some degree of nonlinearity is the rule and not the exception in many situations where mathematical programming is used. Even a slight degree of nonlinearity can cause LP calculations to differ substantially from the true Optimum (Baumol & Bushnell, 1967). Thus, errors may arise when a linear programming calculation is used to solve a problem involving any level of nonlinearity. Typically, when LP methods are applied in ATA, researchers compensate for nonlinearity by using linear approximations to convert the mathematical functions involved into variations suitable for LP (e. g., Veldkamp, 2002; van der Linden, 1996). 13 An example that is most pertinent to this study used a Taylor series approximation to convert the nonlinear objective function into a linear form — a process called linearization (Veldkamp, 2002a). Unfortunately however, linearization introduces numerical, computational and mathematical error. Furthermore, linearization is a problem because the inappropriate use of linearity assumptions leads to model rnisspecifications. Research suggests that in a mathematical programming context, the type of curvature of these mathematical functions can produce distortions when a linear approximation is used. That is, the LP calculation can easily produce a final solution that is worse than the intermediate solutions obtained in preceding computational iterations (Baumol & Bushnell, 1967). Additionally, it has been shown that linearization techniques have the disadvantage of increasing the problem size and computational requirements (Glover, 1975; Glover & Woolsey, 1974; Walters, 1967; Zangwill, 1965). Hence, there is a need to find alternative methods for automated multidimensional test assembly (AMTA) as will be articulated in the section that follows. 1.7 Purpose Reducing mathematical and computational error that results from linearization is desirable. Hence, an ideal approach to this problem requires the formulation of a solution which exploits the specialized mathematical structure of the automated multidimensional test assembly model when the intention is to maximize information. Therefore, the purpose of this dissertation is to develop and evaluate a new method for AMTA that is not reliant on linearization and yet can assemble tests that maximize multidimensional test information while increasing measurement precision and simultaneously reducing 14 both mathematical and computational error. This new approach incorporates elements of combinatorial optimization and heuristic techniques that were described previously. In Chapter 2 the multidimensional item response theory (MIRT) model is introduced along with associated statistics and evaluation criteria of test quality. Additionally, an optimal design theory criterion and the mathematical programming methods of LP and QKP are described. In Chapter 3, an objective function suitable for QKP is derived and an algorithm for selecting test items that maximize multidimensional information is developed. A computational experiment for comparing QKP, LP, and Random (RND) test assembly methods is described along with various evaluation factors and criteria. Results from the computational experiments and performance of the QKP algorithm are presented in Chapter 4. Finally, in Chapter 5 I discuss the study’s conclusions and suggest future directions for this work. 15 CHAPTER 2 LITERATURE REVIEW An overview of ATA and its associated psychometric and procedural aspects was provided in the Chapter 1. The purpose of Chapter 2 is to give a detailed description of the methods and procedures pertinent to this study. Specifically, this chapter reviews multidimensional item response theory (MIRT) and the design of multidimensional tests based on Birnbaum’s (1968) framework. Additionally, an optimality criterion suitable for maximizing multidimensional information (MINF) is reviewed along with the 0-1 linear programming (LP) approach to automated multidimensional test assembly (AMT A) established by Veldkamp (2002a). Lastly, a new combinatorial optimization approach to AMTA, called quadratic knapsack programming (QKP), is described and its suitability for multidimensional test assembly is justified. 2.1 Multidimensional Two Parameter Logistic Model This study only considers the multidimensional two-parameter logistic model (Ackerman et al., 2003, Reckase, 1997) as a way to illustrate this new method. However, it should be noted that extensions to other MIRT models are possible and would provide interesting directions for future research. For simplicity, only the case of two 65 (31, 62) will be illustrated, where each firepresents an unknown trait parameter. The multidimensional two-parameter logistic model (M2PLM) can be represented as follows (van der Linden, 1996): 16 exp a ~61+a ~62+d- P" (61’02)=P(U'7 =l|ali'a2i’d"61’62)5 P" = 1+ rigidly-61 +3362 +22), (2.1) where, UI-j is a response variable that takes the value 1 if the response of person j = 1,...,M to item i = 1,...,N is correct, and 0 otherwise, (a1 ,-, 021') are the discrimination parameters of item I for 91 and 92, respectively, dI- is a scalar parameter that is related to the difficulty of the item. In this study, it is assumed that item parameters (ali, a2,- and di) are known and the M2PLM is used to estimate 611]- and 921- from a set of dichotomous responses as discussed in section 1.3 of the preceding chapter. There are several statistical indices associated with measurement and calibration in the MIRT framework. First, MDISC enables items to be compared on a general measure of quality. It gives the relationship of the item response to the combination of dimensions that gives the strongest relationship as follows (Reckase & McKinley, 1991; Reckase, 1997): 1/2 n MDISC =[Z 225,] (2.2) k=l where a”, is the ith item’s discrimination on the kth dimension (i.e. 01 and 192). Second, the difficulty of MIRT items is indexed with a statistic called MDIFF (Reckase, 1997): 17 “di m 2 Z aik d=1 MDIFF gives the distance from the origin of the 6-space to the point of steepest slope in a MDIFF,- = (2.3) direction from the origin. It has the same interpretation as the b—parameter in UIRT. Assume each item, i, is represented by a vector 3,- that consists of the elements of a row of the matrix a — an n by m matrix containing the aik and ajk discrimination parameters. The angle between the directions of steepest slope for two items can be formulated as (Reckase, et al., 2000): m 2 aikajk k-l m 2 m 2 Zaire 201k k=l k=l where 01,-]- is the angle that the line from the origin of the space to the point of the steepest a~~ =arccos u (2.4) slope makes with the jth axis for item i. Equation (2.4) is also called the direction cosine and represents the direction of greatest slope from the origin. These cosines specify the directions of measurement in multidimensional space (Reckase, 1997). Equation (2.1) defines a surface which gives the probability of a correct response for a test item. It operates as a function of the location of examinees in the ability space specified by the B—vector. Assuming there are only two statistical constructs (01 and 02) which underlie performance on a particular psychological trait or educational achievement domain, the probability response surface can be represented graphically as shown in Figure 2.1. 18 5: m .1 Probability .0 .hi I Figure 2.1. Item Response Surface for the M2PLM with Parameters al=1.5, a2=0.2, and d=1. The item response surface (IRS) of Figure 2.1 has the following parameter values: a1=1.5, a2=0.2, and d=1. It is monotonically increasing and is bounded by 0 and 1 with its height representing the probability of answering an item correctly given two abilities (i.e., 91 and 92). Additionally, multidimensional item discrimination (MDISC) is a scalar, while multidimensional difficulty (MDIFF) is a composite of item discrimination and difficulty on each dimension. Recall that in UIRT discrimination and difficulty are simply scalars. 19 2.2 Item Vectors Essential information about a test item can be displayed graphically using item vectors. These vectors are used to represent the difficulty .of items, the direction of maximum discrimination and the items’ discriminating power (Reckase, 1985). Based on the MIRT model of Equation (2.1), the direction of these vectors shows the weighted combination or composite of abilities for which each item provides the best measurement. Items with similar direction cosines measure the same set of abilities (Ackerman etal., 2003; Reckase, Ackerman & Carlson, 1988). In Figure 2.2, a sample plot of item vectors is shown. These item vectors vary in difficulty. Items in the upper right quadrant are more difficult than items in the lower left quadrant. Items in this diagram are generally aligned in two distinct directions called content clusters. One cluster consists of items that are closer to the oil-axis, while the other cluster consists of items that are closer to the 02-axis. This implies that the two content clusters measure two different, yet related sets of constructs since their directions are not orthogonal. Note that the aforementioned description only applies to situations in which a test is sensitive to only two skills or dimensions. The vectors of Figure 2.2 have three specific characteristics. Item discrimination is represented by the length of the vector and is indexed by the MDISC value of Equation (2.2). Location of the base of the vector in multidimensional space signifies difficulty as given by MDIFF in Equation (2.3). Lastly, the angular direction of each item relative to the positive (Ell-axis represents the location of the item as indexed by the direction cosine formula of Equation (2.4). 20 0.8 I 0.6 — 0.4 - 0.2 - ®N o .. ..................................... -o.2 2 6 q“ I \ -0.4 -0.2 0 0.2 0.4 0.6 Figure 2.2. Sample Plot of Item Vectors. Figure 2.3 further illustrates the concept of angles between test items. This diagram indicates that the angle between the two item vectors defines a plane in multidimensional space since the vectors originate from the origin and have terminal points. The angle (0121-) represents the angle between item j and item i. This angle is related to the direction cosine by the formula of Equation (2.4). 2.3 Item Content Clusters The preceding description of item content clusters was given to illustrate that since test items can be represented as vectors, they can be used to identify sets of items that measure similar combinations of skills. Moreover, achievement and certification 21 Dimension 2 (62) \ Dimension 1 (01) “it Figure 2.3. The Angular Distance in Two-Dirnensional Space. tests are constructed so they are efficient at measuring combinations or clusters of skills. Hence, when item vectors point in the same direction, they measure the same combination of knowledge and skills. Additionally, the angle between these items evaluates the difference in knowledge or skills they measure. For example, when the angle between two items is 0 degrees, those items are measuring the same combination of skills. However, if the angle is 90 degrees, those items are measuring completely different skills (Reckase & Martineau, 2004). For example, in the National Assessment of Educational Progress (NAEP) frameworks, assessments are constructed to fulfill multiple subscales. A typical example is the 2006 N AEP framework for Economics which contained three subscales: i) Market Economy, ii) National Economy, and iii) International Economy (National Assessment Governing Board, 2005). If test items are correctly matched to each of these three 22 subscales, then it would be expected that 3 distinct “item content clusters” of similarly . aligned sets of items can be distinguished in multidimensional space using the M2PLM or other MIRT models. The similarities or patterns of proximities (i.e., measures of closeness or association) among item vectors can be understood through multivariate statistical procedures. These patterns or similarities can then be used to identify sets of items that measure the same combinations of abilities. Measures of proximity are indexed using the direction cosines given in Equation (2.4). Cluster analysis has been used successfully to identify the aforementioned item patterns (Kim, 2001; Martineau, et al., 2006; Miller & Hirsch, 1992). Kim (2001) determined an optimal cluster analysis method for identifying groups of items with similar orientation in the O—space. The grouping of items is then analyzed qualitatively with the purpose of assigning substantive meanings to the resulting content clusters (or groups of items). Hence, in an achievement test, it would be expected that groups of items which measure the same content appear in the same groups or “content clusters” (e. g., Martineau, et al., 2006). An example from Reckase (2006) can be used to illustrate the concept of content clusters using cluster analysis. Table 2.1 shows the item parameters for 20 test items that are two-dimensional (i.e., have to a-parameters). Using direction cosines, these 20 items can be subdivided into smaller and more meaningful groups according to their similarity. That is, a dendrogram can be used to display test items that measure similar skills as shown in Figure 2.4. Therefore, at an average angel of 1.5, there are two distinguishable clusters. At an average angle of 0.5 there are 4 distinguishable clusters, and so on. In an 23 actual test, a content expert would be able to look at items grouped by the dendrogram and define or classify the common content measured by each cluster (see Martineau, et al., 2005). Table 2.1 Item parameters for a 20 item test (Reckase, 2006). Item Item Number a1 a; d Number a1 02 d l 1.81 .86 1.46 11 .24 1.14 -.95 2 1.22 .02 .17 12 .51 1.21 -1.00 3 1.57 .36 .67 13 .76 .59 -.96 4 .71 .53 .44 14 .01 1.94 -1.92 5 .86 .19 .10 15 .39 1.77 -1.57 6 1.72 .18 .44 16 .76 .99 -1.36 7 1.86 .29 .38 17 .49 1.10 -.81 8 1.33 .34 .69 18 .29 1.10 -.99 9 1.19 1.57 .17 19 .48 1.00 -1.56 10 2.00 .00 .38 20 .42 ' .75 -1.61 2.4 Multidimensional Information Two approaches to computing multidimensional information (MINF) exist. One is based on maximum likelihood estimation and will be called Ackerman information (Ackerman, 1994). The other is based on the direction of items and angles between items and will be called Reckase information (Reckase & McKinley, 1991). Ackerman information will used in deriving relevant indices for assembly tests, while Reckase information is useful for evaluating the measurement precision of assembled tests. The differences between Ackerman and Reckase information will be explained in Section 2.4.3. 24 —L mmA—s-a-wwowmoomw i=3 11-1,. . .7 u—L Item Number —L-—L—-L 44—3—3534 (OODNNCOO—‘h 1 1 1 i 0.5 1 1.5 2 2.5 3 Average Angle Between Items/Clusters — C) Figure 2.4. Dendrogram of item content clusters based on item parameters in Table 2.1 2.4.1 Ackerman Information Green (1990) was the first to suggest a formulation of MINF different from that of Reckase presented in the section 2.3.2. Ackerman (1994) summarized the latter version as taking into account the lack of local independence when the information is estimated for a particular direction. This method builds on Birnbaum’s (1968) introduction of Fisher’s information matrix. Bimbaum used Fisher’s information matrix to describe the information structure of a test and quantified it as the test information function (TIF). 25 This approach allows the test assembler to focus on both the goal of the test (e. g., maximizing information) and the selection of combinations of test items that assure the best results. When assembling tests from an item pool that fits a UIRT model, Bimbaum (1968) suggested fitting them to a TIF which is defined as a summation of item information about 19. 2 a lnL] (2.5) WEBB? In Equation (2.5) II. (6) represents the information‘or measurement precision provided by each item. Additionally, in UIRT it is common to take the natural log of the likelihood function (L) associated with the data under the 3PLM in Equation (1.1) and similar models. Hence “lnL” in Equation (2.5) simply represents the natural log of this likelihood function. “-E” is the negative expected value of the second derivative of this likelihood function. The test information is the sum of the item information functions (Lord, 1980; Reckase & McKinley, 1991) and it can be represented as follows: n [T (a) = 21,19) (2.6) i=1 where n is the number of items in the test. However, when there are two dimensions, Fisher’s information becomes a matrix which for two 195 (61, 62) is obtained by computing the negative expected value of the second derivative of the log of the likelihood function (Kendall & Stuart, 1967; van der Linden, 1996) as follows: 26 ( azlnL dzlnL \ 3612 861862 azlnL 321nL “01.62): -E , (2.7) where the likelihood function, L, of Equation (2.1) can be written as (Ackerman, 1994): n L(u10)=1'1L( ui 101,02) =1:1PI-(0)“ QI)(0)1" (2.8) i=1 The multidimensional information (called Ackerman information in this paper) in Equation (2.7) can be written as: n ) :Zlarzil’il" )Qi(9 I Zanazil’iWIQiW) I (61,62): i=1 (2.9) LzaliaZiH i 0)Qi(0 ) 20215 (0)Qi(0) i=1 I Also note a property of the maximum likelihood estimator of 19, that 0 is asymptotically normal with mean Band variance: ~ ~ —1 v(01,02101,02)=[1(91,02)] (2.10) In the multidimensional case, this is a variance-covariance matrix. Therefore, to optimize measurement precision, a function on Fisher’s information matrix has to be maximized or the variance—covariance matrix has to be minimized (V eldkamp, 2002a). The latter fact holds since IRT models are special forms of nonlinear regression models (Lord, 1980). 2.4.2 Reckase Information As noted previously, measurement precision is evaluated using the concept of information in IRT. The reciprocal of the information function is the asymptotic variance 27 of the maximum likelihood estimate of ability. Hence, the larger the amount of information measured, the smaller the asymptotic variance and therefore the higher the measurement precision (Ackerman, Gierl & Walker, 2003; Lord, 1980). Using the concept of multidimensional information (MINF) the measurement precision of a test can be assessed. MINF is related to MDISC because if an item has a high value of MDISC, it will give a large amount of information. However, MINF and MDISC differ since MINF identifies that capability of the item to discriminate at each pOint in the 0-space, rather than just at the steepest point of the IRS (Reckase & McKinley, 1991). The slope in a particular direction is defined using a directional derivative as follows (Reckase & McKinley, 1991): VaP(0) = 612:) cos a1 + 513(0) cos a2+,...,+-5—:§1)—cos an (2.11) n where or is the vector of angles with the coordinate axes in the 0 space, or,- (i = 1, . . .,i) is an element of the vector, 0 is the vector of abilities defining a point in the space, and 0 (i = 1,. . .,n) is an element of the vector. For a single item, the directional derivative of the item response surface (IRS) is: VaPi (9) = “113' “HQ (9)005 a1 +ar'ZPi (9)Qi (9)0080’2+2-~=+ainpi (9)Qi (9)003 an (2.12) =122~ (Mia... cow. k=l 28 Since MINF is a direct generalization of the UIRT concept of information (Lord, 1980), it can be expressed as follows (Reckase & McKinley, 1991): . 2 [B- (WQ (9) Z 0:1 00801] 111(0): k=1 192(0)Q.- (0) (2.13) 2 = Pi (9)Qi (”[2 aik 005012] k=1 To obtain MINF for an entire test, the item information of Equation (2.13) is summed across the number of items in the test such that IT (0) is the sum of item information functions similar to the UIRT form given in Equation (2.6). When test information is computed using Reckase information, the same direction must be used for all items. 2.4.3 Difi‘erence between Reckase and Ackerman Information As noted previously, in this study, Ackerman information was used for mathematical derivations involving multidimensional information, while Reckase information was used for evaluating the quality of assembled tests. Although these two indices essentially index multidimensional information, they are fundamentally different. Reckase information is basically a multidimensional critical ratio, comparable to Lord’s (1980, p.69) derivation of the score information function (Ackerman, 1994). For the M2PLM, Ackerman (1994) explicates this critical ratio as: “a measure of how effective test score x is at discriminating between a trait level (01, 02) and a trait level “close by” (61.6.2) along a line through (I91, 0;) at an angle a.” 29 Reckase information cannot be used as a substitute for Ackerman information when calculating information for more than one test item. This substitution is not possible because Equation (2.12) of Reckase information does not, account for the lack of local independence when a particular direction is specified (Ackerman, 1994). This implies that the resulting covariance among the traits is improperly accounted for. Ackerman information, on the other hand, is formulated from a variance-covariance matrix and can properly account for the resulting covariance when a particular direction is specified. Since IRT models can be viewed as nonlinear regression models and experimental design methods can be used in optimization, Ackerman information facilitates mathematical derivations that involve multidimensional information when the purpose is to explain the variance-covariance structure of the IRT parameters. 2.5 D-optimality Criterion Having noted that IRT models are special types of regression models as described in section 2.4.1, statistical criteria from optimal design theory can be used for optimizing the selection of IRT parameters in ATA. This approach works similar to the manner in which parameters are estimated in experiments that have multiple parameters and can be modeled as regression models. In prior IRT studies that involved the optimal design of tests and samples, some function of Fisher’s information matrix was maximized (Berger, 1991; de Gruijter, 1985, 1988; Stocking, 1990; Thissen & Wainer, 1982; Vale, 1986; Wingersky & Lord, 1984). When the determinant of Fisher’s information matrix is maximized for the purpose of optimizing the design of a test, this criterion is called D- optimality. 3O D-optimality was first suggested by Wald (1943) and is also called the generalized variance criterion (Anderson, 1984; Berger & Veerkamp, 1994). It is useful in situations where the goal is to minimize the generalized variance of the parameter estimates. Hence, in the ATA the smaller the variance of the estimated parameters, the greater the information obtained for each assembled test (Hambleton & Swaminathan, 1984). Specifically, D-optimality accomplishes this goal (Berger, 1988; van der Linden, 1994; van der Linden & Boekkooi-Timminga, 1989). D-optimality is the preferred optimality criterion in IRT for the following reasons: a) it can be used to formulate a confidence interval around the parameter estimates and hence has an easy and natural interpretation, b) it is invariant to linear transformations of the logit scale, and c) it is equivalent to other criteria (e.g., G-optimality). However, D- optimality has some disadvantages: i) it is generally not sensitive to misspecifications of the model and ii) models with a different number of parameters cannot be compared (as is the case with other optimality criteria) (Berger, 1991, 1992; Berger & Veerkamp, 1994). Overall, this optimality criterion has consistently given good results in prior studies (e.g., Veldkamp, 2002a). From the explanation above, it follows that D-optimality can be used for the computation of optimal values in ATA. This criterion can be used to optimally select items for a test by maximizing the objective function associated with D-optimality and thus minimizing the generalized variance of the parameter estimates. Hence, it maximizes measurement precision when it is used to compute the objective function. In multidimensional ATA, D-optimality can be represented as Maximize |1(0)|, (2.14) 31 where, II (0)| is the determinant of Fisher’s information matrix for the vector of abilities 0. As will be explained in greater detail in the section 2.5, the expression of Equation (2.14) is not only a function of x., but also a function of two 65 (01, 62) (e.g., Veldkamp, 2002a). Therefore, in ATA, this objective function can be optimized for a grid of points instead of the entire 6-region. That is, the problem of maximizing the information function at certain 6—points can be applied to the multidimensional B-space if the two- dimensional grid is defined by (s,t), where s = 1,.. ..,S and t = 1,. . ..,T. Hence updating Equation (2.14) to reflect this addition, it becomes Maximize |1(0,., )|. (2.15) 2.6 0-1 Linear Programming An important goal in educational and psychological assessment is to construct tests of minimal length that will yield scores with the necessary degree of reliability and validity for the intended uses (Berger, et al., 1994; Crocker & Algina, 1986). In the last few decades, a form of mathematical programming, called 0-1 linear programming (LP), has been used to construct tests that best meet these requirements using various ATA approaches. A central assumption of LP is that all its functions, both objective and constrained, are linear (Hillier & Gerald, 1995). Examples of LP methods that have been used for optimal test assembly include the branch-and-bound method (Adema, 1992); 0-1 linear programming (Adema, Boekkooi-Timminga & van der Linden, 1991); a maxirnin approach (van der Linden & Boekkooi-Timminga, 1989; Veldkamp, 2002a); and binary programming (Theunissen, 1985). In these examples, LP facilitated optimal test assembly so that test construction 32 goals and specifications such as maximum reliability, target information, minimal length and maximum item-total correlation were met to the best degree possible. The standard format of the LP problem includes only equality constraints. It is set up as either a minimization or maximization problem of the following form (Venkataraman, 2002): Maximize/Minimize f (x) : ch (2.16) Subject to g (x) : Ax=b (2.17) Decision variables xI. 6 {0,1} i= 1 . . . I (2.18) where x represents a column of decision variables, [x1, x2, . . . , xn]T; c is a column vector of utility (cost) coefficients, [C1, c2, . . . , on]T; b is a column vector of constraint limits, [b1, b2, . . . , bHJT; A is an mxn matrix of constraint coefficients which can alternatively be represented as: (011 011 - aln I’1 (01 Ixfl a a . a C x A: 21 22 Zn ; lb: I’2 ; c: 2 ; x: 2 Kaml aml ° amn bn Ken \xn I Research on ATA using MIRT has been quite limited. The earliest suggestion of automated multidimensional test assembly (AMT A) is by van der Linden (1996). Veldkamp has done most of the documented research on AMTA using both LP methods (V eldkamp, 2002a) and Lagrangian relaxation techniques (V eldkarnp, 1998). While both these methods are interesting, the former appears to be the most promising because the majority of other ATA work uses LP or closely related models. A detailed description of Veldkamp’s approach is given in section 2.5.2. 33 2.6.1 Veldkamp ’s Automated Multidimensional Test Assembly Approach In his approach to AMTA, Veldkamp (2002a), imposed a linear approximation ori the objective function obtained using D-optimality through a Taylor series approximation. Note that solving the expression in (2.15) for the M2PLM gives: N N 2 9 x) = Zaiifi (9IQi (oniZalziPi (9IQi (I’M—[Z aliaZiP i( 9IQi (9)19 ] (2-19) i=1 i=1 Hence, when computing a linear approximation for Equation (2. 19), a grid of points (s,t) in the 0 space ( 61,62 ) e {—3, 3} x{-3, 3} is chosen so the objective function is maximized. It follows that a linear approximation of Equation (2.19) can be expressed in terms of x, y, and 2 so it becomes: Maximize xy — 22, (2.20) N N where x= 20%2P(2(0)Q2(9)x2.y=2012P2(0)Q2(Ixiiandz=Zari02iPi(9IQi(9Ixi i=1 i=1 Thus, for any given ability point (s, t), and a given test, the functions x, y, 2 can be calculated for a result that was denoted as (iii). In turn, (2.20) can be rewritten as: Maximize fix,y,z), (2.21) with a linear approximation of the objective function at the point (372') being equal to ..af___ 3f___ 3f--- Maxrmrze $(xst'yst in I ' x+'a—y'(xst 'yst ’Zst I ' y +Ta:(xst'yst 'Zst I ° Z + 0 (2°22) where c = f (3,37,?) + Vf (Ky-if (335,?) and the partial derivatives that define the linear approximation are: 34 af __ ._ _ _ ‘ klst = 5;"(xst'YSt'Zst I = I’st a _ _ _ _ k2st = £0.27 'yst =Zst I = xst I (2.23) 3f k3st = ELY.” 5st is: I = T2 ' Est Note that coefficients kjs, represent the partial derivatives taken at the point (s, t) for a given test thus producing completing the linearization and giving the complete AMTA model as follows: N N N 2 Max klst Zaripi (9IQi (9Ixi +k28t Zaii’i (9IQi (9I‘i +k3st ZaliaZiPi (9IQi (9Ixi (224) i=1 i=1 i=1 subject to, ZxI. S nc, (categorical constraints) ieC ZquI. S n9, (quantitative constraints) ‘60 ” i (2.25) ZxI S 1, (enemy sets) ieE I ZxI = n, (test length) 161 xI. e 0 or 1, i = 1,....,I (decision variables) (2.26) Essentially, Veldkamp expressed the objective function in Equation (2.19) as a linear function of the decision variable xi, and was able to use a linear programming algorithm to solve an AMTA problem. 35 2.7 Quadratic Knapsack Programming The knapsack model of mathematical programming has been studied for several years as the simplest prototype of a maximization problem. This model invokes the image of the backpacker who is constrained by a fixed-size knapsack. The backpacker must fill it only with the most useful combination of items from a list of possible items so the value of items packed into the knapsack is maximized (Gallo, et al., 1980; Hoffman & Padberg, 1996; Kellerer, et al., 2004). This model belongs to the combinatorial optimization family described in Chapter 1. In the literature, the knapsack model was first used for test construction by Feuerman and Weiss (1973). Research on the knapsack model was motivated by the fact that a) Veldkamp’s (2002a) AMTA approach may introduce mathematical and computational error, and b) the non-linearized form of the objective function has a quadratic structure. Moreover, the method introduced here seemed intuitively suitable. Therefore, this study introduces quadratic knapsack programming (QKP) as an alternative to the use of LP in AMTA. QKP extends the knapsack problem by assigning values not only to individual objects, but to pairs of objects. Additionally, QKP is a generalization of the knapsack problem obtained when the objective function is allowed to be quadratic. In other words, extending the QKP metaphor to AMTA where the goal is to maximize MINF, the knapsack is a two-dimensional test and only items that make the test most informative in both dimensions can be selected. Note that here a two-dimensional test is defined based on the item response matrix. That is, a two-dimensional test is an item response matrix that requires two dimensions to accurately model the data when item responses are 36 dichotomously scored (i.e., correct responses are coded as one and incorrect responses are coded as zero). As noted previously, the purpose of optimal test assembly is to choose the best sets of test items so the measurement precision of a test is maximized. Specifically, when using QKP this corresponds to the problem of choosing the best n test items out of N test items. Moreover, in choosing these items, it is natural to assume that the utility of this choice should reflect how well the items fit together. Therefore, assume that n items are given, with item j having a positive integer weight wj. A limit on the total weight chosen is given by a positive integer knapsack capacity c. Additionally, the model has an (nx n) nonnegative integer symmetric profit matrix S = (Sian (2.27) where Sjj is the profit achieved if item j is selected and sij + st- is a profit achieved if both items i and j are selected. Profit simply refers to the utility with which selected items function in unison. Hence, the QKP model has the following mathematical formulation: N N Maximize f (x) = xTSx = Z Z xI-xjsI-j (2.28) i=1 j=l N Subject to Z wjxj s c (2.29) '=1 xje{0,l}, je N, (2.30) where xTSx is the quadratic form of the QKP model (Hillier & Lieberman, 1995; Kellerer, et al., 2004). 37 In mathematical programming terminology, Equation (2.28) is called the objective function. Note that in AMTA the objective function determines the measurement precision of the test. The latter is formulated using Fisher’s information matrix and describes the information structure of the test. This information structure is quantified as the test information function (Bimbaum, 1968) as described in section 2.4.1. The knapsack constraints which govern selection of items are shown in (2.29). (2.30) denotes the decision variables of the QKP model such that xi = 1 if the item is included in the test and x2- = 0, if the item is excluded from the test. Additionally, in the test assembly context N is the number of items in the test bank (or item pool) and all test specifications have to be expressed in terms of the xis (Caprara, et al., 2000; Kellerer, et al., 2004). 38 CHAPTER 3 METHODOLOGY The purpose of this study was to develop and evaluate a new approach to AMTA when the goal of assembling the test is to maximize information. Development involved mathematical derivation of the objective function and formulation of a heuristic algorithm for solving the problem with a computer. The new approach was evaluated through a series of computational experiments. When evaluating computer algorithms or computational techniques, it is common to use computational experiments (e.g., Caprara et al., 1999; Luecht, 1998; Swanson & Stocking, 1993). These typically entail the comparison of a focal method to similar methods of interest so that conclusions about efficiency and performance can be drawn. Through the series of computational experiments, the new approach was evaluated under the following conditions: 1) item pool quality, 2) item pool size, 3) test length, and 4) AMT A method. Additionally, in this study, computational outcomes (i.e., computations of D-optimality) were linked to psychometric criteria and indices that inform test assemblers and psychometricians about the quality of assembled tests. The first two factors were manipulated with the intent to vary item pools available for assembly. The first factor was quality of the item pool as measured by mean MDISC levels and the second factor was item pool size. Crossing these factors (three levels of item pool quality and two levels of item pool size) produced six item pool variations. Varying levels of these two factors allowed the reliability and classification accuracy of tests to be altered. The third factor varied was test length. Lengths of 21 to 99 items were 39 considered since test length is a primary factor that affects reliability and decision consistency. Finally, to evaluate the new method by comparison to other assembly methods, the AMTA method was varied. The three AMTA methods considered were QKP, LP, and random item selection (RND). While QKP and LP were used for effectiveness comparison, RND was used as a baseline measure to evaluate and compare performance of the other two methods. The remainder of the chapter is divided into the several sections. Two sections cover modeling the test assembly problem, deriving the QKP objective function and describing the QKP heuristic. In addition, sections describing the varying factors such as the AMTA method, item pools, test length, Simulees, the test assembly model, UIRT simulations, data analyses and evaluation criteria are included. These sections explain the manipulation of each factor, data generation, and the evaluation criteria used in the study. 3.1 Modeling the Test Assembly Problem For the three AMTA considered, the following process of modeling the test assembly problem was followed (van der Linden, 2005): l. Identifi/ the decision variables. Items are coded as 1 when included in the assembled test and 0 when excluded. This qualifies the assembly problem as an integer programming problem. 2. Model the constraints. Sets of content item content clusters are specified to serve as assembly constraints. 40 3. Model the objective function. An objective function that can be used to maximize MINF was formulated based on a derivation of Fisher’s information matrix using D-optimality for the M2PLM. 4. Solve the model for an optimal solution. For each method, a unique computer algorithm was used for solving the test assembly problem. 3.2 The OKP Obiective Function Recall that D-optimality maximizes the determinant of Fisher’s information matrix. Veldkamp (2002a) derived this function for the two-dimensional case. As described in the previous chapter, the objective function to be maximized over 6from Equation (2.15) becomes: N N N 2 f (92") = Zaiipi (9IQi (oniZalziPi (9IQi (9Ixi ‘[zaliaZiPi (9IQi (9)13] (3-1) , i=1 i=1 i=1 This becomes a question of maximizing the integral of fl0,x) with respect to Gas follows: Max( jf(0,x)d0) (3.2) Since no explicit maximizing solution can be found, the integral should first be approximated. The integral in (3.2) can be approximated numerically. Computationally, this is done by approximating the continuous distribution of 6with a discrete point-mass T distribution with support at 6: t1,12,.. .,tr such that Z P(9 = t j) =1 . Then the function F1 fix) to be maximized is: 41 f(x)=ZP(0=tJ-)f(tj,x) (3.3) where 6 = t J- is a generic quadrature point and f (tj,x) is the associated weight. It follows that the summation over j distributes over the summations over i, so that (3.1) becomes: N 2 N 2 N 2 f (XI=ZaiiRixiZaziRixi— ZaliaZiRixi (3.4) i=1 '=1 i=1 T where R,- =ZP(9=Ij)Pi(‘jIQi(th' j=l The n-dimensional vectors u, v, and W can be defined as follows: 11 = alziRI , v = agI-RI- , w = alI-azI-RI. Subsequently, fix) can then be written in vector notation (remembering that xy = xT y = yTx )so that Equation (3.4) becomes: f (x) = (“KIWI-(WINK) TuvTx—xTwax r T T T (3.5) x uv x+x vu x = ( )-xTwax 2 = xTSx (uvT + vuT) where S: 2 - wa is a symmetric N x N matrix. The second equality in (3.5) above follows from the fact that xTuvTx = (ux)(vx) = (vx)(ux) = xTvuTx. Now the unconstrained objective functron fix) = x Sx has to be maxrmrzed. However, in a 42 practical testing situation, it would have to be maximized while taking into account relevant test constraints. 3.3 The Quadratic Knapsack Programming Heuristic As noted previously, to solve an AMTA problem for an optimal solution, a suitable computer algorithm has to be used. For this express purpose, the QKP heuristic algorithm is introduced. It belongs to a class of heuristics that are defined as sequential because they assemble a test by selecting one item at a time (van der Linden, 2005). This heuristic uses the mathematical structure of the QKP model as the basis for item selection. This heuristic is a simplified version of the QUADNAP heuristic algorithm (Caprara, Pisinger & Toth, 1999) and is formulated as T fix) = x Sx, (3.6) where S is a real-valued, symmetric, non-negative an matrix, and x is a binary n- dirnensional column vector and fix) needs to optimized over all x’s that have n 1’s. This corresponds to the problem of choosing the best it questions out of N questions such that: N N N N f(XI=Zinijij= Z Z Sij (3.7) i=1j=l i:xI==1j:xJ-=l Therefore, fix) is the sum of all entries Sij where x,- and xj each equal 1. Put in another way, if A is the set of indices (a subset of {1, 2, 3... N}) such that those components of x are 1, then A defines a submatrix of SA and fix) is the sum of the entries of that submatrix. Additionally, note that the ijth entry of S A is the (Ai, Aj)‘h entry of S where A,- is the ith 43 element of set A. Thus, the QKP heuristic works in a series of iterative steps whereby it keeps trying to improve the choice of A, as will be described in detail in section 3.3.2. 3.3.1 Solving Nonlinear Problems Nonlinear problems are generally solved through numerical analysis. Using computer code, numerical analysis can be converted into numerical techniques. The set of methods and techniques used for solving optimization problems are called search methods. In their implementation, several progressively better attempts are required before a solution can be obtained. For this reason, they are called iterative techniques. This implies that each attempt or search is conducted in a well-formulated and consistent manner so that information from previous iterations is used to update current computations until an acceptable solution is found. The process by which this search is conducted is the algorithm. Another meaning of the term “algorithm” is the translation of an ordered sequence of search procedures for finding a solution into a set of specific actions that can be executed on a computer (Venkataraman, 2002). The development of an algorithm for solving the conceptualization of the AMTA problem as a QKP problem is presented in the section 3.3.2. This algorithm is called the QKP heuristic. 3.3.2 Phases of the QKP Heuristic The algorithm used to implement the QKP heuristic consists of three phases. In the first phase, the algorithm is initialized by summing all the columns (or rows since the matrix is symmetric) of S (an N x N matrix) as described in section 3.3. In the second phase, the n columns (or rows) with the highest sums are selected for a submatrix of S called A. Here n corresponds to the target test length. In the third phase, elements outside A that do a better job of estimating D-optimality are selected and replace elements of A one by one. Based on the phases described above, the QKP heuristic is defined as a greedy algorithm since local improvements are made each time. Essentially it is called “greedy” because it selects items based on the decreasing order of efficiency with which the items under consideration are added to the “knapsack,” if the capacity constraints are not violated (Kellerer, et al., 2004). This is a common approach to formulating algorithms. For other uses of such algorithms see WDM (Swanson & Stocking, 1988; 1993) and NWADH (Luecht, 1998). 3.3.3 Details of the QKP Heuristic In the initialization phase the following steps are performed: Step 1. Sum all columns (or rows, since the matrix is symmetric) of S. Step 2. Create a submatrix of S called A that has n (indices number of required test items) with the highest values of the sums obtained in Step 1. Step 3. Check to see that the submatrix A contains the specified number of elements n. Step 4. If A has less than n elements then select the best element from those that remain in S Step 5. Repeat Steps 2, 3 and 4 until the correct number of test items is contained in submatrix A. 45 The submatrix A is improved in two heuristic steps, the second of which is more important. In both cases a variable called ‘CONTINUE’ is used in order to ensure the algorithm stops when no improvements are made to A. (The technical term for variables like CONTINUE is a ‘flag’.) Step 6. Initialize A by summing columns with indices in the existing set A. Step 7. Stop summing when no improvements can be made in A. The second heuristic replaces elements of A one by one with elements outside A that do a better job of maximizing the measurement precision of the test based on the objective function. Step 8. For each index i in A, find the best candidate for j (not in A) to replace i. Step 9. See if the best candidate is actually better than i. Step 10. If it is better, replace i with it and repeat Steps 3 and 4. ' Step 11. Continue Step 8, 9 and 10 until there are no improvements in A. The basic logic of the heuristic can be maintained for different types of constraints (i.e., quantitative, categorical, and item set constraints); however computational alterations may be required to ensure that desired outcomes are obtained. The main advantage of this heuristic is that its logic can be easily extended to different types of problems as long as they can be formulated in the form of the QKP model. 3.4 R_andom Item Selection Method Apart from the QKP and LP methods, a random item selection method (RND) was formulated in order to serve as a baseline measure of the performance of the two 46 main methods that were being compared. This algorithm uses the QKP formulation described in the section 3.3.3 as the basis for how it functions. Specifically, RND selects the number of items that comprise the best possible test that can be constructed from 100 guesses using computations of D-optimality. The algorithm is explained in greater detail as follows: Step 1. Step 2. Step 3. Step 4. Step 5. Step 6. Sum all columns (or rows, since the matrix is symmetric) of S. Randomly select it indices (number of required test items) to create a submatrix of S called A. Compute D-optimality for these it indices and set this value to be the largest value of D-optimality. Randomly select another n indices and compute D-optimality. If the value of D-optimality obtained in Step 4 is larger than that obtained in Step 3, then update it to be the new value. Otherwise keep the value in Step 3 as the largest. Repeat Steps 4 and 5 100 times, then select the n items which give you the largest D-optimality from these repetitions to be the best test. From the description above, it is clear that RND is not truly random. Moreover, its randomness is derived from the manner in which it uses the computation of D- optimality at each iterative step. Therefore, RND is comparable to LP and QKP and provides a good way of estimating how much better the other two methods are than an unstructured method of selecting test items. Also note that RND only works for unconstrained problems. 47 3.5 Item Pool Simulations Item pools were simulated for the purpose of providing a “proof of concept”. They also formed a basis for the computational experiments. The expression “proof of concept” can be thought of as an incomplete realization (or synopsis) of a method or idea(s) in order to demonstrate its feasibility. The proof of concept is commonly considered a milestone on the way to fulfilling a functional prototype. Hence, the simulations will simply be used as a way of showing that the core ideas are workable and feasible and will help establish viability (Wikipedia, n.d.). Two factors were manipulated to create six item pools with characteristics typical of MIRT item pools. The first factor was mean item pool MDISC and the second was item pool size. Item pools were simulated assuming only a simple structure because all ability dimensions were intentional and it would be counterintuitive to impose a more complex dimensional structure. Recall that simple structure implies that all items are pure because they measure knowledge or skill in only one content strand (Kim, 2003; Roussos, et al., 1998; Zhang, 2005). Typical item pools are about 400 items in length. Therefore, the item pool size factor had two levels: a) normal with 300 items and b) large with 900 items. Item quality was determined by varying values of mean MDISC. This is an important factor to consider since higher levels of MDISC are generally associated with increased reliability and classification accuracy as is the case with a-parameters in UIRT. Using the M2PLM of Equation (2.1), three levels of item pool quality were simulated. Low, Moderate and High discrimination item pools were simulated with MDISC and MDIFF generated from lognormal and normal distributions, respectively. Standard 48 deviations and correlations among these indices were held constant so the only difference in the item pools was the value of mean MDISC as shown in Table 3.1. The conceptual framework of linear regression was used for the simulation of item parameters in each item pool. Specifically, from looking at several MIRT studies, it was found that the linear relationships among parameters were quite similar. These similarities included correlations between MDISC and MDIFF, as well as variances and means of MDISC and MDIFF. The aforementioned explorations were conducted on MIRT parameters provided by Kim (2001), Martineau, et a1. (2005), Min (2003), Reckase (1997) and Reckase (2006). Although the specifications used in the item pool simulations were somewhat arbitrary, they did closely resemble item characteristics from the aforementioned papers. Recall that when there is a dependent variable Y and an independent variable X, the linear relationship between these two variables is given by the following formula: 1?,- = a + bXI- (3.8) In Equation (3.8), a is the intercept of the line and b is the slope of the line. It follows that the least squares solution for the slope b is then given by: b=w=r[fl} (3.9) var(x) sJIC where r is the correlation between X and Y, s is the standard deviation of Y and sx is y the standard deviation of X. Once the slope has been computed, the least squares solution fora is: a = F- b)? (3.10) where F is the mean of Y and If. is the mean of X. 49 Using the linear regression framework described above, MDIFF and MDISC values were obtained by first simulating MDIFF parameters as random vectors from a multivariate normal distribution. After this was accomplished, MDISC parameters were regressed on MDIFF parameters and the variance of MDISC was computed. Additional variance needed for MDISC was then simulated based on these prior values and then returned values from a lognormal distribution with mean and variance as specified in Table 3.1. Item difficulties were computed using the relationship d = -MDIFF/MDISC and direction cosines based on Equation (2.4) were simulated for associated clusters and dimensions to which each item belonged. A description of item clusters will be provided below. Lastly, discrimination parameters were then computed using the relationship a = (direction cosine x MDISC) for each item. Hence, the resulting discrimination parameters (a1 and a2) had a lognormal distribution and the difficulty parameters (d) had a normal distribution. As mentioned above, comprehensive sets of parameter specifications were used in simulating the item pools. Table 3.1 provides these values. Specifically, MDISC and MDIFF means and standard deviations were specified. Additionally, correlations between MDISC and MDIFF values were also specified. All these values are somewhat similar to this found from analyzing previous research papers as mentioned above. In Table 3.1, SDMDISC and SDMDIFF are standard deviations of MDISC and MDIFF, respectively. Additionally, recall that item pool quality was defined by the mean value of MDISC. That is, when an item pool was defined as being of Low quality it had a small mean MDISC value of 0.40, when an item pool was of Moderate quality it had a medium mean MDISC value of 1.40 and when an item pool was of High quality it had a high 50 mean MDISC value of 2.40. Lastly, correlations between MDISC and MDIF'F values allowed the linear regression formulas of Equations (3.8) to (3.10) to be used. The correlation between MDISC and MDIFF was r = .29. Table 3.1. Item characteristics for Low, Moderate and High Discrimination item pools Item Pool Quality Item . Characteristic Low Moderate ngh Meanwmsc) 0.40 1.40 2.40 3132222222O 0.30 0.30 0.30 Meanmmpn 0.00 0.00 0.00 313222222222 0.5 0.5 0.5 From the discussion of item content clusters in Section 2.3, it was ascertained that some multidimensional tests consist of clusters with common content. These are called item content clusters. A realistic example of item content clusters was provided from one of the NAEP assessments. Therefore, item clusters can be considered constraints in an assessment and such an example will be illustrated by simulating content clusters in each of the aforementioned item pools. For simplicity, only three item content clusters were simulated for the two- dimensional item pools. Each cluster contained an equal number of items for each of the three content clusters. Hence, there were 100 and 300 items per cluster in the 300 and 900 length item pools, respectively. Several studies showed that it is reasonable to have such 51 a small number of content clusters (e. g., Martineau et al., 2006; Miller & Hirsh, 1992; Reckase, 1997). In simulating these content clusters, each of the three was systematically aligned with the 612 and Oz-axes as illustrated in Figure 2.2. That is, a matrix of direction cosines of item direction with each dimension’s axis (i.e., 191 and 02-axes) in the multidimensional space was defined. These direction cosines for items in each content cluster were defined as follows: 1. CO = cosine of no alignment with the specified dimension (i.e., indicates that none of the item variance is attributable to the specified dimension) ii. C1 = cosine of perfect alignment with the specified dimension (all of the item variance is attributable to the specified dimension) iii. C2 = cosine of equal alignment with both dimensions (i.e., half of the item variance is attributable to the specified dimension) Hence, these clusters were created by adding random noise to direction cosines to keep all items in the same cluster from pointing in exactly the same direction in the multidimensional space. As noted above, content clusters were aligned to the two dimensions systematically. That is, cosines of alignment were defined as CO = 0, C1 = 1, C2 = J05 and arranged to cover the possible combinations as shown in the following matrix: 52 6’1 92 Cluster 1 l 0 Cluster2 «’05 40.5 Cluster 3 0 1 In other words, C0 = 0, C1 = 1, C2 = 40.5 represents 90°, 0° and 45° alignment of direction cosines with each of the two dimensions, respectively. Therefore, these are direction cosines for the two axes (i.e., 611 and 62-axes) where for each row (i.e., cluster) the sum of the squares of direction cosines must equal one. This is a realistic simulation of item content clusters because these situations actually exist, although in a more complex form in an actual test. Recall that in an actual test, these item content clusters could represent subject matter content based on a pre- specified test framework. For example, Martineau et a1. (2006) verified content clusters in a Grade 4 mathematics assessment and were able to successfully identify meaningful content clusters which were clearly related to the test framework. 3.6 Test Len Test length is one of the primary factors that affect test reliability and decision accuracy. Longer tests are more reliable and have higher decision accuracy (Hambleton & Swaminathan, 1984). Nevertheless, in testing, two important factors limit the length of tests. First, a test has to be of reasonable length so that it can be completed in a reasonable amount of time. Longer tests also increase the cost of administration and 53 these costs are invariably passed along to candidates. Second, longer tests require more items to be developed and item development is one of the primary costs for testing programs. Therefore, test length can be limited by economic constraints on a testing program. Typical test lengths were chosen for these evaluations. Test lengths of 21 to 100 items were considered. However, in some cases for brevity results for only 21 and 63 item tests were reported. These values were chosen since they most closely approximate sections of certification and licensure tests (e.g., Graduate Management Admissions Test and United States Medical Licensing Examination). 3.7 Simulees In MIRT calibrations, 2,000 or more examinees are usually recommended (Ackerman, 1994; Reckase, 1997). Hence, to ensure the stability of calibrated item parameters and simulated ability parameters, a single sample of 5,000 sirnulees was selected. These abilities were randomly selected from a multivariate normal distribution, 1 .6 MVN (u, 2), where 2‘. = [O 6 1 )is the population variance-covariance matrix and 0 it = [0] is the mean vector. The specified distribution of the two abilities also had a correlation of 0.6. This was the only type of ability distribution assumed because it produces the least estimation errors and was a good way to keep comparisons among the assembled tests meaningful and manageable. It is also the ability distribution assumed in MIRT calibration software (e. g., NOHARM and TESTFACT). 54 3.8 Test Assembly Model As noted above, the QKP method uses D-optimality for its objective function, but does not invoke a Taylor approximation. Recall that LP uses D-optimality with a Taylor series approximation which may introduce mathematical and computational error. Hence, in the QKP and LP conditions, test specifications will be set so that one-fourth of the items come from each content cluster. No content constraints or objective function can be used for the RND method because of the way it was formulated. Only equality constraints will be used in this test assembly model. It should be noted that even though equality constraints are convenient and easy to handle, they require more numerical effort in order to be satisfied. They have a disadvantage in that they are restrictive of the model design and limit the region from which the solution can be obtained (Venkataraman, 2002). Hence, the test assembly model to be solved has the following form: Maximize fi0,x) (3.11) subject to I W Z xi=nclusterl (Content Cluster 1) i =1 I Z xi=nclusIerz (Content Cluster 2) i (3.12) i=1 I Z xi=nclusIer3 (Content Cluster 3) i=1 , xI- e 0 or 1, i: 1,....,I (decision variables) (3.13) where fi0,x) can either represent the objective function for LP in Equation (2.21) or the one for QKP in Equation (3.5). 55 Formulation of this AMTA model entails the specification of content clusters as constraints. Also tests using this model were assembled over the typical range of abilities such that (61,62 ) e {—3, 3}x{-3, 3} . Although MIRT is rarely used for item banking or the assembly of tests, the model presented here is a realistic approach to AMTA since studies have confirmed that tests consist of identifiable content clusters (e. g., Martineau, et al., 2006). 3.9 Unidimensional Item Response Theory Simulations Using MIRT parameters from the assembled tests, a secondary simulation was conducted. This simulation draws on previous research which has shown that MIRT items which measure more than one trait can be used to construct tests that meet UIRT assumptions (Reckase, et al., 1988). This situation may arise in the construction of decision-making tests (e.g., certification, achievement, or criterion referenced tests) because although the item pool consists of items that are sensitive to more than one ability, they may be required to produce only a single score (e. g., selection tests where pass/fail decision needs to be made). There is another way to view this simulation study. Suppose that a test reports examinee performance on a single score scale. If a test has the property of being unidimensional (i.e., the data that come from the interaction of persons with the test has the property that it can be modeled using a unidimensional model), then the latent trait underlying this score scale will be a unidimensional construct. However, if a test has the property of being multidimensional (i.e., the data that come from the interaction of persons with the test has the property that it can be modeled using a multidimensional 56 model), the trait underlying the score scale will be a unidimensional composite of the multiple constructs being measured in the test. Mathematically this composite ability is realized as (van der Linden, 2005): 261+(1—2)62, 03231 (3.14) where I» is the weight assigned each ability. Therefore, the test would be scored using an estimate of this combination. Prior research has shown that analyzing multidimensional data with unidimensional IRT models gives meaningful results (Ackerman, 1987; Bogan & Yen, 1983; Reckase, 1979, 1985; Reckase, et al., 1988). Additionally, research has shown that unidimensional ability estimates are comparable to the average of the multidimensional traits (Ansley & Forsyth, 1985) and have different interpretations at different points on the unidimensional scale (Reckase, et al., 1986). Therefore, to evaluate the measurement precision of a single composite score that may be created from a weighted linear combination of the multidimensional scores, the multidimensional parameters for each assembled test were used to simulate dichotomous item responses. For each examinee’s two ability scores, simulated item response data was generated using Equation (1.1). The following standard IRT method (see Li & Schafer, 2003; Zhang, 2005) was used: 1) Given ability score Gj = (011-, 1921-), calculate the probability of answering item i correctly by examinee j, pi]- = P2( 61-), using the item parameter estimates from the assembled test 2) Generate a random number r from the (0,1) uniform distribution 57 3) If rI-j < Pij then a correct response is obtained for examinee j on item i and assigned a value of 1, otherwise, an incorrect response is obtained and assigned a value of 0. BILOG-MG 3 (Zimkowski, et al., 2003) was then used to fit a 3PLM model to each set of dichotomous data. In conducting these calibrations, the marginal maximum likelihood method was chosen, and the programs’ default prior distributions for the discrimination, difficulty and lower asymptote were chosen, along with a convergence criterion of 0.0001. The 3PLM model was selected because of its frequent use in similar studies that showed the relationship between unidimensional and multidimensional IRT analyses (e.g., McKinley & Reckase, 1983; Reckase, 1979; Reckase & Ackerman, 1986). Finally, the parameters obtained from this calibration were used to compute unidimensional test information using Equations (2.6). Functionally, this created the best unidimensional approximation to the multidimensional space spanned by the test items. 3.10 Data Analyses and Evaluation Criteria The ultimate goal of ATA is to produce tests of high psychometric quality with the most efficient computer implementation. Hence, the computational experiment conditions were used to make comparisons among QKP-, LP- and RND-assembled tests for the purpose of determining which methods produce tests of higher quality. Additionally, computational performance of the AMTA methods were compared using the following evaluation criteria: a) Real- and CPU-tirne performance, b) D-optimality computation, c) maximum information, d) MINF across the ability continua, and (1) relative efficiency of assembled tests. These criteria were chosen because they help to 58 compare the efficiency of QKP and provide an important connection to both MIRT and UIRT. 3. I 0. 1 Computational Efficiency An important goal in developing and studying knapsack problems is to determine their computational behavior (Kellerer, et al., 2004). Knapsack problems are commonly evaluated by computing their Real- and CPU-time performance. CPU-time is the amount of time the CPU is actually processing instructions during the execution of a computer program. Recall that a CPU (abbreviation for central processing unit) is the brains of the computer where most calculations take place. Real-time refers to events occurring in the computer at the same rate as they occur in real life. In this study, the CPU- and Real-time performance of QKP, LP and RND were compared in order to gauge each procedure’s efficiency. Therefore, these analyses give an indication of the speed and processing capability of each procedure. 3.10.2 Maximum Information In order to compare the three procedures, maximum information (MAXINFO) was used as an additional criterion for assessing the quality of assembled tests. MAXINFO is the maximum amount of information that is given by an item or a test when checking the direction of the composite in the multidimensional space. This composite exists because for any set of multidimensional items, MIRT statistical information is obtained for different linear composites of abilities in the multidimensional space. Based on Equation (2.13), MAXINFO is computed as follows: 59 MAXINFO = max[II, (0)] (3.15) a where a is the vector of angles with the coordinate axes in the 0 space, 01,- (i = 1, . . .,i) is an element of the vector, Equation (3. 14) simply indicates that depending on the direction of maximum discrimination of the particular items, one composite of abilities is measured better than the rest. The formula follows from the formula for computing MINF in Equation (2.13). In general, this measurement of composite abilities occurs because of the manner in which angles of the composite are aligned with the coordinate axes. MAXINFO tells how much information is provided for that composite. 3. 10.3 Pairwise Comparisons Using Clamshell Plots Reckase and McKinley (1991) developed a way to graphically display multidimensional information (MINF) using Clamshell plots (named because of their shape). These give the amount of information in different directions. Clamshell plots measure the amount of information provided by a test (or test item) at 10 different . . . o o . o . composrtes or measurement directions, from 0 to 90 in 10 increments. They are created by computing the amount of information at 61, 02 points along each corresponding ability continuum and represented by vectors originating from the points where the two 0-points intersect (Ackerman, 1996; Reckase & McKinley, 1991). The measurement precision of assembled tests was compared using Clamshell plots. Comparisons were made through a pairwise comparison of areas in the 60 multidimensional space where tests assembled with different methods provided the most information. These comparisons not only provide information about measurement efficiency, but also the regions of the multidimensional ability space where each method is superior to the other in providing MINF. Therefore, these comparisons can be considered a form of multidimensional relative efficiency similar to unidimensional efficiency which is described below. 3. 10.4 Relative Efficiency The ratio of the information functions for two tests can be computed when the measurement efficiency of two tests is being compared. This is called relative efficiency (Lord, 1980) and as noted above, when measuring multidimensional tests in a unidimensional manner, this amounts to providing a measure of the relative efficiency of composite test scores created with each method. Therefore, when comparing the relative efficiency (RE) of LP- and QKP-assembled tests, the following formula is used: I 6, LP RE(LP, QKP) = —(———)— (3.16) 1 (corp) Similarly, comparison of RE for RND- and QKP-assembled tests becomes: 1 6, R nd RE(Random, QKP) = ( ‘1 0'") (3.17) I (6,QKP) RE gives the relative merits of the amount of information for each test and facilitates making decisions about the quality of tests. It is represented graphically as a comparison of percentiles of measurement efficiency based on UIRT test information. Hence, a test that provides the most information in the region of the ability scale of interest is preferred. As noted above, this evaluation criterion serves as a relative measure 61 CHAPTER 4 RESULTS This chapter presents results from the study. Four factors were examined: 1)item pools quality measured by mean item pool MDISC, 2) item pool size, 3) length of assembled test, and 4) AMTA method. The first two factors were completely crossed to form six item pools. These were Low Discrimination with 300 items, Moderate Discrimination with 300 items, High Discrimination with 300 items, Low Discrimination with 900 items, Moderate Discrimination with 900 items, and High Discrimination with 900 items. The remaining factors were incompletely crossed as permitted by the item pool available for test assembly. The Section 4.1 reports results of the item pool and ability simulations. Sections 4.2 and 4.3 summarize findings from the computation of D-optimality and the characteristics of assembled tests. Section 4.4 presents findings on computational efficiency and Section 4.5 presents comparisons of MAXINFO for the computational experiment conditions. Additional MIRT comparisons are presented in Section 4.6 using pairwise comparisons of Clamshell plots. Section 4.7 reports findings on the relative efficiency of assembled tests. 4.1 Simuljted Da_ta Descriptive statistics of the simulated item pools and ability parameters are presented in this section. Recall that three levels of item pool quality based on mean MDISC values and two levels of item pool size were two conditions in this study. These 63 two conditions were fully crossed to produce six item pools that were used for assembling tests of different test lengths. Table 4.1 summarizes properties of the 300-Length item pools. All the discrimination parameters are positive as is required both practically and for the mathematical derivations to work. Generally, means of the al parameters were higher than means of the a 2 parameters as expected in a compensatory model. These mean discrimination parameters are also consistent with the MDISC levels stipulated in the simulation. Mean difficulty was centered at approximately zero for all three item pools. Standard deviations for all item parameters were also generally reasonable and showed a good range of variability. Moreover, MDISC and MDIFF values were consistent with definitions of item pool quality that were proposed. Therefore, the lower the mean MDISC value for an item pool, the smaller the number of highly discriminating items it contained. Table 4.1. Descriptive Statistics of Parameters for 300-Length Item Pools Item Pool Statistical Item Pool Property Discrimination Property a1 a2 (1 MDIS C MDIF F Low Mean 0.3737 0.1474 -0.0441 0.4062 0.0117 SD 0.2602 0.0966 0.2623 0.2662 0.5162 Moderate Mean 1 .0659 0.6750 -0.041 1 1.4096 -0.0021 SD 0.4851 0.5108 0.7089 0.3158 0.4974 High Mean 2.2612 0.8104 -0.0103 2.4109 -0.0141 SD 0.2863 0.2447 0.0688 0.3144 0.5052 Extended descriptive statistics for 300-length item pools are provided in Table 4.2. Here, a further breakdown of item parameter descriptive statistics allowed their 64 reasonableness to be determined. Descriptive statistics were computed for M2PLM parameters and further analyzed by content cluster for each item pool. The three content clusters were used to specify AMTA model constraints. Overall, parameter values for the 300-length item pools are reasonable. The means and standard deviations indicate that an appreciable amount of variation existed within each cluster. Additionally, minimum and maximum values are within a meaningful range of typical MIRT parameters. However, in a few cases, the mean a- pararneters were quite small with values of 0.04 and 0.01 for Cluster 2 of the High Discrimination pool and Cluster 2 of the Low Discrimination pool, respectively. This did not pose a major concern since mean MDISC values for the entire pool were reasonable, as reported in Table 4.1. Similar summary statistics were computed for the 900—length item pools. Table 4.3 summarizes properties for the Low, Moderate and High Discrimination item pools. Notably, mean MDISC values were consistent with those stipulated in Chapter 3 and were also comparable to those of the 300-length item pools. Interestingly, d values for the Low Discriminating pool were lower than those for the Moderate and High Discrimination pools. This implies that Low Discrimination item pools generally consist of harder items. As was the case with summary statistics reported in Table 4.1, mean MDIFF values for the Moderate and High Discrimination item pools tended to be lower than the stipulated value of 0.2. Although this is a deviation from the values originally stipulated for mean MDIFF, it did not pose a problem for the intended comparisons since the resulting 300- 65 nomad 33 .o 336 cbmmd owned mwomd nomad 386 chNd Qm 9 S . 7 Sec; acmm.~ wwmnd- ammmd meme; hwmmd- mmw g .o omvvd :32 m 5330 ._ 054.0 «me .o vmnmd m3md >800 2: mo good 83 .o wummd Om R36 533 53 .N . good gem; vad 3.86. 336 mvomd :82 N dogma—O mwood m _ cod mmomd Smmd owood amend .32 .o 386 amend Om memo; wwmmd ohwmd movmd cvmmd momma momma wmvod S Ed 302 fl Bums—O E me 5 me No S n «a S :wi 03632 Bo..— 3630 sea as: com 4&5— uo £00m Eu: eouwfifitoma :mE 98 88252 .33 .8.“ Beams—O .«o mouwufim ozatomon— NV 2an 66 and 900-length item pools had almost-identical statistical properties. Table 4.3. Descriptive Statistics of Parameters for 900-Length Item Pools Item Pool Statistical Item Pool Property Discrimination Property a1 a; d MDISC MDIFF Low Mean 0.4063 0.0254 -0.0536 0.4075 0.0210 SD 0.2974 0.0298 0.2573 0.2984 0.4999 Moderate Mean 1.3779 0.1 106 -0.0604 1.3868 0.0162 SD 0.2882 0.1 168 0.6875 0.2903 0.4878 High Mean 2.3867 0.1516 -0.0590 2.3959 0.0044 SD 0.3061 0.1512 1.2457 0.3090 0.5115 An extended summary of item pool statistical properties stratified by cluster for the 900-length item pools is given in Table 4.4. Similar to the findings of Table 4.2, a- parameter values are small in a few of the content clusters. Namely, values of 0.01 and 0.04 were found in Cluster 4 of the Low Discrimination and Cluster 2 of the High Discrimination pool, respectively. Again, these small means are not a concern since the overall MDISC values displayed in Table 4.3 are reasonable. Lastly, recall that a sample of ability parameters was simulated. This sample was assumed to have been randomly selected from multivariate normal distribution, MVN (1.1., 0.6 1 2), with 2'. = 0.6 1 0 J, 1.1 = (OI. Analysis of the simulated data produced the 0.60 0.0087 , "I = and p = 1.0042] (4.0001I J. This sample had statistical characteristics very close to those that were 1.008 following ability distribution: 2 = 0.60 1.00 0.604 0.604 1.00 67 embed mmvod moomd Emmd mmood Swmd nomad .m—mcd ommmd Gm $5.7 mmmmd nmva wwmed- on 56 $32 vmhmd- 336 ecomd 522 m cams—U Named vmood Neomd $56 $36 9.de mhwod wmmod Shad Om 3.86- ow Sd 9»me 3.36. 886 8mm; mmood- Emod mmmmd :82 m Sums—U moowd $23 $36 Smmd 836 omwmd mmwfio 386 wcmmd Om fib— 4 K36 ~mmmd omvmd Seed coomg o2 fio 986 m2 md :82 ~ Bums—U No S S Se me 5 w No S :wE 88082 33 3:25 sea as: com Ewe»: «0 .325 So: 5:33:5me :wmm was 880.52 .33 c8 B835 mo woumufim 359.880 414 035. 68 were stipulated. 4.2 D-optimality The multidimensional optimality of each AMTA procedure was compared using D-optimality. Recall that D-optimality may be considered a summary index of MINF or measurement precision provided by each test. It is computed from the determinant of Fisher’s information matrix and is a single numeric value for each assembled test. Therefore, the higher the value of D-optimality computed, the higher the resulting MINF for that test. An AMTA procedure which selects items that add the most utility to D- optimality is preferred. Using the six item pools described in the previous section, D-optimality was compared for each assembled test. Tests varied from short tests of length 21 to long tests of length 100. Additionally, these tests were assembled while accounting for the content constraints described previously. Remember that the goal of assembling each test was to maximize MINF. D-optimality results from assembling these tests are presented in Figures 4.1 to 4.6. For example, Figure 4.1 displays D-optimality results for tests that ranged from 21 to 100 items. This figure shows that at every test length, the QKP method assembled tests with higher D-optimality values than the LP method. Additionally, RND was used as a baseline measure of the effectiveness of LP and QKP. Therefore, since the values for LP and QKP are distinctly higher than those for RND, it can be concluded that using optimization procedures is a valuable endeavor. Moreover, since the values of D- optirnality for QKP were higher than those for LP this indicates that for a BOO-Length 69 0.16 __RND F l T j l l 0.14 0.12 D-Optimality Criterion O O 'o 'o .0 0) an «A '2 o o N I \ \ I l \ l g0 30 40 50 60 70 80 90 1 00 Test lengths Figure 4.1. D-Optimality Criterion Results for the 300-Length Low Discrimination Item Pool 1.4 ____RND I l I l l T ------- LP 1.2 —QKP q A 8 'C 9. '5 0.8 g g 0.6 d: 0.4 0.2 $0 30 40 50 60 70 80 90 100 Test lengths Figure 4.2. D-Optimality Criterion Results for the 300-Length Moderate Discrimination Item Pool 70 .4“ or D-Optimality Criterion .-* N 9° -¥ OI N OI OJ OI #- 2° 01 60 70 80 90 100 Test lengths Figure 4.3. D-Optimality Criterion Results for the 300-Length High Discrimination Item Pool 0. T f r fl r ' r 025 _ _ RND 0.02 _o O .1. 0'! 0.01 D-Optimality Criterion 0.005 Figure 4.4. D-Optimality Criterion Results for the 900—Length Low Discrimination Item Pool 71 0.35 0.3 0.25 .0 N , 0.15 D-Optimality Criterion .0 ..L 0.05 70 80 90 100 Figure 4.5. D-Optimality Criterion Results for the 900-Length Moderate Discrimination Item Pool 0.7 .0 c» D-Optimality Criterion .0 9 .0 w -h 0| .0 N .0 .3 70 80 90 1 00 60 Test lengths Figure 4.6. D-Optimality Criterion Results for the 900-Length High Discrimination Item Pool 72 item pool of Low Discrimination, QKP assembles tests with higher measurement precision. It is also noteworthy that QKP D-optimality values begin to diverge from those of LP as length of assembled tests increases. This means that QKP is more efficient at assembling longer tests than LP. For item pools of varying discrimination, there are noticeable differences in D- optimality values. In Figure 4.1 and 4.3 the difference in D-optimality values are distinctly greater than those displayed in Figures 4.2, 4.5, 4.5, and 4.6. The difference lies in the item pools that were used. The first two figures display values for tests that were assembled using Low Discrimination item pools while the latter four figures display results from Moderate and High Discrimination item pools. This implies that in comparison to LP, the QKP procedure has less of an advantage as the items in the pool become increasingly discriminating. Overall, these findings imply that for a wide range of test lengths and varying levels of item pool quality, QKP assembles tests of higher measurement precision. Unfortunately, D-optimality has no direct links to common psychometric measures of test quality and its evaluation will have to continue in the succeeding sections. However, before linking D-optimality to any psychometric indices, an example and comparison of assembled tests will be provided in the next section. 4.3 Comparison of Constructed Tests Recall that D-optimality was derived mathematically and used in the objective function to facilitate maximizing multidimensional information of the assembled test. Therefore, test items that maximize the value of D-optimality are selected from the item 73 pools using each of the three procedures. For illustrative purposes, this section compares assembled tests. Only 21-item tests assembled from Low Discrimination item pools of length 300 and 900 will be described here. Tables 4.5 and 4.6 summarize the properties of these tests based on items selected from each cluster, type of AMTA method used, and M2PLM parameters. Additionally, the number of items selected in common between QKP and the other two methods are identified. In Table 4.5, of the 21 items selected, 16 items were the same between QKP and LP and 6 the same between QKP and RND. Similarly, in Table 4.6, of the 21 items selected, 18 items were common between QKP and LP and 4 items were common between QKP and RND. Clearly, QKP and LP are similar in the way they select items from the pool. However, a closer look reveals that QKP tends to select items that have higher a-parameter values as analyzed in greater detail below. The superiority of QKP can be determined by comparing the multidimensional parameters of items selected by each AMTA procedure. For example, in Table 4.5, among the items not common between LP and QKP, the mean values of a1 and a; for LP are 1.13 and 0.28, while those for QKP are 2.04 and 0.51. Similarly, in Table 4.6 these mean values for LP are 0.97 and 0.46 while those for QKP are 1.54 and 0.66. Consequently, higher measurement precision would result from QKP-assembled tests since items of higher discrimination contribute more to the computation of MINF based on Equation (2.13). Although this breakdown of findings is only reported for two tests, the same conclusions would be true for other comparisons of QKP- and LP-assembled tests since computations of D-optimality (as shown in Figures 4.1 to 4.6) are consistently 74 8mm. 33833-sz .28 ago Son 5 2:8. 3.36:. ...... .38. doifiomma-n: .28 no.0 Son. a. $5.. 8.86:. ... m . .d- ddd NNd ...N eed- de N..d *odN eed- de th 8N w . .d- m . .d and neN mNd- eNd wwd wNN wed- wmd emd weN de- eNd wmd meN mmd- dmd edd *NNN mmd- dmd edd NNN MNd- de bod n .N hmd- Ned md. *w.N ..md- Ned md. w.N de- mmd $d ..c... .N NNd- ..Nd 5d *e.N NNd- bNd Sd e.N mdd- d. .d bed 0: mod. mNd mmd d... de- nmd hwd ..N NNd dmd Nbd ......Nw NNd dmd th *Nw NNd dmd th Nw m 8.3.0 d. .d- N. .d .Nd NeN mNd- mmd dmd *oNN eed- .cd ed. dmN ...d- m . .d . mNd mmN e. .d- dmd .md mo. mNd- nmd and dNN wdd- w . .d .md ed. m . .d- .ed Ed *3. m . .0. .ed dud mm. m . .d- .ed dud ......ww. e. .d- bod e... *N... e. .d- 5d e... Nb. e. d- nod e... ......Nn. d. .d edd 3.. *3. d. .d edd 3.. cm . wdd- wmd mod .h. mdd cmd dmd *mN. mdd dmd dmd mN. mdd e. .d eNd .N. ...d bNd eed ... .d. ...d bNd eed .d. N 8.3.0 odd- 8d .md on. eNd- 8d oed ...NmN eNd- ..dd eed NmN ..dd mdd mmd 3 d. .d hdd dmd *8. d. .d hdd dmd dd. m . .d ..dd ..ed mm mNd bod bed *we mmd ... .d N... E. med N. .d and ......Ne med N. .d and Ne mNd nod bed we eNd cod ded dm eNd odd ded mm med N.d and Ne de odd eed ......mm de odd eed *mm de odd eed mm mod ddd mod dN mod ...d end *e. mod 2.0 end e. . .8355 .e 8 :9 .5828 h 3 S .5823 .e me 6 8.8.8 88. 88. EB. 92 A: 5.0 .8. ea. 8:228:85 33 ace-8m 2: mam: emu... 8353?. mzm Ba .3 55 82:3 .5. 383 was: .3. 2%.. 8.88. 8388857.”. ...... ago ...on :. 88.. 8.8.6:. ...... .88. 838884.. ...... g0 59. :. 88.. 8.86:. ... 76 and. mod de mww .o. .- eNd RN 8% .d. . - eNd .mN wmw .d..- eNd .w.N *wmw Nd. .- m . d .n. *Ndw mad- N. d wN. NNw cNd- edd Rd m... eed- ...d d... ...mK Nd..- m . .d .m. Now mNd- edd med edw de- d. .d wdd moo ded- . ..d d. .. m . 5 Ned- e. d .e.. 53 Ned- e. .d .e.. *wec Ned- e. .d .e.. wee .dd mdd de mNe mNd- ddd 3d 3.3 mNd- ddd dwd wmo bod cod cod don 3d h. d m5. *m.e. ..dd h. .d mm. m . e m 8.8.0 . ..d- ddd dmd 5m dmd- ddd ed. ...mdo and- ddd ed. moo mdd- ddd dmd emm ..Nd- ddd Nbd *moc hNd- ddd Ed 80 mod. ddd. med .mm 8d- ddd om. *ece oNd- odd dmN mmm .dd ddd med e.e ddd ddd Nd.. *cme 3d- ddd on. eoe e. d ddd m . .. *wem .dd ddd .cd cNe ddd ddd Nd. cme e. .d ddd eed 3N e. d ddd m .... *wem e. .d ddd m . .. wem mNd ddd 5d dd. cNd ddd dwd *wNN oNd ddd dwd wNN N 8.8.0 . ..d- wdd 3d MNm m . .d- ..dd w..d 3% m . d- bod w..d mom de nod and *mmN e. d Odd mud *edN e. d odd m..d edN o. .d mod wmd eh. m . .d odd med RN de bod dud mmN m . d mod end on. de 8d and *mmN mNd wdd dwd .mN de edd Ned .m . MNd wdd dwd ... . mN ded N. .d 2.... mNN wmd bod dwd de. dmd wdd odd ... .NN dmd mod add .NN mNd Ndd eNd 0N m . .. e. .d 9.... *3 m . .. e. d 3.. mo . 8.8.0 .e S. 5 8.8.8 .e we 5 8.8.8 .e N: 5 8.8.8 88.. 88.. 88.. :2: 8 5.0 .8. as. 8.8.885 33 522.8 2.. 8.5 88 8353.8 :2: Ea 8 5.0 53.8 8. 8.8.8 2:2. .3. 28. higher for QKP -assembled tests. 4.4 Computational Efficiency As noted previously, CPU- and Real-time performance of computer programs helps in the evaluation of their efficiency. Although today’s computers have much faster processing speeds than ten years ago when CPU- and Real-time performance were of greater concern, in this study it is still important to compare the computational speeds of QKP and LP as a way of evaluating the efficiency of both methods. Therefore, if QKP was mathematically more accurate than LP, but significantly slower at computing a solution, then the mathematical advantage would be less important. This is because in operational settings, the speed at which a computer program computes a solution is far more important than minute deviations in accuracy. Figures 4.7 to 4.18 present results of CPU- and Real-time computations. These results were computed for 21- to lOO-length assembled tests using each of the six item pools. Recall that item pools were varied in terms of their length and mean MDISC. Generally, LP assembles tests using the least amount of Real and CPU-time. However, the difference between LP and QKP is small enough to conclude that the two procedures perform comparably. 4.5 Maximum Information MAXINFO is a multidimensional index which measures the composite of abilities in the direction of maximum information for a particular test or set of items. This index facilitates comparison of assembled tests since it was shown previously that the AMTA 77 ' ' ' ——RND -LP —QKP a; '0 5 8 ‘ . In ‘6 g F — i U [K /\ —————— A\—-——/’——‘—\','-—- .- 1 g. ‘1‘ ’---l ...... I l ............. 70 so 90 100 60 Test lengths Figure 4.7. Real Time Performance of Each Assembly Method for the BOO-Length Low Discrimination Item Pool Q: .d.’ —-RND LP 0.3 - QKP CPU Tlme(seconds) ‘20 30 4o 50 60 70 so 90 100 Test lengths Figure 4.8. CPU Time Performance of Each Assembly Method for the 300-Length Low Discrimination Item Pool 78 0.14 I fir l l T I _—RND \ ------LP l —QKP 0.12- l r \ t I A _ \ {\ ‘ p ;\ //\ ,\\/\ /\ /\ \ , \ I x I \ /\ ,\ / \ I \ 1;; 01-l , l I / \ \. -\ I L a \ / \ x , / , 3 / x I x I \/ \__J 8 K / \ I H €003. ./ V v _ I: 8 050.06 0.04 . 0°30 30 4b 50 60 7o 80 90 100 Testlengths Figure 4.9. Real Time Performance of Each Assembly Method for the 300-Length Moderate Discrimination Item Pool 0.14 T T r I I I ' __RND ------LP 0.121 —QKP \ x a 01-‘ A ~- g \ / F-' E 00. 0.04 .‘ 0°50 ' 30 40 so so 70 so 90 100 Testlenghs Figure 4.10. CPU Time Performance of Each Assembly Method for the 300-Length Moderate Discrimination Item Pool 79 0.16 l l T l T I — - RND LP 0.14 » l —oKP 0.12 ’0? '0 g 0.1 .,_ a, . 8’ E '. l: 0.08 E 0 (I: 0.06 004- “‘1: z"... " -------------- it“ {.-.-“x “‘31.: “‘x-n-x ------- *2 ...", “it; - 0'080 30 40 50 60 70 80 90 1 00 Test lengths Figure 4.11. Real Time Performance of Each Assembly Method for the BOO-Length High Discrimination Item Pool 0.11 .0 —‘L CPU 11me(seconds) .0 so 9 .o 8 9. 8 8 P o o: I O .5 0.03- 8..-"-.-"4’ ‘x” ":5 i. 3' ‘a.-.-....-.:’ ‘2' - . 4 0'030 30 40 50 60 70 80 90 100 Test lengths Figure 4.12. CPU Time Performance of Each Assembly Method for the 300-Length High Discrimination Item Pool 80 1.6 1 l 1.4 1.2-x Real Tlme(seconds) —-RND LP — QKP Figure 4.13. Real Time Performance of Each Assembly Method for the 900-Length Low Discrimination Item Pool Test lengths 90 1 00 0.65 j I f T I l __RND 0.6 - , "‘"“ LP \ / \ /——\ / / ~— .. 0.55‘ \ / \’ \// \/ \\/ \f _ v 0.5 ~ ‘ P h OI I CPU Time(seconds) 3 o 01 -h 9 to 0.25 0.2 ' Figure 4.14. CPU Time Performance of Each Assembly Method for the 900-Length Low Discrimination Item Pool 60 Test lengths 81 0.9 I I T T I l ——RND ------LP 08- ——oKP 0.7'\ /\ x s 806- \- 0 . E“ F05?- - i 0 m 0.4 0.3 70 80 90 1 00 Figure 4.15. Real Time Performance of Each Assembly Method for the 900-Length Moderate Discrimination Item Pool 0.65 I I I f I I , __RND 0.6 -, / “'"'* LP N\/ / / \/// \\ //\\//\\/\ .. QKP 0.55- V \ \. ’ - 0.5- - .0 uh (II I 1 CPU Time(seconds) 8 o 01 h» Figure 4.16. CPU Time Performance of Each Assembly Method for the 900-Length Moderate Discrimination Item Pool 82 .3 a) ‘l i -——RND "th 15" mm 1“ A N I A l Real Tlme(seconds) o CD 9 a) 0.4 0.2 90 30 40 50 60 70 80 90 100 Test lengths Figure 4.17. Real Time Performance of Each Assembly Method for the 900-Length High Discrimination Item Pool 0.8 T I T I I I __RND "th 07— aw {l l l l I \ 0.6-l / l . . I In I A l g l / \ // \ 305 L/\——// \- \\/___,____#~-v-_-- £1 0 E l: 0.4 D a. 0 0.3 0.2 0' 80 30 40 50 60 70 80 90 1 00 Test lengths Figure 4.18. CPU Time Performance of Each Assembly Method for the 900-Length High Discrimination Item Pool 83 procedures do not select identical items from the item pool. Comparisons of MAXINFO are presented graphically in Figures 4.19 to 4.24. These results show that differences in MAXINFO between QKP and LP are uniformly different for tests of length 21 to 99. If these lines are extrapolated, QKP would always assemble tests with higher MAXINFO than LP. Additionally, it appears this would hold true for item pools of any length. Moreover, both QKP and LP have higher values of MAXINFO than RND. This implies it is indeed worth the effort to use optimization methods for assembling multidimensional tests. It is interesting to note however that the higher the item quality, the smaller the difference in MAXINFO. Recall that item quality imposed on different item pools by specifying mean MDISC values with Low Discrimination = 1.20, Moderate Discrimination = 2.20, and High Discrimination = 3.80. What this finding suggests is that QKP is less effective at assembling tests when the multidimensional discrimination of items is higher. However, this conclusion may not be entirely correct. A more logical explanation is that differences between AMTA procedures are nullified when the coefficient of variation becomes smaller. The coefficient of variation represents the ratio of the standard deviation to the mean, and it is a useful statistic for comparing the degree of variation from one data set to another, even if the means are drastically different from each other. The formula for calculating the coefficient of variation is: SD Mean CV: (4.1) where SD is the standard deviation. CV computations for the MDISC values in Table 4.1 gave the following values: a) Low Discrimination = 0.66, b) Moderate Discrimination = 84 MAXINFO 30 30 40 50 70 80 90 100 60 Test lengths Figure 4.19. MAXINFO Results for a 300-Length Low Discrimination Item Pool 520 30 40 50 60 70 80 90 100 Test lengths Figure 4.20. MAXINFO Results for a BOO-Length Moderate Discrimination Item Pool 85 140 I TIT j I I I —-—RND +LP 120 _QKP ' / /-- / / 100- ' \/ _ ,, ./ 2 / \ Z 30- - /“ - E / q 60- ,/ - 4o— -/ - 2°20 30 4o 50 60 70 so 90 100 Test lengths Figure 4.21. MAXINFO Results for a 300-Length High Discrimination Item Pool 22 __RND I I I I I‘ I 30 30 40 50 60 70 80 90 1 00 Test lengths Figure 4.22. MAXINFO Results for a 900-Length Low Discrimination Item Pool 86 ——RND MAXINFO l l I 180 30 40 50 60 70 80 90 100 Test lengths Figure 4.23. MAXINFO Results for a 900-Length Moderate Discrimination Item Pool 14o ——RND f r I I I I 120 100 80 MAXINFO 60 40 J 1 l l l l 4 290 30 40 50 60 70 80 90 100 Test lengths Figure 4.24. MAXINFO Results for a 900-Length High Discrimination Item Pool 87 0.22 and c) High Discrimination = 0.15. Similarly, from Table 4.3, CVs have the following values: a) Low Discrimination = 0.73, b) Moderate Discrimination = 0.21 and c) High Discrimination = 0.13. Hence, the efficiency of QKP is lessened when the coefficient of variation becomes smaller. 4.6 Egirwise Comparisons Using Clamshell Plot Pairwise comparisons of MINF were conducted using clamshell plots. Presented pairwise because every assembled test was compared for QKP versus LP and QKP versus RND. Specifically, the comparisons focused on differentiating the measurement precision of QKP in comparison to LP, and QKP in comparison to RND. The purpose of the first comparison was to determine which procedure was better. The second comparison was used as a baseline measure to assess the magnitude of the difference between QKP and LP comparisons. In interpreting these diagrams, blue is positive (i.e., where QKP gives more MINF than the method to which it was being compared) and red is negative (i.e., where QKP gives less MINF than the method to which it was being compared). Additionally, the red and blue clamshells are plotted so that red is inverted, while blue is upright. The density of color and length of the clamshells are also related to the magnitude of MINF represented at those grid points. These comparisons are made by eye-balling the areas of red and blue clamshell plots while corresponding item pool quality and test length plots are juxtaposed. Out of all the tests assembled, only two test lengths were selected for illustrative purposes. The 21-length test is commonly considered a short test while a 63-length test can be regarded as a long test. Recall that generally, test length is an important factor in 88 test assembly because the longer the test, the more reliable it becomes, but at significant cost in terms of other resources such as item exposure and size of item pools. Therefore, a 63-item test is expected to be more reliable than 21-item test. The findings from these pairwise comparisons are presented in Figures 4.25 to 4.48. At each test length, the level of MINF was compared. Looking at Figures 4.25 and 4.26, for example, there is a difference between the clamshell plots. Figure 4.25 compares QKP and RND, while Figure 4.26 compares QKP and LP. The length and darkness of the clamshells is slightly less distinct in the former than it is in the latter. This means that QKP-assembled tests give higher MINF when compared to RND-assembled tests, than when compared to LP-assembled tests. This was the expected outcome based on findings from D-optimality and MAXINFO comparisons. In these figures, the presence of red, inverted clamshell plots indicates the areas of the two-dimensional space where the comparison method gives higher MINF than QKP. However, since these areas of red inverted clamshell plots are few and indistinct, that means LP and RND do not give that much more MINF in these regions of the multidimensional space than QKP. Additional conclusions can be drawn by looking at the remaining pairwise comparison plots. QKP assembles tests with higher MINF than LP and RND when the item pool consists of items with low discrimination. Evidence for this conclusion is corroborated by Figures 4.25 to 4.28, 4.37 to 4.39, and 4.40. Furthermore, QKP is more efficient for shorter tests. Evidence for the latter observation comes from the pairwise clamshell plot comparisons for the 21- and 63- length tests. Another conclusion highlighted by these finding is that item pools of high 89 NNOOO) —t—-.<'3.o 99.-24 .... de808mnmNAmmmA I Q- -161208OA 004081216 224283235 4 Figure 4.25. Pairwise Comparison of MINF for 21-Item QKP- and RND-Assembled Tests Using the Low Discrimination 300-Length Item Pool mwmw I I I I I I I In ...... I --oo 0044 hmbhbhohbhbmbbNbA IT I 4~3m63228242-1m6120804 00.4081216 22m4283236 4 6 1 Figure 4.26. Pairwise Comparison of MINF for 21-Item QKP- and LP-Assembled Tests Using the Low Discrimination 300-Length Item Pool 90 . 444<.4"';"{€"/_:’ £45414 4 , ‘4 as - 33 .4 4. 44444”; 4.4.474», 4 A .4... I. I 3.2— @ -.-' -’ flflgflwzéf 7,??155722’7/9/ . ‘ - - - -~ “ ‘ ‘ “ ‘ '4 “474/147 ‘ “ g ...I '4. 4 .4. ' 14544.4“ 41:, .. 2-3' .4 ‘ .22 444444422,— * :2 ‘ ‘ 16 g .4 _ .44. 4147474744,- .4 g... ._ - " " ‘ ,‘ '7,/i7,('l'.’ir”/:/%/7 “ ~- 4 .4 =4 4424’ 4 g, ,, 12. ‘- efi -_ -_. , ,,,.,"_;”,}7W; 4 ‘- .- 08- ‘fl:& 2:»: 14454:“: 4: ‘4.- ' 54,,71’Zv’7fér. . fl 0-4' " ‘ ‘4 “ !?I¢"/¢il7¢ff ‘ ‘ N a ‘ defi ”- 4" 19' é”4’"-’»- " CD 0' ‘ ' , 4774:“ ._ ‘ & é 4: . '4 '_4.“”:: g "1 -O.4 " (cw! -0_8- - fl ‘1‘ I“ 27474;, 2' 4’/ 4 _ a fi ‘ i g - (a "41 "_ '1-2 1,7479. _ ‘ 4 -4 4‘. z: '1-6 ‘ ‘ " ‘ 2‘ {742444. -2 - - _. - '{éfi'fif/ZOZ9:J '2-4' ‘ ‘ "' ‘ f “g , 44747449,; - fl _: :: :4: - — ’1.- ‘-_4 4' - g8“ ‘ " z 4 , 74717471414/9 _ ._ - n a. '5 .4: 4 ‘_»4-4 4 3% - .. .. f4 .4 4 4 4'4 ' it _ - - ..., .4 -.. 4 :44 444' l I l l I l l l I l l l l J l l J: I I l I -4-3.63.2-2.8-2.4-2-1.61.2-0.80.4 0 0.40.81.21.6 2 2.42.83.23.6 4 —§ Figure 4.27. Pairwise Comparison of MINF for 63-Item QKP- and RND-Assembled Tests Using the Low Discrimination 300-Length Item Pool 4- fleéfi— ... £55 4 4 ‘1. .- - - . 36- j» W”; .4 4%24'344'4 _ “ - _ - _ _ 3.2_ a ‘ fl , Egg/2:454. “a a. - - . . 2.8- - - 44144474404 4 44-- - - - . . 2.4- - o a £4 *4 4 {g’gfrgrsas _ a: .- - - _ - .2“ - . fir flfi _—: 44'4/é'14’54’ 4 t- - - - . . 16_ - --‘éger/flg/yg’44 fl... - . ‘ 1.2,. - a - ‘ g «4 {fig/L4,: 4'4 ‘- - - - _ 0.8- - - n‘:% ”4: sé'é/O’fl- _- 4 n- _ - . 0.4' " - ‘ ‘ g vs 1.; éflg/r}%r/: 4 44 ‘ - o n - N - - 4: 4.4-.4545”; :4 4 - . 0 02:. - 1' :24 4444444....- ‘2: - - :oi8- . a a ‘ ‘ é _ ,3 4? 44/022": 4 :4 i- - - -1.2- . n - a - ‘ i "4"4’£'=£fl:/”; : 4: fl - - -1:6... . - a. - &£ ”-4 >4 4 4&4 ‘:: .4 ‘ - - -2- . n ‘ ‘fi i- a -2.4— - . - a ‘ ‘ é ‘. -23- . . - - - .. “4W1- -32- - . - .. agggdéWéa- -35- . . - - 4 £12 m: _4_ . . . - ‘ _ ‘ “g g -4 -3:63f2-2.l82j4 -2—1f61.12-0.18~0f4 0 (f4 01.81j21f6 2 2:4 2f8 3:2 3:6 4 6 1 Figure 4.28. Pairwise Comparison of MINF for 63-Item QKP— and LP-Assembled Tests Using the Low Discrimination 300-Length Item Pool 91 NNOOOO us5‘.'. ”Huntup; I E f .-|I||... E I I 4406 0044 -'IOIIII}| E I I 'Nbbbbohbbbwhbbbh I I I I I T I I I r r I I I I I I l I I I aase OJthu I A l l l -4 -3.63.2—2.82.4 -2 -1.61.2-0.8-0.4 0 0.4 0.8 1.2 1.6 2 2.4 2.8 3.2 3.6 4 6 . 1 Figure 4.29. Pairwise Comparison of MINF for 21-Item QKP- and RND-Assembled Tests Using the Moderate Discrimination BOO-Length Item Pool \ iii 449$ Ppee .... mthOthmthNm-h I N l I j I I j I l 1 I I I I I I I I fl I I l :1: A -2. -3.2 -3.6 ‘ -4 -3.63.2-2.8-2.4 -2 -1.61.2-0:8-0.4 0 0.4 0.81.2 1.6 2 2.4 2.8 3.2 3.6 4 1 Figure 4.30. Pairwise Comparison of MINF for 21-Item QKP- and LP-Assembled Tests Using the Moderate Discrimination BOO-Length Item Pool 92 4.. 3.6L - - - 3.2- - - - - 2.8- - - - - - 2.4- : « - - - 2.. D J J a o 1.6- D .D J J n 1.2.. D D 1 I D 0_8- p D D I l 0.4.. o a n J D . . CDN 0.. a J .n o . . . . —0.4r- .’ D D I o o _0.8_ - D D D D D O -12L. - D O D U D U -1_6~ . o . u n o D _2L. . . o e o D I -2.4- - - - - - -2, - . - . -3.2- - - -3.6+ - - -4 - ‘ . l J i l I l 4 4L 1— -4 -3.63.2-2.82.4 -2 -1.61.2—0.8-0.4 0 .4 0.8 1.2 1.6 2 2.4 2.8 3.2 3.6 4 61 Figure 4.31. Pairwise Comparison of MINF for 63-Item QKP- and RND-Assembled Tests Using the Moderate Discrimination 300-Length Item Pool 4- . .- ‘ ‘ ‘fi‘ .- . 3.6_ a ‘I ‘ ‘fi‘- . 3.2.. - ‘ - ‘i‘i‘ - 2.8.. c ‘ ‘ ‘ it- . . 2.4L . - a. J ‘- ‘4‘. - . 2|— - ‘ ‘ ‘fi‘- - 1_6.. .- n n 4‘ £1- - 1.2. a - ‘ c‘ fi“ . 0.8.. o - ‘ ‘ “é‘ a . 0.4.. '. n .I ‘ “ti- . CDN 0i. 0 .v ‘ “fig- 0 '0.4’ - - ‘ ‘ fifl‘ - . _0_8- .. an ‘1 ‘ “£‘- a -1.2L- a n 4‘ ‘fi‘- . -1.6_ - a. n ‘3‘“ - -2- . a ‘ ‘ ‘fi‘ 4- '2.4‘ . a- an ‘ lfifi - .. -2.8- - n 4‘ fi‘é“ . -3.2.. - a ‘ ‘fi‘fi - “3.6i' . a- :- ‘ fifl- - .,4- . a . ‘fifi‘- -4-3.63.2—2.82.4 -2-1.61.2—0.8-0.4 0 0.4081215 2 2.42.83.23.6 4 9 ‘I Figure 4.32. Pairwise Comparison of MINF for 63-Item QKP- and LP-Assembled Tests Using the Moderate Discrimination 300-Length Item Pool 93 4+- . - .. , a W‘- . 3.6" - w .- V - fig- 3.2" ' - v u i“- g 281' - _ , - . n %‘ . 2.4” - - - . a fig- . 2... . -, - . a a“- . 15. , , ..., - . as (63%.- . 1.2_ - - - _ - fim- 0.8- . - - ' . . ééfid . 0.4“ - '- ' - ‘ ‘fi'é‘ " co“ 0L . , , _, . drug‘s/gm- -0.4L - - ..., . . ‘ éfifia - -0,8 . .. ., ... - .. géfifi - '1.2£ - - - .- - ‘ééfin . -1.6_ - - ' I, . ‘ fififi . - -2- - ' V ' . ‘ fl“- ‘ a . -2.4r- . - - ' . - égfi ‘ ‘n o . -2_8- - ' ' - a ‘fi‘ ‘ - a . -32- - ., v - . £432- ‘ a . -3'6- . , ., ' an ag“ - a. - . -4_ . - - v - .- ‘fi‘ . n a . -4-3.63.2-2.82.4 -2-1.61.2-0.80.4 0 0.40.81.21.6 2 2.42.83.23.6 4 0 1 Figure 4.33. Pairwise Comparison of MINF for 21-Item QKP- and RND—Assembled Tests Using the High Discrimination 300-Length Item Pool 4— - . - . - a. 4. - . 3.6_ , , . . a Q ‘ a . 3.2_ , _ , . . a ‘ G . . 2'8__ , - o a - - a . 2_4— . . - - .- 4. n a . 2__ , _ _ , - a - ‘I . o . 1.6“ , _ , . a d ‘ a . 1.2_ , - _, . . n - - ‘1 o - 0.8" a o c a- - - ‘ a . 0'4” , , . o .- ‘ a - ON 0" . ' , . a - ‘ C a . -0.4— , - - . a fl . O - . . _0.8__ , _ - - . n a a n - . -1'21- . , - a a a d n . . -1.6~ . . - . a a a a a .- . -2L . . - . . a a a n a . . -2.4— . - ., a- d a a d a . _2. _ , - - - - n a - a a . -3'2- - , . . a a a a a . . '3.6" _ ' - - a. a a a . _4_ , , - - a a - c n a -4-3.6-3.2-2.8-2.4 -2-1.61.2-0.80.4 0 0.40.81.21.6 2 2.42.83.23.6 4 0 1 Figure 4.34. Pairwise Comparison of MINF for 21-Item QKP- and LP-Assembled Tests Using the High Discrimination BOO-Length Item Pool 94 4— .. .. n - n a - . 3.6.. - - - a d a- a .. 32E - . a a a a. . 28 , .. . a n n n .. . 2_4- - .. . - a n as .. . 2— . - - . n a- a a . 1_6.. - - , a ‘ ‘ - . . 1'2... - ,, . .- A J a - . 0.8?- — - - - n ‘2 a .. . 0.4” - , ... a ‘ fl .- .- ON 0. ., ... . o H ‘ a - -04- .. _ - . a ‘t n - ‘0.8L‘ - c - - 1“ ‘r- a . -1.2,.. . - - a i 1 - .- ‘1.6” - - - . ‘1 fl ‘ - -2». . .. - . a i ‘ n . -2_4- - .. - . ‘ l n . ‘2.8‘ - .- . - ‘ n - ‘3.2" . - c - ‘ ‘ n . '3. ' - - - - I» 4.- - _4,.. . , , - . a ‘ ‘ .- -4-3.63.2—2.8-2.4 -2-1.61.2-0.8-0.4 0 0.4081215 2 2.42.83.23.6 4 0 1 Figure 4.35. Pairwise Comparison of MINF for 63-Item QKP- and RND-Assembled Tests Using the High Discrimination BOO-Length Item Pool pmww ohbwbmabhmh s ‘ I I I 0 I U I I | | b | O I ' e2 4456 ooea ' C I . | . | . I . ' hwbbbb II IIFIIIIIIIIITIITIFFII |I||||||$' I N -223 -32 -35 ~ - ' -4 . . . . . . . . . . . . -4-3.63.2—2.82.4-2-1.61.2-0.8-0.4 0 0.4031215 2 2.42.83.23.6 4 e 1 Figure 4.36. Pairwise Comparison of MINF for 63-Item QKP- and LP-Assembled Tests Using the High Discrimination 300-Length Item Pool 95 4— . .. an. ‘12 -. /2;4r”-.g<’.=_-'c.. # _ _ 3.5. - - .. «a _ ._. w- 4 » .. - 3_21- - - -. & v! 4‘ 7’! —- ‘ - 23- - - - ‘3 .. _ 44:!ng _ ._ _ _ 2'4 ---..._.-«m._ -- .2: - .. ..g. _. 4mg- ~__ _ _ _ 1 6 : _ .. ‘2 _ ag fl'gp .. - . 12: . _ a. ‘2 -_ -.4”¢'"§;Zf:4r 2.. - - _ 0.8 - - ‘2: -—- .3.» 1%1‘ . .. . 0.4- - an - -_—.’ «g’g’z-A’igr’; ‘ - - N I - . .../......r/fi,’ 0— - - ‘ --------- ’ —- ... ‘- - - (D («a/:4; ... _ '0-4' ' : : ' ngtz:-_ ._ _ : —?'3_ : .. 4 - ' -_-r' -‘12AZ-W’3: ’_ . ...; _ - -1'6: - .. ... -. Ia-ia/flf’fi-v'- ‘ - . - 2 . .—-' .4”: I’ll/’- - — ‘ - ‘ — 7 / "— ‘ .- -2.4- - - .. t: a .4 «MC/fag- - - 1, ,.4% .¢/ -2_8- - - A & -_. _. is -. _ - -32- - n 4. & e! 5:2;7/41—4'; _ _ -3.6- - -. _. & U:<::{:€ 2. .. _ -41 - - - - - - é ,_ «cat-«=2 - .. _ . . . I L I I I J I I l l I I I l I J I I l I I -4-3.63.2-2.8-2.4 -2-1.61.2-0.80.4 0 0.4081215 2 2.42.83.23.6 4 1 Figure 4.37. Pairwise Comparison of MINF for 21-Item QKP— and RND-Assembled Tests Using the Low Discrimination 900-Length Item Pool 4L - .. 4.1-sam— . 3.6- - .. ‘w- . 3'2- - - a. flag- - 23- . _ n. W- - 2_4_ - - .. 423%.. . 2- - _ ‘ we- - . 15- _ - ‘2 w- - . 1.2? - .. 4 w- _ . 0.8- - - 4h ‘ - - 0_4_ - ... 4am- - . CDN 0- - ‘ W- n . -o_4- .. 4 % - . -0.8* .. 4 W... - . -1.2_ - ... W4 - . -15- - ‘ W4. - . -2- - ‘ W4. .. - -2_4— - an W4. .. . -28- - ‘ W4 .. - .32r - at W4. .. . -35 - - W4. - - -4 - . . . . . _ g 422%.: - - . . . . -4-3.63.2—2.8-2.4 —2-1.61.2—0.80.4 0 0.40.81.21.6 2 2.42.83.23.6 4 1 Figure 4.38. Pairwise Comparison of MINF for 21-Item QKP- and LP-Assembled Tests Using the Low Discrimination 900-Length Item Pool 96 4L- . - 4 ‘ & =.. 7:: " :" :37. 5'” _ _ _ 3'6L. . - ._ -: é .— :: - . l 3!.I’1 ‘ _ ‘ L . - _ - ‘2 '_ 4 _.".=.-." wife ""'<=’='§;:.,-v __ _ _ . " ' - 4 '_ {7 r7 , _/’.—‘ 4 " ‘ 24 .. - - ‘ ‘2 ‘ -..- £!;;’;;;”IZ’;;’ ' - _ _ 2— - .. .- ‘ _. 4 aficé? ‘ ‘ - - ‘ ,- It 16~ . - - .. a. -._ 4 £259. «1 .. - _ . {2% _ - V ,/%{ r _ . . - - _’ ". “ 7- : A ‘ . 0.8L . - - — ‘2 — g {’g’ i ‘ ‘ _ ‘ _ _ - .. .— 42 5’3““? ‘ - - . N 0‘4 - ..-/2’14 _ ‘ q; 0‘ ‘ ‘ ‘ -’ -‘ a}; 5 - - . 0 4 _ . - ‘ fi 5" .—_._4' ‘_ ”rgf‘”: ~ ‘ - _ _ - . [/1’11’. -08- . - — 42 .4 "=- ’g’g'; . ‘ ‘_ _ _ -1_2_ - - _ 4g .- 1..- -_; 7,2,7 4: .— - - -16- . - - ‘. ‘2 {42.41% _ - - 2 ‘ ‘ ‘2 _~ #:1'4’fi _ ' - .E - -114??- _ _ ‘3'gr - - - £ 'fl"/"'¢f’ ‘ - — . - - an '5 -- .‘ _ _ _ -32- . - _ ¢ 1% ..ng' _, ‘ _ - -35- . - - .. ‘g .. ~ _ - _ - -4- n - A ‘g -_ —=: »_ I - _ _ I I I J I l l J -ll1-3:63.2-2f8»2.4 -2 -1.61.2—0:80.4 0 0 4 .81.21.6 2 2.4 2.8 3.2 3.6 4 Figure 4.39. Pairwise Comparison of MINF for 63-Item QKP- and RND-Assembled Tests Using the Low Discrimination 900-Length Item Pool 4- - - ‘ ‘m‘_ - - 3.6- . - . ..m..- - - 3.2- . - - ‘Mm‘- _ _ 32: : : : -£%:m-:i : 2- . - _ «Mm- - - 1.6- - .. ‘éw- .. . 1'2- - - imam- - - 03- - _ ..t new... - - o.4_ _ - ... game-4n- - _ - - ... We“ - - (DN 0‘ ‘3 éiiéiiii -0'4- — ‘- ‘ é ‘ a. - . -03- . - -‘m- - - . -12- . .. ‘ gm- _ - . -15- . . . - - - -49-mg..- - - . -2- . . . . - ‘— ¢ “'1 ‘ A - . -2_4_ . . . . - ‘ ng- _ - - 'Z'SF ' ‘ ‘ ‘Mm‘ ‘ ‘ ' -3'2 - - ‘ t ‘ n - . -35 . a _ - .. ‘m- - - _ -4 . - - 4. ‘m‘- - . . 0.30:4 (1) 0:4 0:81I21j6 é 2f4 2f8 3f2 3J.6 :1 e 1 Figure 4.40. Pairwise Comparison of MIN F for 63-Item QKP— and LP—Assembled Tests Using the Low Discrimination 900-Lcngth Item Pool -4 —3.63.2~2.82.4 -2 -1.6-1. rp 97 mnww 'oa'Nbolc-olhbo'w'mwlc-boiomh H'HILII I I I ..L-Loo 004—; @kb. NOD-5N -3.6 1 I I I I I I I I I I I I I I I I I I I I IIIIIIIIIII I IE I .h -.....utIIIIIIIIIII I :5 i‘tII-' J I L J I A -4 -3.63.2—2.82.4 -2 -1.61.2—0.80.4 0 0.4 0.8 1.2 1.6 2 2.4 2.8 3.2 3.6 4 W Figure 4.41. Pairwise Comparison of MINF for 21-Item QKP- and RND-Assembled Tests Using the Moderate Discrimination 900-Length Item Pool I I NNOOOO W‘ LeOb ppee .... mNoo-ho-boonN-hoon-h I NI #N -2. -3I6 do N cutsdquqttttIIIIIIIII tllIltIIIIIIIIIIIIIII Ii IIIFIIIIjIIIIII—IIIIII I .hs .1l11111111IIIIIIIIIII -4 -3.63.2—2.82.4 -2 -1 6120.80.11 0 0.4 0.8 1.2 1.6 2 2.4 2.8 3.2 3.6 4 W Figure 4.42. Pairwise Comparison of MINF for 21-Item QKP- and LP-Assembled Tests Using the Moderate Discrimination 900-Length Item Pool 98 wwww NbbbbohbhbwthmA I I I I I I I I I I I fj I I I I I I 7 I 4.300 (DO—LA I ......s»....... 44 h m 62 $5 ”IIIIIIII I I I I i -4 -3.63.2-2.82.4 -2 -1 .6-1.2-0.8-0.4 0 0.4 0.8 1.2 1.6 2 2.4 2.8 3.2 3.6 4 0 1 Figure 4.43. Pairwise Comparison of MINF for 63-Item QKP- and RND-Assembled Tests Using the Moderate Discrimination 900-Length Item Pool Nwww III-...; 2 8 b 0044 I I I _l_Lo thbhobbhbwbbhm¢ @kb mm» IIIIIIIIIIIIIII I I I I IIIII I I I I 65 firIIIIIIIIIIfiIIIITIIfiI ~ivinI|I|||IIIIIIII||| I «h — l I 41 l L I J J -4 -3.63.2—2.82.4 -2 -1.61.2-0.8-0.4 0 0.4 0.8 1.2 1.6 2 2.4 2.8 3.2 3.6 4 % Figure 4.44. Pairwise Comparison of MINF for 63-Item QKP- and LP-Assembled Tests Using the Moderate Discrimination 900-Length Item Pool 99 NNOJOJ I I I I I I I I I I I c I I I I I I I a .5400 00—5.; hwmhbhohbhbmhbhbh IITFIIIIIIIIIIIIIIfiII I -2.8 -3.2 -3.6 -4 ...IIIIIIIIIIIIIIIIII "IIIIIIIIIIIIIIIIIIIII 1 L 1 4L 1 -4 -3.6-3.2-2.8-2.4 -2 —1.61.2-0.8-0.4 0 0.4 0.8 1.2 1.6 2 2.4 2.8 3.2 3.6 4 81 Figure 4.45. Pairwise Comparison of MINF for 21-Item QKP- and RND-Assembled Tests Using the High Discrimination 900-Length Item Pool 4.. an é‘. 3.6- ‘- Ira- 3.2- - A.-. 2.8- - «e- - 2.4- - .... - 2- - amen: . 1.6- - .....- 1.2- - «nu-=- 0.8L - ‘2‘- 0.4- - .....- CON 0- .. ail-2‘:- . -0_4_ . ‘2‘:- . -0.8L. . ‘ ‘n . -1.2.. .. ‘ t .- . -1.6’ ‘ i ‘ - -21— ‘: ‘= a: - '2.4' at i ‘ - -2.8_ ‘I. ‘ ‘ .- -3.2- an ‘ ‘ - -3_6L ‘ ‘ ‘. a. -4}. - ‘ ‘ - . -4-3.6-3.2-2.82.4 -2-1.61.2—0.8-0.4 0 0.40.81.21.6 2 2.42.83.23.6 4 9 1 Figure 4.46. Pairwise Comparison of MINF for 21-Item QKP- and LP-Assembled Tests Using the High Discrimination 900-Length Item Pool 100 4.. .. . ‘ an . - - 3.6L. .. an a. an . - - 3.2- - - - - - - - 2.8- - - - - . - 2.4r ' " ‘ "' ‘ ' 2L - .- .. m - - 1.6r- ‘ ‘ "‘ "‘ ‘ ' 1.2- - - -‘ - - - 0.8- - - - - - - 0.4— - - - - - - 0N 0__ - - ‘ a. - - - -o.4- - - - - - - -0.8r ° - ‘ ‘ - -1.2- - - - - - -1.6— - - - - - -2,. - - a. as .- . -2.4- - - - - - - -2.aL - - - - - - -3.2~ - - - - - -3.6— - - - - - 4r ‘ ‘ ‘ " ‘ ' -4-3.6-3.2—2.82.4 -2-1.6-1.2~0.80.4 0 0.40.81.21.6 2 2.42.83.23.6 4 0 1 Figure 4.47. Pairwise Comparison of MINF for 63-Item QKP— and RND-Assembled Tests Using the High Discrimination 900-Length Item Pool 4— - - .. a. .1: ‘z ._ . 3.6.. - .. . n ‘ ‘ ‘ . 3.2- - - - - - - - - 2.8_ - - - - ‘ ‘ - .- 2.4— - - - - - ‘- - - 2- - - - -. - - an a . 1.6- - , - .- ‘ a: ‘ .- . 1.2.. - , - - an .n m a. . 0.8» - - - - - - - - 0.4L - - - - - - ‘ - GDN 0- - - ‘ - a ‘ n .. -0.4- - - a ‘ a a- . -0.8~ - - - - - ‘~ - -1.2- - as - - n a . -1.6- - - - - - - - -2- - - - - - - - . -2.4- - - - - - - - - -2.8- - - - - - - - - -3.2r - - - - - - - - -3.6— - - - - - - - - -4- - . .. - - .- - -4-3.6-3.2-2.8-2.4 -2-1.61.2—0.8-0.4 0 0.40.81.21.6 2 2.42.83.23.6 4 9 1 Figure 4.48. Pairwise Comparison of MINF for 63-Item QKP- and LP-Assembled Tests Using the High Discrimination 900-Length Item Pool 101 multidimensional discrimination do not provide much difference in the computation of MINF among all three methods. Support for this observation comes from looking at tests assembled from High Discrimination item pools. These plots do not show much difference among the three methods. However, item pool length does not seem to have an impact on the measurement precision of assembled tests. Some overall conclusions can be drawn from these pairwise comparisons. Generally, these plots indicate that QKP provides better measurement precision in the center of the two—dimensional space. That is, 61 and 02 values are either negatively related or one value is considerably larger than another. These conclusions are consistent with the use of the M2PLM which is compensatory in the sense that a low value on one dimension is compensated for by a value of the corresponding dimension and vice vers'a. Lastly, LP generally provides more MINF in the extreme portions of the 91— and 92 —space. 4.7 Relative Efficiency Recall that Relative Efficiency (RE) is a method used for comparing the measurement efficiency of two tests. Specifically, this is accomplished by comparing their information functions graphically. In this section, the RE of QKP-, LP-, and RND- assembled will be compared. Remember that after tests were constructed using MIRT parameters, dichotomous responses were simulated for each assembled test using a fixed sample of 5,000 simulated abilities. These dichotomous responses were then calibrated unidimensionally for 21- and 63—length tests using BILOG-MG. Essentially, this section 102 repeats the comparison conducted with the pairwise clamshell plots. Therefore, RE and UIRT information comparisons were conducted for item pools of varying discrimination and length using the three AMTA procedures for 21, and 63-length tests. Results from computations of RE and UIRT information are presented in Figures 4.49 to 4.72. For example, Figure 4.48 shows that QKP-assembled tests provide the most information at about 0 on the ability scale. As expected, LP-assembled test provide the second highest amount of information. The findings from Figure 4.49 are then translated to RE in Figure 4.50. The horizontal line with RE = 1 indicates the position where tests assembled by all three methods are equally efficient at providing UIRT information. Around the 50‘h percentile, LP-assembled tests are about 0.7 times as efficient as QKP- assembled tests. Similarly, around the 50'h percentile, RND-assembled tests are about 0.3 times as efficient as QKP-assembled tests at estimating ability. Additionally, the relative merits of each assembled tests can be compared in terms of efficiency along the ability continuum. Figure 4.50 shows that QKP-assembled tests are more efficient between the 35‘h and 72“d percentile, while LP-assembled tests are more efficient over the remainder of the continuum. This supports conclusions drawn from the clamshell plot pairwise comparisons that LP-assembled tests are more efficient at extreme portions of the ability continuum. The remaining plots of UIRT information show that QKP-assembled tests consistently provide the highest information. This finding holds true regardless of the length of the item pool or its level of discrimination. The RE plots are not as easy to interpret. They seem reasonable for the tests assembled using the Low Discrimination item pools, but become erratic and uneven for tests assembled from the High 103 18 I fi F I I T _—QKP 16 -—L —l N -h Information _L on O Figure 4.49. Test Information Functions for 21-Item Tests Assembled from the 300- Length, Low Discrimination Item Pool with the QKP, LP, and RND Methods 2.4 I I I I I I T __RE(LP'QKP) —‘— RE(RND, QKP) .N N 1 Relative Efficiency A 0.8 0.6 0'4 90 100 Percentile Figure 4.50. Efficiency Functions for 21-Item Tests Assembled from the 300-Length, Low Discrimination Item Pool with the QKP, LP, and RND Methods 104 30 25 Information A to UI O _L O Figure 4.51. Test Information Functions for 63-Item Tests Assembled from the 300- Length, Low Discrimination Item Pool with the QKP, LP, and RND Methods 1.4 T I T r j I I — — RE(LP’QKP) + RE(RND, QKP) 1.3 I Relative Efficiency l l l l J 1 l 1 l 0 10 20 30 40 50 60 70 80 90 100 Percentile Figure 4.52 Efficiency Functions for 63-Item Tests Assembled from the BOO-Length, Low Discrimination Item Pool with the QKP, LP, and RND Methods 105 '8 Information a ..L O g? —QKP +LP —-RND v v v V v Figure 4.53. Test Information Functions for 21-Item Tests Assembled from the 300- Length, Moderate Discrimination Item Pool with the QKP, LP, and RND Methods Relative Efficiency ’ ‘ ——RE(LP,QKP) + RE(RND. QKP) 0'50 10 20 30 40 50 60 70 80 90 100 Percentile Figure 4.54. Efficiency Functions for 21-Item Tests Assembled from the 300-Length, Moderate Discrimination Item Pool with the QKP, LP, and RND Methods 106 1 00 I I T T I I — QKP +LP —*RND Information Figure 4.55 . Test Information Functions for 63-Item Tests Assembled from the 300- Length, Moderate Discrimination Item Pool with the QKP, LP, and RND Methods 3 i I I I I I I _ — RE(LP’QKP) —'— RE(RND, QKP) N Relative Efficiency 3. 0.50 1 0 20 30 40 50 60 70 80 90 1 00 Figure 4.56. Efficiency Functions for 63-Item Tests Assembled from the 300-Length, Moderate Discrimination Item Pool with the QKP, LP, and RND Methods 107 8 OI 0 Information .h. 0 I I I I ‘I l —QKP —-RND A —l —l d .1 e e t :4: t: 4% 3 4 Figure 4.57. Test Information Functions for 21-Item Tests Assembled from the 300- Length, High Discrimination Item Pool with the QKP, LP, and RND Methods 9 I I I I r I T - _ RE(LP'QKP) 8. + RE(RND, QKP) 7 _ as ‘ C .2 5 0 a E .3 4 - E 0 (I 3 - 2 / / .l / / / I 1 _ _ x ’ \ __ __, ...- " 0o 10 20 30 4o 50 60 7o 80 90 100 Percentile Figure 4.58. Efficiency Functions for 21-Item Tests Assembled from the BOO-Length, High Discrimination Item Pool with the QKP, LP, and RND Methods 108 250 I I I I I I I —QKP —-—LP --RND 20m ‘ 150- - .5 E E 100r - 50b ‘ 4. t e t i e ¢ ¢ ;fi ‘ L L i -4 -3 -2 -1 0 1 2 3 4 Ability . Figure 4.59. Test Information Functions for 63—Item Tests Assembled from the 300- Length, High Discrimination Item Pool with the QKP, LP, and RND Methods — — RE(LP,QKP) —*— RE(RND, QKP) 1.8 Relative Efficiency :1 a: —L N 0 10 20 30 40 50 60 70 80 90 100 Percentile Figure 4.60. Efficiency Functions for 63—Item Tests Assembled from the BOO-Length, High Discrimination Item Pool with the QKP, LP, and RND Methods 109 25 I T I I I I _ QKP -+—LP --RND 20~ ~ 15- _ .§ 5 1o Figure 4.61. Test Information Functions for 21-Item Tests Assembled from the 900- Length, Low Discrimination Item Pool with the QKP, LP, and RND Methods '— — RE(LP,QKP) 3.5 U Relative Efficiency ... 8 ..a U" —L S3 or Percentile Figure 4.62. Efficiency Functions for 21-Item Tests Assembled from the 900-Length, Low Discrimination Item Pool with the QKP, LP, and RND Methods 110 45 40 35 Information N N 0 0| ..L 01 10 Figure 4.63. Test Information Functions for 63-Item Tests Assembled from the 900- Length, Low Discrimination Item Pool with the QKP, LP, and RND Methods ‘ ——RE(LP,QKP) + RE(RND, QKP) Relative Efficiency 00 10 20 30 40 50 60 70 80 90 100 Percertile Figure 4.64. Efficiency Functions for 63-Item Tests Assembled from the 900-Length, Low Discrimination Item Pool with the QKP, LP, and RND Methods 111 DO 0'! Information to N b) o‘ 01 O .3 U" A O g? Figure 4.65. Test Information Functions for 21-Item Tests Assembled from the 900- Length, Moderate Discrimination Item Pool with the QKP, LP, and RND Methods 3.5 I I I I I I I I —_RE(LP,QKP) + RE(RND, QKP) Relative Efficiency 0.50 10 20 30 40 50 60 70 80 90 1 00 Percettile Figure 4.66. Efficiency Functions for 21-Item Tests Assembled from the 900-Length, Moderate Discrimination Item Pool with the QKP, LP, and RND Methods 112 100 . j . . . . —QKP —«—LP --RND .§ 2 E _ “.5 r 3 i i i i 4 Ability Figure 4.67. Test Information Functions for 63—Item Tests Assembled from the 900- Length, Moderate Discrimination Item Pool with the QKP, LP, and RND Methods i——RE(LP,QKP) + RE(RND. QKP) 3.5- - 3- _ 5‘ a“: "525* . E 0 ..>. _ . r! 2 . 0 o: 1.5- — i\ \ \ \ 1 \\\ ””””””””” ‘v—i \‘W \‘E‘4 0.5 l J L J; I J J l l o 10 20 30 40 so 60 70 so 90100 Figure 4.68. Efficiency Functions for 63-Item Tests Assembled from the 900-Length, Moderate Discrimination Item Pool with the QKP, LP, and RND Methods 113 70 fi I I I fi T —— QKP +LP —-RND O) O T Information or A U! 0 O O W7 I I N O I _l O I VYTVVViTY I0 I: 0 4r I» u I on» Figure 4.69. Test Information Functions for 21-Item Tests Assembled from the 900- Length, High Discrimination Item Pool with the QKP, LP, and RND Methods 25 r —1 —4 —1 —4 - — - RE(LP,QKP) + RE(RND, QKP) 20 .L U! Relative Efficiency 8 c-v— A Y V v YT—w—V l 50 Percertile Figure 4.70. Efficiency Functions for 21-Item Tests Assembled from the 900-Length, High Discrimination Item Pool with the QKP, LP, and RND Methods 114 180 160 140 120 Information 8 8 60 40 20 g? Figure 4.71. Test Information Functions for 63-Item Tests Assembled from the 900- Length, High Discrimination Item Pool with the QKP, LP, and RND Methods 7 I I I I I I I — — RE(LP’QKP) —°— RE(RND, QKP) 5 _\ \ \ 5+ \\ g~ \ s \ .6 4 .. \ '5 \ I.I.l o \ 2% 3- \ 3 \ tr \ 2 - \ \ \ \ 1 MA 3 fiw 00 110 210 SIO 40 50 60 70 8O 90 1 00 Percentile Figure 4.72. Efficiency Functions for 63-Item Tests Assembled from the 900-Length, High Discrimination Item Pool with the QKP, LP, and RND Methods 115 Discrimination item pools. One reason could be that the differential effect of content clusters is becoming realized in the modeling. However, given that tests assembled from High Discrimination item pools had no recognizable difference in the clamshell plot comparisons; these RE findings may not be reliable or meaningful. Therefore, these findings may simply be a case of modeling noise. 116 CHAPTER 5 DISCUSSION AND CONCLUSIONS The first purpose of this study was to formulate a new method for maximizing multidimensional information in automated multidimensional test assembly. To accomplish this purpose, QKP was adapted to AMTA in conjunction with a new computer algorithm called the QKP heuristic. The second purpose was to evaluate the QKP procedure. This was accomplished through a series of computational experiments based on realistic test scenarios. Comparisons of QKP to LP were conducted to determine whether the mathematical formulation reduced mathematical error. Another reason for this comparison was to determine the computational efficiency of QKP. A baseline measure to determine whether it was at all meaningful to use QKP and LP was used. This baseline measure used a random item selection method comparable to the other two methods and was called RND. Specifically, four conditions were manipulated in the computational experiments. There were three levels of item pool quality, two levels of item pool size, several levels of test length (although in appropriate cases only two levels were reported), and three AMTA methods. These conditions were all evaluated for QKP in comparison to LP and RND methods. Evaluations were mainly of two types. The first type used multidimensional data and indices while the second type used unidimensional data and indices. Hence, both multidimensional and unidimensional data was simulated. The multidimensional data was simulated using the M2PLM for creating various item pools. QKP, LP and RND procedures were then used to assemble tests that ranged from 21 to 117 99 items in length. After assembling these multidimensional tests, a fixed sample of 5,000 simulated abilities was then used to simulate dichotomous responses for each assembled test. These dichotomous responses were then calibrated with BILOG-MG using a 3PLM. Evaluations of multidimensional and unidimensional tests compared QKP-, LP-, and RND-assembled tests for all conditions. 5.1 Unique Contributions This dissertation made several new contributions. This was the first time quadratic knapsack programming has been used in AMTA. Prior to this dissertation, authors have simply mentioned the possibility of using it (V eldkarnp, 2002a) or have used it in non-IRT contexts (Feuerman & Weiss, 1973). The successful use of QKP in MIRT allows numerous extensions to be made to the model proposed herein. The QKP heuristic was also another new contribution to the psychometric literature. This heuristic is flexible enough that it can be extended to assemble tests with: a) more complex constraints such as inequalities, b) multiple test forms, c) complex multiple section or multi-stage designs, and (1) fitting target information functions. The last contribution has more to do with MIRT than AMTA. Specifically, a new MIRT index called MAXINFO was formulated. This index can be used to measure the largest composite of information in a particular direction for any multidimensional test. Hence, this index can be used to assess the overall quality of assembled tests and is applicable to multidimensional tests in general. 118 5 .2 Computational Experiments and Evaluation Criteria Overall, findings from the computational experiments indicate that QKP is a better method for assembling multidimensional tests. Recall that the purpose of assembling these tests was to maximize multidimensional information. Therefore, QKP assembles tests with higher measurement precision than LP. Moreover, the incorporation of RND into these computational experiments seems to indicate that it is worthwhile to use optimization methods (i.e., QKP and LP) for assembling tests. Hence, QKP seems to be an improvement on LP. The advantage of QKP may be due to its ability to reduce the error produced by linearization. Another advantage may be due to the manner in which the QKP heuristic selects items. Discussion of the computational experiments and evaluation criteria used in this study follows in the succeeding sections. The random method performs comparably only when it comes to the MAXINFO index. However, from looking at the other 5 evaluation criteria (i.e., D-optimality, CPU- time, Real-time, pairwise clamshell plots and relative efficiency), we see that the random method is far less efficient. Also recall that the random method is actually “pseudo- random” because it is based on the D-optimality criterion and uses 100 iterations in order to select items from the item pool. This design of the random method was made so that it was a more realistic comparison to QKP and LP. 5.2.] Item Pool and Ability Simulations The simulated item pools fit the pre-specified conditions fairly well. Summary statistics given in Tables 4.1 and 4.3 showed that target mean MDISC and MDIFF levels were close to pre-specified values. Additionally, standard deviations were similar to those 119 stipulated for the simulation. A breakdown of item pool statistics by content cluster in Tables 4.2 and 4.4 also showed that simulated M2PLM values were reasonable and had appreciable levels of variability. The mean and covariance of ability parameters for simulated abilities closely followed a multivariate normal distribution as stipulated. The correlation between the two simulated ability traits was 0.604 which was close to the specified correlation of 0.60. Hence, the distribution of sirnulee abilities had a moderate correlation and default ability distribution (standardized bivariate normal distribution) which produces the least estimation errors and is assumed in standard calibration software such as NOHARM and TESTFACT. Generally, the variances of the al values were much smaller than those of the a2 values. This finding along with low means for the (12 values suggests that the simulated item pools were mainly sensitive to differences on the first dimension. This was not necessarily the intended outcome of the simulation. However, all methods were evaluated uniformly using the same item pools and this provided an adequate comparison of the three methods. Additionally, the assembled 21-item tests in Tables 4.5 to and 4.6 indicate that all three methods generally chose highly discriminating items that were available in each of the item pools. Therefore, even though mean values generally indicated anomalies in the item pools, this did not have an adverse impact on the veracity of the computational experiments. Most of the information provided is found along the 01 dimension. This finding is consistent with finding that the simulated item pools were mainly sensitive to differences 120 on the first dimension as noted in above. Although this was not the intended purpose of the simulations, since the assembly methods were compared using the same item pools, the computational experiments proved to be admuate in evaluating the three methods. 5.2.2 Computations of D-optimality and Characteristics of Assembled Tests D-optimality may be considered a summary index of the multidimensional information or measurement precision provided by each test. Computations of D- optimality were provided for each of the six item pools for test lengths ranging from 21 to 99 items. These results are presented in Figures 4.1 to 4.6. The diagrams showed that in general, QKP-assembled tests produced the highest values of D-optimality indicating they had the highest measurement precision among tests assembled with the three optimization methods (i.e., QKP, LP, and RND). Additionally, QKP-assembled tests had higher measurement precision than LP- and RND-assembled tests for the low- discrirrrination item pools. This finding is reflected in Figures 4.1 and 4.4. The smallest difference in D-optimality among the three assembly methods existed for high-discrimination item pools. This finding seems to indicate that QKP is most efficient when item pools have low discrimination, but was less efficient for item pools of high discrimination. As extended discussions will reinforce below, measurement precision was not measured at a single point, but was compared for different regions of the multidimensional and unidimensional space. Further simulation studies would have to be conducted in order or to determine the reason for this occurrence. Sample test characteristics for assembled tests are provided in Tables 4.5 and 4.6. Only 21-item tests assembled from Low Discrimination item pools of length 300 and 900 121 were shown for illustrative purposes. The properties of these tests were summarized based on items selected from each cluster, type of AMTA method used, and M2PLM parameters. Additionally, the number of items selected in common between QKP and the other two methods are identified. As expected, there were higher numbers of items selected in common between QKP- and LP-assembled tests versus between QKP- and RND-assembled tests. In Table 4.5, of the 21 items selected, 16 were the same between QKP and LP and 6 items were the same between QKP and RND. Similarly, in Table 4.6, of the 21 items selected, 12 were common between QKP and LP and 4 items were common between QKP and RND. QKP and LP performed similarly in the way they select items from the pool. However, a closer look revealed that QKP tended to select items with higher a-parameter values. In Table 4.5, among the items not common between LP and QKP, the mean values of a1 and oz for LP are 1.13 and 0.28, while those for QKP are 2.04 and 0.51. Similarly, in Table 4.6 these mean values for LP are 0.97 and 0.46 while those for QKP are 1.54 and 0.66. Consequently, higher measurement precision would result from QKP- assembled tests since items of higher discrimination contribute more to the computation of MINF based on Equation (2.13). Generally, a2 parameters in the simulated item pools had means near zero. This was an unanticipated outcome of the simulations and would imply that item pools were generally most sensitive to one dimension. However, this outcome did not appear to affect the evaluations that were done using the computational experiments because assembled tests rarely contained items with a2 parameter values that were very close to 122 zero. Hence, the optimization methods closely followed the stipulation of the objective function to select items with al and a2 parameters that were as highly discriminating as possible so that D-optimality was maximized. This finding seems to imply that even when poor item pools exist, optimization methods are still a very good way of assembling tests that meet both content and statistical specifications. 5.2.3 Computational Efiiciency An analysis of CPU- and Real-time performance of QKP, LP and RND helped with the evaluation of each method’s efficiency in formulating a solution. Figures 4.7 to 4.18 presented results of CPU- and Real-time computations. These results were computed for 21- to 99-length assembled tests using each of the six item pools. Recall that item pools were varied in terms of their length and mean MDISC. Generally, LP-assembled tests using the least amount of Real and CPU-time. However, the difference between LP and QKP is small enough to conclude that the two procedures performed equally well. 5.2.4 MAXINFO and Pairwise Comparisons with Clamshell Plots It was shown in Tables 4.5 and 4.6 that QKP, LP and RND do not select identical items from the item pool. A multidimensional index called MAXINFO was formulated to measure the composite of abilities in the direction of maximum information for each assembled test. Comparisons of MAXINFO were presented graphically in Figures 4.19 to 4.24. These results showed that differences in MAXINFO between QKP and LP were unifomrly different for tests of length 21 to 99. QKP and LP had higher values of 123 MAXINFO than RND. This implied it is indeed worth the effort to use optimization methods for assembling multidimensional tests. Comparisons of MINF were conducted using clamshell plots. These comparisons were presented pairwise because every assembled test was compared for QKP versus LP and QKP versus RND. The comparisons focused on differentiating the measurement precision of QKP in comparison to LP, and QKP in comparison to RND. The purpose of the frrst comparison was to determine which procedure was better. The second comparison was used as a baseline measure to assess the magnitude of difference between QKP and LP comparisons. In interpreting these diagrams, blue is positive (i.e., where QKP gives more MINF than the method to which it was being compared) and red is negative (i.e., where QKP gives less MINF than the method to which it was being compared). Additionally, the red and blue clamshells are plotted so that red is inverted, while blue is upright. The density of color and length of the clamshells are also related to the magnitude of MINF represented at those grid points. These comparisons are made by eye-balling the areas of red and blue clamshell plots while corresponding item pool quality and test length plots are juxtaposed. Only two test lengths were selected for illustrative purposes. Recall that generally, test length is an important factor in test assembly because the longer the test, the more reliable it becomes, but at significant cost in terms of other resources such as item exposure and size of item pools. The 21-length test was considered a short test while a 63-length test was considered a long test. Hence, a 63—item test is expected to be more reliable than 21-item test. Findings from these pairwise comparisons were presented in Figures 4.25 to 4.48. QKP-assembled tests give higher MINF when compared to RND- 124 assembled tests, than when compared to LP-assembled tests. This outcome was expected based on D-optimality and MAXINFO findings. Overall, the pairwise comparisons showed that QKP-assembled tests were most efficient in the middle portions of the ability continuum. Furthermore, findings showed that QKP assembled tests had higher MIN F than LP and RND when the item pool consisted of items with low discrimination. Evidence for this conclusion is corroborated by Figures 4.25 to 4.28, 4.37 to 4.39, and 4.40. QKP also proved to be more efficient for assembling shorter tests as evidenced by comparisons using pairwise clamshell plots of 21- and 63-length tests. This conclusion was supported by the observation that item pools of high multidimensional discrimination do not assemble tests which show much difference in the computation of MINF when QKP-, LP- and RND—assembled tests were compared. Recall that the quality of the item pool (low, medium and high discrimination) was defined by mean MDISC for each of the simulated item pools. Figures 4.33 to 4.36 are pairwise comparisons of QKP to the linear (LP) and random methods (RND). These comparisons indicate that there is virtually no distinction between QKP and and the other two methods for 21- and 63-length assembled tests. Similar results were found for Figures 4.45 to 4.48. What these sets of figures have in common is the used of the high discrimination item pools. That is, when a item pools contain highly discriminating test items, the QKP method does not assemble tests of higher multidimensional information. Additional conclusions can be drawn from these pairwise comparisons. These plots indicated that QKP provides better measurement precision in the center of the two- dimensional space. That is, 91 and 02 values are either negatively related or one value is 125 considerably larger than another. These conclusions are consistent with the use of the M2PLM which is compensatory in the sense that a low value on one dimension is compensated for by a high value on the corresponding dimension and vice versa. The last conclusion drawn was that LP generally provided more MINF in the extreme portions of the two-dimensional space. 5.2.5 Unidimensional Relative Efficiency of Assembled Tests Additionally, in order to evaluate the measurement precision of a single composite score that may be created from a weighted linear combination of the multidimensional scores, the multidimensional parameters for each assembled tests were used to simulate dichotomous item responses. BILOG-MG was used to fit a three parameter logistic model to each set of data and the parameters used to compute unidimensional information. Functionally, this created the best unidimensional approximation to the multidimensional space spanned by the test items. The ratio of the information functions was then computed to provide a measure of the relative efficiency of composite test scores created with each method. Consequently, this evaluation criterion serves as a relative measure of the measurement precision that might be expected in composite scores that may be more meaningful to practitioners and psychometricians less familiar with multidimensional IRT indices. Results showed that when assembling tests to report a single score from item pools composed of multidimensional items, QKP-assembled tests provide the most UIRT information. The latter conclusion is supported by findings from the computations of UIRT information which showed that QKP-assembled tests consistently gave the highest amount of information. The comparison of the information of each test was also 126 evaluated as relative efficiency by computing the ratio of information for tests assembled with LP and RND methods in comparison to tests assembled with the QKP method. Computations of relative efficiency showed that at the targeted ability level of 0 = 0, QKP-assembled tests were the most efficient tests. Although the highest amount of UIRT information was generally obtained at 0 = 0 on the ability scale, the plots of information did not make it clear which tests gave the most information at other points on the ability scale. Recall that the 5,000 Simulees had a mean ability of 0; hence we expected the highest information to occur at this point. Plots of relative efficiency (RE) showed that QKP did not assemble the most efficient test over the entire ability continuum. For example, in Figure 4.50, QKP is only more efficient than LP between the 30th and 80‘h percentiles of ability, while LP is more efficient than QKP on the remainder of the ability distribution. Also, the overall level of discrimination of the item pool had an influence on the RE of assembled tests. QKP-assembled tests were more efficient than both LP- and RND-assembled tests when they were assembled from Low Discrimination item pools. The RE of QKP-assembled tests was less distinct for Moderate and almost imperceptible for High Discrimination item pools. 5.3 Limitations As with any experimental study, in this study there were also questions about the generalizability of findings to real testing situations. In this study, care was taken to ensure that the simulated conditions were realistic. The IRT model used to generate and analyze data is commonly used in the literature for item calibration and scoring. The chosen item parameters were congruent with those typically used and reported in many 127 studies dealing with real and simulated data. However, no matter how realistic the generated data may be, it will never substitute real data. Therefore, the absence of real data should be considered a major limitation of this study. The fact that simulated data generally behave too well should not weaken the value of the findings obtained, however, since the focus of this study was mainly methodological. Nonetheless, test developers should expect some decrease in performance between the development phase where simulations are conducted and the operational phase where testing is implemented. In practice, certification tests are constructed in order to make pass/fail decisions at a specific point on the ability continuum. However, the evaluation conducted herein did not assemble tests to achieve such a purpose. Rather, this evaluation was conducted simply to show which of the three AMTA methods assembled tests most efficiently and accurately. Therefore, until further research is conducted on using QKP for creating cut- scores, this AMTA method remains severely limited in its practical application to real testing contexts. Moreover, the conclusions drawn about the performance of QKP in comparison to LP may not be completely accurate due to the simplicity of the test specifications used here. It is expected that as more constraints are included in the QKP model, its applicability and efficiency may become different than the findings presented in the current study. 5.4 Future Research The success of optimization problems hinges on their ability to choose a good initial solution. In many cases a finite number of iterations have to be implemented so 128 that the algorithm dOes not become an infinite loop. In this study, this was not a problem since the computational experiments were carefully formulated and hence the QKP heuristic was fairly well-behaved. However, with the extension of the QKP model to solve larger and more complex AMTA problems, these issues may become a problem. Additionally, there are times when an algorithm can give different solutions even when used consistently to solve a particular problem. Several factors may cause these inconsistencies. Namely, the degree and type of nonlinearity, the number of design variables, the number of constraints, and accuracy of the initial guess (Venkataraman, 2002). Therefore, it is important for further evaluations of QKP to include and vary the aforementioned conditions. Future studies that consider more variations in test length, item pool characteristics, and test specifications will be important to enable the study’s results to be further generalized. Moreover, additional variations to sirnulees’ characteristics will also allow better evaluations of the QKP method in comparison to the other methods. The most important extension of this work however will be to the kinds of problems currently handled by LP approaches (such as content constraints, enemy sets, etc.). The process of ATA is quite complex and AMTA is even more complex. Issues of dimensionality and computation made the development of this method quite difficult. The solution presented here pertains only to the two-dimensional case and extensions to higher dimensions require more derivations. Therefore, more efficient and suitable mathematical and computational extensions are required for this procedure to be more applicable to a larger number of testing situations. 129 Solving combinatorial optimization problems like QKP can be a difficult task. This difficulty arises because unlike with LP where the feasible region is a convex set, in combinatorial problems, a matrix of feasible points needs to be searched in order to find an optimal solution. Hence, in LP, any local solution is a global optimum. However, in the case of QKP, there can be situations in which many local optima exist. Therefore, finding a global Aoptimum to the problem requires one to prove that a particular solution dominates all feasible points by arguments other than the calculus-based derivative approaches of convex programming (Hoffman & Padberg, 1996). Hence, it is important for some form of quality control to be developed for the QKP procedure. For instance, a method for determining situations in which the QKP-derived solution is infeasible should be created. Generally, an ATA or AMT A solution is considered infeasible when there is no solution that satisfies all constraints. Methods for analyzing (Huitzing, et al., 2005) and solving (Tirnminga, 1998) infeasibility problems in test assembly exist. However, thus far these have only been developed for unidimensional tests. Therefore, these should also be adapted to the AMTA context. The adaptation of QKP to commercially available software such as CPLEX would provide a standard form for sharing this software with other researchers. Not only is software like CPLEX easier to use, but it is already commonly used for ATA and would provide some standardization of software used. Another advantage of using CPLEX is that it has built-in sensitivity analyses. Sensitivity analysis refers to determining the range of parameters for which the optimal solution still has the same variables in the basis, even though values at the solution may change (Venkataraman, 2002). Another likely sensitivity test would use the Likelihood Ratio Test to compute regions of arbitrary 130 confidence based on the maximum likelihood estimates that arise from using D- optimality (see Neyer, 1994). 131 REFERENCES Ackerman, T. A. (1987). The use of unidimensional item parameter estimates of multidimensional items in adaptive testing. Paper presented at the annual meeting of the American Educational Research Association, Washington, DC. Ackerman, T. A. (1994). Creating a test information profile for a two-dimensional latent space. Applied Psychological Measurement, 18(3), 257—275. Ackerman, T. A. (1996). Graphical representation of multidimensional item response theory analyses. Applied Psychological Measurement, 20(4), 248-258. Ackerman, T. A., Gierl, M. J., & Walker C. (2003). Using multidimensional item response theory to evaluate educational and psychological tests. Educational Measurement. Issues and Practice, 22(3), 37- 53. Adema, J. J. (1992). Implementation of the branch-and-bound method for test construction. Methodika, 6, 99-117. Anderson, T. W. (1984). An introduction to multivariate statistical analysis (2“d ed.) New York: Wiley. Ansley, T. N. & Forsyth, R. A. (1985). An examination of the characteristics of unidimensional IRT parameter estimates derived form two-dimensional data. Applied Psychological Measurement, 9, 37—48. Baumol, W. J. & Bushnell, R. C. (1967). Error Produced by' Linearization in Mathematical Programming. Econometrica, 35, 447-471. Berger, M. P. F. (1991). On the efficiency of IRT models when applied to different sampling designs. Applied Psychological Measurement, 15, 293-306. Berger, M. P. F. (1992). Sequential sampling designs for the two-parameters item response theory model. Psychometrika, 57, 521-538. Berger, M. P. F. & Veerkamp, W. J. J. (1994). A review of selection methods for optimal test design. (Report No. RR-94-4). Enschede, Netherlands: Twente University, Faculty of Educational Science and Technology. Berger, M. P. F. (1998). Optimal design of tests with dichotomous and polytomous items. Applied Psychological Measurement, 22(3), 248-258. Bimbaum, A (1968). Some latent trait models and their use in inferring an examinee’s ability. In F. M. Lord & M. R. Novick, Statistical theories of mental test scores (pp. 397-479). Reading, MA: Addison-Wesley. 132 Bogan, E. D. & Yen, W. M. (1983). Detecting multidimensionality and examining its efiects on vertical equating with the three-parameter logistic model. Paper presented at the annual meeting of the American Educational Research Association, Montreal, Canada. Burden, R. L. & Faires, J. D. (1997). Numerical Analysis. (6" ed.). New York: Brooks/Cole Publishing. Caprara, A., Pisinger, D., & Toth, P. (1999). Exact solution of the quadratic knapsack problem. INFORMS Journal on Computing, 11, 125-137. Caprara, A., Kellerer, H., Pferschy, U. & Pisinger, D. (2000).Approximation algorithms for knapsack problems with cardinality constraints. European Journal of Operational Research, 123, 333-345. Crocker, L., Algina, J. (1986). Introduction to Modern and Classical Test Theory. Harcourt Fort Worth, TX: Brace Jovanovich. Davey, T. (2005). Improving ETS ’s psychometric infrastructure. R & D Seminar presentation, Educational Testing Service, Princeton, N]. de Gruijter, D. N. M. (1985). A note on the asymptotic variance-covariance matrix of item parameter estimates in the Rasch model. Psychometrika, 50, 247-249. de Gruijter, D. N. M. (1988). Standard errors of item parameter estimates in incomplete designs. Applied Psychological Measurement, 12, 109-116. Federation of State Medical Boards & National Board of Medical Examiners. (1996). General instructions, content description, and sample items. Philadelphia, PA: National Board of Medical Examiners. Feuerman, F., & Weiss, H. (1973). A mathematical programming model for test construction and scoring. Management Science, 19, 961-966. Gallo, G., Hammer, P. L., & Sirneone, B., (1980). Quadratic knapsack problems. Mathematical Programming, 12, 132-149. Glover, F & Woolsey, R. E. (1974). Converting the 0-1 Polynomial Programming Problem to a 0-1 Linear Program. Operations Research, 22, 180-182. Glover, F. (1975). Improved Linear Integer Formulations of the Nonlinear Integer ' Problems. Management Science, 23, 445-460. Green, B. F. (1990). Notes on the item information function in the multidimensional compensatory IRT model. Report No. 88-10). Baltimore, MD: Johns Hopkins University, Psychometric Laboratory. 133 Hambleton, R & Swaminathan, H. (1984). Item Response Theory: Principles and Applications. New York: Springer-Verlag. Hillier, Frederick S., & Lieberman, Gerald, J. (1995) Introduction to Operations Research. New York: McGraw Hill, Inc. Hoffman, K. & Padberg, M. (1996). Integer and Combinatorial Programming. Encyclopedia of Operations Research and Management Science. Saul I. Gass and Carl M. Harris (Eds). Boston : Kluwer Academic. Huitzing, H. A.; Veldkamp, B. P.; Verschoor, A. J. (2005). Infeasibility in automated test assembly models: A comparison study of different methods. Journal of Educational Measurement , 42(3), 223-243. Kellerer, H., Pferschy, U., & Pisinger, D. (2004). Knapsack Problems. New York: Springer. Kendall, M. s. & Stuart, A. (1967). The Advanced Theory of Statistics, Vol. 2, 2nd edition. New York: Hafner Publishing Company. Kim, J .-P. (2001). Proximity measures and cluster analyses in multidimensional item response theory. Unpublished doctoral dissertation, Michigan State University, East Lansing, MI. Li, Y. H. & Schafer, W. D. (2003). The effect of item selection methods on the accuracy of CATs’ ability estimates when item parameters are contaminated with measurement errors. Paper presented at the Annual Meeting of the American Educational Research Association, Chicago, IL. Lord, F. M. (1980). Application of item response theory to practical testing problems. Hillsdale, NJ: Lawrence Erlbaum. Luecht, R. M. (1988). Computer-assisted test assembly using optimization heuristics. Applied Psychological Measurement, 22, 224-236. Leucht, R. M. (1996). Multidimensional computerized adaptive testing in a certification or licensure context. Applied Psychological Measurement, 20(4), 389-404. Martineau, J. A., Mapuranga, R. & Ward, K. (2006, April). Confirming content structure in standardized state assessments using multidimensional item response theory. Paper presented at the annual meeting of the American Educational Research Association, San Francisco, CA. McKinley, R. L. & Reckase, M. D. (1983). The use of [RT analysis on dichotomous data from multidimensional tests. Paper presented at the Annual Meeting of the American Educational Research Association, Montreal, Quebec. 134 Michigan Education Assessment Program (2003). Design and Validity of the Test. Retrieved March, 2004, from http://www.meap.org/. Miller, T. R. & Hirsh, T. M. (1992). Cluster analysis of angular data in applications of multidimensional item response theory. Applied Measurement in Education (5), 193-212. Min, K. (2003). The impact of scale dilation on the quality of the linking of multidimensional item response theory calibrations. Unpublished doctoral dissertation, Michigan State University, East Lansing, MI. Muraki., E & Engelhard, G. (1985). Full-information item factor analysis: Applications to EAP scores. Applied Psychological Measurement, 9, 417-430. National Assessment Governing Board: US. Department of Education. (2005). Economics framework for the 2006 National Assessment of Educational Progress. Retrieved December, 2006, from http://www.nagb.org/pubs/economics_06.pdf. Nemhauser, G. L. & Wolsey, L. A. (1988). Integer and Combinatorial Optimization. New York: John Wiley. Neyer, B. T. (1994). A D-optimality based sensitivity test. Technometrics, 36(1), 61-70. Reckase, M. D. (1979). Unifactor trait models applied to multifactor tests: Results and implications. Journal of Educational Statistics, 4(3), 207-230. Reckase, M. D. (1985). The difficulty of test items that measure more than one dimension. Applied Psychological Measurement, 9, 401-412. Reckase, M, D. & Ackerman, T. A. (1986). Building a test using items that require more than one skill to determine a correct answer. Paper presented at the Annual Meeting of the American Educational Research Association, San Francisco, CA. Reckase, M. D., Carlson, J ., Ackerman, T. A. & Spray, J. A. (1986). The interpretation of unidimensional IRT parameters when estimated from multidimensional data. Paper presented at the annual meeting of the Psychometric Society, Toronto, Canada. Reckase, M., Ackerman, T. & Carlson, J. (1988). Building a unidimensional test using multidimensional items. Journal of Educational Measurement, 25(3), 193-203. Reckase, M. D. & McKinley, R. L. (1991). The discrimination power of test items that measure more than one dimension. Applied Psychological Measurement, 9, 401- 412. 135 Reckase, M. D. (1997). A linear logistic multidimensional model for dichotomous item response data. In W. J. van der Linden & R. K. Hambleton (Eds), Handbook of Modern Item Response Theory, (pp. 271-286). New York: Springer-Verlag. Reckase, M. D., Martineau, J. A., & Kim, J.-P. (2000, June). A Vector Approach to Determining the Dimensionality of a Data Set. Paper presented at the annual meeting of the Psychometric Society, Seattle, WA. Reckase, M. D. (2006). Multidimensional item response theory. In C. R. Rao and S. Sinharay (Eds) Handbook of Statistics, Volume 26. North—Holland: Elsevier. Roussos, L. A., Stout, W. F. & Marden, J, I. (1998). Using new proximity measures with hierarchical cluster analysis to detect multidimensionality. Journal of Educational Measurement, 35, 1-30. Segall, D. O. (1996). Multidimensional adaptive testing. Psychometrika, 61, 331-354. Stocking, M. L. (1990). Specifying optimum examinees for item parameter estimation in, item response theory. Psychometrika, 55, 461-475. Stocking, M. L., & Swanson, L. (1998). Severely constrained adaptive testing with extensions to item pool design. Applied Psychological Measurement, 22, 271-279. Swanson, L., & Stocking, M. L. (1993). A model and heuristic for solving very large item selection problems. Applied Psychological Measurement, 1 7(2), 151-166. Theunissen, T. J. J. M. (1985). Binary programming and test design. Psychometrika, 50, 41 1-420. Thissen, D. & Wainer, H. (1982). Some standard errors in item response theory. Psychometrika, 47, 397-412. Timminga, E. (1998). Solving infeasibility problems in computerized test assembly. Applied Psychological Measurement, 22(3), 280-291. Vale, C. D. (1986). Linking item parameters onto a common scale. Applied Psychological Measurement, 10, 333-344. van der Linden, W. J. & Boekkooi-Timminga, E. (1989). A maxirnin model for test design with practical constraints. Psychometrika, 54, 237-247. van der Linden, W. J. (1994). Optimum design in item response theory. In G. H. Fisher & D. Laming. Contributions to mathematical psychology, psychometrics, and methodology (pp. 305-318). New York: Springer-Verlag. 136 van der Linden, W. J. (1996). Assembling tests for the measurement of multiple traits. Applied Psychological Measurement, 20, 373-388. van der Linden, W. J. (1998). Optimal assembly of psychological and educational tests. Applied Psychological Measurement, 22, 195-211. van der Linden, W.J. (2005). Linear models for optimal test design. New York: Springer- Verlag. Veldkamp, B. P. (1998). Multidimensional test assembly based on Lagrangian relaxation techniques (Report No. RR-98-08). Enschede, Netherlands: Twente University, Faculty of Educational Science and Technology. Veldkamp, B. P. (2002a). Constrained multidimensional test assembly. Applied Psychological Measurement, 26(2), 133-145. Veldkamp, B. P. (2002b). Optimal test construction (Report No. RR-O2-08). Enschede, Netherlands: Twente University, Faculty of Educational Science and Technology. Venkataraman, P. (2002). Applied optimization with MATLAB programming. New York: Wiley. Votaw, D. F. (1952). Methods of solving some personnel classification problems. Psychometrika, 17, 255-266. Wald, A. (1943). On the efficient design of statistical investigations. Annals of Mathematical Statistics, 14, 134-140. Walters. L. G. (1967). Reduction of Integer Polynomial Problems to Zero-One Linear Programming. Operations Research, 15, 1171-1174. Whittaker, E. T. and Watson, G. N. (1990). Forms of the remainder in Taylor's Series. A Course in Modern Analysis, 4th ed. Cambridge, England: Cambridge University Press. Wikipedia: The Free Encyclopedia. (n.d.). Proof of Concept. Retrieved December, 2006 from http://en.wikipedia.org/wild/Proof_of_concept Wingersky, M. S. & Lord, F. M. (1984). An investigation of methods for reducing sampling error in certain IRT procedures. Applied Psychological Measurement, 8, 347-364. Wrightrnan, L. F. (1988). Practical issues in computerized test assembly. Applied Psychological Measurement, 22, 292-302. 137 Yen, W. M. (1983). Use of the three-parameter model in the development of standardized achievement tests. In R. K. Hambleton (Ed.), Applications of Item Response Theory. Vancouver: Educational Research Institute of British Columbia. Zangwill, W. I. (1965). Media Selection by Decision Programming. Journal of Advertising Research, 5, 23-27. Zhang, J. (2005). Estimating multidimensional item response models with mixed structure. ET S Research Report (ET S RR-05-04). Princeton, NJ: ETS. Zimowski, M., Muraki, E., Mislevy, R., & Bock, D. (2003). BILOG-MG (version 3) [Computer software and manual]. Lincolnwood, IL: Scientific Software International. 138 uiijiiiigi