PSYCHOMETRIC TOOLS FOR FORMATIVE CLASSROOM ASSESSMENT: TEST CONSTRUCTION AND ITEM POOL DESIGN BASED ON COGNITIVE DIAGNOSTIC MODELS B y Jiahui Zhang A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Measurement and Quantitative Methods Doctor o f Philosophy 2019 ABSTRACT PSYCHOMETRIC TOOLS FOR FORMATIVE CLASSROOM ASSESSMENT: TEST CONSTRUCTION AND ITEM POOL DESIGN BASED ON COGNITIVE DIAGNOSTIC MODELS By Jiahui Zhang This thesis is concerned with the potential applications of c ognitive diagnostic models (CDMs) with hierarchical attributes in support ing formative classroom assessments. T he conventional CDM approach that requir e s large sample sizes is impractical in the classroom setting . T hree are three CDM - based approaches that do not involve item calibration and thus are practical in the classroom setting: 1) CDM classifications using non - adaptive tests assembled from a calibrated item pool, 2) nonparametric classifications using non - adaptive tests based on CDMs, and 3) computer ized adaptive testing (CAT) combined with CDMs ( i.e. , CD - CAT). Since most CDMs and their applications assume independent attributes, r elevant model p arameterizations , and the Q - matrix for hierarchical CDMs were discussed. Three studies were conducted to ad dress the test construction and item pool design issues related to the three CDM - based approaches. Specifically, n ew i nd ice s based on the Kullback - Leibler information are proposed for non - adaptive test construction with a calibrated item pool . Different Q - matrix designs were explored for nonparametric classifications , and recommendations regarding the Q - matrix design were provided for teachers . For CD - CAT, an item pool design method based on simulation was proposed and evaluated . The intended contribution of the thesis consists of psychometric tools for the teachers t hat help them facilitate formative assessments in the classroom and instrumental guidelines for developers of formative assessment systems . Cop yright by JIAHUI ZHANG 2019 iv T o my grandpa v ACKNOWLEDGEMENTS I would like to thank my advisor and chair of my dissertation committee, Dr. William Schmidt, for his guidance and support. His great insight into education has illuminated my graduate study and will continue to guide me in my future career. I would also like to thank Dr. Richard Houang, Dr. Tenko Raykov , and Dr. Amelia Gotwals for serving on my committee and offering constructive feedback for my proposal and dissertation draft. The idea for this dissertation was born and developed in the many inspiring conversations I had with my mentors, Dr. Richard Houang and Dr. Leland Cogan. I have benefited tremendously from their knowledge and insight. I am grateful to my family for their unconditional support and trust. Special thanks go to my husband, Qian Xu, who has made great sacrifices to support my pursuit of knowledge. I would also like to acknowledge my adviser and mentor at Beijing Normal University , Dr. Tao Xin, who is like fami ly to me. He led me into the field of educational measurement and always guide d me in the right direction . I would also like to thank many other friends and colleagues from Michigan State University, Beijing Normal University, NWEA, and ACT , who supported me along this arduous journey. vi TABLE OF CONTENTS LIST OF TABLES ................................ ................................ ................................ .......................... viii LIST OF FIGURES ................................ ................................ ................................ ............................ x Chapter 1 Introduc tion ................................ ................................ ................................ .................... 1 1.1 Psychometric solutions for formative classroom assessment ................................ .................. 5 1.2 Related concepts ................................ ................................ ................................ ......................... 7 1.2.1 External and classroom assessment ................................ ................................ ..... 10 1.2.2 Summative and formative assessment ................................ ................................ . 11 1.2.3 Domain - referenced and norm - referenced testing/interpretations ...................... 12 1.2.4 Curriculum - based assessment ................................ ................................ .............. 14 1.2.5 Next - generation assessment ................................ ................................ ................. 15 Chapter 2 Literature review of CDM - based approaches ................................ ............................ 16 2.1 CDM ................................ ................................ ................................ ................................ .......... 16 2.1.1 Attributes ................................ ................................ ................................ ............... 17 2.1.2 Attribute profile space of hierarchical attributes ................................ ................. 24 2.1.3 Q - matrix ................................ ................................ ................................ ................. 29 2.1.4 Item response models and calibration methods ................................ ................... 32 2.1.5 Classification methods ................................ ................................ .......................... 34 2.1.6 Q - matrix design ................................ ................................ ................................ ..... 36 2.1.7 Criteria for test construction ................................ ................................ ................. 38 2.2 Nonparametric classification based on CDM conception ................................ ...................... 41 2.2.1 The nonparametric (NPC) method ................................ ................................ ....... 41 2.2.2 The general nonparametric classification (GNPC) method ................................ 43 2.3 CD - CAT ................................ ................................ ................................ ................................ .... 44 2.3.1 From IRT - based CAT to CD - CAT ................................ ................................ ...... 44 2.3.2 Item selection methods for CD - CAT ................................ ................................ ... 45 2.3.3 Item pool design ................................ ................................ ................................ .... 47 Chapter 3 CDM parameterization and Q - matrix with hierarchical attributes ........................... 51 3.1 Introduction ................................ ................................ ................................ ............................... 51 3.2 Attribute hierarchies ................................ ................................ ................................ ................. 51 3.3 Parameterizations of hierarchical CDMs ................................ ................................ ................ 54 3.4 Q - matrix of hierarchical CDMs ................................ ................................ ............................... 57 3.4.1 Reduced or full Q - matrix ................................ ................................ ...................... 57 3.4.2 Complete Q - matrix for hierarchical attributes ................................ .................... 63 3.5 Summary ................................ ................................ ................................ ................................ ... 64 Chapter 4 Conditional KLI - based indexes for hierarchical CDMs ................................ ............ 66 4.1 Introduction ................................ ................................ ................................ ............................... 66 4.2 Conditional KL indices for test construction ................................ ................................ .......... 68 4.3 Simulation design ................................ ................................ ................................ ..................... 71 vii 4.4 Simulati on results ................................ ................................ ................................ ..................... 72 4.5 Discussion ................................ ................................ ................................ ................................ . 88 Chapter 5 Q - matrix design for nonparametric classifications with hierarchical attributes ...... 92 5.1 Introd uction ................................ ................................ ................................ ............................... 92 5.2 Ties in NPC ................................ ................................ ................................ ............................... 93 5.3 Simulation design ................................ ................................ ................................ ..................... 94 5.4 Simulation results ................................ ................................ ................................ ..................... 96 5.5 Discus sion ................................ ................................ ................................ ............................... 111 Chapter 6 Item pool design for CD - CAT ................................ ................................ .................. 113 6.1 Introduction ................................ ................................ ................................ ............................. 113 6.2 Method for CD - CAT item pool design ................................ ................................ ................. 114 6.2.1 The minimum optimal pool ................................ ................................ ................ 114 6.2.2 The minimum p - optimal pool ................................ ................................ ............. 116 6.3 Simulation design ................................ ................................ ................................ ................... 117 6.4 Simulation results ................................ ................................ ................................ ................... 118 6.5 Discussion ................................ ................................ ................................ ............................... 120 APPENDIX ................................ ................................ ................................ ................................ ..... 122 REFERENCES ................................ ................................ ................................ ................................ . 128 viii LIST OF TABLES Table 1: Subsets of attribute hierarchies for 3 - attribute, 4 - attribute, or 5 - attribute conditions ..... 52 Table 2: Expected responses on two items with two independent attributes ................................ .. 55 Table 3: Expected responses on two items with two linear attributes ( ) ........................... 56 Table 4: Expected responses on under an inverted pyramid hierarchy (H3.3) ...... 56 Table 5: Expected responses on under a pyramid hierarchy (H3.4) ...................... 57 Table 6: The expected responses of two groups of attribute p rofiles on and under the DINA model ................................ ................................ ................................ ................................ ................... 59 Table 7: The q - vectors in and their equivalent q - vectors under the DINA model wit h three linear attributes (H3.2) ................................ ................................ ................................ ........................ 59 Table 8: The q - vectors in and their equivalent q - vectors under the DINA model with three inve rted pyramid attributes (H3.3) ................................ ................................ ................................ .... 60 Table 9: The q - vectors in and their equivalent q - vectors under the DINA model with three pyramid attributes (H3.4) ................................ ................................ ................................ ................... 60 Table 10: The q - vectors in and their equivalent q - vectors under the DINA model with four or five attributes ................................ ................................ ................................ ................................ ....... 61 Table 11: The q - vectors in and their equivalent q - vecto rs under the ACDM with three linear attributes (H3.2) ................................ ................................ ................................ ................................ .. 62 Table 12: Distinct q - vectors in a mixed item pool under DINA and ACDM for H3 .2 using the reduced Q - matrix approach ................................ ................................ ................................ ................ 62 Table 13: Expected response vectors given of two Q - matrices ( and ) for the inver ted pyramid (H3.3) under the DINA model ................................ ................................ ............................ 63 Table 14: Expected response vectors given of five q - vectors for independent attributes under ACDM ................................ ................................ ................................ ................................ ................. 64 Table 15: KLI indices and the CCRs for two Q - matrices ................................ ................................ 69 Table 16: Regression estimates a nd for each attribute hierarchy ................................ .............. 73 Table 17: The overall correlation and the correlations for different test lengths between cKLI and the CCR ................................ ................................ ................................ ................................ ............... 73 ix Table 18: Item parameters of five items for H3.2 ................................ ................................ ............ 91 Table 19: Comparison between two three - item tests in terms of the two indices .......................... 91 Table 20: Hamming distances for with (H3.1) ................................ ......................... 94 Table 21: Hamming distan ces for with (H3.1) ................................ ........ 94 Table 22: Q - matrix designs for the simulation study of nonparametric classifications ................. 95 Table 23: NPC results for H3.1 ................................ ................................ ................................ ......... 98 Table 24: NPC results for H3.2 ................................ ................................ ................................ ......... 99 Table 25: NPC results for H3.3 ................................ ................................ ................................ ....... 100 Table 26: NPC results for H3.4 ................................ ................................ ................................ ....... 101 Table 27: NPC results for H4.1 ................................ ................................ ................................ ....... 102 Table 28: NPC results for H4.2 ................................ ................................ ................................ ....... 102 Table 29: NPC results for H4.3 ................................ ................................ ................................ ....... 103 Table 30: NPC results for H4.4 ................................ ................................ ................................ ....... 103 Table 31: NPC results for H4.5 ................................ ................................ ................................ ....... 104 Table 32: NPC results for H5.1 ................................ ................................ ................................ ....... 105 Table 33: NPC results for H5.2 ................................ ................................ ................................ ....... 106 Table 34: NPC results for H5.3 ................................ ................................ ................................ ....... 107 Table 35: NPC results for H5.4 ................................ ................................ ................................ ....... 108 Table 36: NPC results for H5.5 ................................ ................................ ................................ ....... 109 Table 37: NPC results for H5.6 ................................ ................................ ................................ ....... 110 Table 38: Item distribution for two hypothetical examinees with true attribute profiles of and and the union of the two sets of items ................................ ................. 115 Table 39: Q - vectors for the first item ................................ ................................ .............................. 117 Table 40: The minimum 95 - optimal pools ................................ ................................ ...................... 119 Table 41: Comparis on between the random and designed item pools ................................ .......... 119 x LIST OF FIGURES Figure 1: A complex example of attribute hierarchy in Köhn and Chiu (2018) ............................. 20 Figure 2: Three types of standard relationships in the Common Core Graph (a: the upper panel, b: left bottom panel, c: right bottom panel) ................................ ................................ ........................... 21 Figure 3: Four hierarchical structures using six attributes (Leighton, Gierl, & Hunka, 2004) ...... 22 Figure 4: Linear, pyramid, inverted pyramid and diamond structures using five attributes (Liu & Huggins - Manley, 2016) ................................ ................................ ................................ ...................... 22 Figure 5: Four types of attribute hierarchies and an independent structure (Tu, Wang, Cai, Douglas, & Chang, 2018) ................................ ................................ ................................ ................................ ... 23 Figure 6: A subset of attribute hierarchies with 3 attributes ................................ ............................ 52 Figure 7: A subset of attribute hierarchies with 4 attributes ................................ ............................ 53 Figure 8: A subset of attribute hierarchies with 5 attributes ................................ ............................ 53 Figure 9: Correct classification rates under two conditions ................................ ............................. 67 Figure 10: A plot for tests with three independent attributes (H3.1) of the combined index with CCRs ................................ ................................ ................................ ................................ .................... 74 Figure 11: A plot for tests with three linear attributes (H3.2) of the combined index with CCRs 75 Figure 12: A plot for tests with three inverted pyramid attributes (H3.3) of the combined index with CCRs ................................ ................................ ................................ ................................ ........... 76 Figure 13: A plot for tests with three pyramid attributes (H3.4) of the combined index with CCRs ................................ ................................ ................................ ................................ .............................. 77 Figure 14: A plot for tests with four independent attributes (H4.1) of the combined index with CCRs ................................ ................................ ................................ ................................ .................... 78 Figure 15: A plot for tests with four linear attributes (H4.2) of the combined index with CCRs . 79 Figure 16: A plot for tests with t hree linear attributes + one single attribute (H4.3) of the combined index with CCRs ................................ ................................ ................................ ................................ . 80 Figure 17: A plot for tests with four invert ed pyramid attributes (H4.4) of the combined index with CCRs ................................ ................................ ................................ ................................ .................... 81 Figure 18: A plot for tests with four pyramid attributes (H4.5) of t he combined index with CCRs ................................ ................................ ................................ ................................ .............................. 82 xi Figure 19: A plot for tests with five independent attributes (H5.1) of the combined index with CCRs ................................ ................................ ................................ ................................ .................... 83 Figure 20: A plot for tests with five linear attributes (H5.2) of the combined index with CCRs . 84 Figure 21: A plot for tests with five inverted pyramid attributes (H5.3) of the combined index with CCRs ................................ ................................ ................................ ................................ .................... 85 Figure 22: A plot for tests with five inverted pyramid attributes (H5.4) of the combined index with CCRs ................................ ................................ ................................ ................................ .................... 86 Figure 23: A plot for tests with five pyramid attributes (H5.5) of the combined index with CCRs ................................ ................................ ................................ ................................ .............................. 87 Figure 24: A plot for tests with five pyramid attributes (H5.6) of the combined index with CCRs ................................ ................................ ................................ ................................ .............................. 88 Figure 25: The conditional CCRs from four random tests in H4.2 ................................ ................. 90 Figu re 26: Distribution of the number of items for in an example ................................ .. 116 1 Chapter 1 Introduction Assessments are ubiquitous in most education systems. E ducational assessments have the potential to provide feedback . T he positive effect of feedback on learning has long been established in numerous studies in educational psychology, cognitive science, and learning science (e.g., Fyfe & Rittle - J ohns on, 2015; Hattie & Timperley, 2007; Moreno, 2004). Therefore, various types of assessment s have been widely used in schools to improve learning and teaching , which can be classified into summative assessment ( providing a summary evaluation at the end of an educational program ) and formative assessment ( providing timely diagnostic information for learning and teaching during an educational program ) . Despite its potential usefulness in learning, assessment or testing is among the most debated issues in pub lic education. There have been concerns from teachers and parents that tests take up too much time from teaching and learning (Hefling, 2015; Walsh, 2017) . A survey by the Council of the Great City Schools (CGCS) on large urban districts revealed that t he average amount of testing time spent on required assessments among eighth - grade students in the 2014 - 15 school year was 4.22 days or 2.34 % of school time ( Hart et al., 2015 ) . Examples of required assessments in the CGCS report are (i) st ate summative assessments for accountability (e.g., the Partnership for Assessment of Readiness for College and Careers [PARCC] assessments) , (ii) state and local formative assessments, (iii) local end - of - course exams , and (iv) SAT, ACT, and Advanced Placement (AP) tests (optional in some places) . Specific categories of students (including students with disabilities and English language learners) take (v) special assessments in addition to the required and optional tests . Many of the required tests mentioned above are external , high - stake s , and summative measures for accountability purposes , fueled by important educational polic y questions (Baker, 2 Chung, & Cai, 2016) . These tests are not designed for assisting daily classroom learning and teac hing . Even if diagnostic information can be extracted, it would be too late to be useful in the classroom (Hart et al., 2015) . T oo many of such tests would inevitably disrupt the learning process and may lead to problems such as teaching to the tests ( e.g., Copp, 2018 ) and test anxiety ( e.g., Schutz & Pekrun, 2007, p.3) , both of which result from the mis use and abuse of educational assessments. To address this issue , t he U.S. Department of Education called on states to make assessments fewer and smarter in the Testing Action Plan (U.S. Department of Education, 2015). It calls for more c lassroom, low - stake s , and formative tests that are smart to provide timely feedback to learning and teaching and fewer external, high - stakes, and summative tests . We are entering a new era of K - 12 assessments , where both accountability and instructional improvement are emphasized (Chang, 2012) , and, correspondingly , both summative and formative educational assessments are required . R esearch topics i n the psychometric society echo the change in educational policies: the assessment for learning have become popular as researchers emphasize on mak ing assessment truly useful for learning ( e.g. , Bennett, 2011; Wilson, 2018). If tests are designed for producing feedback for learning and teaching and eventually integrate with the learning process, some problem s of educational tests , including disrupting the learning process and teaching to the tests may be solved . R enewed attention has been brought to t he old concepts of classroom assessment and formative assessment (e.g. , Bennett, 2015 ; Black & William, 2008; Gotwals, 2018; Shepard, 2018) . C lassroom assessment refers to the assessment taking place in the classroom and initiated by the teacher (Shepard, 2006; Wilson, 2018) . Formative assessment is designed for providing timely and 3 constructive feedback that is closely connected to a curriculum and are based on s tudents' learning history. It should be a t houghtful integration of the process to provide feedback and the appropriate measurement instrument or methodology ( Bennet , 2011 ). This thesis concerns formative assessment in the classroom henceforth referred to as formative classroom assessment. A huge responsibility for implementing formative classroom assessments lies on the shoulders of the teachers. Specifically, teachers need to take two iterated actions that are at the core of formative assessment: one is the identification of the gap between the desired goal and the , and the other is the action taken to close the gap (Black & William, 1998). Identifying the gap is a measurement issue per se because t he gap is the difference between . However, many teachers do not feel adequately prepared for th is assessment task ( Mertler , 2003). Despite the increasing emphasis on educational measurement in policies and research, i n some states, preservice teachers are not required to take specific coursework in classroom assessment or educational assessment in general (Campbell, 2013). As a result, t formative assessment practices are not without struggles (Black & Wiliam, 1998; Gotwals, 2018). There is a gap between policy and research on one side and . Although formative assessment is an attractive concept, t he effectiveness of formative assessment hinges on its quality, not on its existence in the classroom ( Black & Wiliam, 1998 ) . As it takes time and resources to improve teacher preparation and professional development in assessment , there is an urgent need now to provide teachers with psychometric tools to facilitate formative assessment in the classroom . Teachers especially need assistance in constructing and delivering formative assessments as well as interpreting the results (Bennett, 2015; Campbell, 2013; Gotwals, 2018). Psychometric tools, which has gu ided and supported most standardized 4 testing programs, if used appropriately, can also help with constructing, delivering, and interpreting formative assessments ( Bennett , 2011; Bennett , 2015 ) . Note that t he use of psychometric tools, especially item response models, inevitably introduces some degree of standardization. Ideally, the teacher would develop his or her own formative assessment because it is the teacher who knows best the learning history of eac h student and the learning goals . - developed assessment is the exact opposite of standardization . With limited educational resources , therefore, w e need to strike a balance between individualization and standardization when thinking of psycho metric tools for formative classroom assessment . In choosing appropriate psychometric tool s (e.g., item response models) for formative classroom assessment, the best place to start is the validity , which is mainly decided by the useful ness of the feedback for formative purposes. Therefore, the first question we should ask is: What kind of feedback do teachers need? The needs of teachers were reflected in a survey conducted on a nationally representative sample of 400 elementary and secondary mathematics an d English language arts teachers in the U.S. about a decade ago (Goodman & Huff, 2006; Huff & Goodman, 2007). The survey shows that norm - referenced information, standards - based information, and performance information at the item level from large - scale sta ndardized assessments are of comparatively little interest to teachers because the information cannot be used directly in the instruction ; w hat teachers need is detailed information about the strengths and weaknesses of individual students regarding specif ic knowledge, skill, and competencies . Various methods have been proposed for providing diagnostic feedback. Some approaches involve extracting information from summative tests based on and calibrated with unidimensional item response theory (IRT) models ( e.g., subscor es; see Haberman, 2008 ). However, some 5 researchers caution that each purpose can be compromised if a single assessment is expected to serve multiple purposes (Pellegrino, Chudowsky, & Glaser, 2001, p2; Reckase, 2017) . Although u nidimensional I RT models have been successfully applied in summative tests aiming at selecting and differentiating , t hey might not be the most appropriate one s for formative purposes because the diagnostic nature of formative assessment usually suggests multidimensionali ty . 1 . 1 Psychometric solutions for formative classroom assessment A family of measurement models cognitive diagnostic models (CDMs; e.g., Rupp, Templin & Henson, 2010) , which were developed for modeling diagnostic assessment data, are chosen for formative classroom assessment in this thesis . These models target multiple fine - grained latent constructs (referred to as attributes) that are typical in interim or formative assessments. With categorical latent variables , they are less affect ed by the high dimens ion ality as multidimensional IRT (MIRT) models and are more appropriate for finer - grained constructs than MIRT models (Templin & Bradshaw, 2013) . The identification of these finer - grained constructs as well as their relationship is often based on cognitive or learning theories, and require collaborations between psychometricians and content experts. This construct space is similar to the concept of a domain in domain - referenced testing (Hively, 1974; Houang, 1980). T he assessment developed based on CDMs can be integrated with the learning process through these constructs . Therefore, CDMs have the potential to be an essential part of the solution to formative classroom assessment. Specifically, t his thesis concerns formative classroom assessment that (i) can be linked to an instructional program lasting for several weeks, and (ii) can provide formative information for learning and instruction . The underlying measurement model s are CDMs. Note that the assessment of interest do es not intend to measure relatively stable traits such as ability or aptitude. Instead, the 6 targeted construct is the internalized knowledge or skills that the student acquires after particular Although current CDM method s ( i.e. , calibration and classification) work well in large - scale assessments with hundreds or thousands of examinees and long tests , the application of CDMs in small - scale test settings in the classroom would be problematic due to limited testing time and the lack of response data required for reliable estimation (Chiu, Sun, & Bian, 2018). There are three alternative s to conventional CDM analysis , which do not require item calibrations and therefore, are practical in the classroom setting: 1 ) parametric c lassifications using non - adaptive tests assembled from a calibrated item p ool ( e.g., Henson & Douglas, 2005 ), 2 ) nonparametric classifications using non - adaptive tests based on CDMs ( e.g., Chiu, Sun, & Bian, 2018), and 3) cognitive diagnostic computerized adaptive testing ( CD - CAT ; e.g., Chen, 2009). The first two approaches use non - adaptive tests, which means the same test is given to all students in a classroom , so t est construction is a critical question. The CD - CAT approach uses adaptive tes ts that are tailored to the state of individual students , the success of which depends on a well - designed item pool . H ow to design the appropriate item pool for a CD - CAT program remains a research question . Responding to practical needs and gaps in the lit erature , this thesis addresses the test construction and item pool design issues for these three approaches . These CDM - based approaches are intended for facilitating formative classroom assessment , which is related to domain - referenced testing and curricu lum - based assessment. Therefore, t he rest of Chapter 1 reviews these related concepts as well as the broad er concept of educational assessment and the so - called next - generation assessment . 7 The next chapter reviews the fundamentals and previous studies of the three CDM - based approaches with a focus on CDMs with hierarchical attributes. Chapter 3 deals with parameterizations and Q - matrices of CDMs with hierarchical attributes, followed by thre e chapters addressing three research questions related to the test construction or item pool design issues . 1 . 2 Related concepts Formative c lassroom assessments belong to the broad er concept of educational assessment or achievement assessment . The terms educational assessment and achievement assessment have been used interchangeably in the literature. More specifically, Mislevy, Steinberg, and Almond (2003) in their seminal work on assessment design defined an educational assessment to be " a machine for r easoning about what students know, can do, or have accomplished, based on a handful of things they say, do, or make in particular settings. " Baker, Chung, and Cai (2016) offered a broader construction : A test or an assessment consists of a systematic meth od of gaining knowledge, characteristics, or propensities. The definition of Mislevy et al. (2003) focuses on the types of inferences made from the assessment , and the definition of Baker et al. (2016) also highlights the process of making inference s (i.e., via sampling ) in educational assessment . The history of educational assessment has been inter t wined with t hat of psychological assessment . Their connection can be seen from the title of the Standards for Educational and Psychological Testing (AERA, American Psychological Association [APA], & National Council on Measurement in Education [NCME], 1985, 1999, 2014 ) as well as journals and books (e.g., Educational and Psychological Measurement ). The first generation of standardized achievement tests w as developed in the same period and by the same researchers as IQ tests were ( Sheperd, 2006 ). As a result, educational assessments and psychological assessments tend to have the same 8 item formats and often utilize the same statistical models (e.g. , item response theory models), with both hav ing roots in individual difference s psychology . In this section, the di scussion is limited to IRT - based assessment because most large - scale or commercial achievement tests (e.g., PARCC, NAEP, PISA, SAT, ACT) use IRT models. M ore and mo r e researchers in the educational assessment field , however, hav e realized the critical differences between educational and psychological assessments despite their entwined histories . Among the most discussed issues is t he definition of the measured domain , the stability of the unobserved construct s , the dimensionality of the construct space , the normal ity assumption , and the purpose of assessment . The unobserved construct s measured in psychological assessments are usually not well - defined. As noted by Brody (2000, p.39) , researchers know how to measu re the construct called intelligence, but they still do not know what has been measured ; w hat the IQ test does , as a result, is merely trying to differentiate people along a hypothetical scale. In some sense, t he test that is supposed to measure intelligen ce defines what intelligence is . T his i s not true in education where domains could be well defined according to the instructional goals of a specific instruction al program . However, the measured domains are not well delineated for some educational tests (B aker, 2009). In such cases, it can be said that we know how to measure achievement, but we do not know what has been measured , particularly, if and when educational assessments follow the tradition of psychological measurement . The unobserved construct s in psychological assessment s are usually stable traits, such as intelligence, self - efficacy , or personality . These traits are assumed , or believed, to remain stable for a n extended period. The purpose of psychological assessments is to reflect t he relative location of a person regarding this latent trait , and improvement or change within a short period is not 9 expected (Baird, Andrich, Hopfenbeck, & Stobart , 2017). However, examinees in educational assessments are expected to show change s in their educational attributes and accomplishments within a short period, which is the primary purpose of any educational program. The existence of content blueprints complicates the definition of the unobserved constructs in educational assessment. Unlike a psyc hological test, an educational test is usually developed based on a content blueprint (Luecht, 2013; Reckase, 2017). A content blueprint is usually constructed as a set of test specifications that is independent of the psychometric modeling of test responses (Luecht, 2013). However, a test blueprint with multiple content domains may suggest , and be consistent with, a multidimensional space ( Reckase, 2017 ). Besides content dimensions, cognitive dimensions have also been considered for educational asse ssments , which further complicate s the dimensionality issue ( George & Robitzsch , 2018 ; Harks, Klieme, Hartig, & Leiss, 2014 ). In an analysis of TIMSS data, content dimensions are number, geometry , and data, and cognitive dimension are knowing, reasoning, a nd applying ( George & Robitzsch , 2018) . For most of the commercial achievement tests, the interpretation of a test score is directly based on the assumed normal distribution of underlying stable psychological characteristics (Baker, 2009). Th is normal ity assumption is another inheritance e ducational measurement inherited from the psychological measurement under the general framework of latent variable modeling ( Baker & Kim, 2004 ) . Consistent with the interpretation of scores , a normal distribution is usually assumed in IRT modeling for the unobserved construct . Specifically , the normal distribution is used (i) in the integration step in item calibration and (ii) as a prior distribution in Bayesian IRT - based scoring (Baker & Kim, 2004) . While the normal ity assumption may work well for a variety of stable psychological traits (e.g., intelligence, self - efficacy), whether it is suitable for the 10 measurement of learning or mastery of educational attributes is questionable ( Bloom, 1968; Baker, 2009). Educational assessment designers, following the guidelines developed for psychological assessments, tend to optimize the test for detecting differences among examinees . It would work well if the goal is selecti on . However, the test development guidelines may need some adaptations whe n we consider the purpose of improving student learnin g because the differences between different test scores could be trivial regarding the subject matter (Bloom, 1968). One characteristic of ed ucational assessments that is different from psychological assessments , however, is the existence of many dichotomies , such as classroom assessment versus external tests, formative versus summative assessment, domain - referenced (or criterion - referenced) ve rsus norm - referenced testing (assessment) . 1 . 2.1 External and c lassroom assessment External assessments are constructed outside of the classroom by measurement and subject experts and are often fueled by educational policies (Baker, Chung & Cai, 2016 ) , also referred to as the large - scale standardized assessment s . There is a rich literature on the theories and practices of external assessments. They have served well the purpose of selection and ac countability over the past decades. However, t he effects of external assessments on learning are difficult to establish (Wilson, 2018) . E ducational assessments can be divided into classroom assessments and external assessments , depending on the administra tion of the assessments . T eachers usually create and grade c lassroom assessments based on particular instructional goals , and they make short - term decisions based on assessment results ( Hanna & Dettmer, 2004, p. 8 ) . Classroom assessments may also be develo ped out of the classroom but initiated by teachers or students in the classroom. 11 Classroom assessments, when used in a constructive way by teachers, can send the message to students telling them what is important (Nitko, 2001), and have been shown to have a s ubstantial impact on student success (Shepard, 2006; Wilson, 2018). Some r esearchers believe that we can make measurement truly important for education through classroom assessments (Wilson, 2018) . 1 . 2.2 Summ ative and f ormative assessment The dichotomy of formative assessment versus summative assessment has been proposed for decades . While great improvement has been seen in the practices and research of summative assessment over the past few decades, formative assessment mostly appears as the subject of theoretical discussion (Scriven, 1967; Bloom, 1968; Bloom, Hastings, & Madaus, 1971) . Scriven (1967) and Bloom (19 68 ) were among the first to use the terms evaluation and summative evaluation A s ummative evaluation judges what students have mastered at the end of a n educational program (Bloom, 1968) . Defining formative assessment s, however, can be much more complicated : There has been debate over conceptualization of formative asse ssment as a test or a process (Bennet, 2011). For Bennet (2011), neither side of the argument can provide a full picture of forma tive assessments: He defined formative assessment to be a th oughtful integration of process , on the one hand , and methodology o r instrumentation , on the other hand . Other researchers put more emphasis on the process part (e.g., Furtak, Circi, & Heredia, 2018 ; Gotwals, 2018 ) . Recently, formative assessment is receiving renewed attention ( Bennet, 2011, p. 5) . Since formative assessments generally take place in the classroom as a type of classroom assessments, teachers need to take many responsibilities. However, it remains a challenging task for teachers to learn how to do formative assessments ( Bennet, 2011 ; F urtaka, Circib & , Heredia , 2018 ; Gotwals, 2018 ; Shavelson, 2008 ) . Teachers need guidance and assistance in various aspects of assessments , 12 including goal setting, extracting information, providing feedback , and using feedback to modify instructions ( Gotwal s, 2018, p.157 ). Bennett (2011, p. 18) argued th at teachers need deep cognitive - domain understanding knowledge of measurement fundamentals in addition to pedagogical knowledge , in order to be able to realize effective formative assessments. Howev er, e ven if teachers can acquire all the knowledge, understanding, and skill s needed for formative assessment, they still need a substantial amount of time to put them in to practice (Bennet, 2011). 1 . 2.3 Domain - referenced and norm - referenced testing /interpretations Another well - known contrast in educational measurement is between d omain - r eferenced (or criterion - referenced) t esting and n orm - r eferenced testing (Hively, 1974). Norm - referenced testing (NRT) has its roots in the ps ychological measurement of individual differences. NRT goes hand in hand with latent trait modeling (Hively, 1974; Houang, 1980). The test construction for NRT based on latent trait modeling places great emphasis on correlation or the so - called internal co nsistency among a set of items, which plays a significant role in the decisions of including or excluding certain items ( Hively, 1974 ; Houang, 1980 ). However, this test construction procedure may pose a danger to the validity of measurement because 1) variables that are conceptually disconnected can be correlated (Baird et al., 2017) and 2) the obtained set of items may not be a representative sample from the targeted domain ( Houang, 1980 ). Domain - referenced t esting (DRT), in contrast, bears more educational considerations. More emphasis is placed on validity instead of reliability . Much research is devoted to the discussion of the domain and item sampling within the domain ( Baker, 1974 ; Hively, 1974 ; Millman, 1974 ). A domain can be defined by a n explicit ly specified set of items ( Hively, 1974 ) or a set of rules according to which a large number of test items could be generated (Baker, 1974). A compl ex domain can be divided into sub - domains. The examinee's measu rement of principal 13 interest in NRT is the examinee's score over al l items in domain or sub - domain (Brennan, 1981; Hively, 1974 ). This score, referred to as the domain score (or the sub - domain score), cannot be directly obtained because it is impossible to administer all the items in the domain (or sub - domain). It can be estimated by the examinee's observed percent of correct responses on a set of items if the set is a representative sample (Brennan, 1981) . Estimates for large domains may be obtained by stratified sampling over their constituent sub - domain, and diagnostic profiles may be gathered by sampling within sub - domains (Hively, 1974). IRT - based estimators are available for domain or sub - domain scores , give n a large set of calibrated it ems ( Bock, Thissen, & Zimowski, 1997 ). For a complicated domain, the set of sub - domain scores serves a diagnostic profile (Hively, 1974); alternatively, one can assign sub - domain scores weights to calculate a single domain score (Millman, 1974). The estima ted domain or sub - domain scores are then compared to some criterion to decide whether mastery is achieved. In contrast to the two - stage methods, Houang (1980) took a latent class approach to estimat e the mastery of a simple domain. The concept of DRT as an assessment type lost its popularity after the 1970 s . S ince the 1974 Standards for Educational and Psychological Tests , t he distinction between two types of test score interpretations criterion - referenced and norm - (or criterion - ) referenced interpretations have received more attention . Instead of differentiating two different types of assessments (i.e., NRT and DRT) , test developers draw from both test development perspectives to ensure the reliability and validity of measurement ( Brennan, 2006 ) . Although most standardized testing programs are designed to primarily provide norm - referenced interpretations, there has been an increasing need for domain - referenced or criterion - referenced interpretations. 14 1 . 2.4 Curriculum - based assessment Educational assessments are based on a specific curriculum or not. To be useful for learning, however, assessment needs to be integrated in to a coherent process of assessment, instruction, and curriculum based on learning theories (Black, Wilson, & Yao, 2011; Shepard, Penuel, & Pellegrino, 2018 ). This is especially true for formative classroom assessment. If the assessment is not aligned to the curriculum that students are learning, the validity of the formative feedback will be in doubt. A link between curriculum and achievement assessment has been well established in the international assessments led by the International Association for the Evaluation of Educational Achievement (IEA) . The c urriculum - achievement alignment constitutes a vital part of the validity evidence for the subject achievement tests. The validity check ( by comparing assessment items with the curriculum students have experienced ) has been carried out in some form in all IEA studies (Cogan & Schmidt, 2019). For example, t eachers provid ed validity check on the test items in the pilot study and the First International Mathematics Study (FIMS) and in the second studies, SIMS and SISS (Husén, 1967a; Keeves, 1974; Travers & Westbury, 1989). T he 1995 Third International Mathematics and Scienc e Study (TIMSS - 95) conducted a more extensive curriculum analysis, and provide d evidence for the relationship between assessment, instruction, and curriculum ( Schmidt & McKnight, 1995; Schmidt, Jorde, et al., 1996; Schmidt, McKnight, Valverde, Houang , & Wi ley, 1996) . A curriculum is structured around subject content. Taking the subject of mathematics as an example, a s athematics, even circumscribed by what is taught in school, encompasses a very large content domain. The question is then how to model curriculum - sensitive content in the psychometric model for curriculum - based assessment . Under the typical unidimensional IRT modeling framework , content exists in the form of content 15 constraints , independent of the measured construct ( e.g., Kingsbury & Zara, 1991 ; van d er Linden, 2005a ) . The separation of the measured construct and the curriculum - sensitive conte n t s makes it difficult , if not impossible, to extract formative feedback from the test data regarding the contents . 1. 2.5 Next - generation assessment Since we entered the new millennium, there have been increasing discussion over the so - called next - generation assessme the educational - generation assessment, researchers and measurement practitioners attempt to respond to the critiques on educational measurement m entioned earlier and the needs from learners, parents, and teachers ( e.g. , Bennett, 201 1 ; Conley, 2018; Embretson, 200 3 ; Heritage, 2010) . A lengthy, but not exhaustive list of next - generation assessment topics includes formative assessment (e.g., Gorin, & Mislevy, 2013 ; Heritage, 2010 ), assessment of new constructs such as critical thinking (e.g., Liu, Frankel, & Roohr, 2014 ), technology - based assessment (e.g., Beatty & Gerace , 2009; Bennett, 2015; Mislevy, 2016), class room assessment (e.g., Shepard et al., 2018), personalized testing and learning (e.g., Chen, Li, Liu, & Ying, 2018 ; Clark, 2016), integration of learning and assessment (e.g., Baird et al., 2017), and automatic item generation and scoring (e.g., Bennett, 2 015; Gierl & Lai, 2012 ). 16 Chapter 2 Literature review of CDM - based approaches Th is chapter provides brief literature reviews for the basics of CDM, nonparametric classifications based on CDM , and CD - CAT , which form the foundations of the three CDM - based approaches for formative classroom assessment proposed in Chapter 1 . The CDM - based test construction begins with the identifications of the attribute profile space and the Q - matrix characterizing the relationship between items and attributes (des cribed in detail in Chapter 2) . The attribute profile space defines the domain in the language of domain - referenced testing. Test construction based on CDMs has many similarities with domain - referenced testing (Hively, 1974; Houang, 1980). The identificati ons of the relationships between attributes and items usually depend on cognitive theories and learning theories. In this way, the assessment can be integrated with the learning process. 2 . 1 CDM CDM s (cognitive diagnostic model s ) , also known as diagnosti c classification model s , belong to the confirmatory or constrained latent class model ing framework in which individuals are classified into groups defined by combinations of categorical (usually binary) latent variables (Rupp, Templin & Henson, 2010). The categorical unobserved variables that define the measurement constructs underlying a CDM are often referred to as attribute s ( Tatsuoka , 1983, 1990 ), elsewhere called finer - grained proficiencies (de la Torre, & Karelitz, 2009) or facets ( Henson, DiBello, & Stout, 2018 ) . Macready and Dayton (1977) and Houang (1980) were among the first to apply latent class models using only one dichotomous trait to measure mastery of a simple domain. Later, t he work s of Tatsuoka ( 1983) and Leighton , Gierl, and Hunka (2004) involve more compl ex domains with multiple attributes , and they introduced the concepts of Q - matri x and attribute hierarchy . I n the 17 past three decades, a large number of CDMs that employ item response functions (IRFs) and explicit Q - matri ces have b een proposed and studied intensely ( Rupp, Templin, & Henson, 2010 ; Templin & Bradshaw, 2014 ) in response to the pressing demand for individualized diagnostic information in education (Center for K - 12 Assessment and Performance Management at ETS, 2014; U.S. Department of Education 2014). 2 . 1 .1 Attributes Since the introduction of attributes to dia gnostic assessments by Tatsuoka ( 1983, 1990), the terminology of attributes has been used in the CDM literature to refer to the unobserved variables that the test aims to measure . Long before the time of diagnostic assessment, Guttman i.e. , categorical variable). operations, item types, or, more generall , 5) viewed attributes as "sources of cognitive complexity" in test performance, which may consist of both cognitive and content components. Leighton, Gierl, and Hunka (1999) defined attributes as the procedural or declarative knowledge needed to perform a t ask in a specific domain. Most of the above definitions include both cognitive and content components. In an educational setting , possessing an attribute is often referred to as mastery of an attribute , and lacking an attribute is referred to as non - maste ry (Templin & Bradshaw, 2014) . Like most CDM research, we restrict the scope of this thesis to attribute s with two levels, so that indicates mastery of attribute and indicates non - mastery of this attribute. An attribute profile (Templin & Bradshaw, 2014) , which is also referred to as an attribute pattern ( Ma, Iaconangelo, & de la Torre, 2015 ) or attribute mastery pattern ( Henson & Douglas, 2005 ) , is a specific combination of attribute mastery and non - mastery, with each combinati on 18 representing a unique latent class of examinees. Attribute profiles are denoted by column vectors , where indicate s the absence or presence, respectively, of the th attribute (mastery vs. non - mastery) , and the supers cript denotes t ra nspose. 2 .1.1.1 Interaction among attributes in an item CDMs can be categorized as noncompensatory or compensatory models based on the assumptions about how attribute s interact with each other to affect the probability of an item response . According to DiBello, Roussos, and Stout (2006), a noncompensatory (or conjunctive ) model assumes that lacking competency on any required attribute poses a severe obstacle to successful performance on the task. In other words, successful performa nce on a task requires mastery of all the required attributes ; mastery of some of the required attributes does not compensate for the non - mastery of other required attributes . The terms of conjunctive model s and noncompensatory models are often used interc hangeably . Opposite to the noncompensatory nature, c ompensatory interaction of attributes means that mastering one required attribute can compensate for nonmastery of other required attributes. An extreme case of compensatory models is a disjunctive model in which mastering each subset of the required attributes would lead to the equally high probability of a correct response (DiBello, Roussos, & Stout, 2006). 2 .1.1.2 Interdependenci es among attributes Most CD Ms assume independent attributes (Rupp et al. , 2010). Nevertheless, there are cases in which data analysis suggested the presence of interdependencies among attributes (Templin & Bradshaw, 2014). To account for the relationships between attributes, d e la Torre and Douglas (2004) proposed a higher - order model linking the categorical attribut e s to an underlying multivariate normal distribution . The interdependencies among attributes are reflected in the correlated dimensions of the multivariate normal distribution . Another approach to modeling the 19 attribute relationships is to impos e a hierarchical structure , in which master ing an attribute could be prerequisite to mastering ano ther attribu te ( Leighton et a l. , 2004 ; Tatsuoka, 2009; Templin & Bradshaw, 2014 ) . This thesis adopts the hierarchical approach , which is reviewed in more details below . A hierarchy of attributes specifies the relationship between each pair of attributes. For attribute and attribute , if , attribute is called a prerequisite of attribute . Suppose there are three attributes in a linear relationship. We have , , and . Attribute hierarchies are often visualized by a tree graph with a set of attributes connected with arrows. An arrow that points from attribute to attribute means that mastering attribute is a prerequisite to mastering attribute ( Gierl, Leighton , & Hunka, 2000 ; Köhn & Chiu, 2018; Leighton et al., 2004) . Attribute is a lower - level attribute , and attribute is a higher - level attribute in this case . Th ese pair - wise prerequisite relationship s can be formally defined by a K - by - K binary matrix called the adjacency matrix (A - matrix) , in which K is the number of attributes (Tatsuoka, 1983, 2009; Gierl et al. , 2000) . The A - matrix represents the direct relationships among attributes usually illustrat ed by one - way arrow s . T he th element of the A matrix indicates whether attribute is directly connected in the form of a prerequisite to attribute . The diagonal elements of the A - matrix are zeros. The following is an example of a complex hiera rchy in Köhn and Chiu (2018) with its 11 - by - 11 A - matrix . 20 0 0 0 1 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 A = 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Figure 1 : A complex example of attribute hierarchy in Köhn and Chiu (2018) In an attribute hierarchy, there are direct and indirect relationships. A direct relationship is characterized by a one - way arrow. In the example below, and has a direct relationship because there is an arrow pointing from to . An indirect relationship can be found between and , which are connected through two arrows and in between. If compared to a road map, a n attribute hierarchy consists of at least one path of attributes. A path is defined to be a subset of attribu tes connected by one - way arrows. The complex attribute hierarchy below has more than one paths , for example, the path . For any hierarchy of attributes , the longest path involves at most attributes and has at most arrows. The maximum is reached when the attributes form a linear hierarchy . Note that some attributes appear in the same path while others do not share a common path. For example, and in the following hierarchy do not share a common path. Another exa mple is the pair of and . 21 The prerequisite relationship s between attributes are quite common in content standards for mathematics. A s shown in the map of College - and Career - Ready Standards (CCRS - former ly called the Common Co re State Standards) , content standards do not stand alone but form a complicated network (Zimba, 2011, 2015 ). Some standards form a linear structure with one standard being the prerequisite of another one (Figure 2 a). Some standards se rve as prerequisites for several other standards (Figure 2 b). There are also standards that are based on several other standards (Figure 2 c). Figure 2 : Three types of standard relationships in the Common Core Graph (a: the u pper panel, b: left bottom panel, c: right bottom panel) However, attribute hierarchies have long been poorly represented in the current CD literature , and related studies have begun only recently (e.g., Templin & Bradshaw, 2014). Research on hierarchical attributes has focused on hypothesis testing of the assumed attribute hiera rchy (Templin & Bradshaw, 2014) and model estimation (Tu et al., 2018). When att ribute hierarchies are proved to b e present, it is recommended to incorporate this information in the 22 modeling process by reparameterizing the original model and excluding certain attribute profiles (Templin & Bradshaw, 2014; Tu et al., 2018) . Hierarchies that have been used in simulation studies are summarized below. Leighton et al. (2004) proposed four types of attribute hierarchies , which have been adopted in many studies linear , divergent, convergent, and unstructured hierarchies as illustrated in Figu re 3 . Liu and Huggins - Manley (2016) renamed the unstructured hierarchy and the convergent hierarchy in L eighto n et al. (2004 ) as the invert pyramid and the diamond hierarchy respectively. They replaced the divergent hierarchy with the pyramid hierarchy ( Figure 4 ). Tu et al. (2018) added a mixed type to the list , which is a combination of two hierarch ies ( Figure 5 ) . Figure 3 : Four hierarchical structures using six attributes (Leighton, Gierl, & Hunka, 2004) Figure 4 : Linear, pyramid, inverted pyramid and diamond s tructures using five attributes (Liu & Huggins - Manley, 2016 ) 23 Figure 5 : Four types of attribute hierarchies and an independent structure (Tu, Wang, Cai, Douglas, & Chang, 2018) Note that a pyramid (e.g., Liu & Huggins - Manley, 201 6 ) or a convergent (e.g., Tu, et al., 2018) hierarchy comes with an implicit assumption that all prerequisite attributes must be mastered so that the mastery of the higher - level attribute can be possible. In application studies of CDMs with hierarchical attributes, the most commonly seen hierarchy is the linear hierarchy ( Gierl, Wang, & Zhou, 2008; Gierl, Alves & Majeau, 2010 ) . To get an idea of the hierarchical relationships in real classroom instruction, two CCRS - aligned textbooks for Grade 4 math , Eureka Math (2015) and Engaged NY (2014) , were analyzed . The content structures of the textbooks may shed some light on classroom instruction because textbooks provide an essential source of information and guid ance for teachers, especially when new standards are introduced. The content analysis results can be found in Appendix A . Generally, three to five attributes (standards) are involved in a period of one to four weeks . Pyramid and invert pyramid structures f ollowing the definitions of Liu and Huggins - Manley ( 2016 ) are observed besides the linear structure. 24 2 .1. 2 Attribute profile space of hierarchical attributes For a test involving K attributes, the set of all possible attribute profiles, subject to the relationship between attributes, is called the attribute profile space ( also called latent attribute space or latent space ; e.g., Köhn & Chiu, 2018 ; Tatsuoka, 2009). The attribute prof ile space , denoted by , is defined by a matrix with K columns representing K attributes and each row vector representing a n attribute profile . Identifying the attribute profile space for independent attributes is straightforward. Assuming independe nt attributes, the attribute profile space is a - by - K matrix , representing different classes into which the examinees would be classified . The hierarchical relationship s between attributes constrain the latent attribute space because some attribute profiles become impossible . Specifically, it is not allowed to master an attribute without mastering its prerequisite. Researchers have reached a consensus on restricting the attribute profile space at the presence of hierarchical a ttributes (e.g., Templin & Bradshaw, 2014; Tu et al., 2018). However, the identification of the attribute profile space is not straightforward, especially when the number of attributes is large (Köhn & Chiu, 2018). Köhn and Chiu (2018) proposed the lattice - theoretical approach to obtain the latent space . Th e first step is to derive the K from the tree graph of the attribute hierarchy . Each basic proficiency class is a K - element vector characterizing a possible path from the lowest - level attribute to a higher - level attribute . The next step is to reconstruct the attribute space as a set of linear combinations of the basic proficiency clas ses. However, t he inspection becomes more difficult as the number of attributes increases and the process is prone to mistakes. 25 An alternative way to derive the attribute profile space begins with the A - matrix. The first step is to derive the basic profic iency cla sses as defined in Köhn and Chiu (2018) in the form of column vectors of a matrix, called the reachability matrix (R - matrix ; Tatsuoka, 1983, 2009; Gierl et al. , 2000 ) . This approach is, therefore, referred to as the R - matrix approach. 2.1.2.1 R - matrix approach We define some Boolean operations before elaborating the R - matrix approach. A Boolean vector or matrix is one for which all entries are either 0 or 1. The Boolean addition of two Boolean vectors of K elements is defined as where operator. The product of the I - by - K Boolean matrix and the K - by - J Boolean matrix is defined by a matrix , the th element of which is where and . For a square Boolean matrix , and any , the th Boolean power of is the Boolean product of copies of . n times The derivation of the R - matrix from the A - matrix and the derivation of the attribute profile space from the R - matrix are elaborated below. The R matrix can be calculated as the n th Boolean power of the matrix (Leighton et al., 2004): 26 where is the integer required for R to reach invariance and can represent the numbers 1 through . The number is decided by the number of arrows in the longest path of the hierarchy. The next step of the R - matrix approach derives t he attribute profile spa ce from the R - matrix. Note that the A - matrix and R - matrix are of order . The attribute profile space , with columns indicating different attributes, however, may have more than rows. The following algorithm produces the transpose of the attri bute profile space (Ding, Luo, Cai, Lin, & Wang, 2008). 1) For the th column of the R - matrix, we take the Boolean addition of the th column and each column on its right side. 2) When a new column vector is obtained, it is added to the right of the R - matrix. 3) The first two steps are repeated for each column of the original R - matrix , including the last one. Note that the column vectors in the Boolean addition include the new columns. The obtained matrix is called the expanded R - matrix , denoted as , because it expands the K - by - K R - matrix by adding columns . This algorithm is referred to as the expanding algorithm. The attribute profile space is the transpose of the expanded R - matrix ( ) with an additional row of 0s . The space contains at most rows, re presenting attribute profiles , denoted as s . The maximum is reached when the attributes are independent. The number of attribute profiles ( s) in the space decreases with hierarchical attributes. The R - matrix approach is equivalent to the lattice - theoretical approach ( Köhn & Chiu , 2018) , but is easier to appl y in practice. Appendix B provides R code for the expanding algorithm. 27 2.1.2.2 Interpretations of the Boolean operations The interpretations o f the Boolean operations involved in the R - matrix approach are provided be l ow . Note that the A - matrix only captures the direct relationship between two attributes. Each 1 - entry in the A - matrix stands for a one - way arrow that connects two attributes. The R - matrix should also capture indirect relationships . Therefore, the first step is to add the identity matrix to the A - matrix to account for the relationship with an attribute itself. The next step multiplies to itself until invariance is achieved . The th element of is in which if and , which means attribute and attribute has an indirect relationship through attribute ; else, . The disjunction among attribute takes the value of 1 if attribute and attribute has an indirect relationship through any attribute. Consequently, the elements in capture all indirect relationships between attribute and attribute in the form of . Similarly, it can be shown that the th element of the matrix takes the value of 1 i f attribute and attribute has an indirect relationship through two attributes in the form of . Since the longest possible path in an attribute hierarchy has arrows, the largest number would take in equation is . Take the th column of the R - matrix. The th element of the th column takes the value of 1 if there a path from attribute to attribute . If the th attribute is at the lowest level in any path, then the th column has only one non - zero entry ; otherwise, the th column describes a path which 28 ends at attribute . As a result, t he columns in the R - matrix correspond to different paths as shown in the tree graph , equivalent to t he basic proficiency classes defined in Köhn and Chiu (2018). We use a linear hierarchy with four attributes to demonstrate the derivation of the R - matrix . The four columns in the R - matrix in equation describe four paths that start from attribute 1 (i.e., the lowest - level attribute) and end with each attribute, respectively. Invariance is achieved at b ecause the longest path (i.e., ) has three arrows. Th e columns of the R - matrix can be seen as attribute mastery profile s . If the attributes form a single linear hierarchy, then the R - matrix contains all the possible attribute mastery profiles. However, if there exist two attributes th at do not appear in the same path, the R - matrix fails to account for all the possible combinations of states of two such attributes. Consider the following attribute hierarchy . The first path (column) is nested within the other three paths (columns). The second path is nested within the two paths on the right. However, the last two paths are not nested within each other because A3 and A4 are not connected directly or indirectly in any path. The four columns in the R - matrix also correspond to four profiles. 29 Another possible profile , which is not included in the R - matrix, can be obtained by adding the last two columns of the R - matrix. The expanding algorithm involves the Boolean addition of two columns in the R - matrix shown in equation and . Addition of two nested paths as defined in equation does not produce a new column . Addition of two independent path s , however, produces a new column, which expands the original R - ma trix . Continuing with the complex hierarchy example in Köhn and Chiu (2018) , the attribute profile space derived from the expanding algorithm contains 31 attribute profiles. 2 .1.3 Q - matrix The relationship between the items and the attributes is described in an indicator matrix, called the Q matrix, which has r ows corresponding to items, columns corresponding to attributes, and binary elements indicating whether an attribute is measured by an item (that is, whether mastery of an attribute is required to succeed on an item). The Q - matrix was initially proposed by Tatsuoka (1983) and has been employed in most of the commonly used CDMs. The Q - matrix reflects the test blue print (Leighton, Gierl, & Hunka, 2004) . Specifically, t he Q - matrix operationalizes the substantive and cognitive theories based on which the test h as been 30 developed and provid es evidence for the construct and content aspects of validit y (Rupp, Templin, & Henson, 2010). It is often considered an analog to the specified factor structure in a confirmatory factor analysis ( Henson, DiBello, & Stout, 2018 ) . The row vectors of the Q - matrix are also referred to as q - vectors. Items with a q - vector with only one non - zero entry are called single - attribute items. Others are multiple - attribute items. An example of Q - matrix is which shows that the test measures three attributes with three items, the first item probes the second attribute, the second item targets the first and the third attributes, and the last item requires all three attributes. In other words, an examinee needs to master the second attribute to succeed on item 1 without guessing or slipping. The specification of the Q - matrix precedes any model fitting and classifying. The Q - matrix is part of the model assumption that can be falsified (e.g., Wang et al., 2018). W hile most theoretical and empirical studies assume that the Q - matrix is corre ctly specified (e.g., Henson et al., 2018) , recent efforts on Q - matrix construction and validation have pointed out the negative effects of incorrectly identified Q - matrices and proposed solutions ( e.g., de la T orre , 2008 ; Liu, Xu, & Ying, 2012 ). 2 .1. 3 . 1 Reduced versus full Q - matrix With hierarchical attributes, researchers have reached a consensus on restricting the attribute prof ile space ( e.g., Templin & Bradshaw, 2014; Tu et al., 2018). However, there has not been a consensus on the Q - matrix. T wo types of Q - matrices are being used : the full (or unrestricted) Q - matri ces (Liu et al., 2016; Templin & Bradshaw, 2014) and the reduced (or restricted) Q - matri ces ( Köhn & Chiu, 2018 ; Leighton et al., 2004; Tu et al., 2018) , which are defined below . 31 Consider a test with three independent attributes. The expanded R - matrix below has seven columns and e ach column represe nts an item type : If we randomly sample from the columns of in equation as the q - vectors , regardless of the attribute hierarchy, the Q - matrix is called a full Q - matrix . With any attribute hierarchy, a full Q - matrix could have all seven types of q - vectors or a random subset of them . In a test of three linear attributes, for instance, although the attribute profile is not allowed, the - vector is possible in the full - Q - matrix approach. Considering that s ome attributes profiles become illegitimate under a certain hie rarchy; particularly , it is im possible to master a n attr ibute without mastering all prerequisite attributes . Therefore, i n another line of research, it is assumed that an item probing a higher - level attribute also requires its prerequisite. This assumption would lead to the rem oval of some q - vectors . For example, under a linear hierarchy ( ) would be unreasonable because the item requires the mastery of the second attribute without requiring its prerequisite. A reduced Q - matrix can only have columns of as q - vectors. A special reduced Q - matrix is the transpose of , denoted as For three linear attributes, for example, and are defined in equation and . T he only difference between and the attribute profile space is the exclusion or inclusion of the vector of all 0s . Therefore, can also be derived using the R - matrix approach . 32 While studies using full Q - matrices tend not to discuss the necessity to make any change in the Q - matrix, researchers using reduced Q - matrices believe that the items should reflect the attribute hierarchy ( Köhn & Chiu, 2018; Tu et al., 2018). The choice be tween the full Q - matrix and the restricted one has not been formally addressed in the literature . 2 .1.3. 2 Complete Q - matrix A complete Q - matrix is needed to identify all possible attribute profiles (Chiu, D ouglas, & Li, 2009; Chiu & K ö hn, 2015). With a complete Q - matrix, we have , where denotes the expected response vector . Completeness of the Q - matrix is evaluated by checking the definition for each pair, , in the attribute profile space. It was proved in Chiu et al. (2009) that a Q - matrix containing the identity matrix (i.e., single - attribute items) is complete for the DINA model with independent attributes. Köhn and Chiu ( 2018 ) later showed that any Q - matrix that contains the transpose of the R - matrix is complete for the DINA model, given any attribute hierarchy. This rule, however, does not apply to more complicated CDMs such as ACDM and GDINA ( Köhn & Chiu , 2018 ) . 2 .1. 4 Item response m odel s and calibration methods T he rela tionship between each attribute profile and the pro bability of a correct response is expressed in terms of IRF ( de la Torre, 2011; Rupp, Templin, & Henson, 2010). A variety of models with different IRFs for m ultiple - attribute items have been proposed ; most of them are equivalent to each other in the parameterization for a single - attribute item. S ome CDMs are more general models that subsume most other specific models. The general frameworks include the general diagnostic model (GDM; von Davier 2005) , the log - li near 33 cognitive diagnosis model (LCDM; Henson, Templin, & Willse, 2009 ), and t he generalized DINA ( ) mo del (GDINA; de la Torre, 2011 ) . The rest of the section introduces the GDINA framework and two reduced models from GDINA. The following notations are used: is the number of required attributes for item j, as in . is the reduce d attribute vector consisting of the columns of the required attributes, where . The probability of a correct response on item by students with attribute pattern will be denoted by . The IRF of the GDINA model (de la Torre, 2011) is given by where is , and in the identity, log, and logit links, respectively; is the intercept for item j; is the main effect due to ; is the interaction effect due to and ; is the interaction effect due to . The G - DINA model is a saturated model and subsumes several widely used reduced CDMs, including the DINA model ( Haertel 1989; Junker and Sjitsma 2001; Macready and Dayton 1977 ) and the A - CDM (de la Torre, 20 11) . To obtain the DINA model, all terms in the GDINA model in identity link, except and , are constrained to zero, that is, 34 The A - CDM is the constrained identity - link G - DINA model without the interaction terms. It can be formulated as Current methods for fitting CDMs use either marginal maximum likelihood estimation that relies on the Expectation Maximization algorithm (MMLE - EM) or Markov chain Monte Carlo (MCMC) techniques (Rupp et al., 2010) . 2 . 1 . 5 C lassification method s The prime objective of CD M data analysis is to classify examinees into one of the attribute profiles . The estimated attribute profile denoted as , takes the value of one of the possible skill patterns for When dichotomou s attributes are involved and assumed to be independent, the attribute profile space consists of latent classes. If an attribute hierarchy exists, the number of attribute profiles decreases with some attribute profiles becoming impossible. Exam ine es are often classified via maximum likelihood e stimation (MLE; de la Torre, 2008), maximum a pos teriori (MAP ; Rupp et al., 2010), or expected a posteriori (EAP ; de la Torre, 2008; Rupp et al., 2010 ) , which are applicable to any CDM that is a special case of a restricted latent class model . Huebner and Wang (2011) conducted a simulation study comparing the accuracy of the three methods under different testing conditions. The likelihood function of the responses given the attribute profile is given by 35 The MLE estimator is the attribute profile for that maximizes the likelihood , and is formally denoted as If prior probabilities denoted as for , are available from previous test administrations, the posterior probability for each can be calculated: The MAP estimator is then denoted as It is generally true that MLE and MAP estimates are equivalent if flat priors are used in MAP estimation (Huebner & Wang, 2011). For the EAP approac h , the probabilities of mastery for each attribute (the marginal skill probabilities) , for , are calculated for an examinee and rounded at .50 to obtain binary mastery classifications. The posterior probabilities are aggregated to ob tain t he marginal probabilities for : where The marginal probability is usually rounded at .50 to obtain a binary classification for attribute ( ) . 36 With hierarchical attributes, researchers have reached a conse nsus on restricting the attribute profile space (e.g., Templin & Bradshaw, 2014; Tu et al., 2018). The MLE estimator maximizes the likelihood function over the set of all possible attribute profiles when the item parameters are assumed to be known, which i s referred to as unrestricted MLE (Tu et al., 2018). When hierarchical attributes are involved, a restricted MLE is recommended in which the probability of some attribute profiles are fixed to zero due to the hierarchy (Templin & Bradshaw, 2014; Tu et al., 2018). The only difference between unrestricted and restricted MLE is in the attribute profile space. Similarly, restricted MAP and EAP estimators should be used for hierarchical attributes. 2 . 1 . 6 Q - matrix design The CDM s provide guidance for test construction . C ognitive theories could have a real impact on testing practice through CDM model assumptions about relationships between attribute as well as the relationship between attributes and item responses . Given a set of a ttributes, i nstead of relying heavily on post hoc item analysis surrounding internal consistency, test development in the CDM context begins with a set of possible item types that are characteriz ed by their q - vectors. For example, a test with three indepen dent attributes can have at most seven different item types. The Q - matrix for a particular test can be obtained by sampling with replacement from the column vectors of the corresponding . The Q - matrix is a core element of the CDM - based test design. Madison and Bradshaw (2015) defined the Q - matrix design as "the deliberate arrangement of a set of test items according to the specific subset of attributes measured by each individual item. " The Q - matrix plays a significant role in the stati stica l identification of the model ( Köhn & Chiu, 2018 ; Xu & Zhang, 2016 ) . However, Q - matrices that lead to identification may provide varying classification accuracy rates . 37 Three studies have been done with the effects of Q - matrix design on classification accu racy with independent attributes. Chiu, Douglas, and Li (2009) showed that each attribute needs to be measured by at least one single - structured item in order to obtain acceptable classification accuracy in both DINA ( Haertel , 1989; Junker & Sjitsma , 2001; Macready & Dayton , 1977 ) and DINO (Templin & Henson, 2006) models. Similarly, DeCarlo (2011), in his investigation of the DINA model, found that if an attribute is always measured through interaction terms and never measured in isolation, the classificati on obtained only reflects the prior probabilities. The finding of DeCarlo (2011) was echoed in Madison and Bradshaw (2015), in which they concluded that attributes measured in isolation c ould help increase classification accuracy when holding constant the number of times an attribute is measured on a test, based on the log - linear cognitive diagnosis model (LCDM; Henson, Templin, & Willse, 2009). Recent efforts expanded the research on Q - matrix design to testing situations with hierarchical attributes (Liu & Huggins - Manley, 2016 ; Liu, Huggins - Man ley, & Bradshaw, 2017). In Liu, Huggins - Man ley, and Bradshaw (2017) , d ifferent Q - matrix designs were generated using the so - called independent approach, adjacent approach, or reachable approach when the attribute hierarchy was linear, divergent, convergent, or unstructured. The CDM was the hierarchical diagnostic classification model (H DCM; Templin & Bradshaw, 2014). The independent approach only allows for simple - structured items. Each item measures at m ost two attributes with direct relationships in the adjacent approach. Each item can measure any combination of attributes that are directly or indirectly connected in the reachable approach. Their simulations found that the adjacent approach leads to high er classification accuracy in a shorter test and they recommended using the adjacent approach to design the Q - matrix when a hierarchy is present (Liu et al., 2017) . Using the adjacent approach in Liu et al. (2017) , Liu and Huggins - Manley (2016) found that 38 "higher - level attributes were often associated with higher classification accuracy than lower - level attributes" as a result of more information about higher - level attributes from the hierarchical structure. 2 .1. 7 Criteria for test construction A research area closely related to Q - matrix design is the development of item and test indices. When estimated item parameters are available for a pool of items, an item index based on the estimated item parameters can be calculated to identify good items that achiev e high classification rates with a minimal number of items (Henson, DiBello, & Stout, 2018) . This type of item indices is referred to as item discrimination in Henson et al. (2018). The Fisher information is an example of such item indices in the IRT context. For CDMs, a counterpart of the Fisher information is the Kullback - Leibler information ( KLI; also called KL divergence or KL distance) . M uch of the work on item - level and test - level indices in CDM s have been based on KLI. 2.1.7.1 Kullback - Leibler i nformation KLI measures how far a distribution is away from the actual distribution (Gray, 2011 ; Chang & Ying , 1996 ; Xu , Chang, & Douglas, 2003 ). Given a probability space , with being a finite space, and another measure on the same space, the KL information of with respect to (Gray, 2011) is defined as which ranges from 0 to . The Fisher information can be used in the test construct ion because the test information is the sum of item information , and the variability of the maximum likelihood estimate decreases as the information increases. Test construction criteria for CDMs shou ld have similar properties ( Henson & Douglas, 2005 ). 39 The KL information for an item for differentiating and is defined as Note that ; for . A n item is most useful in determining the difference between two attribute profile s , and , if and are large. All s for item can be recorded in a matrix of columns and rows where is the size of the attribut e profile space. The KL information for a test is defined as where represents the response pattern for items. The KL information for a test compares the probability distribution for a n item response vector X , given when compared to the probability distribution of X given an alternative attribute pattern, . Because of the assumption of local independence among items conditional on , it can be shown that the test information is the sum of the KL information across all items in the exam . The test KL information for all pairs of in the attribute profile space , , form s a n matrix where is the size of . of a matrix containing possible comparisons because the KL information is not symmetric. The diagonal elements of the matrix are zero. The KL information provides a general method that will apply to all CDMs (Henson & Douglas, 2005) , based on which researchers have proposed attribute, item, or test - level indexes for test construction. 2.1.7.2 Cognitive diagnostic index (Henson & Douglas, 2005) The cognitive diagnostic index (CDI) for an item is proposed as a weighted average of the off - diagonal elements of since the matrix expands exponentially with the number of 40 attributes and makes it difficult in simultaneously evaluating all the elements ( Henson & Douglas , 2005 ) . The for it em i s defined as where is the Hamming distance and stands for the element of the matrix at . The for a test is defined as where stands for the element of the matrix at . It can be shown that the CDI for a test is the sum of for all the items in the test. Henson and Douglas (2005) showed that the CDI strongly relates to the average correct classification rates across attributes and examinees for a test and they suggest using the cognitive Other indexes based on the KL information include the Attribute Discrimination Index (ADI) that is supposed to be related to the correct classification rate of the masters for the th attribute (Henson, Roussos, Douglas & He, 2008), and the modified CDI and modified ADI ( Kuo, Pai, & de la Torre , 2016 ). Note that all the indexes mentioned above are overall indexes that are not conditional on . 2.1.7.3 A unified item and test discrimination approach (Henson, DiBello, & Stout, 2018) Henson et al. (2018) proposed a probability - based attribute - specific index for items with multiple options . For dichotomous items, the index is reduced to 41 where denotes an attribute pattern that differs from only on the th attribute. The maximization is taken over all s. The index describes the discrimination power of item in measuring attribute and has a value between 0 and 1. 2 . 2 Nonparametric classification based on CDM conception An alternative to classification when calibrating a parametric CDM is not practical or even possible is the nonparametric approach. The nonparametric approach shares with the conventional CDM approach the conceptions of a Q - matrix, a set of attributes, and different attribute interaction effects on correct responses . The test is still constructed based on a CDM , but a probabilistic model is not used to characterize the correct response probabilities of different attribute profiles . Instead, the examinees are classified into dif ferent attribute profiles using a nonparametric method . Barnes (2010) developed a nonparametric exploratory approach to build the Q - matrix and classify examinees . Some researchers employ cluster analysis for nonparametric classifications ( Ayers, Nugent, & Dean, 2008; Chiu, Douglas, & Li, 2009; Willse, Henson, & Templin, 2007 ). Another stream of research is based on the idea of minimizing the distance between observed item response patterns and the ideal response patterns according to the Q - matrix (Chiu & D ouglas, 2013 ; Chiu, Sun, & Bian, 2018 ; Wang & Douglas, 2015 ) . The rest of the section reviews the third type of nonparametric methods that minimize distance measure s. 2 .2.1 The nonparametric (NPC) method Chiu and Douglas (2013) proposed a simple method to classify examinees by matching observed item response patterns to the nearest ideal response pattern, henceforth referred to as the nonparametric (NPC) method. The ideal response of examinee on item is denoted as , a nd the vector containing ideal responses of examinee on all the items in a test is denoted as . 42 The ideal response patterns are derived from the Q - matrix and the assumption on attribute interactions. Consider a q - vector and four possible attr ibute profiles . If we assume a conjunctive model underlying the responses, the ideal responses for the four attribute profiles would be respectively . For a test with more than one item, each possible attribute profile is associated with an ideal response pattern. The ob served response pattern of an examinee is compared with the ideal response patterns. The attribute profile of the closest ideal response pattern is the estimate for the examinee. Three distance measures were proposed by Chiu and Douglas (2013) . Denote the observe response pattern as . T he hamming distance between and is given by where J stands for the test length. A weighted Hamming distance is defined as where denotes the proportion correct on the th item . They also proposed the penalized Hamming distance for the special cases where the slipping parameter is much less than the guessing parameter or vice versa (Chiu & Douglas, 2013). Chiu and Douglas (2013) f ound that accurate classification can be achieved when the true model is DINA and NIDA with slip and guess parameters considerably high er than 0. The estimator of would be perfect without any slip ping or guess ing but still performs with good relative efficiency even when this is not the case (Chiu & Douglas, 2013) . A formal justification 43 for the NPC methods was provided in Wang and Douglas (2015) , show ing that the nonparametric method yields consistent classificat ions under a variety of underlying conjunctive models. 2 .2.2 The general nonparametric classification (GNPC) method The general nonparametric classification (GNPC) method ( Chiu, Sun, & Bian , 2018 ) was proposed as an extension of t he NPC methods ( Chiu & Do uglas , 2013) . T he example in 3.2.1 is revisited t o illustrate the need for this extension . T he ideal responses for the four attribute profiles are respectively , assuming an underlying conjunctive model. The ideal responses would become if the underlying model is a disjunctive one. In the NPC method, either the conjunctive ideal response patterns (denoted as ) or the disjun ctive ideal response patterns (denoted as ) are used according to the assumptions about the cognitive process . H owever, using or may not be adequate if the underlying CDM is a complex one, such as a saturated GDINA model. Consider a set of GDINA parameters for this item . The probabilities for the four possible attribute profiles to answer the item correctly are . Obviously, neither the ideal responses (0, 0, 0, 1) nor (0, 1, 1, 1) wou ld be appropriate. Besides, before any analysis of the response data, we cannot decide which of t he ideal response pattern s is more suitable. Therefore, the GNPC method defines the weighted ideal response on item for the th attribute profile in the attribute profile space as in which is a weight calculated from the data in an iterative procedure . Conceptually, the weight is found when the total distance between the observed responses and the weighted ideal responses is minimized . Denote the attribute profiles as for The total distance can be denoted as 44 is obtained by minimizing : where is the number of examinees classified to attribute profile . The can be computed via an iterative procedure described in Chiu et al. (2018). The NPC method can be used to provide a set of initial classifications to calculate the initial . The NPC ( Chiu & Douglas , 2013 ; Wang & Douglas, 2015 ) and the GNPC (Chiu et al., 2018) methods do not have limitations regarding the number of attributes, the sample size or the test length as the conventional CDMs do , which makes them a practical option for small - scaled classroom assessments. 2 . 3 CD - CAT 2 . 3 .1 From IRT - based CAT t o CD - CAT C omputerized adaptive testing (CAT) , built on the idea of can tailor both items in the test form and the test length to an individual examinee. The maximum information criterion is usually adopted in IRT - based efficiency in terms of shorter test length or higher measurement precision or both compared to linear testing . Th ere have been many operational CAT programs since the 1980s and rich literature in the past decades (Reckase, 2010). CAT algorithms based on CDMs (denoted as CD - CAT) have been developed with the same motivation behind the IRT - based CAT, that is, to increa s e testing efficiency (Cheng, 2009; McGlohen & Chang, 2008; Xu, Chang, & Douglas, 2003). When the cognitive diagnosis is 45 combined with CAT, we can proceed a new stage of As technologies become mor e available in the classroom, CD - CAT can play a more important role in learning and teaching. Chang (2015) reported a n experimental CD - CAT program was implemented in Zhengzhou CD - CAT encourages critical thinking, making students more independent in problem solving, and offers easy to follow individualized remedy, making learning more interesting. ( p. 15 ) Similar to the CAT s based on other measurement models , a CD - CAT algorithm consists of a measurement model (e.g., the DINA model ), a method for selecting the first item (s) to administer, a scoring method, a rule to select the next item conditional on examinee responses to the previous item(s) , and a termination rule to end the test . An item pool with cali brated items is needed for the implement ation of the CAT algorithm. 2 . 3 . 2 Item selection methods for CD - CAT Item selection is a core element of CAT algorithms . T hree item selection indices based on the KL information are reviewed in this section because they will be used in the simulation study . There are i tem selection methods based on other criteria such as the Shannon entropy ( Wang, 2013; Xu et al., 2003) and mutual information ( Huebner, Finkelman, & Weissman, 2018 ) . The following notations are used for the CD - CAT context : denotes the attribute profile estimate for examinee after items have been administered; denotes the observed response pattern for examinee when items have been administe red; denotes the size of the attribute profile space; ( ) denotes the th attribute profile in the attribute profile space; 46 denotes the available items in the item pool when items have been administered; and denotes the response of examinee to item from . The KL algorithm . Xu , Chang, and Douglas (2003) proposed using the straight sum of the KL distances bet ween and all the for . Note that when there are independent attributes . The KL index is defined as w here Then the th item for the th examinee is the item in that maximizes . The KL index is referred to as the global discrimi nation index (GDI) in Xu et al. (2003). This item selection meth od is referred to as the KL algorithm in Cheng (2009). The KL algorithm selects items that are the most powerful in distinguishing the current attribute profile estimate from all other possible attribute profiles on average (Cheng, 2009). Cheng (2010) points out that the KL algorithm does not consider attribute coverage. Another drawback is that this algorithm may not be effective at the early stage with inaccurate . T he posterior - weighted KL (PWKL) index . The PWKL index weights the KL index by the posterior distribution ( Cheng, 2009 ) . If informative priors are available for each attribute profile, posterior distributions can be obtained at each step : Denote by for simplicity in notation. The PWKL index is defined as 47 Assuming local independence, the likelihood function can be written as w here is the IRF defined by a CDM . Then the th item for the th examinee is the item in that maximizes . If the prior is discrete uniform, the PWKL index is reduced to the likelihood - weighted KL (LWKL) index : T he modified posterior - weighted Kullback - Leibler (MPWKL) ind ex . The KL and PWKL index use the current estimate with an implicit assumption that the point estimate is a good summary of the current information. However, the point estimate may be inaccurate especially at the early stage s of a test. To solve this problem, Kaplan , de la Torre, and Barrada (2015) used the entire posterior distribution instead of a point estimate. The MPWKL index is given as 2 . 3 . 3 Item pool design The potential benefits of CAT cannot be realized without a well - constructed item pool (Reckase, 2010). There are some studies on item pool design for CAT based on IRT models (e.g., Reckase, 2010; Thissen, Reeve, Bjorner, & Chang, 2007) , and more research is needed in this area . Considering the d ifference between items based on IRT and CD M, the findings from IRT - based 48 CAT cannot be directly applied to CD - CAT. However, t he item pool design for CD - CAT has not been addressed in the literature despite its importance. Simulation findings on item usage in CD - CAT might inform the item pool design process (Kaplan et al., 2015) . For example, a CD - CAT based on the DINA model tends to use items with a q - vector matching the examinee true attribute profile and items that required single attributes which were not mastered by the examinee , which implies that it is important to include sufficient single - attribute items in the item pool . Since there is no published research on item pool design for CD - CAT , the studies on the IRT - based CAT are reviewed below. There is a body of literature on selecting operational pools anson and Stocking, 1998; van der Linden, Ariel, & Veldkamp , 2006; Way, Steffen, & Anderson, 1998). The problem they address is related to item pool design but is more appropriately described as item pool assembly ( van der Linden et al., 2006 ). van der Lin den et a l. (2006) argues that an item pool design problem occurs before actual items are available and the output is a blueprint for an item pool that defines the distribution of number s of items over the space of all possible combinations of statistical and nonstatistical item attributes (e.g., item difficulty parameter and word count) . The goal of item pool design is to guide the item writing and pool maintenance process ( Reckase , 2010 ; Veldkamp and van der Linden , 2000 ). Item pool design studies for IRT - based CAT focuses on different aspects of an item pool . Veldkamp and van der Linden ( 2000 ) proposes a method for item pool design that minimizes item - writing costs subject to test constr aints . Test constraints are represented in t he classification table that contains all possible combinations of item attributes such as word counts, difficulty parameters, 49 difficulty parameters, and discrimination indices (Veldkamp & van der Linden, 2000). Quantitative attributes are transformed to categorical variables represented by intervals, for example, for the difficulty parameter. T he goal of the item pool design process is to find out the number of items needed for each cell of the classification table. The number of items in each cell of a previous item pool, however, is needed to define item writing costs as the inverse of that number, based on the idea that items written more freque ntly tend to be less costly. Another stream of research based on the bin - and - union method (Reckase, 2010) explores item pool design without any information of existing item pool s as a starting point (He & Reckase, 2014; Mao, 2014). This family of researc h focuses on the psychometric performances of item pools instead of the item - writing costs. Reckase (2010) thinks an optimal item pool should always provide the desired item for every item selection. An optimal item pool for a CAT procedure based on 1PL m odel , for example, has an item in the pool that has a b - parameter exactly item pool is where is the test length, which is too large to be practical . If the latent scale is divided into bins and the items with b - parameters within a bin are treated equivalent, the item pool size will be greatly decreased to a reasonable level. categorization of the difficulty parameter in Veldkamp and van der Linden (2000). The item pool design methods of Veldkamp and van der Linden (2000) and Recakse (2010; also see He & Reckase, 2014; Mao, 2014) are based on different def initions of optimal item pool, but a common feature they share is the use of computer simulation. The simulations in Veldkamp and van der Linden (2000) are carried out using integer programming and the shadow test approach ( van der Linden, 2005a, 2005b; va n der Linden & Diao, 2014; van der Linden & Reese, 1998 ) and 50 sampling examinees from a hypothetical examinee distribution . The goal is to record the counts of the number of times items from each cell in the classification table are used , and t he final blue print is calculated from these counts (Veldkamp & van der Linden, 2000) . The bin - and - union method (Reckase, 2010) takes a more direct approach by simulating an operational CAT and sampling from an examinee population. 51 Chapter 3 CDM parameterization and Q - matrix with hierarchical attributes 3.1 Introduction The CDMs with a restricted attribute profile space due to the attribute hierarchy is henceforth referred to as hierarchical CDMs . This section addresses parameterizations and the Q - matrix of hierarchical CDMs. P arameterizations for hierarchical CDMs have not been formally discussed except for the HDCM (Liu et al., 2017; Templin & Bradshaw, 2014) and the DINA model. When it comes to the Q - matrix, t wo types of Q - matrices are being use d by two groups of researchers, respectively: the full (or unrestricted) Q - matri ces (Liu et al., 2016; Templin & Bradshaw, 2014) and the reduced (or restricted) Q - matri ces ( Köhn & Chiu, 2018; Leighton et al., 2004; Tu et al., 2018). The choice between the full Q - matrix and the restricted one has not been formally addressed. Therefore, the first set of research questions is about the parametrization of hierarchical CDMs and the difference between reduced and full Q - matrix . These questions are important because the test constructions and item pool designs all depend on correctly - defined CDMs and Q - matrices. In this thesis, it is assumed that the hierarchical relationship and the Q - matrix ha ve been established and validated, and we focus on test construction or item pool design for different types of attribute hierarchies. 3 .2 Attribute hierarchies Before discussing parameterizations and Q - matrices, we define the attribute hierarchies studied in this thesis. The formative assessment is designed for a period of two to four weeks. Therefore, we consider situations with three, four, or five attributes in this study . The subsets of attribute hierarchies chosen for 3 - attribute, 4 - attribute, or 5 - att ribute conditions, respectively, are 52 listed in Table 1 and illustrated in Figure 6 - Figure 8 . Most of the selected attribute hierarchies can be found in the textbook analysis , as well as previous empirical and simulation studies. Table 1 : Subsets of attribute hierarchies for 3 - attribute, 4 - attribute, or 5 - attribute conditions ID Number of attributes Size of attribute profile space Attribute hierarchy H3.1 3 8 Independent H3.2 3 4 Linear H3.3 3 5 Inverted pyramid H3.4 3 5 Pyramid H4.1 4 16 Independent H4.2 4 5 Linear H4.3 4 8 Linear + single H4.4 4 6 Inverted pyramid H4.5 4 6 Pyramid H5.1 5 32 Independent H5.2 5 6 Linear H5.3 5 10 Inverted pyramid I H5.4 5 11 Inverted pyramid II H5.5 5 10 Pyramid I H5.6 5 11 Pyramid II Figure 6 : A subset of attribute hierarchies with 3 attributes 53 Figure 7 : A subset of attribute hierarchies with 4 attributes Figure 8 : A subset of attribute hierarchies with 5 attributes 54 3. 3 Parameterizations of hierarchical CDMs We discuss the parameterizations for the DINA ( Junker & Sijtsma, 2001), ACDM ( de la Torre, 2011), and GDINA model with the identity link function ( de la Torre, 2011) when the attributes are hierarchical . An item requiring attributes can classify students into at most classes. A hierarch ical relationship among attributes leads to fewer than classes. A saturated model for an item requiring ind ependent attributes can have item parameters including an intercept, main effect terms, and interaction terms. The number of item parameters can n ot exceed the number of classes. The parameterizations for DINA and ACDM do not change with hie rarchical attributes. The DINA model has two parameters for each item disregarding the q - vector of the item: an intercept and an interaction term (or a guessing parameter and a slipping parameter in an alternative parameterization). Under the A - CDM, an item requiring independent attributes has item parameters (i.e., one intercept and main effect terms). For GDINA, some item parameters ( i.e. , the main effects of nested attributes and some interaction terms) need to be fixed at zero , which is parallel to the parameterizations of the Hierarchical Diagnostic Classification Model (HDCM ; Templin & Bradshaw, 2014 ). Before demonstrat ing the parameterizations of hierarchical models, w e present the parameterizations of three models D INA, ACDM, and GDINA for a single - attribute item and a two - independent - attribute item . The three models are equivalent regarding a single - attribute item but have different parameterizations for an item requiring two independent attributes , which are shown in Table 2 in the form of expected response . 55 Table 2 : Expected responses on two items with two independent attributes Any model DINA A CDM GDINA (00) (10) (01) (11) Note : I tem involves two independent attributes and ; all models the identity link; DIN A = gate; ACDM = additive cognitive diagnosis modeling; GDINA = generalized DINA ; = intercept; = main effect of the k th attribute ( ); = two - way interaction. Su ppose is the prerequisite of (i.e., ) . The item , under each model (DINA, A - CAM, or GDINA) , classifies examinees into two groups: those who master both and its prerequisite and those who have not mastered . The parameterizations of the three hierarchical models are in Table 3 . Under the DINA model, the item has the same parameterizations as . For the parameterizations of the item unde r GDINA , the main effect of the higher - level attribute ( i.e ., ) needs to be fixed at zero . Both ACDM and GDINA have three item parameters. ACDM has an intercept and two main effects. GDINA has an intercept, a main effect, and an interaction effect. Alth ough parameterized differently , the two models become mathematically equivalent for an item measuring two linear attributes. 56 Table 3 : Expected responses on two items with two linear attributes ( ) Any model DINA A CDM GDINA (00) (10) (11) Note : I tem involves two attributes and under a linear hierarchy; all models the identity link; DIN gate; ACDM = additive cognitive diagnosis modeling; GDINA = generalized DINA ; = intercept; = main effect of the k th attribute ( ); = two - way interaction. Next, we consider a situation involving three attributes with one attribute being the prerequisite of the other two as in an inverted pyramid hierarchy (H3.3) . Table 4 presents the parameterizations of three models for . For this item, the three models have different parameterizations. The difference between ACDM and GDINA lies in the interaction effect between and . Table 4 : Expected responses on under an inverted pyramid hi erarchy (H3.3) DINA ACDM GDINA (000) (100) (110) (101) (111) Note : The inverted pyramid hierarchy defines , . and do not share a common path . 57 W e then consider a situation involving three attributes with two attributes being the prerequisite of the third one as in a pyramid hierarchy (H3.4) . Table 5 presents the parameterizations of three models for . For this item, the three models have different parameterizations. The difference betwe en ACDM and GDINA lies in the interaction effect between and . Table 5 : Expected responses on under a pyramid hierarchy (H3.4) DINA A CDM GDINA (000) (100) (010) (110) (111) Note : The pyramid hierarchy defines , . and do not share a common path. 3 . 4 Q - matrix of hierarchical CDMs 3.4.1 Reduced or full Q - matrix In previous studies, either a reduced Q - matrix or a full Q - matrix is used. With hierarchical attributes , the argument is around whether it is possible for an ite m to measure a higher - level attribute witho ut measuring its prerequisite (s) . A full Q - matrix allows all types of q - vectors as in an independent - attribute situation. A reduced Q - matrix requires that items that measu r e a higher - level attribute also require a ll its prerequisite(s). In other words, a reduced Q - matrix can only 58 contain q - vectors in (the transpose of the expanded R - matrix ) . We will demonstrate that the reduced Q - matrix approach is equivalent to the full Q - matrix approach under the DINA model . It can be shown that , under the DINA model, a multiple - attribute item is equivalent to the single - attribute item , in which takes the value 1 or 0 if the previous attributes are the direct or indirect prerequisites of the th attribute , or takes the value 0 if the th attribute is not connected wi th the th attribute in any path . The multiple - attribute item and the single - attribute item are equivalent because they classify attribute profiles into the same two groups (i.e., s mastering the th attribute or not) , and they have the same expected response for each group as shown in Table 6 . Therefore, u nder the DINA model with a linear hierarchy, the reduced Q - matrix is equivalent to an identity matrix consisting of single - attribute q - vectors. Table 7 presents the equivalent q - vectors for each row of in the case of three linear attributes . Under the DINA model and any attribute hierarchy, each q - vector in represents a unique type of items ( Table 7 - Table 10 ). Other q - vectors can find their equiva lent one in . Consequently, there would be no difference between the reduced Q - matrix approach and the full Q - matrix approach under the DINA model . However, it is noteworthy that there are less than distinctive q - vectors with hierarchical attribute s. Note that all the single - attribute items are included in under the DINA model. Under the ACDM or GDINA, however, each q - vector is distinctive, and consequently does not include all the single - attribute items. We use H3.2 under the ACDM to dem onstrate this in Table 11 . If the reduced Q - matrix approach is used with ACDM or GDINA, it means that some single - attribute q - vectors will be excluded from the Q - matrix. 59 Table 6 : The expected responses of two groups of attribute profiles on and under the DINA model Note : stands for the th attribute ; takes the value 1 or 0 if the previous attributes are the direct or indirect prerequisites of the th attribute, or takes the value 0 if the th attribute is not connected with the th attribute in any path ; = intercept ; = interaction. Table 7 : The q - vectors in and their equivalent q - vectors under the DINA model with three linear attributes (H3.2) Equivalent s Attribute Profiles (1 0 0) (0 0 0) (1 0 0) (1 1 0) (1 1 1) (1 1 0) (0 1 0) (0 0 0) (1 0 0) (1 1 0) (1 1 1) (1 1 1) (0 0 1) (1 0 1) (0 1 1) (0 0 0) (1 0 0) (1 1 0) (1 1 1) Note : Single - attribute items are bolded ; = intercept; = interaction. 60 Table 8 : The q - vectors in and their equivalent q - vectors under the DINA model with three inverted pyramid attributes (H3.3) Equivalent s Attribute Profiles (1 0 0) (0 0 0) (1 0 0) (1 1 0) (1 0 1) (1 1 1) (1 1 0) (0 1 0) (0 0 0) (1 0 0) (1 1 0) (1 0 1) (1 1 1) (1 0 1) (0 0 1) (0 0 0) (1 0 0) (1 1 0) (1 0 1) (1 1 1) (1 1 1) * (0 1 1) (0 0 0) (1 0 0) (1 1 0) (1 0 1) (1 1 1) Note : Single - attribute items are bolded ; = intercept; = interaction ; * = q - vector that is not in the R - matrix. Table 9 : The q - vectors in and their equivalent q - vectors under the DINA model with three pyramid attributes (H3.4) Equivalent s Attribute Profiles (1 0 0) (0 0 0) (0 1 0) (1 0 0) (1 1 0) (1 1 1) (0 1 0) (0 0 0) (1 0 0) (0 1 0) (1 1 0) (1 1 1) (1 1 0) * (0 0 0) (1 0 0) (0 1 0) (1 1 0) (1 1 1) (1 1 1) (0 0 1) (1 0 1) (0 1 1) (0 0 0) (1 0 0) (0 1 0) (1 1 0) (1 1 1) Note : Single - attribute items are bolded ; = intercept; = interaction ; * = q - vector that is not in the R - matrix. 61 Table 10 : The q - vectors in and their equivalent q - vectors under the DINA model with four or five attributes Hierarchy Equivalent s Hierarchy Equivalent s H4.2 (1000) H5.4 (10000) (110 0 ) (0100) (11000) (01000) (111 0 ) ( qq 1 0 ) , e.g., (0010) (10100) (00100) (1111) (qqq1), e.g., (0001) (11100) (01100) H4.3 (1000) (11010) (qq010), e.g., (00010) (0001) (11001) (qq001), e.g., (000 0 1) (110 0 ) (0100) (11110) (qq110) (1001) (11101) (qq101) (1 1 1 0 ) ( qq 10) , e.g., (0010) (11011) (qq011) (11 0 1) (0101) (11111) (qq111) (1111) (0111) (1011) (0011) H5.5 (10000) H4.4 (1000) (01000) (110 0 ) (0100) (00100) (1 1 1 0 ) ( qq 10) , e.g., (0010) (11000) (11 0 1) (qq01), e.g., (0001) (10100) (1111) (qq11) (01100) H4.5 (1000) (11100) (0100) (11110) (qqq10), e.g., (00010) (1100) (11111) (qqqq1), e.g., (000 0 1) (1 1 1 0 ) ( qq 10) , e.g., (0010) H5.6 (10000) (1111) (qqq1), e.g., (0001) (01000) H5.2 (10000) (00010) (11000) (01000) (11000) (11100) (qq100), e.g., (00100) (10010) (11110) (qqq10), e.g., (00010) (01010) (11111) (qqqq1) e.g., (00001) (11100) (qq100), e.g., (00100) H5.3 (10000) (11010) (11000) (01000) (11110) (qq110) (11100) (qq100), e.g., (00100) (11111) (qqqq1), e.g., (00001) (11010) (qq010), e.g., (00010) (11001) (qq001), e.g., (00001) (11110) (qq110) (11101) (qq101) (11011) (qq011) (11111) (qq111) Note : q takes the value of 0 or 1. Single - attribute items are bolded. 62 Table 11 : The q - vectors in and their equivalent q - vectors under the ACDM with three linear attributes (H3.2) Other Attribute Profiles (100) (000) (100) (110) (111) (110) (000) (100) (110) (111) (010) (000) (100) (110) (111) (00 1 ) (000) (100) (110) (111) (101) (000) (100) (110) (111) (011) (000) (100) (110) (111) (111) (000) (100) (110) (111) Note : Single - attribute items are bolded; = intercept; = main effect of attribute . If the reduced Q - matrix approach is taken, there are only three q - vectors under ACDM. However, if the model selection is made at the item level and a n item pool of mixed models can be constructed (Ma et al . , 2015), items calibrated with the DINA model can be included in this item pool. For the linear hierarchy H3.2, for example, the mixed item pool has five distinct item types in Table 12 . If the full Q - matrix approach is taken instead, the mixed item pool can have two more item types: and calibrated by the ACDM. Table 12 : Distinct q - vectors in a mixed item pool under DINA and ACDM for H3.2 using the reduced Q - matrix approach Model Attribute Profiles (100) - (000) (100) (110) (111) (110) ACDM (000) (100) (110) (111) (110) DINA (000) (100) (110) (111) (111) DINA (000) (100) (110) (111) (111) ACDM (000) (100) (110) (111) Note : Single - attribute items are bolded; = intercept; = mai n effect of attribute . 63 3.4.2 Complete Q - matrix for hierarchical attributes A Q - matrix containing the identity matrix is complete for the DINA model with independent attributes , according to Chiu et al. (2009). Since the completeness of a Q - matrix is evaluated by checking whether it holds that for each pair of in the attribute profile space, the completeness will not change if some s are excluded from the attribute profile space. Since there is only one way to define single - attribute items under different models , it is safe to conclude that the identity matrix is complete for any attribute hierarchy under any model. Under the DINA model, is complete since equals to or contains the identity matrix ; another type of complete matrix is the transpose of the R - matrix that equals to the identi ty matrix , consistent with the conclusion of Köhn and Chiu (2018 ) . The expected response vectors given are presented in Table 13 . Table 13 : Expected response vectors given of two Q - matrices ( and ) for the i nve rted pyr amid ( H3.3) under the DINA model (000) (100) (110) (101) (111) Note : Single - attribute items are bolded; = intercept; = main effect of attribute . 64 Under ACDM, one type of items alone would be sufficient for completeness by definition as long as the three main effects ( , , and ) are different from each other ( Table 14 ). Without assuming the differences between , , and , an inspection of Table 14 shows that of each attribute hierarchy is a complete Q - matrix disregarding the attribute hierarchy. Table 14 : Expected response vectors given of five q - vectors for independent attributes under ACDM (000) (100) (010) (001) (110) (101) (011) (111) Note : , , and form the for the linear hierarchy (H3. 2 ); , , , and form the for the inverted pyramid hierarchy (H3.3) ; , , , and form the for the pyramid hierarchy (H3.4). 3.5 Summary In discussing the parameterizations of hierarchical CDMs, we identif ied equivalent models when an attribute hierarchy is present . The three models in the GDINA family parameterize single - attribute items in the same way regardless of the attribute hierarchy . The hierarchical ACDM and hierarchical GDINA model are equivalent to each other but different from the hierarchical DINA model when two linear attributes are involved in an item. The hierarchical ACDM and GDINA 65 model have different parameterizations when d. Independence refers to the fact that the two attributes are not on the same path in the tree graph. Under the hierarchical DINA model, the q - vectors in represent distinct item types . Since the number of q - vectors in is smaller than , a f ull Q - matrix may have two seemingly different q - vectors that are are equivalent. By equivalence, we mean that the items have the same parameterizations and would thus lead to the same classifications of examinees given the same item parameters . For example , and are equivalent to in the hierarchical DINA model if attribute . As a result, the choice between the reduced and the full Q - matrix approaches does not make a difference under the hierarchical DINA model. Under the ACDM or GDINA model, any combination of attributes is a distinct q - vector so there are in theory different item types . A reduced Q - m atrix under the hierarchical ACDM or GDINA model inevitably excludes the single - attribute items for the higher - level attributes . For example, a reduced Q - matrix in H3.4 (pyramid hierarchy) only includes two single - attribute q - vectors corresponding to the two lower - level attributes. The single - attribute q - vector for the other attribute is excluded from a reduced Q - matrix. The absence of single - attribute q - vectors in the recuded Q - matrices may have serious impact on the classifications, which is discusse d in the next chapter. 66 Chapter 4 Conditional KL I - based indexes for hierarchical CDMs 4.1 Introduction In the previous chapter, we discuss two approaches to construct ing Q - matrices with hierarchical attributes. We main ly talk about equivalent q - vectors and complete Q - matrices. There are, however, numerous ways to construct the Q - matrix for a test from all the available q - vectors. Previous studies in Q - matrix design simulate tests with different Q - matrices to compare the c lassification results ( Chiu et al., 2009; Liu & Huggins - Manley, 2016; Liu et al., 2017; Madison & Bradshaw, 2015 ) . We address the issue of Q - matrix design from the perspective of item - level and test - level ind ic es. The indices can be used to automate test a ssembly with a calibrated item pool . The indices also provide a basis for comparing different Q - matrix designs. The existing item - level and test - level indexes based on the KL information are overall indexes for a population of examinees , and they are fou nd to be positively correlated with the overall classification rates ( Henson & Douglas, 2005; Kuo et al., 2016 ) . However, the correct classification rates (CCRs) could vary substantially across different attribute profiles within the same test regardless of independent or hierarchical attributes . The CCRs conditional on the attribute profile are usually not reported as most studies only calculate an overall CCR for the population of examinees. With independent attributes, t he conditional CCRs ar e different for different attribute profiles when each attribute is measured in different numbers of items . In this situation, a ttribute - level indexes could compensate for overall ind ices for items or tests ( Henson et al., 2008; Kuo et al., 2016 ). However, the attribute - level index ADI fails to consider the dependency between attributes as a result of attribute hierarchies. To address this problem, t he modified ADI proposed 67 by Kuo and colleagues (2016) add weights on the original ADI but remains to be an ov erall index for a population of examinees. The following examples show the necessity for conditional indices in stead of an overall index. Suppose items are calibrated with the DINA model and the intercept and the interaction effect fo r all items . The Q - matrix contains a multiple - attribute item in addition to three single - attribute items. When is used for three independent attributes, different attribute profiles have substantially different conditional CCRs. Another example is the identity matrix but is used for measuring three linear attributes (i.e., ). The CCRs for complete m astery and complete non - mastery are higher than other profiles. Figure 9 : Correct classification rates under two conditions Since the goal is to estimate the attribute profile for every examinee accurately, it is necessary to develop an index conditional on the attribute profile , especially when hierarchical attributes are present. This thesis proposes two conditional ind ices based on the KL information that can be used for non - adaptive test construction and Q - matrix design . 0.68 0.7 0.72 0.74 0.76 A000 A100 A010 A110 A001 A101 A011 A111 CCR 0.70 0.75 0.80 0.85 0.90 A000 A100 A110 A111 CCR 68 In this chapter , it is assumed that a large number of items have been developed for a well - defined domain and that the Q - matrix , as well as the relationship between attributes, are correctly specified . W e take the full Q - matrix approach and allow all types of q - vectors. It is also assumed that item parameter estimates have been obtained from previous calibrations. 4.2 Conditional KL ind ice s for test construction A set of two indices is proposed , condition al on the attribute profile. The two conditional indices summarize the information from the L - by - L test KLI matrix as defined in equation in 2.1.7 , where is the size of the attribute profile space . The first index is the average of the elements in the th column and the th row of the test KLI matrix. The second index is the where is the th element of the test KLI matrix and represents the size of the attribute profile space. The two KLI - based indices were log - transformed to get a linear relationship with CCR (Henson et al., 2008; Henson et al., 2018). The index describes the average discrimination power of a test to differentiate from other attribute profiles. It is supposed to be positively correlated with the conditional CCR for . However, t h is index alone is not sufficient for pred icting CCR due to the multidimensional nature of the CDMs. When the is fixed, i f the test does not differentiate well between two particular attribute profiles and , the CCR for or suffers (Cheng, 2010) . This phenomenon was mentioned in Cheng (2010) CD - CAT study and compared to law of the minimum, . Therefore , a second index was defined in to characterize the weakest point of a test . O ne particularly 69 low KLI between two s leads to a relatively large range given the same . A range measure was used instead of the minimum measure to control the effect of . The index is negatively correlated with the conditional CCR for but has a low or insignificant correlation with . Th e need for the second index is best illustrated by comparing the following two Q - matrices under the DINA model. Three independent attributes are measured with nine items . w here is the identity matrix . Assuming the intercept and the interaction effect for all items, the two indices were calculated for the two tests. The CCRs were also obtained from the simulation. In the T able 15 , the two tests have the same for each attribute profile but the second te st has substantially lower CCRs. Table 15 : KLI indices and the CCRs for two Q - matrices CCR CCR 000 2.20 1.10 0.92 2.20 2.20 0.81 100 2.20 1.10 0.92 2.20 2.20 0.81 010 2.20 1.10 0.92 2.20 2.20 0.81 001 2.20 1.10 0.92 2.20 2.20 0.81 110 2.20 1.10 0.92 2.20 2.20 0.80 101 2.20 1.10 0.91 2.20 2.20 0.80 011 2.20 1.10 0.91 2.20 2.20 0.81 111 2.20 1.10 0.91 2.20 2.20 0.81 70 The difference between the two Q - matrices in Table 15 was referred to as an issue of content balancing in Cheng (2010) since the number of items for each attribute is not balanced in . Given the same , the second index is needed in this case to account for the different CCR s. A larger range index corresponds to a lower CCR. The two conditional KL indices would be good predictor s of the conditional CCR of with a fixed test length . To make them useful for between various test lengths , the following two conditions need to be satisfied: 1. For each , there are no zero off - diagonal entries in the test KLI matrix because is not defined ; 2. There is an o dd number of items in each item type (i.e., a distinct q - vector ) . The first condition is satisfied when the Q - matrix is complete. The second condition is necessary for the indices to be useful because when the examinee correctly respond to half of the items, the examinee is likely to be misclassified. For example, if the test has two items with , for examinees who master attribute , the likelihood function is ; for examinees who do not possess attribute , the l ikelihood function is . It is possible that an examinee correctly answers item 1 but fails at item 2. Then , . W hen the item s are homogenous in quality , the difference between and would be very small. I n an extreme case when and for all items , . KLI - based i tem selection in CD - CAT uses ind ices similar to and ignores the minimum effect. As a result, researchers found it necessary to add extra constr aints to the item selection algorithm in order to improve the CCR (Cheng, 2010) . Such constraints intend to balance attribute coverage , and t his pro cess is also referred to as content balancing (Cheng, 2010). The 71 result of content balancing is a smaller KLI range given the same . When attribute hierarchies are present, content balancing becomes tricky. Using the two indices together in test construction becomes more practical with hierarchical attributes than content balancing . 4.3 Simu lation design A s imulation s tudy was conducted to assess the p erformance of the t wo i ndices . Random tests were generated as described below with items calibrated using DINA or A - CDM . The hierarchical GDINA model is equivalent to A - CDM in most cases, so the GDINA model is excluded from the simulations. The attribute hierarchies shown in 3.2 were used to simulate the examinee responses . The assessment tasks may be embedded in the classroom instruction and scattered in multiple class sessions. As a result, the assessment is not necessarily a concise one. We consider test lengths of three, five, and seven times the number of attributes, respectively . For each combination of test length (e.g., nine items) and attribute hierarchy (e.g., H3.2), three sets of tests were simulated. T h e first set of 25 tests consists of single - attribute items, the second set of 25 tests consists of q - vectors from the full Q - matrix calibrated by the DINA model, and the third set of 5 0 tests consists of q - vectors from the full Q - matrix calibrated by the ACDM. The actual Q - matrix for each random test was constructed by randomly sampling from all the possible q - vectors with replacement if the full Q - matrix approach is used or from the id entity matrix if only single - attribute items are wanted. Each Q - matrix contained the identity matrix to e nsure completeness. The re was an odd number of items in each item type (i.e., a distinct q - vector). For all items, t he intercept parameter ( ) was generated from the uniform distribution and was generated from . A total of 5 ,000 examinees are simulated for each true attribute profile for each random test . Given each examinee's attribute profile, item scores are gene rated based on the chosen model. 72 A random variable is generated. The correct response probability is compared with to decide the response of examinee to item : The two conditional indices were calculated for each attribute profile for each random test. C lassification s were accomplished via MLE for independent attributes or restricted MLE for hierarchical attributes b ecause the item parameters are known. C onditional profile - wise CCR were recorded for each . The index of means is supposed to be positively correlated with the CCR , and the index of range is supposed to be negatively correlated with the CCR. For each attribute hierarchy, a linear regression mode l with normal errors was fit using the two indices to predict the CCR : The regression estimates were used to produce a linear combination of the two indices as a combined index , cKLI : The combined index cKLI is expected to be highly correlated with the CCR. 4.4 Simulation results The regression estimates and the for each attribute hierarchy were summarized in Table 1 6 . A combined index was calculated as a linear combination of the two indices using the regression estimates as weights. This combined index (cKLI) was plotted against the CCR conditional on in the following scatter plots to visualize th e relationship s ( see Figure 10 - Figure 2 4 ) . For brevity, we only present the scatter plots for a subset of s when there are more than a ttribute profiles in the space. 73 Table 16 : Regression estimates and for each attribute hierarchy Attribute hierarchy H3.1 Independent - 0.07 0.2 4 0.7 6 H3.2 Linear - 0.07 0.19 0. 78 H3.3 Inverted pyramid - 0.07 0.2 0 0.74 H3.4 Pyramid - 0.06 0.21 0.79 H4.1 Independent - 0.08 0.27 0.82 H4.2 Linear - 0.08 0.21 0.80 H4.3 Linear+single - 0.08 0.24 0.80 H4.4 Inverted pyramid - 0.08 0.22 0.81 H4.5 Pyramid - 0.07 0.22 0.81 H5.1 Independent - 0.07 0.28 0.81 H5.2 Linear - 0.09 0.21 0.82 H5.3 Inverted pyramid I - 0.08 0.26 0.82 H5.4 Inverted pyramid II - 0.08 0.26 0.81 H5.5 Pyramid I - 0.08 0.25 0.80 H5.6 Pyramid II - 0.08 0.25 0.81 Table 17 : The overall correlation and the correlations for different test lengths between cKLI and the CCR Attribute hierarchy All Test length H3.1 Independent 0 .8 7 0.60 0.76 0.85 H3.2 Linear 0 . 88 0.83 0.87 0.87 H3.3 Inverted pyramid 0 .8 6 0.79 0.82 0.84 H3.4 Pyramid 0 .89 0.81 0.88 0.86 H4.1 Independent 0 .90 0.77 0.82 0.88 H4.2 Linear 0 . 89 0.86 0.86 0.86 H4.3 Linear+single 0 .90 0.81 0.84 0.85 H4.4 Inverted pyramid 0 .90 0.83 0.87 0.89 H4.5 Pyramid 0 .90 0.84 0.87 0. 8 7 H5.1 Independent 0 .90 0.77 0.87 0.85 H5.2 Linear 0 .9 0 0.85 0.91 0.87 H5.3 Inverted pyramid I 0 .91 0.84 0.88 0.91 H5.4 Inverted pyramid II 0.90 0.81 0.88 0.89 H5.5 Pyramid I 0.90 0.80 0.87 0.90 H5.6 Pyramid II 0.90 0.84 0.87 0.88 Note : is the number of attributes . The overall correlation between cKLI and the CCR is presented in Table 17. All the overall correlations are around . The correlations for different test lengths are also calculated (Table 74 17). The correlation gener ally increases substantially as the test length goes up from three times of to five or seven times of where is the number of attributes. This trend can also be seen in the scatter plots. Figure 10 : A plot for tests with three independent attributes (H3.1) of the combined index with CCRs 75 Figure 11 : A plot for tests with three linear attributes (H3.2) of the combined index with CCRs 76 Figure 12 : A plot for tests with three inverted pyramid attributes (H3. 3 ) of the combined index with CCRs 77 Figure 13 : A plot for tests with three pyramid attributes (H3. 4 ) of the combined index with CCRs 78 Figure 14 : A plot for tests with four independent attributes (H 4 .1) of the combined index w i th CCRs 79 Figure 15 : A plot for tests with four linear attributes (H4.2) of the combined index with CCRs 80 Figure 16 : A plot for tests with three linear attributes + one single attribute (H4.3) of the combined index with CC Rs 81 Figure 17 : A plot for tests with four inverted pyramid attributes (H4. 4 ) of the combined index with CCRs 82 Figure 18 : A plot for tests with four pyramid attributes (H4. 5 ) of the combined index with CCRs 83 Figure 19 : A plot for tests with five independent attributes (H5 .1 ) of the combined index with CCRs 84 Figure 20 : A plot for tests with five linear attributes (H5. 2 ) of the combined index with CCRs 85 Figure 21 : A plot for tests with five in v erted pyramid attributes (H5. 3 ) of the combined index with CCRs 86 Figure 22 : A plot for tests with five in v erted pyramid attributes (H5. 4 ) of the combined index with CCRs 87 Figure 23 : A plot for tests with five pyramid attributes (H5. 5 ) of the combined index with CCRs 88 Figure 24 : A plot for tests with five pyramid attributes (H5. 6 ) of the combined index with CCRs 4.5 Discussion The two indices can predict the CCR well according to the linear regression results showing that abut 80% of the total variance was explained . The prediction of two indices was a substantial improve from the prediction of ei t her index alone. This relationship was also reflected by the high correlation between t he combined index and the CCR. These results suggest t hat using an averaged KLI may not be sufficient for predicting CCR. Therefore, any single index based on the maximum or the mean of KLI would have serious limitations as a test construction index. As mentioned earlier, it has been found necessary to add extra constraints to the item selection algorithm based on a single KLI index, in order to improve the CCR in CD - CAT research (Cheng, 2010). Such constraints would lead to a decreased range index , and balance d attribute coverage 89 would be observed with independent attributes . In other words, content balancing cou l d have the same effect as having a range index when attributes are independent. W ith hierarchical attributes, however, there is no clear way to define content balancing. Therefore, u sing the two indices together in test construction would be more appropriate with hierarchical attributes than content balancing. This applies to both non - adaptive and adaptive test construction. It is important to note that the ( , CCR) relations hip does not depend on the model selected (DINA or ACDM) or test length. However, the relationship between the two indices and the CCR may depend on the attribute hierarchy, more specifically, the number of attribute profiles as suggested by the different regression estimates in Table 15 . Moreover, t he indices lead to better predictions of the CCR as the test length increases. The proposed index can be used to assemble tests from an item pool by setting an information target or a fixed test lengt h. Setting an appropriate information target may not be easy because on the one hand , a target needs to be set for each attribute profile and on the other hand, also noted in Henson et al. (2018) , the threshold value that would ensure a certain CCR may dep end on the number of attributes and the attribute hierarchy. If the test length is fixed , the test assembly algorithm could take two steps: a set of tests with largest mean KLI is identified first , and then the one with the largest minimum KLI or smallest range index is chosen . Alternatively, the regression estimates in Table 1 6 an be used to calculate the combined index. The test assembly can be automated in various ways . With the two information indices, we is measured by an adequate number of items ( Cheng , 2010, p. 903) . 90 The ( , CCR) relationship is visualized for each attribute profile in Figure 11 - Figure 24 because the CCR could vary substantially between s. We chose four random tests in the condition H4.2 to demonstrate the variation of CCR in Figure 2 5 . Figure 25 : T he conditional CCR s from four random tests in H4.2 With a linear hierarchy and an identity matrix as the Q - matrix, the attribute profiles that master some but not all attributes are easier to be misclassified than the two attribute profiles on the two end s (i.e., the one with all 0s and the one with all 1s). This pattern can also be explained in terms of the KL indices ( Table 18 ) . Another way to see the various CCRs for a linear hierarchy is the item with differentiates with other s and the item with differentiates with other s , and as a result these two s have a higher classification rates than the s in the middle. We use the two KLI - based indices to compare the full and reduced Q - matrix ap proaches under the ACDM. As mentioned earlier, the two approaches are equivalent under the ACDM. The major difference between a full Q - matrix with ACDM and a reduced Q - matrix with ACDM is the exclusion of some single - attribute items from the reduced Q - matr ices. Therefore, we compare the 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 A0000 A1000 A1100 A1110 A1111 DINA-test 1 DINA-test 2 ACDM-test 1 ACDM-test 2 91 identity matrix with with ACDM in terms of the two indices for a linear hierarchy of three attributes (H3.2) . The item parameters are presented in Table 1 8 . The indices for the two three - item tests are shown in Table 1 9 . If the reduced Q - matrix approach is adopted and all the items are calibrated with the ACDM, classifications for the attribute profiles , , and become much more difficu lt. As suggested by the combined index, much l onger tests are required to achieve comparable classification rates for most of the attribute profiles if two types of single - attribute items, and , are excluded from the candi date pool . In addition to the consideration of classification efficiency, t he choice between a reduced Q - matrix and a full Q - matrix should depend on answers to questions such as whether it is possible to a mixed item pool, whether it is possible to develop a certain ite m type, and the model - data fit at the item level . Table 18 : Item parameters of five items for H3.2 Model ( 010 ) - 0.1 0.8 ( 001 ) - 0.1 0.8 ( 100 ) - 0.1 0.8 ( 110 ) ACDM 0.1 0.4 0.4 ( 111 ) ACDM 0.1 0.27 0.27 0.27 Table 19 : Comparison between two three - item tests in terms of the two indices cKLI cKLI 000 1.26 1.10 0.05 1.38 0.82 0.12 100 0.85 0.69 0.04 0.33 1.35 - 0.17 110 0.85 0.69 0.04 0.52 2.84 - 0.38 111 1.26 1.10 0.05 0.80 3.34 - 0.42 92 Chapter 5 Q - matrix design for nonparametric classifications with hierarchical attributes 5.1 Introduction W ithout a calibrated item pool, the nonparametric classification (NPC) method (Chiu, Sun, & Bian, 2018) provide an alternative approach f or classifications . The NPC method allows the t eachers to develop the ir own items based on CDMs if they can identify the attribute hierarchy and the Q - matrix. There is no need for item calibratio n , and s tudents are classified based on their response data without the need to estimate item parameters. T he Q - matrix design plays an even more important role in nonpa rametric classifications than in parametric classifications , but it has not been formally addressed in the literature . Related studies explore different Q - matrix designs with hierarchical attributes in the context of parametric classifications (Liu, Huggins - Manley, & Bradshaw 2017; Tu, Wang, Cai, Douglas, & Chang, 2018). There is a consensus on the effect of single - structured items on accurate classifications regardless of the attribute hierarchy (Chiu et al., 2009; DeCarlo, 2011; Madison & Bradshaw, 2015). However, t he role of items with multiple attributes is not clear. Other factors in Q - matrix design that receive less attention in existing research include test length and the number of items in each item type. In this study , the NPC method (Chiu & Douglas, 2013) was used Because it is assumed that the teacher develops a CDM - based test for a particular classroom . Prior data are not expected to be available. Therefore, the general nonparametric classification method (Chiu, Sun, & Bian, 2018) that requ ires some prior response data is not considered . 93 5.2 Ties in NPC There is a tie w hen the observed response pattern of an examinee is at an equal distance to more than one ideal response pattern. Some Q - matrices lead to more ties than others. With an ideal Q - matrix, the item responses of high probabilities are always closest to the ideal response pattern of the true and there would be no ties in the hamming distance . In this study, if a tie occurs, the examinee would be randomly classified into one attribute profile with the minimal hamming distance. We present a comparison between two Q - matrices as an example . The underlying model is the DINA model. The item quality is assumed to be high : . Three independent attri butes are involved . We focus on an examinee with . The hamming distance s between several likely response patterns of and each of the ideal response patterns are shown in the cell s of Table 20 and Table 2 1 . With an identity matrix as the Q - matrix, there are no ties in the hamming distance and the probability of correctly classifying the examinee with equals to the probability of observing the response pattern of , which is . When t he Q - matrix contains the identity matr ix and an item probing all three attributes , ties are observed when the examinee slips on one of the items ( Table 2 1 ) . The probability of a tie is the probability of observing such a response patter, which is . It is still possible to clarify the examinee with a tie in the hamming distance. T he CCR for can be calculated as a weighted sum of probabilities : . Comparing the two Q - matrices reveals that adding a n item with to the identity matrix leads to a slight increase in the CCR for from to . The second Q - matrix leads to a probability of 0.29 to obtain a tie. 94 Table 20 : Hamming distance s for with (H3.1) Response pattern Probability : (Ideal response pattern) : (000) : (100) : (010) : (001) : (110) : (101) : (011) : (111) (111) 3 2 2 2 1 1 1 0 (110) 2 1 1 3 0 2 2 1 (101) 2 1 3 1 2 0 2 1 (011) 2 3 1 1 2 2 0 1 Table 21 : Hamming distance s for with (H3.1) Response pattern Probability : (Ideal response pattern) : (000) : (100) : (010) : (001) : (110) : (101) : (011) : (111) (1111) 4 3 3 3 2 2 2 0 (1110) 3 2 2 2 1 1 1 1 (1101) 3 2 2 4 1 1 3 1 (1011) 3 2 4 2 3 1 3 1 (0111) 3 4 2 2 3 3 1 1 5. 3 Simulation design T he identity matrix serve d as the baseline Q - matri x . We consider ed the following situations : 1) a dding one or two simple - attribute items to the baseline, 2) adding one or two multiple - attribute items to be baseline, and 3) adding an identity matrix. A total of 15, 19, and 23 Q - matrices are obtained for 3, 4, and 5, respectively , presented in Table 2 2 . The computations of CCRs and the probability of a tie become more complicated with a longer test or more attributes. Therefore, a simulation study was conducted to compare Q - matrices. Item parameters were simulated based on . A total of 5 ,000 examinees are simulated for each true attribute profile for each Q - matrix . Given each examinee's attribute profile, item scores are gene rated based on the DINA . A random variable is generated. The correct response probability is compared with to decide the response of examinee to item : 95 Examinee responses were classified using the nonparametric classification method (Chiu & Douglas, 2013). C onditional profile - wise CCR were recorded for each . The percent of ties was recorded for each simulation condition as an estimate of the probabilit y of getting a tie . Table 22 : Q - matrix designs for th e simulation study of nonparametric classifications Q - matrix Q - matrix Q - matrix 3 - 1 4 - 1 5 - 1 3 - 2 4 - 2 5 - 2 3 - 3 4 - 3 5 - 3 3 - 4 4 - 4 5 - 4 3 - 5 4 - 5 5 - 5 3 - 6 4 - 6 5 - 6 3 - 7 4 - 7 5 - 7 3 - 8 4 - 8 5 - 8 3 - 9 4 - 9 5 - 9 3 - 10 4 - 10 5 - 10 3 - 11 4 - 11 5 - 11 3 - 12 4 - 12 5 - 12 3 - 13 4 - 13 5 - 13 3 - 14 4 - 14 5 - 14 3 - 15 4 - 15 5 - 15 4 - 16 5 - 16 4 - 17 5 - 17 4 - 18 5 - 18 4 - 19 5 - 19 5 - 20 5 - 21 5 - 22 5 - 23 96 5. 4 Simulation results Simulation results for the conditions with three attributes are summarized in Table 22 - Table 25 . For brevity, we only present the results for four attribute profiles. Comparing each Q - matrix to the baseline (Q3 - 1), we found that a very high probability of obtaining a tie usually suggests no increase in the CCR and a lack of ties suggests an increased CCR for some s . A longer test does not necessarily lead to higher CCR for each attribute profile. As shown in Table 22 , a dd ing a single - attribute item to the baseline Q - matrix does not lead to an increased CCR with three independent attributes. Th e lack of change can be explained by the ties in hamming distances that cancel the effect of adding one more item. It is more likely to obtain a tie when there are an even number of , , or in the Q - matrix. Adding slightly increases the CCR of and a nd adding slightly increases in the CCR of . In the above conditions, ties are likely to occur for all or some attribute profiles. However, when two items of each q - vector are added to the baseline, as in Q3 - 5, Q3 - 6, and Q3 - 7, the CCRs of all or some attribute profiles increase substantially , and almost no ties are observed. With a linear hierarchy, all q - vectors have their equivalent single - attribute q - vectors. Therefore, all the Q - matrices contain single - attribute q - vectors. The comparison b etween Q3 - 2 and Q3 - 5, between Q3 - 3 and Q3 - 6, and between Q3 - 4 and Q3 - 7 in Table 2 4 suggests that a large probability for getting ties would hurt the classifications. For example, the CCR for increases slightly after adding a (Q3 - 2) but the cl assifications for other attribute profiles are not benefited. When two s are added (Q3 - 5), the CCR for and increase substantially. The probability of ties decreases from 0.23 (Q3 - 2) to 0.08 (Q3 - 5) with another added to the Q - matrix. Similar patterns can be found for the inverted pyramid or pyramid hierarchies in Table 2 5 and Table 2 6 . 97 The negative effe ct of having even numbers of items in an item type is highlighted in the comparison between Q3 - 1, Q3 - 8, and Q3 - 15 in Table 2 3 - Table 2 6 . When the Q - matrix consists of two identity matrices, the CCR for each does not change or increase slightly compared t o the baseline. However, when Q - matrix consists of three identity matrices, the CCR for each increases substantially. Summarizing simulation results for three attributes, we conclude that tests with even number of items from each q - vector are less effic ient than tests with each q - vector in odd times. When a q - vector appears in an even number and the item quality is homogeneous, it is more likely to have ties compared to the baseline situation of each attribute hierarchy , and consequently , the effect of e xtra test length is partially or completely canceled out. This conclusion also applies to conditions of four or five attributes, shown in Table 2 7 - Table 3 7 . 98 Table 23 : NPC results for H3.1 Q CCR Pr(tie) 3 - 1 3 1 1 1 0 0 0.73 0.74 0.74 0.74 0.00 0.00 0.00 0.00 3 - 2 4 2 1 1 0 0 0.73 0.74 0.73 0.73 0.18 0.18 0.18 0.18 3 - 3 4 1 1 1 1 0 0.72 0.71 0.76 0.76 0.03 0.16 0.24 0.25 3 - 4 4 1 1 1 0 1 0.73 0.72 0.72 0.77 0.00 0.02 0.15 0.28 3 - 5 5 3 1 1 0 0 0.78 0.79 0.77 0.79 0.00 0.00 0.00 0.00 3 - 6 5 1 1 1 2 0 0.73 0.75 0.86 0.86 0.01 0.07 0.02 0.02 3 - 7 5 1 1 1 0 2 0.74 0.72 0.75 0.93 0.00 0.02 0.07 0.03 3 - 8 6 2 2 2 0 0 0.74 0.73 0.74 0.71 0.46 0.45 0.46 0.45 3 - 9 7 3 2 2 0 0 0.79 0.79 0.78 0.78 0.32 0.33 0.33 0.33 3 - 10 7 2 2 2 1 0 0.74 0.78 0.85 0.85 0.44 0.32 0.18 0.18 3 - 11 7 2 2 2 0 1 0.73 0.73 0.78 0.93 0.46 0.44 0.33 0.02 3 - 12 8 4 2 2 0 0 0.78 0.79 0.79 0.79 0.36 0.37 0.36 0.36 3 - 13 8 2 2 2 2 0 0.73 0.79 0.86 0.85 0.45 0.36 0.25 0.25 3 - 14 8 2 2 2 0 2 0.73 0.73 0.78 0.93 0.44 0.44 0.36 0.11 3 - 15 9 3 3 3 0 0 0.92 0.92 0.91 0.92 0.00 0.00 0.00 0.00 Note : J = test length ; = number of items with a certain q - vector; CCR = correct classification rate . 99 Table 24 : NPC results for H3. 2 Q CCR Pr(tie) 3 - 1 3 1 1 1 0.85 0.77 0.77 0.86 0.09 0.10 0.08 0.09 3 - 2 4 2 1 1 0.90 0.77 0.80 0.85 0.17 0.23 0.03 0.09 3 - 3 4 1 2 1 0.89 0.81 0.80 0.89 0.04 0.15 0.15 0.03 3 - 4 4 1 1 2 0.85 0.80 0.77 0.89 0.10 0.03 0.23 0.17 3 - 5 5 3 1 1 0.95 0.84 0.79 0.86 0.03 0.08 0.03 0.08 3 - 6 5 1 3 1 0.89 0.87 0.87 0.89 0.02 0.02 0.03 0.03 3 - 7 5 1 1 3 0.85 0.80 0.83 0.96 0.08 0.03 0.09 0.03 3 - 8 6 2 2 2 0.89 0.82 0.81 0.89 0.18 0.33 0.33 0.19 3 - 9 7 3 2 2 0.97 0.86 0.81 0.89 0.01 0.18 0.32 0.18 3 - 10 7 2 3 2 0.89 0.88 0.88 0.90 0.18 0.18 0.17 0.17 3 - 11 7 2 2 3 0.89 0.81 0.86 0.97 0.19 0.31 0.19 0.01 3 - 12 8 4 2 2 0.97 0.86 0.82 0.89 0.05 0.23 0.33 0.19 3 - 13 8 2 4 2 0.90 0.87 0.88 0.90 0.18 0.22 0.22 0.19 3 - 14 8 2 2 4 0.89 0.81 0.87 0.97 0.20 0.31 0.22 0.05 3 - 15 9 3 3 3 0.97 0.94 0.94 0.97 0.01 0.01 0.01 0.01 Note : J = test length; = number of items with a certain q - vector; CCR = correct classification rate. 100 Table 25 : NPC results for H3. 3 Q CCR Pr(tie) 3 - 1 3 1 1 1 0 0.81 0.72 0.78 0.81 0.17 0.02 0.08 0.02 3 - 2 4 2 1 1 0 0.88 0.75 0.80 0.80 0.16 0.14 0.02 0.01 3 - 3 4 1 2 1 0 0.85 0.72 0.80 0.81 0.11 0.18 0.17 0.18 3 - 4 4 1 1 1 1 0.81 0.72 0.76 0.83 0.17 0.05 0.24 0.24 3 - 5 5 3 1 1 0 0.95 0.78 0.80 0.81 0.04 0.00 0.02 0.00 3 - 6 5 1 3 1 0 0.85 0.79 0.87 0.87 0.11 0.01 0.02 0.00 3 - 7 5 1 1 1 2 0.81 0.72 0.80 0.95 0.17 0.04 0.15 0.02 3 - 8 6 2 2 2 0 0.88 0.75 0.80 0.80 0.19 0.44 0.33 0.33 3 - 9 7 3 2 2 0 0.97 0.78 0.81 0.81 0.01 0.31 0.33 0.32 3 - 10 7 2 3 2 0 0.89 0.79 0.87 0.87 0.19 0.32 0.18 0.18 3 - 11 7 2 2 2 1 0.88 0.74 0.86 0.95 0.19 0.44 0.18 0.01 3 - 12 8 4 2 2 0 0.96 0.78 0.80 0.81 0.05 0.35 0.34 0.33 3 - 13 8 2 4 2 0 0.89 0.79 0.87 0.87 0.19 0.36 0.22 0.23 3 - 14 8 2 2 2 2 0.88 0.73 0.86 0.95 0.20 0.43 0.23 0.08 3 - 15 9 3 3 3 0 0.96 0.91 0.94 0.94 0.01 0.00 0.01 0.00 Note : J = test length; = number of items with a certain q - vector; CCR = correct classification rate. 101 Table 26 : NPC results for H3.4 Q CCR Pr(tie) 3 - 1 3 1 1 0 1 0.81 0.76 0.73 0.81 0.02 0.08 0.02 0.17 3 - 2 4 2 1 0 1 0.81 0.78 0.73 0.84 0.19 0.25 0.17 0.11 3 - 3 4 1 1 1 1 0.80 0.78 0.76 0.88 0.03 0.15 0.22 0.04 3 - 4 4 1 1 0 2 0.80 0.81 0.73 0.88 0.01 0.02 0.14 0.16 3 - 5 5 3 1 0 1 0.87 0.83 0.79 0.85 0.00 0.08 0.01 0.11 3 - 6 5 1 1 2 1 0.81 0.82 0.86 0.88 0.02 0.10 0.02 0.04 3 - 7 5 1 1 0 3 0.81 0.80 0.78 0.95 0.00 0.02 0.01 0.04 3 - 8 6 2 2 0 2 0.81 0.80 0.73 0.88 0.33 0.34 0.45 0.20 3 - 9 7 3 2 0 2 0.87 0.86 0.80 0.90 0.18 0.19 0.32 0.18 3 - 10 7 2 2 1 2 0.81 0.86 0.86 0.89 0.32 0.18 0.19 0.17 3 - 11 7 2 2 0 3 0.81 0.81 0.80 0.97 0.32 0.31 0.31 0.01 3 - 12 8 4 2 0 2 0.87 0.87 0.79 0.89 0.22 0.24 0.36 0.18 3 - 13 8 2 2 2 2 0.80 0.86 0.86 0.89 0.33 0.23 0.25 0.19 3 - 14 8 2 2 0 4 0.81 0.81 0.78 0.96 0.34 0.32 0.36 0.06 3 - 15 9 3 3 0 3 0.95 0.94 0.92 0.96 0.00 0.01 0.00 0.01 Note : J = test length; = number of items with a certain q - vector; CCR = correct classification rate. 102 Table 27 : NPC results for H4.1 Q CCR Pr(tie) 4 - 1 4 0.65 0.65 0.65 0.65 0.66 0.00 0.00 0.00 0.00 0.00 4 - 2 5 0.66 0.64 0.66 0.65 0.66 0.17 0.18 0.18 0.19 0.18 4 - 3 5 0.65 0.64 0.68 0.68 0.68 0.03 0.17 0.25 0.24 0.25 4 - 4 5 0.67 0.67 0.64 0.72 0.70 0.00 0.02 0.14 0.28 0.29 4 - 5 5 0.66 0.65 0.64 0.64 0.73 0.00 0.00 0.02 0.13 0.33 4 - 6 6 0.71 0.72 0.71 0.70 0.71 0.00 0.00 0.00 0.00 0.00 4 - 7 6 0.67 0.68 0.76 0.77 0.77 0.01 0.08 0.02 0.02 0.02 4 - 8 6 0.66 0.65 0.67 0.84 0.84 0.00 0.02 0.06 0.03 0.03 4 - 9 6 0.65 0.65 0.65 0.67 0.90 0.00 0.00 0.01 0.06 0.04 4 - 10 8 0.66 0.66 0.66 0.65 0.66 0.54 0.55 0.54 0.56 0.55 4 - 11 9 0.71 0.71 0.71 0.70 0.71 0.44 0.45 0.46 0.45 0.45 4 - 12 9 0.64 0.71 0.77 0.77 0.76 0.55 0.44 0.32 0.33 0.34 4 - 13 9 0.66 0.64 0.70 0.83 0.84 0.54 0.55 0.45 0.20 0.19 4 - 14 9 0.66 0.65 0.65 0.69 0.91 0.54 0.54 0.55 0.45 0.03 4 - 15 10 0.71 0.71 0.72 0.71 0.71 0.46 0.48 0.47 0.47 0.47 4 - 16 10 0.66 0.70 0.76 0.77 0.77 0.54 0.48 0.38 0.39 0.38 4 - 17 10 0.64 0.66 0.70 0.83 0.84 0.56 0.55 0.48 0.27 0.27 4 - 18 10 0.65 0.66 0.65 0.71 0.91 0.54 0.54 0.55 0.47 0.14 4 - 19 12 0.90 0.89 0.89 0.89 0.89 0.00 0.00 0.00 0.00 0.00 Table 28 : NPC results for H4.2 Q CCR Pr(tie) 4 - 1 4 0.85 0.76 0.74 0.77 0.86 0.10 0.10 0.16 0.09 0.10 4 - 2 5 0.89 0.76 0.76 0.77 0.84 0.17 0.23 0.11 0.09 0.09 4 - 3 5 0.88 0.79 0.76 0.79 0.85 0.03 0.16 0.22 0.04 0.09 4 - 4 5 0.85 0.79 0.76 0.79 0.88 0.10 0.04 0.22 0.16 0.03 4 - 5 5 0.85 0.78 0.76 0.78 0.88 0.09 0.10 0.10 0.23 0.18 4 - 6 6 0.96 0.82 0.77 0.77 0.84 0.03 0.10 0.11 0.09 0.11 4 - 7 6 0.88 0.86 0.81 0.80 0.85 0.03 0.02 0.11 0.03 0.09 4 - 8 6 0.86 0.80 0.82 0.87 0.89 0.10 0.04 0.12 0.02 0.03 4 - 9 6 0.85 0.76 0.76 0.82 0.96 0.09 0.09 0.12 0.09 0.02 4 - 10 8 0.89 0.80 0.79 0.80 0.89 0.18 0.34 0.34 0.33 0.19 4 - 11 9 0.97 0.87 0.80 0.82 0.89 0.01 0.17 0.32 0.32 0.19 4 - 12 9 0.89 0.88 0.86 0.81 0.89 0.17 0.17 0.20 0.32 0.18 4 - 13 9 0.89 0.81 0.86 0.87 0.90 0.19 0.31 0.19 0.18 0.17 4 - 14 9 0.89 0.79 0.80 0.87 0.97 0.19 0.33 0.34 0.18 0.01 4 - 15 10 0.97 0.88 0.80 0.80 0.89 0.05 0.22 0.34 0.33 0.19 4 - 16 10 0.90 0.87 0.87 0.82 0.89 0.18 0.22 0.23 0.32 0.19 4 - 17 10 0.89 0.81 0.86 0.87 0.89 0.20 0.33 0.23 0.22 0.19 4 - 18 10 0.89 0.82 0.80 0.87 0.97 0.19 0.32 0.34 0.23 0.06 4 - 19 12 0.97 0.95 0.94 0.95 0.97 0.01 0.01 0.02 0.01 0.01 103 Table 29 : NPC results for H4.3 Q CCR Pr(tie) 4 - 1 4 0.77 0.70 0.70 0.76 0.76 0.09 0.09 0.10 0.09 0.09 4 - 2 5 0.79 0.69 0.72 0.76 0.76 0.18 0.24 0.03 0.09 0.10 4 - 3 5 0.80 0.73 0.73 0.79 0.80 0.03 0.15 0.14 0.03 0.03 4 - 4 5 0.76 0.71 0.69 0.79 0.79 0.10 0.03 0.23 0.17 0.16 4 - 5 5 0.77 0.69 0.70 0.76 0.82 0.09 0.09 0.11 0.22 0.24 4 - 6 6 0.86 0.76 0.73 0.76 0.76 0.02 0.09 0.03 0.09 0.09 4 - 7 6 0.81 0.77 0.78 0.80 0.79 0.02 0.03 0.03 0.02 0.03 4 - 8 6 0.75 0.72 0.76 0.87 0.86 0.10 0.03 0.09 0.02 0.03 4 - 9 6 0.76 0.69 0.69 0.79 0.94 0.10 0.09 0.10 0.15 0.04 4 - 10 8 0.80 0.72 0.73 0.81 0.81 0.34 0.46 0.45 0.34 0.33 4 - 11 9 0.87 0.77 0.74 0.81 0.79 0.19 0.34 0.44 0.32 0.35 4 - 12 9 0.81 0.78 0.80 0.81 0.80 0.33 0.33 0.32 0.31 0.32 4 - 13 9 0.80 0.72 0.78 0.86 0.87 0.34 0.45 0.33 0.19 0.18 4 - 14 9 0.80 0.73 0.73 0.85 0.95 0.34 0.46 0.44 0.18 0.01 4 - 15 10 0.88 0.78 0.73 0.80 0.80 0.22 0.38 0.46 0.34 0.34 4 - 16 10 0.80 0.78 0.78 0.81 0.80 0.33 0.36 0.37 0.32 0.33 4 - 17 10 0.80 0.72 0.78 0.87 0.88 0.33 0.46 0.37 0.23 0.22 4 - 18 10 0.80 0.73 0.72 0.86 0.94 0.33 0.44 0.46 0.23 0.09 4 - 19 12 0.94 0.91 0.92 0.94 0.95 0.01 0.01 0.01 0.01 0.00 Table 30 : NPC results for H4.4 Q CCR Pr(tie) 4 - 1 4 0.84 0.73 0.69 0.78 0.76 0.10 0.16 0.09 0.09 0.09 4 - 2 5 0.87 0.74 0.73 0.77 0.76 0.18 0.29 0.03 0.09 0.10 4 - 3 5 0.86 0.79 0.73 0.80 0.81 0.05 0.14 0.13 0.03 0.03 4 - 4 5 0.83 0.76 0.70 0.77 0.79 0.10 0.11 0.24 0.25 0.17 4 - 5 5 0.86 0.73 0.69 0.76 0.75 0.09 0.16 0.11 0.23 0.24 4 - 6 6 0.96 0.78 0.74 0.79 0.76 0.03 0.17 0.03 0.08 0.09 4 - 7 6 0.89 0.86 0.78 0.80 0.78 0.03 0.04 0.02 0.02 0.02 4 - 8 6 0.85 0.76 0.75 0.83 0.86 0.09 0.11 0.09 0.08 0.02 4 - 9 6 0.85 0.72 0.69 0.79 0.80 0.09 0.16 0.10 0.15 0.15 4 - 10 8 0.88 0.80 0.73 0.80 0.81 0.19 0.33 0.44 0.34 0.33 4 - 11 9 0.97 0.85 0.73 0.80 0.81 0.01 0.20 0.44 0.34 0.33 4 - 12 9 0.90 0.87 0.79 0.81 0.81 0.18 0.19 0.30 0.31 0.32 4 - 13 9 0.89 0.80 0.78 0.87 0.87 0.18 0.32 0.33 0.19 0.18 4 - 14 9 0.88 0.79 0.73 0.85 0.86 0.19 0.34 0.44 0.19 0.18 4 - 15 10 0.97 0.86 0.73 0.80 0.79 0.05 0.23 0.43 0.34 0.34 4 - 16 10 0.89 0.87 0.79 0.81 0.80 0.19 0.22 0.36 0.33 0.33 4 - 17 10 0.88 0.81 0.77 0.86 0.88 0.19 0.34 0.37 0.23 0.22 4 - 18 10 0.89 0.80 0.71 0.86 0.86 0.19 0.33 0.45 0.22 0.23 4 - 19 12 0.97 0.94 0.92 0.94 0.94 0.01 0.01 0.01 0.01 0.01 104 Table 31 : NPC results for H4.5 Q CCR Pr(tie) 4 - 1 4 0.81 0.76 0.77 0.70 0.73 0.03 0.08 0.08 0.09 0.16 4 - 2 5 0.81 0.79 0.77 0.70 0.75 0.18 0.18 0.26 0.23 0.10 4 - 3 5 0.81 0.78 0.79 0.72 0.80 0.03 0.16 0.16 0.27 0.04 4 - 4 5 0.80 0.80 0.80 0.72 0.79 0.01 0.03 0.03 0.14 0.14 4 - 5 5 0.80 0.76 0.75 0.73 0.73 0.02 0.09 0.10 0.03 0.30 4 - 6 6 0.88 0.86 0.83 0.75 0.76 0.01 0.02 0.09 0.09 0.11 4 - 7 6 0.81 0.84 0.83 0.81 0.78 0.02 0.08 0.08 0.11 0.04 4 - 8 6 0.82 0.80 0.80 0.78 0.85 0.00 0.02 0.02 0.02 0.04 4 - 9 6 0.80 0.77 0.77 0.73 0.79 0.02 0.09 0.09 0.04 0.16 4 - 10 8 0.80 0.79 0.80 0.71 0.80 0.33 0.34 0.34 0.46 0.34 4 - 11 9 0.87 0.87 0.86 0.78 0.80 0.18 0.18 0.19 0.33 0.33 4 - 12 9 0.81 0.87 0.87 0.84 0.82 0.32 0.18 0.17 0.19 0.31 4 - 13 9 0.81 0.81 0.80 0.78 0.87 0.33 0.31 0.32 0.32 0.19 4 - 14 9 0.81 0.80 0.81 0.73 0.85 0.33 0.33 0.34 0.42 0.20 4 - 15 10 0.87 0.88 0.86 0.77 0.81 0.22 0.22 0.23 0.37 0.33 4 - 16 10 0.81 0.87 0.86 0.84 0.81 0.34 0.23 0.22 0.26 0.33 4 - 17 10 0.81 0.80 0.81 0.79 0.87 0.33 0.33 0.33 0.36 0.23 4 - 18 10 0.81 0.80 0.80 0.73 0.85 0.33 0.35 0.34 0.44 0.23 4 - 19 12 0.94 0.94 0.94 0.91 0.94 0.00 0.01 0.01 0.01 0.01 105 Table 32 : NPC results for H5.1 Q CCR Pr(tie) 5 - 1 5 0.58 0.58 0.60 0.59 0.59 0.59 0.00 0.00 0.00 0.00 0.00 0.00 5 - 2 6 0.58 0.59 0.58 0.59 0.59 0.60 0.18 0.17 0.18 0.18 0.19 0.18 5 - 3 6 0.60 0.59 0.62 0.60 0.61 0.61 0.03 0.16 0.23 0.24 0.25 0.23 5 - 4 6 0.58 0.59 0.57 0.64 0.62 0.64 0.00 0.02 0.16 0.29 0.30 0.29 5 - 5 6 0.60 0.60 0.59 0.58 0.66 0.66 0.00 0.00 0.02 0.13 0.34 0.33 5 - 6 6 0.59 0.58 0.59 0.59 0.56 0.69 0.00 0.00 0.00 0.02 0.12 0.35 5 - 7 7 0.63 0.62 0.63 0.64 0.65 0.63 0.00 0.00 0.00 0.00 0.00 0.00 5 - 8 7 0.60 0.62 0.69 0.70 0.68 0.69 0.01 0.08 0.02 0.02 0.02 0.02 5 - 9 7 0.58 0.60 0.60 0.76 0.75 0.76 0.00 0.02 0.07 0.03 0.03 0.03 5 - 10 7 0.59 0.58 0.59 0.60 0.81 0.81 0.00 0.00 0.01 0.06 0.04 0.04 5 - 11 7 0.58 0.57 0.59 0.59 0.61 0.88 0.00 0.00 0.00 0.01 0.06 0.06 5 - 12 10 0.59 0.59 0.60 0.59 0.59 0.60 0.63 0.63 0.63 0.63 0.63 0.63 5 - 13 11 0.65 0.64 0.64 0.63 0.64 0.64 0.54 0.56 0.55 0.54 0.55 0.55 5 - 14 11 0.59 0.62 0.69 0.68 0.69 0.69 0.63 0.55 0.44 0.46 0.45 0.45 5 - 15 11 0.58 0.60 0.64 0.76 0.75 0.76 0.64 0.63 0.53 0.33 0.34 0.33 5 - 16 11 0.60 0.59 0.60 0.62 0.81 0.81 0.64 0.62 0.62 0.56 0.21 0.21 5 - 17 11 0.58 0.59 0.59 0.59 0.63 0.89 0.64 0.64 0.63 0.63 0.54 0.05 5 - 18 12 0.63 0.64 0.63 0.64 0.64 0.63 0.58 0.57 0.56 0.57 0.58 0.57 5 - 19 12 0.59 0.63 0.69 0.69 0.68 0.68 0.64 0.57 0.49 0.50 0.51 0.51 5 - 20 12 0.59 0.60 0.64 0.75 0.76 0.75 0.63 0.63 0.56 0.40 0.40 0.40 5 - 21 12 0.59 0.59 0.58 0.63 0.83 0.83 0.63 0.62 0.63 0.57 0.29 0.28 5 - 22 12 0.58 0.61 0.60 0.59 0.62 0.89 0.64 0.61 0.63 0.63 0.57 0.17 5 - 23 15 0.87 0.88 0.87 0.86 0.87 0.88 0.00 0.00 0.00 0.00 0.00 0.00 106 Table 33 : NPC results for H5.2 Q CCR Pr(tie) 5 - 1 5 0.85 0.76 0.72 0.73 0.76 0.85 0.10 0.10 0.17 0.16 0.10 0.11 5 - 2 6 0.89 0.77 0.75 0.72 0.76 0.84 0.17 0.24 0.11 0.17 0.09 0.10 5 - 3 6 0.87 0.79 0.77 0.76 0.76 0.85 0.04 0.16 0.22 0.11 0.10 0.09 5 - 4 6 0.84 0.79 0.75 0.76 0.78 0.85 0.11 0.04 0.22 0.22 0.04 0.10 5 - 5 6 0.85 0.76 0.76 0.76 0.78 0.88 0.09 0.10 0.12 0.23 0.16 0.04 5 - 6 6 0.84 0.76 0.72 0.76 0.76 0.89 0.11 0.09 0.17 0.11 0.25 0.17 5 - 7 7 0.96 0.82 0.77 0.74 0.77 0.84 0.03 0.11 0.11 0.16 0.09 0.10 5 - 8 7 0.89 0.86 0.82 0.76 0.76 0.85 0.03 0.02 0.10 0.11 0.09 0.09 5 - 9 7 0.84 0.80 0.82 0.82 0.80 0.85 0.09 0.03 0.12 0.11 0.03 0.09 5 - 10 7 0.85 0.77 0.76 0.83 0.86 0.89 0.09 0.09 0.11 0.11 0.02 0.02 5 - 11 7 0.85 0.76 0.73 0.77 0.83 0.96 0.09 0.09 0.16 0.10 0.09 0.03 5 - 12 10 0.89 0.79 0.79 0.80 0.80 0.89 0.19 0.34 0.35 0.35 0.33 0.19 5 - 13 11 0.97 0.87 0.80 0.79 0.81 0.89 0.01 0.18 0.33 0.33 0.32 0.19 5 - 14 11 0.89 0.87 0.86 0.80 0.80 0.89 0.17 0.19 0.19 0.33 0.34 0.20 5 - 15 11 0.89 0.82 0.87 0.86 0.81 0.88 0.18 0.32 0.19 0.19 0.32 0.20 5 - 16 11 0.88 0.81 0.80 0.86 0.87 0.89 0.20 0.32 0.32 0.19 0.18 0.17 5 - 17 11 0.89 0.80 0.79 0.80 0.86 0.97 0.19 0.33 0.34 0.33 0.19 0.01 5 - 18 12 0.97 0.86 0.80 0.79 0.80 0.89 0.05 0.23 0.33 0.36 0.33 0.19 5 - 19 12 0.90 0.86 0.86 0.81 0.80 0.89 0.19 0.23 0.23 0.33 0.32 0.19 5 - 20 12 0.89 0.81 0.86 0.87 0.80 0.89 0.18 0.33 0.24 0.22 0.35 0.19 5 - 21 12 0.88 0.80 0.79 0.86 0.88 0.90 0.20 0.33 0.34 0.23 0.22 0.18 5 - 22 12 0.89 0.79 0.80 0.79 0.86 0.97 0.18 0.34 0.33 0.34 0.23 0.05 5 - 23 15 0.97 0.95 0.93 0.94 0.94 0.97 0.01 0.01 0.01 0.02 0.01 0.01 107 Table 34 : NPC results for H5.3 Q CCR Pr(tie) 5 - 1 5 0.84 0.68 0.62 0.70 0.73 0.73 0.10 0.20 0.09 0.07 0.02 0.01 5 - 2 6 0.86 0.69 0.64 0.69 0.73 0.73 0.18 0.34 0.04 0.09 0.02 0.00 5 - 3 6 0.87 0.77 0.65 0.71 0.72 0.71 0.06 0.15 0.13 0.03 0.01 0.00 5 - 4 6 0.84 0.72 0.62 0.72 0.74 0.74 0.11 0.16 0.23 0.17 0.17 0.17 5 - 5 6 0.83 0.68 0.63 0.68 0.75 0.76 0.11 0.22 0.11 0.23 0.25 0.24 5 - 6 6 0.84 0.68 0.62 0.69 0.71 0.78 0.11 0.21 0.09 0.11 0.17 0.31 5 - 7 7 0.95 0.74 0.64 0.70 0.73 0.72 0.03 0.22 0.04 0.07 0.02 0.00 5 - 8 7 0.88 0.84 0.70 0.72 0.72 0.72 0.03 0.06 0.02 0.02 0.01 0.00 5 - 9 7 0.84 0.71 0.67 0.79 0.79 0.79 0.10 0.18 0.08 0.02 0.01 0.00 5 - 10 7 0.83 0.68 0.62 0.72 0.86 0.86 0.11 0.22 0.10 0.14 0.02 0.02 5 - 11 7 0.83 0.69 0.62 0.69 0.74 0.93 0.12 0.22 0.10 0.09 0.09 0.03 5 - 12 10 0.88 0.80 0.64 0.72 0.73 0.73 0.19 0.34 0.55 0.45 0.45 0.44 5 - 13 11 0.97 0.84 0.66 0.71 0.73 0.73 0.01 0.20 0.53 0.46 0.45 0.45 5 - 14 11 0.90 0.87 0.70 0.73 0.72 0.74 0.18 0.18 0.45 0.45 0.46 0.43 5 - 15 11 0.89 0.78 0.70 0.78 0.79 0.79 0.18 0.34 0.46 0.34 0.33 0.33 5 - 16 11 0.89 0.77 0.66 0.77 0.86 0.85 0.19 0.36 0.55 0.34 0.19 0.19 5 - 17 11 0.88 0.78 0.66 0.71 0.78 0.93 0.20 0.35 0.54 0.46 0.31 0.02 5 - 18 12 0.97 0.85 0.65 0.72 0.72 0.73 0.05 0.24 0.55 0.47 0.46 0.45 5 - 19 12 0.90 0.87 0.71 0.73 0.73 0.73 0.18 0.23 0.47 0.46 0.44 0.45 5 - 20 12 0.89 0.78 0.69 0.78 0.79 0.80 0.19 0.35 0.49 0.37 0.37 0.35 5 - 21 12 0.88 0.78 0.67 0.77 0.85 0.85 0.19 0.35 0.54 0.38 0.25 0.24 5 - 22 12 0.88 0.79 0.66 0.72 0.78 0.93 0.19 0.34 0.54 0.46 0.37 0.11 5 - 23 15 0.97 0.93 0.89 0.91 0.91 0.92 0.00 0.02 0.01 0.01 0.00 0.00 108 Table 35 : NPC results for H5.4 Q CCR Pr(tie) 5 - 1 5 0.80 0.66 0.61 0.66 0.69 0.74 0.17 0.15 0.09 0.03 0.10 0.02 5 - 2 6 0.86 0.66 0.64 0.64 0.70 0.72 0.17 0.27 0.04 0.03 0.08 0.02 5 - 3 6 0.83 0.72 0.65 0.66 0.72 0.73 0.12 0.15 0.14 0.14 0.02 0.01 5 - 4 6 0.80 0.66 0.61 0.66 0.71 0.74 0.18 0.17 0.24 0.23 0.19 0.19 5 - 5 6 0.79 0.65 0.62 0.64 0.75 0.74 0.17 0.16 0.11 0.16 0.23 0.25 5 - 6 6 0.80 0.65 0.63 0.66 0.66 0.79 0.17 0.16 0.08 0.05 0.23 0.29 5 - 7 7 0.94 0.70 0.65 0.66 0.69 0.73 0.05 0.16 0.03 0.02 0.08 0.02 5 - 8 7 0.83 0.76 0.70 0.70 0.72 0.72 0.12 0.05 0.03 0.01 0.02 0.00 5 - 9 7 0.80 0.64 0.64 0.76 0.78 0.78 0.18 0.17 0.14 0.02 0.02 0.01 5 - 10 7 0.79 0.66 0.61 0.68 0.86 0.85 0.18 0.16 0.10 0.10 0.03 0.02 5 - 11 7 0.80 0.65 0.63 0.65 0.70 0.93 0.17 0.16 0.08 0.05 0.14 0.04 5 - 12 10 0.88 0.72 0.65 0.66 0.73 0.72 0.19 0.46 0.56 0.54 0.44 0.45 5 - 13 11 0.97 0.77 0.67 0.67 0.71 0.73 0.01 0.33 0.53 0.54 0.46 0.44 5 - 14 11 0.89 0.79 0.72 0.71 0.72 0.73 0.18 0.33 0.44 0.45 0.44 0.45 5 - 15 11 0.87 0.71 0.71 0.77 0.78 0.79 0.21 0.45 0.44 0.32 0.32 0.32 5 - 16 11 0.88 0.72 0.66 0.71 0.85 0.85 0.19 0.45 0.55 0.43 0.20 0.19 5 - 17 11 0.87 0.71 0.65 0.66 0.77 0.92 0.20 0.46 0.53 0.54 0.33 0.02 5 - 18 12 0.97 0.77 0.66 0.66 0.72 0.73 0.06 0.37 0.54 0.54 0.46 0.44 5 - 19 12 0.89 0.78 0.69 0.71 0.72 0.73 0.19 0.37 0.49 0.48 0.45 0.45 5 - 20 12 0.87 0.73 0.69 0.78 0.79 0.79 0.21 0.44 0.48 0.36 0.36 0.36 5 - 21 12 0.88 0.72 0.65 0.72 0.86 0.86 0.20 0.45 0.54 0.46 0.25 0.24 5 - 22 12 0.87 0.71 0.65 0.67 0.77 0.93 0.20 0.46 0.55 0.53 0.37 0.11 5 - 23 15 0.96 0.91 0.89 0.90 0.92 0.93 0.01 0.02 0.01 0.00 0.01 0.00 109 Table 36 : NPC results for H5.5 Q CCR Pr(tie) 5 - 1 5 0.73 0.73 0.70 0.62 0.68 0.83 0.00 0.02 0.08 0.09 0.20 0.11 5 - 2 6 0.73 0.73 0.69 0.63 0.72 0.84 0.19 0.19 0.24 0.24 0.18 0.11 5 - 3 6 0.73 0.71 0.71 0.65 0.76 0.84 0.03 0.16 0.30 0.30 0.12 0.11 5 - 4 6 0.72 0.73 0.69 0.67 0.78 0.83 0.01 0.03 0.15 0.33 0.06 0.11 5 - 5 6 0.73 0.73 0.72 0.66 0.78 0.87 0.00 0.01 0.03 0.13 0.15 0.06 5 - 6 6 0.71 0.73 0.69 0.64 0.70 0.88 0.00 0.02 0.08 0.04 0.32 0.17 5 - 7 7 0.79 0.78 0.74 0.67 0.72 0.84 0.00 0.03 0.08 0.09 0.17 0.11 5 - 8 7 0.73 0.75 0.81 0.74 0.74 0.84 0.02 0.07 0.10 0.10 0.12 0.10 5 - 9 7 0.72 0.72 0.73 0.79 0.78 0.85 0.00 0.02 0.08 0.11 0.06 0.09 5 - 10 7 0.73 0.73 0.72 0.69 0.85 0.88 0.00 0.01 0.02 0.03 0.05 0.03 5 - 11 7 0.73 0.73 0.69 0.64 0.74 0.95 0.00 0.02 0.08 0.04 0.22 0.03 5 - 12 10 0.73 0.74 0.71 0.65 0.79 0.88 0.44 0.44 0.46 0.55 0.34 0.20 5 - 13 11 0.79 0.79 0.77 0.70 0.79 0.89 0.32 0.32 0.34 0.45 0.34 0.19 5 - 14 11 0.74 0.79 0.84 0.76 0.79 0.89 0.44 0.32 0.19 0.33 0.32 0.19 5 - 15 11 0.73 0.73 0.78 0.82 0.82 0.89 0.44 0.44 0.32 0.19 0.31 0.18 5 - 16 11 0.73 0.72 0.72 0.71 0.87 0.89 0.44 0.44 0.44 0.44 0.19 0.18 5 - 17 11 0.73 0.74 0.73 0.66 0.85 0.97 0.44 0.44 0.44 0.54 0.20 0.01 5 - 18 12 0.79 0.79 0.77 0.70 0.80 0.88 0.35 0.35 0.37 0.46 0.32 0.20 5 - 19 12 0.73 0.79 0.85 0.77 0.80 0.89 0.44 0.36 0.26 0.37 0.33 0.18 5 - 20 12 0.73 0.73 0.77 0.82 0.80 0.89 0.45 0.45 0.36 0.28 0.34 0.19 5 - 21 12 0.73 0.72 0.73 0.70 0.86 0.90 0.44 0.45 0.45 0.48 0.22 0.18 5 - 22 12 0.73 0.72 0.72 0.67 0.85 0.97 0.45 0.45 0.46 0.53 0.25 0.05 5 - 23 15 0.92 0.93 0.92 0.89 0.93 0.97 0.00 0.00 0.01 0.01 0.02 0.01 110 Table 37 : NPC results for H5.6 Q CCR Pr(tie) 5 - 1 5 0.72 0.69 0.65 0.69 0.65 0.79 0.02 0.08 0.03 0.23 0.17 0.17 5 - 2 6 0.73 0.69 0.65 0.73 0.67 0.79 0.18 0.25 0.20 0.17 0.11 0.18 5 - 3 6 0.72 0.71 0.68 0.75 0.71 0.79 0.03 0.15 0.23 0.12 0.05 0.18 5 - 4 6 0.73 0.72 0.66 0.75 0.71 0.83 0.01 0.02 0.15 0.22 0.15 0.13 5 - 5 6 0.73 0.69 0.66 0.71 0.75 0.86 0.02 0.08 0.04 0.26 0.20 0.06 5 - 6 6 0.73 0.69 0.65 0.73 0.66 0.86 0.02 0.08 0.03 0.17 0.28 0.16 5 - 7 7 0.79 0.76 0.71 0.73 0.69 0.80 0.01 0.09 0.02 0.18 0.11 0.17 5 - 8 7 0.73 0.75 0.78 0.75 0.70 0.81 0.02 0.09 0.03 0.12 0.05 0.17 5 - 9 7 0.74 0.72 0.72 0.82 0.78 0.84 0.01 0.03 0.01 0.13 0.05 0.12 5 - 10 7 0.72 0.70 0.66 0.74 0.84 0.87 0.01 0.07 0.03 0.22 0.05 0.04 5 - 11 7 0.73 0.69 0.65 0.72 0.69 0.94 0.02 0.08 0.02 0.18 0.17 0.05 5 - 12 10 0.72 0.72 0.66 0.78 0.71 0.87 0.46 0.46 0.53 0.34 0.47 0.20 5 - 13 11 0.79 0.78 0.70 0.80 0.72 0.88 0.33 0.33 0.46 0.34 0.45 0.20 5 - 14 11 0.73 0.77 0.77 0.80 0.73 0.88 0.45 0.33 0.32 0.32 0.44 0.20 5 - 15 11 0.74 0.71 0.71 0.85 0.79 0.89 0.44 0.45 0.45 0.21 0.33 0.18 5 - 16 11 0.72 0.72 0.66 0.84 0.85 0.89 0.46 0.46 0.54 0.20 0.17 0.18 5 - 17 11 0.73 0.72 0.65 0.80 0.77 0.97 0.46 0.47 0.55 0.34 0.33 0.01 5 - 18 12 0.79 0.76 0.71 0.79 0.72 0.87 0.35 0.38 0.46 0.34 0.44 0.21 5 - 19 12 0.74 0.77 0.76 0.79 0.72 0.89 0.44 0.38 0.38 0.35 0.47 0.19 5 - 20 12 0.73 0.72 0.72 0.85 0.80 0.88 0.46 0.46 0.47 0.24 0.34 0.20 5 - 21 12 0.73 0.73 0.66 0.85 0.86 0.89 0.44 0.44 0.54 0.23 0.25 0.19 5 - 22 12 0.73 0.72 0.67 0.78 0.76 0.97 0.43 0.46 0.54 0.35 0.37 0.06 5 - 23 15 0.92 0.91 0.90 0.93 0.91 0.96 0.00 0.01 0.00 0.02 0.01 0.02 111 5. 5 Discussion Nonparametric classifications could play an important role in formative classroom assessment. Tests developed by the teachers constitute a large part of classroom assessments. With the guidance of psychometric theory, teachers may be able to extract more f ormative feedback. Nonparametric classifications based on CDMs offer solutions to both test construction and result interpretations. The teachers may develop the items under the guidance of CDM - based assessment (Rupp et al., 2010) . However, it is not likel y to collect enough response data in the classroom setting for model estimation (including calibration and classification) . Besides, there are concerns about the invariance properties of model parameters. In response to these limitations, r esearchers have proposed different nonparametric classification methods to produce student results without having to estimat e item parameters ( Chiu & Douglas, 2013; Chiu, Sun, & Bian, 2018 ; Wang & Douglas, 2015 ) . This study adds to the literature by providing insight s into how to construct such a test. Q - matrix design is at the center of test construction for both parametric and nonparametric CDM - based tests. Test construction involves practical questions , including how long the test should be and how many items are n eeded from each type. Note that what we discuss in Chapter 3 about equivalent q - vectors and different types of Q - matrices also applies to the nonparametric situation. Generally, Q - matrix designs that work well for MLE classifications also work well for non parametric classifications. The ties in the hamming distance are parallel to equal or similar likelihoods between attribute profiles. The simulation study compared Q - matrix designs with to items. Longer tests were not considered because the situat ion is teacher - developed classroom assessment. It is important to include the single - attribute items for nonparametric classifications. Adding an odd number of 112 multiple - attribute items can increase the CCR of a subset of s while adding an odd number of single - attribute items leads to an increased CCR for every . It is recommended that a Q - matrix has an odd number of items in each q - vector. A test with an even number of items in a certain q - vector is generally not substantially better than a test with one less item in this q - vector. This is especiall y true when the item quality is homogeneous. An important implication for teachers is that more items do not necessarily mean more accurate classifications. A single - attribute item is generally more useful than a multiple - attribute one. However, if the cla ssification of certain s , say , is of particular interest, then including the appropriate multiple - attribute item (in this case, ) in the Q - matrix becomes meaningful in terms of CCR. A classroom assessment network can be built where teachers de velop their items based on CDMs with q - vectors and the corresponding curriculum identified. Such i tems can be collected from teachers and form various item pools , which can later be used for CD - CAT or nonparametric CD - CAT . At last, this study assumes the DINA model as the underlying CDM. Future research could explore different Q - matrix designs for NPC with other underlying CDMs. 113 Chapter 6 Item pool design for CD - CAT 6.1 Introduction Item pool design is an important but often neglected area for CD - CAT. Since t he item pool design for CD - CAT has not been addressed in the literature , w e draw from studies on item pool design for CAT based on IRT models (e.g., Reckase, 2010; Thissen, Reeve, Bjorner, & Chang, 2007 ; Veldkamp & van der Linden, 2000 ) . The findings for IRT - based CAT can be informative because CD - CATs are the same sequential optimization problems using CDMs instead of IRT models as the item response model. However, the categorical nature of the latent constructs in CDM decides that new s tudies are needed for the CD - CAT context. Besides , CD - CAT has different priorities from those of IRT - based CAT. Classroom formative assessments are generally low - stakes tests, so test security issues are not of primary concern . It is acceptable that tests overlap between students. What is of more importance is to assign new items to a student each time he or she takes the test during the instructional period . Therefore, different requirement s are imposed on item pool design for classroom formative assessme nts as compared to high - stakes standardized tests. When a series of formative assessments are needed for learning , multiple item pools should be constructed . For example, each unit addresses different attributes , so a new item pool may be needed for each unit to support the formative assessment when learning a unit . Considering a large number of item pools required for one school year and the high cost in item development , it is important to know the minimal size of an ite m pool that satisfies the purposes of a test. This study aims to propose a n item pool design method for CD - CAT so that the item pool can fully support a test . The proposed method will be applied to explore the number of items and 114 item types needed for an i tem pool for classroom formative assessments under various conditions . The item pools obtained will be evaluated in terms of their performances using with a CD - CAT algorithm. 6 . 2 Method for CD - CAT item pool design The proposed method for item pool design borrowed the idea s from Veldkamp and van der Linden (2000) and Reckase ( 2010) for the item pool design of IRT - based CAT. The core of the method is computer simulations. 6.2.1 The minimum optimal pool The minimum optimal pool is defined to be the smallest i tem pool that can provide the ideal item at each item - selection step , given the CD - CAT algorithm and test constraints . The potential item pool in the case of IRT - based CAT has an infinite number of items. A CDM - based item pool, however, has a limited number of item types defined by the q - vectors. For example, an item pool for three independent attributes (H3.1) can have seven item types . For three attribute hierarchies H3.2 , H3.3, and H3.4 there are three, four, and four item types, respectively, under the DINA model, which are listed in Table 7 - Table 9 . Items within an item type only differ in item parameters. The outpu t of the item pool design process would be the number of items needed for each item type. In the item writing process, it is difficult, if not impossible, to control the level of item parameters , which is especially true for complicated item response model s. Therefore, we start with an ideal situation in item pool design, assuming all items have equally high or low quality a high - quality condition and a low - quality condition yield a range of item numbers. The proposed method can be used for any CD - CAT alg orithm and test requirement. 115 Below is a brief illustration o f the proposed method when applied to a variable - length CD - CAT . Suppose an examinee with the true attribute profile is taking a CAT measuring three linear attributes. The items are calibrated using the DINA model. We further assume that for all items the probability of the correct response interval for examinees who have mastered none of the required attributes on an i tem is and the probability of the correct response interval for examinees who have mastered all the required attributes on a n item is . The first item is fixed to be A simulation of the CAT process using the KL algo rithm to select items leads to the administration of 2 items when the test terminates when the desired accuracy level is achieved, that is the largest . The items administered to this examinee are summarized by item type in Table 38 . Suppose anthe r examinee with the true attribute profile takes the test , and the items used are also summarized in Table 8. Since two examinees can use the same items, a union of the two sets of items leads to an item pool for two such examinees. In other wo rds, the maximum number of items from each item type among the examinees constitutes the number of items required for two such examinees. If a third examinee is to be simulated, the union or maximization can be taken between the set of items for the new ex aminee and the union obtained earlier in Table 3 8 . Table 38 : Item distribution for two hypothetical examinees with true attribute profiles of and and the union of the two sets of items Item type Union/Maximum 2 1 2 0 4 4 0 3 3 116 6.2.2 The minimum p - optimal pool After the test is administered to more examinee s, the maximum number of items selected from each item type among all examine e s will eventually become stable except for a few outliers . T he test is extremely long in these extreme cases . Suppose an item pool is designed for measuring three linear attributes given a certain CD - CAT algorithm. We further assume that a ll candidate items are of low quality , that is, . The simulations of 1,000 examinees per attribute profile produce d a distribution of item numbers for each item type. The distribution for is shown in Figure 2 6 . A n examinee use d 44 items of in an extreme case but 95% of the simulated examinees only need ed 12 items of or fewer . The maximum number s of items for and were 54 and 44, respectively. Therefore, the minimum optimal pool as defined earlier would cons ist of 44 items of , 54 items of , and 44 items of . However, considering the need to construct a large number of item pools and the high cost of item development , an optimal item pool becomes impractical. If we instead take the p th percentile of the distribution instead of the maximum , the size of the item pool will be substantially smaller. Such an item pool is called the minimum p - optimal pool. Figure 26 : Distribution of the number of items for in an example 117 6.3 Simulation design Two sets of simulations will be conducte d . The first set of simulations apply the proposed item pool design method to construct minimal 95 - optimal pools. The second set of simulations evaluate the performances of the item pool s . W e consider item pools involving three attributes. The attribute hierarchies in Figure 6 are used . Item pools are design ed for the following variable - length CD - CA T . All items are calibrated by the DINA model . Following the termination rule in Hsu et al. (2013), the variable - length test is t ermina ted at stage when the largest is greater than or equal to 0.9 0 . The item selection criterion is the posterior - weighted KL index (PWKL) proposed by Cheng (2009) . The index of PWKL was chosen because of its popularity and high attribute profile recovery rate (Xu, Wang & Shang, 2016) . T he first item in a test was fixed to be or randomly selected from the subset of q - vector for each attribute hierarchy as shown in Table 3 9 . Table 39 : Q - ve ctor s for the first item Hierarchy First item H3.1 , , H3.2 H3.3 H3.4 , In the simulations for item pool design, i tem quality was held constant for the entire item pool. Two item quality levels were simulated. An item pool of high quality has . An item pool of low quality has . The minimal 95 - optimal pools were constructe d for both item quality levels. For both sets of simulations, a total of 1 ,000 examinees were simulated for each true attribute profile. The CD - CAT algorithm described above was used on each simulated examinee. 118 I tem responses were gene rated based on the DINA . A random variable is generated. The correct response probability is compared with to decide the response of examinee to item : To evaluate the performance of the item pool design method , we constructed ten minimal 95 - optimal pools for each hierarchy , assuming low item quality . Under each attribute hierarchy, t en designed item pools were compared with ten random item pools in terms of test length , the percent of times that the precision criterion is met , and CCR . The random item pools have the same size as the corresponding designed pool , but the Q - matrix was randomly selected from all the availab le q - vectors. For both designed and random item pools, i tem parameters and w ere generated from the uniform distribution and , respectively . 6.4 Simulation results The number of items needed for t he minimal 95 - optimal pools is shown in Table 39 for two item quality levels. The total column presents the size of the item pools. The first row of Table 40 Table 40 descr ibes the item pool designed for three independent attributes ( H3.1 ) assuming low item quality . For example, fifteen items of are required. The second row shows that only four items of are required if the item quality is high. To test the performance of the proposed item - pool design method , the designed item pools were compared with the random pools , and the statistics are summarized in Table 4 1 . The designed pool for low item quality w as used in the comparison because item parameters for this set of simulations were generated from a uniform distribution with the low item quality as a lower bound . 119 Table 40 : The minimum 95 - optimal pool s Item quality H Total L ow 3.1 15 15 15 10 10 10 9 84 H igh 3.1 4 4 4 2 2 2 2 20 L ow 3.2 12 18 16 46 H igh 3.2 4 4 4 12 L ow 3.3 13 16 17 10 56 H igh 3.3 4 4 4 2 14 L ow 3.4 15 15 11 14 55 H igh 3.4 4 4 2 4 14 Table 41 : Comparison between the random and designed item pools P ool H Test length Modified test length % criterion met CCR Random 3.1 12.05 9.60 9 6.65 0.88 0.91 0.91 0.91 0.91 0.91 Designed 3.1 9.92 9.24 99 .10 0.90 0.92 0.89 0.91 0.93 0.92 Random 3. 2 6.40 5.96 9 8.8 9 0.95 0.92 0.92 0.92 Designed 3. 2 6.27 5.87 99 .01 0.94 0.91 0.91 0.91 Random 3. 3 8.06 7.03 97.88 0.96 0.89 0.92 0.91 0.92 Designed 3. 3 7.52 7.07 99 . 1 1 0.94 0.92 0.90 0.92 0.91 Random 3. 4 8.02 7.11 98 .07 0.91 0.92 0.92 0.91 0.91 Designed 3. 4 7.45 6.97 99 .0 0 0.93 0.92 0.93 0.90 0.91 Note : CCRs for and with H3.1 are not presented for brevity. Take H3.1 for an example. The average test length using the random item pools was 12.05 , longer than the average test length using the designed pools, which was 9.92 . The difference in test length is partly due to the percent of times that the precision cr iterion is met . With random pools , the precision criterion was met at an average of 96.65% of the repetitions, which means 3.35% of the examinees would have to take all the items in the pool. The precision criterion was met for 99.10% of the cases on average when using the designed pools. The modified test length was calcul ated by excluding the cases where the precision criterion was never met . The designed pools were associated with slightly shorter tests than the random tests after excluding the extreme cases. With random or designed item pools, the average CCR for each at tribute profile was close 120 to or higher than 0.90, which was the precision criterion. The same conclusion can be drawn for other attribute hierarchies except for H3.3, where the modified test length by using designed pools was not lower than that by using r andom pools. 6.5 Discussion An important practical question is how many items are needed for a CD - CAT item pool. This type of question s be longs to the research area of the item pool design. Although numerous item selection methods have been pr opose d , the i tem pool design has been given limited attention. This study aims to guide practitioners when CD - CAT is involved. The method for item pool design is based on simulations. As Dr. Reckase noted, t here is no correct answer to the question How big should a CAT item pool be? The proposed method leads to an item pool designed for a specific CD - CAT program. The concept of the minimum optimal pool is introduced but is deemed impractical. The minimum p - optimal pool is defined to be a practical item pool design for a formative assessment system. We then d e monstrate the construction of minimum p - optimal pools for variable - length CD - CAT with two item quality levels and four attribut e hierarchies. With designed item pools, the precision criterion is supposed to be met with shorter tests compared to with random item pools , which was supported by the simulation results. Future research may consider the item pool d esi g n for fixed - length CD - CAT. Another situation worth explored is when a student would take the test multiple times (M = 1, 2, 3, 4) during an instruction period (a couple of weeks), and each time new items should be administered to a student . The in the minimum p - optimal pool take the value of 0.95 in this study but it could also take other values . Another variable that can be manipulated is the item quality. Currently, we 121 a ssum e h omogeneous item quality between item types , which is a c om mon settin g in simulation studies . However, i t is possible that single - attribute items and multiple - attribute items tend to have different levels of item quality, or items involving a certain attribute have lower or higher item quality than others. Future research m ay take heterogeneous item quality into consideration and some practical evidence is needed regarding the item quality of different item types . Most p revious studies are built upon item pools t hat are calibrated using a single CDM. This study uses the DINA model. However, it is likely to observe that different items require different processes in practice, which suggests that the item pool may be made up by various CDMs ( Kaplan, de la Torre, & Barrada, 2015 ). Recent progress in item - level model selection indices provides a theoretical basis for such item pools ( Liu, Andersson, Xin, Zhang, & Wang, 2018 ; Ma e t al. , 2015 ). Suppose multiple - attribute items calibrated by ACDM are also included as candidate items. I tem selection methods based on KL information, such as PWKL index, would always prefer a single - attri bu te item to a multiple - attribute item under the ACDM. The current item pool design method, therefore, would produce an item pool without any ACMD based mu ltiple - attribute items. The optimal pool need s to be redefined with mixed model s. 122 APPENDIX 123 APPENDIX Hierarchies in Two Textbooks Eureka Math Grade 4 (2015) 1 4.OA.1 is not connected with any other Grade 4 standards in the Coherence Map. Unit 1 (4 weeks) 4.OA.A.1 1 , 4.NBT.A.1, 4.NBT.A.2 >* 4.NBT.A.3 >* 4.NBT.B.4, 4.OA.A.3 Unit 2 (1 week) 4.MD.A.1, 4.MD.A.2 >* 4.OA.A.3 Unit 3 (8 weeks) 4.MD.A.2 >* 4.MD.A.3, 4.NBT.A.1, 4.NBT.B.5, 4.NBT.B.6, 4.OA.A.1, 4.OA.A.2, 4.OA.A.3, 4.OA.A.4 Unit 4 (3.3 weeks) 4.G.A.1, 4.G.A.2, 4.G.A.3, 4.MD.C.5, 4.MD.C.6, 4.MD.C.7 Unit 5 (8.4 weeks) 124 2 4.OA.C.5 is not connected with any other Grade 4 standards in the Coherence Map. 3 4.MD.A.3 is not connected with any other Grade 4 standards in the Coherence Map. 4.MD.A.2 4.MD.B.4, 4.NBT.A.3 >* 4.NF.A.1, 4.NF.A.2, 4.NF.B.3, 4.NF.B.4, 4.OA.A.2, 4.OA.C.5 2 Unit 6 (3.3 weeks) 3.NF.A.3, 4.NF.A.1, 4.NF.A.2, 4.NF.C.5, 4.NF.C.6, 4.NF.C.7, 4.MD.A.1, 4.MD.A.2 >* 4.NBT.A.1 Unit 7 (3.8 weeks) 3.NF.A.1, 4.NF.A.1, 4.NF.B.3, 3.OA.A.1, 3.OA.A.2, 4.OA.A.2, 4.OA.A.3, 4.OA.B.4, 4.MD.A.1, 4.MD.A.2 >* 4.MD.A.3 3 >* 4.NBT.A.2, 4.NBT.B.4, 4.NBT.B.5, 4.NBT.B.6 125 Engage NY Grade 4 (2014) Unit 1 (4 days) Unit 2 (2 days) 4.OA.A.1 4 , 4.NBT.A.1, 4.NBT.A.2 4.NBT.A.2 Unit 3 (4 days) Unit 4 (2 days) 4.NBT.A.3 4.NBT.B.4, 4.OA.A.3 Unit 5 (4 days) Unit 6 (3 days) 4.NBT.A.2, 4.NBT.B.4, 4.OA.A.3 4.NBT.A.1, 4.NBT.A.2, 4.NBT.B.4, 4.OA.A.3 Unit 7 (3 days) Unit 8 (2 days) 4.MD.A.1, 4.MD.A.2 4.MD.A.1, 4.MD.A.2 Unit 9 (3 days) Unit 10 (3 days) 4.MD.A.3, 4.OA.A.1, 4.OA.A.2, 4.NBT.B.5 4.NBT.B.5 Unit 11 (5 days) Unit 12 (2 days) 4.NBT.B.5 4.NBT.B.5, 4.OA.A.1, 4.OA.A.2, 4.OA.A.3 4 4.OA.1 is not connected with any other Grade 4 standards in the Coherence Map. 126 Unit 13 (9 days) Unit 14 (4 days) 4.NBT.B.6, 4.OA.A.3 4.OA.A.4 Unit 15 (9 days) Unit 16 (5 days) 4.NBT.B.6, 4.OA.A.3, 4.NBT.B.4, 4.NBT.B.6, 4.NBT.A.1 4.NBT.B.5 Unit 17 (4 days) Unit 18 (4 days) 4.G.A.1 4.MD.C.5, 4.MD.C.6 Unit 19 (3 days) Unit 20 (5 days) 4.MD.C.7 4.G.A.1, 4.G.A.2, 4.G.A.3 Unit 21 (6 days) Unit 22 (5 days) 3.NF.A.3, 4.NF.B.4 4.NF.A.1 Unit 23 (4 days) Unit 24 (6 days) 4.NF.A.2 4.NF.B.3 Unit 25 (8 days) 4.NF.B.3, 4.NF.B.4, 4.NF.A.2, 4.MD.B.4 Unit 26 (6 days) 4.NF.B.3 127 Unit 27 (6 days) 4.NF.B.4, 4.OA.A.2, 4.MD.B.4 Unit 28 (1 day) 4.OA.C.5 5 Unit 29 (3 days) Unit 30 (5 days) 4.NF.C.6 4.NF.C.5, 4.NF.C.6 Unit 31 (3 days) 4.NF.C.7 Unit 32 (3 days) Unit 33 (2 days) 4.NF.C.5, 4.NF.C.6 4.MD.A.2 Unit 34 (5 days) 4.MD.A.1, 4.OA.A.1, 4.MD.A.2 Unit 35 (3 days) 4.MD.A.2, 4.OA.A.2, 4.MD.A.1, 4.NBT.B.5, 4.NBT.B.6, 4.OA.A.3 5 4.OA.1 is not connected with any other Grade 4 standards in the Coherence Map. 128 REFERENCES 129 REFERENCES American Psychological Association, American Educational Research Association, & National Council on Measurement in Education. (1974). Standards for educational and ps ych ological tests . Washington, DC: American Psychological Association. Ayers, E., Nugent, R., & Dean, N. (2008). Skill set profile clustering based on student capability vectors computed from online tutoring data. In R. S. J. de Baker, T. Barnes, & J. E. Beck (Eds.), Educational data mining 2008: Proceedings of the 1st international conference on educational data mining, Montreal, Quebec, Canada (pp. 210 217). Retrieved from http://www.educationaldatamining.org/EDM2008/uploads/proc/full%20proceeding s.pdf Barnes, T. (2010). Novel derivation and application of skill matrices: The q - matrix method. In Ramero, C., Vemtora , S., Pechemizkiy, M., de Baker, R. S. J. (Eds.), Handbook of educational data mining (pp. 159 - 172). Boca Raton, FL: Chapman & Hall. Beatty, I. D., & Gerace, W. J. (2009). Technology - Enhanced Formative Assessment: A Research - Based Pedagogy for Teaching Science with Classroom Response Technology. Journal of Science Education and Technology, 18( 2), 146 - 162. Belov, D. I., & Armstrong, R. D. (2009). Direct and inverse problems of item pool design for computerized adaptive testing. Educational and Psychological Measurement, 69 (4), 533 - 547. Bennett, R. E. (2011). Formative assessment: a critical review. Assessment in Education: Principles, Policy & Practice , 18 (1), 5 - 25. Bennett, R. E. (2015). The Changing Nature of Educational Assessment. Review of Research in Education, 39 ( 1), 370 - 407. Brennan, R. L. (2006). Perspectives on the evolution and f uture of educational measurement. In R. Brennan (Ed.), Educational measurement (4th ed., pp. 1 - 16). Westport, CT: American Council on Education and Praeger. Black, P., & Wiliam, D. (1998). Assessment and classroom learning. Assessment in Education: Princip les, Policy & Practice, 5 (1), 7 74. Black, P., Wilson, M., & Yao, S. (2011). Road maps for learning: A guide to the navigation of learning progressions. Measurement: Interdisciplinary Research and Perspectives, 9 , 71 123. Bloom, B. S. (1968). Learning for Mastery. Instruction and Curriculum . Regional Education Laboratory for the Carolinas and Virginia, Topical Papers and Reprints, Number 1. Evaluation comment, 1(2), n2. 130 Bloom, B. S., Hastings, J. T., & Madaus, G. F. (1971 ). Handbook on formative and summative evaluation of student learning . New York: McGraw - Hill. Bock, R. D., Thissen, D., & Zimowski, M. F. (1997). IRT estimation of domain scores. Journal of Educational Measurement, 34 (3), 197 - 211. Brennan, R. L. (1981). Some statistical procedures for domain - referenced testing: a handbook for practitioners . Iowa City, Iowa : Research and Development Division, American College Testing Program . Retrieved from https://searchworks.stanford.edu/view/1312930 Campbell, C. (2013 ). Research on teacher competence in classroom assessment. In J.H. McMillan (Ed.), Sage handbook of research on classroom assessment (pp. 71 - 84) . SAGE, Los Angeles. Center f or K - 12 Assessment and Performance Management at ETS. (2014, March). Coming together to raise achievement: New assessments for the common core state standards . Retrieved from http://www.k12center.org Chang, H. - H. (2 012). Making computerized adaptive testing diagnostic tools for schools. In R. W. Lissitz & H. Jiao (Eds.), Computers and their impact on state assessment: Recent history and predictions for the future (pp. 195 - 226). Charlotte, NC: Information Age Publishi ng. Chang, H. H. (2015). Psychometri cs behind computerized adaptive testing. Psychometrika, 80 , 1 - 20. Chang, H. H., & Ying, Z. (1996). A global information approach to computerized adaptive testing. Applied Psychological Measurement, 20 (3), 213 - 229. Chen, Y., Li, X., Liu, J., & Ying, Z. (2018). Recommendation System for Adaptive Learning. Applied Psychological Measurement, 42 (1), 24 - 41. Cheng, Y. (2009). When cognitive diagnosis meets computerized adaptive testing: CD - CAT. Psychometrika, 74 (4), 619 - 632. Cheng, Y. (2010). Improving cognitive diagnostic computerized adaptive testing by balancing attribute coverage: The modified maximum global discrimination index method. Educational and Psychological Measurement, 70 (6), 902 - 913. Chiu, C.Y., & Köhn , H.F. (2015), Consistency of Cluster Analysis for Cognitive Diagnosis: The DINO Model and the DINA Model Revisited. Applied Psychological Measurement, 39 , 465 - 479. Chiu, C. Y., & Douglas, J. (2013). A Nonparametric Approach to Cognitive Diagnosis by Proxi mity to Ideal Response Patterns. Journal of Classification, 30 (2), 225 - 250. Chiu, C. - Y., Douglas, J. A., & Li, X. (2009). Cluster analysis for cognitive diagnosis: Theory and applica tions. Psychometrika, 74 , 633 - 665. 131 Chiu, C. Y., Sun, Y., & Bian, Y. (2018 ). Cognitive Diagnosis for Small Educational Programs: The General Nonparametric Classific ation Method. Psychometrika, 83 , 355 - 375. Clark, I. (2016). Formative assessment: assessment is for self - regulated learning . Educational Psychology Review, 24 (2), 20 5 - 249. Conley, T. D. (2018). The Promise and Practice of Next Generation Assessment . Cambridge, MA: Harvard Education Press. Copp, D. T. (2018). Teaching to the test: a mixed methods study of instructional change from large - scale testing in Canadian schoo ls. Assessment in Education: Principles, Policy & Practice, 25 (5), 468 - 487. de la Torre, J. (2011). The generalized DINA model framework. Psychometrika, 76 , 179 - 199. de La Torre, J., & Karelitz, T. M. (2009). Impact of diagnosticity on the adequacy of models for cognitive diagnosis under a linear attribute structure: A simulation study. Journal of Educational Measurement, 46 (4), 450 - 469. Ding, S. L., Luo, F., Cai, Y., Lin, H. J., & Wang, X. B. (2008). Compleme nt matrix theory. In K. Shigemasu , A. Okada, T. Imaizumi, & T. Hoshino (Eds .), New Trends in Psychometrics (pp. 417 - 423). Tokyo: Universal Academy. Embretson, S. E. (1995). Developments toward a cognitive design system for psychological tes ting. In D. Lupinsky & R. Dawis (Eds.), Assessing individual differences in human behavior (pp. 17 - 48). Palo Alto, CA: Davies - Black Publishing Embretson, S . E . (2003) . The Second Century of Ability Testing: Some Predictions and Speculations. Princeton, NJ: Educational Testing Service. Retrievable at http :// www.ets.org/Media/Research/pdf/PICANG7.pdf . Furtak, E. M., Circi, R., & Heredia, S. C. (2018). Exploring alignment among learning progres sions, teacher - designed formative assessment tasks, and student growth: Results of a four - year study. Applied Measurement in Education, 31 (2), 143 - 156. Fyfe, E. R., & Rittle - johnson, B. (2015). Feedback Both Helps and Hinders Learning: The Causal Role of Prior Knowledge Feedback. Journal of Educational Psychology, 108 (1), 82 - 97. A New Perspective on Gender Differences in Mathematical Sub - Comp etencies. Applied Measurement in Education, 31 (1), 79 - 97. - space model for test development and analysis. Educational Measurement: Issues and Practice, 19 , 34 - 44. Gierl, M. J., & Lai, H. (2012). The role o f item models in automatic item generation. International journal of testing, 12 (3), 273 - 298. 132 Gray, R. M. (2011). Entropy and information theory ( 6 th ed. ) . New York: Springer. Gorin, J. S., & Mislevy, R. J. (2013). Inherent Measurement Challenges in the Next Generation Science Standards for Both Formative and Summative Assessment. Invitational Assessment Symposium, (September), 2 - 39. Retrieved from http://citeseerx.ist.psu.edu/v iewdoc/download?doi=10.1.1.800.5350&rep=rep1&type=pdf Gotwals, A. W. (2018). Where are we now? Learning progressions and formative assessment. Applied Measurement in Education, 31( 2), 157 - 164. Haberman, S. J. (2008). When Can Subscores Have Value? Journal of Educational and Behavioral Statistics, 33 (2), 204 229. Haertel, E. H. (1989). Using restricted latent class models to map the skill structure of achievement items. Journal of Educational Measurement, 26 , 333 - 352. Hanna, G. S., & Dettmer, P. (2004). Assessment for effective teaching: Using context - adaptive planning. Boston: Pearson A and B. Harks, B., Klieme, E., Hartig, J., & Leiss, D. (2014). Separating Cognitive and Content Domains in Mathematical Competence. Edu cational Assessment, 19 (4), 243 - 266. Hattie, J., & Timperley, H. (2007). The power of feedback. Review of Educational Research, 77 , 81 - 112. Hefling, K. (January 7, 2015). Do students take too many tests? Congress to weigh question. Associated Press . Retri eved from http://www.pbs.org/newshour/rundown/congressdecide - testing - schools Henson, R., & Douglas, J. (2005). Test construction for cognitive diagnosis. Applied Psychological Measurement, 29 , 262 277. Henson, R., Roussos, L., Douglas, J., & He, X. (2008). Cognitive diagnostic attribute - level discrimination indices. Applied Psychological Measurement, 32 (4), 275 288. Henson, R., DiBello, L., & Stout, B. (2018). A Generalized Approach to Defining Item Discrimination for DCMs. Measurement: Interdisciplinary Re search and Perspectives, 16 (1), 18 - 29. Heritage, M. (2010). Formative assessment and next - generation assessment systems: Are we losing an opportunity? National Center for Research on Evaluation, Standards, and Student Testing (CRESST) and the Council of Ch ief State School Officers (CCSSO). CCSSO: Washington. Hively, W. (1974). Introduction to Domain - referenced Testing. In W. Hively (Ed.), Domain - referenced testing (pp. 16 - 30). Englewood Cliffs, N.J.: Education al Technology Publications . 133 Houang, R. T. (1980) . Estimation of parameters for a latent class model applied to the study of achievement test items (Unpublished doctoral dissertation) . University of California, Santa Barbara, CA. Junker, B. W., & Sijtsma, K. (2001). Cognitive assessment models with few a ssumptions, and connections with nonparametric item response theory. Applied Psychological Measurement, 25 , 258 - 272. Kaplan, M., de la Torre, J., & Barrada, J. R. (2015). New Item Selection Methods for Cognitive Diagnosis Computerized Adaptive Testing. Applied Psychological Measurement, 39 (3), 167 - 188. Kingsbury, C. G., & Zara, A. R. (1991). A Comparison of Procedures for Content - Sensitive Item Selection in Computerized Ada ptive Tests. Applied Measurement in Education, 4 (3), 241 - 261. Köhn, H. - F., & Chiu, C. - Y. (2018). How to Build a Complete Q - Matrix for a Cognitively Diagnostic Test. Journal of Classification, 35 (2), 273 - 299. Kuo, B. C., Pai, H. S., & de la Torre, J. (201 6). Modified Cognitive Diagnostic Index and Modified Attribute - Level Discrimination Index for Test Construction. Applied Psychological Measurement, 40 (5), 315 - 330. Leighton, J. P., Gierl, M. J., & Hunka, S. M. (2004). The Attribute Hierarchy Method for Cog nitive Assessment: A Variation on Tatsuoka s Rule - Space Approach. Journal of Educational Measurement, 41 (3), 205 - 237. Luecht, R. M. (2013). Test Specifications under Assessment Engineering. Journal of Applied Testing Technology, 14 , 1 - 38. Liu, J., Xu, G., & Ying, Z. (2012). Data - Driven Learning of Q - Matrix. Applied Psychological Measurement, 36 (7), 548 - 564. Liu, O. L., Frankel, L., & Roohr, K. C. (2014). Assessing critical thinking in higher education: Current state and directions for next - generation asses sment. RR - 14 - 10 . Princeton, NJ: Educational Testing Service. Liu, R., Huggins - Manley, A. C., & Bradshaw, L. (2017). The Impact of Q - Matrix Designs on Diagnostic Classification Accuracy in the Presence of Attribute Hierarchies. Educational and Psychological Measurement, 77 (2), 220 - 240. Liu, Y., Andersson, B., Xin, T., Zhang, H., & Wang, L. (2018). Improved Wald Statistics for Item - Level Model Comparison in Diagnostic Classification Models. Applied Psychological Measurement . https://doi.org/10.1177/0146621618798664 Ma, W., Iaconangelo, C., & de la Torre, J. (2015). Model Similarity, Model Selection, and Attribute Classification. Applied Psychological Measurement, 40 (3), 20 0 - 217. 134 Macready, G. B., & Dayton, C. M. (1977). The use of probabilistic models in the assessment of mastery. Journal of Educational Statistics, 33 , 379 - 416. Mislevy, R. J. (2016). How Developments in Psychology and Technology Challenge Validity Argumentation. Journal of Educational Measurement, 53 (3), 265 - 292. Moreno, R. (2004). Decreasing cognitive load for novice students: Effects of explanatory versus corrective feedback in discovery - based multi media. Instructional Science, 32 , 99 113. Nitko, A.J. (2001). Educational assessment of students (3rd ed.). Upper Saddle River, NJ: Merrill. Pellegrino, J. W., Chudowsky, N., & Glaser, R. (2001). Knowing what students know: The science and design of educat ional assessment . Washington, DC: National Academy Press. Baker, F. B., & Kim, S. H. (2004). Item response theory: Parameter estimation techniques . CRC Press. Reckase, M. D. (2010). Designing Item Pools to Optimize the Functioning of a Computerized Adaptiv e Test. Psychological Test and Assessment Modeling, 52 (2), 127 - 141. Rupp, A., Templin, J., & Henson, R. (2010). Diagnostic measurement: Theory, methods, and applications . New York, NY: Guilford Press. Schmidt, W.H., & McKnight, C. C. (1995). Surveying educational opportunity in mathematics and science: An international perspective. Educational Evaluation and Policy Analysis, 17 (3), 337 - 353. Schmidt, W., Jorde, D., Cogan, L., Barrier, E., Gonzalo, I., Moser, U., Shimizu, K., Sawada, T., Valver de, G., McKnight, C., Prawat, R., Wiley, D., Raizen, S., Britton, E. & Wolfe, R. (1996). Characterizing pedagogical flow . Boston MA: Kluwer Academic Publishers. Schmidt, W.H., McKnight, C.C., Valverde, G.A., Houang, R. T., & Wil ey, D. E. (1996). Many visions , many aims: A cross - national investigation of curricular intentions in school mathematics . Boston: Kluwer Academic. Schmidt, W. H., McKnight, C. C., Valverde, G. A., Houang, R. T. and Wiley, D. E. (1997) . Many Visions, Many Aims: A Cross - National Investig ation of Curricular Intentions in School Mathematics (Dordrecht, The Netherlands: Kluwer). Schutz, P. A., & Pekrun, R. (Eds.). (2007). Emotion in education . Burlington, MA: Academic Press. Scriven, M. (1967). The methodology of evaluation. In R. W. Tyler, R. M. Gagné, & M. Scriven (Eds.), Perspectives of curriculum evaluation (Vol. 1, pp. 39 83). Chicago, IL: Rand McNally Applied Measurement in Education, 21 (4), 293 - 294. 135 Shepard, L. A. (2006). Classroom a ssessment. In R. L. Brennan (Ed.), Educational Measurement (4th ed., pp. 623 - 646). Westport, CT: ACE/Praeger. Shepard, L. A., Penuel, W. R., & Pellegrino, J. W. (2018). Classroom Assessment Principles to Support Learning and Avoid the Harms of Testing. Educational Measurement: Issues and Practice, 37 (1), 52 - 57. Swanson, L. & Stocking, M. L. (1993). A m odel and h euristic for s olving very large i tem s election p roblems . Applied Psychological Measurement, 17 , 151 - 166. Tatsuoka, K. K. (1983). Rule space: An a pproach for dealing with misconceptions based on item response theory. Journal of Educational Measurement, 20 , 345 - 354. Templin, J., & Bradshaw, L. (2014). Hierarchical Diagnostic Classification Models: A Family of Models for Estimating and Testing Attribu te Hierarchies. Psychometrika, 79 (2), 317 - 339. Thissen, D., Reeve, B. B., Bjorner, J. B., & Chang, C. H. (2007). Methodological issues for building item banks and computerized adaptive scales. Quality of Life Research, 16 (SUPPL. 1), 109 - 119. Tu, D., Wang, S., Cai, Y., Douglas, J., & Chang, H. ( 2018 ). Cognitive Diagnostic Models With Attribute Hierarchies: Model Estimation With a Restricted Q - Matrix Design. Applied Psychological Measurement. https://d oi.org/10.1177/0146621618765721 U.S. Department of Education. (2014). Secretary's final supplemental priorities and definitions for discretionary grant programs . Retrieved from https://www.fede ralregister.gov/articles/2014/ 12/10/2014 - 28911/secretarys - final - supplemental - priorities - and - definitions - for - discretionary - grant - programs #h - 28. U.S. Department of Education. (2015). Fact Sheet: Testing Action Plan , Washington, D.C. van Der Linden, W. J. (2005a). A Comparison of Item - Selection Methods for Adaptive Tests with Content Constraints. Journal of Educational Measurement, 42 (3), 283 - 302. van der Linden, W. J. (2005b). Linear models for optimal test design . New York: Springer. van der Linden, W. J., & Diao, Q. (2014). Using a universal shadow - test assembler with multistage testing. In D. Yan, A. A. von Davier, & C. Lewis (Eds.), Computerized multistage testing: Theory and applications (pp. 101 - 118). New York, NY: CRC Press. van der Linden, W. J., & Reese, L. (1998). A model for optimal constrained adaptive testing. Applied Psychological Measurement, 22 , 259 - 270. von Davier, M. (2005). A general diagnostic model applied to language testing data, ETS Research Report RR - 05 - 16 . Prin ceton, NJ: Educational Testing Service. Retrieved from http://www.ets. org/Media/Research/pdf/RR - 05 - 16.pdf 136 Walsh, B. (November 3, 2017). When Testing Takes Over: An expert's lens on the failure of high - stakes accountability tests and what we can do to ch ange course . Usable Knowledge. Retrieved from https://www.gse.harvard.edu/news/uk/17/11/when - testing - takes - over Wang, S., & Douglas, J. (2015). Consistency of nonparametric classification in cognitive diagnosis. Psychometrika, 80 (1), 85 - 100. Wang, W., Song , L., Ding, S., Meng, Y., Cao, C., & Jie, Y. (2018). An EM - Based Method for Q - Matrix Validation. Applied Psychological Measurement , 42(6), 446 459 . Way, WD., Steffen, M., & Anderson, G.S. (1998). Developing, maintaining, and renewing the item inventory to support computer - based testing . Paper presented at the colloquium, Computer - Based Testing: Building the Foundation of Ou r Future Assessments, Philadelphia, PA, September 25 - 26, 1998. Willse, J., Henson, R., & Templin, J. (2007). Using sum scor es or IRT in place of cognitive diagnosis models: Can existing or more familiar models do the job? Paper presented at the Annual Meeting of the National Council on Measurement in Education, Chicago, IL. Wilson, M. (2018). Making measurement important for e ducation: The crucial role of classroom assessment. Educational Measurement, 37 (1), 1 37. Xu, G., & Zhang, S. (2016). Identifiability of Diagnostic Classification Models. Psychometrika, 81 (3), 625 - 649. Zimba, J. (2011). Examples of structure in the Common mathematical content . Retrieved from http://commoncoretools.me/wpcontent/uploads/2011/07/ccssatlas_2011_07_06_0956_p 1 p2.pdf Zimba, J. (2015, October 29). Coherence Map . Retrieved from www.achievethecore.org/coherence - map