PSYCHOMETRIC TOOLS FOR FORMATIVE CLASSROOM ASSESSMENT: TEST 
CONSTRUCTION AND ITEM POOL DESIGN BASED 
ON COGNITIVE DIAGNOSTIC 
MODELS
 
B
y
 
Jiahui Zhang
 
A DISSERTATION
 
Submitted to
 
Michigan State University
 
in partial fulfillment of the 
requirements
 
for the degree of
 
Measurement and Quantitative Methods

Doctor
 
o
f Philosophy 
 
2019
 
 
ABSTRACT
 
PSYCHOMETRIC TOOLS FOR FORMATIVE CLASSROOM ASSESSMENT: TEST 
CONSTRUCTION AND ITEM POOL DESIGN BASED 
ON COGNITIVE DIAGNOSTIC 
MODELS
 
 
By
 
Jiahui Zhang
 
This thesis is concerned with the potential 
applications of
 
c
ognitive diagnostic models 
(CDMs) 
with hierarchical attributes 
in
 
support
ing
 
formative classroom
 
assessments. 
T
he 
conventional CDM approach 
that 
requir
e
s
 
large sample sizes 
is
 
impractical in the classroom setting
.
 
T
hree are three CDM
-
based approaches that 
do not involve item calibration and thus 
are practical 
in the classroom setting: 1) CDM classifications using non
-
adaptive tests assembled from a 
calibrated item pool, 2) nonparametric classifications using non
-
adaptive tests based on CDMs, 
and 3) computer
ized adaptive testing (CAT) combined with CDMs (
i.e.
,
 
CD
-
CAT). Since most 
CDMs and their applications assume independent attributes, r
elevant model p
arameterizations
,
 
and 
the Q
-
matrix 
for hierarchical CDMs
 
were
 
discussed. 
Three studies 
were 
conducted to ad
dress the 
test construction and item pool design issues related to the three 
CDM
-
based 
approaches. 
Specifically, n
ew i
nd
ice
s 
based on 
the 
Kullback
-
Leibler information 
are
 
proposed
 
for 
non
-
adaptive 
test construction
 
with a calibrated item pool
. 
Different
 
Q
-
matrix designs 
were
 
explored for 
nonparametric classifications
,
 
and recommendations regarding the Q
-
matrix design were provided 
for teachers
. 
For CD
-
CAT, an item pool design method 
based on simulation was
 
proposed
 
and 
evaluated
. 
The 
intended contribution
 
of the thesis 
consists of
 
psychometric tools for
 
the teachers 
t
hat help them
 
facilitate 
formative assessments in the classroom
 
and 
instrumental guidelines for
 
developers of formative assessment systems
. 
 
 
Cop
yright by
 
JIAHUI ZHANG
 
2019
 
 
iv
 
T
o
 
my grandpa
v
 
ACKNOWLEDGEMENTS
 
 
I would like to thank my advisor and chair of my dissertation committee, Dr. William 
Schmidt, for his guidance and 
support. His great insight into education has illuminated my 
graduate study and will continue to guide me in my future career.
 
I would also like to thank Dr. Richard Houang, Dr. Tenko Raykov
,
 
and Dr. Amelia Gotwals 
for serving on my committee and offering 
constructive feedback for my proposal and dissertation 
draft.
 
The idea for this dissertation was born and 
developed
 
in 
the many inspiring conversations 
I had with my mentors, Dr. Richard Houang and Dr. Leland Cogan. I have benefited tremendously 
from their
 
knowledge and insight.
 
I am grateful to my family 
for their
 
unconditional support and trust. Special thanks go to 
my husband, Qian Xu, who has made great sacrifices to support my pursuit of knowledge.
 
I would also like to acknowledge my adviser and mentor at Beijing Normal 
University
, Dr. 
Tao Xin, who is like fami
ly to me. He led me into the field of educational measurement and 
always 
guide
d
 
me 
in
 
the right direction
.
 
I would also like to thank many other friends and colleagues from Michigan State 
University, Beijing Normal University, NWEA, and ACT
,
 
who supported 
me along this arduous 
journey.
 
 
vi
 
TABLE OF CONTENTS
 
 
LIST OF TABLES
 
................................
................................
................................
..........................
 
viii
 
LIST OF FIGURES
 
................................
................................
................................
............................
 
x
 
Chapter 1
 
Introduc
tion
 
................................
................................
................................
....................
 
1
 
1.1 Psychometric solutions for formative classroom assessment
 
................................
..................
 
5
 
1.2 Related concepts
 
................................
................................
................................
.........................
 
7
 
1.2.1 External and classroom assessment
 
................................
................................
.....
 
10
 
1.2.2 Summative and formative assessment
 
................................
................................
.
 
11
 
1.2.3 Domain
-
referenced and norm
-
referenced testing/interpretations
 
......................
 
12
 
1.2.4 Curriculum
-
based assessment
 
................................
................................
..............
 
14
 
1.2.5 Next
-
generation assessment
 
................................
................................
.................
 
15
 
Chapter 2
 
Literature review of CDM
-
based approaches
 
................................
............................
 
16
 
2.1 CDM
 
................................
................................
................................
................................
..........
 
16
 
2.1.1 Attributes
 
................................
................................
................................
...............
 
17
 
2.1.2 Attribute 
profile space of hierarchical attributes
 
................................
.................
 
24
 
2.1.3 Q
-
matrix
 
................................
................................
................................
.................
 
29
 
2.1.4 Item response models and calibration methods
................................
...................
 
32
 
2.1.5 Classification methods
 
................................
................................
..........................
 
34
 
2.1.6 Q
-
matrix design
 
................................
................................
................................
.....
 
36
 
2.1.7 Criteria for test construction
 
................................
................................
.................
 
38
 
2.2 Nonparametric classification based on CDM conception
 
................................
......................
 
41
 
2.2.1 The nonparametric (NPC) method
 
................................
................................
.......
 
41
 
2.2.2 The general nonparametric classification (GNPC) method
................................
 
43
 
2.3 CD
-
CAT
 
................................
................................
................................
................................
....
 
44
 
2.3.1 From IRT
-
based CAT to CD
-
CAT
 
................................
................................
......
 
44
 
2.3.2 Item selection methods for CD
-
CAT
 
................................
................................
...
 
45
 
2.3.3 Item pool design
 
................................
................................
................................
....
 
47
 
Chapter 3
 
CDM parameterization and Q
-
matrix with 
hierarchical attributes
 
...........................
 
51
 
3.1 Introduction
 
................................
................................
................................
...............................
 
51
 
3.2 Attribute hierarchies
 
................................
................................
................................
.................
 
51
 
3.3 Parameterizations of hierarchical CDMs
 
................................
................................
................
 
54
 
3.4 Q
-
matrix of hierarchical CDMs
 
................................
................................
...............................
 
57
 
3.4.1 Reduced or full Q
-
matrix
 
................................
................................
......................
 
57
 
3.4.2 Complete Q
-
matrix for hierarchical attributes
 
................................
....................
 
63
 
3.5 Summary
 
................................
................................
................................
................................
...
 
64
 
Chapter 4
 
Conditional KLI
-
based indexes for hierarchical CDMs
 
................................
............
 
66
 
4.1 Introduction
 
................................
................................
................................
...............................
 
66
 
4.2 Conditional KL indices for test construction
 
................................
................................
..........
 
68
 
4.3 Simulation design
 
................................
................................
................................
.....................
 
71
 
vii
 
4.4 Simulati
on results
 
................................
................................
................................
.....................
 
72
 
4.5 Discussion
 
................................
................................
................................
................................
.
 
88
 
Chapter 5
 
Q
-
matrix design for nonparametric classifications with hierarchical attributes
 
......
 
92
 
5.1 Introd
uction
 
................................
................................
................................
...............................
 
92
 
5.2 Ties in NPC
 
................................
................................
................................
...............................
 
93
 
5.3 Simulation design
 
................................
................................
................................
.....................
 
94
 
5.4 Simulation results
 
................................
................................
................................
.....................
 
96
 
5.5 Discus
sion
 
................................
................................
................................
...............................
 
111
 
Chapter 6
 
Item pool design for CD
-
CAT
 
................................
................................
..................
 
113
 
6.1 Introduction
 
................................
................................
................................
.............................
 
113
 
6.2 Method for CD
-
CAT item pool design
 
................................
................................
.................
 
114
 
6.2.1 The minimum optimal pool
 
................................
................................
................
 
114
 
6.2.2 The minimum p
-
optimal pool
................................
................................
.............
 
116
 
6.3 Simulation design
 
................................
................................
................................
...................
 
117
 
6.4 Simulation results
 
................................
................................
................................
...................
 
118
 
6.5 Discussion
 
................................
................................
................................
...............................
 
120
 
APPENDIX
 
 
................................
................................
................................
................................
.....
 
122
 
REFERENCES
 
................................
................................
................................
................................
.
 
128
 
 
viii
 
LIST OF TABLES
 
 
Table 1: Subsets of attribute hierarchies for 3
-
attribute, 4
-
attribute, or 5
-
attribute conditions
 
.....
 
52
 
Table 2: Expected responses on two items with two independent attributes
 
................................
..
 
55
 
Table 3: Expected responses on two items with two linear attributes (


)
 
...........................
 
56
 
Table 4: Expected responses on 


under an inverted pyramid hierarchy (H3.3)
 
......
 
56
 
Table 5:  
Expected responses on 


under a pyramid hierarchy (H3.4)
 
......................
 
57
 
Table 6: The expected responses of two groups of attribute p
rofiles on 


and
 

under the DINA 
model
 
................................
................................
................................
................................
...................
 
59
 
Table 7: The q
-
vectors in 

 
and their equivalent q
-
vectors under the DINA model wit
h three 
linear attributes (H3.2)
................................
................................
................................
........................
 
59
 
Table 8: The q
-
vectors in 

 
and their equivalent q
-
vectors under the DINA model with three 
inve
rted pyramid attributes (H3.3)
 
................................
................................
................................
....
 
60
 
Table 9: The q
-
vectors in 

 
and their equivalent q
-
vectors under the DINA model with three 
pyramid attributes (H3.4)
 
................................
................................
................................
...................
 
60
 
Table 10: The q
-
vectors in 

 
and their 
equivalent q
-
vectors under the DINA model with four or 
five attributes
................................
................................
................................
................................
.......
 
61
 
Table 11: The q
-
vectors in 

 
and their equivalent q
-
vecto
rs under the ACDM with three linear 
attributes (H3.2)
 
................................
................................
................................
................................
..
 
62
 
Table 12: Distinct q
-
vectors in a mixed item pool under DINA and ACDM for H3
.2 using the 
reduced Q
-
matrix approach
 
................................
................................
................................
................
 
62
 
Table 13: Expected response vectors given 

 
of two Q
-
matrices (

 
and 

) for the inver
ted 
pyramid (H3.3) under the DINA model
 
................................
................................
............................
 
63
 
Table 14: Expected response vectors given 

 
of five q
-
vectors for independent attributes under 
ACDM
 
................................
................................
................................
................................
.................
 
64
 
Table 15: KLI indices and the CCRs for two Q
-
matrices
 
................................
................................
 
69
 
Table 16: Regression estimates a
nd 


for each attribute hierarchy
 
................................
..............
 
73
 
Table 17: The overall correlation and the correlations for different test lengths between cKLI
 
and 
the CCR
 
................................
................................
................................
................................
...............
 
73
 
ix
 
Table 18: Item parameters of five items for H3.2
 
................................
................................
............
 
91
 
Table 19: Comparison between two three
-
item tests in terms of the two indices
 
..........................
 
91
 
Table 20: Hamming distances for 


with 


(H3.1)
 
................................
.........................
 
94
 
Table 21: Hamming distan
ces for 


with 


(H3.1)
 
................................
........
 
94
 
Table 22: Q
-
matrix designs for the simulation study of 
nonparametric classifications
 
.................
 
95
 
Table 23: NPC results for H3.1
 
................................
................................
................................
.........
 
98
 
Table 24: NPC results for H3.2
 
................................
................................
................................
.........
 
99
 
Table 25: NPC results for H3.3
 
................................
................................
................................
.......
 
100
 
Table 26: NPC results for H3.4
 
................................
................................
................................
.......
 
101
 
Table 27: NPC results for H4.1
 
................................
................................
................................
.......
 
102
 
Table 28: NPC results for 
H4.2
 
................................
................................
................................
.......
 
102
 
Table 29: NPC results for H4.3
 
................................
................................
................................
.......
 
103
 
Table 30: NPC results for H4.4
 
................................
................................
................................
.......
 
103
 
Table 31: NPC results for H4.5
 
................................
................................
................................
.......
 
104
 
Table 32: NPC results for H5.1
 
................................
................................
................................
.......
 
105
 
Table 33: NPC results for H5.2
 
................................
................................
................................
.......
 
106
 
Table 34: NPC results for H5.3
 
................................
................................
................................
.......
 
107
 
Table 35: NPC results for H5.4
 
................................
................................
................................
.......
 
108
 
Table 36: NPC results for H5.5
 
................................
................................
................................
.......
 
109
 
Table 37: NPC results for 
H5.6
 
................................
................................
................................
.......
 
110
 
Table 38: Item distribution for two hypothetical examinees with true attribute profiles of 


and 


and 
the union of the two sets of items
 
................................
.................
 
115
 
Table 39: Q
-
vectors for the first item
 
................................
................................
..............................
 
117
 
Table 40: The minimum 95
-
optimal pools
................................
................................
......................
 
119
 
Table 41: Comparis
on between the random and designed item pools
 
................................
..........
 
119
 
x
 
LIST OF FIGURES
 
 
Figure 1: A complex example of attribute hierarchy in Köhn and Chiu (2018)
 
.............................
 
20
 
Figure 2: Three types of standard relationships in the Common Core Graph (a: the upper panel, b: 
left bottom panel, c: right bottom panel)
 
................................
................................
...........................
 
21
 
Figure 3: Four hierarchical structures using six attributes (Leighton, Gierl, & Hunka, 2004)
......
 
22
 
Figure 4: Linear, pyramid, inverted pyramid and diamond structures using five attributes (Liu & 
Huggins
-
Manley, 2016)
................................
................................
................................
......................
 
22
 
Figure 5: Four types of attribute hierarchies and an independent structure (Tu, Wang, Cai, Douglas, 
& Chang, 2018)
................................
................................
................................
................................
...
 
23
 
Figure 6: A subset of attribute hierarchies with 3 attributes
 
................................
............................
 
52
 
Figure 7: A subset of attribute hierarchies with 4 attributes
 
................................
............................
 
53
 
Figure 8: A subset of attribute 
hierarchies with 5 attributes
 
................................
............................
 
53
 
Figure 9: Correct classification rates under two conditions
 
................................
.............................
 
67
 
Figure 10: A plot for tests with three independent attributes (H3.1) of the combined index with 
CCRs
................................
................................
................................
................................
....................
 
74
 
Figure 11: A plot for tests with three linear attributes (H3.2) of the combined index with CCRs
 
75
 
Figure 12: A plot for tests with three inverted pyramid attributes (H3.3) of the combined index 
with CCRs
 
................................
................................
................................
................................
...........
 
76
 
Figure 13: A plot for tests with three pyramid attributes (H3.4) of the combined index with CCRs
................................
................................
................................
................................
..............................
 
77
 
Figure 14: A plot for tests with four independent attributes (H4.1) of the combined index with 
CCRs
................................
................................
................................
................................
....................
 
78
 
Figure 15: A plot for tests with four linear attributes (H4.2) of the combined index with CCRs
 
.
 
79
 
Figure 16: A plot for tests with t
hree linear attributes + one single attribute (H4.3) of the combined 
index with CCRs
 
................................
................................
................................
................................
.
 
80
 
Figure 17: A plot for tests with four invert
ed pyramid attributes (H4.4) of the combined index with 
CCRs
................................
................................
................................
................................
....................
 
81
 
Figure 18: A plot for tests with four pyramid attributes (H4.5) of t
he combined index with CCRs
................................
................................
................................
................................
..............................
 
82
 
xi
 
Figure 19: A plot for tests with five independent attributes (H5.1) of the combined index with 
CCRs
................................
................................
................................
................................
....................
 
83
 
Figure 20: A plot for tests with five linear attributes (H5.2) of the combined index with CCRs
 
.
 
84
 
Figure 21: A plot for tests with five inverted pyramid attributes (H5.3) of the combined index with 
CCRs
................................
................................
................................
................................
....................
 
85
 
Figure 22: A plot for tests with five inverted pyramid attributes (H5.4) of the combined index with 
CCRs
................................
................................
................................
................................
....................
 
86
 
Figure 23: A plot for tests with five pyramid attributes (H5.5) of the combined index with CCRs
................................
................................
................................
................................
..............................
 
87
 
Figure 24: A plot for tests with five pyramid attributes (H5.6) of the combined index with CCRs
................................
................................
................................
................................
..............................
 
88
 
Figure 25: The conditional CCRs from four random tests in H4.2
 
................................
.................
 
90
 
Figu
re 26: Distribution of the number of items for 


in an example
 
................................
..
 
116
 
1
 
Chapter 1
 
Introduction
 
Assessments are ubiquitous in most education 
systems. E
ducational assessments
 
have the 
potential to 
provide feedback
.
 
T
he 
positive effect
 
of
 
feedback
 
on
 
learning has 
long 
been 
established in numerous studies in educational psychology, cognitive science, and learning science 
(e.g., Fyfe & Rittle
-
J
ohns
on, 2015; Hattie & Timperley, 2007; Moreno, 2004).
 
Therefore, 
various 
types of 
assessment
s have been widely used in schools to improve learning and teaching
, which 
can be classified into summative 
assessment (
providing a summary evaluation at the end of an
 
educational program
) 
and formative assessment
 
(
providing timely diagnostic information for 
learning and teaching during an educational program
)
.
 
  
Despite its potential usefulness in learning, assessment or testing is among the most 
debated issues in pub
lic education.
 
There have been concerns from teachers and parents that tests 
take 
up 
too much time 
from
 
teaching and learning
 
(Hefling, 2015; Walsh, 2017)
.
 
 
A survey
 
by the 
Council of the Great City Schools
 
(CGCS) 
on 
large urban districts
 
revealed that 
t
he average 
amount of testing time 
spent on
 
required assessments 
among eighth
-
grade students in the 2014
-
15 
school year was 4.22 days or 2.34
%
 
of school time
 
(
Hart et al., 2015
)
.
 
Examples of required 
assessments 
in the CGCS report 
are 
(i) 
st
ate
 
summative assessments for accountability (e.g.,
 
the
 
Partnership for Assessment of Readiness for College and Careers
 
[PARCC] 
assessments)
, 
(ii) 
state 
and 
local 
formative assessments, 
(iii) 
local 
end
-
of
-
course 
exams
, and (iv) 
SAT, ACT, and 
Advanced Placement (AP)
 
tests (optional in some places)
.
 
Specific categories of students 
(including students with disabilities and English language learners) take 
(v) 
special
 
assessments 
in 
addition to
 
the 
required and optional tests
.
 
 
Many
 
of
 
the 
required
 
tests 
mentioned above 
are 
external
, 
high
-
stake
s
,
 
and 
summative
 
measures
 
for
 
accountability 
purposes
,
 
fueled by 
important
 
educational polic
y questions
 
(Baker, 
2
 
Chung, & Cai, 2016)
.
 
These tests are 
not 
designed for
 
assisting daily classroom learning and 
teac
hing
. 
Even if diagnostic information can be extracted, it would be too late to be useful in the 
classroom
 
(Hart et al., 2015)
. 
 
T
oo many of such tests would inevitably disrupt the learning process 
and may lead to 
problems
 
such as teaching to the tests (
e.g., Copp, 2018
) and test anxiety (
e.g., 
Schutz & Pekrun, 2007, p.3)
, 
both of which 
result from
 
the mis
use 
and abuse 
of educational
 
assessments.
 
To
 
address
 
this 
issue
, t
he U.S. Department of Education called on states to make 
assessments fewer and smarter in 
the 
Testing Action Plan (U.S. Department of Education, 2015).
 
It calls for
 
more c
lassroom, 
low
-
stake
s
, and formative tests
 
that are 

smart

 
to
 
provide timely 
feedback to learning and teaching
 
and fewer external, high
-
stakes, and summative tests
.
 
We are 
entering 
a new era of K
-
12 assessments
,
 
where
 
both accountability and instructional improvement 
are emphasized (Chang, 2012)
,
 
and, 
correspondingly
, 
both summative and formative 
educational 
assessments
 
are 
required
.
 
R
esearch topics
 
i
n the psychometric society
 
echo the change in educational policies:
 
the 

assessment for learning

have 
become popular
 
as 
researchers 
emphasize on
 
mak
ing
 
assessment truly useful for learning 
(
e.g.
,
 
Bennett, 2011; 
Wilson, 
2018).
 
If tests are 
designed for producing feedback for learning and teaching
 
and eventually 
integrate 
with the learning process, 
some 
problem
s
 
of
 
educational
 
tests
,
 
including 
disrupting the 
learning process and teaching to the tests 
may
 
be solved
.
 
 
R
enewed attention 
has been brought 
to t
he 
old 
concepts of classroom assessment and 
formative assessment 
(e.g.
, 
Bennett, 2015
; Black & William, 2008; Gotwals, 2018; Shepard, 2018)
. 
C
lassroom assessment
 
refers to the assessment taking place in the classroom and initiated by the 
teacher
 
(Shepard, 2006; Wilson, 2018)
.
 
Formative assessment 
is
 
designed for providing timely and 
3
 
constructive feedback 
that is closely
 
connected to a curriculum and 
are 
based on s
tudents' learning 
history.
 
It should be a t
houghtful integration of the process to provide feedback and the appropriate 
measurement instrument or methodology (
Bennet
,
 
2011
). 
This thesis concerns formative 
assessment in the classroom 
henceforth 
referred to 
as formative classroom assessment. 
 
 
A
 
huge responsibility for implementing formative classroom assessments lies on the 
shoulders of the teachers.
 
 
Specifically, teachers
 
need to
 
take
 
two
 
iterated
 
actions
 
that are at 
the 
core
 
of formative assessment: one is the
 
identification of the gap between the desired goal and the 

,
 
and the other is
 
the action taken to close the gap (Black & William, 
1998).
 
Identifying the gap is a measurement issue per se
 
because t
he gap is 
the difference between 

.
 
However,
 
many teachers do not feel adequately prepared
 
for 
th
is
 
assessment task
 
(
Mertler
, 2003).
 
Despite the
 
increasing 
emphasis on educational 
measurement
 
in policies and research, 
i
n some states, 
preservice
 
teachers
 
are
 
not required to take 
specific coursework in classroom assessment or educational assessment in general (Campbell, 
2013).
 
As a result, t

 
formative assessment practices
 
are not without struggles
 
(Black 
&
 
Wiliam, 
1998;
 
Gotwals, 2018).
 
There is
 
a
 
gap between policy and research on one 
side
 
and 

.
 
Although formative assessment is an attractive concept, t
he effectiveness of formative 
assessment
 
hinges on
 
its quality, not on its existence 
in the classroom (
Black &
 
Wiliam, 1998
)
.
 
As
 
it takes time 
and resources 
to 
improve
 
teacher preparation and professional development
 
in 
assessment
,
 
there is an urgent need 
now 
to provide teachers with 
psychometric
 
tools to
 
facilitate 
formative
 
assessment
 
in the classroom
.
 
Teachers 
especially 
need assistance in constructing and 
delivering formative assessments as well as interpreting the results (Bennett, 2015;
 
Campbell, 
2013;
 
Gotwals, 2018). Psychometric tools, which has gu
ided and supported most standardized 
4
 
testing programs, if used appropriately, can also help with constructing, delivering, and 
interpreting formative assessments
 
(
Bennett
, 2011; 
Bennett
, 
2015
)
. 
 
Note that t
he use of psychometric tools, especially 
item response
 
models, inevitably 
introduces some degree of standardization. 
Ideally, the teacher would develop 
his or her own
 
formative assessment
 
because it is the teacher 
who
 
knows 
best 
the learning history of 
eac
h student 
and the
 
learning goals
. 

-
developed assessment
 
is 
the 
exact 
opposite 
of
 
standardization
.
 
With limited educational resources
, 
therefore, 
w
e need to strike a balance between 
individualization and standardization
 
when thinking of psycho
metric tools for formative classroom 
assessment
.
 
In choosing 
appropriate psychometric 
tool
s
 
(e.g., item response models) 
for formative 
classroom assessment, the best place to start is the validity
, which 
is mainly decided by the
 
useful
ness of the feedback
 
for formative purposes. Therefore, the first question we should ask is: 
What kind of 
feedback
 
do teachers need? The needs of teachers were reflected in a survey 
conducted on a nationally representative sample of 400 elementary and secondary mathematics 
an
d English language arts teachers in the U.S.
 
about a decade ago
 
(Goodman & Huff, 2006; Huff 
& Goodman, 2007). The survey shows that norm
-
referenced information, standards
-
based 
information, and performance information at the item level from 
large
-
scale sta
ndardized
 
assessments 
are
 
of comparatively little interest to teachers because the information cannot be used 
directly in the instruction
;
 
w
hat teachers need is detailed information about the strengths and 
weaknesses of individual students regarding specif
ic knowledge, skill, and 
competencies
.
 
Various methods have been proposed for 
providing 
diagnostic 
feedback. Some approaches 
involve
 
extracting information 
from summative tests 
based on and 
calibrated with unidimensional
 
item response
 
theory (IRT) 
models (
e.g., subscor
es; see 
Haberman, 2008
). However, some 
5
 
researchers caution that each purpose can be compromised if a single assessment is expected to 
serve multiple purposes (Pellegrino, Chudowsky, & Glaser, 2001, p2; Reckase, 2017)
. Although 
u
nidimensional I
RT models have been successfully applied in summative tests 
aiming at
 
selecting 
and differentiating
,
 
t
hey might not be the most appropriate one
s
 
for 
formative
 
purposes
 
because 
the 
diagnostic 
nature of formative assessment usually suggests 
multidimensionali
ty
.
 
1
.
1
 
Psychometric solutions
 
for formative classroom assessment 
 
A family of measurement models

cognitive diagnostic models 
(CDMs;
 
e.g., Rupp, 
Templin & Henson, 2010)
, 
which were developed for modeling diagnostic assessment data, 
are 
chosen for 
formative
 
classroom assessment
 
in this thesis
. These models target multiple fine
-
grained latent constructs (referred to as attributes) that are typical in interim or formative 
assessments. With categorical latent 
variables
,
 
they 
are 
less
 
affect
ed
 
by the 
high 
dimens
ion
ality 
as 
multidimensional 
IRT
 
(MIRT) 
models
 
and are more appropriate for finer
-
grained constructs
 
than 
MIRT models 
(Templin & Bradshaw, 2013)
. 
The identification of 
these finer
-
grained constructs 
as well as their relationship 
is often based on cognitive
 
or learning theories,
 
and require 
collaborations between psychometricians and content experts. 
This construct space is similar to 
the concept of a domain in 
domain
-
referenced testing (Hively, 1974; Houang, 1980). 
T
he 
assessment 
developed based on CDMs 
can
 
be integrated with the learning process
 
through these 
constructs
.
 
Therefore, CDMs have the potential to be an 
essential
 
part of the solution to 
formative 
classroom assessment.
 
Specifically, t
his thesis
 
concerns formative classroom assessment
 
that 
(i) can 
be
 
linked to 
an instructional program 
lasting for several weeks, 
and
 
(ii) 
can 
provide 
formative 
information for 
learning and instruction
.
 
The underlying measurement model
s
 
are
 
CDMs. 
Note that the assessment 
of interest do
es
 
not intend to measure relatively stable traits 
such
 
as
 
ability
 
or aptitude. Instead, the 
6
 
targeted construct is the internalized knowledge or skills that the student acquires after particular 

 
Although current 
CDM 
method
s 
(
i.e.
,
 
calibration and classification) 
work well 
in
 
large
-
scale assessments with hundreds or thousands of examinees
 
and long tests
, the application of 
CDMs in small
-
scale test settings in the classroom 
would be
 
problematic 
due to limited testing 
time and
 
the lack of
 
response data
 
required for reliable estimation (Chiu, Sun, & Bian, 2018). 
There are three
 
alternative
s to 
conventional
 
CDM
 
analysis
, which 
do not require item calibrations 
and therefore, 
are practical in
 
the classroom setting: 
 
1
)
 
parametric c
lassifications using non
-
adaptive tests
 
assembled from
 
a calibrated item 
p
ool (
e.g., 
Henson & Douglas, 2005
), 
 
2
) nonparametric classifications 
using non
-
adaptive tests based on CDMs 
(
e.g., Chiu, Sun, 
& Bian, 2018), and
 
3) 
cognitive diagnostic 
computerized adaptive testing (
CD
-
CAT
;
 
e.g., Chen, 2009).
 
The first two 
approaches use non
-
adaptive tests, which means the same test is given to all 
students in a classroom
, so
 
t
est construction is a 
critical question. 
The CD
-
CAT approach uses 
adaptive tes
ts that are tailored to 
the state of 
individual students
, the success of which depends on 
a
 
well
-
designed 
item pool
.
 
H
ow to design the appropriate item pool for a CD
-
CAT program 
remains a research question
.
 
Responding to practical needs and gaps in the 
lit
erature
, this thesis 
addresses the test construction and item pool design issues for these three approaches
.
 
 
These CDM
-
based approaches
 
are intended for 
facilitating 
formative classroom 
assessment
, which is related to domain
-
referenced testing and curricu
lum
-
based assessment.
 
Therefore, t
he rest of Chapter 1 
reviews these related concepts
 
as well as
 
the 
broad
er
 
concept of 
educational assessment and 
the so
-
called next
-
generation assessment
. 
 
 
7
 
The next chapter reviews the fundamentals and previous studies of the three CDM
-
based 
approaches with a focus on CDMs with hierarchical attributes. Chapter 3 deals with 
parameterizations and Q
-
matrices of CDMs with hierarchical attributes, followed by thre
e chapters 
addressing three research questions related to the 
test construction 
or
 
item pool design issues
. 
 
 
1
.
2
 
Related concepts
 
Formative c
lassroom assessments belong to the 
broad
er
 
concept of educational assessment
 
or achievement assessment
. 
The terms 
educational assessment and achievement assessment have 
been used interchangeably in the literature. 
More specifically, 
Mislevy, Steinberg, and Almond 
(2003) in their seminal work on assessment design defined an educational assessment to be "
a 
machine for r
easoning about what students know, can do, or have accomplished, based on a 
handful of things they say, do, or make in particular settings.
" Baker, Chung, and Cai 
(2016) 
offered a broader construction
: 

A test or an assessment consists of a systematic meth
od of gaining 

knowledge, characteristics, or propensities.

 
The 
definition 
of Mislevy et al. (2003) 
focuses on
 
the types of inferences made from the assessment
,
 
and
 
the definition of Baker et al. (2016) 
also 
highlights
 
the
 
process of making
 
inference
s
 
(i.e., via 
sampling
)
 
in educational assessment
.
 
 
The history of educational 
assessment 
has been 
inter
t
wined with t
hat of psychological 
assessment
. 
Their connection
 
can be seen from the title of the 
Standards for Educational and 
Psychological Testing 
(AERA, American Psychological Association [APA], & National Council 
on Measurement in Education [NCME], 1985, 1999, 2014
) as well as journals and books (e.g., 
Educational and Psychological Measurement
).
 
The first generation of
 
standardized 
achievement 
tests 
w
as
 
developed in the same period and by the same 
researchers
 
as 
IQ
 
tests
 
were 
(
Sheperd, 
2006
).
 
As a result, 
educational assessments and psychological assessments tend to
 
have
 
the same 
8
 
item formats and 
often utilize 
the same statistical models 
(e.g.
,
 
item response theory models), 
with 
both hav
ing
 
roots in 
individual 
difference
s
 
psychology
.
 
In this section, 
the 
di
scussion is limited to 
IRT
-
based assessment
 
because most large
-
scale or commercial achievement tests
 
(e.g., PARCC, 
NAEP, PISA, SAT, ACT) use IRT models.
 
M
ore and mo
r
e
 
researchers in the educational assessment field
, however,
 
hav
e 
realized
 
the 
critical differences between educational and psychological assessments
 
despite their entwined 
histories
. 
Among the 
most 
discussed issues 
is
 
t
he definition of the measured domain
,
 
the 
stability
 
of the 
unobserved 
construct
s
, 
the dimensionality of the 
construct
 
space
, 
the 
normal
ity
 
assumption
, 
and the purpose of assessment
. 
 
The 
unobserved
 
construct
s
 
measured 
in psychological assessments are usually not well
-
defined. 
As
 
noted by 
Brody (2000, p.39)
, researchers know how to measu
re the construct called 
intelligence, but they still do not know what has been measured
; w
hat the IQ test does
, as a result,
 
is 
merely 
trying to differentiate people along a hypothetical scale.
 
In some sense, 
t
he test that is 
supposed to measure intelligen
ce defines what intelligence is
. 
T
his i
s not true
 
in education where 
domains could be well defined according to the 
instructional goals of a 
specific instruction
al
 
program
.
 
However, 
the measured domains are not well delineated for some educational tests 
(B
aker, 2009). In such cases, it
 
can be said that we know how to measure achievement, but we do 
not know what has been measured
, particularly,
 
if 
and when 
educational assessments follow the 
tradition of psychological measurement
.
 
The 
unobserved 
construct
s
 
in psychological 
assessment
s 
are
 
usually stable traits, such as 
intelligence, 
self
-
efficacy
,
 
or personality
. These traits are assumed
, or believed,
 
to remain stable 
for 
a
n extended
 
period. The purpose of psychological assessments is to reflect t
he relative location 
of a person regarding this latent trait
,
 
and improvement 
or
 
change 
within a short period 
is not 
9
 
expected (Baird, 
Andrich, Hopfenbeck, & Stobart
, 2017). 
However, 
examinees in educational 
assessments are expected to show 
change
s
 
in their
 
educational attributes 
and accomplishments 
within a short period, which is the 
primary
 
purpose of any educational program.
 
The existence of content blueprints complicates the 
definition of the 
unobserved constructs 
in
 
educational assessment. Unlike a psyc
hological test, an educational test is usually developed 
based on 
a 
content blueprint (Luecht, 2013; Reckase, 2017).
 
A content blueprint 
is usually
 
constructed
 
as a
 
set of test specifications
 
that is independent 
of
 
the psychometric
 
modeling of test 
responses
 
(Luecht, 2013). 
However, a test blueprint with multiple content domains may suggest
, 
and be consistent with,
 
a multidimensional space (
Reckase, 2017
).
 
Besides content dimensions, 
cognitive dimensions have also been considered for educational asse
ssments
, which further 
complicate
s
 
the dimensionality issue
 
(
George & Robitzsch
, 2018
; 
Harks, Klieme, Hartig, & Leiss, 
2014
).
 
In an analysis of TIMSS data, 
content dimensions are number, geometry
,
 
and data, and 
cognitive dimension are knowing, reasoning, a
nd applying (
George & Robitzsch
, 2018)
.
 
 
For most of the commercial achievement tests, the interpretation of a test score is directly 
based on the assumed normal distribution of underlying stable psychological characteristics (Baker, 
2009). 
Th
is
 
normal
ity
 
assumption
 
is 
another inheritance e
ducational measurement 
inherited
 
from
 
the psychological measurement
 
under the general framework of latent variable modeling 
(
Baker 
& Kim, 2004
)
. 
Consistent 
with
 
the interpretation of scores
, 
a
 
normal distribution is usually 
assumed in 
IRT
 
modeling
 
for the unobserved construct
.
 
Specifically
, the normal 
distribution
 
is 
used 
(i) 
in
 
the 
integration step
 
in item calibration and 
(ii) 
as 
a
 
prior distribution in Bayesian 
IRT
-
based 
scoring
 
(Baker & Kim,
 
2004)
. 
While the 
normal
ity
 
assumption 
may 
work well for a variety 
of stable psychological traits (e.g., intelligence, self
-
efficacy), whether it is suitable for the 
10
 
measurement of learning or 
mastery of
 
educational attributes 
is questionable (
Bloom, 1968;
 
Baker, 
2009). 
 
Educational assessment designers, following the guidelines developed for psychological 
assessments, tend to optimize the test for detecting differences among examinees
. 
It 
would 
work 
well 
if the goal is 
selecti
on
.
 
However, 
the test development guidelines may need some adaptations 
whe
n we consider the purpose of
 
improving 
student 
learnin
g because 
the differences 
between 
different test scores 
could be trivial regarding the subject matter (Bloom, 1968).
 
One
 
characteristic of ed
ucational assessments that 
is different
 
from 
psychological 
assessments
, however,
 
is 
the existence of 
many dichotomies
, 
such as
 
classroom assessment versus 
external tests, formative versus summative assessment, domain
-
referenced (or criterion
-
referenced) 
ve
rsus norm
-
referenced testing
 
(assessment)
. 
 
1
.
2.1
 
External and c
lassroom assessment
 
External 
assessments are
 
constructed 
outside of the classroom by measurement and subject 
experts and are
 
often 
fueled by educational policies 
(Baker, Chung & Cai, 2016
)
, also referred to 
as
 
the large
-
scale
 
standardized
 
assessment
s
.
 
There is a rich literature on the theories and practices 
of 
external
 
assessments. 
They have served 
well 
the purpose of selection and ac
countability over 
the past decades. 
However, t
he effects of external assessments on learning are 
difficult
 
to establish
 
(Wilson, 2018)
.
 
 
E
ducational assessments can be divided into classroom assessments and external 
assessments
,
 
depending on 
the
 
administra
tion 
of
 
the assessments
.
 
T
eachers 
usually 
create and 
grade
 
c
lassroom assessments based on particular instructional 
goals
, and 
they 
make short
-
term 
decisions
 
based on assessment results (
Hanna & Dettmer, 2004, p.
 
8
)
.
 
Classroom assessments may 
also be develo
ped out of the classroom but 
initiated by teachers or students
 
in the classroom. 
11
 
Classroom assessments, when used in a constructive way by teachers, can send the message to 
students telling them what is 
important
 
(Nitko, 2001), and have been 
shown
 
to have a 
s
ubstantial
 
impact on student success (Shepard, 2006; Wilson, 2018). 
Some r
esearchers believe that we can 
make measurement truly important for education
 
through classroom assessments (Wilson, 2018)
.
 
 
1
.
2.2
 
Summ
ative and f
ormative assessment
 
The dichotomy of formative assessment versus summative assessment has been proposed 
for decades
.
 
While
 
great
 
improvement has been seen in the practices and research of summative 
assessment 
over the past few decades, 
formative
 
assessment mostly 
appears as the subject of
 
theoretical
 
discussion 
(Scriven, 1967; Bloom, 1968; Bloom, Hastings, & Madaus, 1971)
.
 
Scriven 
(1967) and 
Bloom (19
68
)
 
were among the first to 
use the terms 

 
evaluation

and 

summative 
evaluation

 
A s
ummative 
evaluation 
judges
 
what students 
have mastered 
at the end 
of a
n educational program (Bloom, 1968)
.
 
Defining 
formative assessment
s, however, can be
 
much 
more 
complicated
: There has been debate over 
conceptualization
 
of formative asse
ssment 
as
 
a test 
or a process (Bennet, 2011). 
For Bennet (2011), neither side of the argument can provide a full 
picture of forma
tive assessments:
 
He defined formative assessment to be a 
th
oughtful integration 
of process
, 
on
 
the
 
one hand
,
 
and 
methodology o
r instrumentation
, on the other hand
.
 
Other 
researchers put more emphasis on the process 
part
 
(e.g., 
Furtak, Circi, & Heredia, 2018
; 
Gotwals, 
2018
)
.
 
Recently, formative assessment is receiving 
renewed
 
attention
 
(
Bennet, 
2011, p. 5)
. 
Since 
formative assessments generally take place in the classroom as a type of classroom assessments, 
teachers need to take 
many
 
responsibilities. However, it remains a challenging task for teachers to 
learn how to do formative assessments (
Bennet, 
2011
; 
F
urtaka, Circib
 
&
, Heredia
,
 
2018
; 
Gotwals, 
2018
; 
Shavelson, 2008
)
.
 
Teachers need guidance and assistance in various aspects of assessments
,
 
12
 
including goal setting, extracting information, providing feedback
,
 
and using feedback to modify 
instructions (
Gotwal
s, 2018, p.157
). 
Bennett (2011, p. 18) argued th
at teachers need 

deep 
cognitive
-
domain understanding

knowledge of measurement fundamentals

in addition to 

pedagogical knowledge

, in order
 
to be able to
 
realize effective
 
formative assessments.
 
Howev
er, 
e
ven if teachers can acquire all the knowledge, understanding, and skill
s
 
needed 
for formative 
assessment, they still need 
a 
substantial amount of time to put them 
in
to practice (Bennet, 2011).
 
1
.
2.3
 
Domain
-
referenced
 
and 
norm
-
referenced 
testing
/interpretations
 
 
Another 
well
-
known
 
contrast 
in educational measurement is 
between 
d
omain
-
r
eferenced 
(or criterion
-
referenced) 
t
esting and 
n
orm
-
r
eferenced
 
testing (Hively, 1974). 
Norm
-
referenced 
testing 
(NRT) 
has its roots in the 
ps
ychological
 
measurement of individual differences.
 
NRT goes 
hand in hand with latent trait modeling
 
(Hively, 1974; Houang, 1980). The test construction for 
NRT based on latent trait modeling places great emphasis on correlation or the so
-
called internal 
co
nsistency among a set of items, which plays a 
significant
 
role in the decisions of including or 
excluding certain items (
Hively, 1974
; 
Houang, 1980
).
 
However, this test construction procedure 
may 
pose
 
a
 
danger
 
to the validity of measurement because 1) variables that are conceptually 
disconnected can be correlated (Baird et al., 2017) and 2) the obtained set of items may not be a 
representative sample from the targeted domain (
Houang, 1980
).
 
 
Domain
-
referenced t
esting (DRT), in contrast, bears more educational considerations. 
More emphasis is placed on validity instead of 
reliability
. Much research 
is devoted to 
the 
discussion of the 
domain
 
and item sampling within the domain (
Baker, 1974
; 
Hively, 1974
;
 
Millman, 
1974
). A domain can be defined by 
a
n explicit
ly
 
specified set of items (
Hively, 1974
) or 
a set of rules according to which 
a large number of test items could be generated
 
(Baker, 1974).
 
A 
compl
ex domain can be divided into sub
-
domains. 
The examinee's measu
rement of principal 
13
 
interest in NRT is the examinee's score over al
l items in domain or sub
-
domain
 
(Brennan, 1981; 
Hively, 1974
). This score, referred to as the domain score (or the sub
-
domain score), cannot
 
be 
directly
 
obtained 
because it is impossible to
 
administer all the items in the domain (or sub
-
domain). 
It can be estimated by the examinee's observed percent of correct responses
 
on a set of items 
if the 
set
 
is a representative sample 
(Brennan, 1981)
. 
Estimates for large domains may be 
obtained by 
stratified sampling over their constituent sub
-
domain, and diagnostic profiles may be gathered by 
sampling within sub
-
domains (Hively, 1974).
 
IRT
-
based estimators are available for domain or 
sub
-
domain scores
,
 
give
n
 
a large set of calibrated it
ems (
Bock, Thissen, & Zimowski, 1997
). For 
a complicated domain, the set of sub
-
domain scores serves a diagnostic profile (Hively, 1974); 
alternatively, one can assign sub
-
domain scores weights to calculate a single domain score 
(Millman, 1974). The estima
ted domain or sub
-
domain scores are then compared to some criterion 
to decide whether mastery is achieved. In contrast to the two
-
stage methods, Houang (1980) took 
a latent class approach to 
estimat
e
 
the mastery of a simple domain.
 
The concept of 
DRT 
as an
 
assessment type 
lost its popularity after 
the 
1970
s
. S
ince 
the 
1974 
Standards for Educational and Psychological Tests
, 
t
he 
distinction between 
two types of test 
score interpretations

criterion
-
referenced
 
and norm
-
(or criterion
-
)
referenced interpretations

have 
received more attention
. 
Instead of 
differentiating
 
two 
different types of assessments
 
(i.e., 
NRT and DRT)
, 
test developers 
draw from both
 
test development perspectives
 
to ensure the 
reliability and validity of measurement
 
(
Brennan, 2006
)
.
 
Although most standardized testing 
programs are designed to primarily provide norm
-
referenced interpretations, there has been an 
increasing need for domain
-
referenced or criterion
-
referenced interpretations. 
 
14
 
1
.
2.4
 
Curriculum
-
based assessment
 
Educational 
assessments 
are
 
based on a 
specific
 
curriculum or not.
 
To be useful for learning, 
however, 
assessment needs to be integrated 
in
to
 
a coherent process of assessment, instruction, and 
curriculum based on learning theories (Black, Wilson, & Yao, 2011; 
Shepard,
 
Penuel, 
&
 
Pellegrino, 
2018
). This is especially true for formative classroom assessment.
 
If the assessment is not 
aligned 
to
 
the curriculum that students are learning, the validity of the formative feedback will be in doubt.
 
A link between curriculum and 
achievement assessment 
has been
 
well established in the 
international assessments 
led
 
by 
the International Association for the Evaluation of Educational 
Achievement (IEA)
. 
The c
urriculum
-
achievement
 
alignment 
constitutes a vital part of
 
the 
validity 
evidence for 
the subject
 
achievement tests.
 
The validity check 
(
by comparing 
assessment items 
with the curriculum students have experienced
)
 
has
 
been carried out in some form in all IEA 
studies (Cogan & Schmidt, 2019).
 
For example, t
eachers 
provid
ed validity check
 
on 
the test items 
in the pilot study and the First International Mathematics Study (FIMS) and in the second studies, 
SIMS and SISS (Husén, 1967a; Keeves, 1974; Travers & Westbury, 1989).
 
T
he 1995 Third 
International Mathematics and Scienc
e Study (TIMSS
-
95)
 
conducted a more extensive curriculum 
analysis, and 
provide
d
 
evidence for the relationship between assessment, instruction, and 
curriculum (
Schmidt & McKnight, 1995; Schmidt, Jorde, et al., 1996; Schmidt, McKnight, 
Valverde, 
Houang
, & Wi
ley, 1996)
. 
 
A curriculum is structured around subject content. 
Taking the subject of mathematics as an 
example, a
s 

athematics, even circumscribed by what is taught 
in school, encompasses 
a very large
 
content domain.

The question is then how to model 
curriculum
-
sensitive content in the psychometric model
 
for curriculum
-
based assessment
. 
Under
 
the 
typical
 
unidimensional
 
IRT modeling
 
framework
, content exists in the form of content 
15
 
constraints
, 
independent 
of
 
the measured construct
 
(
e.g., Kingsbury & Zara, 
1991
; 
van
 
d
er Linden, 
2005a
)
.
 
The separation of the measured construct and the curriculum
-
sensitive conte
n
t
s makes it
 
difficult
, if not impossible,
 
to extract formative feedback from the test data
 
regarding 
the contents
. 
 
1.
2.5
 
Next
-
generation assessment
 
Since we entered the new millennium, there have been
 
increasing discussion over the so
-
called next
-
generation assessme

the 
educational 

-
generation assessment, researchers 
and measurement practitioners attempt to respond to the critiques on educational measurement 
m
entioned earlier and the needs from learners, parents, and teachers 
(
e.g.
,
 
Bennett, 201
1
; Conley, 
2018; Embretson, 200
3
; Heritage, 2010)
. 
 
A 
lengthy, but
 
not exhaustive list of 
next
-
generation assessment topics
 
includes
 
formative 
assessment (e.g., 
Gorin, & Mislevy, 2013
; Heritage, 
2010
), assessment of new constructs such as 
critical thinking (e.g., 
Liu, 
Frankel, & Roohr, 
2014
), technology
-
based assessment (e.g., 
Beatty
 
& 
Gerace
, 2009; Bennett, 2015; Mislevy, 2016), class
room assessment (e.g., Shepard 
et al., 
2018), 
personalized testing and learning (e.g., Chen, Li, Liu, & Ying, 
2018
; Clark, 2016), integration of 
learning and assessment (e.g., Baird et al., 2017), and automatic item generation and scoring (e.g., 
Bennett, 2
015; Gierl
 
& Lai, 
2012
). 
 
 
16
 
Chapter 
2
 
Literature review of 
CDM
-
based approaches
 
Th
is
 
chapter provides brief literature reviews for 
the basics of 
CDM, nonparametric 
classifications
 
based on CDM
, and CD
-
CAT
, which 
form the foundations
 
of 
the three 
CDM
-
based 
approaches for formative classroom assessment
 
proposed in Chapter 1
.
 
The CDM
-
based test construction begins with the identifications of the attribute profile 
space and the Q
-
matrix characterizing the relationship between items and attributes
 
(des
cribed in 
detail in Chapter 2)
. The attribute profile space defines the domain in the language of domain
-
referenced testing. Test construction based on CDMs has many similarities with domain
-
referenced testing (Hively, 1974; Houang, 1980). The identificati
ons of the relationships between 
attributes and items usually depend on cognitive theories and learning theories. In this way, the 
assessment can be integrated with the learning process. 
 
2
.
1
 
CDM
 
 
CDM
s
 
(cognitive diagnostic model
s
)
, also known as diagnosti
c classification model
s
, 
belong to the confirmatory or constrained latent class model
ing
 
framework in which individuals 
are classified into groups defined by combinations of 
categorical 
(usually binary) 
latent variables
 
(Rupp, Templin & Henson, 2010). 
The 
categorical 
unobserved
 
variables 
that define the 
measurement constructs underlying a CDM 
are often referred to as 
attribute
s
 
(
Tatsuoka
, 1983, 
1990
), elsewhere called finer
-
grained proficiencies (de la Torre, & Karelitz, 2009) or 
facets 
(
Henson, DiBello, & 
Stout, 2018
)
.
 
 
Macready and Dayton (1977) and Houang (1980) were among the first to apply latent class 
models using only one dichotomous trait to measure mastery of a 
simple 
domain.
 
Later, t
he work
s
 
of 
Tatsuoka
 
(
1983)
 
and Leighton
, Gierl, and Hunka
 
(2004) 
involve
 
more 
compl
ex
 
domains with 
multiple
 
attributes
,
 
and they 
introduced the concepts of 
Q
-
matri
x
 
and 
attribute
 
hierarchy
. 
I
n the 
17
 
past three decades, a large number of CDMs that
 
employ item response functions 
(IRFs) 
and 
explicit Q
-
matri
ces
 
have b
een proposed
 
and 
studied intensely 
(
Rupp, Templin, & Henson, 2010
; 
Templin & Bradshaw, 2014
)
 
in response to the pressing demand for individualized diagnostic 
information in education (Center for K
-
12 Assessment and Performance Management at ETS, 
2014; U.S. Department of Education 2014). 
 
2
.
1
.1 
Attributes
 
Since the introduction of attributes to dia
gnostic assessments by Tatsuoka (
1983, 
1990), 
the terminology of attributes has been used in 
the 
CDM literature to refer to the 
unobserved
 
variables that the test 
aims
 
to measure
. Long before the time of diagnostic assessment, Guttman 


i.e.
,
 
categorical variable).
 

operations, item types, or, more generall
,
 

5) viewed 
attributes as "sources of cognitive complexity" in test performance, which may consist of both 
cognitive and content components. Leighton, Gierl, and Hunka (1999) defined attributes as the 
procedural or declarative knowledge needed to perform a t
ask in a specific domain. Most of the 
above 
definitions include 
both 
cognitive and content components. 
 
In an educational 
setting
, possessing an attribute is 
often 
referred to as mastery of an 
attribute
,
 
and 
lacking an attribute is referred to as non
-
maste
ry (Templin & Bradshaw, 2014)
.
 
Like 
most CDM research, we 
restrict 
the
 
scope 
of this thesis 
to attribute
s
 
with
 
two levels, so that 


indicates mastery of attribute 

 
and 


indicates non
-
mastery of this attribute.
 
 
An attribute profile
 
(Templin & Bradshaw, 2014)
, which is also 
referred to as
 
an attribute 
pattern 
(
Ma, Iaconangelo, & de la Torre, 2015
) 
or attribute mastery pattern
 
(
Henson & Douglas, 
2005
)
, is a specific combination of attribute mastery and non
-
mastery, 
with 
each 
combinati
on 
18
 
representing a unique latent class of examinees. Attribute profiles are denoted by column vectors 


, where 


indicate
s
 
the absence or presence, respectively, of the 

th attribute
 
(mastery vs. non
-
mastery)
,
 
and the supers
cript 

 
denotes t
ra
nspose.
 
2
.1.1.1 Interaction 
among
 
attributes
 
in an item
 
CDMs can be categorized as 
noncompensatory
 
or compensatory models based on the 
assumptions about how attribute
s
 
interact
 
with each other 
to affect the probability of an item 
response
. 
According to DiBello, Roussos, and Stout (2006), a 
noncompensatory
 
(or 
conjunctive
)
 
model assumes that lacking competency on any required attribute poses a severe obstacle to 
successful performance on the task. In other words, successful performa
nce on a task requires 
mastery of all the required attributes
; mastery of some of the required attributes does not 
compensate for the non
-
mastery of other required attributes
. 
The terms 
of conjunctive
 
model
s
 
and
 
noncompensatory
 
models are often used interc
hangeably
.
 
Opposite to the 
noncompensatory
 
nature, 
c
ompensatory interaction of attributes means that mastering one 
required 
attribute can compensate 
for 
nonmastery
 
of other 
required 
attributes. An extreme case of compensatory models is 
a
 
disjunctive model 
in which 
mastering each subset of the required attributes would lead to 
the 
equally
 
high probability of a correct response 
(DiBello, Roussos, & Stout, 2006).
 
 
2
.1.1.2
 
Interdependenci
es among
 
attributes
 
Most CD
Ms
 
assume independent attributes
 
(Rupp
 
et al.
, 2010). Nevertheless, there are 
cases in which data analysis suggested the presence of 
interdependencies among attributes 
(Templin & Bradshaw, 2014).
 
To account for the
 
relationships 
between attributes, d
e la Torre and 
Douglas (2004) proposed a higher
-
order model linking the 
categorical 
attribut
e
s
 
to an underlying 
multivariate normal distribution
. 
The interdependencies among attributes are reflected in the
 
correlated dimensions
 
of the 
multivariate normal distribution
.
 
Another approach to modeling the 
19
 
attribute relationships is to 
impos
e
 
a hierarchical structure
,
 
in which
 
master
ing
 
an
 
attribute 
could 
be
 
prerequisite 
to
 
mastering 
ano
ther attribu
te
 
(
Leighton
 
et 
a
l.
,
 
2004
; Tatsuoka, 2009; Templin 
&
 
Bradshaw, 2014
)
.
 
This thesis adopts the hierarchical approach
, which is reviewed in more details 
below
.
 
A
 
hierarchy of attributes specifies the relationship between each pair of attributes. For 
attribute 

 
and attribute 

, if 


, attribute 

 
is called a prerequisite of attribute 

.
 
Suppose there are three attributes in a linear relationship. We have 


, 


, and 


.
 
Attribute hierarchies are often 
visualized
 
by 
a tree graph with 
a 
set
 
of
 
attributes connected 
with arrows. 
An
 
arrow
 
that points
 
from attribute 

 
to attribute 

 
means
 
that mastering attribute 

 
is 
a prerequisite to mastering attribute 

 
(
Gierl, Leighton
, & Hunka, 2000
; 
Köhn
 
& Chiu, 2018; 
Leighton
 
et al., 2004)
.
 
Attribute 

 
is
 
a 
lower
-
level attribute
,
 
and 
attribute 

 
is 
a higher
-
level 
attribute
 
in this case
.
 
Th
ese
 
pair
-
wise 
prerequisite relationship
s
 
can be formally defined by a 
K
-
by
-
K binary 
matrix called 
the adjacency matrix (A
-
matrix)
, in which K is the number of attributes 
(Tatsuoka, 
1983, 2009; Gierl
 
et al.
, 2000)
. 
The A
-
matrix represents the direct relationships among attributes
 
usually illustrat
ed by one
-
way arrow
s
.
 
T
he 


th element of the A matrix indicates whether 
attribute 

 
is directly connected in the form of a prerequisite to attribute 

.
 
The diagonal elements 
of the A
-
matrix are zeros. 
The following
 
is 
an example 
of a complex hiera
rchy in
 
Köhn
 
and Chiu 
(2018)
 
with its 
11
-
by
-
11 
A
-
matrix
.
 
20
 
 
0
 
0
 
0
 
1
 
1
 
0
 
0
 
0
 
0
 
0
 
0
 
 
0
 
0
 
1
 
1
 
0
 
0
 
0
 
0
 
0
 
0
 
0
 
 
0
 
0
 
0
 
0
 
0
 
0
 
0
 
0
 
0
 
0
 
0
 
 
0
 
0
 
0
 
0
 
0
 
1
 
1
 
0
 
0
 
0
 
0
 
A
=
 
0
 
0
 
0
 
0
 
0
 
0
 
1
 
0
 
0
 
0
 
0
 
 
0
 
0
 
0
 
0
 
0
 
0
 
0
 
1
 
0
 
0
 
0
 
 
0
 
0
 
0
 
0
 
0
 
0
 
0
 
1
 
0
 
0
 
0
 
 
0
 
0
 
0
 
0
 
0
 
0
 
0
 
0
 
1
 
0
 
0
 
 
0
 
0
 
0
 
0
 
0
 
0
 
0
 
0
 
0
 
1
 
1
 
 
0
 
0
 
0
 
0
 
0
 
0
 
0
 
0
 
0
 
0
 
0
 
 
0
 
0
 
0
 
0
 
0
 
0
 
0
 
0
 
0
 
0
 
0
 
Figure 
1
:
 
A complex
 
example of attribute hierarchy in 
Köhn
 
and Chiu (2018)
 
 
In an attribute hierarchy, there are 
direct and indirect
 
relationships. A direct relationship is 
characterized by a one
-
way arrow. 
In the example below, 


and 


has a direct relationship 
because 
there is an arrow pointing from 


to 


. 
An indirect relationship can be found between 


and 


, which are connected through two arrows and 


in between.
 
If compared to a road map, a
n attribute hierarchy consists of at least one path of
 
attributes. 
A path is defined to be a subset of attribu
tes connected by one
-
way arrows. 
The complex attribute 
hierarchy below has more than one paths
,
 
for example, the path 


. 
For any hierarchy
 
of 

 
attributes
, the longest path
 
involves at most 

 
attributes and has at most 


arrows.
 
The maximum is reached when the 

 
attributes
 
form a linear hierarchy
.
 
 
Note that some attributes appear 
in
 
the same path while others do not share a common path. 
For example,  


and 


in the following hierarchy do not share a common path. Another exa
mple 
is the pair of 


and 


.
 
21
 
The
 
prerequisite relationship
s between attributes are
 
quite 
common 
in content standards 
for mathematics. A
s shown in the 
map of 
College
-
 
and Career
-
Ready Standards (CCRS 
-
 
former
ly
 
called the Common Co
re State Standards)
, 
content 
standards 
do
 
not stand alone but form a 
complicated network
 
(Zimba, 2011, 2015
). Some standards 
form
 
a linear 
structure
 
with one 
standard being the prerequisite of another one 
(Figure 
2
a). 
Some
 
standards se
rve as 
prerequisites
 
for several other standards (Figure 
2
b). There are also standards that are based on several other 
standards (Figure 
2
c).
 
 
Figure 
2
:
 
Three types of standard relationships in the Common Core Graph (a: the u
pper panel, b: 
left bottom panel, c: right bottom panel)
 
 
However, attribute hierarchies have long been poorly represented in the current CD 
literature
,
 
and
 
related studies have begun only recently (e.g., Templin & Bradshaw, 2014). 
Research on hierarchical attributes has focused on hypothesis testing of the assumed attribute 
hiera
rchy (Templin & Bradshaw, 2014)
 
and model estimation (Tu et al., 2018). 
When
 
att
ribute 
hierarchies are proved to b
e present, it is recommended to incorporate
 
this information in the 
22
 
modeling process
 
by reparameterizing
 
the original model and 
excluding 
certain
 
attribute profiles
 
(Templin & Bradshaw, 2014; 
Tu
 
et al., 2018)
. 
 
Hierarchies
 
that have been 
used in simulation studies are summarized below. 
Leighton
 
et 
al.
 
(2004) proposed four types of attribute hierarchies
, which have been adopted in many studies

linear
, 
divergent, convergent, and unstructured
 
hierarchies

as
 
illustrated in
 
Figu
re 3
.
 
Liu and 
Huggins
-
Manley (2016) 
renamed
 
the unstructured hierarchy and the convergent hierarchy in 
L
eighto
n et al.
 
(2004
) 
as 
the 

invert
 
pyramid

 
and the 

diamond hierarchy

 
respectively. They 
replaced the 
divergent
 
hierarchy with the pyramid hierarchy (
Figure 4
).
 
Tu et al. (2018) added a 
mixed type to
 
the list
, which is a combination of 
two
 
hierarch
ies
 
(
Figure 
5
)
.
 
 
Figure 
3
:
 
Four hierarchical structures using six attributes (Leighton, Gierl, & Hunka, 2004)
 
 
Figure 
4
:
 
Linear, pyramid, inverted pyramid and diamond s
tructures using five attributes (Liu & 
Huggins
-
Manley, 
2016
)
 
23
 
 
Figure 
5
:
 
Four types of attribute hierarchies and an independent structure (Tu, Wang, Cai, Douglas, 
& Chang, 2018)
 
 
Note that a
 
pyramid 
(e.g., Liu & Huggins
-
Manley, 
201
6
) or a 
convergent
 
(e.g., Tu, et al., 
2018)
 
hierarchy comes with an implicit assumption that all prerequisite attributes must be mastered 
so that the mastery of the higher
-
level attribute can be possible.
 
In application studies of CDMs with hierarchical attributes, the 
most commonly seen 
hierarchy
 
is the linear hierarchy
 
(
Gierl, Wang, & Zhou, 2008; Gierl, Alves & Majeau, 2010
)
.
 
To 
get an idea of the hierarchical relationships in 
real 
classroom instruction, 
two CCRS
-
aligned 
textbooks for Grade 4 math
,
 
Eureka Math (2015) and Engaged NY (2014)
, were analyzed
.
 
The 
content structures of the textbooks may shed some light on classroom instruction because 
textbooks provide 
an essential source of information and guid
ance for
 
teachers, especially when 
new standards are introduced.
 
The content analysis results can be found in Appendix 
A
. Generally, 
three to five attributes (standards)
 
are involved 
in a period of one to four weeks
. Pyramid and invert 
pyramid structures f
ollowing the definitions of Liu and Huggins
-
Manley (
2016
) are observed 
besides the linear structure. 
 
24
 
2
.1.
2
 
Attribute
 
profile space
 
of hierarchical attributes
 
For a test involving 
K
 
attributes, the set of all possible attribute profiles, subject to the 
relationship between attributes, is called the attribute profile space (
also called latent attribute 
space or 
latent
 
space
; e.g., 
Köhn & Chiu, 2018
; Tatsuoka, 2009). 
The attribute prof
ile space
, 
denoted by 

,
 
is defined by a matrix
 
with 
K
 
columns representing 
K
 
attributes and each row vector 
representing a
n
 
attribute profile
.
 
 
Identifying the attribute profile space for 

 
independent attributes is straightforward.
 
Assuming 

 
independe
nt attributes, 
the
 
attribute profile space 

 
is a 


-
by
-
K
 
matrix
, 
representing
 

different 
classes
 
into which
 
the examinees would be classified
.
 
 
The
 
hierarchical 
relationship
s
 
between
 
attributes constrain the 
latent
 
attribute space
 
because some attribute profiles become impossible
.
 
Specifically, it is not allowed to master an 
attribute without mastering its prerequisite.
 
Researchers have reached a consensus on restricting 
the attribute profile space at the presence of hierarchical a
ttributes (e.g., Templin & Bradshaw, 
2014; Tu et al., 2018).
 
However,
 
the identification of the attribute profile space is not 
straightforward, especially when the number of attributes is large (Köhn & Chiu, 2018).
 
Köhn and Chiu (2018) proposed 
the lattice
-
theoretical approach
 
to obtain the latent space
. 
Th
e
 
first step is to derive the 
K
 

from the tree graph of the 
attribute hierarchy
.
 
Each basic proficiency class is a K
-
element vector characterizing 
a possible 
path from the lowest
-
level attribute to a higher
-
level attribute
. 
The next step is to reconstruct the 
attribute space as a set of linear combinations of the basic proficiency clas
ses. 
However, t
he 
inspection 
becomes
 
more difficult as the number of attributes 
increases
 
and 
the 
process
 
is
 
prone to 
mistakes.
 
 
25
 
An alternative 
way to derive the attribute profile space
 
begins with
 
the A
-
matrix.
 
The first 
step is to derive the 
basic profic
iency cla
sses
 
as defined 
in 
Köhn and Chiu (2018)
 
in the form of
 
column vectors of a matrix, called 
the reachability matrix (R
-
matrix
; 
Tatsuoka, 1983, 2009; Gierl
 
et al.
, 2000
)
. 
This approach is, therefore, referred to as 
the 
R
-
matrix approach. 
 
2.1.2.1 
R
-
matrix approach
 
We
 
define some Boolean operations before elaborating the R
-
matrix approach. 
A Boolean 
vector 
or matrix 
is one for which all entries are either 0 or 1.
 
The Boolean
 
addition of two 
Boolean 
vectors of K elements
 
is defined as 
 

where 

 
operator.
 
 
The product of the I
-
by
-
K 
Boolean 
matrix 

 
and the K
-
by
-
J 
Boolean 
matrix 

 
is defined 
by a matrix
 

, the 


th element of which is
 

where 

 
and 

 
.
 
For a square 
Boolean
 
matrix 

, and any  


, the  

th Boolean power of 

 
is the Boolean 
product of 

 
copies of 

.
 

n
 
times
 
 
The derivation of the R
-
matrix from the A
-
matrix and the 
derivation of the attribute profile 
space from the R
-
matrix are elaborated below.
 
The R matrix can be calculated as the 
n
th Boolean power of the matrix 


(Leighton et 
al., 2004): 
 
26
 

where 

 
is the integer required for R to reach invariance and can represent the numbers 1 through 


.
 
The number 

 
is decided by the number of arrows in the longest path of the hierarchy. 
 
The next step of the R
-
matrix approach derives t
he attribute profile spa
ce from the R
-
matrix.
 
Note that the A
-
matrix and R
-
matrix are of order 


. The attribute profile space
 

, with 

 
columns indicating different attributes, however, may have more than 

 
rows.
 
The following 
algorithm produces the transpose of the attri
bute profile space (Ding, Luo, Cai, Lin, & Wang, 
2008).
 
1)
 
For the 

th column of the R
-
matrix, we take the Boolean addition of the 

th column 
and each column on its right side. 
 
2)
 
When a new column vector is obtained, it is added to the right of the R
-
matrix.
 
 
3)
 
The first two steps are repeated for each column of the original R
-
matrix
,
 
including 
the last one. 
Note that the column vectors in the Boolean addition include the new 
columns.
 
The obtained matrix is called the expanded R
-
matrix
, denoted as 


, because 
it expands 
the K
-
by
-
K R
-
matrix by adding columns
. This algorithm is referred to as the expanding algorithm. 
The attribute profile space
 

is the transpose of the expanded R
-
matrix
 
(


)
 
with an additional 
row of 0s
. The space contains
 
at most 


rows, re
presenting
 

attribute profiles
, denoted as 

s
. 
The maximum is reached when the attributes are independent. The number of attribute profiles 
(

s) in the space 
decreases with hierarchical attributes.
 
 
The R
-
matrix approach is equivalent to the 
lattice
-
theoretical approach
 
(
Köhn 
&
 
Chiu
,
 
2018)
, but is 
easier
 
to 
appl
y
 
in practice.
 
Appendix 
B
 
provides R code for the expanding algorithm.
 
27
 
2.1.2.2 
Interpretations of the Boolean operations
 
The interpretations o
f the Boolean operations 
involved 
in the R
-
matrix approach are 
provided
 
be
l
ow
. 
 
Note that 
the A
-
matrix only captures the direct relationship between two attributes. 
Each 
1
-
entry in the A
-
matrix stands for a one
-
way arrow that connects two attributes. The R
-
matrix 
should also capture indirect relationships
.
 
Therefore, the first step is to add the identity matrix to 
the A
-
matrix to account for 
the relationship with an attribute itself.
 
The next step multiplies 


to itself until 
invariance is 
achieved
.
 
The 


th element of 


is
 
 
in which 


if 


and 


, which means attribute 

 
and 
attribute 

 
has an indirect relationship
 
through attribute 

; else, 


. 
The 
disjunction
 
among 

 
attribute 


takes the value of 1 if 
attribute 

 
and 
attribute 

 
has an indirect relationship
 
through any attribute.
 
 
Consequently, the elements in 


capture all indirect relationships between 
attribute 

 
and attribute 

 
in the form of 


.
 
Similarly, it can be shown that 
the 


th element of 
the 
matrix 


takes the value of 1 i
f attribute 

 
and attribute 

 
has an indirect relationship 
through two attributes in the form of  


.
 
Since the longest possible path in an 
attribute hierarchy has 


arrows, the largest number 

 
would take in 
equation 


is 


.
 
Take the 

th column of the R
-
matrix. 
The 

th element of the 

th column takes the value of 
1 if there a path from attribute 

 
to attribute 

. If the 

th attribute is at the lowest level in any path, 
then the 

th column has only 
one 
non
-
zero entry
; otherwise, the 

th column
 
describes a path 
which 
28
 
ends at attribute 

.
 
As a result, t
he columns in the R
-
matrix 
correspond to
 
different
 
paths 
as shown 
in 
the
 
tree graph
,
 
equivalent to
 
t
he basic proficiency classes defined in Köhn and Chiu (2018).
 
We use
 
a linear hierarchy with four attributes 
to demonstrate the derivation of the R
-
matrix
.
 

The four columns in the R
-
matrix in 
equation 


describe four paths that start from 
attribute 1 (i.e., the lowest
-
level attribute) and end with each attribute, respectively.
 
Invariance is 
achieved at 


b
ecause 
the longest path 
(i.e., 


)
 
has three arrows.
 
Th
e 
columns of the 
R
-
matrix
 
can be seen as attribute mastery profile
s
.
 
If the 

 
attributes 
form a single linear hierarchy, then the R
-
matrix contains all the possible attribute mastery profiles. 
However, 
if
 
there exist 
two attributes 
th
at 
do not appear in the same path, 
the R
-
matrix 
fails to
 
account 
for 
all the possible combinations 
of states 
of two 
such
 
attributes.
 
 
Consider the following 
attribute hierarchy
.
 
The first path (column) is nested within the 
other three paths (columns). 
The 
second path is nested within the two paths on the right. However, 
the last two paths are not nested within each other because A3 and A4 are 
not connected directly 
or indirectly in any path.
 
The four columns in the R
-
matrix also correspond to four profiles. 
29
 
Another 
possible profile 


, which is not included in the R
-
matrix, 
can be obtained by 
adding the last two columns of the R
-
matrix.
 
 
The 
expanding algorithm
 
involves 
the Boolean 
addition of two columns 


in the R
-
matrix
 
shown 
in 
equation 


and 


.
 

Addition of two nested paths 
as defined in 
equation 


does
 
not produce a new 
column
.
 
Addition of two independent 
path
s
, however, produces a new 
column, which expands the original 
R
-
ma
trix
.
 
 
Continuing with the complex hierarchy example in 
Köhn
 
and Chiu (2018)
, the 
attribute 
profile space
 

derived from the expanding algorithm 
contains 31
 
attribute profiles. 
 
2
.1.3 Q
-
matrix
 
The relationship between the items and the attributes is described in 
an indicator matrix, 
called the
 
Q matrix, which has r
ows corresponding to items, columns corresponding to attributes, 
and binary elements indicating whether an attribute is measured by an item (that is, whether 
mastery of an attribute is required to succeed on an item). The Q
-
matrix was initially proposed by
 
Tatsuoka (1983) and has been employed in most of the commonly used CDMs.
 
The Q
-
matrix 
reflects the test blue
print
 
(Leighton, Gierl, & Hunka, 2004)
.
 
Specifically, t
he 
Q
-
matrix operationalizes the substantive 
and cognitive theories
 
based on which the 
test
 
h
as been 
30
 
developed 
and provid
es
 
evidence for the construct and content aspects of validit
y
 
(Rupp, Templin, 
& Henson, 2010).
 
It is often 
considered an analog 
to the specified factor structure
 
in 
a confirmatory 
factor analysis
 
(
Henson, DiBello, & Stout, 
2018
)
.
 
The row vectors of the Q
-
matrix 
are
 
also referred 
to as q
-
vectors. Items with 
a q
-
vector with only one non
-
zero entry 
are
 
called single
-
attribute items. 
Others are multiple
-
attribute items.
 
An example of Q
-
matrix is
 

which shows that the test measures three attributes with three items, the first item probes the second 
attribute, the second item targets the first and the third attributes, and the last item requires all three 
attributes. In other words, an 
examinee
 
needs
 
to master the second attribute to succeed on item 1 
without guessing or slipping.
 
The 
specification of the Q
-
matrix precedes any model fitting and classifying. 
The Q
-
matrix 
is part of the model assumption that can be 
falsified
 
(e.g., Wang et al., 2018).
 
W
hile most 
theoretical and empirical studies 
assume
 
that the Q
-
matrix is corre
ctly specified
 
(e.g., Henson et 
al., 2018)
, recent efforts on Q
-
matrix construction and validation have pointed out the 
negative
 
effects of incorrectly identified Q
-
matrices and proposed solutions (
e.g., 
de la 
T
orre
, 2008
; Liu, 
Xu, & Ying, 2012
).
 
2
.1.
3
.
1
 
Reduced versus full Q
-
matrix
 
With hierarchical attributes, 
researchers have reached 
a 
consensus
 
on restricting the 
attribute prof
ile space (
e.g., 
Templin & Bradshaw, 2014; Tu et al., 2018). However, 
there has not 
been a consensus on the Q
-
matrix. T
wo types of Q
-
matrices 
are 
being used
: the 
full (or unrestricted) 
Q
-
matri
ces
 
(Liu et al., 2016; Templin & Bradshaw, 2014) and 
the reduced
 
(or restricted) Q
-
matri
ces
 
(
Köhn
 
& Chiu, 2018
; 
Leighton
 
et al., 2004; Tu et al., 2018)
, which are defined below
. 
 
31
 
Consider a test with three 
independent 
attributes. 
The expanded R
-
matrix


below has 
seven columns and e
ach 
column 
represe
nts
 
an item type
:
 
 
If we randomly sample from the columns of 


in 
equation 


as
 
the q
-
vectors
,
 
regardless of the attribute hierarchy, 
the Q
-
matrix is called a 
full Q
-
matrix
. With any attribute 
hierarchy, a full Q
-
matrix could have 
all seven types 
of q
-
vectors 
or a random subset 
of them
. 
In 
a test of three 
linear
 
attributes, for instance, although the attribute profile 


is not 
allowed, the 

-
vector 


is possible in the full
-
Q
-
matrix approach.
 
Considering that s
ome attributes profiles become illegitimate under 
a 
certain
 
hie
rarchy; 
particularly
, 
it is im
possible to 
master
 
a
n
 
attr
ibute without mastering
 
all 
prerequisite attributes
.
 
Therefore, i
n another line of research, it is assumed that an item probing a higher
-
level attribute 
also requires its prerequisite. This assumption would lead to the rem
oval of some 
q
-
vectors
. For 
example, 


under a linear hierarchy 
(


) 
would be unreasonable because 
the item requires the mastery of the second attribute without requiring its prerequisite.
 
A reduced 
Q
-
matrix can only have columns of 


as q
-
vectors.
 
A
 
special 
reduced Q
-
matrix is the transpose 
of 


, denoted as 


For three linear attributes, 
for example, 


and 


are defined in equation 


and 


.
 

T
he only difference between 


and the attribute profile space 

 
is the exclusion or inclusion of 
the vector of all 0s
. Therefore, 


can
 
also be derived using the 
R
-
matrix approach
. 
 
32
 
While studies using full Q
-
matrices tend not to discuss the necessity to make any change 
in the Q
-
matrix, researchers using 
reduced
 
Q
-
matrices believe that the items should reflect the 
attribute hierarchy (
Köhn
 
& Chiu, 2018; 
Tu et al., 2018). 
The choice be
tween the full Q
-
matrix 
and the restricted one 
has not 
been formally addressed
 
in the literature
.
 
 
2
.1.3.
2
 
Complete Q
-
matrix
 
A complete Q
-
matrix is needed to identify all possible attribute
 
profiles (Chiu, 
D
ouglas, & 
Li,
 
2009; Chiu & K
ö
hn, 2015). With a complete Q
-
matrix, we have 


, 
where 
 

denotes the expected response vector 


.
 
Completeness 
of 
the Q
-
matrix
 
is evaluated by checking the definition 


for each pair, 


, in the attribute profile space. It was proved in Chiu et al. (2009) that a Q
-
matrix 
containing the identity matrix (i.e., 

 
single
-
attribute items) is complete for the DINA model with 
independent attributes. 
Köhn 
and
 
Chiu 
(
2018
) later showed that 
any Q
-
matrix that contains 
the 
transpose of the R
-
matrix
 
is complete 
for the DINA model,
 
given any attribute hierarchy. This rule, 
however, does not apply to more complicated CDMs such as ACDM and GDINA
 
(
Köhn 
&
 
Chiu
,
 
2018
)
. 
 
2
.1.
4
 
Item 
response m
odel
s
 
and calibration
 
methods
 
T
he rela
tionship between each attribute profile
 
and the pro
bability of a correct response is 
expressed in terms of 
IRF (
de la Torre, 2011; Rupp, Templin, & Henson, 2010). 
A 
variety
 
of 
models with different IRFs 
for m
ultiple
-
attribute items 
have been proposed
; most of them are 
equivalent to each other in the parameterization for a single
-
attribute item.
 
S
ome 
CDMs
 
are more 
general 
models 
that subsume most 
other specific models. The 
general frameworks 
include
 
the general diagnostic model (GDM; von Davier 2005)
,
 
the log
-
li
near 
33
 
cognitive diagnosis model (LCDM;
 
Henson, Templin, & Willse, 2009
), and 
t
he generalized DINA 
(

) 
mo
del
 
(GDINA; de la Torre, 2011
)
.
 
The rest of the section introduces the GDINA framework and two reduced models from 
GDINA. The following notations are used:
 
 
is the number of required attributes for item j, as in 


.
 

is the reduce
d attribute vector consisting of the columns of the required attributes, 
where 


.
 

The probability of 
a correct response
 
on item 

 
by
 
students with attribute pattern 


will 
be denoted by 


.
 
The IRF
 
of the GDINA model (de la Torre, 2011) 
is given by
 

where 


is 


,
 
and 


in the identity, log, and logit links, 
respectively;
 

is the intercept for item j;
 

is 
the main effect due to 


;
 

is 
the 
interaction effect due to
 

and 


;
 

is
 
the interaction effect due to 


.
 
The G
-
DINA model is a saturated model and subsumes several widely used reduced CDMs, 
including the DINA model 
(
Haertel 1989; Junker and Sjitsma 2001; Macready and Dayton 1977
) 
and the A
-
CDM
 
(de la Torre, 20
11)
.
 
To obtain the DINA model, all terms in the GDINA model in identity link, except 


and 


, are constrained to zero, that is,
 
34
 

The A
-
CDM is the constrained identity
-
link G
-
DINA model 
without the interaction terms. 
It can be formulated as
 

Current methods for fitting CDMs use either marginal maximum likelihood estimation that 
relies on the 
Expectation Maximization
 
algorithm (MMLE
-
EM) or Markov chain Monte Carlo 
(MCMC) techniques
 
(Rupp et al., 2010)
.
 
2
.
1
.
5
 
C
lassification
 
method
s
 
The 
prime objective of 
CD
M
 
data analysis
 
is to classify examinees into one of the attribute 
profiles
.
 
The estimated attribute 
profile
 
denoted as 


, takes the value of one of the possible skill 
patterns 


for 


When
 

dichotomou
s 
attributes are involved and assumed to be 
independent, the attribute profile space consists of
 

latent classes.
 
If an attribute hierarchy 
exists, 
the number of attribute profiles 

 
decreases with some attribute profiles becoming 
impossible.
 
 
Exam
ine
es are often classified via maximum likelihood e
stimation (MLE; de la Torre, 
2008), maximum a pos
teriori (MAP
; Rupp et al., 
2010), or expected a posteriori (EAP
; de la Torre, 
2008; Rupp et al., 
2010
)
, which 
are applicable to any CDM that is a 
special
 
case 
of a
 
restricted 
latent class model
.
 
Huebner and Wang (2011) conducted a simulation study comparing the 
accuracy of the three methods under different testing conditions.
 
The likelihood function of the responses given the attribute profile 

 
is given by 
 
35
 

The MLE estimator is the attribute profile
 

for 


that maximizes the likelihood
, 
and is formally denoted as 
 

If prior 
probabilities
 
denoted as
 

for 


, are available from previous test 
administrations, 
the posterior probability 


for each 


can be calculated:
 

The
 
MAP estimator is 
then 
denoted as
 

It is
 
generally true that MLE 
and MAP estimates are equivalent
 
if
 
flat priors 
are used in 
MAP estimation 
(Huebner & Wang, 2011).
 
For the EAP approac
h
, the probabilities of mastery for each 
attribute
 
(the marginal skill 
probabilities)
, 


for 


,
 
are calculated for an examinee and rounded at .50 to obtain 
binary mastery classifications.
 
The
 
posterior probabilities 


are aggregated to 
ob
tain
 
t
he 
marginal probabilities 


for 


:
 

where 


The marginal probability 


is usually 
rounded
 
at .50 to obtain a binary
 
classification
 
for 
attribute 

 
(


)
. 
 
36
 
With hierarchical attributes, researchers have reached a conse
nsus on restricting the 
attribute profile space (e.g., Templin & Bradshaw, 2014; Tu et al., 2018). The MLE estimator 
maximizes the likelihood function over the set of all possible attribute profiles when 
the 
item 
parameters are assumed to be known, which
 
i
s referred to as unrestricted MLE (Tu et al., 2018). 
When hierarchical attributes are involved, a restricted MLE is recommended in which 
the 
probability of some
 
attribute profiles
 
are fixed to zero
 
due to the hierarchy (Templin & Bradshaw, 
2014; Tu et al.,
 
2018). The only difference between unrestricted and restricted MLE is in the 
attribute profile space.
 
Similarly, restricted MAP and EAP estimators should be used for 
hierarchical attributes.
 
2
.
1
.
6
 
Q
-
matrix design
 
The 
CDM
s
 
provide guidance for test 
construction
.
 
C
ognitive theories
 
could 
have a real 
impact on testing practice 
through CDM
 
model
 
assumptions
 
about relationships between attribute 
as well as the relationship between attributes and item responses
. 
Given a set of a
ttributes, i
nstead 
of relying heavily on 
post hoc item analysis surrounding 
internal consistency, test development in 
the CDM context begins with a set of possible item types that are characteriz
ed
 
by their q
-
vectors. 
For example, a test with three 
indepen
dent 
attributes can have at most seven different 
item types. 
The Q
-
matrix 
for 
a 
particular
 
test 
can be obtained by sampling with replacement from the 
column
 
vectors 
of the corresponding 


. The Q
-
matrix 
is a core element
 
of the 
CDM
-
based 
test
 
design.
 
Madison and Bradshaw (2015) defined the 
Q
-
matrix design as "the deliberate arrangement 
of a set of test items according to the specific subset of attributes measured by each individual 
item.
" 
The Q
-
matrix plays a 
significant
 
role in the stati
stica
l identification of the model (
Köhn & 
Chiu, 2018
; Xu & Zhang, 2016
)
. However, Q
-
matrices that lead to identification 
may provide 
varying classification accuracy rates
.
 
 
37
 
Three studies have been done with 
the effects of Q
-
matrix design 
on classification accu
racy 
with 
independent attributes. Chiu, Douglas, and Li (2009) showed that each attribute needs to be 
measured by at least one single
-
structured item in order to obtain acceptable classification accuracy 
in both DINA (
Haertel
,
 
1989; Junker 
& 
Sjitsma
,
 
2001;
 
Macready 
& 
Dayton
,
 
1977
) and DINO 
(Templin & Henson, 2006) models. Similarly, DeCarlo (2011), in his investigation of the DINA 
model, found that if an attribute is always measured through interaction terms and never measured 
in isolation, the classificati
on obtained only reflects the prior probabilities. The finding of DeCarlo 
(2011) was echoed in Madison and Bradshaw (2015), in which they concluded that attributes 
measured in isolation 
c
ould
 
help increase classification accuracy when holding constant the 
number of times an attribute is measured on a test, based on the log
-
linear cognitive diagnosis 
model (LCDM; Henson, Templin, &
 
Willse, 2009).
 
Recent efforts 
expanded the research on Q
-
matrix design to testing situations with 
hierarchical attributes
 
(Liu &
 
Huggins
-
Manley, 
2016
; 
Liu, Huggins
-
Man
ley, & Bradshaw, 2017).
 
In 
Liu, Huggins
-
Man
ley, and Bradshaw (2017)
,
 
d
ifferent Q
-
matrix designs 
were 
generated using 
the so
-
called independent approach, adjacent approach, or reachable approach when the 
attribute 
hierarchy was linear, divergent, convergent, or unstructured. The CDM was the hierarchical 
diagnostic classification model (H
DCM; Templin & Bradshaw, 2014). 
The independent approach 
only 
allows
 
for
 
simple
-
structured items. Each item measures at m
ost two attributes with direct 
relationships in the 
adjacent
 
approach. Each item can measure any combination of attributes that 
are directly or indirectly connected in the reachable approach.
 
Their simulations found that the 
adjacent
 
approach leads to high
er classification accuracy in a shorter test and they recommended 
using the 
adjacent
 
approach to design the Q
-
matrix when a hierarchy is present
 
(Liu et al., 2017)
.
 
Using 
the 
adjacent
 
approach in
 
Liu et al. (2017)
, 
Liu and Huggins
-
Manley (2016) found that 
38
 
"higher
-
level attributes were often associated with higher classification accuracy than lower
-
level 
attributes" as a result of more information about higher
-
level attributes from the hierarchical 
structure. 
 
2
.1.
7
 
Criteria for test construction
 
A research 
area closely related to Q
-
matrix design is the development of item and test 
indices. 
When estimated item parameters are available for a pool of items, an item index based on 
the estimated item parameters can be calculated to identify good items that 
achiev
e 
high
 
classification
 
rates 
with
 
a minimal number of items
 
(Henson, DiBello, & Stout, 2018)
. 
This type 
of item indices is referred to as item discrimination in Henson et al. (2018).
 
The Fisher information 
is an example of such item indices in the IRT 
context.
 
For CDMs, a
 
counterpart of the Fisher 
information is the Kullback
-
Leibler
 
information
 
(
KLI; 
also called KL 
divergence 
or KL distance)
. 
M
uch of the work on 
item
-
level
 
and test
-
level
 
indices
 
in 
CDM
s 
have been based on
 
KLI.
 
2.1.7.1 
Kullback
-
Leibler 
i
nformation
 
KLI
 
measures how far a distribution 

 
is away from the actual distribution 

 
(Gray, 2011
; 
Chang 
&
 
Ying
, 
1996
;
 
Xu
, Chang, & Douglas,
 
2003
).
 
Given a probability space 


, with 

 
being 
a finite space, and another measure 

 
on the same space, the 
KL information
 
of 

 
with 
respect to 

 
(Gray, 2011)
 
is defined 
as 
 

which 
ranges from 0 to 

.
 
 
The Fisher information can be used in 
the 
test
 
construct
ion
 
because the test information is 
the sum of item information
,
 
and 
the variability of the maximum likelihood estimate decreases
 
as 
the information increases. Test construction criteria for CDMs shou
ld have similar properties 
(
Henson & Douglas, 2005
).
 
39
 
The KL information for an item 

 
for differentiating 


and 


is defined as
 

Note that 


; 


for
 

. 
A
n item is most useful in determining the 
difference between 
two
 
attribute 
profile
s
, 


and 


, if 


and 


are
 
large.
 
All 


s for item 

 
can be recorded in a matrix 


of 

 
columns and 

 
rows where 

 
is the size of the attribut
e profile 
space.
 
The KL information for a test is defined as
 

where 

 
represents the response pattern for 

 
items.
 
The KL
 
information for a test compares the 
probability distribution for a
n
 
item response
 
vector 
X
, given 


when compared to the probability 
distribution of 
X
 
given an alternative attribute pattern, 


.
 
Because 
of the assumption of local
 
independence among items conditional on 

,
  
it can be shown that the test information 
is the sum 
of the 
KL
 
information 
across all items
 
in the exam
.
 
The 
test 
KL information 


for all pairs of 


in the attribute profile space
,
 

,
 
form
s a
n
 

matrix
 

where 

 
is the size of 

.
 
of a 


matrix containing
 

possible comparisons because the KL 
information is not symmetric.
 
The 
diagonal elements of the matrix are zero. 
The KL information 
provides a general method that will 
apply to all CDMs (Henson & Douglas, 2005)
, based on which 
researchers have proposed attribute, item, or test
-
level indexes for test construction.
 
2.1.7.2 Cognitive diagnostic index
 
(Henson & Douglas, 2005)
 
The cognitive diagnostic index (CDI) 
for an item 

 
is proposed as a weighted average of 
the 
off
-
diagonal 
elements of 


since the matrix expands exponentially with the number of 
40
 
attributes 

 
and makes it difficult in 
simultaneously 
evaluating all the elements
 
(
Henson 
&
 
Douglas
,
 
2005
)
. 
The 


for it
em
 

i
s
 
defined as
 
 
where 


is the Hamming distance and 


stands for the element of the matrix 


at 


. 
 
The 

 
for a test is defined as
 

where 


stands for the element of the matrix 


at 


. 
It can be shown that the CDI 
for a test is the sum of 


for all the items in the test.
 
Henson and Douglas (2005) 
showed that 
the CDI strongly relates to the 
average correct 
classification rates across attributes and examinees for a test
 
and they 
suggest using the cognitive 

 
Other indexes 
based on the KL information 
include 
the
 
Attribute Discrimination Index 
(ADI) 
that 
is
 
supposed to 
be related to the correct classification rate of the masters for the 

th 
attribute
 
(Henson, Roussos, Douglas & He, 2008), and 
the modified CDI and modified ADI (
Kuo, 
Pai, 
&
 
de la Torre
,
 
2016
).
 
Note 
that all the indexes mentioned above are overall indexes that are 
not conditional on 

.
 
2.1.7.3 
A
 
unified item and test discrimination approach
 
(Henson, DiBello, & Stout, 2018)
 
Henson et al. (2018) 
proposed
 
a probability
-
based attribute
-
specific
 
index
 
for
 
items with 
multiple options
. 
For
 
dichotomous 
items,
 
the index is reduced to 
 

41
 
where 


denotes an attribute pattern that differs from 

 
only on the 

th attribute.
 
The 
maximization is taken over all 

s. 
The index 


describes the 
discrimination power
 
of item 

 
in 
measuring attribute 

 
and has a value between 0 and 1. 
 
 
2
.
2
 
Nonparametric classification
 
based on CDM
 
conception
 
An alternative to classification 
when calibrating a parametric CDM is not practical or even 
possible 
is the nonparametric approach. 
The nonparametric approach shares with the conventional 
CDM approach the conceptions of a Q
-
matrix, a set of attributes, and different attribute 
interaction 
effects on correct responses
. The test is still constructed based on a CDM
,
 
but a probabilistic model 
is not used to 
characterize the correct response probabilities of different attribute profiles
.
 
Instead, 
the examinees are classified into 
dif
ferent attribute profiles 
using a nonparametric method
.
 
 
Barnes (2010) developed 
a
 
nonparametric
 
exploratory approach 
to build the Q
-
matrix and 
classify examinees
.
 
Some researchers employ 
cluster analysis 
for nonparametric classifications 
(
Ayers, Nugent, &
 
Dean, 2008; 
Chiu, Douglas, & Li, 2009; 
Willse, Henson, & Templin, 2007
). 
Another stream of research is based on the idea of minimizing
 
the distance between observed item 
response patterns and 
the 
ideal response patterns
 
according to the Q
-
matrix
 
(Chiu & D
ouglas, 2013
; 
Chiu, Sun, 
& Bian, 
2018
; 
Wang & Douglas, 2015
)
.
 
The rest of the section reviews the 
third type 
of 
nonparametric methods 
that minimize distance
 
measure
s.
 
2
.2.1 The nonparametric (NPC) method
 
Chiu and Douglas (2013) 
proposed a simple method to classify 
examinees
 
by matching
 
observed item response patterns to the nearest ideal response pattern,
 
henceforth referred to as the 
nonparametric 
(NPC) 
method.
 
The ideal response of examinee 

 
on
 
item 

 
is denoted as
 

, a
nd 
the vector containing ideal responses of examinee 

 
on all the items in a test is denoted as 


.
 
42
 
The ideal response patterns are derived from the Q
-
matrix and the assumption on attribute 
interactions. Consider 
a q
-
vector 


and four possible attr
ibute profiles
 

.
 
If we assume a conjunctive model
 
underlying the responses, the ideal responses for the four attribute profiles would be 


respectively
.
 
For a test with more than one item, each possible attribute 
profile is associated with an ideal response pattern. 
The ob
served 
response pattern of an examinee 
is compared with the ideal response patterns. The attribute profile of the closest ideal response 
pattern is the estimate for the examinee. 
Three
 
distance measures were proposed by 
Chiu and 
Douglas (2013)
.
 
Denote the 
observe response pattern as
 

.
 
T
he hamming distance between 

 
and
 

is given by
 

where 
J
 
stands for the test length.
 
A weighted Hamming distance is defined as
 

where 


denotes the 
proportion correct on the 

th item
.
 
They also proposed the 
penalized 
Hamming distance
 
for the special cases where the slipping parameter is much less than the 
guessing parameter or vice versa (Chiu & Douglas, 2013). 
 
Chiu and Douglas (2013)
 
f
ound that accurate classification can be achieved when the 
true
 
model is DINA and NIDA with slip and guess parameters considerably 
high
er
 
than 0. The 
estimator of 

 
would be perfect 
without any 
slip
ping
 
or
 
guess
ing
 
but still performs with good 
relative efficiency even when this is not the case
 
(Chiu & Douglas, 2013)
.
 
A 
formal justification
 
43
 
for 
the NPC methods
 
was provided in 
Wang and Douglas (2015)
, 
show
ing
 
that the nonparametric 
method yields consistent classificat
ions under a variety of underlying conjunctive models.
 
 
2
.2.2 The 
general nonparametric classification (GNPC) method
 
The
 
general nonparametric classification (GNPC) method 
(
Chiu, Sun, 
&
 
Bian
,
 
2018
) 
was 
proposed as an extension of t
he NPC methods (
Chiu 
&
 
Do
uglas
,
 
2013)
.
 
T
he example in 3.2.1 is 
revisited
 
t
o illustrate the need
 
for this
 
extension
. 
T
he ideal responses for the four attribute profiles 
are
 

respectively
, assuming an underlying conjunctive 
model. The ideal responses 
would become 


if the 
underlying model is a disjunctive one. 
In the NPC method, either the conjunctive ideal response 
patterns 
(denoted as 


) 
or the disjun
ctive ideal response patterns 
(denoted as 


) 
are used 
according to the
 
assumptions about the cognitive process
.
 
H
owever, using 


or 


may not be 
adequate if the underlying CDM is a complex one, such as a saturated GDINA model. 
Consider a 
set
 
of GDINA parameters for this item 


. The probabilities 
for the four possible attribute profiles to answer the item correctly are 


. 
Obviously, neither the ideal responses (0, 0, 0, 1) nor (0, 1, 1, 1) wou
ld be appropriate. 
 
Besides, before any analysis of the response data, 
we cannot decide which 
of t
he 
ideal 
response pattern
s
 
is more suitable.
 
Therefore, the GNPC method defines 
the
 
weighted ideal 
response 
on item 

 
for the 

th attribute profile in the attribute profile space 
as
 

in which 


is a weight calculated 
from the data 
in an iterative procedure
.
 
Conceptually, 
the weight is 
found when
 
the total distance 
between the observed responses and the weighted ideal 
responses
 
is minimized
.
 
Denote the attribute profiles as 


for 


The total distance can be 
denoted as
 
44
 

is obtained by minimizing 


:
 

where 


is the number of examinees classified to attribute profile 


. 
The 


can be computed 
via an iterative procedure described in Chiu et al. (2018). The NPC method can be used to provide 
a set of initial classifications to calculate the initial 


.
 
The NPC 
(
Chiu 
&
 
Douglas
,
 
2013
; Wang & Douglas, 2015
) 
and the GNPC (Chiu et 
al., 
2018) methods 
do not have limitations regarding the number of attributes, the sample size or the 
test length as the conventional CDMs do
, which makes them a practical option for small
-
scaled 
classroom assessments.
 
2
.
3
 
CD
-
CAT
 
2
.
3
.1 
From IRT
-
based CAT t
o CD
-
CAT
 
C
omputerized adaptive testing (CAT)
, built on the idea of 


can
 
tailor both items in the test form and 
the 
test length to an individual examinee. The maximum 
information criterion is usually adopted in 
IRT
-
based
 

efficiency in terms of shorter test length or higher measurement precision or both
 
compared to 
linear testing
. 
Th
ere 
have
 
been many 
operational CAT programs since the 1980s 
and 
rich
 
literature 
in the past decades 
(Reckase, 2010).
 
 
CAT algorithms 
based 
on CDMs (denoted as CD
-
CAT) have been developed 
with the same 
motivation 
behind
 
the 
IRT
-
based
 
CAT, that is, 
to 
increa
s
e testing
 
efficiency 
(Cheng, 2009; 
McGlohen & Chang, 2008; Xu, Chang, & Douglas, 2003).
 
When the cognitive diagnosis is 
45
 
combined with CAT, we 
can proceed
 

a new stage
 
of 


As technologies become mor
e available in the classroom, 
CD
-
CAT can 
play a more important role in learning and teaching.
 
Chang (2015)
 
reported a
n experimental CD
-
CAT 
program 
was 
implemented in 
Zhengzhou

CD
-
CAT 
encourages critical 
thinking, making students more independent in problem solving, and offers 
easy to follow individualized remedy, making learning more interesting.
 
(
p. 15
)

 
Similar to 
the 
CAT
s based on other measurement models
, 
a 
CD
-
CAT 
algorithm consists 
of
 
a 
measurement 
model (e.g., the DINA 
model
), a method for selecting the first item
(s)
 
to 
administer, a scoring method, a
 
rule 
to select
 
the next item
 
conditional on 
examinee 
responses to 
the previous item(s)
, and a termination rule to 
end
 
the 
test
.
 
An item pool with cali
brated items is 
needed 
for the
 
implement
ation of
 
the CAT algorithm. 
 
2
.
3
.
2
 
Item selection methods for CD
-
CAT
 
Item selection is a core element of 
CAT algorithms
.
 
T
hree
 
item selection indices based on 
the KL 
information
 
are reviewed in this section
 
because they will be used in the simulation study
. 
There are i
tem selection methods based on other criteria 
such as the Shannon entropy
 
(
Wang, 2013; 
Xu et al., 2003)
 
and mutual information (
Huebner, Finkelman, & Weissman, 2018
)
.
 
The following notations are used 
for the CD
-
CAT context
:
 
 
denotes 
the 
attribute profile estimate for examinee 

 
after
 

items have been 
administered;
 

denotes the 
observed response pattern for examinee 

 
when
 

items have been 
administe
red;
 

denotes the size of the attribute profile space;
 
 
(


) denotes
 
the 

th attribute profile in the
 
attribute profile space;
 
 
46
 

denotes the available items in the item pool when 

 
items have been administered; and
 

denotes the response of examinee 

 
to item 

 
from 


.
 
The 
KL
 
algorithm
.
 
Xu
, Chang, and Douglas
 
(2003) proposed using the straight sum of the 
KL distances bet
ween 


and all the 


for
 

.
 
Note that 


when there are 

 
independent 
attributes
. The KL index is defined as
 

w
here
 

Then the 


th item for the 

th examinee is the item in 


that maximizes 


.
 
The 
KL index
 

is referred to as the global discrimi
nation index (GDI)
 
in Xu et al. (2003). 
This item selection meth
od is
 
referred to as the KL algorithm in Cheng (2009).
 
 
The KL algorithm selects items that are the most powerful in distinguishing the current 
attribute profile estimate from all other possible attribute profiles on average (Cheng, 2009).
 
Cheng 
(2010) points out that the KL algorithm does not consider attribute coverage. Another drawback 
is that this algorithm may not be 
effective
 
at the early stage 
with inaccurate
 

.
 
T
he posterior
-
weighted KL 
(PWKL) 
index
.
 
The PWKL index weights the 
KL index by 
the 
posterior distribution
 
(
Cheng, 2009
)
. If
 
informative priors 


are available for each attribute 
profile,
 
posterior distributions
 
can be obtained
 
at each 
step
 

:
 

Denote 


by 


for simplicity in notation.
 
The PWKL index is defined as
 
47
 

Assuming local independence, the likelihood function
 

can be written 
as
 

w
here 


is 
the 
IRF
 
defined by a 
CDM
.
 
Then the 


th item for the 

th examinee is the 
item in 


that maximizes 


.
 
If the prior is discrete 
uniform,
 
the PWKL index is 
reduced to the likelihood
-
weighted KL (LWKL) index
:
 

T
he modified posterior
-
weighted Kullback
-
Leibler (MPWKL)
 
ind
ex
.
 
The KL and PWKL 
index use the current estimate 


with an implicit assumption that the point estimate is a good 
summary of the current information. However, the point estimate 


may be inaccurate 
especially at 
the 
early 
stage
s of a test. 
To solve 
this problem, 
Kaplan
, de la Torre, and Barrada 
(2015)
 
used the entire posterior distribution instead of a point estimate. The MPWKL index is 
given as
 

2
.
3
.
3
 
Item pool 
design
 
The potential benefits of CAT cannot be realized without a 
well
-
constructed item pool 
(Reckase, 2010).
 
There are 
some 
studies on item pool design for CAT based on IRT models (e.g., 
Reckase, 2010; Thissen, 
Reeve, Bjorner, & Chang, 2007)
, and more research is needed in this area
.
 
Considering the d
ifference between 
items based on IRT
 
and CD
M, the findings from IRT
-
based 
48
 
CAT cannot be directly applied to CD
-
CAT. However, t
he item pool design for CD
-
CAT has not 
been addressed in 
the 
literature
 
despite its importance.
 
Simulation findings
 
on item usage in CD
-
CAT might inform the item pool design
 
process 
(Kaplan et al., 2015)
.
 
For example, a CD
-
CAT based on the DINA model tends
 
to use items with 
a q
-
vector matching 
the examinee

true
 
attribute profile 
and items that required single attributes 
which were not mastered by the examinee
, which implies that it is important to include sufficient 
single
-
attribute items in the item pool
.
 
 
Since there is no published research on item pool design
 
for CD
-
CAT
, the studies on 
the 
IRT
-
based
 
CAT are reviewed below. 
There is a body of literature on selecting operational pools 

anson and Stocking, 1998; 
van der Linden, Ariel, &
 
Veldkamp
, 2006; Way, Steffen, & Anderson, 1998).
 
The problem they 
address is related to item pool 
design but
 
is more appropriately described as item pool assembly 
(
van der Linden et al., 2006
).
 
van der Lin
den 
et a
l. 
(2006) 
argues that 
an item pool design problem
 
occurs before actual 
items are available and the output 
is a blueprint for an item pool that defines the distribution of 
number
s
 
of items over the space of all possible combinations of statistical and nonstatistical
 
item 
attributes
 
(e.g., item difficulty parameter and word count)
. 
The goal of item pool design is to guide 
the item writing and pool maintenance process (
Reckase
,
 
2010
;
 
Veldkamp 
and
 
van der Linden
,
 
2000
).
 
Item pool design studies for IRT
-
based CAT 
focuses on different aspects of an item pool
. 
Veldkamp 
and
 
van der Linden 
(
2000
)
 
proposes a method for item pool design 
that minimizes item
-
writing costs
 
subject to test constr
aints
.
 
Test constraints are represented in t
he classification table 
that 
contains all possible combinations of item attributes such as word counts, difficulty parameters, 
49
 
difficulty parameters, 
and discrimination indices (Veldkamp & van der Linden, 2000). 
Quantitative attributes are
 
transformed to categorical variables 
represented by intervals, for 
example, 


for the difficulty parameter.
 
T
he 
goal of 
the 
item pool design process
 
is to find out
 
the number of items 
needed
 
for each cell of the 
classification table. The 
number of items in each cell 
of a
 
previous item pool, however, 
is 
needed
 
to define item writing costs as the inverse of that number, based on the idea 
that items 
written
 
more 
freque
ntly tend to be
 
less costly.
 
 
Another stream of research
 
based on 
the bin
-
and
-
union method
 
(Reckase, 2010) explores 
item pool design without 
any information of 
existing item pool
s
 
as a starting point (He & Reckase, 
2014; Mao, 2014).
  
This family of researc
h focuses on the psychometric performances of item 
pools instead of the item
-
writing costs.  
Reckase (2010) thinks an optimal item pool should always 
provide the desired item for every item selection.
 
An optimal item pool for 
a
 
CAT procedure 
based 
on 1PL m
odel
, for example,
 

has an item in the pool that has
 
a b
-
parameter exactly 


item pool is 


where 

 
is the test length, which is too large to be practical
. 
If the latent scale 
is divided into bins and the items with b
-
parameters within a bin are treated equivalent, the item 
pool size will be greatly decreased to a reasonable level. 


categorization of the difficulty parameter in Veldkamp and van der Linden (2000).
 
The item pool design methods of
 
Veldkamp and van der Linden (2000) and 
Recakse
 
(2010; 
also see He & Reckase, 2014; Mao, 2014) are based on different def
initions of optimal item pool, 
but a common feature they share is the use of computer simulation. 
The simulations in Veldkamp 
and van der Linden (2000) are 
carried out using
 
integer programming and the shadow test approach
 
(
van der Linden, 2005a, 2005b; va
n der Linden & Diao, 2014; van der Linden & Reese, 1998
)
 
and 
50
 
sampling examinees from 
a hypothetical
 
examinee 
distribution
. The goal is to record the counts of 
the number of times items from each cell in the classification table are used
,
 
and
 
t
he final blue
print 
is calculated from these counts
 
(Veldkamp & van der Linden, 2000)
. 
The 
bin
-
and
-
union method
 
(Reckase, 2010)
 
takes a more direct approach 
by
 
simulating an operational CAT and sampling 
from an examinee population. 
 
 
51
 
Chapter 
3
 
CDM 
parameterization and Q
-
matrix with hierarchical attributes
 
3.1 Introduction
 
The CDMs with a restricted attribute profile space due to the attribute hierarchy is 
henceforth
 
referred to as 
hierarchical CDMs
. 
This
 
section addresses parameterizations and the Q
-
matrix of hierarchical CDMs. 
P
arameterizations for 
hierarchical 
CDMs 
have
 
not 
been 
formally 
discussed except for 
the 
HDCM (Liu et al., 2017; Templin & Bradshaw, 2014)
 
and the DINA 
model. 
When it comes to the Q
-
matrix, t
wo types of Q
-
matrices are being use
d by two groups of 
researchers, respectively: the full (or unrestricted) Q
-
matri
ces
 
(Liu et al., 2016; Templin & 
Bradshaw, 2014) and the reduced (or restricted) Q
-
matri
ces
 
(
Köhn
 
& Chiu, 2018; 
Leighton et al., 
2004; Tu et al., 2018). 
The choice between the 
full Q
-
matrix and the restricted one has not been 
formally addressed.
 
Therefore, the first set of research questions is about the parametrization of 
hierarchical 
CDMs
 
and the difference between reduced and full Q
-
matrix
.
 
These questions are important 
because the test constructions and item pool designs all depend on correctly
-
defined CDMs and 
Q
-
matrices.
 
 
In this thesis, it is assumed that the
 
hierarchical relationship
 
and the Q
-
matrix 
ha
ve
 
been 
established and validated, and we focus on test construction or item pool design for different types 
of attribute hierarchies.
 
 
3
.2 Attribute 
hierarchies
 
Before discussing 
parameterizations and Q
-
matrices, 
we define the attribute hierarchies 
studied 
in this thesis. 
The formative assessment is designed for a period of two to four weeks. 
Therefore, we consider situations with three, four, or five attributes
 
in this study
.
 
The subsets of 
attribute hierarchies chosen for 3
-
attribute, 4
-
attribute, or 5
-
att
ribute conditions, respectively, are 
52
 
listed in Table 1 and illustrated
 
in
 
Figure 6
-
Figure 8
. Most of the selected attribute hierarchies can 
be found in the textbook analysis
, 
as well as 
previous empirical and simulation studies.
 
 
Table 
1
:
 
Subsets of attribute hierarchies for 3
-
attribute, 4
-
attribute, or 5
-
attribute conditions
 
ID
 
Number of attributes
 
Size of attribute 
profile space
 
Attribute hierarchy
 
H3.1
 
3
 
8
 
Independent
 
H3.2
 
3
 
4
 
Linear
 
H3.3
 
3
 
5
 
Inverted pyramid
 
H3.4
 
3
 
5
 
Pyramid
 
H4.1
 
4
 
16
 
Independent
 
H4.2
 
4
 
5
 
Linear
 
H4.3
 
4
 
8
 
Linear + single
 
H4.4
 
4
 
6
 
Inverted pyramid
 
H4.5
 
4
 
6
 
Pyramid
 
H5.1
 
5
 
32
 
Independent
 
H5.2
 
5
 
6
 
Linear
 
H5.3
 
5
 
10
 
Inverted pyramid I
 
H5.4
 
5
 
11
 
Inverted pyramid II
 
H5.5
 
5
 
10
 
Pyramid I
 
H5.6
 
5
 
11
 
Pyramid II
 
 
Figure 
6
:
 
A subset of attribute hierarchies with 3 attributes
 
53
 
 
Figure 
7
:
 
A subset of attribute hierarchies with 4 attributes
 
 
Figure 
8
:
 
A subset of attribute hierarchies with 5 attributes
 
54
 
3.
3
 
Parameterizations of
 
hierarchical CDMs
 
We
 
discuss the parameterizations
 
for the DINA 
(
Junker & Sijtsma, 2001), 
ACDM
 
(
de la
 
Torre, 2011), and 
GDINA
 
model 
with the 
identity link function (
de la 
Torre, 2011)
 
when
 
the 
attributes
 
are hierarchical
.
 
 
An item requiring 

 
attributes can classify students into at most 


classes. 
A 
hierarch
ical 
relationship among attributes leads to
 
fewer 
than 


classes. A saturated model 
for an item 
requiring 

 
ind
ependent 
attributes 
can have
 

item parameters including an intercept, 

 
main 
effect terms, and 


interaction terms. The number of item parameters 
can
n
ot
 
exceed the 
number of classes.
 
The parameterizations for 
DINA and ACDM
 
do not change with hie
rarchical attributes.
 
The DINA model has two parameters for each item disregarding the q
-
vector of the item: an 
intercept and an interaction term (or a guessing parameter and a slipping parameter in an alternative 
parameterization).
 
Under the A
-
CDM, 
an
 
item
 
requiring 

 
independent 
attributes
 
has
 

item 
parameters (i.e., one intercept and 

 
main effect terms).
 
 
For GDINA, some item parameters (
i.e.
,
 
the main effects of nested attributes and some 
interaction terms) need to 
be fixed
 
at
 
zero
, which 
is parallel to 
the parameterizations of the 
Hierarchical Diagnostic Classification Model
 
(HDCM
; Templin & Bradshaw, 2014
).
 
Before
 
demonstrat
ing
 
the parameterizations of hierarchical models, w
e 
present the 
parameterizations of three models

D
INA, ACDM, and GDINA

for 
a
 
single
-
attribute
 
item and 
a two
-
independent
-
attribute item
.
 
The three models 
are equivalent regarding a 
single
-
attribute
 
item 
but
 
have 
different parameterizations for an item requiring two independent attributes
, which are
 
shown in
 
Table 2
 
in the form of expected response 


.
 
 
55
 
Table 
2
:
 
Expected responses on two items with two independent attributes
 

Any model
 
 
DINA
 
A
CDM
 
GDINA
 
(00)
 

(10)
 

(01)
 

(11)
 

Note
: 
I
tem 

 
involves two independent attributes 


and 


; all models the identity link; 
DIN
A = 

 
gate; ACDM = additive cognitive diagnosis modeling; GDINA = 
generalized DINA
; 


= intercept; 


= main effect of the 
k
th attribute (


); 


= two
-
way interaction. 
 
 
Su
ppose 


is the prerequisite of 


(i.e., 


)
. 
The
 
item 


, 
under 
each 
model (DINA, A
-
CAM, or GDINA)
,
 
classifies examinees into two groups: those who master both 


and its prerequisite 


and those who have not mastered 


.
 
The parameterizations of the three 
hierarchical models are in 
Table 3
. 
Under the DINA model, the
 
item 


has the same 
parameterizations
 
as 


. 
For the
 
parameterizations of the item 


unde
r 
GDINA
,
 
the main effect of the higher
-
level attribute
 
(
i.e
., 


) needs to be fixed at zero
.
 
Both 
ACDM and GDINA have three item parameters. ACDM has an intercept and two main effects. 
GDINA has an intercept, a main effect, and an interaction effect. 
Alth
ough 
parameterized 
differently
, the two models become mathematically equivalent
 
for an item measuring two linear 
attributes.
 
 
56
 
Table 
3
:
 
Expected responses on two items with two linear attributes (


)
 

Any model
 
 
DINA
 
A
CDM
 
GDINA
 
(00)
 

(10)
 

(11)
 

Note
: 
I
tem 

 
involves two attributes 


and 


under a linear hierarchy; all models the identity link; 
DIN

 
gate; ACDM = additive cognitive diagnosis modeling; 
GDINA = generalized DINA
; 


= intercept; 


= main effect of the 
k
th attribute (


); 


= 
two
-
way interaction. 
 
Next, we consider a situation involving three attributes
 
with one attribute being the 
prerequisite of the other two as in an inverted pyramid hierarchy
 
(H3.3)
.
 
Table 4 
presents the 
parameterizations of three models for 


.
 
For this item, the three models have different 
parameterizations. The difference between ACDM and GDINA lies in the interaction effect 
between 


and 


.
 
 
Table 
4
:
 
Expected responses on 


under an inverted pyramid hi
erarchy
 
(H3.3)
 

DINA
 
ACDM
 
GDINA
 
(000)
 

(100)
 

(110)
 

(101)
 

(111)
 

Note
: 
The inverted pyramid 
hierarchy 
defines 


, 


. 


and 


do not share a 
common path
.
 
57
 
W
e 
then 
consider a situation involving three attributes
 
with two attributes being the 
prerequisite of the third one as in a pyramid hierarchy
 
(H3.4)
.
 
Table 5 
presents the 
parameterizations of three models for 


.
 
For this item, the three models have different 
parameterizations. The difference betwe
en ACDM and GDINA lies in the interaction effect 
between 


and 


.
 
 
Table 
5
:
  
Expected responses on 


under a pyramid hierarchy
 
(H3.4)
 

DINA
 
A
CDM
 
GDINA
 
(000)
 

(100)
 

(010)
 

(110)
 

(111)
 

Note
: The pyramid 
hierarchy 
defines 


, 


. 


and 


do not share a common path.
 
 
3
.
4
 
Q
-
matrix
 
of
 
hierarchical CDMs
 
3.4.1 
Reduced or full Q
-
matrix
 
In previous studies, either a reduced Q
-
matrix or a full Q
-
matrix is used. 
With 
hierarchical
 
attributes
, the argument is around 
whether it is possible for 
an ite
m
 
to measure a 
higher
-
level
 
attribute witho
ut measuring its prerequisite
(s)
. A full Q
-
matrix allows all types of 
q
-
vectors
 
as in 
an independent
-
attribute situation. A reduced Q
-
matrix requires that 
items
 
that 
measu
r
e
 
a 
higher
-
level
 
attribute
 
also require 
a
ll
 
its prerequisite(s).
 
In other words, 
a reduced Q
-
matrix can only 
58
 
contain q
-
vectors in 


(the transpose of the expanded R
-
matrix 


)
.
 
We will demonstrate that 
the 
reduced Q
-
matrix 
approach 
is equivalent to 
the full Q
-
matrix approach
 
under the DINA model
.
 
It can be shown that
, under the DINA model, 
a multiple
-
attribute item
 

is equivalent to 
the single
-
attribute item 


,
 
in which 


takes the value 1
 
or 0
 
if 
the 
previous 


attributes are the direct or indirect prerequisites of the 

th attribute
, or takes the value 0 if the 

th attribute is not 
connected wi
th
 
the 

th attribute
 
in any path
.
 
The multiple
-
attribute item 


and 
the single
-
attribute item 


are equivalent because 
they classify attribute profiles into the same 
two groups
 
(i.e., 

s mastering the 

th attribute or not)
, and 
they 
have the same 
expected response
 
for 
each group
 
as 
shown in
 
Table 6
. 
 
Therefore, 
u
nder the DINA model with a linear hierarchy, the reduced Q
-
matrix 


is 
equivalent to an identity matrix consisting of 

 
single
-
attribute q
-
vectors. 
Table 7 
presents the 
equivalent q
-
vectors for each row of 


in the case of three linear attributes
.
 
Under the DINA model and 
any
 
attribute hierarchy, 
each q
-
vector in 


represents a unique 
type of items
 
(
Table 7
-
Table 10
).
 
Other q
-
vectors can find their equiva
lent 
one 
in 


.
 
Consequently, 
there would be no difference between the
 
reduced Q
-
matrix 
approach and the full Q
-
matrix 
approach
 
under the DINA model
. 
However, it is noteworthy that 
there are 
less than 


distinctive 
q
-
vectors 
with hierarchical attribute
s. 
 
Note that all the single
-
attribute items are included in 


under the DINA model. Under 
the ACDM or GDINA, however, each q
-
vector is distinctive, and consequently 


does not 
include all the single
-
attribute items. We use H3.2 under the ACDM to dem
onstrate this in
 
Table 
11
. If the reduced Q
-
matrix approach is used with ACDM or GDINA, it means that some single
-
attribute q
-
vectors will be excluded from the Q
-
matrix.
 
59
 
Table 
6
:
 
The expected responses of
 
two groups of 
attribute profiles
 
on 


and
 

under the DINA 
model
 

Note
: 


stands for the 

th attribute
;
 

takes the value 1 or 0 if the previous 


attributes are the direct or indirect prerequisites of the 

th 
attribute, or
 
takes the value 0 if the 

th attribute is not connected with the 

th attribute in
 
any path
;
 
 
= intercept
;
 

= interaction.
 
 
Table 
7
:
 
The q
-
vectors in 


and their equivalent q
-
vectors under 
the 
DINA 
model 
with three 
linear attributes
 
(H3.2)
 

Equivalent 

s
 
Attribute Profiles
 

(1 0 0)
 
 
(0 0 0)
 
(1 0 0) (1 1 0) (1 1 1)
 
(1 1 0)
 
(0 1 0)
 
(0 0 0) (1 0 0)
 
(1 1 0) (1 1 1)
 
(1 1 1) 
 
(0 0 1)
 
(1 0 1) (0 1 1) 
 
(0 0 0) (1 0 0) (1 1 0)
 
(1 1 1)
 
Note
: Single
-
attribute items are bolded
; 


= intercept; 


= interaction.
 
 
60
 
Table 
8
:
 
The q
-
vectors in 


and their equivalent q
-
vectors under 
the 
DINA 
model 
with three 
inverted pyramid
 
attributes
 
(H3.3)
 

Equivalent 

s
 
Attribute Profiles
 

(1 0 0)
 
 
(0 0 0) 
 
(1 0 0) (1 1 0) (1 0 1) (1 1 1)
 
(1 1 0)
 
(0 1 0)
 
(0 0 0) (1 0 0)
 
(1 1 0) (1 0 1) (1 1 1)
 
(1 0 1)
 
(0 0 1)
 
(0 0 0) (1 0 0) (1 1 0) 
 
(1 0 1) (1 1 1)
 
(1 1 1)
*
 
(0 1 1)
 
(0 0 0) (1 0 0) (1 1 0) (1 0 1)
 
(1 1 1)
 
Note
: 
Single
-
attribute items are bolded
; 


= intercept; 


= interaction
; * = q
-
vector 
that is 
not in the R
-
matrix.
 
 
Table 
9
:
 
The q
-
vectors in 


and their equivalent q
-
vectors under 
the 
DINA 
model 
with three 
pyramid
 
attributes
 
(H3.4)
 

Equivalent 

s
 
Attribute Profiles
 

(1 0 0)
 
 
(0 0 0) (0 1 0)
 
(1 0 0) (1 1 0) (1 1 1)
 
(0 1 0)
 
 
(0 0 0) (1 0 0)
 
(0 1 0) (1 1 0) (1 1 1)
 
(1 1 0)
*
 
 
(0 0 0) (1 0 0) (0 1 0) 
 
(1 1 0) (1 1 1)
 
(1 1 
1)
 
 
(0 0 1)
 
(1 0 1) (0 1 1) 
 
(0 0 0) (1 0 0) (0 1 0) (1 1 0)
 
(1 1 1)
 
Note
: Single
-
attribute items are bolded
; 


= intercept; 


= interaction
; * = q
-
vector 
that is 
not in the R
-
matrix.
 
 
61
 
Table 
10
:
 
The q
-
vectors in 


and their equivalent q
-
vectors under the DINA model with four or 
five attributes
 
Hierarchy
 

Equivalent 

s
 
Hierarchy
 

Equivalent 

s
 
H4.2
 
(1000)
 
 
H5.4
 
(10000)
 
 
(110
0
)
 
(0100)
 
 
(11000)
 
(01000)
 
(111
0
) 
 
(
qq
1
0
)
, e.g.,
(0010)
 
 
(10100)
 
(00100)
 
(1111)
 
(qqq1), e.g.,
(0001)
 
 
(11100)
 
(01100)
 
H4.3
 
(1000)
 
 
(11010)
 
(qq010), e.g., 
(00010)
 
(0001)
 
 
(11001)
 
(qq001), e.g., 
(000
0
1)
 
(110
0
)
 
(0100)
 
 
(11110)
 
(qq110)
 
(1001)
 
 
(11101)
 
(qq101)
 
(1
1
1
0
)
 
(
qq
10)
, e.g.,
 
(0010)
 
 
(11011)
 
(qq011)
 
(11
0
1)
 
(0101)
 
 
(11111)
 
(qq111)
 
(1111)
 
(0111) (1011) (0011)
 
H5.5
 
(10000)
 
 
H4.4
 
(1000)
 
 
(01000)
 
 
(110
0
)
 
(0100)
 
 
(00100)
 
 
(1
1
1
0
)
 
(
qq
10)
, e.g.,
(0010)
 
 
(11000)
 
 
(11
0
1)
 
(qq01), e.g., 
(0001)
 
 
(10100)
 
 
(1111)
 
(qq11)
 
 
(01100)
 
 
H4.5
 
(1000)
 
 
(11100)
 
 
(0100)
 
 
(11110)
 
(qqq10), e.g., 
(00010)
 
(1100)
 
 
(11111)
 
(qqqq1), e.g., 
(000
0
1)
 
(1
1
1
0
)
 
(
qq
10)
, e.g.,
 
(0010)
 
H5.6
 
(10000)
 
 
(1111)
 
(qqq1), e.g.,
 
(0001)
 
 
(01000)
 
 
H5.2
 
(10000)
 
 
(00010)
 
 
(11000)
 
(01000)
 
 
(11000)
 
 
(11100)
 
(qq100), e.g., 
(00100)
 
 
(10010)
 
 
(11110)
 
(qqq10), e.g., 
(00010)
 
 
(01010)
 
 
(11111)
 
(qqqq1) e.g., 
(00001)
 
 
(11100)
 
(qq100), e.g., 
(00100)
 
H5.3
 
(10000)
 
 
(11010)
 
 
(11000)
 
(01000)
 
 
(11110)
 
(qq110)
 
 
(11100)
 
(qq100), e.g., 
(00100)
 
 
(11111)
 
(qqqq1), e.g., 
(00001)
 
 
(11010)
 
(qq010), e.g., 
(00010)
 
 
(11001)
 
(qq001), e.g., 
(00001)
 
 
(11110)
 
(qq110)
 
 
(11101)
 
(qq101)
 
 
(11011)
 
(qq011)
 
 
(11111)
 
(qq111)
 
 
Note
: 
q takes the value of 0 or 1.
 
Single
-
attribute items are bolded. 
 
 
62
 
Table 
11
:
 
The q
-
vectors in 


and their equivalent q
-
vectors under 
the 
ACDM
 
with three linear 
attributes
 
(H3.2)
 

Other
 

Attribute Profiles
 

(100)
 
 
(000)
 
(100) (110) (111)
 
 
(110)
 
 
(000)
 
(100)
 
(110) (111)
 
 
(010)
 
(000) (100)
 
(110) (111)
 
 
(00
1
)
 
(000) (100) (110)
 
(111)
 
 
(101)
 
(000)
 
(100) (110)
 
(111)
 
 
(011) 
 
(000) (100)
 
(110)
 
(111)
 
 
(111) 
 
 
(000) 
 
(100)
 
(110)
 
(111)
 
Note
: Single
-
attribute items are bolded; 


= intercept; 


= 
main effect of attribute 

.
 
 
If 
the reduced Q
-
matrix approach is taken, there are only three q
-
vectors under ACDM. 
However, if 
the 
model
 
selection is 
made
 
at
 
the item level and a
n item pool of mixed models can 
be constructed (Ma 
et al
.
, 2015), 
items calibrated with the DINA model can be included in this 
item pool.
 
For the linear hierarchy H3.2, for example, the mixed item pool has five distinct item 
types in
 
Table 12
.
 
If the full 
Q
-
matrix approach is taken instead, the mixed item pool can have two 
more item types: 


and 


calibrated by the ACDM. 
 
 
Table 
12
:
 
Distinct q
-
vectors in a mixed item pool under DINA and ACDM for H3.2
 
using the 
reduced Q
-
matrix approach
 

Model
 
Attribute Profiles
 

(100)
 
-
 
(000)
 
(100) (110) (111)
 
 
(110)
 
ACDM
 
(000)
 
(100)
 
(110) (111)
 
 
(110)
 
DINA
 
(000) (100)
 
(110) (111)
 
 
(111)
 
DINA
 
(000) (100) (110)
 
(111)
 
 
(111) 
 
ACDM
 
(000) 
 
(100)
 
(110)
 
(111)
 
Note
: Single
-
attribute items are bolded; 


= intercept; 


= mai
n
 
effect of attribute 

. 
 
 
63
 
3.4.2 Complete Q
-
matrix for hierarchical attributes
 
A Q
-
matrix containing the
 
identity matrix is complete for the DINA model with 
independent attributes
,
 
according to Chiu et al. (2009).
 
Since the 
completeness of a Q
-
matrix is 
evaluated by checking whether it holds that 


for each pair of 


in the attribute
 
profile space, the completeness 
will
 
not change if 
some 

s are excluded from 
the 
attribute profile space.
 
Since there is only one way to define single
-
attribute items under different 
models
, it is safe to conclude that the 
identity matrix is 
complete
 
for any attribute hierarchy under 
any model.
 
Under the DINA model, 


is complete since 


equals to or contains the identity 
matrix
; another type of complete matrix is the transpose 
of the R
-
matrix that equals to the identi
ty 
matrix
,
 
consistent with 
the conclusion of 
Köhn and Chiu (2018
)
.
 
The expected response vectors 
given 

 
are presented 
in
 
Table 13
.
 
 
Table 
13
:
 
Expected response 
vectors 
given
 

of two Q
-
matrices (


and 

) for 
the 
i
nve
rted 
pyr
amid 
(
H3.3) 
under the DINA model
 

(000)
 

(100)
 

(110)
 

(101)
 

(111)
 

Note
: Single
-
attribute items are bolded; 


= intercept; 


= main effect of attribute 

. 
 
 
64
 
Under ACDM, one type of items


alone would be sufficient for 
completeness by definition as long as the three main effects (


, 


, and 


) 
are different from 
each other (
Table 
14
). Without assuming the differences between 


, 


, and 


, an inspection 
of
 
Table 14
 
shows that 


of each attribute hierarchy is a complete Q
-
matrix disregarding the 
attribute hierarchy.
 
 
Table 
14
:
 
Expected response vectors 
given
 

of 
five
 
q
-
vectors
 
for independent attributes under 
ACDM
 

(000)
 

(100)
 

(010)
 

(001)
 

(110)
 

(101)
 

(011)
 

(111)
 

Note
: 


, 


, and 


form the 


for the 
linear
 
hierarchy (H3.
2
); 


, 


, 


, and 


form the
 

for
 
the inverted pyramid hierarchy (H3.3)
;


, 


, 


, and 


form the 


for the pyramid hierarchy 
(H3.4).
 
 
3.5 
Summary
 
In discussing the parameterizations of hierarchical CDMs, we
 
identif
ied
 
equivalent models 
when 
an 
attribute hierarchy is 
present
.
 
The three models 
in the GDINA family 
parameterize single
-
attribute items in the same way
 
regardless of the attribute hierarchy
. The 
hierarchical 
ACDM and 
hierarchical 
GDINA model are equivalent to each other but different from the 
hierarchical 
DINA 
model when two 
linear
 
attributes are involved in an item. 
The hierarchical ACDM and GDINA 
65
 
model have different parameterizations when 

d. 
Independence refers to the fact that the two attributes are not on the same path in the tree graph. 
 
Under the 
hierarchical 
DINA model, 
the q
-
vectors in 


represent distinct 
item types
. 
Since the number of q
-
vectors in 


is smaller than 


, 
a
 
f
ull Q
-
matrix
 
may 
have two 
seemingly different q
-
vectors that are are 
equivalent. 
By equivalence, we mean that the items have 
the same parameterizations and would
 
thus
 
lead to the same classifications of examinees
 
given the 
same item parameters
. 
For example
, 


and 


are equivalent to 


in the hierarchical DINA model
 
if attribute 


.
 
As a result, 
the choice between 
the reduced and the full Q
-
matrix approaches 
does not make a 
difference
 
under the 
hierarchical 
DINA model. 
 
Under the ACDM or GDINA model,
 
any combination of attributes is a distinct q
-
vector
 
so 
there are in theory 


different item types
. 
A
 
reduced Q
-
m
atrix under the hierarchical ACDM 
or GDINA model 
inevitably 
excludes the single
-
attribute 
items for 
the 
higher
-
level attributes
.
 
For 
example, 
a reduced Q
-
matrix 


in 
H3.4 (pyramid hierarchy) only includes two single
-
attribute q
-
vectors corresponding to 
the two lower
-
level attributes. The single
-
attribute q
-
vector for the other 
attribute is excluded from a reduced Q
-
matrix.
 
The absence of single
-
attribute q
-
vectors in the 
recuded Q
-
matrices may have serious impact on the classifications, which is discusse
d in the next 
chapter.
 
 
66
 
Chapter 
4
 
Conditional KL
I
-
based
 
indexes
 
for hierarchical 
CDMs
 
4.1 Introduction
 
In the previous chapter, we discuss two approaches to construct
ing
 
Q
-
matrices
 
with 
hierarchical attributes. We 
main
ly
 
talk about equivalent q
-
vectors and complete Q
-
matrices. There 
are, however, numerous ways to construct the Q
-
matrix for a test from all the available q
-
vectors.
 
Previous studies in Q
-
matrix design 
simulate tests with different Q
-
matrices
 
to 
compare
 
the 
c
lassification 
results
 
(
Chiu et al., 2009; Liu & Huggins
-
Manley, 2016; Liu et al., 2017; Madison 
& Bradshaw, 2015
)
. 
We address the 
issue of
 
Q
-
matrix design from the perspective of item
-
level 
and test
-
level ind
ic
es.
 
The indices can be used to automate test a
ssembly with a calibrated item 
pool
. The indices also 
provide a basis for comparing different Q
-
matrix designs.  
 
The 
existing 
item
-
level and test
-
level 
indexes based on the KL information 
are 
overall 
indexes for a population of examinees
,
 
and
 
they are fou
nd to be positively correlated with the 
overall classification rates (
Henson & Douglas, 2005; Kuo et al., 2016
)
.
 
However, the 
correct 
classification rates (CCRs) could vary substantially across different 
attribute profiles 
within the 
same test 
regardless of
 
independent or hierarchical attributes
. The 
CCRs
 
conditional on the 
attribute profile are usually not reported as most studies only 
calculate
 
an overall 
CCR
 
for the 
population of examinees.
 
With independent attributes, t
he conditional 
CCRs 
ar
e different 
for different attribute 
profiles 
when each attribute
 
is measured 
in different numbers of items
. In this situation, a
ttribute
-
level 
indexes 
could compensate for
 
overall
 
ind
ices for items or tests
 
(
Henson et al., 2008; Kuo et 
al., 2016
).
 
However,
 
the attribute
-
level index ADI fails to consider the dependency between 
attributes as a result of attribute hierarchies. 
To address this problem, t
he modified ADI proposed 
67
 
by Kuo and colleagues (2016) 
add weights on the original ADI but remains to be an ov
erall index 
for a population of examinees. 
 
The following examples show the necessity for conditional indices in
stead of
 
an overall 
index. 
Suppose items are calibrated with the DINA model and 
the intercept 


and the 
interaction effect 


fo
r all items
. The Q
-
matrix 


contains a multiple
-
attribute item in 
addition to three single
-
attribute items. When 


is used for three independent attributes, different 
attribute profiles have substantially different conditional CCRs. Another example 


is the identity 
matrix but is used for measuring three linear attributes (i.e., 


). The CCRs for 
complete m
astery and complete non
-
mastery are higher than other profiles.
 

Figure 
9
:
 
Correct classification rates 
under two conditions
 
 
Since the goal is to estimate the attribute profile for every 
examinee accurately, it is 
necessary to develop an index conditional on the attribute profile
,
 
especially when hierarchical 
attributes are present. 
This thesis proposes 
two
 
conditional ind
ices
 
based on the KL information 
that can be used 
for 
non
-
adaptive test construction
 
and Q
-
matrix design
. 
 
0.68
0.7
0.72
0.74
0.76
A000
A100
A010
A110
A001
A101
A011
A111
CCR
0.70
0.75
0.80
0.85
0.90
A000
A100
A110
A111
CCR
68
 
In 
this chapter
, it is assumed that a large number of items have been developed for 
a well
-
defined
 
domain
 
and
 
that 
the Q
-
matrix
, as well as the relationship between attributes,
 
are correctly 
specified
.
 
W
e take the full Q
-
matrix approach and allow all types of q
-
vectors. It is also assumed 
that
 
item parameter estimates have been obtained from previous calibrations.
 
4.2 Conditional KL ind
ice
s for test construction
 
A set of two indices is proposed
, condition
al on the attribute profile. The two conditional 
indices summarize 
the 
information from the L
-
by
-
L test KLI matrix 

 
as 
defined in
 
equation


in 
2.1.7
, where 

 
is the size of the attribute profile space
. 
The first index is the 
average of the 
elements 
in the 

th column and the 

th row of the test KLI matrix. 
The second index is the 
 

where 


is the 


th element of 
the test KLI matrix 

 
and 

 
represents the size of the attribute 
profile space.
 
The two KLI
-
based indices were 
log
-
transformed
 
to get a linear relationship 
with 
CCR 
(Henson et al., 2008; Henson et al., 2018).
 
The 
index
 

describes the average discrimination power of 
a test 
to 
differentiate 


from other attribute profiles. 
It is supposed to be positively correlated with the 
conditional CCR for 


. 
However, 
t
h
is index
 
alone is not sufficient for pred
icting CCR 
due to
 
the 
multidimensional nature of the CDMs.
 
When the 


is fixed, i
f the test does not 
differentiate well between
 
two particular attribute profiles 
 

and
 

,
 
the CCR for
 

or 


suffers
 
(Cheng, 2010)
. 
This 
phenomenon was mentioned in Cheng (2010)

 
CD
-
CAT study and 
compared to 


law of the minimum,


.
 
Therefore
, a second index 


was
 
defined in 


to characterize the weakest point of 
a 
test
. O
ne 
particularly
 
69
 
low
 
KLI between 
two
 

s
 
leads to 
a 
relatively large range given the same 


. 
A range 
measure was used instead of the minimum measure 
to 
control
 
the effect of 


. 
The 
index 


is negatively correlated with the conditional CCR for 


but has a low or 
insignificant correlation with
 

.
 
Th
e
 
need for the second index 
is
 
best illustrated
 
by comparing the following two Q
-
matrices under the DINA model.
 
Three independent attributes 
are measured
 
with nine items
.
 

w
here 

 
is the identity matrix
.
 
 
Assuming 
the intercept 


and 
the interaction effect 


for all items, 
the 
two indices were calculated for 
the 
two tests. The CCRs were also obtained from the simulation.
 
In the 
T
able
 
15
, the two tests have the same 


for each attribute profile but 
the second 
te
st has substantially lower CCRs. 
 
 
Table 
15
: KLI indices and the CCRs for two Q
-
matrices
 

CCR
 

CCR
 
000
 
2.20
 
1.10
 
0.92
 
2.20
 
2.20
 
0.81
 
100
 
2.20
 
1.10
 
0.92
 
2.20
 
2.20
 
0.81
 
010
 
2.20
 
1.10
 
0.92
 
2.20
 
2.20
 
0.81
 
001
 
2.20
 
1.10
 
0.92
 
2.20
 
2.20
 
0.81
 
110
 
2.20
 
1.10
 
0.92
 
2.20
 
2.20
 
0.80
 
101
 
2.20
 
1.10
 
0.91
 
2.20
 
2.20
 
0.80
 
011
 
2.20
 
1.10
 
0.91
 
2.20
 
2.20
 
0.81
 
111
 
2.20
 
1.10
 
0.91
 
2.20
 
2.20
 
0.81
 
 
70
 
The difference between the two Q
-
matrices 
in Table 15 
was referred to as an issue of 
content balancing in Cheng (2010) since the number of items for each attribute is not balanced in 


. Given the same 


, the second index
 

is needed in this case to account for 
the different CCR
s.
 
A larger range index corresponds to a lower CCR.
 
The 
two 
conditional KL 
indices would be
 
good 
predictor
s
 
of the conditional CCR of 

 
with a fixed test length
.
 
To 
make them useful
 
for
 
between various
 
test lengths
, the following two 
conditions need to be
 
satisfied:
 
1.
 
For each 

, 
there are 
no zero 
off
-
diagonal 
entries in 
the 
test KLI matrix 

 
because 


is not defined
;
 
 
2.
 
There 
is
 
an o
dd number of items in each item type
 
(i.e., 
a 
distinct 
q
-
vector
)
.
 
The first condition is satisfied when the Q
-
matrix is complete. 
The second condition is 
necessary
 
for the indices to be useful 
because 
when the examinee correctly respond to half of the 
items, the 
examinee
 
is likely to be misclassified. 
For example, if the
 
test has two items with 


, 
for examinees who master attribute 


, the likelihood 
function is 


; for examinees who do not possess attribute 


, the l
ikelihood
 
function is 


. It is possible that an examinee correctly answers 
item 1 but fails at item 2. Then 


, 


. 
W
hen the item
s
 
are 
homogenous in quality
,
 
the difference between 


and 


would be very small. 
I
n 
an extreme case when
 

and 


for all items
, 


.
 
KLI
-
based i
tem selection in CD
-
CAT uses ind
ices
 
similar to 


and ignores 
the 
minimum 
effect. As a result, researchers 
found it necessary to 
add extra constr
aints to the item 
selection algorithm 
in order 
to improve 
the 
CCR
 
(Cheng, 2010)
.
 
Such constraints intend to balance 
attribute coverage
,
 
and
 
t
his 
pro
cess is also referred to as
 
content balancing (Cheng, 2010).
 
The 
71
 
result of content balancing is a smaller 
KLI 
range given the same 


. When attribute 
hierarchies are present, content balancing becomes tricky. 
Using the two indices 
together in test 
construction 
becomes more practical with hierarchical attributes
 
than content balancing
.
 
4.3 Simu
lation
 
design
 
A 
s
imulation 
s
tudy 
was conducted to assess
 
the 
p
erformance of the 
t
wo 
i
ndices
. 
Random 
tests 
were generated
 
as described below with items calibrated using 
DINA
 
or
 
A
-
CDM
.
 
The 
hierarchical GDINA model is equivalent to A
-
CDM in most cases, so the
 
GDINA model 
is 
excluded
 
from the simulations. 
The attribute hierarchies 
shown 
in 3.2 were used to simulate the 
examinee responses
. 
The assessment tasks may be embedded in the classroom instruction and 
scattered in multiple class sessions. As a result, the
 
assessment is not necessarily 
a 
concise
 
one. 
We consider test lengths of 
three, five, and seven times the number of attributes, respectively
.
 
 
For each 
combination of test length 
(e.g., 
nine
 
items) 
and 
attribute hierarchy (e.g., H3.2), 
three sets of tests were simulated. T
h
e
 
first set of 25 tests consists of 
single
-
attribute 
items, the 
second set of 25 tests consists of q
-
vectors from the full Q
-
matrix calibrated by the DINA model, 
and the third set of 5
0 tests consists of q
-
vectors from the full Q
-
matrix calibrated by the ACDM. 
The actual Q
-
matrix for each random test was constructed by randomly sampling from all the 
possible q
-
vectors with replacement if the full Q
-
matrix approach 
is used
 
or from the id
entity 
matrix if only single
-
attribute items are wanted.
 
Each Q
-
matrix contained the identity matrix to 
e
nsure
 
completeness. The
re was an odd
 
number of items in 
each item type
 
(i.e., a distinct q
-
vector).
 
For all items, t
he 
intercept parameter (


) was
 
generated from the 
uniform distribution 


and 


was generated from 


.
 
A total of 
5
,000 examinees 
are simulated
 
for each 
true
 
attribute profile
 
for each random 
test
.
 
Given each examinee's attribute 
profile, item scores are gene
rated based on the chosen 
model.
 
72
 
A random 


variable 

 
is generated.
 
The correct response probability 


is 
compared with 

 
to decide the response of examinee 

 
to item 

:
 

The two conditional indices 
were calculated
 
for each attribute profile for each random test. 
C
lassification
s
 
were
 
accomplished via 
MLE for independent attributes or restricted MLE for 
hierarchical 
attributes b
ecause 
the item parameters are known.
 
C
onditional profile
-
wise CCR
 
were
 
recorded
 
for each 

.
 
The index of means is supposed to 
be positively correlated
 
with 
the CCR
,
 
and
 
the index of 
range
 
is supposed to 
be negatively correlated
 
with the CCR. 
For each attribute hierarchy, a 
linear 
regression 
mode
l
 
with normal errors 
was fit using the two indices to predict the CCR
:
 

The regression estimates were used to produce a linear combination of the two indices as 
a combined index
,
 
cKLI
:
 

The combined index
 
cKLI
 
is expected to 
be highly correlated
 
with the CCR.
 
4.4 Simulation results
 
 
The regression estimates and the 


for each attribute hierarchy 
were summarized in
 
Table 
1
6
. 
A combined index was calculated as a linear combination of the two indices using the 
regression estimates as weights. This
 
combined index 
(cKLI) 
was plotted against the CCR 
conditional on 

 
in the following 
scatter plots 
to visualize th
e
 
relationship
s
 
(
see 
Figure 
10
-
Figure 
2
4
)
.
 
For brevity, we only present the scatter plots for a subset of 

s when there are more than 


a
ttribute profiles in the space.
 
 
73
 
Table 
16
:
 
Regression estimates
 
and 


for each attribute hierarchy
 
Attribute hierarchy
 

H3.1
 
Independent
 
-
0.07
 
0.2
4
 
0.7
6
 
H3.2
 
Linear
 
-
0.07
 
0.19
 
0.
78
 
H3.3
 
Inverted pyramid
 
-
0.07
 
0.2
0
 
0.74
 
H3.4
 
Pyramid
 
-
0.06
 
0.21
 
0.79
 
H4.1
 
Independent
 
-
0.08
 
0.27
 
0.82
 
H4.2
 
Linear
 
-
0.08
 
0.21
 
0.80
 
H4.3
 
Linear+single
 
-
0.08
 
0.24
 
0.80
 
H4.4
 
Inverted pyramid
 
-
0.08
 
0.22
 
0.81
 
H4.5
 
Pyramid
 
-
0.07
 
0.22
 
0.81
 
H5.1
 
Independent
 
-
0.07
 
0.28
 
0.81
 
H5.2
 
Linear
 
-
0.09
 
0.21
 
0.82
 
H5.3
 
Inverted pyramid 
I
 
-
0.08
 
0.26
 
0.82
 
H5.4
 
Inverted pyramid 
II
 
-
0.08
 
0.26
 
0.81
 
H5.5
 
Pyramid 
I
 
-
0.08
 
0.25
 
0.80
 
H5.6
 
Pyramid 
II
 
-
0.08
 
0.25
 
0.81
 
 
Table 
17
:
 
The overall correlation
 
and 
the 
correlations for different test lengths
 
between 
cKLI
 
and 
the CCR
 
Attribute hierarchy
 
All
 
Test length
 

H3.1
 
Independent
 
0
.8
7
 
0.60
 
0.76
 
0.85
 
H3.2
 
Linear
 
0
.
88
 
0.83
 
0.87
 
0.87
 
H3.3
 
Inverted pyramid
 
0
.8
6
 
0.79
 
0.82
 
0.84
 
H3.4
 
Pyramid
 
0
.89
 
0.81
 
0.88
 
0.86
 
H4.1
 
Independent
 
0
.90
 
0.77
 
0.82
 
0.88
 
H4.2
 
Linear
 
0
.
89
 
0.86
 
0.86
 
0.86
 
H4.3
 
Linear+single
 
0
.90
 
0.81
 
0.84
 
0.85
 
H4.4
 
Inverted pyramid
 
0
.90
 
0.83
 
0.87
 
0.89
 
H4.5
 
Pyramid
 
0
.90
 
0.84
 
0.87
 
0.
8
7
 
H5.1
 
Independent
 
0
.90
 
0.77
 
0.87
 
0.85
 
H5.2
 
Linear
 
0
.9
0
 
0.85
 
0.91
 
0.87
 
H5.3
 
Inverted pyramid I
 
0
.91
 
0.84
 
0.88
 
0.91
 
H5.4
 
Inverted pyramid II
 
0.90
 
0.81
 
0.88
 
0.89
 
H5.5
 
Pyramid I
 
0.90
 
0.80
 
0.87
 
0.90
 
H5.6
 
Pyramid II
 
0.90
 
0.84
 
0.87
 
0.88
 
Note
: 

 
is the number of attributes
.
 
 
The overall correlation between 
cKLI
 
and the CCR 
is presented
 
in Table 17. All the overall 
correlations are around 


. The correlations for different test lengths are also calculated (Table 
74
 
17). The correlation gener
ally increases substantially as the test length goes up from three times of 

 
to five or seven times of 

 
where 

 
is the number of attributes. This trend can also be seen in 
the scatter plots.
 
 
Figure 
10
:
 
A plot for tests with
 
three
 
independent attributes (H3.1) of the combined index with 
CCRs
 
 
75
 
 
Figure 
11
:
 
A plot for tests with three linear attributes (H3.2) of the combined index with CCRs
 
 
76
 
 
Figure 
12
:
 
A plot for tests with three 
inverted pyramid
 
attributes (H3.
3
) of the combined index 
with CCRs
 
 
77
 
 
Figure 
13
:
 
A plot for tests with three pyramid attributes (H3.
4
) of the combined index with CCRs
 
78
 
 
Figure 
14
:
 
A plot for tests with 
four
 
independent attributes (H
4
.1) of the combined index w
i
th 
CCRs
 
 
79
 
 
Figure 
15
:
 
A plot for tests with four linear attributes (H4.2) of the combined index with CCRs
 
 
80
 
 
Figure 
16
:
 
A plot for tests with three linear attributes + one single attribute (H4.3) of the combined 
index with CC
Rs
 
 
81
 
 
Figure 
17
:
 
A plot for tests with 
four
 
inverted pyramid
 
attributes (H4.
4
) of the combined index with 
CCRs
 
 
82
 
 
Figure 
18
:
 
A plot for tests with four pyramid attributes (H4.
5
) of the combined index with CCRs
 
 
83
 
 
Figure 
19
:
 
A plot for tests with 
five
 
independent
 
attributes (H5
.1
) of the combined index with 
CCRs
 
 
84
 
 
Figure 
20
:
 
A plot for tests with five 
linear
 
attributes (H5.
2
) of the combined index with CCRs
 
 
85
 
 
Figure 
21
:
 
A plot for tests with five 
in
v
erted pyramid
 
attributes (H5.
3
) of the combined index with 
CCRs
 
 
86
 
 
Figure 
22
:
 
A plot for tests with five in
v
erted pyramid attributes (H5.
4
) of the combined index with 
CCRs
 
87
 
Figure 
23
:
 
A plot for tests with five pyramid attributes (H5.
5
) of the combined index with CCRs
88
 
 
Figure 
24
:
 
A plot for tests with five pyramid attributes (H5.
6
) of the combined index with CCRs
 
 
4.5 Discussion
 
The two indices can predict the CCR well according to the linear regression results
 
showing 
that
 
abut 80% of the total 
variance was explained
. 
The prediction of two indices was a substantial 
improve 
from the prediction of ei
t
her index alone. 
This relationship 
was also reflected by the high 
correlation between t
he combined index 

 
and
 
the CCR. 
These results 
suggest t
hat using an 
averaged KLI 
may not be sufficient for predicting CCR. Therefore, any single index based on the 
maximum or the mean of KLI 
would have serious limitations as a test construction index. 
As 
mentioned earlier, 
it 
has been found
 
necessary to 
add extra constraints to the item selection 
algorithm 
based on a single KLI index, 
in order to improve the CCR
 
in CD
-
CAT research 
(Cheng, 
2010).
 
Such constraints 
would lead to 
a decreased range index
,
 
and 
balance
d
 
attribute coverage
 
89
 
would be 
observed with 
independent attributes
.
 
In other words, content balancing 
cou
l
d 
have the 
same effect as 
having
 
a
 
range index when attributes are independent. 
W
ith hierarchical
 
attributes, 
however, 
there is no clear way
 
to define 
content balancing. 
Therefore,
 
u
sing the two indices 
together in test construction 
would be
 
more 
appropriate
 
with hierarchical attributes than content 
balancing.
 
This applies to both non
-
adaptive and adaptive test construction.
 
It is important to note that the 
(

, CCR) relations
hip does not depend on the model 
selected 
(DINA or ACDM) 
or test length.
 
However, 
the
 
relationship 
between the two indices and 
the CCR may depend on the attribute hierarchy, more specifically, the number of attribute profiles 
as suggested by the 
different regression estimates in
 
Table 15
.
 
Moreover, t
he indices 
lead to
 
better 
predictions of the CCR 
as
 
the test length increases. 
 
The proposed index can be used to assemble tests from an item pool
 
by setting an 
information target or a fixed test lengt
h. 
Setting an appropriate information target may not be easy 
because 
on
 
the
 
one hand
, a target needs to 
be set
 
for each attribute profile and on the other hand, 
also 
noted in 
Henson
 
et al. (2018)
, 
the
 
threshold value that would ensure a certain CCR
 
may dep
end 
on the number of attributes and the attribute hierarchy.
 
 
If the test length 
is fixed
, 
the test assembly algorithm could take two steps: a set of tests 
with largest mean KLI 
is identified
 
first
,
 
and
 
then 
the one with the largest minimum KLI 
or smallest 
range index 
is chosen
.
 
Alternatively, the regression estimates in Table 1
6
 
an be used to calculate 
the combined index. 
The 
test 
assembly can 
be automated
 
in various ways
.
 
With the two 
information indices, we
 

is measured
 
by 
an adequate number of items
 
(
Cheng
,
 
2010, p. 903)
.

 
90
 
The 
(

, CCR) relationship
 
is visualized for each attribute profile in 
Figure 
11
 
-
 
Figure 
24
 
because the CCR could vary substantially between 

s. We chose four random tests in the 
condition H4.2 to demonstrate the 
variation of CCR in
 
Figure 2
5
. 
 
 
Figure 
25
:
 
T
he conditional CCR
s
 
from four random tests in H4.2
 
 
With a linear hierarchy and an identity matrix as the Q
-
matrix, the attribute profiles that 
master some but not all attributes are easier to be misclassified than the two attribute profiles on 
the two end
s (i.e., the one with all 0s and the one with all 1s). 
This
 
pattern 
can also 
be explained
 
in terms of the KL indices
 
(
Table 18
)
.
 
Another way to see the various CCRs for a linear hierarchy 
is 
the item with 


differentiates 


with other 

s 
and
 
the item with 


differentiates 


with other 

s
, and as a result these two 

s
 
have a higher 
classification rates than the 

s 
in the middle.
 
We use the two KLI
-
based indices to compare the 
full 
and
 
reduced Q
-
matrix ap
proaches 
under the ACDM. As mentioned earlier, the two approaches are equivalent under the ACDM. 
The 
major difference between a full Q
-
matrix with ACDM and a reduced Q
-
matrix with ACDM is the 
exclusion of some single
-
attribute items from the reduced Q
-
matr
ices. 
Therefore, we compare the 
0.4
0.45
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
A0000
A1000
A1100
A1110
A1111
DINA-test 1
DINA-test 2
ACDM-test 1
ACDM-test 2
91
 
identity matrix with 


with ACDM in terms of the two indices
 
for a linear hierarchy of three 
attributes
 
(H3.2)
. The item parameters are presented in
 
Table 1
8
. The indices for the two three
-
item tests are 
shown in
 
Table 1
9
.
 
If the reduced Q
-
matrix approach is adopted and all the items 
are calibrated
 
with the 
ACDM, 
classifications for the attribute profiles 


,
 

, and 


become 
much more difficu
lt. 
As suggested by the combined index, 
much 
l
onger tests are required to 
achieve comparable classification rates 
for most of the attribute profiles 
if two types of single
-
attribute items, 


and 


, are excluded from 
the 
candi
date 
pool
.
 
In 
addition to the consideration of classification efficiency, t
he 
choice between a reduced 
Q
-
matrix and a full Q
-
matrix 
should 
depend
 
on answers to questions 
such as
 
whether it is possible 
to a mixed item pool, 
whether it is possible to develop 
a certain
 
ite
m
 
type, and the model
-
data fit 
at the item level
.
 
 
Table 
18
:
 
Item parameters 
of five items for 
H3.2
 
 
Model
 

(
010
)
 
-
 
0.1
 
 
0.8
 
 
(
001
)
 
-
 
0.1
 
 
0.8
 

(
100
)
 
-
 
0.1
 
0.8
 
 
(
110
)
 
ACDM
 
0.1
 
0.4
 
0.4
 
 
(
111
)
 
ACDM
 
0.1
 
0.27
 
0.27
 
0.27
 
 
Table 
19
:
 
Comparison between two three
-
item tests in terms of the two indices
 

cKLI
 

cKLI
 
000
 
1.26
 
1.10
 
0.05
 
1.38
 
0.82
 
0.12
 
100
 
0.85
 
0.69
 
0.04
 
0.33
 
1.35
 
-
0.17
 
110
 
0.85
 
0.69
 
0.04
 
0.52
 
2.84
 
-
0.38
 
111
 
1.26
 
1.10
 
0.05
 
0.80
 
3.34
 
-
0.42
 
 
92
 
Chapter 
5
 
Q
-
matrix design 
for
 
nonparametric classifications 
with 
hierarchical 
attributes
 
5.1 Introduction
 
W
ithout
 
a calibrated item pool, the nonparametric classification 
(NPC) 
method (Chiu, Sun, 
& Bian, 2018) provide an alternative
 
approach f
or classifications
.
 
The NPC method
 
allows the
 
t
eachers 
to
 
develop 
the
ir
 
own items
 
based on CDMs
 
if they can
 
identify the 
attribute hierarchy and 
the 
Q
-
matrix.
 
There is no need for item calibratio
n
,
 
and s
tudents 
are classified based on their 
response data without the need to estimate item parameters.
 
T
he 
Q
-
matrix design 
plays an even more important role
 
in
 
nonpa
rametric classifications
 
than in parametric classifications
, but it has not 
been formally addressed
 
in 
the 
literature
.
 
Related 
studies explore different Q
-
matrix designs
 
with
 
hierarchical attributes 
in the context of parametric 
classifications 
(Liu, 
Huggins
-
Manley, & Bradshaw 2017; Tu, Wang, Cai, Douglas, & Chang, 
2018).
 
There is a consensus on the effect of single
-
structured items on accurate classifications 
regardless of the attribute hierarchy (Chiu et al., 2009; DeCarlo, 2011; Madison & Bradshaw, 
2015). 
However, t
he role of items with multiple attributes is not clear.
 
Other factors in Q
-
matrix 
design 
that receive less attention in existing research 
include test length and the number of items 
in each item type.
 
In this 
study
,
 
the 
NPC
 
method (Chiu & 
Douglas, 2013) 
was
 
used
 
Because it 
is assumed
 
that 
the teacher develops a CDM
-
based test for a 
particular classroom
. 
Prior data are not expected 
to be available. 
Therefore, 
the general nonparametric classification method (Chiu, Sun, & Bian, 
2018)
 
that requ
ires some prior 
response 
data
 
is not considered
.
 
93
 
5.2 Ties in NPC
 
There is a tie w
hen the 
observed 
response pattern of an examinee 
is
 
at 
an 
equal
 
distance to 
more than one ideal response pattern.
 
Some Q
-
matrices lead to more ties than others. 
With an 
ideal 
Q
-
matrix, the
 
item responses of high probabilities are 
always 
closest to the ideal response pattern
 
of the true 


and there would be no ties
 
in 
the hamming distance
.
 
In this study, if a tie occurs, the 
examinee would be randomly classified into one
 
attribute profile with the minimal hamming 
distance.
 
We present 
a comparison
 
between two Q
-
matrices
 
as an example
.
 
The underlying model 
is the DINA model. 
The item quality is assumed to be high
: 


. 
Three independent attri
butes
 
are involved
. 
We focus on 
an examinee with 


.
 
The hamming distance
s
 
between several likely response patterns
 
of 


and 
each of 
the ideal 
response patterns 
are
 
shown in 
the
 
cell
s
 
of 
Table 
20
 
and 
Table 2
1
. 
With an identity matrix as the 
Q
-
matrix, there are no ties
 
in the hamming distance
 
and 
the probability of correctly classifying 
the
 
examinee 
with 


equals to
 
the probability of observing the response pattern of 


, 
which is 


.
 
 
When t
he Q
-
matrix contains the identity matr
ix 


and an item probing all three attributes
, 
ties are observed
 
when
 
the examinee slips on one of the items
 
(
Table 2
1
)
.
 
The probability of a tie 
is 
the probability 
of observing
 
such a response patter, which is 


.
 
It is still 
possible to clarify the examinee with a tie in the hamming distance.
 
T
he CCR for 


can be 
calculated as 
a weighted sum of probabilities
:
 

.
 
Comparing the two Q
-
matrices reveals that adding 
a
n item with 


to the identity matrix leads to a slight increase 
in the CCR for 


from 


to 


.
 
The second Q
-
matrix leads to a probability of 0.29 to obtain a tie.
 
94
 
Table 
20
:
 
Hamming distance
s
 
for 


with 


(H3.1)
 
Response 
pattern 

 
Probability
 

: (Ideal response pattern)
 

: 
(000)
 

: 
(100)
 

: 
(010)
 

: 
(001)
 

: 
(110)
 

: 
(101)
 

: 
(011)
 

: 
(111)
 
(111)
 

3
 
2
 
2
 
2
 
1
 
1
 
1
 
0
 
(110)
 

2
 
1
 
1
 
3
 
0
 
2
 
2
 
1
 
(101)
 

2
 
1
 
3
 
1
 
2
 
0
 
2
 
1
 
(011)
 

2
 
3
 
1
 
1
 
2
 
2
 
0
 
1
 
 
Table 
21
:
 
Hamming distance
s for
 

with 


(H3.1)
 
Response 
pattern 

 
Probability
 

: (Ideal response pattern)
 

: 
(000)
 

: 
(100)
 

: 
(010)
 

: 
(001)
 

: 
(110)
 

: 
(101)
 

: 
(011)
 

: 
(111)
 
(1111)
 

4
 
3
 
3
 
3
 
2
 
2
 
2
 
0
 
(1110)
 

3
 
2
 
2
 
2
 
1
 
1
 
1
 
1
 
(1101)
 

3
 
2
 
2
 
4
 
1
 
1
 
3
 
1
 
(1011)
 

3
 
2
 
4
 
2
 
3
 
1
 
3
 
1
 
(0111)
 

3
 
4
 
2
 
2
 
3
 
3
 
1
 
1
 
 
5.
3
 
Simulation design
 
T
he 
identity matrix 
serve
d
 
as 
the 
baseline
 
Q
-
matri
x
.
 
We consider
ed
 
the following situations
: 
1) a
dding 
one or two 
simple
-
attribute
 
items
 
to the baseline, 2)
 
adding
 
one or two multiple
-
attribute
 
items
 
to be baseline, and 3) adding an identity matrix. A total
 
of 
15, 19, and 23 Q
-
matrices
 
are 
obtained
 
for
 

3, 4, and 5, respectively
, presented in
 
Table 2
2
. 
 
The computations of CCRs and the probability of a tie become more complicated with a 
longer test
 
or more attributes. Therefore, a simulation study was conducted to compare Q
-
matrices. 
Item parameters
 
were simulated based on 


.
 
A 
total of 
5
,000 examinees are simulated for each 
true
 
attribute profile
 
for each 
Q
-
matrix
.
 
Given each 
examinee's attribute profile, item scores are gene
rated based on the 
DINA
.
 
A random 


variable 

 
is generated.
 
The correct response probability 


is compared with 

 
to 
decide the response of examinee 

 
to item 

:
 
95
 

Examinee responses were classified using the nonparametric classification method (Chiu 
& Douglas, 2013). 
C
onditional profile
-
wise CCR
 
were recorded
 
for each 

.
 
The 
percent
 
of ties 
was
 
recorded for each simulation condition
 
as an estimate of the probabilit
y of getting a tie
.
 
 
Table 
22
:
 
Q
-
matrix designs for th
e simulation study
 
of 
nonparametric classifications
 
Q
-
matrix
 
Q
-
matrix
 
Q
-
matrix
 
3
-
1
 

4
-
1
 

5
-
1
 

3
-
2
 

4
-
2
 

5
-
2
 

3
-
3
 

4
-
3
 

5
-
3
 

3
-
4
 

4
-
4
 

5
-
4
 

3
-
5
 

4
-
5
 

5
-
5
 

3
-
6
 

4
-
6
 

5
-
6
 

3
-
7
 

4
-
7
 

5
-
7
 

3
-
8
 

4
-
8
 

5
-
8
 

3
-
9
 

4
-
9
 

5
-
9
 

3
-
10
 

4
-
10
 

5
-
10
 

3
-
11
 

4
-
11
 

5
-
11
 

3
-
12
 

4
-
12
 

5
-
12
 

3
-
13
 

4
-
13
 

5
-
13
 

3
-
14
 

4
-
14
 

5
-
14
 

3
-
15
 

4
-
15
 

5
-
15
 

4
-
16
 

5
-
16
 

4
-
17
 

5
-
17
 

4
-
18
 

5
-
18
 

4
-
19
 

5
-
19
 

5
-
20
 

5
-
21
 

5
-
22
 

5
-
23
 

96
 
5.
4
 
Simulation results
 
Simulation results for the conditions with three attributes 
are summarized
 
in
 
Table 22
-
Table 25
. For brevity, we only present the results for four attribute profiles. Comparing each Q
-
matrix to the baseline (Q3
-
1), we found
 
that a very high probability 
of obtaining
 
a tie usually 
suggests no increase in the CCR and a lack of ties suggests an increased CCR for some 

s
.
 
A 
longer test does not necessarily lead to higher CCR for each attribute profile.
 
 
As shown in
 
Table 22
, a
dd
ing 
a single
-
attribute item to the baseline Q
-
matrix does not 
lead to an increased CCR with three independent attributes. 
Th
e lack of 
change 
can 
be explained
 
by the ties in hamming distances that cancel the effect of adding one more item. It is more likely 
to obtain a tie when there are an even number of 


, 


, or 


in the Q
-
matrix. Adding 


slightly
 
increases the CCR of 


and 


a
nd adding 


slightly
 
increases in the 
CCR of 


.
 
In the above conditions, ties are likely to occur for all or some attribute profiles. 
However, when two items of each q
-
vector are added to the baseline, as in Q3
-
5, Q3
-
6, and Q3
-
7, 
the CCRs of all 
or some attribute profiles increase substantially
,
 
and almost no ties are observed. 
 
With a linear hierarchy, all q
-
vectors have their equivalent single
-
attribute q
-
vectors. 
Therefore, all the Q
-
matrices contain single
-
attribute q
-
vectors. The comparison b
etween Q3
-
2 
and Q3
-
5, between Q3
-
3 and Q3
-
6, and between Q3
-
4 and Q3
-
7 in 
Table 2
4
 
suggests that a large 
probability 
for 
getting 
ties
 
would
 
hurt the classifications. For example, the CCR for 


increases
 
slightly
 
after adding a 


(Q3
-
2) 
but
 
the cl
assifications for other attribute profiles are not 
benefited. When two 


s are added (Q3
-
5), the CCR for 


and 


increase substantially. 
The probability of ties decreases from 0.23 (Q3
-
2) to 0.08 (Q3
-
5) with another 


added to the 
Q
-
matrix. Similar patterns can be found for the inverted pyramid or pyramid hierarchies in
 
Table 
2
5
 
and Table 2
6
.
 
97
 
The negative effe
ct of having even numbers of items in an item type 
is highlighted
 
in the 
comparison between Q3
-
1, Q3
-
8, and Q3
-
15 in
 
Table 2
3
-
Table 2
6
. When the Q
-
matrix consists of 
two identity matrices, the CCR for each 

 
does not change or increase slightly compared t
o the 
baseline. However, when Q
-
matrix consists of three identity matrices, the CCR for each 

 
increases substantially.
 
Summarizing simulation results for three attributes, we conclude that tests with even 
number of items from each q
-
vector are less effic
ient than tests with each q
-
vector in odd times. 
When 
a q
-
vector appears in an even number and the item quality is homogeneous, it is more likely 
to have ties compared to the baseline situation of each attribute hierarchy
,
 
and
 
consequently
,
 
the 
effect of e
xtra test length is partially or completely 
canceled
 
out. This conclusion also applies to 
conditions of four or five attributes, shown in 
Table 2
7
-
Table 
3
7
.
 
 
98
 
Table 
23
:
 
NPC results for H3.1 
 
Q
 

CCR
 
 
Pr(tie)
 

3
-
1
 
3
 
1
 
1
 
1
 
0
 
0
 
 
0.73
 
0.74
 
0.74
 
0.74
 
 
0.00
 
0.00
 
0.00
 
0.00
 
3
-
2
 
4
 
2
 
1
 
1
 
0
 
0
 
 
0.73
 
0.74
 
0.73
 
0.73
 
 
0.18
 
0.18
 
0.18
 
0.18
 
3
-
3
 
4
 
1
 
1
 
1
 
1
 
0
 
 
0.72
 
0.71
 
0.76
 
0.76
 
 
0.03
 
0.16
 
0.24
 
0.25
 
3
-
4
 
4
 
1
 
1
 
1
 
0
 
1
 
 
0.73
 
0.72
 
0.72
 
0.77
 
 
0.00
 
0.02
 
0.15
 
0.28
 
3
-
5
 
5
 
3
 
1
 
1
 
0
 
0
 
 
0.78
 
0.79
 
0.77
 
0.79
 
 
0.00
 
0.00
 
0.00
 
0.00
 
3
-
6
 
5
 
1
 
1
 
1
 
2
 
0
 
 
0.73
 
0.75
 
0.86
 
0.86
 
 
0.01
 
0.07
 
0.02
 
0.02
 
3
-
7
 
5
 
1
 
1
 
1
 
0
 
2
 
 
0.74
 
0.72
 
0.75
 
0.93
 
 
0.00
 
0.02
 
0.07
 
0.03
 
3
-
8
 
6
 
2
 
2
 
2
 
0
 
0
 
 
0.74
 
0.73
 
0.74
 
0.71
 
 
0.46
 
0.45
 
0.46
 
0.45
 
3
-
9
 
7
 
3
 
2
 
2
 
0
 
0
 
 
0.79
 
0.79
 
0.78
 
0.78
 
 
0.32
 
0.33
 
0.33
 
0.33
 
3
-
10
 
7
 
2
 
2
 
2
 
1
 
0
 
 
0.74
 
0.78
 
0.85
 
0.85
 
 
0.44
 
0.32
 
0.18
 
0.18
 
3
-
11
 
7
 
2
 
2
 
2
 
0
 
1
 
 
0.73
 
0.73
 
0.78
 
0.93
 
 
0.46
 
0.44
 
0.33
 
0.02
 
3
-
12
 
8
 
4
 
2
 
2
 
0
 
0
 
 
0.78
 
0.79
 
0.79
 
0.79
 
 
0.36
 
0.37
 
0.36
 
0.36
 
3
-
13
 
8
 
2
 
2
 
2
 
2
 
0
 
 
0.73
 
0.79
 
0.86
 
0.85
 
 
0.45
 
0.36
 
0.25
 
0.25
 
3
-
14
 
8
 
2
 
2
 
2
 
0
 
2
 
 
0.73
 
0.73
 
0.78
 
0.93
 
 
0.44
 
0.44
 
0.36
 
0.11
 
3
-
15
 
9
 
3
 
3
 
3
 
0
 
0
 
 
0.92
 
0.92
 
0.91
 
0.92
 
 
0.00
 
0.00
 
0.00
 
0.00
 
Note
: J = test length
; 


= number of items with a 
certain 
q
-
vector; 
CCR = correct classification rate
. 
 
 
99
 
Table 
24
:
 
NPC results for H3.
2
 
 
Q
 

CCR
 
 
Pr(tie)
 

3
-
1
 
3
 
1
 
1
 
1
 
 
0.85
 
0.77
 
0.77
 
0.86
 
 
0.09
 
0.10
 
0.08
 
0.09
 
3
-
2
 
4
 
2
 
1
 
1
 
 
0.90
 
0.77
 
0.80
 
0.85
 
 
0.17
 
0.23
 
0.03
 
0.09
 
3
-
3
 
4
 
1
 
2
 
1
 
 
0.89
 
0.81
 
0.80
 
0.89
 
 
0.04
 
0.15
 
0.15
 
0.03
 
3
-
4
 
4
 
1
 
1
 
2
 
 
0.85
 
0.80
 
0.77
 
0.89
 
 
0.10
 
0.03
 
0.23
 
0.17
 
3
-
5
 
5
 
3
 
1
 
1
 
 
0.95
 
0.84
 
0.79
 
0.86
 
 
0.03
 
0.08
 
0.03
 
0.08
 
3
-
6
 
5
 
1
 
3
 
1
 
 
0.89
 
0.87
 
0.87
 
0.89
 
 
0.02
 
0.02
 
0.03
 
0.03
 
3
-
7
 
5
 
1
 
1
 
3
 
 
0.85
 
0.80
 
0.83
 
0.96
 
 
0.08
 
0.03
 
0.09
 
0.03
 
3
-
8
 
6
 
2
 
2
 
2
 
 
0.89
 
0.82
 
0.81
 
0.89
 
 
0.18
 
0.33
 
0.33
 
0.19
 
3
-
9
 
7
 
3
 
2
 
2
 
 
0.97
 
0.86
 
0.81
 
0.89
 
 
0.01
 
0.18
 
0.32
 
0.18
 
3
-
10
 
7
 
2
 
3
 
2
 
 
0.89
 
0.88
 
0.88
 
0.90
 
 
0.18
 
0.18
 
0.17
 
0.17
 
3
-
11
 
7
 
2
 
2
 
3
 
 
0.89
 
0.81
 
0.86
 
0.97
 
 
0.19
 
0.31
 
0.19
 
0.01
 
3
-
12
 
8
 
4
 
2
 
2
 
 
0.97
 
0.86
 
0.82
 
0.89
 
 
0.05
 
0.23
 
0.33
 
0.19
 
3
-
13
 
8
 
2
 
4
 
2
 
 
0.90
 
0.87
 
0.88
 
0.90
 
 
0.18
 
0.22
 
0.22
 
0.19
 
3
-
14
 
8
 
2
 
2
 
4
 
 
0.89
 
0.81
 
0.87
 
0.97
 
 
0.20
 
0.31
 
0.22
 
0.05
 
3
-
15
 
9
 
3
 
3
 
3
 
 
0.97
 
0.94
 
0.94
 
0.97
 
 
0.01
 
0.01
 
0.01
 
0.01
 
Note
: J = test length; 


= number of items with a certain q
-
vector; CCR = correct classification rate. 
 
 
100
 
Table 
25
:
 
NPC results for H3.
3
 
 
Q
 

CCR
 
 
Pr(tie)
 

3
-
1
 
3
 
1
 
1
 
1
 
0
 
 
0.81
 
0.72
 
0.78
 
0.81
 
 
0.17
 
0.02
 
0.08
 
0.02
 
3
-
2
 
4
 
2
 
1
 
1
 
0
 
 
0.88
 
0.75
 
0.80
 
0.80
 
 
0.16
 
0.14
 
0.02
 
0.01
 
3
-
3
 
4
 
1
 
2
 
1
 
0
 
 
0.85
 
0.72
 
0.80
 
0.81
 
 
0.11
 
0.18
 
0.17
 
0.18
 
3
-
4
 
4
 
1
 
1
 
1
 
1
 
 
0.81
 
0.72
 
0.76
 
0.83
 
 
0.17
 
0.05
 
0.24
 
0.24
 
3
-
5
 
5
 
3
 
1
 
1
 
0
 
 
0.95
 
0.78
 
0.80
 
0.81
 
 
0.04
 
0.00
 
0.02
 
0.00
 
3
-
6
 
5
 
1
 
3
 
1
 
0
 
 
0.85
 
0.79
 
0.87
 
0.87
 
 
0.11
 
0.01
 
0.02
 
0.00
 
3
-
7
 
5
 
1
 
1
 
1
 
2
 
 
0.81
 
0.72
 
0.80
 
0.95
 
 
0.17
 
0.04
 
0.15
 
0.02
 
3
-
8
 
6
 
2
 
2
 
2
 
0
 
 
0.88
 
0.75
 
0.80
 
0.80
 
 
0.19
 
0.44
 
0.33
 
0.33
 
3
-
9
 
7
 
3
 
2
 
2
 
0
 
 
0.97
 
0.78
 
0.81
 
0.81
 
 
0.01
 
0.31
 
0.33
 
0.32
 
3
-
10
 
7
 
2
 
3
 
2
 
0
 
 
0.89
 
0.79
 
0.87
 
0.87
 
 
0.19
 
0.32
 
0.18
 
0.18
 
3
-
11
 
7
 
2
 
2
 
2
 
1
 
 
0.88
 
0.74
 
0.86
 
0.95
 
 
0.19
 
0.44
 
0.18
 
0.01
 
3
-
12
 
8
 
4
 
2
 
2
 
0
 
 
0.96
 
0.78
 
0.80
 
0.81
 
 
0.05
 
0.35
 
0.34
 
0.33
 
3
-
13
 
8
 
2
 
4
 
2
 
0
 
 
0.89
 
0.79
 
0.87
 
0.87
 
 
0.19
 
0.36
 
0.22
 
0.23
 
3
-
14
 
8
 
2
 
2
 
2
 
2
 
 
0.88
 
0.73
 
0.86
 
0.95
 
 
0.20
 
0.43
 
0.23
 
0.08
 
3
-
15
 
9
 
3
 
3
 
3
 
0
 
 
0.96
 
0.91
 
0.94
 
0.94
 
 
0.01
 
0.00
 
0.01
 
0.00
 
Note
: J = test length; 


= number of items with a certain q
-
vector; CCR = correct classification rate. 
 
 
101
 
Table 
26
:
 
NPC results for H3.4 
 
Q
 

CCR
 
 
Pr(tie)
 

3
-
1
 
3
 
1
 
1
 
0
 
1
 
 
0.81
 
0.76
 
0.73
 
0.81
 
 
0.02
 
0.08
 
0.02
 
0.17
 
3
-
2
 
4
 
2
 
1
 
0
 
1
 
 
0.81
 
0.78
 
0.73
 
0.84
 
 
0.19
 
0.25
 
0.17
 
0.11
 
3
-
3
 
4
 
1
 
1
 
1
 
1
 
 
0.80
 
0.78
 
0.76
 
0.88
 
 
0.03
 
0.15
 
0.22
 
0.04
 
3
-
4
 
4
 
1
 
1
 
0
 
2
 
 
0.80
 
0.81
 
0.73
 
0.88
 
 
0.01
 
0.02
 
0.14
 
0.16
 
3
-
5
 
5
 
3
 
1
 
0
 
1
 
 
0.87
 
0.83
 
0.79
 
0.85
 
 
0.00
 
0.08
 
0.01
 
0.11
 
3
-
6
 
5
 
1
 
1
 
2
 
1
 
 
0.81
 
0.82
 
0.86
 
0.88
 
 
0.02
 
0.10
 
0.02
 
0.04
 
3
-
7
 
5
 
1
 
1
 
0
 
3
 
 
0.81
 
0.80
 
0.78
 
0.95
 
 
0.00
 
0.02
 
0.01
 
0.04
 
3
-
8
 
6
 
2
 
2
 
0
 
2
 
 
0.81
 
0.80
 
0.73
 
0.88
 
 
0.33
 
0.34
 
0.45
 
0.20
 
3
-
9
 
7
 
3
 
2
 
0
 
2
 
 
0.87
 
0.86
 
0.80
 
0.90
 
 
0.18
 
0.19
 
0.32
 
0.18
 
3
-
10
 
7
 
2
 
2
 
1
 
2
 
 
0.81
 
0.86
 
0.86
 
0.89
 
 
0.32
 
0.18
 
0.19
 
0.17
 
3
-
11
 
7
 
2
 
2
 
0
 
3
 
 
0.81
 
0.81
 
0.80
 
0.97
 
 
0.32
 
0.31
 
0.31
 
0.01
 
3
-
12
 
8
 
4
 
2
 
0
 
2
 
 
0.87
 
0.87
 
0.79
 
0.89
 
 
0.22
 
0.24
 
0.36
 
0.18
 
3
-
13
 
8
 
2
 
2
 
2
 
2
 
 
0.80
 
0.86
 
0.86
 
0.89
 
 
0.33
 
0.23
 
0.25
 
0.19
 
3
-
14
 
8
 
2
 
2
 
0
 
4
 
 
0.81
 
0.81
 
0.78
 
0.96
 
 
0.34
 
0.32
 
0.36
 
0.06
 
3
-
15
 
9
 
3
 
3
 
0
 
3
 
 
0.95
 
0.94
 
0.92
 
0.96
 
 
0.00
 
0.01
 
0.00
 
0.01
 
Note
: J = test length; 


= number of items with a certain q
-
vector; CCR = correct classification rate. 
 
102
 
Table 
27
:
 
NPC results for H4.1 
 
Q
 

CCR
 
 
Pr(tie)
 

4
-
1
 
4
 
0.65
 
0.65
 
0.65
 
0.65
 
0.66
 
 
0.00
 
0.00
 
0.00
 
0.00
 
0.00
 
4
-
2
 
5
 
0.66
 
0.64
 
0.66
 
0.65
 
0.66
 
 
0.17
 
0.18
 
0.18
 
0.19
 
0.18
 
4
-
3
 
5
 
0.65
 
0.64
 
0.68
 
0.68
 
0.68
 
 
0.03
 
0.17
 
0.25
 
0.24
 
0.25
 
4
-
4
 
5
 
0.67
 
0.67
 
0.64
 
0.72
 
0.70
 
 
0.00
 
0.02
 
0.14
 
0.28
 
0.29
 
4
-
5
 
5
 
0.66
 
0.65
 
0.64
 
0.64
 
0.73
 
 
0.00
 
0.00
 
0.02
 
0.13
 
0.33
 
4
-
6
 
6
 
0.71
 
0.72
 
0.71
 
0.70
 
0.71
 
 
0.00
 
0.00
 
0.00
 
0.00
 
0.00
 
4
-
7
 
6
 
0.67
 
0.68
 
0.76
 
0.77
 
0.77
 
 
0.01
 
0.08
 
0.02
 
0.02
 
0.02
 
4
-
8
 
6
 
0.66
 
0.65
 
0.67
 
0.84
 
0.84
 
 
0.00
 
0.02
 
0.06
 
0.03
 
0.03
 
4
-
9
 
6
 
0.65
 
0.65
 
0.65
 
0.67
 
0.90
 
 
0.00
 
0.00
 
0.01
 
0.06
 
0.04
 
4
-
10
 
8
 
0.66
 
0.66
 
0.66
 
0.65
 
0.66
 
 
0.54
 
0.55
 
0.54
 
0.56
 
0.55
 
4
-
11
 
9
 
0.71
 
0.71
 
0.71
 
0.70
 
0.71
 
 
0.44
 
0.45
 
0.46
 
0.45
 
0.45
 
4
-
12
 
9
 
0.64
 
0.71
 
0.77
 
0.77
 
0.76
 
 
0.55
 
0.44
 
0.32
 
0.33
 
0.34
 
4
-
13
 
9
 
0.66
 
0.64
 
0.70
 
0.83
 
0.84
 
 
0.54
 
0.55
 
0.45
 
0.20
 
0.19
 
4
-
14
 
9
 
0.66
 
0.65
 
0.65
 
0.69
 
0.91
 
 
0.54
 
0.54
 
0.55
 
0.45
 
0.03
 
4
-
15
 
10
 
0.71
 
0.71
 
0.72
 
0.71
 
0.71
 
 
0.46
 
0.48
 
0.47
 
0.47
 
0.47
 
4
-
16
 
10
 
0.66
 
0.70
 
0.76
 
0.77
 
0.77
 
 
0.54
 
0.48
 
0.38
 
0.39
 
0.38
 
4
-
17
 
10
 
0.64
 
0.66
 
0.70
 
0.83
 
0.84
 
 
0.56
 
0.55
 
0.48
 
0.27
 
0.27
 
4
-
18
 
10
 
0.65
 
0.66
 
0.65
 
0.71
 
0.91
 
 
0.54
 
0.54
 
0.55
 
0.47
 
0.14
 
4
-
19
 
12
 
0.90
 
0.89
 
0.89
 
0.89
 
0.89
 
 
0.00
 
0.00
 
0.00
 
0.00
 
0.00
 
 
Table 
28
:
 
NPC results for H4.2 
 
Q
 

CCR
 
 
Pr(tie)
 

4
-
1
 
4
 
0.85
 
0.76
 
0.74
 
0.77
 
0.86
 
 
0.10
 
0.10
 
0.16
 
0.09
 
0.10
 
4
-
2
 
5
 
0.89
 
0.76
 
0.76
 
0.77
 
0.84
 
 
0.17
 
0.23
 
0.11
 
0.09
 
0.09
 
4
-
3
 
5
 
0.88
 
0.79
 
0.76
 
0.79
 
0.85
 
 
0.03
 
0.16
 
0.22
 
0.04
 
0.09
 
4
-
4
 
5
 
0.85
 
0.79
 
0.76
 
0.79
 
0.88
 
 
0.10
 
0.04
 
0.22
 
0.16
 
0.03
 
4
-
5
 
5
 
0.85
 
0.78
 
0.76
 
0.78
 
0.88
 
 
0.09
 
0.10
 
0.10
 
0.23
 
0.18
 
4
-
6
 
6
 
0.96
 
0.82
 
0.77
 
0.77
 
0.84
 
 
0.03
 
0.10
 
0.11
 
0.09
 
0.11
 
4
-
7
 
6
 
0.88
 
0.86
 
0.81
 
0.80
 
0.85
 
 
0.03
 
0.02
 
0.11
 
0.03
 
0.09
 
4
-
8
 
6
 
0.86
 
0.80
 
0.82
 
0.87
 
0.89
 
 
0.10
 
0.04
 
0.12
 
0.02
 
0.03
 
4
-
9
 
6
 
0.85
 
0.76
 
0.76
 
0.82
 
0.96
 
 
0.09
 
0.09
 
0.12
 
0.09
 
0.02
 
4
-
10
 
8
 
0.89
 
0.80
 
0.79
 
0.80
 
0.89
 
 
0.18
 
0.34
 
0.34
 
0.33
 
0.19
 
4
-
11
 
9
 
0.97
 
0.87
 
0.80
 
0.82
 
0.89
 
 
0.01
 
0.17
 
0.32
 
0.32
 
0.19
 
4
-
12
 
9
 
0.89
 
0.88
 
0.86
 
0.81
 
0.89
 
 
0.17
 
0.17
 
0.20
 
0.32
 
0.18
 
4
-
13
 
9
 
0.89
 
0.81
 
0.86
 
0.87
 
0.90
 
 
0.19
 
0.31
 
0.19
 
0.18
 
0.17
 
4
-
14
 
9
 
0.89
 
0.79
 
0.80
 
0.87
 
0.97
 
 
0.19
 
0.33
 
0.34
 
0.18
 
0.01
 
4
-
15
 
10
 
0.97
 
0.88
 
0.80
 
0.80
 
0.89
 
 
0.05
 
0.22
 
0.34
 
0.33
 
0.19
 
4
-
16
 
10
 
0.90
 
0.87
 
0.87
 
0.82
 
0.89
 
 
0.18
 
0.22
 
0.23
 
0.32
 
0.19
 
4
-
17
 
10
 
0.89
 
0.81
 
0.86
 
0.87
 
0.89
 
 
0.20
 
0.33
 
0.23
 
0.22
 
0.19
 
4
-
18
 
10
 
0.89
 
0.82
 
0.80
 
0.87
 
0.97
 
 
0.19
 
0.32
 
0.34
 
0.23
 
0.06
 
4
-
19
 
12
 
0.97
 
0.95
 
0.94
 
0.95
 
0.97
 
 
0.01
 
0.01
 
0.02
 
0.01
 
0.01
 
103
 
Table 
29
:
 
NPC results for H4.3 
 
Q
 

CCR
 
 
Pr(tie)
 

4
-
1
 
4
 
0.77
 
0.70
 
0.70
 
0.76
 
0.76
 
 
0.09
 
0.09
 
0.10
 
0.09
 
0.09
 
4
-
2
 
5
 
0.79
 
0.69
 
0.72
 
0.76
 
0.76
 
 
0.18
 
0.24
 
0.03
 
0.09
 
0.10
 
4
-
3
 
5
 
0.80
 
0.73
 
0.73
 
0.79
 
0.80
 
 
0.03
 
0.15
 
0.14
 
0.03
 
0.03
 
4
-
4
 
5
 
0.76
 
0.71
 
0.69
 
0.79
 
0.79
 
 
0.10
 
0.03
 
0.23
 
0.17
 
0.16
 
4
-
5
 
5
 
0.77
 
0.69
 
0.70
 
0.76
 
0.82
 
 
0.09
 
0.09
 
0.11
 
0.22
 
0.24
 
4
-
6
 
6
 
0.86
 
0.76
 
0.73
 
0.76
 
0.76
 
 
0.02
 
0.09
 
0.03
 
0.09
 
0.09
 
4
-
7
 
6
 
0.81
 
0.77
 
0.78
 
0.80
 
0.79
 
 
0.02
 
0.03
 
0.03
 
0.02
 
0.03
 
4
-
8
 
6
 
0.75
 
0.72
 
0.76
 
0.87
 
0.86
 
 
0.10
 
0.03
 
0.09
 
0.02
 
0.03
 
4
-
9
 
6
 
0.76
 
0.69
 
0.69
 
0.79
 
0.94
 
 
0.10
 
0.09
 
0.10
 
0.15
 
0.04
 
4
-
10
 
8
 
0.80
 
0.72
 
0.73
 
0.81
 
0.81
 
 
0.34
 
0.46
 
0.45
 
0.34
 
0.33
 
4
-
11
 
9
 
0.87
 
0.77
 
0.74
 
0.81
 
0.79
 
 
0.19
 
0.34
 
0.44
 
0.32
 
0.35
 
4
-
12
 
9
 
0.81
 
0.78
 
0.80
 
0.81
 
0.80
 
 
0.33
 
0.33
 
0.32
 
0.31
 
0.32
 
4
-
13
 
9
 
0.80
 
0.72
 
0.78
 
0.86
 
0.87
 
 
0.34
 
0.45
 
0.33
 
0.19
 
0.18
 
4
-
14
 
9
 
0.80
 
0.73
 
0.73
 
0.85
 
0.95
 
 
0.34
 
0.46
 
0.44
 
0.18
 
0.01
 
4
-
15
 
10
 
0.88
 
0.78
 
0.73
 
0.80
 
0.80
 
 
0.22
 
0.38
 
0.46
 
0.34
 
0.34
 
4
-
16
 
10
 
0.80
 
0.78
 
0.78
 
0.81
 
0.80
 
 
0.33
 
0.36
 
0.37
 
0.32
 
0.33
 
4
-
17
 
10
 
0.80
 
0.72
 
0.78
 
0.87
 
0.88
 
 
0.33
 
0.46
 
0.37
 
0.23
 
0.22
 
4
-
18
 
10
 
0.80
 
0.73
 
0.72
 
0.86
 
0.94
 
 
0.33
 
0.44
 
0.46
 
0.23
 
0.09
 
4
-
19
 
12
 
0.94
 
0.91
 
0.92
 
0.94
 
0.95
 
 
0.01
 
0.01
 
0.01
 
0.01
 
0.00
 
 
Table 
30
:
 
NPC results for H4.4 
 
Q
 

CCR
 
 
Pr(tie)
 

4
-
1
 
4
 
0.84
 
0.73
 
0.69
 
0.78
 
0.76
 
 
0.10
 
0.16
 
0.09
 
0.09
 
0.09
 
4
-
2
 
5
 
0.87
 
0.74
 
0.73
 
0.77
 
0.76
 
 
0.18
 
0.29
 
0.03
 
0.09
 
0.10
 
4
-
3
 
5
 
0.86
 
0.79
 
0.73
 
0.80
 
0.81
 
 
0.05
 
0.14
 
0.13
 
0.03
 
0.03
 
4
-
4
 
5
 
0.83
 
0.76
 
0.70
 
0.77
 
0.79
 
 
0.10
 
0.11
 
0.24
 
0.25
 
0.17
 
4
-
5
 
5
 
0.86
 
0.73
 
0.69
 
0.76
 
0.75
 
 
0.09
 
0.16
 
0.11
 
0.23
 
0.24
 
4
-
6
 
6
 
0.96
 
0.78
 
0.74
 
0.79
 
0.76
 
 
0.03
 
0.17
 
0.03
 
0.08
 
0.09
 
4
-
7
 
6
 
0.89
 
0.86
 
0.78
 
0.80
 
0.78
 
 
0.03
 
0.04
 
0.02
 
0.02
 
0.02
 
4
-
8
 
6
 
0.85
 
0.76
 
0.75
 
0.83
 
0.86
 
 
0.09
 
0.11
 
0.09
 
0.08
 
0.02
 
4
-
9
 
6
 
0.85
 
0.72
 
0.69
 
0.79
 
0.80
 
 
0.09
 
0.16
 
0.10
 
0.15
 
0.15
 
4
-
10
 
8
 
0.88
 
0.80
 
0.73
 
0.80
 
0.81
 
 
0.19
 
0.33
 
0.44
 
0.34
 
0.33
 
4
-
11
 
9
 
0.97
 
0.85
 
0.73
 
0.80
 
0.81
 
 
0.01
 
0.20
 
0.44
 
0.34
 
0.33
 
4
-
12
 
9
 
0.90
 
0.87
 
0.79
 
0.81
 
0.81
 
 
0.18
 
0.19
 
0.30
 
0.31
 
0.32
 
4
-
13
 
9
 
0.89
 
0.80
 
0.78
 
0.87
 
0.87
 
 
0.18
 
0.32
 
0.33
 
0.19
 
0.18
 
4
-
14
 
9
 
0.88
 
0.79
 
0.73
 
0.85
 
0.86
 
 
0.19
 
0.34
 
0.44
 
0.19
 
0.18
 
4
-
15
 
10
 
0.97
 
0.86
 
0.73
 
0.80
 
0.79
 
 
0.05
 
0.23
 
0.43
 
0.34
 
0.34
 
4
-
16
 
10
 
0.89
 
0.87
 
0.79
 
0.81
 
0.80
 
 
0.19
 
0.22
 
0.36
 
0.33
 
0.33
 
4
-
17
 
10
 
0.88
 
0.81
 
0.77
 
0.86
 
0.88
 
 
0.19
 
0.34
 
0.37
 
0.23
 
0.22
 
4
-
18
 
10
 
0.89
 
0.80
 
0.71
 
0.86
 
0.86
 
 
0.19
 
0.33
 
0.45
 
0.22
 
0.23
 
4
-
19
 
12
 
0.97
 
0.94
 
0.92
 
0.94
 
0.94
 
 
0.01
 
0.01
 
0.01
 
0.01
 
0.01
 
104
 
Table 
31
:
 
NPC results for H4.5 
 
Q
 

CCR
 
 
Pr(tie)
 

4
-
1
 
4
 
0.81
 
0.76
 
0.77
 
0.70
 
0.73
 
 
0.03
 
0.08
 
0.08
 
0.09
 
0.16
 
4
-
2
 
5
 
0.81
 
0.79
 
0.77
 
0.70
 
0.75
 
 
0.18
 
0.18
 
0.26
 
0.23
 
0.10
 
4
-
3
 
5
 
0.81
 
0.78
 
0.79
 
0.72
 
0.80
 
 
0.03
 
0.16
 
0.16
 
0.27
 
0.04
 
4
-
4
 
5
 
0.80
 
0.80
 
0.80
 
0.72
 
0.79
 
 
0.01
 
0.03
 
0.03
 
0.14
 
0.14
 
4
-
5
 
5
 
0.80
 
0.76
 
0.75
 
0.73
 
0.73
 
 
0.02
 
0.09
 
0.10
 
0.03
 
0.30
 
4
-
6
 
6
 
0.88
 
0.86
 
0.83
 
0.75
 
0.76
 
 
0.01
 
0.02
 
0.09
 
0.09
 
0.11
 
4
-
7
 
6
 
0.81
 
0.84
 
0.83
 
0.81
 
0.78
 
 
0.02
 
0.08
 
0.08
 
0.11
 
0.04
 
4
-
8
 
6
 
0.82
 
0.80
 
0.80
 
0.78
 
0.85
 
 
0.00
 
0.02
 
0.02
 
0.02
 
0.04
 
4
-
9
 
6
 
0.80
 
0.77
 
0.77
 
0.73
 
0.79
 
 
0.02
 
0.09
 
0.09
 
0.04
 
0.16
 
4
-
10
 
8
 
0.80
 
0.79
 
0.80
 
0.71
 
0.80
 
 
0.33
 
0.34
 
0.34
 
0.46
 
0.34
 
4
-
11
 
9
 
0.87
 
0.87
 
0.86
 
0.78
 
0.80
 
 
0.18
 
0.18
 
0.19
 
0.33
 
0.33
 
4
-
12
 
9
 
0.81
 
0.87
 
0.87
 
0.84
 
0.82
 
 
0.32
 
0.18
 
0.17
 
0.19
 
0.31
 
4
-
13
 
9
 
0.81
 
0.81
 
0.80
 
0.78
 
0.87
 
 
0.33
 
0.31
 
0.32
 
0.32
 
0.19
 
4
-
14
 
9
 
0.81
 
0.80
 
0.81
 
0.73
 
0.85
 
 
0.33
 
0.33
 
0.34
 
0.42
 
0.20
 
4
-
15
 
10
 
0.87
 
0.88
 
0.86
 
0.77
 
0.81
 
 
0.22
 
0.22
 
0.23
 
0.37
 
0.33
 
4
-
16
 
10
 
0.81
 
0.87
 
0.86
 
0.84
 
0.81
 
 
0.34
 
0.23
 
0.22
 
0.26
 
0.33
 
4
-
17
 
10
 
0.81
 
0.80
 
0.81
 
0.79
 
0.87
 
 
0.33
 
0.33
 
0.33
 
0.36
 
0.23
 
4
-
18
 
10
 
0.81
 
0.80
 
0.80
 
0.73
 
0.85
 
 
0.33
 
0.35
 
0.34
 
0.44
 
0.23
 
4
-
19
 
12
 
0.94
 
0.94
 
0.94
 
0.91
 
0.94
 
 
0.00
 
0.01
 
0.01
 
0.01
 
0.01
 
105
 
Table 
32
:
 
NPC results for H5.1 
 
Q
 

CCR
 
 
Pr(tie)
 

5
-
1
 
5
 
0.58
 
0.58
 
0.60
 
0.59
 
0.59
 
0.59
 
 
0.00
 
0.00
 
0.00
 
0.00
 
0.00
 
0.00
 
5
-
2
 
6
 
0.58
 
0.59
 
0.58
 
0.59
 
0.59
 
0.60
 
 
0.18
 
0.17
 
0.18
 
0.18
 
0.19
 
0.18
 
5
-
3
 
6
 
0.60
 
0.59
 
0.62
 
0.60
 
0.61
 
0.61
 
 
0.03
 
0.16
 
0.23
 
0.24
 
0.25
 
0.23
 
5
-
4
 
6
 
0.58
 
0.59
 
0.57
 
0.64
 
0.62
 
0.64
 
 
0.00
 
0.02
 
0.16
 
0.29
 
0.30
 
0.29
 
5
-
5
 
6
 
0.60
 
0.60
 
0.59
 
0.58
 
0.66
 
0.66
 
 
0.00
 
0.00
 
0.02
 
0.13
 
0.34
 
0.33
 
5
-
6
 
6
 
0.59
 
0.58
 
0.59
 
0.59
 
0.56
 
0.69
 
 
0.00
 
0.00
 
0.00
 
0.02
 
0.12
 
0.35
 
5
-
7
 
7
 
0.63
 
0.62
 
0.63
 
0.64
 
0.65
 
0.63
 
 
0.00
 
0.00
 
0.00
 
0.00
 
0.00
 
0.00
 
5
-
8
 
7
 
0.60
 
0.62
 
0.69
 
0.70
 
0.68
 
0.69
 
 
0.01
 
0.08
 
0.02
 
0.02
 
0.02
 
0.02
 
5
-
9
 
7
 
0.58
 
0.60
 
0.60
 
0.76
 
0.75
 
0.76
 
 
0.00
 
0.02
 
0.07
 
0.03
 
0.03
 
0.03
 
5
-
10
 
7
 
0.59
 
0.58
 
0.59
 
0.60
 
0.81
 
0.81
 
 
0.00
 
0.00
 
0.01
 
0.06
 
0.04
 
0.04
 
5
-
11
 
7
 
0.58
 
0.57
 
0.59
 
0.59
 
0.61
 
0.88
 
 
0.00
 
0.00
 
0.00
 
0.01
 
0.06
 
0.06
 
5
-
12
 
10
 
0.59
 
0.59
 
0.60
 
0.59
 
0.59
 
0.60
 
 
0.63
 
0.63
 
0.63
 
0.63
 
0.63
 
0.63
 
5
-
13
 
11
 
0.65
 
0.64
 
0.64
 
0.63
 
0.64
 
0.64
 
 
0.54
 
0.56
 
0.55
 
0.54
 
0.55
 
0.55
 
5
-
14
 
11
 
0.59
 
0.62
 
0.69
 
0.68
 
0.69
 
0.69
 
 
0.63
 
0.55
 
0.44
 
0.46
 
0.45
 
0.45
 
5
-
15
 
11
 
0.58
 
0.60
 
0.64
 
0.76
 
0.75
 
0.76
 
 
0.64
 
0.63
 
0.53
 
0.33
 
0.34
 
0.33
 
5
-
16
 
11
 
0.60
 
0.59
 
0.60
 
0.62
 
0.81
 
0.81
 
 
0.64
 
0.62
 
0.62
 
0.56
 
0.21
 
0.21
 
5
-
17
 
11
 
0.58
 
0.59
 
0.59
 
0.59
 
0.63
 
0.89
 
 
0.64
 
0.64
 
0.63
 
0.63
 
0.54
 
0.05
 
5
-
18
 
12
 
0.63
 
0.64
 
0.63
 
0.64
 
0.64
 
0.63
 
 
0.58
 
0.57
 
0.56
 
0.57
 
0.58
 
0.57
 
5
-
19
 
12
 
0.59
 
0.63
 
0.69
 
0.69
 
0.68
 
0.68
 
 
0.64
 
0.57
 
0.49
 
0.50
 
0.51
 
0.51
 
5
-
20
 
12
 
0.59
 
0.60
 
0.64
 
0.75
 
0.76
 
0.75
 
 
0.63
 
0.63
 
0.56
 
0.40
 
0.40
 
0.40
 
5
-
21
 
12
 
0.59
 
0.59
 
0.58
 
0.63
 
0.83
 
0.83
 
 
0.63
 
0.62
 
0.63
 
0.57
 
0.29
 
0.28
 
5
-
22
 
12
 
0.58
 
0.61
 
0.60
 
0.59
 
0.62
 
0.89
 
 
0.64
 
0.61
 
0.63
 
0.63
 
0.57
 
0.17
 
5
-
23
 
15
 
0.87
 
0.88
 
0.87
 
0.86
 
0.87
 
0.88
 
 
0.00
 
0.00
 
0.00
 
0.00
 
0.00
 
0.00
 
 
106
 
Table 
33
:
 
NPC results for H5.2 
 
Q
 

CCR
 
 
Pr(tie)
 

5
-
1
 
5
 
0.85
 
0.76
 
0.72
 
0.73
 
0.76
 
0.85
 
 
0.10
 
0.10
 
0.17
 
0.16
 
0.10
 
0.11
 
5
-
2
 
6
 
0.89
 
0.77
 
0.75
 
0.72
 
0.76
 
0.84
 
 
0.17
 
0.24
 
0.11
 
0.17
 
0.09
 
0.10
 
5
-
3
 
6
 
0.87
 
0.79
 
0.77
 
0.76
 
0.76
 
0.85
 
 
0.04
 
0.16
 
0.22
 
0.11
 
0.10
 
0.09
 
5
-
4
 
6
 
0.84
 
0.79
 
0.75
 
0.76
 
0.78
 
0.85
 
 
0.11
 
0.04
 
0.22
 
0.22
 
0.04
 
0.10
 
5
-
5
 
6
 
0.85
 
0.76
 
0.76
 
0.76
 
0.78
 
0.88
 
 
0.09
 
0.10
 
0.12
 
0.23
 
0.16
 
0.04
 
5
-
6
 
6
 
0.84
 
0.76
 
0.72
 
0.76
 
0.76
 
0.89
 
 
0.11
 
0.09
 
0.17
 
0.11
 
0.25
 
0.17
 
5
-
7
 
7
 
0.96
 
0.82
 
0.77
 
0.74
 
0.77
 
0.84
 
 
0.03
 
0.11
 
0.11
 
0.16
 
0.09
 
0.10
 
5
-
8
 
7
 
0.89
 
0.86
 
0.82
 
0.76
 
0.76
 
0.85
 
 
0.03
 
0.02
 
0.10
 
0.11
 
0.09
 
0.09
 
5
-
9
 
7
 
0.84
 
0.80
 
0.82
 
0.82
 
0.80
 
0.85
 
 
0.09
 
0.03
 
0.12
 
0.11
 
0.03
 
0.09
 
5
-
10
 
7
 
0.85
 
0.77
 
0.76
 
0.83
 
0.86
 
0.89
 
 
0.09
 
0.09
 
0.11
 
0.11
 
0.02
 
0.02
 
5
-
11
 
7
 
0.85
 
0.76
 
0.73
 
0.77
 
0.83
 
0.96
 
 
0.09
 
0.09
 
0.16
 
0.10
 
0.09
 
0.03
 
5
-
12
 
10
 
0.89
 
0.79
 
0.79
 
0.80
 
0.80
 
0.89
 
 
0.19
 
0.34
 
0.35
 
0.35
 
0.33
 
0.19
 
5
-
13
 
11
 
0.97
 
0.87
 
0.80
 
0.79
 
0.81
 
0.89
 
 
0.01
 
0.18
 
0.33
 
0.33
 
0.32
 
0.19
 
5
-
14
 
11
 
0.89
 
0.87
 
0.86
 
0.80
 
0.80
 
0.89
 
 
0.17
 
0.19
 
0.19
 
0.33
 
0.34
 
0.20
 
5
-
15
 
11
 
0.89
 
0.82
 
0.87
 
0.86
 
0.81
 
0.88
 
 
0.18
 
0.32
 
0.19
 
0.19
 
0.32
 
0.20
 
5
-
16
 
11
 
0.88
 
0.81
 
0.80
 
0.86
 
0.87
 
0.89
 
 
0.20
 
0.32
 
0.32
 
0.19
 
0.18
 
0.17
 
5
-
17
 
11
 
0.89
 
0.80
 
0.79
 
0.80
 
0.86
 
0.97
 
 
0.19
 
0.33
 
0.34
 
0.33
 
0.19
 
0.01
 
5
-
18
 
12
 
0.97
 
0.86
 
0.80
 
0.79
 
0.80
 
0.89
 
 
0.05
 
0.23
 
0.33
 
0.36
 
0.33
 
0.19
 
5
-
19
 
12
 
0.90
 
0.86
 
0.86
 
0.81
 
0.80
 
0.89
 
 
0.19
 
0.23
 
0.23
 
0.33
 
0.32
 
0.19
 
5
-
20
 
12
 
0.89
 
0.81
 
0.86
 
0.87
 
0.80
 
0.89
 
 
0.18
 
0.33
 
0.24
 
0.22
 
0.35
 
0.19
 
5
-
21
 
12
 
0.88
 
0.80
 
0.79
 
0.86
 
0.88
 
0.90
 
 
0.20
 
0.33
 
0.34
 
0.23
 
0.22
 
0.18
 
5
-
22
 
12
 
0.89
 
0.79
 
0.80
 
0.79
 
0.86
 
0.97
 
 
0.18
 
0.34
 
0.33
 
0.34
 
0.23
 
0.05
 
5
-
23
 
15
 
0.97
 
0.95
 
0.93
 
0.94
 
0.94
 
0.97
 
 
0.01
 
0.01
 
0.01
 
0.02
 
0.01
 
0.01
 
 
107
 
Table 
34
:
 
NPC results for H5.3 
 
Q
 

CCR
 
 
Pr(tie)
 

5
-
1
 
5
 
0.84
 
0.68
 
0.62
 
0.70
 
0.73
 
0.73
 
 
0.10
 
0.20
 
0.09
 
0.07
 
0.02
 
0.01
 
5
-
2
 
6
 
0.86
 
0.69
 
0.64
 
0.69
 
0.73
 
0.73
 
 
0.18
 
0.34
 
0.04
 
0.09
 
0.02
 
0.00
 
5
-
3
 
6
 
0.87
 
0.77
 
0.65
 
0.71
 
0.72
 
0.71
 
 
0.06
 
0.15
 
0.13
 
0.03
 
0.01
 
0.00
 
5
-
4
 
6
 
0.84
 
0.72
 
0.62
 
0.72
 
0.74
 
0.74
 
 
0.11
 
0.16
 
0.23
 
0.17
 
0.17
 
0.17
 
5
-
5
 
6
 
0.83
 
0.68
 
0.63
 
0.68
 
0.75
 
0.76
 
 
0.11
 
0.22
 
0.11
 
0.23
 
0.25
 
0.24
 
5
-
6
 
6
 
0.84
 
0.68
 
0.62
 
0.69
 
0.71
 
0.78
 
 
0.11
 
0.21
 
0.09
 
0.11
 
0.17
 
0.31
 
5
-
7
 
7
 
0.95
 
0.74
 
0.64
 
0.70
 
0.73
 
0.72
 
 
0.03
 
0.22
 
0.04
 
0.07
 
0.02
 
0.00
 
5
-
8
 
7
 
0.88
 
0.84
 
0.70
 
0.72
 
0.72
 
0.72
 
 
0.03
 
0.06
 
0.02
 
0.02
 
0.01
 
0.00
 
5
-
9
 
7
 
0.84
 
0.71
 
0.67
 
0.79
 
0.79
 
0.79
 
 
0.10
 
0.18
 
0.08
 
0.02
 
0.01
 
0.00
 
5
-
10
 
7
 
0.83
 
0.68
 
0.62
 
0.72
 
0.86
 
0.86
 
 
0.11
 
0.22
 
0.10
 
0.14
 
0.02
 
0.02
 
5
-
11
 
7
 
0.83
 
0.69
 
0.62
 
0.69
 
0.74
 
0.93
 
 
0.12
 
0.22
 
0.10
 
0.09
 
0.09
 
0.03
 
5
-
12
 
10
 
0.88
 
0.80
 
0.64
 
0.72
 
0.73
 
0.73
 
 
0.19
 
0.34
 
0.55
 
0.45
 
0.45
 
0.44
 
5
-
13
 
11
 
0.97
 
0.84
 
0.66
 
0.71
 
0.73
 
0.73
 
 
0.01
 
0.20
 
0.53
 
0.46
 
0.45
 
0.45
 
5
-
14
 
11
 
0.90
 
0.87
 
0.70
 
0.73
 
0.72
 
0.74
 
 
0.18
 
0.18
 
0.45
 
0.45
 
0.46
 
0.43
 
5
-
15
 
11
 
0.89
 
0.78
 
0.70
 
0.78
 
0.79
 
0.79
 
 
0.18
 
0.34
 
0.46
 
0.34
 
0.33
 
0.33
 
5
-
16
 
11
 
0.89
 
0.77
 
0.66
 
0.77
 
0.86
 
0.85
 
 
0.19
 
0.36
 
0.55
 
0.34
 
0.19
 
0.19
 
5
-
17
 
11
 
0.88
 
0.78
 
0.66
 
0.71
 
0.78
 
0.93
 
 
0.20
 
0.35
 
0.54
 
0.46
 
0.31
 
0.02
 
5
-
18
 
12
 
0.97
 
0.85
 
0.65
 
0.72
 
0.72
 
0.73
 
 
0.05
 
0.24
 
0.55
 
0.47
 
0.46
 
0.45
 
5
-
19
 
12
 
0.90
 
0.87
 
0.71
 
0.73
 
0.73
 
0.73
 
 
0.18
 
0.23
 
0.47
 
0.46
 
0.44
 
0.45
 
5
-
20
 
12
 
0.89
 
0.78
 
0.69
 
0.78
 
0.79
 
0.80
 
 
0.19
 
0.35
 
0.49
 
0.37
 
0.37
 
0.35
 
5
-
21
 
12
 
0.88
 
0.78
 
0.67
 
0.77
 
0.85
 
0.85
 
 
0.19
 
0.35
 
0.54
 
0.38
 
0.25
 
0.24
 
5
-
22
 
12
 
0.88
 
0.79
 
0.66
 
0.72
 
0.78
 
0.93
 
 
0.19
 
0.34
 
0.54
 
0.46
 
0.37
 
0.11
 
5
-
23
 
15
 
0.97
 
0.93
 
0.89
 
0.91
 
0.91
 
0.92
 
 
0.00
 
0.02
 
0.01
 
0.01
 
0.00
 
0.00
 
 
108
 
Table 
35
:
 
NPC results for H5.4
 
Q
 

CCR
 
 
Pr(tie)
 

5
-
1
 
5
 
0.80
 
0.66
 
0.61
 
0.66
 
0.69
 
0.74
 
 
0.17
 
0.15
 
0.09
 
0.03
 
0.10
 
0.02
 
5
-
2
 
6
 
0.86
 
0.66
 
0.64
 
0.64
 
0.70
 
0.72
 
 
0.17
 
0.27
 
0.04
 
0.03
 
0.08
 
0.02
 
5
-
3
 
6
 
0.83
 
0.72
 
0.65
 
0.66
 
0.72
 
0.73
 
 
0.12
 
0.15
 
0.14
 
0.14
 
0.02
 
0.01
 
5
-
4
 
6
 
0.80
 
0.66
 
0.61
 
0.66
 
0.71
 
0.74
 
 
0.18
 
0.17
 
0.24
 
0.23
 
0.19
 
0.19
 
5
-
5
 
6
 
0.79
 
0.65
 
0.62
 
0.64
 
0.75
 
0.74
 
 
0.17
 
0.16
 
0.11
 
0.16
 
0.23
 
0.25
 
5
-
6
 
6
 
0.80
 
0.65
 
0.63
 
0.66
 
0.66
 
0.79
 
 
0.17
 
0.16
 
0.08
 
0.05
 
0.23
 
0.29
 
5
-
7
 
7
 
0.94
 
0.70
 
0.65
 
0.66
 
0.69
 
0.73
 
 
0.05
 
0.16
 
0.03
 
0.02
 
0.08
 
0.02
 
5
-
8
 
7
 
0.83
 
0.76
 
0.70
 
0.70
 
0.72
 
0.72
 
 
0.12
 
0.05
 
0.03
 
0.01
 
0.02
 
0.00
 
5
-
9
 
7
 
0.80
 
0.64
 
0.64
 
0.76
 
0.78
 
0.78
 
 
0.18
 
0.17
 
0.14
 
0.02
 
0.02
 
0.01
 
5
-
10
 
7
 
0.79
 
0.66
 
0.61
 
0.68
 
0.86
 
0.85
 
 
0.18
 
0.16
 
0.10
 
0.10
 
0.03
 
0.02
 
5
-
11
 
7
 
0.80
 
0.65
 
0.63
 
0.65
 
0.70
 
0.93
 
 
0.17
 
0.16
 
0.08
 
0.05
 
0.14
 
0.04
 
5
-
12
 
10
 
0.88
 
0.72
 
0.65
 
0.66
 
0.73
 
0.72
 
 
0.19
 
0.46
 
0.56
 
0.54
 
0.44
 
0.45
 
5
-
13
 
11
 
0.97
 
0.77
 
0.67
 
0.67
 
0.71
 
0.73
 
 
0.01
 
0.33
 
0.53
 
0.54
 
0.46
 
0.44
 
5
-
14
 
11
 
0.89
 
0.79
 
0.72
 
0.71
 
0.72
 
0.73
 
 
0.18
 
0.33
 
0.44
 
0.45
 
0.44
 
0.45
 
5
-
15
 
11
 
0.87
 
0.71
 
0.71
 
0.77
 
0.78
 
0.79
 
 
0.21
 
0.45
 
0.44
 
0.32
 
0.32
 
0.32
 
5
-
16
 
11
 
0.88
 
0.72
 
0.66
 
0.71
 
0.85
 
0.85
 
 
0.19
 
0.45
 
0.55
 
0.43
 
0.20
 
0.19
 
5
-
17
 
11
 
0.87
 
0.71
 
0.65
 
0.66
 
0.77
 
0.92
 
 
0.20
 
0.46
 
0.53
 
0.54
 
0.33
 
0.02
 
5
-
18
 
12
 
0.97
 
0.77
 
0.66
 
0.66
 
0.72
 
0.73
 
 
0.06
 
0.37
 
0.54
 
0.54
 
0.46
 
0.44
 
5
-
19
 
12
 
0.89
 
0.78
 
0.69
 
0.71
 
0.72
 
0.73
 
 
0.19
 
0.37
 
0.49
 
0.48
 
0.45
 
0.45
 
5
-
20
 
12
 
0.87
 
0.73
 
0.69
 
0.78
 
0.79
 
0.79
 
 
0.21
 
0.44
 
0.48
 
0.36
 
0.36
 
0.36
 
5
-
21
 
12
 
0.88
 
0.72
 
0.65
 
0.72
 
0.86
 
0.86
 
 
0.20
 
0.45
 
0.54
 
0.46
 
0.25
 
0.24
 
5
-
22
 
12
 
0.87
 
0.71
 
0.65
 
0.67
 
0.77
 
0.93
 
 
0.20
 
0.46
 
0.55
 
0.53
 
0.37
 
0.11
 
5
-
23
 
15
 
0.96
 
0.91
 
0.89
 
0.90
 
0.92
 
0.93
 
 
0.01
 
0.02
 
0.01
 
0.00
 
0.01
 
0.00
 
 
109
 
Table 
36
:
 
NPC results for H5.5 
 
Q
 

CCR
 
 
Pr(tie)
 

5
-
1
 
5
 
0.73
 
0.73
 
0.70
 
0.62
 
0.68
 
0.83
 
 
0.00
 
0.02
 
0.08
 
0.09
 
0.20
 
0.11
 
5
-
2
 
6
 
0.73
 
0.73
 
0.69
 
0.63
 
0.72
 
0.84
 
 
0.19
 
0.19
 
0.24
 
0.24
 
0.18
 
0.11
 
5
-
3
 
6
 
0.73
 
0.71
 
0.71
 
0.65
 
0.76
 
0.84
 
 
0.03
 
0.16
 
0.30
 
0.30
 
0.12
 
0.11
 
5
-
4
 
6
 
0.72
 
0.73
 
0.69
 
0.67
 
0.78
 
0.83
 
 
0.01
 
0.03
 
0.15
 
0.33
 
0.06
 
0.11
 
5
-
5
 
6
 
0.73
 
0.73
 
0.72
 
0.66
 
0.78
 
0.87
 
 
0.00
 
0.01
 
0.03
 
0.13
 
0.15
 
0.06
 
5
-
6
 
6
 
0.71
 
0.73
 
0.69
 
0.64
 
0.70
 
0.88
 
 
0.00
 
0.02
 
0.08
 
0.04
 
0.32
 
0.17
 
5
-
7
 
7
 
0.79
 
0.78
 
0.74
 
0.67
 
0.72
 
0.84
 
 
0.00
 
0.03
 
0.08
 
0.09
 
0.17
 
0.11
 
5
-
8
 
7
 
0.73
 
0.75
 
0.81
 
0.74
 
0.74
 
0.84
 
 
0.02
 
0.07
 
0.10
 
0.10
 
0.12
 
0.10
 
5
-
9
 
7
 
0.72
 
0.72
 
0.73
 
0.79
 
0.78
 
0.85
 
 
0.00
 
0.02
 
0.08
 
0.11
 
0.06
 
0.09
 
5
-
10
 
7
 
0.73
 
0.73
 
0.72
 
0.69
 
0.85
 
0.88
 
 
0.00
 
0.01
 
0.02
 
0.03
 
0.05
 
0.03
 
5
-
11
 
7
 
0.73
 
0.73
 
0.69
 
0.64
 
0.74
 
0.95
 
 
0.00
 
0.02
 
0.08
 
0.04
 
0.22
 
0.03
 
5
-
12
 
10
 
0.73
 
0.74
 
0.71
 
0.65
 
0.79
 
0.88
 
 
0.44
 
0.44
 
0.46
 
0.55
 
0.34
 
0.20
 
5
-
13
 
11
 
0.79
 
0.79
 
0.77
 
0.70
 
0.79
 
0.89
 
 
0.32
 
0.32
 
0.34
 
0.45
 
0.34
 
0.19
 
5
-
14
 
11
 
0.74
 
0.79
 
0.84
 
0.76
 
0.79
 
0.89
 
 
0.44
 
0.32
 
0.19
 
0.33
 
0.32
 
0.19
 
5
-
15
 
11
 
0.73
 
0.73
 
0.78
 
0.82
 
0.82
 
0.89
 
 
0.44
 
0.44
 
0.32
 
0.19
 
0.31
 
0.18
 
5
-
16
 
11
 
0.73
 
0.72
 
0.72
 
0.71
 
0.87
 
0.89
 
 
0.44
 
0.44
 
0.44
 
0.44
 
0.19
 
0.18
 
5
-
17
 
11
 
0.73
 
0.74
 
0.73
 
0.66
 
0.85
 
0.97
 
 
0.44
 
0.44
 
0.44
 
0.54
 
0.20
 
0.01
 
5
-
18
 
12
 
0.79
 
0.79
 
0.77
 
0.70
 
0.80
 
0.88
 
 
0.35
 
0.35
 
0.37
 
0.46
 
0.32
 
0.20
 
5
-
19
 
12
 
0.73
 
0.79
 
0.85
 
0.77
 
0.80
 
0.89
 
 
0.44
 
0.36
 
0.26
 
0.37
 
0.33
 
0.18
 
5
-
20
 
12
 
0.73
 
0.73
 
0.77
 
0.82
 
0.80
 
0.89
 
 
0.45
 
0.45
 
0.36
 
0.28
 
0.34
 
0.19
 
5
-
21
 
12
 
0.73
 
0.72
 
0.73
 
0.70
 
0.86
 
0.90
 
 
0.44
 
0.45
 
0.45
 
0.48
 
0.22
 
0.18
 
5
-
22
 
12
 
0.73
 
0.72
 
0.72
 
0.67
 
0.85
 
0.97
 
 
0.45
 
0.45
 
0.46
 
0.53
 
0.25
 
0.05
 
5
-
23
 
15
 
0.92
 
0.93
 
0.92
 
0.89
 
0.93
 
0.97
 
 
0.00
 
0.00
 
0.01
 
0.01
 
0.02
 
0.01
 
 
110
 
Table 
37
:
 
NPC results for H5.6 
 
Q
 

CCR
 
 
Pr(tie)
 

5
-
1
 
5
 
0.72
 
0.69
 
0.65
 
0.69
 
0.65
 
0.79
 
 
0.02
 
0.08
 
0.03
 
0.23
 
0.17
 
0.17
 
5
-
2
 
6
 
0.73
 
0.69
 
0.65
 
0.73
 
0.67
 
0.79
 
 
0.18
 
0.25
 
0.20
 
0.17
 
0.11
 
0.18
 
5
-
3
 
6
 
0.72
 
0.71
 
0.68
 
0.75
 
0.71
 
0.79
 
 
0.03
 
0.15
 
0.23
 
0.12
 
0.05
 
0.18
 
5
-
4
 
6
 
0.73
 
0.72
 
0.66
 
0.75
 
0.71
 
0.83
 
 
0.01
 
0.02
 
0.15
 
0.22
 
0.15
 
0.13
 
5
-
5
 
6
 
0.73
 
0.69
 
0.66
 
0.71
 
0.75
 
0.86
 
 
0.02
 
0.08
 
0.04
 
0.26
 
0.20
 
0.06
 
5
-
6
 
6
 
0.73
 
0.69
 
0.65
 
0.73
 
0.66
 
0.86
 
 
0.02
 
0.08
 
0.03
 
0.17
 
0.28
 
0.16
 
5
-
7
 
7
 
0.79
 
0.76
 
0.71
 
0.73
 
0.69
 
0.80
 
 
0.01
 
0.09
 
0.02
 
0.18
 
0.11
 
0.17
 
5
-
8
 
7
 
0.73
 
0.75
 
0.78
 
0.75
 
0.70
 
0.81
 
 
0.02
 
0.09
 
0.03
 
0.12
 
0.05
 
0.17
 
5
-
9
 
7
 
0.74
 
0.72
 
0.72
 
0.82
 
0.78
 
0.84
 
 
0.01
 
0.03
 
0.01
 
0.13
 
0.05
 
0.12
 
5
-
10
 
7
 
0.72
 
0.70
 
0.66
 
0.74
 
0.84
 
0.87
 
 
0.01
 
0.07
 
0.03
 
0.22
 
0.05
 
0.04
 
5
-
11
 
7
 
0.73
 
0.69
 
0.65
 
0.72
 
0.69
 
0.94
 
 
0.02
 
0.08
 
0.02
 
0.18
 
0.17
 
0.05
 
5
-
12
 
10
 
0.72
 
0.72
 
0.66
 
0.78
 
0.71
 
0.87
 
 
0.46
 
0.46
 
0.53
 
0.34
 
0.47
 
0.20
 
5
-
13
 
11
 
0.79
 
0.78
 
0.70
 
0.80
 
0.72
 
0.88
 
 
0.33
 
0.33
 
0.46
 
0.34
 
0.45
 
0.20
 
5
-
14
 
11
 
0.73
 
0.77
 
0.77
 
0.80
 
0.73
 
0.88
 
 
0.45
 
0.33
 
0.32
 
0.32
 
0.44
 
0.20
 
5
-
15
 
11
 
0.74
 
0.71
 
0.71
 
0.85
 
0.79
 
0.89
 
 
0.44
 
0.45
 
0.45
 
0.21
 
0.33
 
0.18
 
5
-
16
 
11
 
0.72
 
0.72
 
0.66
 
0.84
 
0.85
 
0.89
 
 
0.46
 
0.46
 
0.54
 
0.20
 
0.17
 
0.18
 
5
-
17
 
11
 
0.73
 
0.72
 
0.65
 
0.80
 
0.77
 
0.97
 
 
0.46
 
0.47
 
0.55
 
0.34
 
0.33
 
0.01
 
5
-
18
 
12
 
0.79
 
0.76
 
0.71
 
0.79
 
0.72
 
0.87
 
 
0.35
 
0.38
 
0.46
 
0.34
 
0.44
 
0.21
 
5
-
19
 
12
 
0.74
 
0.77
 
0.76
 
0.79
 
0.72
 
0.89
 
 
0.44
 
0.38
 
0.38
 
0.35
 
0.47
 
0.19
 
5
-
20
 
12
 
0.73
 
0.72
 
0.72
 
0.85
 
0.80
 
0.88
 
 
0.46
 
0.46
 
0.47
 
0.24
 
0.34
 
0.20
 
5
-
21
 
12
 
0.73
 
0.73
 
0.66
 
0.85
 
0.86
 
0.89
 
 
0.44
 
0.44
 
0.54
 
0.23
 
0.25
 
0.19
 
5
-
22
 
12
 
0.73
 
0.72
 
0.67
 
0.78
 
0.76
 
0.97
 
 
0.43
 
0.46
 
0.54
 
0.35
 
0.37
 
0.06
 
5
-
23
 
15
 
0.92
 
0.91
 
0.90
 
0.93
 
0.91
 
0.96
 
 
0.00
 
0.01
 
0.00
 
0.02
 
0.01
 
0.02
 
 
111
 
5.
5
 
Discussion
 
Nonparametric classifications could play an important role in formative classroom 
assessment.
 
Tests developed by the teachers 
constitute a large part of classroom assessments. With 
the guidance of 
psychometric
 
theory, 
teachers may be able to extract more
 
f
ormative feedback. 
Nonparametric classifications based on CDMs offer solutions to both test construction and result 
interpretations. The teachers may develop the items under the guidance of CDM
-
based assessment
 
(Rupp et al., 2010)
. However, it is not likel
y to collect enough response data in the classroom 
setting for model estimation
 
(including calibration and classification)
.
 
Besides, there are concerns 
about the invariance properties of model parameters. 
In response to these limitations, r
esearchers 
have 
proposed 
different 
nonparametric classification methods to produce student results without 
having to 
estimat
e
 
item parameters
 
(
Chiu & Douglas, 2013; 
Chiu, Sun, 
& Bian, 
2018
; Wang & 
Douglas, 2015
)
. 
This study adds to the literature by providing insight
s into how to construct such 
a test.
 
Q
-
matrix design is at the center of test construction 
for
 
both parametric and nonparametric 
CDM
-
based tests.
 
Test construction involves practical questions
,
 
including how long the test 
should be and how many items 
are n
eeded
 
from each type. 
Note that what we discuss in Chapter 
3 about equivalent q
-
vectors and different types of Q
-
matrices also applies to the nonparametric 
situation. Generally, 
Q
-
matrix designs 
that 
work well for 
MLE
 
classifications
 
also work well for 
non
parametric 
classifications. 
The ties in the hamming distance are parallel to equal or similar 
likelihoods between attribute profiles. 
 
The simulation study compared Q
-
matrix designs with 

 
to 


items. Longer tests were 
not considered because the situat
ion is
 
teacher
-
developed
 
classroom assessment. 
It is important to 
include the single
-
attribute items for nonparametric classifications. Adding an odd number of 
112
 
multiple
-
attribute items can increase the CCR of a subset of 

s while adding an odd number of 
single
-
attribute items leads to an increased CCR
 
for every 

.
 
It 
is recommended
 
that a Q
-
matrix has an odd number of items in each q
-
vector. A test 
with an even number of items in a certain q
-
vector is generally not substantially better than a test 
with one less item in this q
-
vector. 
This
 
is especiall
y true when the item quality is homogeneous.
 
An important implication for teachers is that more items do not necessarily mean more 
accurate classifications. A single
-
attribute item is generally more useful than a multiple
-
attribute 
one. However, if the cla
ssification of certain 

s
, say 


, is of particular interest, then including 
the appropriate multiple
-
attribute item 
(in this case, 


) 
in the Q
-
matrix becomes meaningful 
in terms of CCR.
 
A
 
classroom assessment network
 
can 
be built
 
where teachers de
velop their items based on 
CDMs with 
q
-
vectors and the corresponding curriculum identified. 
Such i
tems can be 
collected 
from teachers
 
and form various item pools
, 
which can later 
be
 
used
 
for CD
-
CAT
 
or nonparametric 
CD
-
CAT
.
 
At last,
 
this 
study 
assumes the DINA model as the underlying CDM. 
Future research 
could explore 
different Q
-
matrix designs for 
NPC 
with 
other 
underlying 
CDMs. 
 
 
113
 
Chapter 
6
 
Item pool design for CD
-
CAT
 
6.1 Introduction
 
Item pool design is an important but often neglected area for CD
-
CAT. 
Since
 
t
he item pool 
design for CD
-
CAT has not 
been addressed
 
in 
the 
literature
, w
e draw from 
studies on item pool 
design for CAT based on IRT models (e.g., Reckase, 2010; Thissen, Reeve,
 
Bjorner, & Chang, 
2007
; Veldkamp & van der Linden, 2000
)
.
 
The findings for IRT
-
based CAT can be informative 
because CD
-
CATs are the same sequential optimization problems using CDMs instead of IRT 
models as the item response model. 
However,
 
the categorical nature of the latent constructs in 
CDM decides that new s
tudies 
are needed
 
for the CD
-
CAT context.
 
 
Besides
, CD
-
CAT has different priorities from
 
those of
 
IRT
-
based CAT.
 
Classroom 
formative assessments are 
generally 
low
-
stakes tests, so test security issues are not 
of primary 
concern
. It is acceptable that 
tests
 
overlap between 
students. What is 
of more importance
 
is to 
assign new items to a student each time he or she takes the test
 
during the instructional period
. 
Therefore, 
different requirement
s are
 
imposed on item pool design for classroom formative 
assessme
nts as compared to high
-
stakes standardized tests.
 
When 
a series of
 
formative assessments
 
are needed
 
for
 

learning
, 
multiple item pools
 
should 
be constructed
. 
For example, 
each unit addresses different 
attributes
, 
so 
a new
 
item pool 
may be needed
 
for each unit
 
to support the formative assessment 
when
 
learning 
a
 
unit
.
 
Considering 
a
 
large
 
number of item pools required
 
for one school year
 
and 
the high cost in item development
, it is important to know the minimal size of an ite
m pool that 
satisfies the purposes of a test.
 
This study aims to propose a
n item pool design method for CD
-
CAT
 
so that the item pool 
can fully support a test
. 
The proposed method
 
will be 
applied
 
to explore the number of items and 
114
 
item types needed for an i
tem pool 
for classroom formative assessments 
under various conditions
.
 
The item pools obtained will be evaluated in terms of their performances using with a CD
-
CAT 
algorithm.
 
6
.
2
 
Method for CD
-
CAT item pool design 
 
The proposed method 
for item pool design 
borrowed 
the idea
s
 
from Veldkamp and van der 
Linden (2000) and Reckase (
2010) for the item pool design of IRT
-
based CAT.
 
The core of the 
method is 
computer
 
simulations.
 
6.2.1 The minimum optimal pool
 
The minimum optimal pool is defined to be the smallest i
tem pool that 
can
 
provide the 
ideal item at each item
-
selection step
,
 
given the CD
-
CAT algorithm
 
and test constraints
. The 
potential item pool in the case of IRT
-
based CAT has an infinite number of items. A CDM
-
based 
item pool, however, has a limited number of item types defined by the q
-
vectors. 
For example, 
an 
item pool
 
for three
 
independent
 
attributes
 
(H3.1)
 
can have
 
seven 
item types
. For three attribute 
hierarchies

H3.2
,
 
H3.3, and H3.4

there
 
are three, four, and four 
item types, respectively, 
under 
the DINA model, 
which 
are 
listed
 
in 
Table 7
-
Table 9
. 
 
Items within an item type only differ in item parameters. The outpu
t of the item pool design 
process would be the number of items needed for each item type. In the item writing process, it is 
difficult, if not impossible, to control 
the level of 
item parameters
, which is especially true for 
complicated item response model
s.
 
Therefore, we start with an ideal situation in item pool design, 
assuming all items have equally high 
or low 
quality
 

a
 
high
-
quality condition and a low
-
quality 
condition yield a range of item numbers. 
The proposed method can 
be used
 
for any CD
-
CAT 
alg
orithm and test requirement. 
 
115
 
Below is a brief illustration 
o
f
 
the proposed method
 
when applied to a variable
-
length CD
-
CAT
.
 
Suppose an examinee with the 
true
 
attribute profile 


is taking a CAT measuring 
three linear attributes. The items are calibrated using the DINA model. We further assume that for 
all items 
the probability of the correct response interval for examinees who have mastered none of 
the required attributes on an i
tem 
is 


and the probability of the correct 
response interval for examinees who have mastered all the required attributes on a
n item
 
is


. The first item is fixed to be 


A simulation of the CAT process using 
the KL algo
rithm to select items leads to the administration of 2 items when the test terminates 
when the desired accuracy level is achieved, that is 
the largest 


. The items 
administered to this examinee are summarized by item type in
 
Table 38
. Suppose anthe
r examinee 
with the true attribute profile 


takes the test
,
 
and the items used are also summarized 
in Table 8. Since two examinees can use the same items, a union of the two sets of items leads to 
an item pool for two such examinees. 
In other wo
rds, the maximum number of items from each 
item type 
among the examinees constitutes the number of items required for two such examinees. 
If a third examinee is to be simulated, the union 
or maximization 
can be taken between the set of 
items for the new ex
aminee and the union obtained earlier in
 
Table 3
8
. 
 
 
Table 
38
:
 
Item distribution for two hypothetical examinees with true attribute profiles of 
 

and 


and the union of the two sets of items
 
Item type
 

Union/Maximum
 

2
 
1
 
2
 

0
 
4
 
4
 

0
 
3
 
3
 
 
116
 
6.2.2 The minimum p
-
optimal pool
 
After the test 
is administered
 
to more
 
examinee
s, 
the 
maximum number of items 
selected 
from each item type among all examine
e
s
 
will eventually 
become stable
 
except for a few outliers
.
 
T
he test 
is
 
extremely 
long
 
in
 
these 
extreme cases
.
 
 
Suppose 
an item pool 
is designed
 
for measuring three linear attributes given a certain CD
-
CAT algorithm. We further assume that a
ll 
candidate 
items are 
of 
low quality
,
 
that is, 


.
 
The simulations 
of 1,000 examinees per 
attribute profile 
produce
d
 
a distribution of item numbers for 
each item type. 
The distribution for 


is shown in
 
Figure 2
6
.
 
 
A
n examinee use
d
 
44 items
 
of 


in an extreme case but 95% of the simulated 
examinees 
only 
need
ed
 
12 items of 


or 
fewer
.
 
The maximum number
s of items for
 

and 


were 54 and 44, respectively. 
Therefore, the minimum optimal pool as defined earlier
 
would 
cons
ist 
of 
44 
items of
 

, 54 items of 


, and 44 items of 


.
 
However, 
considering the need to construct a large number of item pools
 
and the high cost of item 
development
, an optimal item pool becomes impractical. 
If
 
we 
instead 
take the 
p
th
 
percentile of 
the distribution instead of the maximum
,
 
the size of the item
 
pool will be substantially smaller.
 
Such an item pool is called the 
minimum 
p
-
optimal pool. 
 
 
Figure 
26
:
 
Distribution of the number of items 
for 


in an example
 
117
 
6.3 Simulation design
 
Two sets of simulations will 
be conducte
d
. The first set of simulations apply the proposed 
item pool design method to construct 
minimal
 
95
-
optimal 
pools. The second set of simulations
 
evaluate the performances of the item pool
s
.
 
W
e consider 
item 
pools
 
involving
 
three
 
attributes.
 
The attribute hierarchies in 
Figure 6 
are used
.
 
Item pools 
are design
ed
 
for the following 
variable
-
length 
CD
-
CA
T
.
 
All items are calibrated 
by the DINA model
. 
Following the termination rule in Hsu et al. (2013), the variable
-
length test is 
t
ermina
ted
 
at stage 

 
when the largest 


is greater than or equal to 
0.9
0
.
 
The item selection 
criterion
 
is
 
the 
posterior
-
weighted KL index (PWKL) proposed by Cheng (2009)
. The index of 
PWKL was chosen 
because
 
of 
its popularity
 
and high attribute profile recovery rate (Xu, Wang & 
Shang, 2016)
.
 
T
he first item 
in a test
 
was 
fixed to be 


or randomly selected from the subset 
of q
-
vector for each attribute hierarchy as shown in
 
Table 3
9
.
 
 
Table 
39
:
 
Q
-
ve
ctor
s
 
for 
the first item 
 
Hierarchy
 
First item
 
H3.1 
 

, 


, 


H3.2
 

H3.3
 

H3.4
 

, 


In the simulations for item pool design, i
tem quality 
was
 
held constant for the entire item 
pool. 
Two item quality levels were simulated. 
An item pool of 
high
 
quality 
has
 

.
 
 
An item pool of 
low
 
quality 
has
 

.
 
The minimal 95
-
optimal 
pools
 
were constructe
d for both item quality levels.
 
For 
both sets of simulations, a
 
total of 
1
,000 examinees 
were
 
simulated
 
for each 
true
 
attribute profile.
 
The CD
-
CAT algorithm described above 
was used
 
on each simulated examinee. 
118
 
I
tem 
responses were
 
gene
rated based on the 
DINA
.
 
A random 


variable 

 
is generated.
 
The 
correct response probability 


is compared with 

 
to decide the response of examinee 

 
to item 

:
 

To 
evaluate the performance of the item pool design method
, 
we constructed 
ten minimal 
95
-
optimal 
pools
 
for each hierarchy
,
 
assuming low item quality
. 
Under each attribute hierarchy, 
t
en 
designed 
item pools 
were compared
 
with ten random item pools
 
in terms of test length
,
 
the 
percent of times that the precision criterion 
is met
, and CCR
. 
The random item pools have the same 
size as the corresponding designed pool
,
 
but
 
the Q
-
matrix 
was randomly selected
 
from all the 
availab
le q
-
vectors.
 
For both designed and random item pools, 
i
tem parameters 


and 


w
ere
 
generated from the uniform distribution 


and 


, respectively
.
 
 
6.4 Simulation results
 
The number of items needed for t
he minimal 95
-
optimal 
pools
 
is 
shown
 
in
 
Table 39 
for 
two
 
item quality levels.
 
The total column presents the size 
of 
the item pools.
 
The first row of 
Table 
40
Table 
40
 
descr
ibes the item pool designed for 
three independent attributes (
H3.1
)
 
assuming low 
item quality
. For example, fifteen items of 


are required. The second row shows that only 
four items of 


are required if the item quality is high.
 
To test the performance of 
the proposed 
item
-
pool design method
, the designed item pools 
were compared
 
with the random pools
,
 
and
 
the statistics 
are summarized
 
in
 
Table 4
1
.
 
The designed 
pool for low item quality w
as used in the comparison because item parameters for this set of 
simulations 
were generated
 
from a uniform distribution
 
with the low item quality as a lower bound
.
 
 
119
 
Table 
40
: 
The minimum 
95
-
optimal pool
s
 
Item quality
 
H
 

Total
 
L
ow
 
3.1
 
15
 
15
 
15
 
10
 
10
 
10
 
9
 
84
 
H
igh
 
3.1
 
4
 
4
 
4
 
2
 
2
 
2
 
2
 
20
 
L
ow
 
3.2
 
12
 
 
18
 
 
16
 
46
 
H
igh
 
3.2
 
4
 
 
4
 
 
4
 
12
 
L
ow
 
3.3
 
13
 
 
16
 
17
 
 
10
 
56
 
H
igh
 
3.3
 
4
 
 
4
 
4
 
 
2
 
14
 
L
ow
 
3.4
 
15
 
15
 
 
11
 
 
14
 
55
 
H
igh
 
3.4
 
4
 
4
 
 
2
 
 
4
 
14
 
 
Table 
41
:
 
Comparison between
 
the random and designed item pools
 
 
P
ool
 
H
 
Test length
 
Modified 
 
test length
 
% 
 
criterion 
 
met
 
CCR
 

Random
 
3.1
 
12.05
 
9.60
 
9
6.65
 
0.88
 
0.91
 
0.91
 
0.91
 
0.91
 
0.91
 
Designed
 
3.1
 
9.92
 
9.24
 
99
.10
 
0.90
 
0.92
 
0.89
 
0.91
 
0.93
 
0.92
 
Random
 
3.
2
 
6.40
 
5.96
 
9
8.8
9
 
0.95
 
0.92
 
 
0.92
 
 
0.92
 
Designed
 
3.
2
 
6.27
 
5.87
 
99
.01
 
0.94
 
0.91
 
 
0.91
 
 
0.91
 
Random
 
3.
3
 
8.06
 
7.03
 
97.88
 
0.96
 
0.89
 
 
0.92
 
0.91
 
0.92
 
Designed
 
3.
3
 
7.52
 
7.07
 
99
.
1
1
 
0.94
 
0.92
 
 
0.90
 
0.92
 
0.91
 
Random
 
3.
4
 
8.02
 
7.11
 
98
.07
 
0.91
 
0.92
 
0.92
 
0.91
 
 
0.91
 
Designed
 
3.
4
 
7.45
 
6.97
 
99
.0
0
 
0.93
 
0.92
 
0.93
 
0.90
 
 
0.91
 
Note
: CCRs for 


and 


with H3.1 
are not presented for brevity.
 
 
Take H3.1 for an example. The average test length using the random item pools was 
12.05
, 
longer than the average test length using the designed pools, which was 
9.92
. The difference in 
test length is partly due to the percent 
of
 
times that the precision cr
iterion 
is met
. 
With random 
pools
, the precision criterion 
was met
 
at
 
an average of 96.65% of the repetitions, which means 
3.35% of the examinees would have to take all the items in the pool. The precision criterion was 
met for 99.10% of the cases on average when using the designed pools. The modified test length 
was calcul
ated
 
by excluding the cases where the precision criterion 
was never met
. The designed 
pools were associated with slightly shorter tests than the random tests after excluding the extreme 
cases. With random or designed item pools, the average CCR for each at
tribute profile was close 
120
 
to or higher than 0.90, which was the precision criterion. The same conclusion can 
be drawn
 
for 
other attribute hierarchies except for H3.3, where the modified test length by using designed pools 
was not lower than that by using r
andom pools.
 
6.5 Discussion
 
An important practical question is how many items 
are needed
 
for a CD
-
CAT item pool.
 
This 
type
 
of
 
question
s
 
be
longs to the 
research
 
area of 
the 
item pool design. 
Although numerous 
item selection methods have been 
pr
opose
d
, the i
tem pool design has 
been given
 
limited attention. 
This study aims to 
guide
 
practitioners when CD
-
CAT is involved.
 
The method for item pool design 
is based
 
on simulations. 
As Dr. Reckase noted, 

t
here is no correct answer to the question 

How 
big should a CAT item pool be?


The proposed method leads to an item pool designed for a 
specific CD
-
CAT program.
 
The concept of the minimum optimal pool is introduced
 
but is deemed impractical. 
The 
minimum p
-
optimal pool is defined to be a practical item pool design for a formative assessment 
system. 
We 
then 
d
e
monstrate
 
the construction of 
minimum p
-
optimal pools 
for
 
variable
-
length 
CD
-
CAT with 
two item quality levels 
and four 
attribut
e
 
hierarchies.
 
With designed item pools,  
the 
precision criterion 
is supposed to 
be met
 
with shorter tests 
compared to with random item pools
, 
which was supported by the simulation results. 
 
Future research may consider the item pool 
d
esi
g
n
 
for fixed
-
length CD
-
CAT. 
Another 
situation worth explored is
 
when
 
a student 
would
 
take 
the test
 
multiple times (M = 
1, 
2, 3, 4) during 
an instruction period (a couple of weeks), and each time new items should be administered
 
to a 
student
.
 
The 

 
in the minimum p
-
optimal pool take the value of 0.95 
in this study 
but it could also 
take other values
. 
Another variable that can be manipulated is the item quality. Currently, we 
121
 
a
ssum
e
 
h
omogeneous item quality
 
between item types
, which is 
a c
om
mon settin
g in simulation 
studies
.
 
However, i
t is possible that single
-
attribute items and multiple
-
attribute items tend to have 
different levels of item quality, or 
items involving a certain attribute have lower or higher item 
quality than others. Future research m
ay take heterogeneous item quality into consideration
 
and 
some practical evidence is needed regarding the item quality of different item types
.
 
Most p
revious studies
 
are built
 
upon item pools
 
t
hat are 
calibrated using a single CDM.
 
This study uses the DINA
 
model. 
However, it is likely to observe that different items require 
different processes in practice, which suggests that the item pool may be made up 
by 
various 
CDMs (
Kaplan, de la Torre, & Barrada, 2015
).
 
Recent progress in item
-
level model selection 
indices provides a theoretical basis for such item pools (
Liu, 
Andersson, Xin, Zhang, & Wang, 
2018
; 
Ma
 
e
t
 
al.
, 2015
). 
Suppose multiple
-
attribute items calibrated by ACDM 
are also included
 
as 
candidate items. 
I
tem selection methods based on KL information, such as PWKL index, would 
always prefer a single
-
attri
bu
te item 
to a multiple
-
attribute 
item 
under the ACDM. 
The current 
item pool design method, therefore, would produce an item pool without any ACMD based 
mu
ltiple
-
attribute items. The
 
optimal pool
 
need
s
 
to 
be redefined
 
with
 
mixed model
s.
 
 
122
 
APPENDIX
123
 
APPENDIX
 
Hierarchies in 
Two 
Textbooks
 
Eureka Math
 
Grade 4
 
(2015)
 
                                        
1
 
4.OA.1 is not connected with any other Grade 4 standards in the Coherence Map.
 
Unit 1 (4 weeks)
 
 
4.OA.A.1
1
, 
 
4.NBT.A.1, 
 
4.NBT.A.2
>*
 
4.NBT.A.3
>*
 
4.NBT.B.4,
 
4.OA.A.3
 
Unit 2 (1 week)
 
 
4.MD.A.1, 
 
4.MD.A.2
>*
 
4.OA.A.3
 
Unit 3 (8 weeks)
 
 
4.MD.A.2
>*
 
4.MD.A.3, 
 
4.NBT.A.1, 
 
4.NBT.B.5,
 
4.NBT.B.6,
 
4.OA.A.1, 
 
4.OA.A.2, 
 
4.OA.A.3,
 
4.OA.A.4
 
Unit 4 (3.3 weeks)
 
 
4.G.A.1,
 
4.G.A.2, 
 
4.G.A.3,
 
4.MD.C.5, 
 
4.MD.C.6,
 
4.MD.C.7
 
 
Unit 5 (8.4 weeks)
 
124
 
 
2
 
4.OA.C.5 is not connected with any other Grade 4 standards in the Coherence Map.
 
3
 
4.MD.A.3 is not connected with any other Grade 4 standards in the Coherence Map.
 
4.MD.A.2
 
4.MD.B.4,
 
4.NBT.A.3
>*
 
4.NF.A.1,
 
4.NF.A.2,
 
4.NF.B.3,
 
4.NF.B.4,
 
4.OA.A.2,
 
4.OA.C.5
2
 
 
Unit 6 (3.3 weeks)
 
 
3.NF.A.3, 
 
4.NF.A.1,
 
4.NF.A.2,
 
4.NF.C.5, 
 
4.NF.C.6,
 
4.NF.C.7,
 
4.MD.A.1, 
 
4.MD.A.2
>*
 
4.NBT.A.1
 
Unit 7 (3.8 weeks)
 
 
3.NF.A.1, 
 
4.NF.A.1,
 
4.NF.B.3,
 
3.OA.A.1,
 
3.OA.A.2,
 
4.OA.A.2,
 
4.OA.A.3,
 
4.OA.B.4,
 
4.MD.A.1, 
 
4.MD.A.2
>*
 
4.MD.A.3
3
>*
 
4.NBT.A.2,
 
4.NBT.B.4,
 
4.NBT.B.5,
 
4.NBT.B.6
 
125
 
Engage NY Grade 4 (2014)
 
Unit 1 
 
(4 days)
 
Unit 2 
 
(2 days)
 
 
4.OA.A.1
4
, 
4.NBT.A.1, 
4.NBT.A.2
 
4.NBT.A.2
 
Unit 3
 
(4 days)
 
Unit 4 
 
(2 days)
 
4.NBT.A.3
 
4.NBT.B.4, 
4.OA.A.3
 
Unit 5
 
(4 days)
 
Unit 6
 
(3 days)
 
4.NBT.A.2, 
4.NBT.B.4, 
4.OA.A.3
 
4.NBT.A.1, 
4.NBT.A.2, 
4.NBT.B.4, 
4.OA.A.3
 
Unit 7 
 
(3 
days)
 
Unit 8
 
(2 days)
 
 
4.MD.A.1, 
4.MD.A.2
 
4.MD.A.1, 
4.MD.A.2
 
Unit 9
 
(3 days)
 
Unit 10
 
(3 days)
 
 
4.MD.A.3, 
4.OA.A.1, 
4.OA.A.2, 
4.NBT.B.5 
 
4.NBT.B.5
 
Unit 11
 
(5 days)
 
Unit 12
 
(2 days)
 
4.NBT.B.5
 
4.NBT.B.5, 
4.OA.A.1, 
4.OA.A.2, 
4.OA.A.3
 
                                        
4
 
4.OA.1 is not connected with any other Grade 4 standards in the Coherence Map.
 
126
 
Unit 
13
 
(9 days)
 
Unit 14
 
(4 days)
 
 
4.NBT.B.6, 
4.OA.A.3
 
4.OA.A.4
 
Unit 15
 
(9 days)
 
Unit 16
 
(5 days)
 
4.NBT.B.6, 
4.OA.A.3, 
4.NBT.B.4, 
4.NBT.B.6, 
4.NBT.A.1
 
4.NBT.B.5
 
Unit 17
 
(4 days)
 
Unit 18
 
(4 days)
 
 
4.G.A.1
 
4.MD.C.5, 
4.MD.C.6
 
Unit 19
 
(3 days)
 
Unit 20
 
(5 days)
 
4.MD.C.7
 
4.G.A.1, 
4.G.A.2, 
4.G.A.3
 
Unit 21
 
(6 days)
 
Unit 22
 
(5 days)
 
 
3.NF.A.3, 
4.NF.B.4
 
4.NF.A.1
 
Unit 23
 
(4 days)
 
Unit 24
 
(6 days)
 
4.NF.A.2
 
4.NF.B.3
 
Unit 25
 
(8 days)
 
 
4.NF.B.3, 
 
4.NF.B.4, 
 
4.NF.A.2, 
 
4.MD.B.4
 
Unit 26
 
(6 
days)
 
4.NF.B.3
 
127
 
Unit 27
 
(6 days)
 
 
4.NF.B.4, 4.OA.A.2, 
4.MD.B.4
 
Unit 28
 
(1 day)
 
4.OA.C.5
5
 
Unit 29
 
(3 days)
 
Unit 30
 
(5 days)
 
 
4.NF.C.6
 
4.NF.C.5,  
4.NF.C.6
 
Unit 31
 
(3 days)
 
4.NF.C.7
 
Unit 32
 
(3 days)
 
Unit 33
 
(2 days)
 
 
4.NF.C.5,  
4.NF.C.6
 
4.MD.A.2
 
Unit 34
 
(5 days)
 
 
4.MD.A.1, 
 
4.OA.A.1, 
 
4.MD.A.2
 
Unit 35
 
(3 days)
 
 
4.MD.A.2, 
 
4.OA.A.2, 
 
4.MD.A.1, 
 
4.NBT.B.5, 
 
4.NBT.B.6, 
 
4.OA.A.3
 
                                        
5
 
4.OA.1 is not connected with any other Grade 4 standards in the Coherence Map.
 
128
 
REFERENCES
129
 
REFERENCES
 
 
American Psychological Association, American Educational Research Association, & National 
Council on Measurement in Education. (1974). 
Standards for educational and 
ps
ych
ological tests
. Washington, DC: American Psychological Association.
 
Ayers, E., Nugent,
 
R., & Dean, N. (2008). 
Skill set
 
profile clustering based on student capability 
vectors computed from online tutoring data. In R. S. J. de Baker, T. Barnes, & J. E. Beck 
(Eds.), 
Educational data mining 
2008: Proceedings of the 1st international conference
 
on 
educational data mining, Montreal, Quebec, Canada
 
(pp. 210

217). Retrieved from 
http://www.educationaldatamining.org/EDM2008/uploads/proc/full%20proceeding
s.pdf
 
Barnes, T. (2010).
 
Novel derivation and application of skill matrices: The q
-
matrix method. In 
Ramero, C., 
Vemtora
, S., Pechemizkiy, M., de Baker, R. S. J. (Eds.), 
Handbook of 
educational data mining
 
(pp.
 
159
-
172).
 
Boca Raton, FL:
 
Chapman & Hall.
 
Beatty, I. D., & Gerace, W. J. (2009). Technology
-
Enhanced Formative Assessment: A Research
-
Based Pedagogy for Teaching Science with Classroom Response Technology. 
Journal of 
Science Education and Technology,
 
18(
2), 146
-
162.
 
Belov, D. I., & Armstrong, R. D. (2009). Direct and inverse problems of item pool design for 
computerized adaptive testing. 
Educational and Psychological Measurement, 69
(4), 533
-
547.
 
Bennett, R. E. (2011). Formative assessment: a critical 
review. 
Assessment in Education: 
Principles, Policy & Practice
, 
18
(1), 5
-
25. 
 
Bennett, R. E. (2015). The Changing Nature of Educational Assessment. 
Review of Research in 
Education, 39
(
1), 370
-
407. 
 
Brennan, R. L. (2006). Perspectives on the evolution and f
uture of educational measurement. In R. 
Brennan (Ed.), 
Educational measurement
 
(4th ed., pp. 1
-
16). Westport, CT: American 
Council on Education and Praeger.
 
Black, P., & Wiliam, D. (1998). Assessment and classroom learning. 
Assessment in Education:
 
Princip
les, Policy & Practice, 5
(1), 7

74.
 
Black, P., Wilson, M., & Yao, S. (2011). Road maps for learning: A guide to the navigation of 
learning progressions. 
Measurement: Interdisciplinary Research and Perspectives, 9
, 71

123.
 
Bloom, B. S. (1968). 
Learning for 
Mastery. Instruction and Curriculum
. Regional Education 
Laboratory for the Carolinas and Virginia, Topical Papers and Reprints, Number 1. 
Evaluation comment, 1(2), n2.
 
130
 
Bloom, B. S., Hastings, J. T., & Madaus, G. F. (1971
).
 
Handbook on formative and 
summative 
evaluation of student learning
. New York: McGraw
-
Hill.
 
Bock, R. D., Thissen, D., & Zimowski, M. F. (1997). IRT estimation of domain scores. 
Journal of 
Educational Measurement, 34
(3), 197
-
211. 
 
Brennan, R. L. (1981). 
Some statistical 
procedures for domain
-
referenced testing: a handbook for 
practitioners
. Iowa City, Iowa
: Research and Development Division, American College 
Testing Program
.
 
Retrieved from https://searchworks.stanford.edu/view/1312930
 
Campbell, C. (2013
). Research on teacher competence in classroom assessment. In 
J.H.
 
McMillan
 
(Ed.),
 
Sage handbook of research on classroom assessment
 
(pp.
 
71
-
84)
.
 
SAGE,
 
Los Angeles.
 
Center f
or K
-
12 Assessment and Performance Management at ETS. (2014, March). 
Coming 
together to raise achievement: New assessments for the common core state standards
. 
Retrieved from 
http://www.k12center.org
 
Chang, H.
-
H. (2
012). Making computerized adaptive testing diagnostic tools for schools. In R. W. 
Lissitz & H. Jiao (Eds.), 
Computers and their impact on state assessment: Recent history 
and predictions for the future
 
(pp. 195
-
226). Charlotte, NC: Information Age Publishi
ng.
 
Chang, H. H. (2015). Psychometri
cs behind computerized adaptive 
testing.
 
Psychometrika,
 
80
, 1
-
20.
 
 
Chang, H. H., & Ying, Z. (1996). A global information approach to computerized adaptive 
testing.
 
Applied Psychological Measurement,
 
20
(3), 213
-
229.
 
Chen,
 
Y., Li, X., Liu, J., & Ying, Z. (2018). Recommendation System for Adaptive Learning. 
Applied Psychological Measurement, 42
(1), 24
-
41. 
 
Cheng, Y. (2009). 
When cognitive diagnosis meets computerized adaptive testing: CD
-
CAT.
 
Psychometrika, 74
(4), 
619
-
632.
 
Cheng, Y. (2010). Improving cognitive diagnostic computerized adaptive testing by balancing 
attribute coverage: The modified maximum global discrimination index 
method.
 
Educational and Psychological Measurement,
 
70
(6), 902
-
913. 
 
Chiu, C.Y., & Köhn
, H.F. (2015), Consistency of Cluster Analysis for Cognitive Diagnosis: The 
DINO Model and the DINA Model Revisited. 
Applied Psychological Measurement, 39
, 
465
-
479.
 
Chiu, C. Y., & Douglas, J. (2013). A Nonparametric Approach to Cognitive Diagnosis by 
Proxi
mity to Ideal Response Patterns. 
Journal of Classification, 30
(2), 225
-
250. 
 
Chiu, C.
-
Y., Douglas, J. A., &
 
Li, X. (2009). Cluster analysis for cognitive diagnosis: Theory and 
applica
tions. 
Psychometrika, 74
, 633
-
665.
 
131
 
Chiu, C. Y., Sun, Y., & Bian, Y. (2018
). 
Cognitive Diagnosis for Small Educational Programs: 
The General Nonparametric Classific
ation Method. 
Psychometrika, 83
, 355
-
375. 
 
Clark, I. (2016). Formative assessment: assessment is for self
-
regulated learning
. 
Educational 
Psychology Review, 24
(2), 20
5
-
249. 
 
Conley, T. D. (2018). 
The Promise and Practice of Next Generation Assessment
. Cambridge, MA: 
Harvard Education Press.
 
Copp, D. T. (2018). Teaching to the test: a mixed methods study of instructional change from 
large
-
scale testing in Canadian schoo
ls.
 
Assessment in Education: Principles, Policy & 
Practice,
 
25
(5), 468
-
487.
 
de
 
la Torre, J. (2011). The generalized DINA model framework. 
Psychometrika, 76
, 179
-
199.
 
de
 
La Torre, J., & Karelitz, T. M. (2009). 
Impact of diagnosticity on the adequacy of models for 
cognitive diagnosis under a linear attribute structure: A simulation study. 
Journal of 
Educational Measurement, 46
(4), 450
-
469. 
 
Ding, S. L., Luo, F., Cai, Y., Lin, H. J., & Wang, X. B. (2008). 
Compleme
nt
 

matrix theory. In K. 
Shigemasu
, A. Okada, T. Imaizumi, & T. Hoshino (Eds
.), New Trends 
in Psychometrics
 
(pp. 417
-
423). Tokyo: Universal Academy.
 
Embretson, S. E. (1995). 
Developments
 
toward a cognitive design system for psychological tes
ting. 
In D. Lupinsky & R. 
Dawis
 
(Eds.), 
Assessing individual differences in human behavior 
(pp. 
17
-
48). Palo Alto, CA: Davies
-
Black Publishing
 
Embretson, S
. 
E
.
 
(2003)
.
 
The Second Century of Ability Testing: Some Predictions and 
Speculations.
 
Princeton, NJ:
 
Educational Testing Service. 
Retrievable at 
http
:// 
www.ets.org/Media/Research/pdf/PICANG7.pdf
.
 
Furtak, E. M., Circi, R., & Heredia, S. C. (2018). 
Exploring alignment among learning 
progres
sions, teacher
-
designed formative assessment tasks, and student growth: Results of 
a four
-
year study.
 
Applied Measurement in Education, 31
(2), 143
-
156. 
 
Fyfe, E. R., & Rittle
-
johnson, B. (2015). Feedback Both Helps and Hinders 
Learning:
 
The Causal 
Role of Prior Knowledge Feedback. 
Journal of Educational Psychology, 108
(1), 82
-
97.
 

A New Perspective on Gender Differences in Mathematical Sub
-
Comp
etencies. 
Applied 
Measurement in Education, 31
(1), 79
-
97.
 

-
space model 
for test development and analysis. 
Educational Measurement: Issues and Practice, 19
, 34
-
44.
 
Gierl,
 
M. J., & Lai, H. (2012). 
The role
 
o
f item models in automatic item 
generation.
 
International 
journal of testing,
 
12
(3), 273
-
298.
 
132
 
Gray, R. M. (2011).
 
Entropy and information theory
 
(
6
th
 
ed.
)
. New York: Springer.
 
Gorin, J. S., & Mislevy, R. J. (2013). 
Inherent Measurement Challenges in the Next Generation 
Science Standards for Both Formative and Summative Assessment.
 
Invitational 
Assessment Symposium, (September), 2
-
39. Retrieved from 
http://citeseerx.ist.psu.edu/v
iewdoc/download?doi=10.1.1.800.5350&rep=rep1&type=pdf
 
Gotwals, A. W. (2018). Where are we now? Learning progressions and formative assessment. 
Applied Measurement in Education, 31(
2), 157
-
164.
 
Haberman, S. J. (2008). When Can Subscores Have Value? 
Journal 
of Educational and Behavioral 
Statistics, 33
(2), 204

229.
 
Haertel, E. H. (1989). Using restricted latent class models to map the skill structure of 
achievement
 
items. 
Journal of Educational Measurement, 26
, 333
-
352.
 
Hanna, G. S., & Dettmer, P. (2004). 
Assessment for effective teaching: Using context
-
adaptive 
planning.
 
Boston: Pearson A and B.
 
Harks, B., Klieme, E., Hartig, J., & Leiss, D. (2014). Separating Cognitive and Content Domains 
in Mathematical Competence. 
Edu
cational Assessment, 19
(4), 243
-
266. 
 
Hattie, J., & Timperley, H. (2007). The power of feedback. 
Review of Educational Research, 77
, 
81
-
112.
 
Hefling, K. (January 7, 2015). 
Do students take too many tests? Congress to weigh question. 
Associated Press
. Retri
eved from http://www.pbs.org/newshour/rundown/congressdecide
-
testing
-
schools
 
Henson, R., & Douglas, J. (2005). Test construction for cognitive diagnosis. 
Applied Psychological 
Measurement, 29
, 262

277.
 
Henson, R., Roussos, L., Douglas, J., & He, X. (2008).
 
Cognitive diagnostic attribute
-
level 
discrimination indices. 
Applied Psychological Measurement, 32
(4), 275

288.
 
Henson, R., DiBello, L., & Stout, B. (2018). A Generalized Approach to Defining Item 
Discrimination for DCMs. 
Measurement: Interdisciplinary Re
search and Perspectives, 
16
(1), 18
-
29.
 
Heritage, M. (2010). 
Formative assessment and next
-
generation assessment systems: Are we 
losing an opportunity? 
National Center for Research on Evaluation, Standards, and Student 
Testing (CRESST) and the Council of Ch
ief State School Officers (CCSSO). CCSSO: 
Washington.
 
Hively, W. 
(1974). 
Introduction to Domain
-
referenced Testing. In W. Hively (Ed.), Domain
-
referenced testing 
(pp. 16
-
30). Englewood Cliffs, N.J.: Education
al Technology 
Publications
.
 
133
 
Houang, R. T. (1980)
. 
Estimation of parameters for a latent class model applied to the study of 
achievement test items
 
(Unpublished doctoral dissertation)
. University
 
of California, Santa 
Barbara, CA.
 
Junker, B. W., & Sijtsma, K. (2001). Cognitive assessment models with few a
ssumptions, and 
connections with nonparametric item response theory. 
Applied Psychological 
Measurement, 25
, 258
-
272.
 
Kaplan, M., de la Torre, J., & Barrada, J. R. (2015). 
New Item Selection Methods for Cognitive 
Diagnosis Computerized Adaptive Testing. 
Applied Psychological Measurement, 39
(3), 
167
-
188.
 
Kingsbury, C. G., & Zara, A. R. (1991). A Comparison of Procedures for Content
-
Sensitive Item 
Selection in Computerized Ada
ptive Tests. 
Applied Measurement in Education, 4
(3), 241
-
261. 
 
Köhn, H.
-
F., & Chiu, C.
-
Y. (2018). How to Build a Complete Q
-
Matrix for a Cognitively 
Diagnostic Test. 
Journal of Classification, 35
(2), 273
-
299. 
 
Kuo, B. C., Pai, H. S., & de la Torre, J. (201
6). 
Modified Cognitive Diagnostic Index and Modified 
Attribute
-
Level Discrimination Index for Test Construction. 
Applied Psychological 
Measurement, 40
(5), 315
-
330.
 
Leighton, J. P., Gierl, M. J., & Hunka, S. M. (2004). 
The Attribute Hierarchy Method for Cog
nitive 
Assessment: A Variation on Tatsuoka

s Rule
-
Space Approach. 
Journal of Educational 
Measurement, 41
(3), 205
-
237.
 
Luecht, R. M. (2013). Test Specifications under Assessment Engineering. 
Journal of Applied 
Testing Technology, 14
, 1
-
38.
 
Liu, J., Xu, G., 
& Ying, Z. (2012). Data
-
Driven Learning of Q
-
Matrix. 
Applied Psychological 
Measurement, 36
(7), 548
-
564. 
 
Liu, O. L., Frankel, L., & Roohr, K. C. (2014). 
Assessing critical thinking in higher education: 
Current state and directions for next
-
generation asses
sment. RR
-
14
-
10
. Princeton, NJ: 
Educational Testing Service.
 
Liu, R., Huggins
-
Manley, A. C., & Bradshaw, L. (2017). The Impact of Q
-
Matrix Designs on 
Diagnostic Classification Accuracy in the Presence of Attribute Hierarchies. 
Educational 
and Psychological
 
Measurement, 77
(2), 220
-
240.
 
Liu, Y., Andersson, B., Xin, T., Zhang, H., & Wang, L. (2018). Improved Wald Statistics for Item
-
Level Model Comparison in Diagnostic Classification Models.
 
Applied Psychological 
Measurement
.
 
https://doi.org/10.1177/0146621618798664
 
Ma, W., Iaconangelo, C., & de la Torre, J. (2015). 
Model Similarity, Model Selection, and 
Attribute Classification. 
Applied Psychological Measurement, 40
(3), 20
0
-
217. 
 
134
 
Macready, G. B., & Dayton, C. M. (1977). The use of probabilistic models in the assessment of 
mastery. 
Journal of Educational Statistics, 33
, 379
-
416.
 
Mislevy, R. J. (2016). How 
Developments
 
in Psychology and Technology Challenge Validity 
Argumentation. 
Journal of Educational Measurement, 53
(3), 265
-
292.
 
Moreno, R. (2004). Decreasing cognitive load for novice students: Effects of explanatory versus 
corrective feedback in discovery
-
based multi
media. 
Instructional Science, 32
, 99

113.
 
Nitko, A.J. (2001). 
Educational assessment of students
 
(3rd ed.). Upper Saddle River, NJ: Merrill.
 
Pellegrino, J. W., Chudowsky, N., & Glaser, R. (2001). 
Knowing what students know: The science 
and design of educat
ional assessment
. Washington, DC: National Academy Press.
 
Baker, F. B., & Kim, S. H. (2004).
 
Item response theory: Parameter estimation techniques
. CRC 
Press.
 
Reckase, M. D. (2010). 
Designing Item Pools to Optimize the Functioning of a Computerized 
Adaptiv
e Test.
 
Psychological Test and Assessment Modeling, 52
(2), 127
-
141. 
 
Rupp, A., Templin, J., & Henson, R. (2010). 
Diagnostic measurement: Theory, methods, and 
applications
.
 
New York, NY: Guilford Press.
 
Schmidt, W.H., & McKnight, C.
C. (1995). 
Surveying educational opportunity in mathematics and 
science: An international perspective.
 
Educational Evaluation and Policy Analysis, 17
(3), 
337
-
353.
 
 
Schmidt, W., Jorde, D., Cogan, L., Barrier, E., Gonzalo, I., Moser, U., Shimizu, K., Sawada, T., 
Valver
de, G., McKnight, C., Prawat, R., Wiley, D., Raizen, S., Britton, E. & Wolfe, R. 
(1996). 
Characterizing pedagogical flow
.
 
Boston MA: Kluwer Academic Publishers.
 
Schmidt, W.H., McKnight, C.C., Valverde, G.A., Houang, R.
T., & Wil
ey, D.
E. (1996). 
Many 
visions
, many aims: A cross
-
national investigation of curricular intentions in school 
mathematics
.
 
Boston: Kluwer Academic.
 
Schmidt, W. H., McKnight, C. C., Valverde, G. A., Houang, R. T. and Wiley, D. E. (1997)
.
 
Many 
Visions, Many Aims: A Cross
-
National Investig
ation of Curricular Intentions in School 
Mathematics
 
(Dordrecht, The Netherlands: Kluwer).
 
Schutz, P. A., & Pekrun, R. (Eds.). (2007). 
Emotion in education
. Burlington, MA: Academic 
Press.
 
Scriven, M. (1967). 
The methodology
 
of evaluation. In R. W. Tyler, 
R. M. Gagné, & M. Scriven 
(Eds.), 
Perspectives of curriculum evaluation 
(Vol. 1, pp. 39

83). Chicago, IL: Rand 
McNally
 

Applied Measurement in Education, 21
(4), 
293
-
294.
 
135
 
Shepard, L. A. (2006). Classroom a
ssessment. In R. L. Brennan (Ed.), 
Educational Measurement
 
(4th ed., pp. 623
-
646). Westport, CT: ACE/Praeger.
 
Shepard, L. A., Penuel, W. R., & Pellegrino, J. W. (2018). 
Classroom Assessment Principles to 
Support Learning and Avoid the Harms of Testing. 
Educational Measurement: Issues and 
Practice, 37
(1), 52
-
57.
 
Swanson, L. 
&
 
Stocking, M. L. (1993). A 
m
odel and 
h
euristic for 
s
olving 
very large
 
i
tem 
s
election 
p
roblems
.
 
Applied Psychological Measurement, 17
, 151
-
166.
 
Tatsuoka, K. K. (1983). Rule space: An a
pproach for dealing with misconceptions based on item 
response theory. 
Journal of Educational Measurement, 20
, 345
-
354.
 
Templin, J., & Bradshaw, L. (2014). Hierarchical Diagnostic Classification Models: A Family of 
Models for Estimating and Testing Attribu
te Hierarchies. 
Psychometrika, 79
(2), 317
-
339.
 
Thissen, D., Reeve, B. B., Bjorner, J. B., & Chang, C. H. (2007). Methodological issues for 
building item banks and computerized adaptive scales. 
Quality of Life Research, 
16
(SUPPL. 1), 109
-
119. 
 
Tu, D., Wang,
 
S., Cai, Y., Douglas, J., & Chang, H. (
2018
). 
Cognitive Diagnostic Models With 
Attribute Hierarchies: Model Estimation With a Restricted Q
-
Matrix Design. Applied 
Psychological Measurement. 
https://d
oi.org/10.1177/0146621618765721
 
U.S. Department of Education. (2014). 
Secretary's final supplemental priorities and definitions 
for discretionary grant programs
. Retrieved from 
https://www.fede
ralregister.gov/articles/2014/
 
12/10/2014
-
28911/secretarys
-
final
-
supplemental
-
priorities
-
and
-
definitions
-
for
-
discretionary
-
 
grant
-
programs
#h
-
28.
 
U.S. Department of Education. (2015). 
Fact Sheet: Testing Action Plan
, Washington, D.C.
 
van
 
Der Linden, W. J. (2005a). 
A Comparison
 
of Item
-
Selection Methods for Adaptive Tests
 
with 
Content Constraints. 
Journal of Educational Measurement, 42
(3), 283
-
302. 
 
van der Linden, W. J. (2005b). 
Linear models for optimal test design
. New York: Springer.
 
van
 
der Linden, W. J., & Diao, Q. (2014). Using a universal shadow
-
test assembler with multistage 
testing. In D. Yan, A. A. von Davier, & C. Lewis (Eds.), 
Computerized multistage testing:
 
Theory and applications
 
(pp. 101
-
118). New York, NY: CRC Press. 
 
van
 
der Linden, W. J., & Reese, L. (1998). A model for optimal constrained
 
adaptive testing. 
Applied Psychological Measurement, 22
, 259
-
270.
 
von
 
Davier, M. (2005). 
A general diagnostic model applied to language testing data, ETS 
Research Report RR
-
05
-
16
. Prin
ceton, NJ: Educational Testing Service. Retrieved from 
http://www.ets. org/Media/Research/pdf/RR
-
05
-
16.pdf
 
136
 
Walsh, B. (November 3, 2017). 
When Testing Takes Over: An expert's lens on the failure of high
-
stakes accountability tests 

 
and what we can do to ch
ange course
. Usable Knowledge. 
Retrieved from https://www.gse.harvard.edu/news/uk/17/11/when
-
testing
-
takes
-
over
 
Wang, S., & Douglas, J. (2015). 
Consistency
 
of nonparametric classification in cognitive diagnosis. 
Psychometrika, 80
(1), 
85
-
100.
 
Wang, W., Song
, L., Ding, S., Meng, Y., Cao, C., & Jie, Y. (2018). An EM
-
Based Method for Q
-
Matrix Validation. 
Applied Psychological Measurement
, 
42(6),
 
446

459
. 
 
Way, WD., Steffen, M., & Anderson, G.S. (1998).
 
Developing, maintaining, and renewing the 
item 
inventory to support computer
-
based testing
. Paper presented at the colloquium, 
Computer
-
Based Testing: Building the Foundation of
 
Ou
r
 
Future Assessments, 
Philadelphia, PA, September 25
-
26, 1998.
 
Willse, J., Henson, R., &
 
Templin, J. (2007). 
Using sum scor
es or IRT in place of cognitive 
diagnosis models: Can existing or more familiar models do the job?
 
Paper presented at the 
Annual Meeting of the National Council on Measurement in Education, Chicago, IL.
 
Wilson, M. (2018). Making measurement important for e
ducation: The crucial role of classroom 
assessment. 
Educational Measurement, 37
(1), 1

37.
 
Xu, G., & Zhang, S. (2016). 
Identifiability
 
of Diagnostic Classification Models. 
Psychometrika, 
81
(3), 625
-
649. 
 
Zimba, J. (2011). 
Examples of structure in the Common
 

mathematical content
. Retrieved from 
http://commoncoretools.me/wpcontent/uploads/2011/07/ccssatlas_2011_07_06_0956_p
1
p2.pdf
 
Zimba, J. (2015, October 29). 
Coherence Map
. Retrieved from 
www.achievethecore.org/coherence
-
map