EXAMINING ALIGNMENT INDICES’ VALIDITY AS MEASURES OF
TEST CONTENT REPRESENTATIVENESS
By
Anne Traynor

A DISSERTATION
Submitted to
Michigan State University
in partial fulfillment of the requirements
for the degree of
Measurement and Quantitative Methods—Doctor of Philosophy
2014

ABSTRACT
EXAMINING ALIGNMENT INDICES’ VALIDITY AS MEASURES OF
TEST CONTENT REPRESENTATIVENESS
By
Anne Traynor

Alignment index values are often presented as evidence that test content is representative
of the performance domain defined by a written curriculum. Alignment measures bear on the
validity of state achievement test score interpretations, and on test fairness. While alignment
reports have been used to document test content distribution, and to generate recommendations
for test form improvement that appear sensible to assessment professionals (Schafer, Wang, &
Wang, 2009), there is little external evidence that alignment index values are valid quantitative
measures of tests’ content representativeness.
Using eleven states’ mathematics achievement test item and Surveys of Enacted
Curriculum (SEC; Porter, 2002) content analysis data, I examine external validation evidence for
the curriculum emphasis measures underlying the coarse-grained SEC test-curriculum alignment
index. I then use fractional logit regression models (Papke & Wooldridge, 1996) to assess the
relationship between state-level test-curriculum alignment and proportion-correct item difficulty
on corresponding test items, controlling for other state and test item characteristics that may
affect achievement. This study focuses on evaluating the validity of alignment indices as
measures of test-curriculum correspondence, rather than on establishing cutoff criteria on the
measures or comparing alignment data collection protocols, although these are also important
issues.

I find that the content analysis proportions that summarize SEC alignment panelists’
judgments about curriculum objectives’ topics and cognitive demand requirements seem to relate
in expected ways to other measures of state curricular emphasis in Grade 4, providing weak
external validation support for the mathematics curriculum content analysis data. However,
there is no evidence of a statistically or substantively significant relationship between an
alignment measure based on SEC coarse-grained content analysis data, and test item difficulty in
Grade 8, regardless of the extent to which a particular item’s content type is emphasized by the
curriculum. I conclude that although external validation studies for the SEC alignment index
have tended to focus on the relationship between curricular alignment and test performance,
other types of evidence may yield clearer conclusions, and perhaps be more crucial for
demonstrating these indices’ validity. Specifically, I suggest future research to evaluate the
content classification schemes implemented by popular alignment methods, and to establish the
reproducibility of overall and item-level alignment results across independent panels of qualified
experts trained by different facilitators.

Copyright by
ANNE TRAYNOR
2014

ACKNOWLEDGEMENTS

I was fortunate to have the support of my dissertation advisor, Dr. Mark Reckase, and my
committee members, Drs. Robert Floden, Richard Houang, and Raven McCrory. Their
willingness to entertain my proposal, and to ask questions and offer suggestions contributed
markedly to the finished product. I appreciate the assistance of Dr. Barbara Schneider, and
Michelle Chester in the Office of the Hannah Chair, who maintained and provided access to the
NAEP data used in this dissertation.
I also wish to acknowledge others who contributed to my development as a scholar
during my studies. I am particularly appreciative of the support and involvement of my
academic advisor, Dr. Tenko Raykov, in providing feedback and advice as I progressed through
my degree program. Dr. Alexander von Eye offered constructive criticism on drafts of some of
my earliest work—I am grateful for his encouragement. Dr. Spyros Konstantopoulos provided
thoughtful advice as I was completing this dissertation and searching for employment. My
fellow students, especially Cheng-Hsien Li and Hyesuk Jang, contributed to my learning
throughout my studies, and I am glad to have had their company. Finally, I am thankful for the
good counsel of my family and friends, who have supported me as I have completed this
dissertation, and in my life.

v

TABLE OF CONTENTS

LIST OF TABLES ..................................................................................................................... viii
LIST OF FIGURES ...................................................................................................................... x
KEY TO ABBREVIATIONS ...................................................................................................... xi
CHAPTER 1: INTRODUCTION OF THE RESEARCH QUESTIONS ..................................... 1
1.1
Test-Curriculum Alignment Evidence as a Federal School Accountability
Testing System Requirement ............................................................................................ 4
1.2
Integration of Alignment Review into the Test Development Process............................. 5
1.3
The Surveys of Enacted Curriculum (SEC) and Other Processes for Judging
Alignment ......................................................................................................................... 8
1.3.1 The SEC Alignment Method ................................................................................ 9
1.3.2 Other Methods .................................................................................................... 13
1.3.3 Rationale for Studying the SEC Test-Curriculum Alignment Index .................. 16
1.4
Purpose of Study and Research Questions...................................................................... 18
CHAPTER 2: LITERATURE REVIEW .................................................................................... 22
2.1
Alignment Indices as Validity Evidence for Achievement Test Score Interpretations .. 23
2.1.1 State Achievement Test Scores Are Intended to Measure Attainment of
Curriculum Goals ................................................................................................ 24
2.1.2 Alignment Evidence is Necessary for Validation of Intended Achievement
Test Score Interpretations ................................................................................... 27
2.2
Measuring Test-Curriculum Correspondence: Traditional and Modern Alignment
Methods........................................................................................................................... 29
2.2.1 Traditional Evidence of Test-to-Specifications Alignment ................................ 29
2.2.2 Comparison of Traditional and Modern Alignment Methods ............................ 31
2.3
Connecting Test Items to Curriculum Objectives ........................................................... 32
2.4
Defining Cognitive Demand Categories ......................................................................... 34
2.5
Establishing Alignment Criteria ..................................................................................... 38
2.6
The Validity of Alignment Indices as Evidence of Test Content
Representativeness: Previous Empirical Findings .......................................................... 40
2.6.1 Alignment Index Reliability ............................................................................... 44
2.6.2 Rater Agreement ................................................................................................. 45
2.6.3 Rater Interpretation of Curriculum Objective and Test Item Content ................ 50
2.6.4 Rater Interpretation of Test Item and Curriculum Objective Cognitive
Demand ............................................................................................................... 53
2.7
The Relationship Between Test-Curriculum Alignment and Student Achievement
Test Scores: Previous Empirical Findings ...................................................................... 55
2.7.1 Instruction-Curriculum Alignment and Achievement Test Scores ..................... 56
2.7.2 Instruction-Test Alignment and Achievement Test Scores ................................ 59
2.7.3 Test-Curriculum Alignment and Achievement Test Scores ............................... 62
vi

2.8
2.9

Impact of Federal School Accountability Testing on Alignment ................................... 64
Summary of the Literature and Contribution of This Study ........................................... 69

CHAPTER 3: METHOD ............................................................................................................ 72
3.1
Data ................................................................................................................................. 73
3.1.1 SEC Data............................................................................................................. 75
3.1.2 Third International Mathematics and Science Study 2007 U.S.
Benchmarking and National Assessment of Educational Progress 2007
Samples ............................................................................................................... 78
3.1.3 Comparison of TIMSS and NAEP Assessment Frameworks, and SEC
Content Coding Categories ................................................................................. 80
3.2
Models............................................................................................................................. 82
3.2.1 Models for Research Question 1 ......................................................................... 87
3.2.2 Models for Research Question 2 ......................................................................... 94
3.2.3 Models for Research Question 3 ......................................................................... 95
3.3
Assumptions about the US Elementary Education System ............................................ 98
3.4
Assumptions of the Statistical Models .......................................................................... 100
3.5
Interpretation ................................................................................................................. 104
CHAPTER 4: RESULTS .......................................................................................................... 108
4.1
Research Question 1: Are Counts of Curriculum Objectives a Valid Measure of
Curricular Emphasis? .................................................................................................... 109
4.1.1 NAEP Grade 4 .................................................................................................. 111
4.1.2 TIMSS Grade 4 ................................................................................................. 120
4.2
Research Question 2: Can the Cognitive Demand Categories of the SEC Content
Classification Matrix be Treated as Partially Ordered? ................................................ 124
4.3
Research Question 3: To What Extent are Item-Level Alignment Measures
Related to Achievement? .............................................................................................. 125
4.3.1 NAEP Grade 8 .................................................................................................. 126
4.3.2 TIMSS Grade 8 ................................................................................................. 134
4.4
Robustness Check ......................................................................................................... 139
CHAPTER 5: CONCLUSIONS AND DISCUSSION ............................................................. 141
5.1
Research Question 1 ..................................................................................................... 142
5.2
Research Question 2 ..................................................................................................... 143
5.3
Research Question 3 ..................................................................................................... 145
5.4
Accuracy of the Results in the Mathematics Achievement Test Item-State
Population ..................................................................................................................... 148
5.5
Defensibility of Assumptions about the US Elementary Education System ................ 152
5.6
Generalizability of the Results to Other Alignment Indices and State
Curriculum Documents ................................................................................................. 155
5.7
Suggestions for Future Validation Research................................................................. 158
APPENDIX ............................................................................................................................... 163
REFERENCES ......................................................................................................................... 172
vii

LIST OF TABLES

TABLE 1: Distributions of NAEP 2007 Test Items by Content Category and Grade Level ..... 79
TABLE 2: Distribution of TIMSS 2007 Grade 4 Test Items by Content Category ................... 79
TABLE 3: Distribution of TIMSS 2007 Grade 8 Test Items by Content Category ................... 80
TABLE 4: Intraclass Correlations of Item Difficulty Values within States and Items,
by Data Set .................................................................................................................... 103
TABLE 5: Fractional Logit Regression Predicting State-Specific NAEP Grade 4
Classical Item Difficulty ............................................................................................... 115
TABLE 6: Fractional Logit Regression Predicting State-Specific NAEP Grade 4
Classical Item Difficulty, by Content Topic ................................................................. 116
TABLE 7: Fractional Logit Regression Predicting NAEP Grade 4 Classical Item
Difficulty, by State ........................................................................................................ 117
TABLE 8: Fractional Logit Regression Predicting State-Specific TIMSS Grade 4
Classical Item Difficulty ............................................................................................... 121
TABLE 9: Fractional Logit Regression Predicting State-Specific TIMSS Grade 4
Classical Item Difficulty, by Content Topic ................................................................. 122
TABLE 10: Fractional Logit Regression Predicting TIMSS Grade 4 Classical Item
Difficulty, by State ........................................................................................................ 123
TABLE 11: Fractional Logit Regression Predicting State-Specific NAEP Grade 8
Classical Item Difficulty ............................................................................................... 128
TABLE 12: Fractional Logit Regression Predicting State-Specific NAEP Grade 8
Classical Item Difficulty, by Content Topic ................................................................. 130
TABLE 13: Fractional Logit Regression Predicting NAEP Grade 8 Classical Item
Difficulty, by State ........................................................................................................ 132
TABLE 14: Fractional Logit Regression Predicting State-Specific TIMSS Grade 8
Classical Item Difficulty ............................................................................................... 135
TABLE 15: Fractional Logit Regression Predicting State-Specific TIMSS Grade 8
Classical Item Difficulty, by Topic ............................................................................... 137

viii

TABLE 16: Fractional Logit Regression Predicting TIMSS Grade 8 Classical Item
Difficulty, by State ........................................................................................................ 138
TABLE A1: SEC Task Cognitive Demand .............................................................................. 164
TABLE A2: NAEP Item Mathematical Complexity ................................................................ 165
TABLE A3: TIMSS Item Cognitive Domain ........................................................................... 167
TABLE A4: Fractional Logit Regression Predicting State-Specific
NAEP Grade 4 Classical Item Difficulty, Dropping One Influential Item ................... 169
TABLE A5: Measures of Average Test-taking Effort by NAEP
2007 Examinees, by Grade and State ........................................................................... 170
TABLE A6: Average Means of Selected State Characteristics for Study
and All States in 2007, by Grade .................................................................................. 171

ix

LIST OF FIGURES

FIGURE 1: Scatterplot of Proportions of Curriculum Objectives and Mean Instructional
Emphasis by Mathematics Content Topic for Nine Unidentified States, with
Ordinary Least-squares Regression Line ...................................................................... 112

x

KEY TO ABBREVIATIONS

AME

Average marginal effect

BIC

Bayesian information criterion

CI

Confidence interval

ESEA

Elementary and Secondary Education Act of 1965

IEA

International Association for the Evaluation of Educational Achievement

NAEP

National Assessment of Educational Progress

NCTM

National Council of Teachers of Mathematics

OTL

Opportunity to learn

TIMSS

Trends in International Mathematics and Science Study (formerly Third
International Mathematics and Science Study)

SEC

Surveys of Enacted Curriculum

USED

US Department of Education

VIF

Variance inflation factor

xi

CHAPTER 1: INTRODUCTION OF THE RESEARCH QUESTIONS

In the context of state achievement testing for Grades K–12, alignment can be defined as
the degree of content correspondence between a test instrument used to measure students’
achievement in a specific subject area, and a state’s curriculum documents for that subject at a
given grade level (Webb, 1997). There are two sources of imperfect alignment: some content in
the curriculum is not tested, or a test includes some material that is not in the curriculum (La
Marca, 2001). Like setting passing scores on tests, or scoring performance tasks, all alignment
review methods require human judgment (Rothman, 2003; Crocker & Algina, 1986). Alignment
may be reported using qualitative descriptions, or quantitative indices, which are the focus of this
paper. A variety of indices have been proposed to quantify the content correspondence of
particular test-curriculum document pairs, many of which are based on the topic-by-cognitive
complexity classification tables often used to guide item writing and test form assembly.
Alignment measures indicate the fidelity of a test to representative sampling from a particular
curricular domain (McMaken & Porter, 2012; see also Guion, 1977, regarding content validation
evidence).
Throughout this paper, unless otherwise stated, I apply the narrowest definition of
alignment found in the literature, as a characteristic of particular test-curriculum document pairs,
although the term alignment has sometimes been used more broadly, to refer to correspondence
between an entire assessment system—all data collection instruments for all grade levels, and
their administration, scoring and score reporting—and a given curriculum (e.g., Webb, 1997).
Judgments about the overall quality of an assessment system, including accessibility (La Marca,
Redfield, & Winter, 2000) and coherence (Rabinowitz, Roeber, Schroeder, & Scheinker, 2006),

1

are thus beyond the scope of my alignment definition, as are judgments about the quality or
accuracy of item content (Plake, Impara, & Buckendahl, 2004).
Establishing evidence of test relevance to, and representativeness of, the intended
achievement domain is a crucial first step in validating test score interpretations (La Marca,
2001). Alignment indices are among many types of test content evidence that are relevant for
validation (Martone & Sireci, 2009). They have been used as validation evidence for tests that
gauge examinees’ accomplishment of formal, written curriculum expectations, primarily stateadministered educational achievement tests, although in principle they could also be used as
evidence for licensure or employment exams that sample content domains based on job analysis.
Alignment measures bear on the validity of interpretation of students’ test scores as measures of
curriculum goal attainment (La Marca, 2001) because the true meaning of a test’s score scale is
derived from the observed content domain (Martineau, Paek, Keene, & Hirsch, 2007)
represented by the test items. If a test is intended to measure curricular achievement, the
observed content domain will differ from that intended domain, and the true meaning of scores
will differ from their intended interpretation, to the extent that the test is misaligned with the
curriculum (Martineau et al., 2007). Because a test’s degree of alignment to a specific
curriculum document can alter test score interpretations, following the validation framework
described by Kane (2006, 2013), alignment also secondarily affects the soundness of any
proposed test score uses that require those particular interpretations.
As well as being salient to educational measurement, alignment between state
achievement tests and curriculum documents is important for educational practice because
teachers rely on both as indicators of the intended curriculum—state policies on instruction
(Porter, 2002). Alignment can be viewed more broadly as the extent to which “expectations and

2

assessments are in agreement and serve in conjunction with one another to guide the
[educational] system toward students learning what they are expected to know and do” (Webb,
1997, p. 3). If an external test does not correspond to the curriculum, and incentives are attached
to the scores, teachers will tend to align their instruction to the content and features of the
assessment, rather than to the intended, written curriculum (Koretz, 2008). Developing methods
to quantify the degree of match between a formal curriculum document and a test is necessary to
foster adequate correspondence so that educators receive a consistent message about the intended
curriculum. Although I recognize the importance of alignment between test content and
instruction, also, as a fairness issue for examinees (e.g., Resnick, Rothman, Slattery, & Vranek,
2004), and so an ethical and legal issue for test score users (La Marca, 2001; Phillips & Camara,
2006), since this paper focuses on alignment between test content and a curriculum as an
accuracy issue for score interpreters, it does not directly address ethical or legal concerns. It
will, however, explicate and account for the relationship between test-curriculum and testinstruction alignment when it is pertinent to my argument.
This study has two main purposes. First, I use empirical test-curriculum alignment data
from 11 states, together with test item difficulty data and teacher content emphasis data for their
student populations, to evaluate the external validity of the curriculum content analysis
proportions underlying a popular alignment index as measures of curricular emphasis. Second, I
use components of the index to test hypotheses about the relationship between test-curriculum
alignment, and state mean test item performance. If the implemented curriculum—instruction—
follows the written curriculum, test-curriculum alignment, in conjunction with content emphasis
(Gamoran, Porter, Smithson, & White, 1997), should be positively associated with student
achievement on relevant tests (Crocker, Miller, & Franks, 1989). Evidence of such an

3

association after controlling for other factors that affect students’ performance on achievement
items could support the validity of existing alignment indices as measures of test-curriculum
correspondence.

1.1

Test-Curriculum Alignment Evidence as a Federal School Accountability Testing System
Requirement
Educational accountability systems consist of the use of test scores or other measures by

government agencies to monitor the educational status and progress of, and to determine the
distribution of rewards and sanctions to, schools or individual students (Linn, Baker, &
Betebenner, 2002). The No Child Left Behind Act of 2001 amended and reauthorized the
federal Elementary and Secondary Education Act of 1965 (ESEA). The 2001 emendation of this
public law will henceforth be referred to as the “amended ESEA,” or simply as the “ESEA” for
brevity. The amended ESEA compelled states to develop detailed curriculum documents
outlining learning expectations for reading and mathematics in each grade 3–8, and assessments
aligned to those curricula, which were to be administered in public schools receiving federal
ESEA education funding. State departments of education were required to use differences in
school mean assessment results across years to identify public schools that produced inadequate
gains in test scores, to compel these schools to adopt reform strategies, and to evaluate the
effectiveness of the selected reforms (Linn et al., 2002). The amended ESEA effectively created
a system of rewards and sanctions for public schools, pressuring teachers and administrators to
produce high test scores. The full requirements of the amended ESEA were first effective for the
2005–2006 school year.
Any alignment method used to evaluate state achievement tests under the ESEA must, at
a minimum, evaluate the degree of both content and cognitive complexity correspondence

4

between tests and curriculum documents (Davis-Becker & Buckendahl, 2013; La Marca et al.,
2000; US Department of Education [USED], 2004). The three most commonly-used alignment
methods (Martone & Sireci, 2009) are those developed by Webb (1997), by Achieve, Inc.
(Resnick et al., 2004), and by Porter and colleagues based on the Surveys of Enacted Curriculum
(SEC; Porter, 2002). Davis-Becker and Buckendahl (2013) contend that, because some language
used in USED’s (2004) initial guidance to states regarding ESEA alignment requirements was
terminology specific to the Webb method, some state departments of education interpreted this to
suggest that the Webb method was the “correct” or preferred alignment method. They argue
that, given the limited research available on alignment methodology, neither policymakers nor
measurement practitioners have enough information to allow definitive comparison of methods,
or identification of preferred methods.
To judge the adequacy of assessment systems proposed by each state for ESEA
compliance, USED relied on independent panels of assessment professionals to review each
state’s submission and complete an advisory report, summarizing validity evidence collected by
each state, and noting any additional evidence that should be required before approval of a
particular assessment system. Schafer, Wang, and Wang (2009) found that as a consequence of
the review panels’ evaluations, 23 states were asked to submit results of test-curriculum
alignment studies—the most common type of required omitted evidence cited in states’
assessment system proposal decision letters from USED. States’ peer review reports implied that
alignment results were interpreted by the review panels as evidence that item responses would be
influenced by the cognitive processes intended at each grade level, and the decision letters
indicated that USED required these results as validation evidence to obtain assessment system
approval (Schafer et al.).

5

1.2

Integration of Alignment Review into the Test Development Process
Alignment between a test and a particular curriculum document is different than

correspondence of a test to its specifications. In principle, a test could be judged to adequately
match its specifications, but to be entirely unrelated to a curriculum that it was intended to
measure. In practice, since test specifications for state achievement tests are developed with
reference to a particular curriculum document, the alignment between a test and its specifications
bears on the alignment between the test and the target curricular domain. Both test blueprints
and test instruments should be aligned to the content standards (Martineau et al., 2007).
However, typically judgments about whether a test is sufficiently aligned to its specifications are
left to the test developer (Buckendahl, Plake, Impara, & Irwin, 2000), while reviews of
alignment between the test and its relevant curriculum are conducted by independent panels of
subject-matter experts.
Generally, sequential development of a written curriculum, test specifications and test
items, in that order, is more likely than concurrent or differently-ordered development to produce
aligned test-curriculum combinations (Webb, 1997). States typically conduct test-curriculum
alignment studies only following major curriculum revision, test modification, or changes in test
passing scores (La Marca, 2001; Wyse & Viger, 2011). However, ideally, alignment review
should occur regularly during the instrument development and revision cycle (La Marca, 2001).
Alignment reviews often detect sections of the curriculum that are over- or underrepresented on
assembled test forms or in the item pool (Martineau et al., 2007). Although further item
development is unlikely to fully address these gaps, repeated formal alignment review of the
items for an annual testing cycle is uncommon and would be costly (La Marca, 2001; Martineau

6

et al., 2007). To improve the efficiency and coherence of the test development process for state
assessment systems, Martineau et al. (2007) suggested that alignment should be monitored
during the early phases of test development, particularly during item writing, and that formal
alignment review of assembled test forms or the item pool should be combined with the phase of
item quality review by subject-matter experts. Interpretation of other types of validity evidence,
including reliability estimates and factor analysis results, should then be informed by the
alignment results (La Marca, 2001)
In practice, alignment review is typically conducted for specific test forms after they are
assembled (Wyse & Viger, 2011), but often only a single form is analyzed for each grade level
even when multiple forms exist (Polikoff, 2012a, p. 361). The practice of judging and reporting
alignment for only one of many test forms is problematic because some nominally-parallel forms
may be better aligned to a curriculum than others, as demonstrated among New York Regents
Exam forms (Liu & Fulmer, 2008). Schafer et al. (2009; see also Porter, 2002, p. 13) contended
that alignment results for a single test form should not be judged as sufficient for ESEA
compliance by states that utilize multiple test forms, although current interpretations of the law
seem to have deemed alignment results from one form to meet the minimum evidence
requirement. La Marca (2001) pointed out that the extent to which individual test instrument, and
assessment system, alignment to a curriculum are important depends on how information from
the system will inform decision-making. If decisions about students, teachers or schools will be
made based on single test scores, individual test-curriculum alignment evidence is most critical
to the validity of score interpretation and the soundness of the decisions. If decisions will be
made based on multiple measures of curricular attainment (e.g., routine classroom assessment

7

scores; large-scale summative test scores), evidence of overall assessment system alignment to
the curriculum may be needed.

1.3

The Surveys of Enacted Curriculum (SEC) and Other Processes for Judging Alignment
Arguing for more standardized evaluation of the quality of content domain sampling in

test instruments, Guion (1977, p. 7) held that “the notion of content relevance is a quantitative
one, even if we currently lack the means of measuring it.” Others view alignment as
fundamentally a qualitative issue: “evaluating the quality of alignment requires a holistic
judgment. The purposes of the assessments and standards, their use in guiding instruction and
decision-making, and other contextual information must be considered in judging whether the
degree of alignment is sufficient” (La Marca et al., 2000, p. 24; see also Beck, 2007). Modern
alignment procedures integrate these perspectives, generating both numeric indicators and
narrative depictions, which are combined into an overall evaluation (e.g., Flowers, Wakeman,
Browder, & Karvonen, 2009), although the extent to which the summative evaluation relies on
the numeric indicators, and treats them as quantitative, depends on the method.
Because modern alignment methods were developed largely without reference to
previous conceptualizations of test-domain content correspondence in the validation literature,
and each reflects different beliefs about what test or item properties constitute “good”
alignment—the methods define alignment differently—isolating important dimensions along
which they can be compared is difficult. Bhola, Impara, and Buckendahl (2003) classified
alignment methods by their complexity. “Low complexity” alignment methods posit a content
model based on a simple ordinal rating of content match between document items (p. 22). This
simple model underlies all other alignment methods. “Moderate complexity” methods
characterize and rate two distinct aspects of each document task’s content: the topic, and the
8

relative cognitive demand of succeeding on that task (p. 22). In accord with methods used to
develop test specifications and write items (e.g., Haladyna, Downing, & Rodriguez, 2002), most
alignment procedures adhere to, minimally, such a two-dimensional content model (La Marca,
2001). “High complexity” alignment methods consider further dimensions possibly relevant to
judging the coherence of a curriculum-assessment system, such as the correspondence between
test administration and instructional conditions, or between the types of performances elicited by
test items and those actually stated by the curriculum objectives (Bhola et al., 2003). Among
alignment methods that have been used for ESEA compliance (Martone & Sireci, 2009), which
will be described in the sections that follow, Bhola et al. label the SEC method as moderate
complexity, and the Webb and Achieve methods as high complexity.

1.3.1

The SEC Alignment Method

SEC alignment reviews implement a matching-type alignment method (D’Agostino et al.,
2008) in which judges separately match the items composing a particular test, and the curriculum
objectives the test is intended to measure, to a two-dimensional content classification matrix
(Porter, 2002). Unlike indices of item-objective congruence and some modern alignment
methods that pair items directly with objectives, the SEC method matches items and objectives
only indirectly, through their assigned positions in the content matrix. The topic-by-cognitive
demand matrices underlying the SEC have their “conceptual origin” in the late 1970s to mid1980s work of the Content Determinants Group of the Institute for Research on Teaching at
Michigan State University (Porter, 2002, p. 12). An exemplar of Content Determinants Group’s
work is a study by Freeman et al. (1983), which used a three-dimensional 1,260-cell topic-bycognitive-demand-by-mathematical-operation taxonomy to classify the content of mathematics
textbooks. The topic and cognitive demand classification schemes used during SEC alignment
9

reviews have been further informed by more recent analyses of K–12 textbooks, standardized
tests, state and district curriculum documents, and curriculum recommendations of national
professional educators’ organizations (McMaken & Porter, 2012). Revisions of the initial
content taxonomy occurred in 2004, and between 2006 and 2007 (Polikoff, 2012a, p. 347).
Experts in the subject matter of a given educational document (test or curriculum) are
recruited as SEC alignment panelists (Polikoff, 2012a). They may or may not have previous
knowledge of any specific curricula to be analyzed. (If a state has maintained a consistent
curriculum over time, and SEC content analysis results for the curriculum are already available,
it would not need to be re-analyzed [Porter, 2006]). Panelists undertake training by a moderator
following a standard protocol before beginning operational coding. While the training process
“is largely consistent across groups,” as would be expected some “variation in the level of
content expertise and experience” across panels does exist (Smithson & Collares, 2007, p. 3).
After training, judges code document content independently of one another (Polikoff, 2012a),
although they are given the opportunity to discuss any “flagged” items, usually a small fraction
of the total, that “cause confusion for the coding process” (Porter, Polikoff, Zeidner, & Smithson,
2008, p. 4). Panelists are permitted to change their initial coding, but are not required to reach
consensus or otherwise encouraged to reconcile their judgments.
Polikoff (2012a) details the coding procedures that are used by SEC judges. Test items
or objectives—the most specific level of curriculum goal—are coded by both their topic and
cognitive demand, which locates their positions in the content matrix. Judges are directed to
match each objective to between 1 and 6 cells, and each item to between 1 and 3 cells, in the
content matrix (p. 347). Although both types of task statements can cover multiple topics and
cognitive demand levels, because test items are typically narrower in scope than objectives, their

10

classification process is more tightly restricted (Polikoff, 2012a, p. 347). In coding objectives,
judges must accurately interpret the meaning of each of these performance statements that was
intended by policymakers. Likewise, in coding items, judges must determine the content needed
to correctly answer each, and infer “the most likely approach” that students will use in
responding, which is a significant challenge (Porter, 2006, p. 147).
Once panelists have classified each task statement to cell(s) of the content matrix, to
generate a standardized data matrix for each panelist for each document analyzed, the maximum
score value for each test item is divided equally among content cells to which the item was
matched. For example, if a two-point constructed response item is matched to three cells, each
cell would be assigned two-thirds of a point (Polikoff, 2012a, p. 347). Each curriculum objective
is assumed to have unit weight, which, similarly, would be divided across cells to which that
objective was assigned. The weighted item or objective counts in each classification matrix are
then converted to proportions by dividing each cell by the maximum test score, or total number
of objectives in the curriculum document, respectively. These transformations yield a matrix of
content proportions for each judge. The proportion in each cell is an estimate of the relative
emphasis of that content category on the test, or in the curriculum (Porter, 2002). All panelists’
matrices of content proportions for a given educational document are, finally, averaged across
panelists to produce the aggregate content analysis results for that document (Polikoff, 2012a, p.
347). (Alternatively, if measuring test-instruction or instruction-curriculum alignment was the
goal, in this step, matrices of content emphasis proportions for particular instructors could be
derived from SEC teacher survey data [e.g., Porter, 2002].)
Aggregate summary data from the content coding process takes the same form for both
curriculum and test documents, as proportions of total content in each cell of the matrix

11

(Polikoff, 2012a, p. 347). For any test and curriculum that have been rated, an overall testcurriculum alignment index can be computed. Given two content-by-cognitive demand matrices,
reasonable measures of alignment would capture the degree of equality, or association, between
proportions in the corresponding cells of the two matrices (Porter, 2002). Porter (2002)
suggested two potential alignment indices for SEC data, only one of which has been used in
published research. The SEC alignment index is

∑∑ π
1−

i

X

j

i ,j

− πY

i ,j

2

where π X denotes a cell proportion in matrix X and π Y denotes the corresponding cell
i,j

i ,j

proportion in matrix Y. This index is bounded between 0 and 1, inclusive, with higher values
indicating better alignment between the coded assessment and curriculum document. The index
is the sum of cellwise intersections between the two content matrices (Porter, 2006). It
operationally defines a test as “aligned” with a particular curriculum to the extent that the
proportion of test items in each content matrix cell is equivalent to the proportion of objectives in
that cell (Porter, 2002).
An alternative alignment index proposed by Porter (2002, p. 6), since both the rows and
columns of the matrix are treated as nominal rather than ordinal, is to simply compute the
correlation between corresponding cells of matrices X and Y. A third possibility, analogous to a
method implemented by Polikoff and Porter (2012) for instruction-curriculum alignment, in case
the reliability of judges’ item or objective cognitive demand classifications is in doubt, is to
compute alignment based on topic coverage proportions only, collapsing over cognitive demand
categories. It is also possible to generate other statistics using SEC content analysis data; for
example, Polikoff (2012b, p. 285) describes computation of an index for “focus,” the extent to
12

which the content of a state curriculum document is concentrated in certain topic-cognitive
demand cells, rather than diffuse. Distributions of topic and cognitive demand classifications for
items or objectives, by rater, can be produced, and diagnostic information regarding particular
sources of misalignment can be acquired from content emphasis graphics illustrating proportions
of test or curriculum material in various content cells (Porter, 2002).

1.3.2

Other Methods

Webb (1997, 2007) developed the first modern alignment method. In the Webb method,
following training, panelists rate curriculum objectives on a four-point ordinal scale of cognitive
demand, using verbs in the objectives to distinguish cognitive demand levels. After panelists
render independent judgments, a facilitator, who also participates in rating, leads them in
reaching consensus about each objective’s demand level (Webb, Alt, Ely, Cormier, &
Vesperman, 2005). Reviewers then independently match items to objectives based on their
content topic, and rate items on cognitive demand, combining rating- and matching-type
alignment methods (D’Agostino et al., 2008). Each item can be matched with up to 3 objectives.
The Webb (2007, p. 7) method operationally defines alignment as composed of four aspects,
each of which is measured by an index: “depth of knowledge,” “balance of representation,”
“categorical concurrence,” and “range of knowledge.” (Webb’s original framework [1997]
outlined a very high complexity [Bhola et al., 2003] method that appraised additional test and
curriculum properties related to assessment system coherence [Rabinowitz et al., 2006];
however, these other characteristics have rarely been measured during applications of the Webb
method [Martone, 2007], and are beyond the scope of Webb’s more recent alignment
recommendations.)

13

Webb’s (1997) depth of knowledge index indicates, for each curriculum strand, the
proportion of matched items that require cognitive demand at or above the level of their
corresponding objectives, averaged across panelists. It measures how well the test matches the
intended, or a more intellectually-challenging, curriculum. The balance of representation index
for each curriculum strand ranges between 0 and 1, and is based on the difference between the
proportion of assessed objectives in the strand represented by a particular objective, and the
proportion of items assigned to the strand that are matched to that objective. Its computation
assumes that curriculum goals have more than one level of detail, and that the most specific
statements, objectives, are comprehensive and equally weighted (Bhola et al., 2003, p. 24). As
objectives are measured by equal numbers of items, the value of the balance of representation
index will be near 1. If the index is near 0, “then either few objectives are being measured, or
the distribution of items across objectives is concentrated on only one or two objectives” (Bhola
et al., 2003, p. 24). Two further indices, categorical concurrence and range of knowledge, give
average counts of items matched each content strand or specific objective, respectively, each of
which is compared with a minimum criterion value. Because all index computations include
data from every item-objective match, regardless of whether a panelist deemed the item and
objective to require equal cognitive demand, or whether the item distribution for a particular
content strand was balanced, interpretations of the four indices are partially confounded (Webb,
Alt, Ely, Cormier, & Vesperman, 2005). While the Webb alignment method yields separate
index values for the four alignment aspects for each content strand in a curriculum document,
which are usually treated as distinct measures in final alignment evaluation reports, a technique
to combine the indices into a single measure has been proposed (Brown & Conley, 2007).

14

Resnick et al. (2004) describe Achieve, Inc.’s alignment evaluation method. Achieve’s
alignment evaluations stem from a broader conception of alignment than either the SEC or Webb
(2007) methods: “how well all policy elements in a system work together to guide instruction
and, ultimately, student learning” (Resnick et al., 2004, p. 4). To minimize the rater training
requirement, Achieve alignment reviews are typically conducted by an experienced external
panel, rather than by assessment stakeholders. Initial confirmation or revision of the matches
between items and objectives indicated by a test developer’s test specifications is typically
conducted by single senior reviewer. These pre-confirmed item-objective matches are assumed
as a basis for the panel’s alignment review. Once the test specifications table is confirmed,
panelists examine each item and its relation to the objective designated in the (revised) test
specifications. The product of an Achieve alignment review is an evaluation report drawing on
information collected from several rating scales, one index, and qualitative assessments of test
features.
Achieve’s “content centrality” rating scale ranks the extent to which each item’s content
matches the content of its corresponding objective (Resnick et al., 2004, p. 6). The “performance
centrality” rating scale ranks the extent to which the response process required by each item is
consistent with the verb in the corresponding objective (p. 6). The “range” index is the
proportion of curriculum objectives with content that is reflected by at least one test item
(Resnick et al., 2004, p. 7), a traditional alignment index (Crocker et al., 1989). The “source of
challenge” factor determines whether item performance is likely to be unduly influenced by item
characteristics that are irrelevant to attaining the corresponding behavioral objective (p. 7). The
“level of challenge” factor describes the anticipated difficulty of the set of items measuring a
particular curriculum strand for the examinee population; its evaluation assumes that items

15

matched to each goal should be distributed across challenge categories in a grade-levelappropriate manner (p. 7). The “balance” aspect of alignment is a qualitative evaluation of the
extent to which objectives subsumed by a particular broad goal or content strand are wellrepresented by their corresponding test item set, with appropriate emphasis on content that
reviewers judge to be important at that grade level (p. 7). Evaluating the balance and level of
challenge alignment aspects requires reviewers to make explicit value judgments about the
nature of content that should comprise states’ K–12 curricula and achievement tests (Rothman,
2003); both, overall, should be “sufficiently challenging” for students (Resnick et al., 2004, p. 6)
and should emphasize the “more important” content at each grade level (p. 8).

1.3.3

Rationale for Studying the SEC Test-Curriculum Alignment Index

While alignment reports “have been used successfully to document . . . content
representation, as well as to generate recommendations for improvement that seem to make
sense” to other assessment professionals, and to those associated with a given testing program
(Schafer et al., 2009, p. 182), there is little external evidence that alignment index values are
valid quantitative measures of tests’ content representativeness. If assumptions underlying
computation of the indices (or, choice of their cutoff criteria; Webb, 2007) are not reasonable in
practical situations, it is unlikely that the indices can support accurate conclusions about the
degree of test-curriculum alignment. Davis-Becker and Buckendahl (2013) recommended
seeking external validity evidence to evaluate an alignment panel’s conclusions based on
“connections to results of similar studies or other types of information” (p. 30; see also Crocker,
Miller, & Franks [1989] for a similar recommendation regarding traditional content validation
measures). However, because the alignment criteria applied by various methods differ in
number, content, and interpretation, representing considerably different definitions of
16

“alignment,” different methods cannot be expected to yield consistent decisions about alignment
for a particular test-curriculum combination. Thus, it is unclear what conclusions could be
drawn from an empirical study of multiple alignment methods that could not be inferred from
existing thorough comparisons of the methods’ criteria and procedures (e.g., Martone & Sireci,
2009; Vockley, 2009). Instead, I focus on evaluating the assumptions of a single type of
alignment index, and then discuss the extent to which the findings are generalizable to other
alignment indices.
The SEC test-curriculum alignment method meets the criteria to be used for ESEA school
accountability testing alignment reviews (USED, 2012), and is sometimes employed in education
research (e.g., Kurz, Elliott, Wehby, & Smithson, 2010) although perhaps less frequently used by
state testing programs than the Webb method (Davis-Becker & Buckendahl, 2013). It does not
require use of proprietary training materials, and the data produced is open to public scrutiny.
The national SEC data repository allows partnering states and local districts to compare their
curricula to those of other states or professional organizations, or to the assessment frameworks
of nationwide tests (Porter, McMaken, & Blank, 2011).
Previous alignment study results generally suggest achieving adequate overall
representation of curriculum objectives by a test item set is a more serious problem for state
achievement test developers than devising unbiased items that measure at least one objective in
the curriculum (Resnick et al., 2004; Webb, 1999). The SEC method produces a single measure
of overall curriculum domain representation by a particular test, which simultaneously considers
each item or objective’s topic and cognitive demand. The SEC alignment index operationalizes
a precise alignment definition that is similar to definitions presented by measurement theorists
(Guion, 1977, p. 7). It does not reflect judgments about the educational or societal value of

17

particular curriculum objectives or test items, unlike the Achieve alignment components and
Webb depth of knowledge index (Rothman, 2003, p. 23; Webb, 2007), and does not penalize
tests that sample from the curriculum domain, rather than exhaustively assessing each objective,
unlike the Webb alignment criteria (Bhola et al., 2003). Furthermore, the separate coding of test
items and curriculum objectives to a common matrix may improve comparability of results
across reviews since curriculum documents’ specificity does not dictate the level of detail at
which topic and cognitive demand categories are defined (Porter, 2006, p. 149), and may reduce
the bias that would be likely to result if items and objectives were directly matched (Anderson,
2002). Since theory provides some support for interpretation of the SEC alignment index, this
paper seeks external empirical evidence to contribute to validation.

1.4

Purpose of Study and Research Questions
Kane (2013, p. 13) emphasized that test score interpretation validation arguments should

focus on identifying, detailing, and evaluating inferences and assumptions that are “most
questionable a priori.” Alignment index validation arguments likewise should concentrate on
testing questionable assumptions. Measurement theorists have characterized alignment as the
extent that tests’ content emphasis matches the emphasis of a relevant curriculum (La Marca et
al., 2000; Poggio, Glasnapp, Miller, Tollefson, & Burry, 1986). Quantifying curriculum content
emphasis is recognized to be a complicated issue (Crocker et al., 1989), perhaps without a
universal best solution. The calculations for both the SEC and Webb balance of representation
test-curriculum alignment indices assume that curriculum objectives are intended to receive
equal coverage, so that unweighted counts of objectives, with content as classified by expert
panelists, indicate content emphasis for a given curriculum (Porter, 2006; Webb, 2007).
McMaken and Porter (2012; see also Porter, 2006) indicate that while treating the count of items
18

by content category as a test content emphasis measure seems reasonable, the assumption that
counts of objectives are an accurate measure of intended curricular emphasis could be
problematic: in computing the SEC alignment index, each objective is weighted equally, “which
we acknowledge may not reflect the intent of the [curriculum] authors but no other clear
approach is apparent” (McMaken & Porter, p. 179). I first examine evidence of relationships
between the coarse-grained curriculum proportions from SEC content analyses and concurrent
measures of state curricular emphasis to address Research Question 1: Are counts of curriculum
objectives a valid measure of curricular emphasis?
If counts of objectives serve as a valid measure of curricular emphasis, instruction
follows the formal written curriculum, and test item performance is sensitive to instruction, given
sufficient examinee motivation and controlling for prior ability, test item performance should be
positively related to the SEC’s measure of proportional curricular emphasis for the
corresponding objectives (see Mehrens & Phillips, 1987, for an analogous hypothesis regarding
curriculum emphasis measures derived from textbook analyses). Finding of a substantively
significant positive relationship between objective proportions and mean item-level achievement
would suggest that all three of these conditions hold. Finding of a negative or null relationship
would suggest that at least one of these conditions is false.
Models of curricular learning (e.g., Travers and Westbury, 1989) suggest that teachers’
instructional emphasis reports should be a more proximal measure of state curricular emphasis
than students’ test item scores. If counts of objectives serve as a valid measure of curricular
emphasis, and instruction follows the formal written curriculum, teacher self-reports of content
coverage should be positively related to the SEC’s measure of proportional curricular emphasis
for broad content subcategories. Finding of a substantively significant positive relationship

19

would suggest that both conditions hold. Finding of a negative or null relationship would
suggest that at least one of these conditions does not hold, or that the accuracy of the teacher
survey data is poor.
The SEC alignment index assumes cognitive demand categories are nominal—only topic
overlap by specific cognitive demand type is accumulated in the index. Other widely-used
alignment methods characterize cognitive demand as an ordinal property of test items, and there
is some evidence that these methods’ cognitive complexity levels are positively correlated with
item difficulty (Schneider, Huff, Egan, Gaines, & Ferrara, 2013). It has been asserted that the
plausibility of this assumption is unlikely to affect “the overall substantive nature” of certain
findings based on the index (Porter, Polikoff, & Smithson, 2009, p. 265). My next research
question, Research Question 2, asks: Can the cognitive demand categories of the SEC content
classification matrix be treated as partially ordered, rather than nominal as assumed by SEC
alignment indices? The purpose of Question 2 is to check the appropriateness of the model
underlying Question 1. If cognitive demand is best modeled as an ordinal property of test items,
such that instruction requiring application of certain more-demanding cognitive processes related
to a particular content topic also benefits students’ ability to perform other less-demanding types
of cognitive tasks related to the same topic (e.g., Ebel, 1956), models of the relationship between
curricular emphasis measures (although not necessarily alignment) and achievement should
account for the proportion of curricular content at or above a particular cognitive level. An
affirmative conclusion regarding Question 2 would suggest that the Research Question 1
analyses should be repeated accounting for proportions of curriculum objectives at or above a
particular item’s cognitive level.

20

“The use of examinee response data to substantiate the apparent fit between a test and
curriculum has been a long-running theme” in discussions of content validation (Crocker et al.,
1989, p. 188, citing Gulliksen, 1950, Ebel, 1956, and others). Anderson (2002, p. 259) argued
that “curriculum alignment enables us to understand the differences in the effects of schooling on
student achievement” across, for example, courses or educational tracks. Test-curriculum
alignment has been posited to affect school or state mean test scores, particularly in mathematics.
Crocker et al. (1989, p. 188; see also Mehrens & Phillips, 1987) asserted that if various schools’
math curricula do not match a particular test equally well, “there will be considerable variation in
the schools’ mean composite scores” due to variability in the degree of test-curriculum
correspondence. It has similarly been reasoned that states with periodic mathematics assessment
that is similar in content to the NAEP Mathematics test “might be expected to score higher [on
NAEP] because of the alignment of curriculum with NAEP items” (Grissmer, Flanagan, Kawata,
& Williamson, 2000, p. 112). If counts of objectives are found to be a reasonable measure of
curricular emphasis, my final research question, Research Question 3, will attempt to provide
empirical support for hypotheses that mean achievement test scores should increase with testcurriculum alignment: To what extent are item-level alignment measures related to
achievement?
To the extent that the content emphasis values underlying the SEC alignment index are
meaningful, this analysis also responds to a methodological recommendation to incorporate
opportunity-to-learn measures, as well as item feature indicators, into models of test item
difficulty (Ferrara, Svetina, Skucha, & Davidson, 2011). Before detailing the methods that will
be used to pursue these questions, I summarize the theoretical underpinnings of alignment
indices, and previous empirical findings regarding their validity as evidence of test content

21

representativeness, and their relationship to student achievement test scores. My survey of the
literature suggests that none of my three research questions have previously been addressed.

22

CHAPTER 2: LITERATURE REVIEW

Before detailing the methods that I will use to conduct the present study, I review the
recent and historical research on alignment indices. I establish that alignment indices are a
necessary type of validation evidence for particular types of educational tests. I describe
elements of the data collection protocols implemented in various traditional and modern
alignment methods. After outlining some similarities and differences among alignment methods
frequently used for large-scale curricular achievement tests, I summarize the existing validation
research that has been conducted to support the use of alignment indices in operational testing
programs. While much of the previous research has focused on rating consistency among the
subject-matter experts who are engaged to judge test-curriculum alignment, a few studies have
reported on the expert raters’ behavior. To situate my research questions in the literature, I
emphasize open questions regarding the validity of alignment indices as quantitative measures of
test-curriculum correspondence. By examining correlational relationships between Grade 4
mathematics curriculum content analysis data and other measures of state curricular emphasis,
this study will contribute to validation of the SEC alignment index. Finally, I introduce previous
work regarding the relationship between test-curriculum, instruction-curriculum, and instructiontest alignment and student test performance, particularly in mathematics. Although a positive
relationship between test-curriculum alignment and students’ test performance following
instruction has been theorized to exist, previous empirical results have been mixed, with some
apparently supporting, and others contravening this hypothesis. This study will investigate the
influence of test-curriculum alignment on mathematics test item performance by Grade 8
students using recent data from ten US states, as further detailed in Chapter 3.

23

2.1

Alignment Indices as Validity Evidence for Achievement Test Score Interpretations
The amended ESEA requires that “assessments shall . . . be aligned with the State's

challenging academic content and student academic achievement standards, and provide
coherent information about student attainment of such standards” (No Child Left Behind Act of
2001). The US Department of Education has further ruled that evidence from formal testcurriculum alignment reviews of assessments of grade-level, modified or alternate curriculum
objectives must be submitted prior to approval of state assessment systems. Guidance to state
education agencies regarding documentation of their assessment systems specifies that the tests
must:
“Cover the full range of content specified in the State’s academic content standards,
meaning that all the standards are represented legitimately in the assessments; and
Measure both the content (what students know) and the process (what students can do)
aspects of the academic content standards; and
Reflect the same degree and pattern of emphasis apparent in the academic content
standards (e.g., if academic content standards place a lot of emphasis on operations then
so should the assessments); and
Reflect the full range of cognitive complexity and level of difficulty of the concepts and
processes described, and depth represented, in the State’s academic content standards”
(USED, 2004, p. 41, emphases in original text).
As well as being both a statutory and regulatory requirement for state assessment systems, testcurriculum alignment evidence is prescribed by theory for validation of the interpretation of
achievement test scores as measures of curricular attainment.

24

2.1.1

State Achievement Test Scores Are Intended to Measure Attainment of
Curriculum Goals

Validity is a quality of particular test score interpretations, not scores or instruments
(Messick, 1989; Kane, 2006). Messick’s work on validation has often been taken to imply that
most, or all, educational test scores are intended to be interpreted as measuring examinees’
standing on particular latent constructs, unobservable quantitative characteristics of persons
(Peak, 1953). However, some educational test scores are not intended to have construct
interpretations; rather they are intended as measures or predictors of observable traits, such as
reading comprehension or complex mathematics problem solving (e.g., Kane, 2009; 2013;
Millman & Greene, 1989)—the accepted meaning of the tests’ item and total scores “derives
from their action and outcome,” not from posited relations among unobservable constructs,
although such relations may exist (Guion, 1977, p. 6).
Some measurement professionals interpret the state curriculum objectives in a particular
subject area as defining an achievement construct (e.g., Davis-Becker & Buckendahl, 2013), or
view the objectives as representing only a subset of the content goals actually intended by a state
(Koretz, 2008, p. 85). But others have concluded that typical contemporary state curriculum
documents do not define achievement constructs. Haertel (1985) pointed out that educational
outcomes tend to be defined “primarily in terms of their behavioral manifestations, and only
secondarily in terms of cognitive processes” (p. 28; see also Guion, 1977). He suggested that
subject area achievement outcomes are operationally defined by the objectives listed in state
curriculum documents, which is consistent with the assumption of alignment indices that the
objectives in a particular document “are intended to span the content of the goals . . . under
which they fall” (Webb, 2007, p. 9; see also Haertel, 1985, p. 28). Compared to broader item
domains evoked by potential alternative construct characterizations of achievement, state
25

curriculum documents tend to suggest relatively unique item domain specifications that may be
most appropriate for measuring observable traits (Haertel, 1985). Furthermore, to compose the
test specifications for state achievement tests, test developers typically do not refer to any
broader complex achievement construct (Ferrara & Duncan, 2011, pp. 143–144), implying that
achievement scores should be considered measures of an observable trait—performance on tasks
with academic content—possibly measured with systematic or random error.
Although legislators may wish to ascertain general academic performance,—that is, they
may perceive state tests’ target domain to be, for example, academic mathematics achievement—
because tests are usually developed and assembled under the stricture that each item match at
least one objective in the state curriculum, the scores’ potential universe of generalization
(Cronbach, Gleser, Nanda, & Rajaratnam, 1972) effectively covers, at most, listed objectives and
their components, but not similar objectives that could potentially have been included in the
curriculum. In practice, the target domain for a state achievement test might more reasonably be
considered to be the union of the potential task sets corresponding to each curriculum objective,
including performance tasks accomplished under a variety of administration conditions.
Because, in many states, some objectives are consistently excluded from the item development
process for practical reasons (Ferrara & Duncan, 2011), the target domain may be further
narrowed to include only potential items related to “testable” curriculum objectives (Martineau et
al., 2007). However, making inferences about expected student performance on a stated list of
curriculum objectives when tasks are presented in novel or real-life contexts may still be an
ambitious goal. The universe of generalization, the description of the behavior types to which
test scores can reasonably be generalized (Cronbach et al., 1972), for state achievement tests may

26

be limited to performance on tasks from the target domain that are similar to those called for
during testing, e.g., constrained-response items, standardized administration conditions.
If curricular domain definition is adequate, and items approximate random sampling from
the task universe, a valid limited descriptive inference from the test scores immediately follows:
an examinee’s estimated true score is the mean proportion of items in the item universe that the
examinee would be expected to answer correctly on that measurement occasion under the given
administration conditions (Linn, 1980). While government entities may wish to extrapolate to
performance on a broader domain—under additional administration conditions or on tasks that
are not feasible to incorporate in large-scale on-demand testing, these types of inferences
generally require evidence beyond that which can be gleaned from statistical scoring models
(Kane, 2013). If inferences about classroom or school performance, rather than or in addition to
individual student performance, are desired, minimally from an item content perspective, the
aggregate item set administered at that level must approximate random sampling from the item
universe.
Currently, state achievement test scores are intended primarily to produce mean
proficiency classifications at the school or classroom level, and secondarily to generate
individual proficiency classification estimates. Schools’ mean proficiency levels are used as one
component of states’ school ranking system, ratings from which are disseminated to the public
and used to identify underperforming schools. Recently, the federal government has also
incentivized states to use classroom-level achievement estimates, either mean scores or
proficiency levels, to evaluate teacher performance (e.g., Notice of Final Priorities for Race to
the Top Fund, 2009).

27

2.1.2

Alignment Evidence is Necessary for Validation of Intended Achievement Test
Score Interpretations

State achievement tests should “either representatively sample or comprehensively
measure” the assessable curriculum objectives in the same proportions as they appear in the
complete set of objectives (Martineau et al., 2007, p. 30). Content evidence is necessary for
validation of any score interpretation because such evidence connects a test to the trait it is
intended to measure, whether conceived as a construct or an observed variable (Yalow &
Popham, 1983). Yalow and Popham (1983) argued that it was barely conceivable, and certainly
undesirable, to use a test that poorly corresponded to a particular target domain to draw
inferences about examinee performance in that domain. Information about test content is highly
relevant to the meaning that test users can attribute to scores, and becomes more salient to
validation as a test’s target domain is increasingly “rooted in behavior with a generally accepted
meaning” (Guion, 1977, p. 6; see also Kane, 2009). Whether test scores are interpreted as
measures of a construct or an observable trait, evidence that test content is representative of an
intended curricular domain is required for validation of score interpretations that refer to the
curriculum (Crocker & Algina, 1986). If it can be shown that a test constitutes a representative
sample from the content domain of interest, an examinee’s score on the test can be expected to
reflect how the examinee will perform on domain tasks (Yalow & Popham, 1983).
Kane’s (2006, 2013) argument-based approach to validation suggests evaluating the
plausibility of the claims and assumptions justifying each distinct test score interpretation. He
describes major types of inferences that would be contained in most score interpretation or use
arguments, including (a) scoring inferences, in which observed scores (e.g., weighted sum
scores) are inferred from observed performance based on particular warrants and their backing,
(b) generalization inferences, in which universe scores (e.g., item response modeling “theta”
28

ability estimates) are inferred from observed scores, typically based on a statistical model, and
(c) extrapolation inferences, in which domain scores (e.g., standards-based classroom grades
predicted from curricular achievement test scores [Welsh, D’Agostino, & Kaniskan, 2013]),
which represent some performance conditions that were not sampled or observed, are inferred
from universe scores. Because generalization inferences rely on sampling theory, the item
sample that composes each test must be argued to represent the item universe, although it rarely
could be argued to be random. Alignment indices provide evidence regarding the extent to
which a test’s item set is a representative sample from the relevant curricular domain (McMaken
& Porter, 2012), as would be expected under simple random sampling. To the extent that these
indices are accurate measures of test content representativeness, they serve as an important
warrant for claims in generalization inferences, and extrapolation inferences that build on these
generalization inferences.
The validity of a particular score interpretation can serve as a warrant for score use
arguments that involve that interpretation (Kane, 2013). Although evidence of items’ sampling
adequacy may suggest particular test score interpretations, it does not suggest any particular test
score uses, and can support particular uses only indirectly as those uses, often decisions based on
the scores (Kane, 2013), involve specific score interpretations. For example, test content that is
not taught because it does not appear in the curriculum could influence test scores, and “if one
infers that instruction was faulty . . . teachers could be inappropriately blamed” (Mehrens &
Phillips, 1986, p. 186). Since alignment index values, and alignment judgments more generally,
do not imply any particular uses of state achievement test scores (e.g., diploma conferral, teacher
evaluation), in this paper I focus on the indices’ contribution to validation of score
interpretations.

29

2.2

Measuring Test-Curriculum Correspondence: Traditional and Modern Alignment
Methods
Although some traditional methods for collecting evidence of test content relevance or

representativeness include measures and evaluation criteria related to alignment, as I have
defined it, the development of alignment procedures during the late 1990s and early 2000s (e.g.,
Rothman, 2003; Webb, 1997) occurred largely without reference to research on these existing
methods (Martone & Sireci, 2009). Early theorists defined alignment as “a function of how well
the test content matches the curriculum content domain” (Guion, 1977, p. 7), and proposed
assessing it by classifying items into “broad areas” of subject matter and behavioral performance
type (Ebel, 1956, p. 275). Subsequently, numerous quantitative techniques for collecting
content-related evidence, all of which rely on matching test items to some representation of the
target domain, whether, for example, a broad domain definition, a test specifications document
apportioning items to various content categories, or a curriculum document detailing specific
behavioral objectives in a content area, were developed (Crocker & Algina, 1986).

2.2.1

Traditional Evidence of Test-to-Specifications Alignment

Traditional methods for collecting validity evidence based on test content focus on three
distinct types of match: item-objective congruence, test-instruction congruence, and test-test
specifications congruence. Indices of item-objective congruence are computed from the
proportion of raters matching each item to its item-writer-intended objective (Sireci, 1998) and
the quality of the match (Rovinelli and Hambleton, 1978, cited in Hambleton, 1980; Crocker &
Algina, 1986, p. 221; Turner & Carlson, 2003; Sireci & Geisinger, 1992). For any index of itemobjective congruence, Crocker, Llabre, and Miller (1988, p. 288) suggested taking the average of
mean ratings over items as a measure of test-curriculum match. A serious limitation of these

30

congruence indices is that the matching process tends to be “extremely” time consuming because
judges must compare each item to every objective (Crocker et al., 1989, p. 185). Also, the
potential magnitudes of these indices are influenced by both the number of raters and the number
of objectives, so no fixed criterion for an acceptable value of each item index can be set;
situation-specific criterion must be utilized (Crocker et al., 1989).
Jones and Szatrowski (1983) proposed three content validity criterion alternatives, all of
which were based on the assumption that the validity of test scores is related to the proportion of
examinees in the population who have received instruction relevant to correctly answering each
test item. The most complex of their proposed criteria required a minimum level of population
exposure to relevant instruction, which was estimated from teacher surveys, covering a userdetermined minimum proportions of items within each major subtopic appearing on the test.
Klein and Kosecoff (1975) suggested use of the correlation between the importance weighting of
each curricular objective and the number of items measuring each objective as an index of testcurriculum match. However, the magnitude of this correlation index is affected by variance in
the numbers, or importance weightings, of items corresponding to each objective; the correlation
would tend to be reduced as objectives were weighted as equally important, or represented by
equal numbers of items (Crocker et al., 1989).
A simple index proposed by Rovinelli and Hambleton (1978, cited in Hambleton, 1980)
was computed from several raters’ item-objective match data. A chi-square independence test on
the item-by-objective contingency table composed of counts of raters matching each given item
with particular objectives has a straightforward interpretation as a test of whether the item set is
significantly associated with the curriculum objectives, perhaps a minimal requirement for
establishing test content relevance.

31

2.2.2

Comparison of Traditional and Modern Alignment Methods

Both traditional and modern alignment methods focus on items as the appropriate unit of
analysis from the test instrument, and rely on item matching or rating with reference to a defined
target domain (Crocker et al., 1989; D’Agostino et al., 2008). Although not a methodological
requirement, most applications of traditional content methods have matched test items to the
broad content and cognitive demand categories comprising the test specifications (e.g., Sireci,
1998, p. 300; Schmeiser & Welch, 2006; Crocker & Algina, 1986, p. 219) rather than to the
more specific content represented in detailed curriculum objectives (Martone & Sireci, 2009, p.
1336). If content matching is conducted with reference to the test specifications, the matching
procedure only addresses correspondence of the test to those curriculum objectives that are
covered by the test specifications, and does so only indirectly. Recent alignment methods
specifically recommend analyzing the most detailed level of behavioral performances listed in a
curriculum document (Porter, 2002; Webb, 2007), and published applications have consistently
followed this instruction (e.g., Webb, Alt, Ely, Cormier, & Vesperman, 2005; Roach, McGrath,
Wixson, & Talapatra, 2010). However, validation studies using traditional methods have also
sometimes conducted content matching with respect to a formal curriculum document (e.g.,
Klein & Kosecoff, 1975). Traditional content-matching methods that focus on quantifying test
content representativeness, such as the indices of item-objective congruence described
previously, can be considered alignment methods under our definition, so previous research on
the underpinnings and limitations of these methods, although very limited (Crocker et al., 1989),
is relevant to hypotheses about alignment index performance.

32

2.3

Connecting Test Items to Curriculum Objectives
Procedures used to link test item and curriculum objective content can differ along

several dimensions, including the types of task features that are considered relevant to judging
alignment, the reporting of error variability among panelists, whether an intermediate content
classification table is used, and whether connecting items to objectives involves binary matching,
ordinal rating, or both. Alignment procedures can be classified into two general categories: (a)
methods that directly match items to objectives (e.g., Frisbie, 2003), and (b) methods that
estimate the proportion of items that match each objective (e.g., SEC; Davis-Becker &
Buckendahl, 2013), which indirectly match items to objectives by directly matching both to a set
of test content categories. Anderson (2002) recommended use of a generic content taxonomy
table to conduct alignment in any subject area. She argued that mapping educational document
content to a generic table should be preferred over directly matching document units (e.g., test
tasks, curriculum objectives) because the act of classifying content items “focuses quite directly
on student learning,” while lessening the tendency of political or personal implications of the
results to influence judges’ ratings (p. 258).
Alignment procedures also vary in the types of item and objective features that are
considered in determining alignment. In the simplest methods, alignment judgments are based
only on the extent of correspondence between items’ and objectives’ content topics, making it
“more likely that a match will be found” than if judgments also consider correspondence on
cognitive demand or other task features (Bhola et al., 2003, p. 24). Because USED (2004)
requires alignment evaluations of states’ Title I accountability test systems to account,
minimally, for their correspondence with curricular content and cognitive demand, the most
frequently used alignment methods in K–12 achievement testing all classify items by content and

33

cognitive demand—low complexity alignment models (Bhola et al., 2003) will not suffice to
meet the requirement.
Well-documented alignment procedures typically include scripted or written instructions
that detail the method all raters should use to identify task topics and cognitive demand, for
instance by focusing on tasks’ verbs and nouns (Anderson, 2002; D’Agostino et al., 2008),
discouraging raters from developing their own idiosyncratic rules (Webb, 1999). The instructions
typically also include guidance regarding matching items to multiple objectives, but this tends be
general, rather than situation-specific, and not necessarily consistent with instructions given to
item writers (Davis-Becker & Buckendahl, 2013). Development of the methods that combine
individual judges’ ratings to produce an overall alignment index has tended to emphasize the
importance of maintaining independence among raters. Some applications of the SEC procedure
have encouraged judges to discuss any problems or questions after initial coding before giving a
final coding of each objective (McMaken & Porter, 2012). The Webb (1999) procedure requires
judges to reach consensus on the cognitive demand codes for objectives, but directs them to
conduct item-objective matching and rate item cognitive demand independently.
Most existing alignment procedures can be classified as implementing rating, in which
expert judges rate the strength of correspondence between each test item and a pre-assigned
objective, typically from the test specifications, or matching, in which judges determine which
objective or objectives from a list most closely corresponds to each test item (D’Agostino et al.,
2008). D’Agostino et al. (2008) randomly assigned 49 subject-matter experts either to match
high school mathematics achievement test items from Arizona to curriculum objectives or to rate
the strength of item-objective links. Raters judged content, cognitive demand and overall
consistency between item and objective pairs using three-point scale, while matchers matched

34

each item to up to three corresponding objectives. The authors found that itemwise alignment
decisions made using the rating and matching methods agreed moderately, with a correlation of
.59 between average alignment indices for each item. Rating was more time efficient than
matching, requiring about 25% less time, but, the authors believed, was more likely than
matching to encourage acquiescence or, more generally, rater leniency, particularly if objectives
were so broad that they could plausibly be measured by a wide range of items.

2.4

Defining Cognitive Demand Categories
Snow and Lohman (1989) proposed that examinees’ observed item performance be

interpreted as samples of their cognitive processes, which were inherently unstable, rather than
as signals of their standing on a well-defined, although unobservable, latent trait. They held that
the cognitive processes used to complete a task would vary among, and possibly within,
examinees on a single measurement occasion, depending on examinees’ physical and social
situations, as well as their perceptions of the task’s components. However, others assert that
assuming common instruction of the examinees, it may be possible to average over their
responses, treating the cognitive process required for correct response to a test item as a fixed
property of the item in a given population. Snow (1994) allowed that as people share a common
learning history during socialization, schooling, or job training, “common patterns of ability will
be seen to develop,” although he believed that classifications based on shared instructional
experience “may leave out more important information about persons. . . than they capture” (p.
15). Mislevy (2009) argued that to the extent examinees’ context includes common instruction
and life experiences “students’ propensities for actions in…task situations can be said to exist”
(p. 100; emphasis in original), producing response patterns that can be modeled. Unfortunately,
often, not even instructional histories are known, so cognitive processes used by “even a majority
35

of the test takers” must be inferred by test developers and reviewers from highly distal evidence
(Schmeiser & Welch, 2006, p. 316).
Modern alignment methods all rely on classification of test items and curriculum
objectives into mutually exclusive categories of cognitive demand or complexity. Cognitive
complexity coding schemes may reflect item linguistic features, many of which, in the case of
mathematics achievement items, would be considered sources of nuisance response variability
unrelated to the trait of interest, item structural features, which are central to measuring the trait
(Lepik, 1990), or both. Early content validation methods categorized test items based on their
content and the type of performance they required of examinees (Ebel, 1956). Ebel (1956)
explicitly stated that the performance categories did not assume use of any particular cognitive
processes by examinees, only types of observable performance. Some extant cognitive demand
classification schemes encode item features, requiring few assumptions about cognition (e.g.,
Lepik, 1990; Schneider et al., 2013). However, consistent with modern curriculum development
efforts’ reliance on taxonomies of cognitive performance to categorize statements of each
objective, modern alignment methods require judges to make inferences about examinees’
cognitive processing.
Item cognitive demand ratings can be defined as “the baseline level of cognitive
processing required to provide a correct response” (Wyse & Viger, 2011, p. 188), or as intended
to reflect the solution process most examinees, or average examinees, use to solve an item
(Schmeiser & Welch, 2006). Cognitive demand is invariant to changes in an item’s context, and
to modifications affecting only item content, but not the solution process (Wyse & Viger, 2011).
Item cognitive demand is distinct from the concept of item difficulty, although ordered cognitive
complexity ratings would be expected to have a systematic relationship with observed item

36

difficulty, the average probability of correct response (Embretson & Daniel, 2008; Gorin, 2006;
Wyse & Viger, 2011), and tend to be related to observed difficulty in practice (Martineau et al.,
2007). Cognitive demand may be viewed as a property of test items, not jointly of test itemexaminee population combinations, so that the cognitive demand of a test item does not
necessarily change across examinee populations and is independent of the specific curriculum to
which each examinee has been exposed, but this perspective also requires the assumption that all
test takers are familiar with the general approach to each task (NAGB, 2006). Other
interpretations suggest that true item cognitive demand is tied to particular examinee
populations, who may tend to reach correct solutions in distinct ways (Roach, Niebling, & Kurz,
2008), depending on their instructional background (Embretson & Daniel, 2008; Schmeiser &
Welch, 2006). For example, in states where particular well-known number sequences (e.g., the
Fibonacci sequence) are part of the curriculum, related test items may tend to require a lower
level of cognitive demand from students than in states where these sequences are not explicitly
covered, and students will have to reason to reach the solution (Sanford & Fabrizio, 1999).
Even when most examinees follow the same instructional sequence, item cognitive
demand ratings will depend on the extent to which a given classification scheme accounts for
specific characteristics of the examinee population. Consider, for example, a test item that
requires examinees to recall an obscure historical fact, which was an element of all examinees’
instruction, but was not highlighted. If the classification scheme focuses raters’ attention on the
generic type of cognitive process, the verb—recall—ratings are likely to be different (in this
case, lower, if an ordinal scheme is used) than if the classification scheme directs raters to
consider the demand of the specific cognitive process typically activated during examinee-test
item interactions in this population.

37

Ebel (1956) indicated that categories for different types of behavioral performance should
be considered at least partially ordered by degree of difficulty. Taxonomies of cognitive
performance (e.g., Bloom, 1956) similarly prescribe ordered categories. There is some evidence
that test items of varying formats require different types of cognitive performance, and that these
performance types can be ordered by complexity (Martinez, 1999). Among alignment methods,
the Webb (1997) and Achieve (Resnick et al., 2004) methods represent item cognitive demand as
a set of ordered categories, while the SEC method (Porter, 2002) utilizes nominal categories for
cognitive demand. While each coding scheme may capture unique elements of a hypothesized
item response process, many frameworks’ demand category definitions describe similar levels or
types of processing; these commonalities may be reflected in relationships between the ratings
from various classification schemes. For instance, when two raters applied several different
coding schemes including reading load, NAEP mathematical complexity, and Webb depth-ofknowledge to characterize math item cognitive demand, their depth-of-knowledge ratings were
significantly positively correlated with their mathematical complexity ratings for both Grade 4
and 8 items, and with their reading load ratings for Grade 4 items (Schneider et al., 2013).
Similarly, when considered pairwise, some modern alignment methods’ cognitive demand
categories, whether characterized as ordinal or nominal, appear to overlap in meaning, but these
apparent relationships have not been substantiated empirically.
The use of Bloom’s (1956) taxonomy of cognitive levels to guide item and test
development has been widely criticized (see, e.g., Hattie, Jaeger, & Bond, 1999, p. 405, for a
summary), and the need for an empirically-supported taxonomy of cognitive behaviors to guide
item writing has been pointed out (Haladyna et al., 2002; Schmeiser & Welch, 2006). Similar
questions can be raised about the cognitive demand categories applied during alignment

38

procedures. Many cognitive complexity item coding schemes, including commonly-used
schemes like Webb’s (1997) depth-of-knowledge scale and the NAEP mathematical complexity
scale, have little or no empirical support (Ferrara, Svetina, Skucha, & Davidson, 2011). For
most coding schemes, there is limited evidence that cognitive complexity ratings, and
corresponding rating category descriptors, accurately portray aspects of a typical examinee’s
item response process (Embretson & Daniel, 2008; Webb, 2007). In one empirical study
(Schneider et al., 2013), the poor prediction of item difficulty provided by five different
cognitive complexity rating schemes was partially attributed to the wide distributions of
observed item difficulties in the lowest categories of all the rating schemes, suggesting that some
important distinctions among item features were not captured by the descriptors for the lowest
rating categories, or that the specific cognitive processes applied by the examinee population did
not correspond to the category descriptors. A further caution is that even if the rating category
definitions are sound, subject-matter experts are seldom able to accurately predict the cognitive
processes used by examinees to solve achievement test items (Ferrara et al., 2004).

2.5

Establishing Alignment Criteria
Alignment results “depend critically” on the definitions of the criteria utilized (Bhola et

al., 2003, p. 24). Even if two alignment procedures utilized the same model (e.g., the Webb
[1999] “Content” model), with component aspects both labeled and defined identically,
alignment decisions would clearly depend on the stringency of the criteria applied to values on
each component, or to the overall score. Webb (2007) suggests that if a consistent alignment
model is utilized across studies, it may be possible to devise experience-based criteria for
desirable alignment index magnitudes. Alternatively, given alignment index values from
pairings of many states’ curricula and assessments, it may be possible to make normative
39

judgments about alignment magnitude. However, no alignment indices have cutoff criteria that
have been devised based on empirical research, or are widely agreed upon (Davis-Becker &
Buckendahl, 2013). Summary alignment reports generally reflect the conflict between beliefs
that alignment cannot be meaningfully quantified (e.g., Beck, 2007), and that alignment indices
possess approximately interval properties and contain particular scale points that can be given
meaningful interpretation (e.g., Webb, 2007). Criteria for “acceptable” overall alignment index
values rely on assumptions that may be difficult to justify in some testing contexts (Webb, 2007).
While reports often present alignment index values, giving them either absolute (e.g., Webb,
2007) or relative (e.g., Polikoff, 2012a) interpretations, they tend to be situated in a broader
evaluative narrative that attends to item balance across particular content types, items flagged as
irrelevant, and contextual issues, such as the test purpose, and level of resources available
(Webb, 1997) for test and curriculum development.
A minimum criterion for overall alignment could be that an alignment index value is
significantly greater than would be expected due to chance agreement between the ratings of test
items and curriculum objectives covering a given subject matter (Fulmer, 2011). Fulmer (2011)
demonstrated a method for estimating critical values of an SEC-type alignment index, computed
from proportions in a content taxonomy table, for various given statistical significance levels,
table sizes, and numbers of items and objectives. He also verified through simulation that the
estimated mean index values expected by chance, if judges coded both tasks and objectives
randomly without regard for their content, would tend to decrease with increasing table size for a
fixed number of coded items or objectives, and increase with the number of coded items or
objectives for a fixed table size. Alignment index values expected by chance would also increase

40

with the number of raters, with the number of items or objectives coded to multiple table cells,
and with decreasing rater agreement (Polikoff & Fulmer, 2013).
Alignment criteria focused on individual tasks, rather than entire documents, have also
been proposed. La Marca et al. (2000) argued that, minimally, all test items should be relevant
to the curricular domain. Gulliksen (1950) asserted that educational tests should not contain
items that required novel applications of learned content unless examinees have had previous
practice with such new applications, because otherwise the tests would likely be perceived as
unfair by examinees, which might negatively affect their attitudes toward test-taking, or future
learning in the content area. Webb (1997) concurred with Gulliksen: “expectations and
assessments are aligned if what is elicited from students on the assessments is as demanding
cognitively as what students are expected to know and do” (p. 15). However, the issue of how to
deal with planned item- or objective-level misalignment (e.g., Woolard, 2007, p. 11) in
computing alignment indices has not been resolved. Some objectives that cannot be feasibly or
efficiently tested by large-scale assessment are consistently omitted from states’ achievement test
specifications. Additionally, items appropriate for students at lower or higher grade levels may
be included in a test (e.g., Webb, 1999) to facilitate score scaling. Published alignment studies
have typically included all objectives and items in alignment index calculations, although have
occasionally, at the request of particular states, recomputed the indices using only testable
objectives (see MECG, 2010).

2.6

The Validity of Alignment Indices as Evidence of Test Content Representativeness:
Previous Empirical Findings
Previous research on alignment results’ validity as measures of test content

representativeness has focused on an issue common to all judgmental alignment procedures: the

41

quality of the ratings generated by alignment panels. In evaluating test-curriculum
correspondence, “the number of judges used, their competence, and the process they use in
evaluating the test...and the conscientiousness with which they undertook the task of evaluation”
help to determine the quality of their judgments (Ebel, 1956, p. 278). Recognizing that the value
of any alignment data collected hinges on judges’ expertise and adherence to a consistent rating
process, published alignment methods provide guidelines regarding assembling and training the
panel, which have been modified over time based on empirical findings.
Judges should be subject-matter experts who are familiar with the abilities of students in
the target population, and may include university faculty, state department of education
employees or consultants, graduate students with advanced degrees in the content area, or
classroom teachers (Porter et al., 2008; Webb, 2007). The most qualified judges of how well test
content corresponds to a particular curriculum content domain are those with the greatest degree
of “knowledge of the curriculum in a specific school system,” rather than those with “abstract,
generalized curriculum ideas” (Guion, 1977, p. 7). Alignment panelists must possess both
content area expertise and knowledge regarding typical abilities in the student population to be
assessed (Davis-Becker & Buckendahl, 2013). Ideally, alignment panelists should have
knowledge of the specific curriculum document used in the matching procedure (La Marca et al.,
2000). Repeated use of the same panelists may improve comparability of alignment results
across different test-curriculum combinations; however, in this situation any rater or panel bias
could introduce systematic error into a set of alignment results. Representative sampling of
content judges is a fairness issue (Guion, 1977); alignment panelists should be representative of
stakeholders in the assessment results (Davis-Becker & Buckendahl, 2013).

42

The minimum number of judges recommended by each alignment method varies, but is
often lower than the minimum number recommended for standard-setting panels (e.g., 15–20;
Hambleton, Pitoniak, & Copella, 2012). Webb (1997, 2007) has recommended use of between 3
and 8 subject-matter experts for an alignment panel. SEC content analyses are typically
conducted by between 3 and 5 raters, although sometimes as few as two raters have participated
(Porter et al., 2008). The Achieve method (Resnick et al., 2004, p. 8) requires at least 6 judges.
As during standard-setting studies, the “personality, skill, biases, perspectives, and personnel
management abilities” of the facilitator are also important variables mediating the quality of data
collected from an alignment panel (Beck, 2007, p. 131).
Published alignment methods provide written materials to guide consistent
implementation of instruction by facilitators. Variation in phrasing of instructions to panelists
may affect their ratings of each item (Poggio et al., 1986; Bhola et al., 2003). Webb (1999)
reported that when judges were given little initial guidance in interpreting rating categories,
individual reviewers and groups of reviewers developed their own decision rules for coding. If
applied consistently by several raters, self-developed coding schemes could lead to systematic
error in measuring alignment as compared to under the (in this case, unstated) intended coding
rules. Requiring reviewer training on a rubric as the first step of alignment process encourages
reviewers to hold common definitions, for example, of cognitive complexity categories (La
Marca, 2001). While developers of alignment methodologies have emphasized the importance of
allotting sufficient time for training to permit panelists to practice and thoroughly understand the
coding process (e.g., Porter, 2002; Webb, 2007), the time actually expended on training varied
markedly across early implementations of the various alignment procedures (Rothman, 2003)
depending on the resources of the sponsoring organization. As in standard-setting studies, the

43

amount of time allotted for training, practice, and document content analysis, is likely to
influence panelists’ understanding of a given alignment process and confidence in their
judgments (Martone, 2007).
Sireci and Geisinger (1992) argued that content raters’ judgments should be independent
of information regarding item writers’ intent, and even of pre-specified content categories.
Given an objective list, Sireci and Geisinger (1992) expressed concern that raters might tend to
match items to the “closest” objective, rather than considering potential alternative objectives,
not listed, that might more closely correspond to an item. The provided objective list is likely to
influence subject-matter experts’ perceptions of what each item is measuring (Sireci, 1998, p.
303). “By informing the [subject-matter experts] of what the test is supposed to measure, item
congruence . . . ratings can be influenced by response sets such as social desirability and
guessing” (p. 303), possibly inflating item-objective congruence index values. To avoid inducing
rater response sets, which could bias item ratings, by provision of a content categories list from
the test specifications, Sireci and Geisinger (1992) developed an item-similarity matching
method based on multidimensional scaling. However, Martineau et al. (2007) recommended that
upon identifying mismatched items, alignment panelists should be advised of item writers’ intent
in writing the item. Advisement of item writer intent should increase the precision, but also
possibly the bias, of content ratings.
Sireci (1998) listed several possible threats to the validity of item-objective congruence
measures, all of which concern the quality of judges’ ratings or matches: (a) poor reliability of
ratings due to an insufficiently large rater sample, rater fatigue, or the inherent complexity of the
rating task, (b) poor comprehension of the rating task by judges, and (c) bias caused by rater
response sets induced by provision of a fixed objectives list. Although no published alignment

44

procedure is supported by a systematic program of research testing for the presence of these
confounds, empirical studies of modern alignment indices have addressed the first two threats,
meanwhile uncovering suggestive evidence of characteristics associated with rater bias.

2.6.1

Alignment Index Reliability

Crocker et al. (1988) proposed computing a generalizability coefficient at the end of
content-based item analysis, or in a preliminary generalizability study, to check that the number
of raters utilized produces item-objective correspondence index values that are adequately
replicable in repeated independent sampling of rater panels. They used an analysis-of-variance
model to decompose the variance of item ratings, attributing different portions of the total
variance to various specified possible sources (e.g., raters, item content), and computed
generalizability coefficients, of which traditional alpha reliability coefficients are a special case,
to assess the variability in test scores attributable to particular random features of a measurement
procedure, in this case, raters, for various potential numbers of panelists. They noted that the
projected generalizability coefficients apply only to the given test domain specification and rater
population, since the number of raters needed to produce reliable item-objective correspondence
indices is likely to depend on the breadth of a test’s target domain, as well as the specificity with
which objectives are written.
Porter et al. (2008) measured the magnitude of rater effects, cellwise, on the matrices of
fine-grained content emphasis proportions from SEC content analyses of English language arts
and mathematics achievement tests and curriculum documents from two states for grade levels 3,
6, and 9–12. These matrices of average proportions underlie the alignment indices computed
from SEC data. For all state-subject-grade-document type combinations, the value of the
generalizability coefficient prediction approached an asymptote above .9 as the number of raters
45

reached approximately eight or nine (p. 4). In one state, rater generalizability was lower in
English than in math for all grade levels and document types. Otherwise, projected
generalizability estimates for a given number of raters were fairly consistent regardless of grade
level, subject matter, and whether a test or curriculum document was analyzed. The authors
concluded that future SEC alignment review procedures should recruit at least five raters,
although generalizability coefficients were mostly acceptable, exceeding .70, with four raters.
Herman, Webb, and Zuniga (2007) also computed an index of dependability-type
generalizability coefficient for the Webb depth-of-knowledge alignment index calculated from
20 judges’ content ratings of California’s high school mathematics achievement test and a
statement of mathematics competencies expected of freshmen entering University of California
system institutions, estimating a .90 dependability coefficient. While the number of panelists in
that study was much higher than is typical in alignment studies of state achievement tests,
existing research, overall, seems to indicate that adequately reliable alignment index values could
be obtained by recruiting more panelists than were recommended when these alignment methods
were initially developed, perhaps between about 8 and 15 panelists.

2.6.2

Rater Agreement

When a particular item is matched to different objectives by reviewers, these judgments
suggests panelists attribute “diverse” meanings to task statements, particularly with regard to
their content and cognitive demand (Herman et al., 2007, p. 122), and may represent legitimate
differences of opinion that would be expected when applying an inherently judgmental procedure
to possibly complex test items (Rothman, 2003; Webb, 1999). However, disagreement could
indicate a problem with the clarity of task content, with panelists’ interpretation of the alignment
matching criteria (Davis-Becker & Buckendahl, 2013), or with panelists’ decoding of task
46

content (D’Agostino et al., 2008). Thus, sources of substantial disagreement among alignment
panelists should always be investigated (Davis-Becker & Buckendahl, 2013).
In some situations, disagreement may be attributable to characteristics of the documents
analyzed, or to the alignment procedures themselves. When a single curriculum document
includes descriptions of intended student performance at multiple levels of specificity (e.g.,
detailed objectives are subsumed under broad content “strands” or subcategories), agreement will
tend to be higher as items are matched to broader (e.g., compute basic operations), rather than
narrower (e.g., subtract three-digit whole numbers), performance descriptors, simply “because
there are fewer opportunities for disagreement” (Davis-Becker & Buckendahl, 2013, p. 27).
Similarly, for indices that measure test-curriculum overlap based on the proportional match of
items and objectives in a content matrix, such as the SEC alignment index, if the matrix is very
large, requiring reviewers to simultaneously attend to many content categories during matching,
agreement may tend to be much lower than if the matrix has only a few cells describing broad
content categories (Mehrens & Phillips, 1986, p. 186). In other cases, lack of agreement among
panelists “might be due to characteristics or behavior of the raters themselves, including
insufficient training on the rating process, insufficient depth of understanding of the standards,
lack of content knowledge, inappropriate use of secondary objectives, fatigue, and coding errors
(mistakes in writing down the appropriate objective number)” (Webb et al., 2007, p. 25). If
alignment indices reflect underlying disagreement that suggests some raters may be seriously
misinterpreting task statements, or some task statements are too vague to interpret with any
confidence, the alignment indices’ may become “a function of who does the rating rather than a
function of a test item’s content and cognitive demand” (Webb, Herman, & Webb, 2007, p. 25),
compromising the indices validity as measures of content representativeness.

47

To examine rater agreement during alignment procedures, Herman et al. (2007) used
ratings from a panel of 20 judges who rated high school mathematics behavioral objectives and
test items from California, as detailed previously. Training was completed on the same day the
ratings were collected, and followed established recommendations. Recognizing that most
alignment studies rely on considerably smaller numbers of judges, consistent with
recommendations in the literature (e.g., Porter, 2002; Webb, 2007), they simulated a more
realistic quantity of alignment data by drawing ratings of all possible 6-judge subsets, each
composed of three high school math teachers and three university faculty, from the 20 judges.
Depending on the judge subset selected, proportion agreement reached at least .65 (a criterion set
by the authors; they recommended that for an item to be included in an alignment index
computation, at least 65% of raters must match it to the same objective, or assign it the same
cognitive demand rating) on the general content category for most of the 42 items, between about
75% and 100%, depending on the judge subset selected, but agreement on the specific content
topic, number of topics, and cognitive demand measured by each test item varied more widely,
between 50% and 100% depending on the particular panel assembled. If multiple item features,
e.g., topic and cognitive demand, were considered simultaneously, proportion agreement among
the 6 judges in each subset tended to be even lower, as anticipated. The authors concluded that
with only 6 panelists, agreement about item-objective content match was limited.
Webb, Alt, Ely, Cormier, and Vesperman (2005) analyzed rater agreement in 34 selected
alignment studies that used the Web Alignment Tool, an online implementation of the Webb
method. They found that rater agreement on item cognitive demand levels, measured by
intraclass correlation (ICC) and an average pairwise agreement statistic, was usually determined
to be acceptable, with ICCs greater than 0.7 and pairwise agreement greater than 0.6. Four of the

48

alignment studies, two of which had low variability in mean assigned cognitive demand levels
among items, and two of which had only three raters, were judged to have unacceptably low
interrater agreement. Rater agreement on item cognitive demand levels tended to be higher for
lower grades’ curriculum-assessment pairs, which tended to include more items that could be
assigned to the lowest demand category with high certainty (p. 18). Rater agreement in item-goal
matching, measured by pairwise agreement, was usually acceptable. However, rater agreement
in matching items to specific objectives under each goal, again measured by pairwise agreement,
was less than .5 in nearly two-thirds of the studies, including several in which eight or nine raters
participated. Rater agreement in item-objective matching was lowest for studies utilizing
curriculum documents with the largest number of objectives. Slight improvements in agreement
over time were attributed to improvement in the training materials.
The Webb and SEC alignment methods do not require interrater agreement for item-toobjective matches or task-to-content taxonomy table classifications, respectively, even at the
level of broad content category, and any disagreement tends to be masked by their indices, which
rely on averaging (Davis-Becker & Buckendahl, 2013). Many traditional alignment methods
similarly average over all ratings, regardless of the extent to which they agree. Martone (2007)
reported that in one alignment study, many items were counted as a “match” to multiple specific
objectives, some of which refined different broad content goals. Because failure to resolve or
account for this disagreement in any way may be problematic for use of itemwise alignment
results in test revision (Davis-Becker & Buckendahl, 2013), and for meaningful interpretation of
alignment indices, removing some items’ or raters’ data from alignment computations has been
suggested. Herman et al. (2007) proposed that when raters do not reach some prespecified level
of agreement (they recommended 65%) in matching particular items, those items should be

49

excluded from alignment index computations. Using data from three previous Webb-type
alignment studies that compared (a) Michigan curriculum objectives to state achievement test
forms, (b) Tennessee curriculum objectives to state achievement test forms, and (c) California’s
high school mathematics exit examination to math standards expected of entering freshmen by
the University of California system, Webb et al. (2007) found that when they recalculated the
four Webb alignment indices using only items for which raters reached a minimum level of
agreement (either a bare majority, or a clear majority), and applied Webb’s (1999) alignment
criteria, conclusions about each aspect of alignment often differed from the original conclusions.
Porter et al. (2008) reported that in computing generalizability coefficients for two states’ item
and objective content classification tables, the results for one state included two aberrant sets of
ratings at different grade levels. Generalizability estimates improved when these ratings were
omitted, perhaps implying that any such aberrant judges’ ratings should also be excluded from
alignment calculations.
The existing research suggests that the amount of disagreement being averaged over to
compute alignment indices sometimes has been high enough to warrant concern about the
indices’ accuracy. Transparent alignment review results report the level of rater agreement
obtained, flagging any discrepant raters or items. Techniques intended to address lack of
agreement in alignment ratings include enlisting larger numbers of reviewers, averaging results
among reviewers, and improving training (e.g., Webb et al., 2007). Presenting corrected
alignment indices that exclude data from particular raters or items if evidence suggests problems
with rater comprehension of certain items, or systematic rater bias, has also been proposed
(Webb et al., 2007), but is seldom implemented in practice.

50

2.6.3

Rater Interpretation of Curriculum Objective and Test Item Content

Judges’ alignment ratings “are highly dependent on a careful parsing of the content
standards;” however, the “modal” state curriculum document may not have been developed
“with sufficient care to support this level of parsing” (Beck, 2007, p. 130). Particularly if
objectives are compound, partially duplicative, or insufficiently precise, rating is likely to be
difficult (D’Agostino et al., 2008; Webb et al., 2007). Because objective statements are abstract,
they may have “multiple legitimate interpretations,” and potentially be translated into many
different instructional practices (Hill, 2001, p. 302). Through interviews, surveys, and classroom
observations of 25 Michigan teachers, Spillane (2004) found that even when teachers have
similar familiarity with curriculum objectives, motivation to pursue the objectives during
instruction, access to aligned curricular materials, and prior mathematics knowledge, they
interpret a state’s curriculum and test documents differently, and that these variations in
interpretation influence their instructional decisions. During several alignment studies, Webb
(1999) found that panelists, who included subject-matter experts and persons familiar with
participating states’ curricula and assessments, sometimes recognized that they were seriously
uncertain about the intent of a particular objective, and were able to code it only after a state
curriculum director provided guidance about its meaning (see also La Marca et al., 2000, p. 15).
Even if a task statement has an unambiguous meaning, occasionally individual panelists may
misinterpret it. Observing a curriculum development committee of teachers in one urban
Northeastern school district, Hill (2001) reported that state curriculum objectives were
sometimes misinterpreted by individual teachers. In some instances, even committee consensus
decisions about district curriculum objectives partially reflected single teachers’
misunderstandings when others either failed to offer a correction, or had the same

51

misunderstanding. Similarly, subject-matter experts have been observed to occasionally
misunderstand the behaviors intended to be elicited by test items (D’Agostino et al., 2008).
Although careful selection of qualified subject-matter experts who have knowledge of the
relevant curriculum documents should reduce the potential for rater misinterpretation of
curriculum objectives or test items, classifications made by individual judges or the panel may be
influenced by systematic bias. Alignment panelists may be too strict or lenient, tending to find
too many or few matches, or to assign higher or lower ratings than warranted by tasks’ content.
While panelists should have some preexisting knowledge of the analyzed curriculum, or perhaps
similar documents, “they should probably not have been heavily involved” in the development of
either the curriculum or the test (La Marca, 2001, Methodological Considerations, para. 3), as
such connections can positively bias their alignment judgments (Bhola et al., 2003). Sanford and
Fabrizio (1999) observed that alignment panelists who had participated in test development
exhibited “feelings of stress, frustration, and defensiveness” when their instruments were under
review (p. 13). Curriculum alignment reviews conducted internally by test contractors may be
particularly subject to bias. Buckendahl et al. (2000) concluded that employees of two test
publishers found considerably larger proportions of test items aligned with Nebraska’s English
Language Arts curriculum goals than did review panels of classroom teachers, on average. Even
if alignment panelists have not been involved in producing the documents under review, certain
types of panelists may exhibit more lenient response sets. Bhola et al. (2003) cautioned that
training for teachers participating in alignment needs to clearly define criteria for matching, in
order to overcome their tendency to attempt to find objective matches for every item (Bhola et
al.), or to match many items to multiple content topics (Herman et al., 2007). It has further been
suggested that educator panelists who are, and are not, subject to a particular test-based

52

accountability system might tend to produce different judgments about alignment of tests to a
particular curriculum (Roach et al., 2010), but the direction of any differences cannot be easily
predicted because panelists’ familiarity with the curriculum would also presumably vary.
Monitoring of judge comprehension during alignment review is limited by the goal of
generating sets of independent ratings, and may vary across applications of the same alignment
method unless there is a consistent procedure for allowing judges to seek clarification of
document content. Hambleton (1980, pp. 211–212) recommended inserting known “bad” items,
which do not measure any intended objective, into traditional content validation matching
processes, rationalizing that the ratings of judges who matched a large proportion of decidedly
off-topic items to particular objectives should be eliminated from any data analysis (see also
Davis-Becker & Buckendahl, 2013), but recent published alignment methods do not include any
such verification step. Adding a phase of discussion-based feedback regarding items for which
there are serious discrepancies in initial content coding, analogous to the panelist group
discussion sometimes facilitated during standard-setting procedures (e.g., Reckase & Chen,
2012), which does not force panelists to reach consensus on judgments about test items, could
perhaps prevent gross misinterpretation of document content, as well as allow more monitoring
of rater understanding by the moderator. The SEC alignment process includes group discussion
of some items, but they are identified by panelists, rather than by the facilitator based on
collected data (Porter et al., 2008). Alternatively, a feedback phase after initial coding could
provide panelists with information about item writers’ intent in constructing each item
(Martineau et al., 2007). Current alignment methods assume rater competence, following
training, to make the types of content classifications required, but tend to probe this assumption
only through administration of exit surveys inquiring about judges’ experience during the review

53

process (Davis-Becker & Buckendahl, 2013). Consistent, standardized analysis and reporting of
participants’ survey responses would permit users of alignment results to gauge the judges’
understanding of the rating or matching task (Wyse & Viger, 2011), providing necessary
evidence for validation (Davis-Becker & Buckendahl, 2013).

2.6.4

Rater Interpretation of Test Item and Curriculum Objective Cognitive Demand

Unlike item difficulty prediction, which requires raters to anticipate observable behavior
(the response of an average examinee, or the center of the response distribution; e.g., Hambleton
& Jirka, 2006), item cognitive complexity classification requires panelists to predict examinees’
cognitive processing—the strategy they will tend to use to solve a problem—and then to judge
the complexity of the processing requirements to execute that strategy. The training that occurs
before alignment review guides judges to internalize the cognitive demand classification scheme
utilized by a particular alignment method. To foster rigorous conceptualization of each item’s
response requirements, alignment procedures may instruct judges to complete each item,
identifying the correct response prior to matching or rating its content (Ebel, 1956), and perhaps
to assign corresponding objectives to each step of the solution process (Martineau et al., 2007).
Training also usually includes cognitive demand coding practice using sample objectives or
items (Webb, 2007).
Raters’ understanding of the concept of item cognitive demand is shaped the content and
delivery of specific instructions defining the concept, delineating its classification categories, and
describing item features that should be considered in assessing cognitive demand. Wyse and
Viger (2011) used a debriefing survey to probe item writers’ understanding of cognitive demand
following training on Webb’s (1999) cognitive complexity rating scheme. The item writers
included teachers and other educators, all of whom had at least three years of teaching
54

experience. The researchers interpreted some comments on the debriefing survey as evincing
misconceptions about item cognitive demand. Particularly, many item writers seemed to
conflate cognitive demand with item difficulty. However, most comments reflected
understanding of at least some aspects cognitive demand that was consistent with the training
provided.
After receiving training, and carefully considering task statements’ features, judges may
still find classifying tasks’ cognitive demand to be challenging. In one Webb-type alignment
review of an adult basic competency test and curriculum, Martone (2007) found that there was
some disagreement among panelists’ cognitive demand ratings for about two-thirds of the
objectives, and for many of those objectives, initial ratings were nearly evenly split across two
adjacent cognitive demand categories. Panelists’ judgments about task cognitive demand are
likely to be influenced by their understanding of the “developmental levels and prior
instructional experience” of the examinee population (Herman et al., 2007, p. 121). For example,
when reviewing high school mathematics test alignment, high school math teachers tend to rate
the items’ cognitive demand more highly than do university faculty (Herman et al.).
Compounding the difficulty of predicting “average” cognitive demand, as items’ cognitive
complexity increases, students are more likely to use diverse processes (e.g., either algebraic or
geometric reasoning) to reach the correct solution (Leighton & Gokiert, 2008), producing
uncertainty about what objectives the items measure (Webb et al., 2007). Knowledge of the
examinees’ instructional experience may be particularly necessary to classify these items.
Serious disagreement about many items’ cognitive demand would suggest that more training is
necessary to help panelists appreciate the meanings of, and distinctions among, cognitive
demand categories (Martone, 2007).

55

2.7

The Relationship Between Test-Curriculum Alignment and Student Achievement Test
Scores: Previous Empirical Findings
A basic premise of opportunity-to-learn research is that as students receive high-quality

instruction following a particular curriculum, they learn, so their scores on test items (Schmidt,
McKnight, Cogan, Jakwerth, & Houang, 1999; Wiley & Yoon, 1995) and tests (e.g., Schmidt et
al., 2001) covering topics emphasized in the curriculum are expected to increase. Achievement
test content validation arguments make the same claim, focusing on the role of the test: if test
scores are valid measures of curricular attainment, truly reflecting the degree to which students
have mastered the objectives, the scores should increase following relevant instruction (e.g.,
D’Agostino, Welsh, & Corson, 2007; Gulliksen, 1950). To guide the design of the International
Association for the Evaluation of Educational Achievement’s (IEA) cross-national mathematics
studies, Travers and Westbury (1989) translated this theory into a model of curricular learning
that distinguishes between the formal or informal curriculum intended by stakeholders in an
educational system, the intended curriculum, and the instruction that students actually receive,
the enacted curriculum. The intended curriculum is the content material that legislative
authorities, such as national or state education agencies, intend for students to learn in school.
The implemented, or enacted, curriculum is students’ actual content exposure resulting from
instruction during school. The attained curriculum is students’ resulting content mastery, or
achievement. Schmidt et al. (2001, p. 31) hypothesized that the intended curriculum might have
not only an indirect effect on student achievement gains, mediated by instruction, but also a
direct effect on gains. Because alignment indices are meant to reflect the degree of the
correspondence between the intended curriculum, instruction, and the test instruments used to
measure student achievement, under certain assumptions, the indices would be expected to be
predictive of achievement gains.
56

If instructional quality is sufficient (e.g., La Marca et al., 2000), student motivation is
adequate (e.g., McMaken & Porter, 2012), and test item scores (e.g., Muthén, Kao, & Burstein,
1991) or subtest scores (e.g., Schmidt et al., 2001) are sensitive to differences in instructional
content, the strength of alignment between the test and curriculum, in conjunction with the
amount of instructional time allocated to teaching the curriculum (Gamoran et al., 1997), should
be positively related to student test score gains. Presuming these assumptions hold, the
correlation between alignment indices and mean student test scores, or test score gains, could
provide evidence of the indices’ validity as measures of test-curriculum correspondence (Crocker
et al., 1989) and of their potential utility for test developers and teachers (Webb, 2007).
Considering the same structural relationships and assumptions from an OTL perspective, it
would likewise be expected that if test-curriculum alignment indices are an indicator “of the
potential of classroom instruction to influence student achievement” in a particular domain
(Roach et al., 2008, p. 169), they should be related to student achievement gains (Schmidt &
Maier, 2009). However, the strength of the relationship between alignment measures and
achievement will likely be affected by the specific way that alignment is operationalized
(Leinhardt & Seewald, 1981), as has been observed for OTL measures (Floden, 2002; Schmidt &
Maier, 2009). Although the focus of this study is on alignment indices measuring
correspondence between tests and curricula, to assess the extent that any alignment indices have
been demonstrated to explain variability in student achievement or achievement gains, in the
following sections we describe existing evidence for the impact of alignment between tests and
curricula, instruction and tests, or instruction and curricula, on student achievement. Previous
studies have variously represented student performance as total scores, subtest scores or item
scores; all are reviewed here. We highlight results from mathematics, the subject area in which

57

most alignment-related research studies have been conducted, and which is most relevant to the
present study, as well.

2.7.1

Instruction-Curriculum Alignment and Achievement Test Scores

Smithson and Collares (2007) studied the relationship between curriculum-instruction
alignment indices and student achievement scores in underperforming Ohio schools (i.e., schools
not making “adequate yearly progress” in students’ average achievement, according to the state’s
ESEA criteria). They found that alignment indices were a statistically significant positive
predictor of classroom mean achievement, controlling for grade level, although the effect size
was less than one-quarter of a standard deviation in mean achievement scores. The effect
remained after controlling for prior mean achievement, but the prior means represented scores
from only about one-third of the students in the sample, so the coefficient was not expected to be
an unbiased estimate of the population relationship between alignment and mean achievement
gains. Using only the fraction of the student sample for which prior achievement scores were
available, after controlling for economic disadvantage, grade level and prior achievement using a
multilevel model, no significant relationship between teachers’ instruction-curriculum alignment
and students’ achievement scores was observed.
In a random sample from the 10% of Ohio districts participating in the same instructional
alignment study, Woolard (2007) found that elementary school buildings in ESEA “School
Improvement” status generally reported lower mean teacher alignment scores in both math and
language arts than buildings not in School Improvement status, although these differences were
not statistically significant. He also found a small, significant positive correlation between
schools’ mean alignment and their annual mathematics Performance Index, a state-mandated

58

accountability indicator that was a weighted sum of each school’s proportions of students in each
proficiency category, by subject area.
Kurz et al. (2010) examined the relationship between instruction-curriculum alignment,
calculated from SEC teacher questionnaire data, and classroom achievement averages of 18
volunteer general- or special-education Grade 8 mathematics teachers in an urban school district
in Tennessee. Training was conducted according to established SEC protocols. Classroom-level
correlations between the curriculum-instruction alignment index and mean achievement on
Tennessee’s summative state mathematics test were .64 for alignment of instruction reported at
mid-year, and .58 average alignment reported at the end of the school year, relatively high.
However, the authors cautioned that the correlation between alignment and achievement at the
individual student level was likely to be considerably lower than the correlation at the classroom
level. They recommended that future studies “should evaluate alignment alongside other known
predictors of student achievement, including prior achievement, engagement, and other academic
enablers” (Kurz et al., 2010, p. 142).
Polikoff and Porter (2012) studied the effects of instruction-curriculum alignment, as
measured by the SEC index, on teacher “value-added” scores in 4th- and 8th- grade English
language arts and math. The teachers surveyed were a self-selected subsample of teachers from
the Measures of Effective Teaching study, which sampled teachers in six urban school districts.
They were significantly more likely to be white, and had lower proportions of Black or American
Indian students, than teachers who did not participate. Teacher value-added scores in a particular
subject and grade level were calculated as average residuals from models of student achievement
test scores that controlled for prior test scores and other individual student characteristics
(several different achievement tests, including each student’s state’s achievement test, were

59

administered, and consecutively modeled as alternative outcome variables). Four measures of
teachers’ pedagogy based on student surveys or classroom observation protocols were also
collected. The correlation between teachers’ instruction-curriculum alignment index scores and
the mean residualized achievement scores of their students was significant and positive in math,
and larger than the correlations between any of the pedagogical measures and the value-added
scores. However, after adding fixed effects for district-grade combinations and all the
pedagogical measures as additional predictors of the value-added scores, the coefficient on math
instruction-curriculum alignment became nonsignificant, although it remained positive. The
authors interpreted their results, overall, to indicate that the SEC instruction-curriculum
alignment index, or other content coverage measures derived from SEC data, might be predictive
of teachers’ average residualized student achievement scores, perhaps even more predictive than
pedagogical measures, but suggested caution in interpreting the results due to several possible
threats to replicability in the full study population, including possibly insufficient power,
inadequate training of teachers prior to their completion of the SEC survey, or other irregularities
in the subsample data.

2.7.2

Instruction-Test Alignment and Achievement Test Scores

Winfield (1993) surveyed 19 teachers of regular or supplemental 4th-grade mathematics
regarding their relative instructional emphasis on the specific content of 68 sample items written
to correspond to 12 mathematics objectives covered by an annual state achievement test.
Because disadvantaged (i.e., Title I) students’ scores on the achievement test were used by the
school district to evaluate the effectiveness of schools’ supplemental instruction for these
students, the teachers would have experienced mild-to-moderate pressure to align their
instruction to the test objectives. Teachers’ responses to questions about “(1) the number of
60

times a mathematics concept was taught, (2) the frequency of review or re-teaching of the
concept, (3) the number of settings in which the particular test format was used to teach the
concept, (4) the frequency of usage of the format, (5) the extent to which the concept was
emphasized in the school reading curriculum, and (6) the teachers’ perception of student mastery
of the concept” were used to produce a content emphasis scale score for each item (p. 292).
Students in these teachers’ classrooms who were eligible for Title I services then completed the
test items. Analyzing the students’ item scores, Winfield found that average content emphasis
scale scores for each item for both the regular and supplemental teacher groups were moderately,
positively and significantly correlated with item difficulty (i.e., p) values. That is, students were
more likely to respond correctly to test items containing content that was emphasized during
instruction.
A study by Gamoran, Porter, Smithson, and White (1997) is often cited as demonstrating
that, in conjunction with instructional time, the alignment between instruction and a test
instrument, as measured by the SEC index, predicts student achievement. Comparing
achievement gains in three types of high school mathematics classes: general-track, transition,
and college-preparatory, using a multilevel model the investigators found that “more rigorous
content coverage accounts for much of the advantage of college-preparatory classes” over
transition and general-track classes in math achievement gains (p. 325). The sample of 9th- and
10th-graders, drawn from four urban school districts in California or New York, was
characterized as relatively low-achieving. For each participating classroom, the study calculated
an indicator of content coverage that was a cellwise product of alignment, as computed from an
SEC math content-cognitive demand matrix, and proportion of instructional time, as reported in
a teacher survey. The model of achievement gains included covariates measured at both the

61

individual and classroom levels, but prior individual achievement was not among the predictors.
Results indicated the indicator of content coverage was a marginally significant positive
predictor of individual students’ achievement gains over one school year. However, the authors
cautioned that student achievement gains during the school year, which averaged 1.7 points on
the 26-point test, may have been partially attributable to repeated administration of the same test
form, and teachers reported expending, on average, only about 7% of instructional time during
the year on content that appeared on the outcome test, raising questions about the test’s
suitability as an outcome measure. McMaken and Porter (2012) recommended that the Gamoran
et al. (1997) study linking alignment to achievement gains should be replicated.
D’Agostino et al. (2007) investigated the impacts of 52 fifth-grade teachers’ content
emphasis, instruction-test alignment, and the interaction of these factors, on Arizona state
mathematics achievement test scores. Teachers were asked to describe, in writing, how they
taught two particular performance objectives from the state math curriculum, and provide sample
classroom assessment items, if possible. Two subject-matter experts rated, on a three-point scale,
the degree of alignment between teachers’ instruction and items on the state achievement test
that matched the two objectives. Teachers were also asked to report, on a four-point scale, the
degree of emphasis they placed on each of 21 performance objectives, including 11 Grade 5
objectives. The correlation between teachers’ emphasis and alignment scale scores was only .19,
suggesting that these measures captured different aspects of teachers’ practice. Controlling for
individual student background variables including two math pretest scores, as well as for their
schools’ federal school meal program eligibility proportions, the authors used a multilevel model
to predict fifth-graders’ math achievement test scores from classroom level emphasis, alignment,
and the emphasis-by-alignment interaction. Finding that both alignment scores and the

62

interaction between alignment and emphasis were significant predictors of math scores, the
authors concluded that there was some evidence that students in classrooms where instruction
was over-aligned to the test performed better than students in classrooms where instruction
plausibly targeted curriculum objectives but not precisely as they were operationalized on the
test. Students in both highly- and moderately-aligned classrooms performed better than those
whose teachers described instruction that seemed inappropriate to foster achievement of the
objectives. The authors cautioned that teachers’ responses may have been influenced by desire
to make their instruction appear aligned to curriculum, and that the true effect of instructional
alignment on math achievement may have been confounded by positive relationships between
alignment, and content and pedagogical knowledge, neither of which had been measured.

2.7.3

Test-Curriculum Alignment and Achievement Test Scores

Using different measures of test-curriculum alignment, or “overlap,” studies in the 1980s
yielded mixed results regarding the relationship between alignment and achievement test scores.
The “curriculum” in these early studies was usually taken to be represented either by a textbook
(Freeman et al., 1983), possibly with its ancillary instructional materials (Leinhardt & Seewald,
1981), or by a curriculum guide, a document outlining—or possibly detailing—content and
performance goals for a particular course of instruction. To judge test-curriculum overlap,
schools’ degrees of curriculum-test match were rated by external curriculum experts (Mehrens &
Phillips, 1986), or textbooks were systematically matched against a content taxonomy (Freeman
et al., 1983; Mehrens & Phillips, 1987).
Mehrens and Phillips (1986, 1987; see also Phillips & Mehrens, 1988) conducted a series
of studies to address the question of whether differences in schools’ mathematics or reading
curricula substantially affect student performance on commercial standardized achievement tests,
63

which were intended to assess elements common to school curricula nationwide. The authors
(1986) used multivariate analysis of covariance to determine whether any variability in
classroom mean subscores on an off-the-shelf standardized achievement test could be attributed
to differential curriculum emphases across elementary schools in two Midwestern school
districts. For reading and mathematics in Grades 3 and 6, district personnel used a 5-point scale
to rate the degree of correspondence between the content emphases in each school’s
implemented curriculum, and in the test. The reading and math textbook series used by each
school at the two grade levels were also recorded. After controlling for both mean pretest total
scores and welfare eligibility rates in each school, neither test-curriculum correspondence rating
nor textbook series used was a significant predictor of either mean total scores, or subscores, on
the mathematics or reading tests among third- or sixth-graders. Although only 78 schools were
included in the analysis, so statistical power was likely to have been low, even the adjusted mean
test score differences among textbook series or test-curriculum correspondence rating categories
were judged to be within the approximate classroom-level standard error of measurement for the
scores. Using data from one of the districts, Phillips and Mehrens (1988) similarly found very
small, nonsignificant differences in item p-values and objective-level (narrower) test subscores
between curriculum-test content match rating groups, and textbook series groups, for both grade
levels in both reading and mathematics. The authors cautioned that the district curriculum
officers used as raters may not have been sufficiently knowledgeable regarding the curricula
implemented in each school to judge test-curriculum correspondence.
Mehrens and Phillips (1987) used a 180-cell, three-dimensional matrix to classify the
content of the Grade 5 and 6 math texts from three textbook series, which were used by different
buildings in a school district, and the content of an off-the-shelf achievement test that was

64

administered annually in the district. Although the sequencing of topics differed across the
textbook series, the cumulative content presented during Grades 5 and 6 was quite similar. The
authors found that curricular emphasis proportion differences had no detectable relationship to
item difficulty (p) differences computed from the scores of about 1,700 district sixth-graders who
composed the three textbook groups. The average Rasch item difficulty value orders, and the
mean item difficulty values for items covering similar content, also differed little for matched
groups of students who used different textbook series. The authors concluded that the
differences in curricula within a school district during that time period were not large enough to
produce significant differences in standardized test scores.

2.8

Impact of Federal School Accountability Testing on Alignment
When most commonly-used alignment methods were developed, prior to the 2001

emendation of the ESEA, state curriculum documents varied widely in organization, level of
specificity and grade level span (La Marca et al., 2000). Some curriculum documents were
simple lists of content topics or of vague performance goals. Because these curriculum formats
tended not to adequately specify cognitive demand, they hindered both the development of
aligned tests and the alignment review process (La Marca, 2001). The amended ESEA required
states to develop and disseminate written grade-level expectations, statements of relatively
specific behavioral objectives for every grade level, reducing variation in curriculum document
organization among states (Webb, 2007). While the previous ESEA emendation in 1994 had
dictated that state accountability tests must match their curricula (Webb, 1997), few resources
were devoted to ensuring, or even encouraging, compliance. Evaluation of proposed state
accountability testing systems under the 2001 rendition of the law temporarily denied testing
system approval to states that failed to submit alignment evidence (Schafer et al., 2009).
65

During the 1990s and early 2000s, “most states lacked a formal and systematic process”
for determining the alignment between curriculum and assessments (Webb, 1997, p. 8). Some
states expended little effort on alignment review; others recognized that their state achievement
test corresponded poorly to the written curriculum, but lacked the resources to revise the
curriculum or develop more appropriate tests (Wixson & Yochum, 2004). Alignment studies
often deemed alignment between state-administered achievement tests and the relevant
curriculum documents either to be low (Rothman, 2003), with item distributions concentrated on
measuring the least cognitively-demanding objectives (Resnick et al., 2004; Webb, 1999), or to
be inflated by the generality of many states’ curriculum goal statements (Porter, 2002), each of
which appeared to be measurable by a wide, content-diverse range of items. The tested
curriculum (administered by states or school districts) was generally believed to have more
influence than the written curriculum (developed by states) on the enacted curriculum (Glatthorn,
1999).
However, even in the decade before the amended ESEA took effect, activism by
policymakers directed at controlling curriculum, instruction and assessment in some states
appeared to influence teachers’ instructional alignment, particularly in mathematics. Koretz
(2008) describes the possibility of accountability-induced reallocation:
Shifting of instructional resources (primarily instructional time, but other resources as
well) among substantive parts of the curriculum to target better the particulars of the test.
To some degree, reallocation is desirable, in that accountability tests are designed in part
to signal what is important. Reallocation poses a risk, however, because tests are small
and necessarily incomplete samples from the domains of achievement they are intended
to represent. Allocating more time to one set of topics requires taking time away from

66

others, and if the material that is dropped or de-emphasized is also important for the
intended inferences about achievement, then scores can rise more than gains in
achievement warrant . . . Numerous surveys have found that teachers report reallocating
in response to testing. (p. 84)
In the spring of 2001, compared to teachers in states where student achievement tests had
moderate or low stakes for teachers and schools, teachers in states where tests had high stakes
reported being more likely to attempt to match the content and format of their classroom
assessments to those of the state’s achievement test (Pedulla et al., 2003). Efforts by states to
shape classroom instruction may also have encouraged teachers to focus instruction on
curriculum expectations. Controlling for an extensive set of state policy and school
characteristics, as well as other features of eighth-grade math teachers’ classrooms using a
multilevel model, Swanson and Stevenson (2002) found that a state’s level of “standards-based
policymaking” (e.g., establishing curriculum objectives, often based on the National Council of
Teachers of Mathematics’ [NCTM] recommendations, administering curriculum-aligned
assessments) was positively associated with the use of “standards-based instructional practices”
(i.e., instructional content and practices recommended by the NCTM) in classrooms, with a
“modest but substantively meaningful effect size” (p. 13).
The implementation of the amended ESEA, which increased the stakes of student
achievement testing for schools in many states, has spurred public school educators to attempt to
tailor classroom instruction to reflect state curriculum documents and assessment patterns.
Recent research indicates that increased alignment with the curriculum is evident, particularly in
elementary school mathematics instruction. Repeated annual surveys of educators from
representative samples of California, Georgia, and Pennsylvania elementary and middle schools

67

by Stecher and colleagues (2008) between 2004 and 2006 documented changes to instruction
attributed to the amended ESEA’s accountability system. Most math teachers in all three states
reported altering the content of their instruction to better reflect state curriculum objectives,
although relatively few reported changing their proportional use of specific instructional
strategies (e.g., direct instruction) over time. In spring of 2005, the middle year of the survey,
about 75% of elementary math teachers, and a slightly smaller proportion of middle school math
teachers, in the three states reported that they focused more instruction on tested topics than they
would absent the high-stakes state test. Large percentages of elementary and middle school math
teachers in all three states reported using item formats similar to those on the state test for
classroom assessment more frequently than if the test had lower stakes. Many teachers also
reported attempting to align their instruction to reflect the content of the state assessment; the
lowest proportions of math teachers reporting such behavior were in California, where state
policy prohibited public release of any test items from previous assessments. Results from the
survey were similar in 2006.
As would be anticipated, efforts to increase instructional alignment to state curricula
seem to have been concentrated in tested grades and subject areas. The amended ESEA
mandated state achievement testing in reading and mathematics in elementary grades 3–8
beginning in 2005–2006. In Ohio, Woolard (2007) reported that school average curriculuminstruction alignment in mathematics, measured by the SEC index, rose markedly in Grades 2
and 3, the grades at or immediately before which accountability testing began, from a low base
level in Grades K and 1. Science achievement testing is also required in two state-selected
elementary grades, but it was not phased in until 2007–2008. Compared to science teachers in
surveyed schools and school districts, math teachers have made more concerted efforts to align

68

the content of their instruction with state curriculum objectives (Stecher et al., 2008), and have
achieved higher mean instructional alignment, as measured by the SEC index (Porter et al.,
2007). However, there is little evidence of significant changes in instructional alignment among
reading or English language arts teachers (Polikoff, 2012a).
Although many public school teachers reported attempting to increase alignment between
the content of their instruction and state curriculum objectives, the magnitude of actual change in
instructional content emphasis may have been small, and alignment may still be relatively low.
Using alignment indices computed from SEC questionnaire responses collected from a selective
sample of over 3,000 teachers from 23 states, Polikoff (2012a) concluded that Grade K–8
instruction-curriculum alignment increased slightly under the amended ESEA, with the most
pronounced improvement in mathematics. Regression models of instruction-curriculum
alignment change for the grade ranges K–2, 3–8 and 9–12 controlled for any time-invariant
effects of particular states and grade levels on changes in instructional alignment. Over the six
years between 2003 and 2009, the proportion of sampled math teachers’ instruction that aligned
to curriculum objectives increased by 3.8% for Grades K–2, and by 3.1% in Grades 3–8.
Average instructional alignment over the study period and across the grades was low, however,
with only about one-fourth of math instructional time distributed across content-cognitive
demand combinations suggested by state curriculum documents. The sample did not depart
wildly from national population average classroom characteristics, but was not claimed to be
nationally representative, as most of the surveyed math teachers were from Indiana, Montana,
Ohio, Oklahoma, or Oregon.

69

2.9

Summary of the Literature and Contribution of This Study
Alignment evidence is necessary for validation of state achievement test score

interpretations. Although traditional methods of computing alignment exist, their application has
usually compared test content to a test specifications table. Modern alignment procedures to
compare test item and curriculum objective content differ along several dimensions, including
the types of task features that are considered relevant to judging alignment, the reporting of error
variability among panelists, whether an intermediate content classification table is used, and
whether connecting items to objectives involves binary matching, rating, or both. There is little
empirical support for the cognitive demand coding schemes adopted by modern alignment
methods (Ferrara et al, 2011). Alignment methods’ various indices and cutoff criteria espouse
different definitions of test-curriculum alignment; none of the indices have cutoff values that
have been devised based on empirical research, or are widely agreed upon (Davis-Becker &
Buckendahl, 2013).
Alignment indices’ validity as measures of test content representativeness depends on the
conditions under which the rating data is collected. Monitoring of panelists’ comprehension
during alignment review tends to be limited by the desire to maintain the independence of their
judgments. Although curriculum objectives may have multiple reasonable interpretations,
panelists occasionally make clear errors of interpretation when decoding test items or curriculum
objectives; however, these gross errors appear to be rare (D’Agostino et al., 2008; Hill, 2001).
Panelists’ judgments of task cognitive demand may be influenced by their understanding of the
“developmental levels and prior instructional experience” of a given test-taker population
(Herman et al., 2007, p. 121). As might be anticipated, panelists’ findings may be biased if they
have been involved in developing the tests or curricula under review. For this reason, while it is

70

usually recommended that judges have some previous familiarity with curricula that they will
review, test or curriculum developers would not typically be recruited to alignment panels for
state achievement tests. Evidence of variability across panelists’ ratings of item and/or objective
content indicates that ratings have been fairly or highly consistent during some alignment
reviews, but that sometimes their consistency has been poor. Alignment index estimates may
have acceptable reliability if reviews enlist 4 to 6 panelists; however, panelist numbers
recommended by the Webb and SEC methods have been revised upward toward 5 to 8 to reflect
the marked improvements in index reliability expected using data from additional panelists. The
amount of rater disagreement averaged over to compute alignment indices in research or
practical alignment studies has sometimes been high enough to warrant concern about the
indices’ meaning, but on most occasions when rater agreement has been reported, it has been
acceptable. Overall, previous research provides some documentation supporting alignment
indices’ reliability and validity, but such evidence has not systematically been collected and
reported.
Because alignment indices are meant to reflect the degree of content overlap between the
intended curriculum, instruction, and the test instruments used to measure student achievement,
under certain assumptions about instructional quality, student motivation and instructional
sensitivity of the test items, the indices would be expected to be predictive of achievement gains.
There is some evidence of a positive correlation between instruction-curriculum alignment and
classroom or school mean achievement, particularly in mathematics. There is also some
evidence that instruction-test alignment is a significant positive predictor of classroom
mathematics achievement, and classroom and individual achievement gains. On the contrary,
empirical evidence suggests that test-curriculum alignment is not significantly related to

71

classroom mean math achievement test scores. However, this conclusion reflects results from a
single study conducted during the mid-1980s that analyzed an off-the-shelf, non-curriculumbased achievement test (Mehrens & Phillips, 1986, 1987). Further, the authors of the study
cautioned that the district curriculum specialists engaged as raters may not have been sufficiently
knowledgeable regarding curricula implemented in particular schools to accurately judge the
extent of test-curriculum correspondence.
Modern alignment indices are an important warrant for claims in inferences (Kane, 2013)
that generalize students’ observed state achievement test scores to their expected performance on
a universe of potential test tasks defined by their state’s curriculum. To generalize curricular
achievement test scores to performance under measurement conditions other than those
observed, the task sample (i.e., item set) composing the test must be claimed to prompt behaviors
representative of the activities listed in the relevant curriculum document. Such claims may be
warranted by presentation of a particular alignment index or qualitative alignment evaluation as
evidence of test content representativeness. The first purpose of this study is to seek external
empirical backing for the alignment index warrants underlying some score interpretation
validation arguments, as recommended by Davis-Becker and Buckendahl (2013). To investigate
alignment indices’ accuracy as measures of test content representativeness, I focus on checking
two assumptions of the SEC alignment index formula: that counts of curriculum objectives are
indicative of intended curricular emphasis (also an assumption of Webb’s balance-ofrepresentation alignment index), and that the cognitive demand categories adopted are best
treated as nominal. The second purpose of this study is to probe the relationship between testcurriculum alignment and state average mathematics achievement.

72

CHAPTER 3: METHOD

My study uses data reflecting state math curricula from the 2005–2007 school years and
student math performance in 2007. This time frame lies several years after passage of the
amended ESEA, after the 2005–2006 deadline for states to fully implement its accountability
provisions, and five years before any waivers of the accountability requirements were issued in
2012. Over the period from 2001–2007, considerable pressure on states led to increased
uniformity in the organization, although not content, of state curriculum documents, and on
teachers led to increased alignment between mathematics instruction and the written curriculum,
providing a suitable context for testing test-curriculum alignment index function. Variability in
curriculum topic-by-cognitive-demand coverage among and within states will contribute to the
power of statistical tests of the overall relationship between content emphasis or alignment and
test item performance, and results from different states should have at least some comparability
due to increased similarity in the organization of curriculum documents. Student mathematics
achievement and teacher instructional content emphasis data for this study is drawn from the
National Assessment of Educational Progress (NAEP) 2007 and Third International Mathematics
and Science Study (TIMSS) 2007, and measures of content emphasis for state mathematics
curricula and the two achievement tests are taken from publicly-available SEC content analysis
data.
Research Questions 1 and 2 propose examining the relationship between content
emphasis proportions from SEC content matrices that represent state curriculum documents, and
two types of external criteria: achievement test item performance and mean teacher-reported
instructional content emphasis, across states, as validation evidence for the SEC alignment index.
Both relationships are expected to be positive. In this study, I will use zero-order correlations
73

with instructional emphasis, and average marginal effects from regression models of item
difficulty (item difficulty models, e.g., Gorin, 2006), as effect size measures to quantify the
strength and direction of these relationships, if any. To estimate the unique effect of curricular
content emphasis on test item performance, which is posited to also be influenced by many other
item and examinee characteristics, the item difficulty models will control for item- and statelevel characteristics believed to be among the most important. Research Question 3 asks if there
is a statistically significant association between test-curriculum alignment, measured at the level
of content topic, and mean test item performance (i.e., item difficulty) in a state. Alignment is
expected to interact with curricular emphasis, such that its association with item difficulty
becomes increasingly positive as emphasis on curricular content relevant to each particular test
item increases. Research Question 3, like Research Question 1, will be investigated using an item
difficulty model, although for a data at a different grade level. While the results of this study
will, in any case, have to be interpreted with some caution due to the relatively small group of
states with coded curriculum documents for the relevant time frame, they are expected to
contribute evidence for validation of the SEC alignment index, and to quantify the relationship
between state-level alignment and achievement during a time period when elementary
mathematics teachers were under high pressure to target state curriculum objectives during
instruction.

3.1

Data
It is reasonable to believe that the more different two compared curricula, “the more

likely those differences will have an impact” on test scores (Mehrens & Philips, 1987, p. 358).
State mathematics curricula show sufficient variation in objectives that it may be reasonable to
expect differences in item-level achievement due to differences in opportunity to learn the
74

content. Reys et al. (2007) concluded that alignment of the curriculum objectives (i.e., “gradelevel expectations”) across the ten most populous US states was generally poor. Fourth-grade
math objectives showed little consistency across the states examined—about one-quarter of
grade-level expectations were unique to one state’s curriculum document, while only about a
third of objectives appeared in six or more states’ standards. Similarly, a quantitative alignment
analysis that coded the content emphasis of state curriculum documents using the SEC index
(Porter et al., 2009) found relatively low alignment among states’ K–8 math standards,
particularly within grade, but also consolidating across grades. They determined that there was
small common curriculum recommending instruction on particular number properties and basic
operations in the early elementary grades (see also Reys et al., 2007), on estimation at most
grade levels, on simple probability at Grade 7, and on providing interpretation of data displays at
Grade 8.
I engage public-use SEC data on proportions of content coverage in state math curricula.
I restrict my analysis to 11 SEC-participating states that neither adopted, nor made publicly
available as drafts, any major curriculum document revisions during 2006, the year immediately
prior to NAEP 2007 and TIMSS 2007 testing (with “NAEP” and “TIMSS” henceforth used to
refer to the 2007 versions of these tests, unless otherwise noted, for brevity). Because content
learned in previous grades is likely to impact performance, I will aggregate curriculum content
emphasis matrices for the grade in which each test was administered with those of the previous
grade (Mehrens & Phillips, 1987), yielding a matrix of proportions for each state. This
unweighted summation assumes that “roughly the same ‘amount’” of total curriculum content
was covered in each grade (Porter et al., 2009, p. 264).

75

Both the NAEP and TIMSS studies assessed curricular mathematics achievement among
fourth- and eighth-graders, collecting additional background information from sampled students,
their math teachers, and school administrators. All US states participate in NAEP testing, and
Massachusetts and Minnesota served as benchmarking participants for TIMSS. Using TIMSS
and NAEP item responses, rather than state achievement test item responses, to measure
academic mathematics achievement offers the advantages of cross-state comparability and the
potential to control for factors, besides content coverage emphasis, hypothesized to affect test
item performance. Although the content-cognitive demand categories implemented by the
coarse-grained SEC matrix (MECG, 2004) and the two assessments’ frameworks are not
identical, the three schemes’ content dimensions overlap heavily—all content categories
appearing on NAEP and TIMSS were used during SEC coding—and the demand dimensions
overlap partially. The SEC’s content classification scheme will be mapped, separately, onto the
two assessments’ content coding categories.

3.1.1

SEC Data

Since 2001, researchers from the Wisconsin Center for Educational Research at the
University of Wisconsin-Madison and the Surveys of Enacted Curriculum State Collaborative
Project sponsored by the Council of Chief State School Officers have conducted or facilitated
content analyses of curriculum documents and/or achievement tests from many states and school
districts, as well as a number of national standardized tests (Porter et al., 2011). I engage publicuse SEC data on proportions of content coverage in eleven states’ math curricula, which would
have been the active curriculum standards at, and prior to, the NAEP and TIMSS 2007
administrations. I restrict my analyses to states that participated in SEC alignment analyses, and
neither adopted, nor made publicly available as drafts, any major revisions of their curriculum
76

documents during 2006: Alabama, California, Indiana, Kansas, Massachusetts, Michigan,
Minnesota, New Jersey, Ohio, Oregon, and Vermont. Some of these states had relatively longstanding mathematics curriculum documents, while others’ curriculum documents had been
more recently introduced. Identification of states with stable curriculum documents during 2006
was based on consistent information from three nationwide policy reports that listed states’
current curriculum documents, and dated and described any major published curriculum
revisions occurring over the time intervals from 2005 to 2008 and 2005 to 2010 (American
Federation of Teachers, 2008; Carmichael, Martino, Porter-Magee, & Wilson, 2010; Klein,
2005), with reference to state department of education websites for confirmation. NAEP 2007
and TIMSS 2007 test items have also been content analyzed using the SEC content classification
scheme (Blank & Smithson, 2009).
The fine-grained SEC content matrix for mathematics, which is recommended for use in
alignment analyses (Porter, 2002), has 915 cells. The coarser-grained version of the mathematics
matrix, which consolidates specific content topics but retains the same cognitive demand
distinctions as the fine-grained matrix, has 80 cells. Because achievement tests usually contain
many fewer than 1,000 items, and test score users generally want to make inferences about
performance on a domain broader than the “specific cells that happened to be tested” in a very
large, detailed matrix, Mehrens and Phillips (1986, p. 186) cautioned that the coding matrix
should not be too large. To allow classification of state curriculum documents’ SEC content
proportions according to the NAEP and TIMSS content categories, I will use proportions from
the coarse-grained SEC state curriculum content analysis matrices, and NAEP and TIMSS
assessment content analysis matrices for Research Question 3, as raw data. Because the accuracy
of individual judges’ ratings is likely to decrease as an alignment matching task requires more

77

detailed parsing of content items and knowledge of examinee behavior (Davis-Becker &
Buckendahl, 2013), consolidating over detailed content topics may provide the best chance for a
favorable assessment of alignment index validity. Sixteen content topics: Number
Sense/Properties/Relationships, Operations, Basic Algebra, Advanced Algebra, Consumer
Applications, Measurement, Geometric Concepts, Advanced Geometry, Data Display, Statistics,
Probability, Analysis, Trigonometry, Special Topics (e.g., sets, logic), Functions, and
Instructional Technology define the rows of the coarse-grained mathematics content matrix.
Five cognitive demand types: Memorize, Perform Procedures, Demonstrate Understanding,
Conjecture/Generalize/Prove, and Solve Non-routine Problems/Make Connections define the
columns of the matrix. Descriptions of example response requirements that correspond to each
cognitive demand type are listed in appendix Table A1. The proportion in each cell of the SEC
matrix is taken to represent an estimate of the relative emphasis of that cell’s content category by
a test or curriculum document, based on panelists’ item or objective content classifications
(Porter, 2002).
Because knowledge is acquired cumulatively, and tests are often designed to measure
knowledge that may have been taught in previous grades, investigations of the relationship
between curriculum to test match and achievement should account for more than one grade’s
curriculum (Mehrens & Phillips, 1987). Kurz et al. (2010) interpreted findings from their small
alignment study as suggesting that the relation between alignment and mean achievement
becomes strong only when “students have been exposed to the instructional curriculum for a
sustained period of time—in case of this study, for longer than 6 months” (p. 142). To account
for the cumulative nature of knowledge acquisition in mathematics, I will aggregate the
curriculum content emphasis matrices (e.g., Porter et al., 2009) for the grade in which each test

78

was administered (Grade 4 or Grade 8) with those of the previous grade (Grade 3 or Grade 7) by
summing the matrix pairs and dividing each element by two. In one state, curriculum documents
for the grade blocks 3–4 and 7–8, rather than for single grades, were coded; the proportions in
these content emphasis matrices were taken to represent coverage in the relevant grade ranges.

3.1.2

Third International Mathematics and Science Study 2007 U.S. Benchmarking
and National Assessment of Educational Progress 2007 Samples

The NAEP 2007 study assessed curricular mathematics achievement among fourth- and
eighth-graders, collecting additional background information from sampled students, their math
teachers, and school administrators (NCES, 2009). The survey used a two-stage stratified
sampling design, selecting schools with probability proportional to size in the first stage, and
about 30 students per sampled school in the second stage. All US states participated in NAEP
testing between January and March 2007. In the states for which SEC curriculum content
analyses are available, 37,689 fourth graders from 1,713 schools, and 35,182 eighth graders from
1,607 schools, in total, participated. NAEP sampled only public schools; Department of Defense
and Bureau of Indian Education schools will be excluded from analysis. NAEP used a balanced
incomplete block test booklet series design, so each student was administered only a fraction of
the test item set. To fourth-grade students, 164 different items were administered; to eighthgrade students, 167 different items were administered. Test booklets were distributed in a
spiraling manner so that the group of students receiving each item should, after accounting for
unequal probabilities of selection, approximate a simple random sample from the population. In
both grades, items covered five major content topics: Number Sense/Properties/Operations,
Measurement, Geometry/Spatial Sense, Data Analysis/Statistics/Probability, and
Algebra/Functions. Items were classified by the test developers as requiring one of three levels

79

of cognitive demand: Low Complexity, Moderate Complexity, and High Complexity, the
definitions of which are provided in appendix Table A2. Counts of Grade 4 and 8 NAEP items
in each content category are displayed in Table 1.
Like NAEP, TIMSS drew grade-based samples of students in their fourth and eighth
years of formal schooling. Selected students were tested in mathematics and science. The US
states of Massachusetts and Minnesota served as benchmarking participants, sampling large
enough numbers of public school students to permit state-level achievement estimates to be
obtained. These states’ TIMSS sampling designs were based on the NAEP sample designs, and
were specified to minimize duplicate selection of schools by the two studies at each grade level
TABLE 1
Distributions of NAEP 2007 Test Items by Content Category and Grade Level
Topic
Cognitive
Complexity

Algebra

Data Analysis,
Statistics, and
Probability

Geometry

Measurement

Number
Properties and
Operations

Gr. 4

Gr. 8

Gr. 4

Gr. 8

Gr. 4

Gr. 8

Gr. 4

Gr. 8

Gr. 4

Gr. 8

Low

10

23

13

13

13

19

24

19

41

23

Moderate

9

19

6

13

9

12

11

9

21

14

High

1

3

1

0

1

0

0

0

2

0

Source. National Assessment of Educational Progress 2007.

TABLE 2
Distribution of TIMSS 2007 Grade 4 Test Items by Content Category
Topic
Cognitive
Geometric Shapes
Domain
Data Display
Numbers
and Measures
Applying

11

25

32

Knowing

6

23

39

Reasoning

9

9

20

Source. Trends in International Mathematics and Science Study 2007.

80

TABLE 3
Distribution of TIMSS 2007 Grade 8 Test Items by Content Category
Topic
Cognitive
Domain
Algebra
Chance
Data
Geometry

Numbers

Applying

15

5

13

27

25

Knowing

32

5

9

8

27

Reasoning

17

0

8

12

8

Source. Trends in International Mathematics and Science Study 2007.

(see Olson, Martin, & Mullis, 2008, for a detailed description of the sampling procedure).
Testing took place between March and June 2007. In total, 3,593 fourth graders representing 97
schools, and 3,674 eighth graders from 97 schools, participated in these two states. Fourthgraders were administered 179 different mathematics items, while eighth-graders completed 215
different items. TIMSS also used a balanced incomplete block test booklet series design. At
Grade 4, items covered three major content topics: Numbers, Geometric Shapes and Measures,
and Data Display. At Grade 8, items covered four major content topics: Numbers, Algebra,
Geometry, and Data and Chance, which included statistics and probability items, the “Chance”
subtopic. Items were classified by the test developers as requiring one of three types of cognitive
demand: Know, Apply, and Reason, the definitions of which are provided in appendix Table A3.
Counts of Grade 4 and 8 TIMSS items in each content category are displayed in Tables 2 and 3,
respectively.

3.1.3

Comparison of TIMSS and NAEP Assessment Frameworks, and SEC Content
Coding Categories

Although the proportions of items covering each content type and specific objectives to
be assessed differ between NAEP and TIMSS at each tested grade level, the tests’ items at each
grade level cover an identical set of broad content areas (Neidorf, Binkley, Gattis, & Nohara,
2006). The mathematics achievement conceptualizations and target task domains of NAEP and
81

TIMSS have wider scope or referent generality than most state math achievement tests.
However, NAEP, TIMSS, and particular states’ achievement tests would be expected to have
some item types in common. Further, research suggests that mathematics curriculum
interventions, if implemented with reasonable fidelity, can produce sizable score gains on
subject-area achievement tests, even if the tests have not been intentionally aligned with the new
curriculum units (e.g., Senk & Thompson, 2003). Although the targets of inference in the
TIMSS and NAEP studies are broad constructs, I will treat their item sets as corresponding to a
potential state curriculum that defines an observable mathematics achievement trait, and interpret
students’ item performance as representing “relative degree of content acquisition” (Haertel,
1985, p. 24).
The content and cognitive demand categories of the SEC content language and the
mathematics curriculum frameworks used to classify nations’ curriculum materials in the early
TIMSS studies (e.g., Robitaille et al., 1993) are similar. Both document analysis systems use a
common classification scheme to describe the academic content of curricula and tests (Webb,
1997), and describe cognitive demand categories representing distinct types of observable
behaviors (e.g., communicating, solving routine procedures) that are not assumed to be ordered
and do not imply any particular underlying assumptions about examinee cognitive processes.
More recent TIMSS assessment frameworks, including that used for TIMSS 2007, have been
developed from the TIMSS 1995 curriculum framework (Mullis et al., 2005). Compared to the
TIMSS 1995 curriculum framework, the TIMSS 2007 assessment framework uses fewer, more
general content categories, and cognitive demand, rather than observable performance,
categories (Mullis et al., 2005), more similar to the NAEP 2007, but less similar to the SEC,
coding scheme than was the TIMSS 1995 framework.

82

As appropriate based on typical elementary school mathematics curricula and the
frameworks of the NAEP and TIMSS assessments, the two tests cover only a subset of the SEC
content topics. The SEC content topics of Advanced Algebra, Consumer Applications,
Advanced Geometry, Statistics, Probability, Analysis, Trigonometry, Special Topics, and
Instructional Technology are not covered on TIMSS at Grade 4. The SEC content topics of
Consumer Applications, Analysis, Trigonometry, Special Topics, and Instructional Technology
are not covered on TIMSS at Grade 8. The SEC content topics of Consumer Applications,
Analysis, and Trigonometry are not covered on NAEP at Grade 4. The SEC content topics of
Consumer Applications, Analysis, and Trigonometry are not covered on NAEP at Grade 8. The
cognitive demand dimension of the SEC content matrix appears to have some overlap with the
cognitive demand categories of NAEP and TIMSS. Specifically, all three classification schemes
group together items that require extended reasoning to solve non-routine problems. NAEP
explicitly describes its High Mathematical Complexity items as intended to require more
demanding cognitive processing than items in other categories (NAGB, 2006), but TIMSS and
the SEC do not make this claim about items in their Reasoning, or Conjecture/Generalize/Prove
or Solve Non-routine Problems/Make Connections, categories, respectively.

3.2

Models
If test scores are measuring the intended trait, differences among the anticipated response

processes used for, and content of, the test’s items should explain some of variability in average
item responses, proportion-correct item difficulty (Gorin, 2006). Item difficulty models specify
particular item characteristics that are believed to affect examinees’ average probability of
correct response. Traditional item difficulty analysis models (e.g., Bejar, 1993) regress
hypothesized important task features on the classical difficulty parameter for each item. To
83

determine the extent to which the knowledge and skills that affect observed item difficulty match
difficulty features intended by the test developer, the proportion of variability in item difficulty
explained by the modeled item features (i.e., an R2 value), and effect sizes for each factor, are
usually examined (Gorin, 2006). Particular item features “typically show similar relationships”
with classical proportion-correct and item response model item difficulty parameters (Mislevy,
Steinberg, & Almond, 2002, p. 122), which are less-commonly modeled (Gorin, 2006). Other
approaches to item difficulty modeling analyze individual examinee response data using
specialized Rasch item response models (e.g., Fischer, 1997), or latent class models (e.g.,
Tatsuoka, Corter, & Tatsuoka, 2004). Embretson and Daniel (2008) noted that coefficients
estimated from individual data for each hypothesized difficulty factor would be consistent and
expected to be unbiased, and would tend to have smaller standard errors than coefficients in
analogous models estimated from estimated item parameter values. However, in an empirical
study, they found that the magnitude and direction of coefficients for each item feature
associated with difficulty were similar, regardless of whether they were estimated from
individual data or Rasch item difficulty statistics.
Among item surface features or response process characteristics posited to affect
mathematics item difficulty, early studies of Graduate Record Examination mathematical
reasoning items suggested that cognitive complexity ratings were the most consistently useful
predictor of classical (Chalifour & Powers, 1989) or item response model (Enright & Sheehan,
2002) item difficulty, and that structural features of items including the number of assignments
to position that were fixed for the elements to be manipulated in the problem, and the amount of
information from the rules and conditions that was actually required by the intended solution
process were also significantly related to item difficulty (Chalifour & Powers; Enright &

84

Sheehan), as was one linguistic feature: verbal load—the number of words in the prompt
(Chalifour & Powers). Recent studies of math achievement test items from state testing
programs indicate that linguistic features may be an additional important determinant of item
difficulty for elementary school students, as Abedi and Lord (2001) contended. Shaftel, BeltonKocher, Glasnapp, and Poggio (2006) concluded that elementary students’ probabilities of
responding correctly to math test items appear to be primarily influenced by structural, traitrelevant problem features, particularly if the test development process has been rigorous.
Evidence clearly suggests that classical item difficulty (i.e., item easiness) decreases with
increased inclusion of mathematics vocabulary terms (Ferrara, Svetina, Skucha, & Davidson,
2011; Shaftel et al., 2006), a linguistic but trait-relevant item feature. However, certain purely
linguistic features, particularly the number of ambiguous words in the item stem or response
options, also significantly impede item performance (Ferrara et al., 2011; Shaftel et al., 2006),
and there is no evidence that the Webb depth of knowledge or NAEP mathematical complexity
cognitive demand coding schemes are predictive of item difficulty in the middle elementary
grades (Ferrara et al., 2011). Noting that task models containing only item features seldom
explain much of the variability in item difficulty values, Ferrara et al. (2011, p. 13) hypothesized
that some of the additional variation in item difficulty is attributable to differences in opportunity
to learn the item content, and suggested the future item difficulty studies should model OTL.
Ability estimates of students’ standing on the overall mathematics achievement trait
measured by each test, at the time of testing, would be anticipated to capture much of the effect
of previous, aligned instruction on students’ item performance, suggesting that latent variable
modeling using student-level response data would not be ideal to answer my research questions,
and that modeling of observed item difficulty values might be preferred. In this study, I will take

85

classical item difficulty (i.e., proportion-correct, “p”) values for each state-item combination as
the outcome variable. Examining the item difficulty distributions for evidence of severe nonnormality using histograms, and skewness and kurtosis measures, and a D’Agostino-Pearson K2
test (D’Agostino, Belanger, & D’Agostino, 1990), I found that all four difficulty value
distributions were appreciably non-normal. Those for the Grade 4 NAEP and TIMSS and Grade
8 TIMSS items were slightly negatively skewed and had low kurtosis, and could be rendered
approximately normally distributed by a logit transformation (e.g., Cox & Snell, 1989). The
Grade 8 NAEP item difficulty values, however, followed a somewhat heavy-tailed distribution
that could not be normalized by any power transformation (results of a Box-Cox computation
indicated that the optimal transformation exponent to normalize the distribution as nearly as
possible was 1.03—essentially, no transformation).
Rather than utilize ordinary least-squares estimation and a linear regression model for the
raw or logit-transformed item difficulty values, I will use maximum likelihood estimation to
estimate fractional logit regression models at each grade level. The so-named “fractional logit”
model (Papke & Wooldridge, 1996) is a generalized linear model with a logit link function and
Bernoulli variance function that is often used in econometrics applications when the dependent
variable is a proportion. Compared to traditional item difficulty models, fractional logit models
are advantageous in that they do not require the distributional assumptions of ordinary leastsquares regression (e.g., continuity, normality of the population error distribution) that are
unlikely to be met by item difficulty values, they permit observed item difficulty values
anywhere on the closed interval between 0 and 1 (including from items that all students within a
state answer correctly or incorrectly, which occur occasionally in real item data) and also
produce predicted values in the unit interval, and they can yield an interpretable effect size

86

measure (Wooldridge, 2010) under assumptions that are generally more likely to be plausible
than those of ordinary least squares. Since the heteroskedasticity-robust sandwich estimator for
the standard error of the fractional logit model regression coefficients is consistent even when the
Bernoulli variance assumption fails, as recommended by Papke and Wooldridge (1996),
sandwich standard error estimates will be used for inference from these models, which will be
implemented using the software Stata.
Although mathematics test data from two grade levels is available, to reduce the
uncertainty interpreting in statistical test results that would be caused by multiple testing, I will
utilize the Grade 4 data from NAEP and TIMSS to investigate Research Questions 1 and 2, and
the Grade 8 data from both assessments to pursue Research Question 3. For individual state item
difficulty models, predictors will include the relevant SEC curriculum content emphasis
proportion, mean teacher-reported instructional emphasis on the item’s topic, and the item’s
cognitive category. In overall cross-state models, I will include additional measures of state
characteristics that are posited to affect mean item performance and potentially correlated with
curricular emphasis proportions. Further, in the Grade 8 cross-state models, I will also add the
pertinent state’s mean Grade 4 2003 NAEP scale subscore on each item’s content topic to control
for prior achievement in the tested cohort. As additional external validation evidence, I will
examine the anticipated positive relationship between alignment index curricular content
emphasis proportions and a more proximal measure, mean teacher-reported content emphasis,
estimating the correlation between content emphasis proportions and mean teacher emphasis by
topic across states.

87

3.2.1

Models for Research Question 1

To address my first research question, regarding whether unweighted counts of
curriculum objectives can be considered indicative of intended content emphasis in a particular
curriculum document, I will examine validation evidence from concurrent measures: partial
regression coefficients representing the unique relationship between transformed counts—
proportions—in each cell of a curriculum content matrix and other variables that measure
curricular content emphasis, or are expected to be positively correlated with content emphasis.
The NAEP Grade 4 data contain a measure of instructional content emphasis:
mathematics teachers’ self-reported ratings of their emphasis of each major content topic tested
during instruction of the sampled students, making it possible to determine if there is any
relationship between topic emphasis proportions from the SEC and mean reported instructional
coverage of that content. For this analysis, because teachers were asked about topic emphasis, I
will collapse over cognitive demand categories of the SEC matrix to generate a total emphasis
proportion for each content topic. Content topic proportions will then be aggregated as
necessary to correspond to the major content topics used for reporting by NAEP. Since NAEP
samples students, not teachers, mean teacher ratings for each state will be computed as means of
emphasis in instruction received by individual students, accounting for the sampling weights.
Using NAEP Grade 4 teacher survey data for all nine states, the Pearson correlation between
teachers’ mean instructional emphasis of NAEP Topic i (i = 1, 2, . . ., 5) in State k, and the
corresponding residualized SEC content emphasis proportion for that state will be computed.
Because students’ teachers are not randomly assigned to states, prior to computing the
correlation, variability in curriculum content emphasis proportions attributable to state
characteristics will be removed from the proportion measure by regressing it on four principal

88

components scores on a set of State k educational characteristics, which are described in more
detail subsequently. The computed correlation between curricular and instructional emphasis
will be equivalent to the standardized coefficient α from the regression model depicted in
Equation 1:

‫ ܧ‬ሺܻ௜௞ |ܺ௜௞ , ‫܈‬௞ ሻ ൌ ߙܺ௜௞ ൅ ࢽ‫܈‬௞ ,

(1)

where Yik is mean instructional emphasis on Topic i in State k, Xik is the SEC content emphasis
proportion corresponding to Topic i row in State k, and Zk is a matrix of four principal
components scores on a set of State k educational characteristics, described in detail following
presentation of Equation 2 below.
If SEC curriculum content matrices are a reasonable representation of the content topic
emphases in the intended curriculum, and instruction follows the curriculum, cellwise emphasis
proportions should be positively correlated with mean teacher content emphasis survey
responses. This analysis will also contribute to checking one of the assumptions of the third
research question—that instruction largely follows the curriculum. Although the TIMSS Grade
4 data also contains a measure of instructional content emphasis: math teachers’ reports of the
proportion of instructional time devoted to each of the three major TIMSS content topics, a
correlation estimated from mean instructional emphasis by topic in only two states was unlikely
to be stable or have any generalizability to the US population, so it was not computed, but the
mean instructional content emphasis measure was used as a covariate in additional analyses
described subsequently.
After examining correlations between teachers’ mean content topic emphasis and
curriculum content topic emphasis, I will examine the relationship between item difficulty and
curriculum content emphasis, represented by the proportion of objectives in the corresponding

89

SEC matrix cell. Because item performance is hypothesized to be affected by instruction at
previous grade levels, as well as at students’ current grade level, for each state, the SEC
curriculum content matrix for each tested grade level will be aggregated with the matrix for the
previous grade level. Again, these aggregated curriculum content matrices will be collapsed
across some topics so that content emphasis proportions correspond to the content topic
categories used by NAEP or TIMSS, as appropriate. For this analysis, to retain the assumption
of the SEC alignment index that cognitive demand categories are nominal, representing different
types, but not levels, of required cognitive processing for tasks (which admittedly is possible
only to a limited extent), while permitting comparability to the NAEP and TIMSS cognitive
demand categories, I consolidate some SEC content emphasis proportions within topic,
combining the cognitive demand categories Memorize, Perform Procedures, and Demonstrate
Understanding, and likewise the categories Conjecture/Generalize/Prove, and Solve Non-routine
Problems/Make Connections, but maintain the distinction between curriculum objectives that
require extended reasoning, and those that do not.
To estimate each item’s classical difficulty parameter, the mean response on each binary
item, I will conduct a subpopulation analysis of the item response data by state that accounts for
each assessment’s complex sampling design features. Because not every test item is
administered to each student, taking the estimated average probability of correct response to each
item from the sample, with cases weighted by the sampling weights, as an estimate of statewide
probability of correct response relies on systematic random sampling of students within selected
classrooms to take each item, produced by spiraling distribution of the different test booklets
within each classroom. However, some students who were randomly administered a particular
test item may have failed to respond, or produced an unscoreable response. Omitted items will

90

be scored as wrong; not reached items will be assumed to be missing completely at random (i.e.,
MCAR). All items that had identified correct answer(s) were included in my sample, with the
exception of a small number of NAEP items that were scored as clusters due to reported high
error correlations. From these item groups (2 clusters of 2–3 items each in both Grades 4 and 8),
because the fractional logit model assumes item difficulty observations are independently
distributed, which is not true for these items, and because the content topic and cognitive
complexity were not recorded for the overall cluster response, the first item in each cluster was
included in the sample. Both NAEP and TIMSS include some open-ended items with maximum
scores greater than 1. In addition, for some of the NAEP open-ended items, scores for up to
three raters are reported. Throughout my analyses, the proportion of students who earn the
maximum score for a fully correct response from the majority of the raters (when applicable)
will be treated as the item difficulty.
To model variability in the probability of correct item response perhaps attributable to
differences in curriculum exposure across the examinee population, I will use fractional logit
models with the conditional mean of states’ estimated classical item difficulty values as the
outcome, as shown in Equation 2,

‫ܧ‬൫ܻ௝௞ หܺ௝௞ , ‫܅‬௝ , ‫܈‬௞ ൯ ൌ Λሺߙܺ௝௞ ൅ ࢼ‫܅‬௝ ൅ ࢽ‫܈‬௞ ሻ,

(2)

where Yjk is the estimated Grade 4 item difficulty (p; proportion fully-correct responses) for Item
j in State k, Xjk is the SEC content emphasis proportion corresponding to topic-by-cognitive
demand cell of Item j in State k, Wj is a matrix of Item j characteristics (NAEP or TIMSS
classification), Zk is a matrix of State k educational characteristics, which include mean teacherreported content emphasis measures and a set of four principal components scores, and Λ(·) is
the logistic function. In models for the TIMSS data, which as available for only two states, Zk

91

will be replaced by a single state indicator dummy. Because the instructional content emphasis
measure is hypothesized to be a potential mediator between curricular content emphasis and
achievement outcomes (e.g., Travers & Westbury, 1989), it will be introduced into the model
last; results both including and excluding this predictor will be reported.
Most NAEP and TIMSS items are not publicly released, so item characteristics available
as task model variables are limited to those found in published information, which include
cognitive demand ratings and content topic codes generated by the test developers, but not
linguistic feature codes. To account for state-specific differences in item performance
attributable to variation in state educational characteristics that were potentially correlated with
states’ decisions about curricular content emphasis, means of some of these variables for each of
the 50 US state populations were either obtained from the Digest of Education Statistics (NCES,
2008) or computed from the Grade 4 (or Grade 8) NAEP student background data. State means
drawn from the Digest included the percentages of adults holding bachelor’s or advanced
degrees in 2006, of children living in poverty in 2007, and of children suspended or expelled
from public schools during school year 2005–2006, as well as median income in 2005 and perpupil expenditures in school year 2005–2006. Means estimated from the NAEP background
data, and specific to the Grade 4 (or Grade 8) student population included the percentages of
minority students, of English Language Learners, of students attending schools in rural areas or
small towns, of students who were above the average age for their grade, of students eligible for
the federal school meal program, of students who had transferred to their present school within
the last school year, and of students who had computers at home, as well as students’ mean
number of absences in the past month and score on a scale measuring the frequency with which
students talked to their parents about schoolwork. State mean NAEP 2007 grade-level reading

92

scale scores (NCES, 2013) were also tabulated. Unfortunately, some variables possibly related
to both state curricular emphasis and students’ test item performance, particularly mean
mathematics instructional time, were measured by the NAEP teacher questionnaires but were
missing for 15% or more students at both grade levels, with similar proportions of missing
responses across states. This level of missingness, which was unlikely to be completely at
random, was deemed too high to yield accurate estimates of state mean math instructional time.
Because this collection of state-level variables was not of main interest in my analysis, to
reduce the number of parameters that had to be estimated in the instructional emphasis and item
difficulty models, principal component analysis was used to determine the linear combinations of
variables in this set that would capture most of their variability. Since the state educational
variables were measured on many different scales, prior to conducting PCA, all variables were
standardized. Because one state, Alaska, had suppressed student responses to most items in the
NAEP background questionnaires, yielding an incomplete raw data matrix of state variable
means, rather than conducting PCA of the raw data matrix, I analyzed the EM-estimated
covariance matrix (in this case, a correlation matrix) that could otherwise provide starting values
for imputation using a multivariate normal regression model, as suggested by Truxillo (2005).
The first four eigenvalues of the estimated correlation matrix for the state fourth- (and also
eighth-) grade population means were greater than one (e.g., Jolliffe, 2002), the average value of
the matrix’s eigenvalues. (The first four eigenvalues of both observed correlation matrices,
omitting Alaska from the dataset, were also greater than one, and principal components scores
calculated from weights determined by PCA of these matrices were each correlated in excess of
.99 with their corresponding scores obtained from analysis of the EM-estimated correlation

93

matrix in the 49 states with complete data.) The first four principal components explained more
than 75% of the variance in the original variable sets for both grade levels.
For both grade levels, PCA results showed that the first two eigenvalues of the
correlation matrices were greater than two, and the first two principal components had
interpretable patterns of weights, while the third and fourth principal components had weights
that would result in somewhat less meaningful scores. The first principal component had similar
patterns of weights at both grade levels. Both Component 1s had weights with absolute values
greater than .3 for five variables: the percentage of children living in poverty, of adults holding at
least a bachelor’s degree, of students eligible for the federal school meal program, and of
students with a computer at home, and the mean NAEP 2007 reading score; the Component 1
scores could be interpreted as measuring the mean household SES of students in each state; the
scores decrease with increasing mean SES. In the fourth grade data, Component 2 had weights
greater than .3 for four variables: the percentage of minority students, of English Language
Learners, of students attending schools in rural areas or small towns, and of over-age students;
the Component 2 scores could be interpreted as increasing with the heterogeneity of a state’s
student population. In fourth grade, Component 3 loaded most highly on four variables: the
percentage of transfer students, of children living in poverty, and of children suspended or
expelled from school, and the mean number of absences from school; Component 3 scores may
capture elements of a state’s school disciplinary climate; they increase with the percentage of
suspensions and expulsions. In fourth grade, Component 4 loaded most heavily on three
variables: expenditures per pupil, the percentage of over-age students, and the mean number of
school absences; Component 4 scores primarily measure per pupil expenditures, and increases
with expenditures, but also with mean absences.

94

In the eighth grade data, the same variables weighted highly on Component 2 as in the
fourth grade data, but their signs were in opposite directions, so the Component 2 scores should
be interpreted as increasing with the homogeneity of a state’s eighth-grade student population.
Component 3 loaded most heavily on the mean number of absences, the percentage of Grade 8
transfer students, and the mean frequency of discussing school at home; Component 3 scores
appeared to primarily measure student mobility, and increased with student mobility. In eighth
grade, Component 4 again loaded most heavily on per pupil expenditures, followed by the
percentages of English language learners, transfer students, and students suspended or expelled
from school; Component 4 scores gave per-pupil expenditures the largest positive weight, but
percentage of suspended/expelled students also received a sizable positive weight. To retain
capacity to account, in the research models, for differences in state educational climates that
could affect both curriculum development and students’ academic outcomes, state scores on all
of the first four principal components were computed, and those from the 11 states with SEC
curriculum content emphasis data were saved for use as control variables in the item difficulty
models.
Because the magnitude of the relationship between curricular emphasis measures and
item difficulty may vary substantially across content topics or across states, given the exploratory
nature of this study, after obtaining the cross-state results, I will estimate the model within states,
and within content topics, dropping the state-specific principal components scores or content
topic indicators, respectively, from the model.

3.2.2

Models for Research Question 2

To address my second research question, I will consider the change in the estimated
relationship between the SEC content emphasis proportion variable and Grade 4 NAEP or
95

TIMSS item performance when the emphasis proportions account for content at or above the
cognitive demand level of a particular item, rather than only content at the cognitive demand
level of the item. That is, items in the consolidated cognitive demand category that is
hypothesized to represent a lower level of demand will be assigned the total curriculum content
emphasis proportion for their topic, including the quantity from coverage of hypothesized moredemanding curriculum objectives in the Conjecture/Generalize/Prove, and Solve Non-routine
Problems/Make Connections categories. Then, simple and multiple regression coefficients for
the SEC content emphasis proportion variable as a predictor of state-item difficulty values will
be re-estimated by topic, and overall. Because these estimates are not independent of (and in fact
are highly dependent on) those obtained previously, it will not be possible to perform formal
statistical tests of the differences, and so the differences will be inspected and described
qualitatively. If cognitive demand categories are ordered, instruction follows the curriculum, and
instruction on more demanding content related to the same topic benefits performance on items
with less demanding content, I expect that the correlations should increase in the positive
direction.

3.2.3

Models for Research Question 3

My third research question asks if differences in test-curriculum alignment can explain
any of the cross-state variability in students’ item performance on NAEP or TIMSS at Grade 8.
Gamoran et al. (1997) suggested that the effect of instruction-test alignment (the “configuration
of coverage”) on achievement gains should depend on the amount of instructional time devoted
to tested topics (the “level of coverage;” pp. 330–331). They recommended using the product of
content emphasis and alignment, operationalized using an SEC-type measure, to predict
achievement gains. D’Agostino et al. (2007) modeled instruction-test alignment relationships
96

with students’ state math test scores using indicators of instruction-test alignment, content
emphasis, and the product of these two variables. I will combine these two approaches. I will
use a measure of cellwise topic-by-cognitive demand alignment based on SEC content analyses
of state curriculum documents and the NAEP and TIMSS 2007 tests, given in Equation 3,

1 െ ቚߨ௑೔,ೕ െ ߨ௒೔,ೕ ቚ,

(3)

where ߨ௑೔,ೕ denotes a cell proportion in a state curriculum content matrix X and ߨ௒೔,ೕ denotes the
corresponding cell proportion in the NAEP or TIMSS content matrix Y. This measure will be
bounded between 0 and 1, inclusive, with higher values intended to indicate better alignment
between the coded assessment and curriculum document for a particular content cell. I will also
include as predictors the emphasis proportion for that content cell from the curriculum document,
and the interaction between alignment and curriculum content emphasis.
An alternative measure of alignment is potentially available in the TIMSS data. TIMSS
conducted a “test-curriculum matching analysis” during which one or more persons familiar with
each particular jurisdiction’s intended curriculum determined whether each TIMSS math test
item was covered in the curriculum at that grade level (for more than half of the students in the
jurisdiction, if the intended curriculum varied), or not (Mullis, Martin, & Foy, 2008, p. 439).
The judge(s) from Massachusetts determined that all the math items in the TIMSS Grade 8 test
were intended to be taught at that grade level, while the judge(s) from Minnesota found that the
content of 2 items would not have been covered at that grade level by instruction that followed
their state curriculum. Due to the very limited variability on this binary matching measure, I will
not endeavor to compare it to the proportions underlying the SEC alignment measure.
To model the conditional mean of Grade 8 NAEP or TIMSS state item difficulty,
predictors will include alignment and content emphasis indicators from the SEC data, mean
97

teacher-reported instructional emphasis on the item’s topic, and a set of item characteristics
indicators, as shown in Equation 4,

‫ܧ‬൫ܻ௝௞ หܺ௝௞ , ܷ௝௞ , ‫܅‬௝ , ‫܈‬௞ ൯ ൌ Λሺߙܺ௝௞ ൅ ܷ߬௝௞ ൅ ߠܺ௝௞ ܷ௝௞ ൅ ࢼ‫܅‬௝ ൅ ࢽࢆ௞ ሻ,

(4)

where Yjk is the estimated Grade 8 item difficulty (p; proportion fully-correct responses) for Item
j in State k, Xjk is the SEC content emphasis proportion corresponding to topic-by-cognitive
demand cell of Item j in State k, Ujk is the SEC alignment (to NAEP or TIMSS) measure
corresponding to topic-by- cognitive demand cell of Item j in State k, Wj is a matrix of Item j
item characteristics (NAEP or TIMSS classification), Zk is a matrix of State k educational
characteristics, which include the NAEP 2003 Grade 4 scale subscore corresponding to the topic
of Item j in State k and a set of four principal components scores, and Λ(·) is again the logistic
function. Neither NAEP nor TIMSS collected information from Grade 8 teachers regarding
instructional emphasis by topic, so this potential mediator between curricular emphasis and item
difficulty cannot be modeled. Analogously to the state educational characteristics data for the
Grade 4 population, the matrix of State k mean educational characteristics for the Grade 8
population is reduced to a matrix of principal components scores, as detailed previously.
If the written curriculum has little overlap across years (Smithson & Collares, 2007), and
instruction corresponds well to the curriculum, so that the product of test-curriculum alignment
and content emphasis measures new, recent opportunities to learn the test content, this
interaction should be more predictive of achievement gains than of cross-sectional achievement
scores. In the cross-state models, I use each state’s mean Grade 4 2003 NAEP score as a pretest
math achievement measure for the tested cohort. Because the importance of the alignmentemphasis interaction for explaining variation in test item difficulty may vary across content
topics or alignment data collection events, after obtaining cross-state results for both NAEP and

98

TIMSS, I will estimate reduced versions of the model within content topics, and within states,
dropping either the topic indicators from the matrix of Item j characteristics, or the State k
principal components scores. To reduce collinearity of the predictors within each of these stateitem subpopulations, the raw content emphasis and alignment variables were mean-centered
before creating each interaction term.
If the SEC alignment index is functioning well, jointly with cellwise content emphasis,
the cellwise curriculum-test alignment underlying the index should predict student performance
on corresponding test items. This analysis bears on the validity of the SEC index as a measure of
test content representativeness to the extent that instruction is aligned with a state’s curriculum
and other assumptions, detailed in the next section, hold.

3.3

Assumptions about the US Elementary Education System
Measures of test alignment to a particular curriculum do not address student engagement,

pedagogical approaches, or instructional quality (McMaken & Porter, 2012), and their
interpretation does not require assumptions about these classroom features, but interpreting
results of my models as empirical evidence for validity requires these assumptions. My analysis
assumes the average tested student is motivated to engage in learning during classroom
instruction, and to solve the NAEP or TIMSS items correctly. One experiment that offered a
monetary incentive, awarding examinees $1 for each correct response, to students completing
NAEP items found only a small, although statistically significant, effect on mean scores among
eighth-graders (O'Neil, Sugrue, & Baker, 1996). To assess how reasonable this assumption is
regarding NAEP 2007 scores, I will report descriptive statistics for two variables, by state:
students’ mean self-reported level of effort on the test, and their feelings about the importance of
“succeeding” on NAEP.
99

Instruction must be aligned with the curriculum in order to produce gains in student
achievement on tests that measure curricular objectives (e.g., La Marca et al., 2000). Due to
federal school accountability testing in Grades 3–8, which required tests to be aligned to state
standards, it would be expected that the taught, or implemented, curriculum closely follows the
written, or intended, curriculum, at least relative to magnitudes of alignment typical in previous
decades. I assume that state curricula were sufficiently stable over the relevant time period so
that teachers might have been familiar with the curriculum objectives, understood them in the
manner intended by the state, and have been able to provide instruction targeting them. If
instruction targets specific features of state achievement test items (recurring content or formats),
rather than curriculum objectives more broadly (e.g., Koretz, 2008), apparently minor differences
between the features of state test items and the national assessments’ items could influence state
mean item performance on the national assessments. To allow cross-state comparability of the
results, I assume not only that instruction has followed the curriculum, but also that the degree of
instruction-curriculum alignment is similar across states.
For test scores to provide “actionable information” about student learning, test items must
be sensitive to instruction; specifically, item difficulty should be a function of exposure to
relevant instruction (Mislevy & Zwick, 2012, p. 150). Muthén et al. (1991) showed that the
probability of correct response for some eighth-grade mathematics items in the Second
International Mathematics Study, particularly items that required definitional knowledge or
represented “early stages of learning about selected mathematical topics” (p. 18), depended
significantly on whether or not students had received relevant instruction, as reported by their
teachers. I will assume that NAEP and TIMSS items are sensitive to differences in content
coverage among topics within and across states.

100

3.4

Assumptions of the Statistical Models
To assess the plausibility of the assumptions of generalized linear models for each item

difficulty population (e.g., Breslow, 1996; Gill, 2001), I consider qualitatively a series of
diagnostic plots and statistics generated prior to or following estimation of each model. Along
with the primary results from regression modeling—model coefficients, effect size measures and
model fit statistics—I will report evidence of any serious violation of the models’ assumptions,
and, when possible, use alternative model specifications to probe the robustness of the results.
Scatterplots graphing model predicted and response residual values will be inspected, with any
evident patterns in the scatterplot taken to suggest some form of model misspecification: an
inappropriate link function or omitted variables (Gill, 2001). Linearity of the relationships
between the logit of the expected difficulty values and each predictor, assumed by all models in
this study, will be examined by plotting continuous predictor and response residual values for
each model, with any patterns of curvature taken to suggest that higher-order predictor terms
should be considered for inclusion in the model (Breslow, 1996). To check for evidence of
serious collinearity among item- and/or state-specific predictor variable values, prior to each
regression analysis, the variables’ correlation matrix will be checked, and the variance inflation
factor (VIF) value for each predictor will be computed; maximum VIF values exceeding 10 will
be reported as a sign that the sample size may be insufficient to obtain precise estimates of some
of the regression coefficients.
In analyses, I assume that all sample elements are members of the state-item population
of interest. States that had very recently altered their curriculum documents, so that no causal
relation between their current curriculum content and item difficulty could be posited, were
excluded from the sample. With access to only the released NAEP and TIMSS items, I rely on
101

the two assessments’ frameworks to ensure that the items were relevant to achievement in
elementary school mathematics. Observations that have a particularly large influence on the
predicted item difficulty values will be identified by computing the Cook’s D statistic, which
quantifies the change in the predicted values produced by deletion of an observation, for each
case. Further, observations that have an outsized influence on the effect sizes for the “SEC
proportion of curriculum objectives” and “SEC cellwise alignment measure” variables of main
interest in this study will be detected by computing approximate DFBETA statistics for each case
using linear models of logit-transformed item difficulty values. Characteristics of item cases that
have D values greater than 4/n and/or DFBETA values greater than 2/√n (Bollen & Jackman,
1990) will be inspected, and those cases will be evaluated for possible exclusion from the
sample. The assessment items are assumed to be representative, if not random, samples from the
elementary school mathematics content domains of interest. The state sample is also assumed to
be a representative, although not random, sample from the population of US states that had wellestablished curriculum documents, with each state receiving equal weight; the extent to which
this assumption is reasonable will affect the generalizability of the results.
Generalized linear models assume that observations are statistically independent. This
assumption may not be plausible for the analysis units in this study, item difficulty values, which
are nested within, and likely to exhibit some degree of dependence within, items and states. To
quantify the extent to which this assumption is violated by the item difficulty values in each of
the two assessment datasets for each grade level, I computed the intraclass correlation coefficient
“2,1” from Shrout and Fleiss (1979) within state, and within item, for each dataset. Overall, the
intraclass correlation estimates shown in Table 4 indicate very high correlation of item difficulty
values within items, as would be expected, and fairly minor correlation of item difficulty values

102

within states; all of the intraclass correlations were statistically significant at the .05 level. To
obtain regression coefficient standard error estimates that are corrected, usually upward, for the
effect of the non-independence of observations, a sandwich standard error estimator can be
utilized (e.g., Breslow, 1996). However, these standard errors are unbiased only asymptotically
as the number of clusters approaches infinity; with small numbers of clusters, they tend to be
biased downward, and yield test statistics that do not follow a known distribution (Donald &
Lang, 2007). Noting that the number of clusters that should be viewed as “too small” for largesample inference using the sandwich standard error estimator depends on data features such as
disparity in cluster sizes, Cameron and Miller (in press) suggest that 50 clusters may often be
inadequate, and 20 clusters will usually be too few. Multilevel modeling, another strategy to
account for lack of independence among sample observations, similarly requires a large number
of clusters to produce stable results (Raudenbush & Bryk, 2002). In this study, there are only
nine or ten states at each grade level, but there are at least 100 items for both assessments at each
grade level. Within content topics, the number of items ranges between about 15 and 20, as
shown in Tables 1–3. In theory, the item cluster standard error estimator should perform well
since the total numbers of items are large, and should produce standard error estimates that are
more conservative than the heteroskedasticity-robust standard error estimates, because the
proportion of variance in item difficulty that is between items is substantial. The number of
states in this study is probably far too small for the state cluster standard error estimator to
produce unbiased estimates. To address the non-independence of sample observations, I will
compute standard error estimates robust to misspecification of the variance function for the
fractional logit model, as recommended by Papke and Wooldridge (1996), standard error
estimates adjusted for clustering (non-independence) of the item difficulty outcome variable

103

values by item, and standard error estimates adjusted for clustering of the item difficulty values
by state. I will comment on any differences among these standard error estimates, and report the
most conservative standard errors, and p values for the corresponding test statistics.
TABLE 4
Intraclass Correlations of Item Difficulty Values within States and Items, by Data Set
Data Set
NAEP Grade 4

NAEP Grade 8

TIMSS Grade 4

TIMSS Grade 8

ICC Within State

0.024

0.038

0.026

0.016

ICC Within Item

0.943

0.927

0.925

0.939

Note. ICC = intraclass correlation
Sources. National Assessment of Educational Progress 2007 (Restricted-Use); Trends in International Mathematics and Science Study
2007.

As an index of the overall explanatory power of each model, the value of an R2 analog for
generalized linear models, the squared correlation between observed and predicted response
values, the population value of which equals the average proportion of variance explained by the
predictors (Zheng & Agresti, 2000), will be reported, along with its 95% confidence interval.
Because predicted values from each model depend on the observed sample values, this R2
statistic will tend to be biased slightly upward. Zheng and Agresti suggest resampling
techniques to create a confidence interval for the R2, so confidence intervals will be estimated
using a nonparametric bias-corrected and accelerated bootstrap method (Efron, 1987). Checking
that neither the upper or lower bounds of the confidence interval change by more than .01 when
computed from bootstrap procedures using five different seed numbers and several of the models
with the smallest sample sizes—within-topic models from the TIMSS Grade 8 data—suggests
that 1,000 bootstrap replications will produce reasonably stable estimates of the R2 confidence
interval. For the cross-state models, which have the largest sample sizes, the confidence interval
bounds will generally change by less than .003 across 1,000-replication bootstrap procedures
using randomly-selected seed numbers.
104

3.5

Interpretation
I conceive of each cross-state regression analysis using the NAEP or TIMSS data that is

interpreted as a primary result addressing Research Question 1 (Grade 4 data) or 3 (Grade 8 data)
as essentially a meta-analysis of a collection of independent, equally-weighted within-state
studies, using the raw data rather than an effect size measure for each. I will use similar models
to replicate my analyses using the two assessments’ data for each research question, however.
Following analysis, for each of the two sets of results for each research question, I will identify a
single preferred model for interpretation, which should typically be the most complete analyzed
model—a total of four models for statistical hypothesis testing. For generalized linear models,
Stata computes a z test statistic that is the square root of the corresponding Wald chi-square
statistic for each predictor. In each model, there will be either one or two alignment-indexrelated predictors of main interest. For Research Question 1, I will be most interested in the
coefficient for the proportion of curriculum objectives relevant to a given item’s topic and at that
item’s cognitive demand level. For Research Question 3, I will be primarily interested in the
coefficient for the posited interaction between the proportion of the curriculum objectives and
the cellwise alignment measure; in the case that the interaction is non-significant, I will plan to
judge the significance of the main effects of the proportion of objectives and the alignment
measure. Because I will conduct statistical significance testing for two data sets at each grade
level, asking similar research questions, it seems that some adjustment of the significance levels
for multiple testing is needed, although it is unclear which tests should be identified as
constituting a “family” of interrelated tests. Starting from a desired rate of Type I error of .05 for
each test on a coefficient of main interest for each research question, I will adjust the
significance level for each pair of tests (NAEP and TIMSS data) by research question. Since I

105

am applying the adjustment within research question, rather than simultaneously across all tests,
I will use a Bonferroni correction, which is relatively conservative, as well as easy to implement.
Thus, for each of the four tests, I will use a significance level of .025 to judge statistical
significance of each variable of main interest as a predictor of test item difficulty. For Research
Question 3, if the interaction term in either model is non-significant, I will use a significance
level of .0125 to judge the significance of the curriculum objectives proportion and alignment
variables. In all models, I will also report whether each predictor reached the three unadjusted
benchmark significance levels of .05, .01, and .001, as is conventional, but will not use those
results to draw conclusions about the alignment-related predictors of main interest.
Obtaining effect size measures may permit more detailed conclusions about the extent to
which results from the four primary models, and the separate within-state and within-topic
models described previously, support or fail to support the validity of the SEC alignment
measure. Unlike the exponentiated coefficients from a typical logistic regression model with a
binary dependent variable, the exponentiated coefficients from a fractional logit model cannot be
interpreted as odds ratios. Instead, as an effect size measure, I will report the average marginal
effect (AME, or “average partial effect”) of each covariate on the outcome, as suggested by
Wooldridge (2010, p. 750). The marginal effect of a binary predictor on the expected value of
the item difficulty outcome for a particular item-state observation is the difference between the
predicted values of the item difficulty, given the unobserved potential and observed values of the
binary variable for that observation, the observed values of the other predictors, and the
estimated model coefficients. The marginal effect of a continuous predictor on the outcome, for
a particular item-state observation, is the partial derivative of the expected value of the outcome
with respect to that predictor for that observation. The AME of a predictor, for linear or

106

nonlinear models, is the mean of the marginal effects, taken over all observations in the sample,
and indicates the change in the value of the outcome expected for a one-unit increase in that
predictor (Wooldridge, 2010). The AME for a generalized linear model is computed from
predicted values given by the inverse link of the linear predictions (which, for fractional logit
models, are on the logit scale); thus, for fractional logit models the AME is on the same
proportion scale as the observed outcome values. For general linear regression models, the AME
for a predictor is the partial regression coefficient. The AME calculation for a predictor in a
generalized linear model, such as those used in this study, generalizes the linear regression
coefficient to provide an interpretable effect size measure.
As a second type of external validation evidence for the SEC alignment index, addressing
Research Question 1, I will examine the relationship between state-level curricular emphasis and
instructional emphasis in 2007. To judge the size of the correlation between SEC curricular
content emphasis proportions and teacher-reported mean instructional content emphasis, by
topic, across states, I will refer to the descriptive terms suggested by Cohen (1992). Because
outlying cases may be particularly problematic for the stability of a correlation coefficient
estimated from the small sample available for this study, I will also present a scatterplot of the
data.
The limited evidence available to support conclusions regarding Research Question 2
suggests that only weak inferences will be possible. Nevertheless, to determine whether the
relationship between the proportion of curriculum objectives on a given item’s topic at or above
the item’s cognitive demand level, rather than the proportion of curriculum objectives
corresponding to each item’s topic-cognitive demand combination, and test item performance
should be used as evidence for the validity of the SEC alignment index, addressing Research

107

Question 2, I will qualitatively compare the relative magnitude of the regression coefficient for
the curriculum objectives proportion to that from the corresponding Research Question 1 model,
which differs only by substitution of an alternate measure for one predictor—the curriculum
proportion. I will also make note of the Bayesian information criterion (BIC) values, an
indicator of model plausibility appropriate for comparison of non-nested models like these
containing different variants of the curriculum objectives proportion measure.
This chapter described the data, statistical models, and planned analytic strategies to
address my three research questions. Necessarily, particular analytic decisions or robustness
checks indicated by empirical results from the item difficulty modeling have not been presented
here, but will be reported and justified in the next chapter.

108

CHAPTER 4: RESULTS

This chapter presents primary and ancillary findings regarding the three research
questions of interest in this study. The chapter is organized by research question, and contains
many results displays. The section for each research question begins with a summary of the
findings pertinent to that question; discussion of the findings is reserved for the next chapter.
For both the NAEP and TIMSS assessment datasets, model results will first be presented overall,
and then by content topic and state for Research Questions 1 and 3. Generally, each table will be
presented following its first mention in the text. In keeping with agreements to anonymize
alignment findings for some states, states will be identified by generic labels, which will only
correspond for the assessments at a particular grade level. Before proceeding to interpret results
of the data analyses, I evaluate the extent to which assumptions of the statistical models appear
to have been satisfied by the population of state-item observations that were analyzed.
The major threats to unbiased estimation of the regression coefficients in this study were
the possibility of important omitted variables potentially correlated with both state-level
curricular emphasis and student performance on the standardized test items, the possibility of
random measurement error in the predictor variables, and, in the Grade 4 NAEP data, the
occurrence of a small fraction of influential cases that caused notable changes in the estimates.
Plots of each predictor against the response residual values from each model did not show any
nonlinear patterns that would suggest that the functional form of a particular predictor should be
reconsidered.
Natural clustering of the item difficulty outcome variable by state and by test item
violated the fractional logit models’ assumption that errors are independently distributed,
yielding standard errors for the regression coefficients that will tend to be underestimated.
109

Further, given the modest sample sizes used to estimate the item difficulty models and relatively
large correlations among the predictors, multicollinearity hindered the ability to separately
estimate all coefficients of interest in some of the models. Two types of adjustments to the
standard errors to deal with the effects of item and state clustering were entertained. Clusterrobust standard errors based on states as the source of non-independence of observations were
considerably smaller than the initial (robust) standard errors for all variables in the cross-state
models; because they were consistently more liberal than the baseline estimates, these standard
error estimates are not reported. Cluster-robust standard errors with test item as the source of
non-independence of observations were larger than heteroskedasticity-robust standard errors for
item characteristics variables in the cross-state models, and, in theory, should be preferred over
the standard error estimates that treat observations as independent. In results tables for the crossstate models, which are of main interest for statistical hypothesis testing and have a sufficiently
large number of distinct item clusters, I will report item-cluster-robust standard errors. In results
tables for within-topic or within-state models, since the numbers of item or state groups are quite
small, I will report the heteroskedasticity-robust standard errors that treat observations as
independent. As explained in the previous chapter, I do not intend to rely on hypothesis testing
to interpret the exploratory results from these within-topic and within-state models, so my choice
between two standard error estimators that require different, inappropriate assumptions—
independent model errors, or a large number of clusters—should not affect my conclusions.

4.1

Research Question 1: Are Counts of Curriculum Objectives a Valid Measure of
Curricular Emphasis?
Research Question 1 evaluates two forms of external validation evidence for the SEC

alignment index. If proportions of curriculum objectives classified into particular content topic

110

and cognitive demand domains quantify intended curricular emphasis in a state, and other
assumptions about the educational system explicated in Chapter 3 hold, they are expected to be
positively correlated with other measures of curricular emphasis beyond a chance level. I first
examine the relationship between SEC curricular content analysis proportions and a relatively
proximal measure of curricular emphasis: state mean instructional content emphasis ratings by
broad mathematics topic, reported by teachers of students in the NAEP sample. These mean
instructional content emphasis ratings are taken to represent the implemented curriculum. I then
consider associations between SEC curricular content analysis proportions and a set of more
distal measures of curricular emphasis: the “attained” curriculum expressed by mean student
performance on large-scale mathematics achievement test items.
Results of my analyses indicate a strong, positive, statistically-significant linear
relationship between Grade 4 curricular content emphasis proportions and mean instructional
emphasis ratings when proportions under broad content topics are consolidated to correspond to
the teacher questionnaire categories. Further results suggest that there are statistically significant
positive relationships between the average proportion of curriculum objectives corresponding to
a particular item’s topic-cognitive demand combination in Grades 3 and 4, and classical item
difficulty in both the NAEP and TIMSS fourth-grade data; that is, as the proportion of
curriculum objectives increases, it is projected that a greater proportion of students will answer
the item correctly. However, the size of the average marginal effect is very small: as shown in
Model 2 of Tables 5 and 8, only a .013 increase or a .037 increase in NAEP or TIMSS
proportion-correct item difficulty values, respectively, would be expected to follow a 10
percentage-point increase in the proportion of curriculum objectives corresponding to a
particular item’s topic-cognitive demand combination, all else held fixed. Estimation of the

111

models within content topic reveals that the strength of the relationship between the proportion
of objectives measure and item difficulty varies by content topic; for items in some topic areas,
there is no apparent relationship between difficulty and the proportion of objectives, although
differences in major content categories of the two assessments and limited numbers of items
within each make it difficult to compare the NAEP and TIMSS results by topic. Item difficulty
models analyzed within states suggest that small positive associations between the proportion of
objectives covering a topic and item performance are common across states, and that results of
the cross-state significance test are not being driven by unique correlation patterns in one or a
few states.

4.1.1

NAEP Grade 4

The first type of validation evidence I considered was the correlation between the
cellwise curricular emphasis measures underlying the SEC alignment index and mean
instructional coverage ratings from teachers. Teachers of the fourth-grade NAEP examinees
were asked about the extent to which their instruction emphasized numbers and operations,
measurement, geometry, data analysis, and algebra and functions. Emphasis was reported on a
3-point scale indicating “no,” “moderate,” or “heavy” emphasis of each content topic. The
relationship between the SEC proportion of curriculum objectives measure, condensed across
cognitive demand categories, and state mean instructional emphasis by topic is illustrated by the
scatterplot in Figure 1. Figure 1 suggests a fairly strong linear relationship between the
proportion of curriculum objectives and instructional coverage by topic in the nine states. Both
reported Grade 4 instruction and Grade 4 curriculum documents place the greatest emphasis on
number properties and basic operations, although the fraction of the curriculum devoted to
numbers and operations objectives varies widely by state. Generally, fourth-graders’ teachers
112

report giving heavy emphasis, on average, to nnumber
umber properties and operations, and moderate-tomoderate
heavy emphasis to the other four major content strands. They tend to give the least emphasis to
data analysis objectives, as appears to be intended by most states’ curriculum documents. In
most states, students’ teachers report giving somewhat greater weight to algebra instruction, on
average, than would be projected by the proportion of objectives targeting that topic.
topic The
correlation between the state-specific
specific residualized SEC proportion of objectives and
an the state
mean instructional emphasis
mphasis in nine states was 0.78 (p < .001), which is a “large” positive
correlation according to Cohen’s (1992, p. 157) criteria.

FIGURE 1. Scatterplot of Proportions of Curriculum Objectives and Mean Instructional
I
Emphasis
phasis by Mathematics Content Topic for Nine U
Unidentified States,
tates, with
wi Ordinary
Least-squares
squares Regression L
Line.

113

As a second external source of validation evidence, I modeled the relationship between
the cellwise curricular emphasis measures underlying the SEC alignment index and proportioncorrect item difficulty values for fourth-graders in nine states on the NAEP and TIMSS 2007
assessments. As shown in Table 5, the results for Model 2, the preferred cross-state model for
the Grade 4 NAEP item difficulties, indicate that there is a marginally statistically-significant,
but very small, positive relationship (p = .025) between the proportion of curriculum objectives
on a particular item’s topic and at that item’s cognitive demand level in a particular state and
proportion-correct item difficulty. Controlling for differences in instructional emphasis and
other variables, a 10 percentage-point increase in the proportion of curriculum objectives
corresponding to a particular item’s topic-cognitive demand combination would be expected to
produce only a .013 increase in NAEP item proportion-correct. Comparing the Model 1 and
Model 2 estimates for the AME of the proportion of curriculum objectives on item difficulty,
displayed in Table 5, it can be noted that the AME is adjusted slightly downward by addition of
the teacher-reported instructional emphasis variable to the model, but there is a unique
relationship between state-level curricular emphasis and item difficulty that remains even after
accounting for differences in instructional emphasis on test items’ topics. State mean
instructional content emphasis, which each NAEP examinee’s math teacher reported on a 4-point
scale, is positively related to proportion-correct item difficulty. Item mathematical complexity,
low mean state socioeconomic status (principal component 1 scores), and heterogeneity of states’
Grade 4 student populations (principal component 2 scores) are negatively related to proportioncorrect item difficulty, suggesting that these three control variables are functioning as would be
anticipated; directions of coefficients for the other control variables would be difficult to have

114

predicted in advance. The Model 3 results that appear in Table 5 will be discussed in the section
pertaining to Research Question 2.
Sorting DFBETA values by size, I identified one NAEP item that was influential across
all states, except Vermont, on the estimated AME of the proportion of curriculum objectives on
NAEP item difficulty. This item was a High-complexity Number Properties and Operations
item. When the nine state observations on this item was dropped from the cross-state analysis, as
shown in appendix Table A4 (which corresponds to the display in Table 5), the AME of the
proportion of objectives on NAEP item difficulty was noticeably reduced to .009, although the
coefficient was still statistically significant at the .025 level (p = .018). This item was one of the
only two High-complexity Number items in the NAEP Grade 4 data. The second Highcomplexity Number item had a DFBETA value above the cut-off criterion in three states. The
relatively high correlations of about .15 between states’ proportions of curriculum objectives and
their item difficulty on the two High-complexity Number items render the AME larger than it
would be absent these items—the estimated AME for the proportion of curriculum objectives is
not entirely robust to exclusion of these two items, particularly the one flagged as most
influential, from the data. However, although these items are outliers, there is no reason to
believe that they should not be considered elements of the population item set. Ideally, it would
be good to have more items in this topic-cognitive complexity category to confirm that the AME
estimate is not being positively biased by some item feature that is unrelated to the type of
academic content targeted by these two items.
Cook’s D values flagged another NAEP item, a Low-complexity Measurement item, as
influential on the slope of the regression across all nine states. This item had unusually low

115

proportion-correct difficulty values, ranging between .07 and .15, for a Low-complexity item,
explaining its outlier status. In spite of the high Cook’s D values, dropping the nine state
TABLE 5
Fractional Logit Regression Predicting State-Specific NAEP Grade 4 Classical Item Difficulty (N = 1458)
Model 1
Model 2
Model 3
Coef
Coef
Coef
(SE)
AME
(SE)
AME
(SE)
AME
Proportion of curriculum objectives
on topic at item cognitive demand
level (x10)
0.061*
0.014
0.057*
0.013
(0.025)
(0.025)
Proportion of curriculum objectives
on topic at item cognitive demand
level or higher (x10)
0.055*
0.013
(0.024)
Item topic (ref = Number Properties
and Operations)
Measurement
0.107
0.025
0.332
0.077
0.332
0.077
(0.176)
(0.179)
(0.178)
Geometry
0.520*
0.119
0.718***
0.164
0.714***
0.163
(0.205)
(0.206)
(0.205)
Data Analysis, Statistics,
and Probability
0.322
0.075
0.561**
0.129
0.551**
0.127
(0.212)
(0.210)
(0.209)
Algebra
0.336
0.078
0.525*
0.121
0.517*
0.119
(0.222)
(0.220)
(0.219)
Item complexity (NAEP categories)
-0.852*** -0.199 -0.854*** -0.199 -0.853*** -0.199
(0.113)
(0.113)
(0.113)
State principal component 1 score
-0.051*** -0.012 -0.055*** -0.013 -0.054*** -0.013
(0.003)
(0.003)
(0.003)
State principal component 2 score
-0.047*** -0.011 -0.045***
-0.01
-0.044*** -0.010
(0.003)
(0.003)
(0.003)
State principal component 3 score
-0.001
0.000
-0.009*
-0.002
-0.006
-0.001
(0.005)
(0.005)
(0.004)
State principal component 4 score
-0.057*** -0.013 -0.059*** -0.014 -0.058*** -0.013
(0.004)
(0.004)
(0.004)
State mean instructional content
emphasis on topic (scale 1–3)
0.366***
0.085
0.363***
0.085
(0.041)
(0.041)
BIC
-10305
-10298
-10298
2
R
0.29
0.29
0.29
2
0.24,
0.33
R 95% CI
0.25, 0.33
0.25, 0.33
Notes. * p < 0.05, ** p < 0.01, *** p < 0.001; AME = average marginal effect; ref = reference group; BIC =
Bayesian information criterion; CI = confidence interval

116

TABLE 6
Fractional Logit Regression Predicting State-Specific NAEP Grade 4 Classical Item Difficulty, by Content Topic
Data Analysis,
Number Properties
Measurement
Geometry
Statistics, and
and Operations
Probability
Coef
Coef
Coef
Coef
(SE)
AME
(SE)
AME
(SE)
AME
(SE)
AME
Proportion of curriculum
objectives on topic at item
cognitive demand level (x10)
0.083*
0.019
-0.068
-0.017
0.234
0.050
0.550**
0.129
(0.034)
(0.119)
(0.173)
(0.209)
Item complexity (NAEP
categories)
-0.895*** -0.208 -0.528*** -0.128 -1.220*** -0.259 -0.570*** -0.133
(0.063)
(0.091)
(0.122)
(0.104)
State mean instructional
content emphasis on topic
(scale 1–3)
1.967*
0.457
0.233
0.057
0.421
0.089
-0.913
-0.214
(0.915)
(0.502)
(0.643)
(0.677)
State principal component 1
score
-0.055**
-0.013
-0.041
-0.010
-0.043
-0.009
-0.048
-0.011
(0.018)
(0.037)
(0.034)
(0.033)
State principal component 2
score
-0.053**
-0.012
-0.059
-0.014
-0.048
-0.010
-0.029
-0.007
(0.019)
(0.035)
(0.035)
(0.035)
State principal component 3
score
-0.046
-0.011
0.002
0.000
-0.017
-0.004
0.129
0.030
(0.033)
(0.054)
(0.061)
(0.070)
State principal component 4
-0.061
-0.014
-0.023
-0.006
-0.041
-0.009
-0.034
-0.008
score
(0.038)
(0.079)
(0.076)
(0.073)
R2
0.34
0.12
0.46
0.26
2
R 95% CI
0.28, 0.41
0.05, 0.18
0.34, 0.56
0.13, 0.40
N
576
315
207
180

Algebra
Coef
(SE)

0.323
(0.340)

0.075

-0.851***
(0.119)

-0.198

-0.240
(0.841)

-0.056

-0.029
(0.058)

-0.007

-0.038
(0.038)

-0.009

0.052
(0.063)

0.012

0.004
(0.086)
0.29
0.19, 0.40
180

0.001

Notes. * p < 0.05, ** p < 0.01, *** p < 0.001; AME = average marginal effect; BIC = Bayesian information criterion; CI = confidence interval

117

AME

TABLE 7
Fractional Logit Regression Predicting NAEP Grade 4 Classical Item Difficulty, by State (N = 162)
State A
State B
State C
Coef (SE) AME Coef (SE) AME Coef (SE) AME
Proportion of curriculum
objectives on topic at item
cognitive demand level (x10)
0.189*
0.043
0.208*
0.048
0.179
0.042
(0.090)
(0.098)
(0.138)
Item topic (ref = Number
Properties and Operations)
Measurement
0.613
0.138
0.748
0.168
0.464
0.106
(0.366)
(0.384)
(0.548)
Geometry
0.939**
0.205
1.064**
0.234
0.868
0.198
(0.345)
(0.395)
(0.488)
Data Analysis, Statistics,
and Probability
0.886*
0.195
1.045*
0.231
0.661
0.151
(0.384)
(0.457)
(0.559)
Algebra
0.826*
0.183
1.065*
0.235
0.758
0.173
(0.387)
(0.479)
(0.533)
Item complexity (NAEP
categories)
-0.816*** -0.185 -0.805*** -0.186 -0.830*** -0.196
(0.126)
(0.121)
(0.120)
2
R
0.28
0.29
0.29
2
R 95% CI
0.16, 0.39
0.17, 0.40
0.18, 0.41

State D
Coef (SE)

AME

Coef (SE)

AME

0.210*
(0.102)

0.049

0.310
(0.189)

0.071

0.796
(0.420)
1.190**
(0.449)

0.177

0.757
(0.520)
1.060*
(0.504)

0.169

0.259

1.161*
(0.539)
1.100*
(0.524)

0.253

-0.854***
(0.127)
0.28
0.16, 0.40

-0.198

Notes. * p < 0.05, ** p < 0.01, *** p < 0.001; AME = average marginal effect; ref = reference group; CI = confidence interval

118

State E

0.241

0.231

0.813
(0.442)
0.942
(0.545)

0.181

-0.864***
(0.128)
0.31
0.18, 0.44

-0.197

0.208

TABLE 7 (cont’d)
State F
Coef (SE) AME
Proportion of curriculum
objectives on topic at item
cognitive demand level (x10)
Item topic (ref = Number
Properties and Operations)
Measurement
Geometry
Data Analysis, Statistics,
and Probability
Algebra
Item complexity (NAEP
categories)
R2
R2 95% CI

State G

State H

State I

Coef (SE)

AME

Coef (SE)

AME

Coef (SE)

AME

0.220*
(0.107)

0.052

0.417**
(0.154)

0.097

0.501***
(0.135)

0.119

0.506**
(0.175)

0.119

0.625
(0.346)
1.242**
(0.480)

0.139

0.335
(0.211)
1.128**
(0.350)

0.077

0.553*
(0.216)
1.693***
(0.375)

0.115

0.565*
(0.239)
1.138***
(0.342)

0.130

1.795***
(0.463)
1.791***
(0.479)

0.367

0.623*
(0.259)
0.989**
(0.370)

0.143

-0.746***
(0.124)
0.27
0.16, 0.39

-0.177

0.273

1.130*
(0.553)
1.296*
(0.580)

0.249

-0.792***
(0.123)
0.26
0.14, 0.37

-0.188

0.284

0.246

0.831**
(0.298)
1.062**
(0.397)

0.186

-0.805***
(0.129)
0.29
0.17, 0.40

-0.187

0.233

0.348

0.366

-0.708***
(0.128)
0.25
0.13, 0.37

Notes. * p < 0.05, ** p < 0.01, *** p < 0.001; AME = average marginal effect; ref = reference group; CI = confidence interval

119

0.250

0.221

-0.166

observations on this item from the NAEP data and re-estimating the cross-state models had no
detectable consequence for the estimated AME of the proportion of curriculum objectives. Thus,
this item was retained in the data, although its elimination perhaps could have been justified on
the basis of faulty classification.
The overall small positive AME of the proportion of curriculum objectives on item
difficulty in Model 2 of Table 5 conceals some heterogeneity in the size of this effect within
content topics. The analyses for subsets of the Grade 4 NAEP items by broad mathematics topic,
reported in Table 6, suggest that the overall positive effect is driven partly by a large positive
AME of the proportion of curriculum objectives on item difficulty for the Data Analysis,
Statistics, and Probability items. The model for these items projects a .129 increase in a state’s
mean item proportion-correct for each .1 increase in the proportion of Data Display, Statistics
and Probability-related curriculum objectives that target a given item’s cognitive demand level.
Although noting that the coefficient standard errors in the within-state models shown
cannot suffer from inflation due to intra-item correlation of item difficulty values as do those in
the cross-state models, I will focus on interpreting effect sizes for, rather than statistical
significance of, the proportion of curriculum objectives in these models. Multicollinearity
among the predictors in these models tended to be high—for some states’ models, the maximum
VIF value for the predictors exceeded 10—but standard errors for the regression coefficients
were not so large as to bar drawing any conclusions from the results. The within-state item
difficulty models reported in Table 7 suggest positive relationships between the proportion of
curriculum objectives corresponding to a particular item’s topic-cognitive demand combination
and classical item difficulty. The estimated AME of a 10 percentage-point (i.e., .1) increase in

120

the proportion of objectives on item difficulty ranges between about .04 and .12 across the nine
states.

4.1.2

TIMSS Grade 4

As shown in Table 8, the results for Model 2, the preferred cross-state model for the
Grade 4 TIMSS item difficulties, indicate that there is a statistically-significant, but very small,
positive relationship (p = .012) between the proportion of curriculum objectives on a particular
item’s topic and at that item’s cognitive demand level and proportion-correct item difficulty in a
particular state. All other variables held fixed, a 10 percentage-point increase in the proportion
of curriculum objectives corresponding to a particular item’s topic-cognitive demand
combination would be expected to produce only a .037 increase in TIMSS item proportioncorrect. Unlike the NAEP state mean instructional emphasis measure, there is no evidence that
the TIMSS state mean percentage of instructional time measure is related to students’ average
test item performance, but the TIMSS result is based on only two of the nine states in the NAEP
sample, and collinearity among predictors in the TIMSS models became particularly acute when
state mean instructional time percentage was added, judging from predictors’ VIF values,
rendering comparison of the NAEP and TIMSS results for this variable difficult. It can further
be observed that proportion-correct item difficulty tended to be higher in one of the TIMSS
benchmarking states than in the other.
Two TIMSS items had DFBETA values for the proportion of curriculum objectives and
Cook’s D values that were slightly above their respective cut-off criteria in both states. Both
items were Data Display items. One was classed in the Reasoning cognitive domain and had
high proportion-correct difficulty values of .93 and .94. The other was a Knowing item with
moderate item difficulty values. When the two state observations on either of these items were
121

dropped from the analysis, the AME for the proportion of curriculum objectives was reduced by
.003, as compared to the value shown in Table 8 for the full model, Model 2, but the coefficient
was still statistically significant at the .025 level. However, there was little reason to support
dropping either of these items from the data. The TIMSS test developers describe their item
cognitive domain classification schemes as categorical, not as explicitly ordered—it would be
TABLE 8
Fractional Logit Regression Predicting State-Specific TIMSS Grade 4 Classical Item Difficulty
(N = 348)
Model 1
Model 2
Model 3
Coef
Coef
Coef
(SE)
AME
(SE)
AME
(SE)
AME
Proportion of curriculum objectives
on topic at item cognitive demand
level (x10)
0.179*
0.038
0.177*
0.037
(0.070)
(0.071)
Proportion of curriculum objectives
on topic at item cognitive demand
level or higher (x10)
0.165*
0.035
(0.066)
Item topic (ref = Number)
Data Display
1.465***
0.264
1.241**
0.226
1.243**
0.226
(0.327)
(0.433)
(0.435)
Geometric Shapes and Measures
0.468*
0.102
0.265
0.058
0.259
0.057
(0.211)
(0.321)
(0.318)
Item cognitive domain (ref =
Knowing)
Applying
-0.217
-0.046
-0.218
-0.046
-0.218
-0.046
(0.134)
(0.134)
(0.134)
Reasoning
-0.151
-0.031
-0.159
-0.033
-0.156
-0.032
(0.337)
(0.340)
(0.340)
State A
0.233***
0.049
0.215***
0.045
0.234***
0.049
(0.020)
(0.025)
(0.030)
State mean percent instructional
time on topic
-0.006
-0.001
-0.006
-0.001
(0.005)
(0.005)
BIC
-1952
-1946
-1946
2
R
0.26
0.26
0.26
R2 95% CI
0.17, 0.34
0.18, 0.34
0.18, 0.34
Notes. * p < 0.05, ** p < 0.01, *** p < 0.001; AME = average marginal effect; ref = reference group; BIC =
Bayesian information criterion; CI = confidence interval

122

expected, for instance, that on some Knowing items few examinees would respond correctly, and
that these items’ difficulty values would appear to be regression outliers unless the item sample
was very large.
High maximum VIF values for some predictors in the initial TIMSS Grade 4 within-topic
models suggested multicollinearity was a serious problem for stability of the coefficients in
repeated sampling. To reduce the number of related parameters that had to be estimated, I
modified the item cognitive domain variables. On the basis of distributions of the item difficulty
values by cognitive domain category, and increasing order of the mean difficulty values for
Knowing, Applying, and Reasoning items, I constructed a single variable that treated cognitive
domain categories as ordered and linearly related to item difficulty. Replacing the set of
cognitive domain indicators in the model with the new cognitive domain variable, plots of
residuals against values of this modified predictor indicated that its relationship with item
difficulty could reasonably be modeled as linear. Although the separate analyses for Grade 4
TABLE 9
Fractional Logit Regression Predicting State-Specific TIMSS Grade 4 Classical Item Difficulty, by
Content Topic
Geometric Shapes and
Data Display
Measures
Number
Coef (SE)
AME
Coef (SE)
AME
Coef (SE)
AME
Proportion of curriculum
objectives on topic at item
cognitive demand level (x10)
0.774
0.129
-0.196
-0.043
0.181***
0.039
(1.105)
(0.123)
(0.041)
Item cognitive domain
(TIMSS categories, ordered)
0.051
0.009
-0.542***
-0.119
-0.045
-0.010
(0.305)
(0.163)
(0.127)
State A
0.322
0.054
0.039
0.009
0.262*
0.056
(0.264)
(0.151)
(0.105)
2
R
0.04
0.11
0.33
R2 95% CI
0.01, 0.23
0.02, 0.25
0.20, 0.45
N
52
114
182
Notes. * p < 0.05, ** p < 0.01, *** p < 0.001; AME = average marginal effect; BIC = Bayesian information
criterion; CI = confidence interval

123

TIMSS items by broad mathematics topic, reported in Table 9, have small sample sizes, the
model for the Number items fits the data well, judging from the R2 value and its confidence
interval, and estimates a small .039 AME of the curriculum objectives proportion measure on
item difficulty that is similar in magnitude to the AME estimated in the overall data.
The within-state item difficulty models reported in Table 10 suggest small positive
relationships between the proportion of curriculum objectives corresponding to a particular
item’s topic-cognitive demand combination and classical item difficulty. The AME of a .1
increase in the proportion of objectives on item proportion-correct ranges is similar in magnitude
in the two states, about .03–.04, and also similar to the estimate in the cross-state regression.
TABLE 10
Fractional Logit Regression Predicting TIMSS Grade 4 Classical Item Difficulty, by
State (N = 174)
State A
State B
Coef
Coef
(SE)
AME
(SE)
AME
Proportion of curriculum objectives
on topic at item cognitive demand
level (x10)
0.161*
0.033
0.198*
0.043
(0.068)
(0.077)
Item topic (ref = Number)
Data Display
1.393***
0.24
1.535***
0.288
(0.321)
(0.351)
Geometric Shapes and Measures
0.378
0.081
0.546**
0.122
(0.222)
(0.210)
Item cognitive domain (ref =
Knowing)
Applying
-0.209
-0.042
-0.226
-0.049
(0.141)
(0.131)
Reasoning
-0.165
-0.033
-0.139
-0.03
(0.339)
(0.354)
2
R
0.23
0.27
R2 95% CI
0.13, 0.34
0.15, 0.38
Notes. * p < 0.05, ** p < 0.01, *** p < 0.001; AME = average marginal effect; ref =
reference group; CI = confidence interval

124

4.2

Research Question 2: Can the Cognitive Demand Categories of the SEC Content
Classification Matrix be Treated as Partially Ordered?
The purpose of Research Question 2 is to check the appropriateness of the model

underlying Question 1. If cognitive demand is best modeled as an ordinal property of test items,
such that instruction requiring application of certain more-demanding cognitive processes related
to a particular content topic also benefits students’ ability to perform other less-demanding types
of cognitive tasks related to the same topic (e.g., Ebel, 1956), models of the relationship between
curricular emphasis measures and achievement should account for the proportion of curricular
content at or above a particular cognitive level. The follow-up analyses in this study rely on the
assumption that the cognitive processes listed by the SEC higher-order cognitive demand
categories overlap heavily with those described by particular NAEP mathematical complexity
and TIMSS cognitive domain categories.
Revising the proportion of curriculum objectives measure, I find no evidence that
substituting the proportion of objectives on a given item’s topic at or above the item’s cognitive
demand level for the proportion of objectives corresponding to each item’s topic-cognitive
demand combination increases the effect size for the proportion of curriculum objectives, or
leads to an improvement in model plausibility as indicated by BIC values. Comparing Models 2
and 3 in Table 5, there is little difference in the estimated AME for the proportion of curriculum
objectives on NAEP Mathematics item difficulty, or in overall model fit, whether or not
objectives targeting higher levels of cognitive demand related to an item’s topic are included in
the proportion measure. Table 8 shows that the estimate of the AME of the proportion of
curriculum objectives on TIMSS item difficulty is slightly lower when objectives targeting
higher levels of cognitive demand related to an item’s topic are included in the proportion

125

measure in Model 3, than when they are excluded from the proportion measure in Model 2, but
the BIC indices of overall model fit are identical to the ones place.
These results do not support the conclusions that the SEC cognitive demand categories
are semi-ordered and that instruction addressing higher-order cognitive skills benefits
performance. However, it should be noted that the proportions of Grade 4 curriculum objectives
coded as requiring higher-order reasoning were very small across states, so there was little
difference between the curriculum objective proportion measure that was specific to a particular
topic-cognitive demand cell, and the measure that counted objectives at the same demand level
of higher within a given item’s topic. In addition, the proportion of High-complexity items in the
NAEP data, which were assumed to require higher-order reasoning, was quite low (see Table 1),
so only a small fraction of state-item observations in the NAEP data reflected the change to the
proportion of objectives measure.

4.3

Research Question 3: To What Extent are Item-Level Alignment Measures Related to
Achievement?
Although I found limited support for the assertion that the proportion of objectives from

SEC content analyses is a valid measure of intended curricular emphasis, and predicts
achievement on mathematics test items—the main effects of the proportion of objectives
measure on Grade 4 item difficulty, while positive and statistically significant, were quite
small—in the following analyses I proceed to test the hypothesis that Grade 8 proportion-correct
item difficulty will increase with cellwise test-curriculum alignment, at least when curricular
emphasis of the material is high. Previous authors have proposed various methods for modeling
the posited interaction between alignment and emphasis; in this study, I represent the interaction
as a product of the main effects of alignment and emphasis.

126

Overall, I find little evidence of a relationship between test-curriculum alignment and
achievement that depends on curricular emphasis—alignment and curricular emphasis do not
appear to interact as hypothesized at Grade 8, at least not after controlling for important
covariates such as item complexity. Neither does the average proportion of curriculum
objectives measure for Grades 7 and 8 or the cellwise alignment measure appear to be
significantly related to test item performance at Grade 8 after controlling for prior topic-specific
mean achievement, and other item and state characteristics. Collinearity among the alignment
and proportion of objectives measures, their interaction, and the other predictors was
considerable in all of the Grade 8 models. Potentially important predictors had to be eliminated
from the intended NAEP within-state and within-topic models when near-perfect collinearity
prevented their estimation, so those results should be interpreted with particular caution, but the
cross-state NAEP and TIMSS models of main interest include the full complement of predictors.

4.3.1

NAEP Grade 8

Results from a fractional logit regression model for NAEP Grade 8 item difficulty that
includes only alignment-related predictors, shown in Model 1 of Table 11, depict a positive
interaction between the test-curriculum alignment measure and the curricular proportion of
objectives measure—predicted item proportion-correct increases with the proportion of
objectives when alignment is high, but slightly decreases with the proportion of objectives when
alignment is low. Once other covariates are added to the model, however, as shown in the Model
2 results column, the interaction term is no longer statistically significant at the .025 (or even
.05) level. Multicollinearity in Model 2 was considerable; the VIF for the NAEP Grade 4 2003
pretest score was greater than 10. Removing the interaction in Model 3, neither the main effect
of alignment or of the proportion of objectives is significant.
127

Although instructional content emphasis information was not collected from eighth-grade
teachers, a pretest measure, state mean NAEP Mathematics 2003 Grade 4 scale scores for each
content topic, was available for the Grade 8 item difficulty models. States’ mean performance
on particular content topics during a previous assessment of the 2007 Grade 8 cohort is
positively related to proportion-correct item difficulty in this later assessment. As in the Grade 4
NAEP data, item mathematical complexity and low mean state socioeconomic status (principal
component 1 scores) are negatively related to proportion-correct item difficulty. Conversely,
mean item performance tends to increase with the homogeneity of states’ Grade 8 student
populations (principal component 2 scores), a relationship that is significant at the .05 level.
One item had high DFBETA values in more than half of the states for both the alignment
and proportion of curriculum objectives variables. The item was a High-complexity Algebra
item, one of only three High-complexity items administered in Grade 8, all of which were in
Algebra (see Table 1). When all ten state observations on this item were dropped from the
sample, the estimated AMEs of the proportion of curriculum objectives and alignment measures,
which were very small and positive in Model 3 of Table 11, both became slightly negative, but
remained very small and non-significant at the .025 level. It seems unsurprising that one of the
few High-complexity items, all of which covered the same broad content topic, would appear to
be an outlier in the data and would be influential for the full-sample regression estimates, but it is
not clear that these estimates should be viewed as biased by inclusion of the state observations on
this item. The availability of more High-complexity items would likely have produced more
stable regression estimates and permitted more powerful statistical tests of the hypotheses
represented by the alignment and proportion of objectives variables in the Table 11 models, but
the secondary data analysis presented in this study is limited by the design of the NAEP

128

TABLE 11
Fractional Logit Regression Predicting State-Specific NAEP Grade 8 Classical Item
Difficulty (N = 1670)
Model 1
Model 2
Model 3
Coef (SE) Coef (SE) Coef (SE)
AME
Test-curriculum alignment
-2.200*
-0.377
0.040
0.009
(1.055)
(0.513)
(0.140)
Proportion of curriculum
objectives on topic at item
cognitive demand level (x10)
0.289**
0.061
0.024
0.006
(0.089)
(0.134)
(0.085)
Test-curriculum alignment ×
proportion of curriculum
objectives
4.194***
0.932
(1.095)
(1.236)
Topic (ref = Number Properties
and Operations)
Measurement
-0.354
-0.392
-0.092
(0.261)
(0.226)
Geometry
-0.507*
-0.552**
-0.131
(0.225)
(0.190)
Data Analysis, Statistics,
-0.359
-0.404
-0.095
and Probability
(0.297)
(0.258)
Algebra
-0.286
-0.323
-0.076
(0.193)
(0.170)
Item complexity (NAEP
categories)
-0.756*** -0.763***
-0.18
(0.119)
(0.120)
Mean NAEP Mathematics 2003
Grade 4 subscore
0.011***
0.010*
0.002
(0.003)
(0.004)
State principal component 1 score
-0.059*** -0.063***
-0.015
(0.006)
(0.011)
State principal component 2 score
0.005
0.011*
0.003
(0.006)
(0.005)
State principal component 3 score
-0.002
0.001
0.000
(0.004)
(0.005)
State principal component 4 score
-0.038***
-0.044**
-0.01
(0.008)
(0.015)
2
R
0.06
0.28
0.28
2
R 95% CI
0.04, 0.08 0.24, 0.32 0.24, 0.32
Notes. * p < 0.05, ** p < 0.01, *** p < 0.001; AME = average marginal effect; ref =
reference group; CI = confidence interval

129

assessment, which allows for few time-consuming extended reasoning items.
Limiting the analytic sample to only items with a particular content topic, after
controlling for other covariates, I found that interactions between the test-curriculum alignment
measure and the proportion of objectives measure were not statistically significant for Algebra,
Geometry, or Number Properties and Operations items, as presented in Table 12. (Prior to
accounting for item mathematical complexity and cohort pretest subscores, there appeared to be
a significant positive interaction between the test-curriculum alignment and proportion of
objectives measures in predicting Number Properties and Operations item difficulty, and positive
main effects of both the alignment and proportion of objectives measures on proportion-correct
Algebra item difficulty. Evidently this set of alignment-related predictors had no explanatory
value in models for Geometry item difficulty—the lower bound of the 95% confidence interval
for the R2 of the initial model was nearly 0.) Multicollinearity in these models was fairly high,
with maximum VIF values exceeding 10 in all three content areas. Removing the interaction
term from each model revealed that neither alignment nor the proportion of objectives was a
significant predictor of item difficulty within any of the content topics. For the Data Analysis
and Measurement item subsets, which had the smallest sample sizes among the topics, models
that included both the alignment and proportion of objectives measures encountered estimation
problems due to severe multicollinearity; thus, results for these content topics could not be
reported.
Similarly, models for state subpopulations of the NAEP Grade 8 data had serious
multicollinearity problems. It was not possible to estimate models that included both the
alignment and proportion of objectives measures, and indicators for all of the content topic areas.

130

TABLE 12
Fractional Logit Regression Predicting State-Specific NAEP Grade 8 Classical Item Difficulty, by Content Topic

Test-curriculum alignment
Proportion of curriculum
objectives on topic at item
cognitive demand level (x10)
Test-curriculum alignment ×
proportion of curriculum
objectives

Number Properties and Operations
Coef
Coef
Coef
(SE)
(SE)
(SE)
AME
4.477*
1.522
0.663
0.153
(1.822)
(1.890)
(1.070)

0.652***
(0.182)

0.153
(0.229)

15.471**
(5.437)

3.405
(6.214)

Item complexity (NAEP
categories)
Mean NAEP Mathematics
2003 Grade 4 subscore
R2
R2 95% CI
N

0.04
0.01, 0.09
370

0.035
(0.071)

0.008

Coef
(SE)
-2.181
(3.460)

Geometry
Coef
Coef
(SE)
(SE)
2.957
0.935
(3.418)
(2.420)

0.333
(0.461)

-0.459
(0.471)

12.615
(18.192)

-14.771
(18.078)

AME
0.224

-0.099
(0.161)

-0.024

-0.726***
(0.070)

-0.726***
(0.070)

-0.168

-0.771***
(0.097)

-0.771***
(0.097)

-0.184

0.029**
(0.009)
0.28
0.21, 0.37
370

0.032***
(0.007)
0.28
0.20, 0.36
370

0.008

0.034***
(0.009)
0.24
0.16, 0.33
310

0.031***
(0.008)
0.24
0.16, 0.33
310

0.007

0.00
0.00, 0.03
310

Notes. * p < 0.05, ** p < 0.01, *** p < 0.001; AME = average marginal effect; CI = confidence interval

131

TABLE 12 (cont’d)

Test-curriculum alignment
Proportion of curriculum
objectives on topic at item
cognitive demand level (x10)
Test-curriculum alignment ×
proportion of curriculum
objectives

Coef
(SE)
11.296***
(2.715)

AME
0.429

0.038

0.078

0.162

(0.223)

(0.235)

(0.221)

0.145

0.944

(0.948)

(0.904)

Mean NAEP Mathematics
2003 Grade 4 subscore
R
R2 95% CI
N

Coef
(SE)
1.818
(2.410)

1.057***

Item complexity (NAEP
categories)

2

Algebra
Coef
(SE)
0.301
(2.799)

0.11
0.05, 0.18
450

-0.717***

-0.718***

(0.075)

(0.075)

0.031***

0.029***

(0.008)
0.28
0.21, 0.38
450

(0.008)
0.28
0.21, 0.37
450

-0.170

0.007

Notes. * p < 0.05, ** p < 0.01, *** p < 0.001; AME = average marginal effect; CI =
confidence interval

132

TABLE 13
Fractional Logit Regression Predicting NAEP Grade 8 Classical Item Difficulty, by State (N = 167)
State J
State K
State L
State M
Coef
Coef
Coef
Coef
AME
AME
AME
AME
(SE)
(SE)
(SE)
(SE)
Test-curriculum alignment
0.973
0.226
-8.111
-1.913
-10.624
-2.272
-1.216
-0.289
(2.885)
(7.610)
(6.965)
(2.84)
Proportion of curriculum
objectives on topic at item
cognitive demand level (x10)
0.182
0.042
0.202
0.048
0.054
0.008
0.214**
0.051
(0.098)
(0.106)
(0.281)
(0.069)
Test-curriculum alignment ×
proportion of curriculum
objectives
35.952*
(16.348)
Item complexity (NAEP
categories)
-0.751*** -0.174 -0.735*** -0.173 -0.760*** -0.177 -0.763*** -0.181
(0.121)
(0.125)
(0.130)
(0.129)
Mean NAEP Mathematics
2003 Grade 4 subscore
0.009
0.002
0.079
0.019
0.038
0.009
0.002
0.001
(0.055)
(0.059)
(0.044)
(0.025)
2
R
0.24
0.25
0.25
0.26
R2 95% CI
0.14, 0.37
0.14, 0.37
0.14, 0.37
0.15, 0.38
Notes. * p < 0.05, ** p < 0.01, *** p < 0.001; AME = average marginal effect; CI = confidence interval

133

State N
Coef
AME
(SE)
-2.284
-0.542
(2.392)

0.275
(0.145)

0.065

-0.751***
(0.130)

-0.178

-0.018
(0.023)
0.24
0.13, 0.37

-0.004

TABLE 13 (cont’d)

Test-curriculum alignment
Proportion of curriculum
objectives on topic at item
cognitive demand level (x10)

State O
Coef
AME
(SE)
0.100
0.024
(2.078)

State P
Coef
AME
(SE)
-1.416
-0.336
(2.805)

State Q
Coef
AME
(SE)
-4.530*
-1.076
(2.216)

State R
Coef
AME
(SE)
-2.082
-0.499
(2.127)

State S
Coef
AME
(SE)
-9.334
-2.234
(8.872)

0.153
(0.136)

0.037

0.221
(0.115)

0.053

-0.072
(0.267)

-0.017

0.158
(0.120)

0.038

0.238
(0.139)

0.057

-0.696***
(0.125)

-0.167

-0.744***
(0.123)

-0.177

-0.804***
(0.128)

-0.191

-0.702***
(0.120)

-0.168

-0.656***
(0.124)

-0.157

-0.030
(0.028)
0.22
0.11, 0.34

-0.007

0.028
(0.022)
0.26
0.15, 0.38

0.007

-0.018
(0.024)
0.26
0.13, 0.37

-0.004

-0.029
(0.018)
0.24
0.12, 0.36

-0.007

0.045
(0.043)
0.21
0.10, 0.33

0.011

Test-curriculum alignment ×
proportion of curriculum
objectives
Item complexity (NAEP
categories)
Mean NAEP Mathematics
2003 Grade 4 subscore
R2
R2 95% CI

Notes. * p < 0.05, ** p < 0.01, *** p < 0.001; AME = average marginal effect; CI = confidence interval

134

The set of content topic indicators was removed from the model, making omitted variables bias
in the regression coefficients shown in Table 13 more likely than if the content topic indicators
could have been included. One state, State L, had a statistically significant (at the .05 level)
positive interaction between the alignment and proportion of objectives measures after
controlling for item mathematical complexity and NAEP Grade 4 pretest scores (p = .028).
However, because the actual probability of Type I error for this test was probably considerably
greater than .05 due to the number of within-state tests conducted, stability of this result in
repeated sampling from that state’s population, if such sampling were possible, seems doubtful.

4.3.2

TIMSS Grade 8

As in the Grade 8 NAEP data, regressing only alignment-related predictors on TIMSS
Grade 8 item difficulty, as shown in Model 1 of Table 14, there appears to be a positive
interaction between the test-curriculum alignment measure and the curricular proportion of
objectives measure—predicted item proportion-correct increases with the proportion of
objectives when alignment is high, but slightly decreases with the proportion of objectives when
alignment is low. Once other covariates are added to the model, however, as shown in the Model
2 results column, the interaction term is no longer statistically significant at the .025 level.
Removing the interaction in Model 3, neither the main effect of alignment or of the proportion of
objectives is significant at the .0125 level.
To reduce the degree of multicollinearity among predictors in the TIMSS Grade 8
models, which was excessive when separate indicators for each cognitive domain (except one
reference domain category) were used as predictors, I constructed a single variable that treated
cognitive domain categories as ordered and linearly related to item difficulty. As in the Grade 4

135

TIMSS data, mean item difficulty values for the cognitive domain categories increased in the
order: Knowing, Applying and Reasoning. A plot of the distribution of the item difficulty values
TABLE 14
Fractional Logit Regression Predicting State-Specific TIMSS Grade 8 Classical
Item Difficulty (N = 422)
Model 1
Model 2
Model 3
Coef (SE) Coef (SE) Coef (SE)
AME
Test-curriculum alignment
Proportion of curriculum
objectives on topic at item
cognitive demand level (x10)
Test-curriculum alignment ×
proportion of curriculum
objectives

-0.821
(0.882)

1.440
(1.026)

1.359
(1.075)

0.312

0.046
(0.052)

-0.039
(0.090)

-0.030
(0.082)

-0.007

5.428***
(1.359)

0.592
(1.306)
-0.418*
(0.164)
0.306
(0.283)
0.251
(0.246)
-0.229
(0.146)

-0.412*
(0.162)
0.312
(0.281)
0.271
(0.230)
-0.227
(0.145)

-0.097

-0.518***
(0.100)
0.170***
(0.022)
0.28
0.20, 0.34

-0.521***
(0.100)
0.167***
(0.019)
0.28
0.20, 0.34

-0.12

Item topic (ref = Number)
Algebra
Chance
Data
Geometry
Item cognitive domain (TIMSS
categories, ordered)
State J
R2
R2 95% CI

0.05
0.02, 0.10

0.068
0.059
-0.053

0.038

Notes. * p < 0.05, ** p < 0.01, *** p < 0.001; AME = average marginal effect; ref =
reference group; CI = confidence interval

by cognitive domain category, and plots of residuals against values of this predictor, suggested
that its relationship with item difficulty could reasonably be modeled as linear. Unlike in the
Grade 4 TIMSS results, item cognitive domain was a significant predictor of item difficulty in

136

many of the Grade 8 models in which it was entered—controlling for other variables, in Model 3
we observe that proportion-correct item difficulty is expected to decrease by .12 (i.e., items
becomes more difficult to answer correctly) as cognitive domain is increased to the next ordered
category.
Limiting the analytic sample to only items with a particular content topic, after
controlling for other covariates, I found that interactions between the test-curriculum alignment
measure and the proportion of objectives measure were not statistically significant for any of the
content topics, so I removed the interaction term for all of the models. Results for these more
restricted models, which appear in Table 15, indicate that neither alignment nor the proportion of
objectives is an important predictor of item difficulty within any of the content topics, but the
ordered cognitive domain variable again predicts substantial decreases in proportion-correct item
difficulty as cognitive domain is increased to the next-most-challenging category in three of the
four content topics. It should be noted that the model of interest could not be estimated for the
10 items that covered the “Chance” subtopic in the two states, so these items are not represented
in the Table 15 results, although they were included in the overall analyses presented in Table
14.
The results in Table 16 show that, controlling for potential confounding state and item
characteristics, the interaction between the test-curriculum alignment measure and the proportion
of objectives measure was not statistically significant in either state in the TIMSS Grade 8 data,
although the coefficient for the interaction was larger than its standard error in one of the states.
Maximum VIF values were greater than 10 in the models that included the interaction and main
effects of alignment and the proportion of objectives, and all the covariates. Eliminating the
interaction variable from each model, as shown in the final two columns for each state, there was

137

TABLE 15
Fractional Logit Regression Predicting State-Specific TIMSS Grade 8 Classical Item Difficulty, by Topic
Algebra
Data
Geometry
Number
Test-curriculum alignment
Proportion of curriculum
objectives on topic at item
cognitive demand level (x10)
Item cognitive domain
(TIMSS categories, ordered)
State J
R2
R2 95% CI
N

Coef (SE)
-3.670
(3.398)

AME
-0.89

Coef (SE)
-9.148
(10.076)

AME
-1.895

Coef (SE)
-0.344
(3.897)

AME
-0.083

Coef (SE)
4.961
(18.333)

AME
1.094

0.136
(0.149)

0.033

-0.282
(0.751)

-0.058

-0.085
(0.145)

-0.021

-0.081
(0.131)

-0.018

-0.128
(0.174)
0.402
(0.265)
0.15
0.06, 0.26
128

-0.031

-0.497**
(0.183)
0.389
(0.333)
0.41
0.21, 0.60
60

-0.103

-0.551**
(0.199)
0.175
(0.155)
0.17
0.05, 0.32
94

-0.134

-0.736***
(0.154)
0.273
(0.870)
0.26
0.15, 0.36
120

-0.162

0.098

0.081

0.043

Notes. * p < 0.05, ** p < 0.01, *** p < 0.001; AME = average marginal effect; CI = confidence interval

138

0.06

TABLE 16
Fractional Logit Regression Predicting TIMSS Grade 8 Classical Item Difficulty, by State (N = 211)
State J
State K
Coef (SE) Coef (SE) Coef (SE) AME
Coef (SE) Coef (SE) Coef (SE)

AME

Test-curriculum alignment
Proportion of curriculum
objectives on topic at item
cognitive demand level (x10)
Test-curriculum alignment ×
proportion of curriculum
objectives

-6.521**
(2.400)

-3.191
(4.668)

-1.668
(3.754)

-0.379

3.336
(1.903)

4.802
(4.185)

1.743
(3.254)

0.404

0.054
(0.059)

-0.064
(0.111)

-0.077
(0.108)

-0.018

-0.114*
(0.057)

-0.195
(0.155)

-0.025
(0.097)

-0.006

1.095
(2.459)

-2.174
(3.378)

13.091***
(2.270)

9.243
(7.191)

Item cognitive domain
(TIMSS categories, ordered)
Item topic (ref = Number)
Algebra
Chance
Data
Geometry
R2
R2 95% CI

0.05
0.01, 0.12

-0.530***
(0.111)

-0.521***
(0.110)

-0.118

-0.495***
(0.127)

-0.559***
(0.116)

-0.129

-0.112
(0.315)
0.354
(0.386)
0.362
(0.327)
0.014
(0.305)
0.26
0.17, 0.37

-0.227
(0.244)
0.243
(0.328)
0.269
(0.272)
-0.119
(0.198)
0.26
0.17, 0.37

-0.053

-0.209
(0.380)
0.607
(0.366)
0.148
(0.372)
0.178
(0.380)
0.29
0.19, 0.38

-0.424
(0.325)
0.365
(0.296)
0.323
(0.359)
-0.213
(0.197)
0.28
0.18, 0.38

-0.101

0.053
0.059
-0.027
0.15
0.07, 0.24

Notes. * p < 0.05, ** p < 0.01, *** p < 0.001; AME = average marginal effect; ref = reference group; CI = confidence interval

139

0.081
0.072
-0.05

no statistically significant relationship between the alignment or proportion of objectives
measures and item difficulty. Although the estimated AMEs for alignment were quite large, they
were in opposite directions in the two states, providing little evidence of a consistent
interpretable association between alignment and item difficulty.

4.4

Robustness Check
Although normality of the population error distribution is not an assumption of

generalized linear models (Gill, 2001), Breslow (1996) promoted Box-Cox transformation
selection as a tool to determine if the link function for a generalized linear model has been
misspecified. As noted previously, the exponent of the optimal power transformation to
normality for the Grade 8 NAEP item difficulty values, which followed a nearly symmetric but
heavy-tailed distribution, determined by the Box-Cox equation was about 1, so the preferred link
function identified by Breslow’s technique was the identity link. To probe the robustness of the
results reported above to potential misspecification of the link in the model for NAEP Grade 8
item difficulty, I used a linear probability model with cluster-robust standard errors to account
for high intra-item correlations of the difficulty outcome variable. Estimating the results of
models equivalent to those shown in Table 11 using ordinary least squares, I found that the
results reported previously for the alignment and proportion of objectives measures were robust
to use of a different link function and estimator: when only the alignment and proportion of
objectives variables and their interaction were entered in the model, there appeared to be a
positive interaction between alignment and the proportion of objectives, but this apparent effect
was reduced to non-significance once other predictors were added to the model. Removing the
interaction term from the model, the main effects of the alignment and proportion of objectives
measures were found not to be statistically significant. Histograms and normal probability plots
140

of the residuals suggested that the normality assumption was reasonably well-satisfied for all
three of these models. The magnitudes of the regression coefficients from these linear models
and the corresponding AMEs from the generalized linear models reported in Table 11, which
estimate the same quantity in the same metric, were identical to the second decimal place,
providing further evidence that the Grade 8 NAEP results are robust to possible misspecification
or less-than-ideal use of a logit link for the outcome variable.

141

CHAPTER 5: CONCLUSIONS AND DISCUSSION

In this final chapter, I draw conclusions regarding each research question, judging the
extent to which the results support or fail to support the validity of the SEC alignment index as a
measure of test-curriculum correspondence, and compare these conclusions to the findings of
previous studies. Evaluating the plausibility of assumptions that were made by the statistical
models, in retrospect, I discuss the likely accuracy and stability of the study results as estimates
of alignment index component effects in these subpopulations of test items and state curriculum
documents, which Cook and Campbell (1979) referred to as “statistical conclusion validity.”
This discussion establishes the degree to which the model coefficients and standard errors may
be considered good estimates for the parameters in the observed population of test items and
state curricular content analyses. I then consider the similarity between the theoretical model of
the relationships between curricular content emphasis or alignment, instructional emphasis and
student achievement, and the statistical model, considering evidence of any departures from the
assumptions about the educational system described in Chapter 3. I will discuss the
generalizability of the results to other alignment index variants, and other test-state curriculum
pairs. The discussions of replicability and generalizability of the results highlight limitations of
the study design that qualify the answers to my research questions.
Considering the accumulated evidence, I judge the claim that SEC alignment measures
can predict student achievement score gains as still in need of further support. However, this
study provided some weak external validation evidence for the curriculum emphasis data:
proportions of objectives or “grade-level expectations,” used to compute the SEC test-curriculum
alignment index. Although the results of this study have limited implications for the selection or
use of alignment methods by operational testing programs, evaluated together with results from
142

previous studies, they suggest the need for two specific lines of future research to bolster claims
about the validity of alignment indices as measures of test-curriculum correspondence, and about
the validity of conclusions from test-curriculum alignment reviews more generally.

5.1

Research Question 1
The results of this study offer weak support for the validation claim that the proportions

of equally-weighted curriculum objectives from SEC content analyses, which underlie the
coarse-grained SEC test-curriculum alignment index (Porter, 2002), can be interpreted as
measures of intended state curricular emphasis. The detected statistically-significant
relationships between a proportion of objectives measure derived from SEC content analysis
data, and measures of the implemented and attained Grade 4 mathematics curriculum from nine
states were positive, as would be anticipated if proportions of objectives are an accurate measure
of curricular emphasis. Models indicate that proportion-correct item difficulty in Grade 4 is
expected to increase as the proportion of state curriculum objectives targeting the given item’s
topic and cognitive demand type in Grades 3 and 4 increases. This conclusion differs from that
of a similar study by Mehrens and Phillips (1987). Using textbook emphasis proportions to
predict classical item difficulty, they found that textbook emphasis proportion differences,
computed after matching Grade 5 and 6 textbook content blocks to a 180-cell classification
matrix, had no visible relationship to mathematics achievement test item difficulty (p)
differences among sixth-graders. However, the effect of a 10-percentage-point increase in the
proportion of curriculum objectives corresponding to an item’s topic-cognitive demand
combination, a sizable shift in the curriculum content, is quite small—between .01 and .03 on
average—although the size of the estimated effect varies by content topic. As will be discussed
further below, the magnitude of the apparent relationship between the proportion of objectives
143

and item difficulty may have been attenuated by instruction that deviated from the state
curriculum, or inflated by failure to account for important state characteristics that affect both
curricular emphasis and test item performance. The proportions of curriculum objectives
classified into particular broad content topic domains were also strongly correlated with mean
teacher-reported instructional emphasis of those topics in the nine states with available content
analyses of established Grade 4 mathematics curriculum documents, even after controlling for
state differences in average socioeconomic status and student diversity that might be confounded
with distributions of objectives in the curriculum documents. Since all the reported analyses
combined proportions of mathematics objectives within a small number of fairly general topics,
conclusions about the validity of the fine-grained SEC test-curriculum alignment index do not
follow directly from the results of this study. Because no comparison to other alignment indices
or formulations of the SEC index were made, I cannot judge whether this index is the best
existing measure of intended state curricular emphasis, only that the content analysis proportions
that summarize panelists’ judgments about objectives’ topics and cognitive demand requirements
seem to relate in expected ways to other measures of state curricular emphasis in Grade 4.

5.2

Research Question 2
While the preponderance of evidence from previous research on mathematics learning

indicates that instruction on higher-order cognitive skills should benefit performance on topical
test items that require lower levels of cognitive processing for correct response (e.g., Lobato &
Siebert, 2002), modifying the proportion of curriculum objectives measure used in this study, I
found no evidence supporting this hypothesis. Accounting for the proportion of objectives at a
particular cognitive demand level or higher neither strengthened, nor appreciably weakened,
apparent relationships between the curricular proportion measure and item difficulty. Absent
144

replication with state curricula that show greater variation in coverage of high-complexity
objectives, I interpret this result as suggesting (a) insufficient variation in curricular emphasis on
highly-demanding objectives, (b) problems with my operationalization of the proportion of
objectives measure, and/or (c) violations of my assumptions about the educational system.
Although the modified proportion of objectives measure, in principle, could have increased the
coverage proportions corresponding to all low- or moderate-demand items (most items in both
assessments), because the Grade 3 and 4 state curricula in the sample generally included no, or
very few, high-cognitive-demand curriculum objectives, the revised proportion of objectives
measure was quite similar to the original measure. High correlation between the modified and
original measures likely contributed to the limited change in the estimated relationship between
the proportion of objectives and item difficulty when the modified measure was substituted for
the original measure in regression models. Also, my empirical model for the relationship
between curricular coverage and test item performance assumed that the behaviors defined by
the highest NAEP mathematical complexity and TIMSS cognitive domain categories (or least
those behaviors that were assessed) would fall into the two highest-demand SEC cognitive
demand categories. To the extent that this assumption was violated, differences in model fit
when using the original and modified proportion of objectives measures may have been
attributable to systematic measurement error. Finally, if instruction in these nine states focused
on the portion of the curriculum assessed by state achievement tests as reported by teachers in
several states (Stecher et al., 2008), and the state achievement tests seldom assessed the most
demanding objectives (e.g., Webb, 1999), no relationship between proportions of high-demand
objectives and test item difficulty would be expected; my results might reflect poor
correspondence between instruction and the more-demanding segments of the curriculum. While

145

the results of this study fail to support the hypothesis that the SEC cognitive demand categories
are partially ordered in the complexity of cognitive processing required to execute the listed
example behaviors in each category, I hesitate to draw any conclusion about the ordinal nature of
the categories, and suggest that further empirical and theoretical evaluation is needed to address
this question.

5.3

Research Question 3
Despite high collinearity among the predictors in models for Grade 8 item difficulty, my

results hinted at positive interactions between cellwise alignment and the proportion of
curriculum objectives in Grades 7 and 8. In initial models for NAEP item difficulty by content
topic that included only a cellwise alignment measure, a proportion of objectives measure, and
their interaction as predictors, the interaction appeared to be an important predictor of Algebra
item difficulty, suggesting proportion-correct item difficulty increased with alignment when
curricular emphasis (the proportion of objectives) was high. The alignment and proportion of
objectives measures also appeared to have positive main effects on Numbers and Operations
proportion-correct item difficulty. However, these measures were entirely unrelated to
Geometry item difficulty even in this restricted initial model, and once the pretest scores were
included in the item difficulty models, there was no statistically-significant interaction between,
or unique main effects of, the alignment and proportion of objectives measures in any topic.
Results from the TIMSS item difficulty models suggested a potentially important but not
statistically-significant interaction between alignment and the proportion of curriculum
objectives in one of the two states, but a negative association between alignment and proportioncorrect item difficulty in an initial restricted model in the other state. It is possible that the
amount of measurement error in the SEC curricular content analysis data, or the fidelity of
146

instruction to the written curriculum varied considerably in these two states. Because the models
of NAEP item difficulty for these two states, which might otherwise be viewed as replications of
the TIMSS models, could not incorporate the set of item topic indicators as predictors due to
heavy collinearity, they differ meaningfully from the TIMSS models and cannot clarify the
reason for the apparent differences in the TIMSS results by state.
Although the relationship between alignment and test item performance at Grade 8 may
not be uniform across content topics, the most definitive conclusion of these analyses is that in
the presence of other covariates, cellwise test-curriculum alignment measures are not
significantly related to test item mathematics performance in Grade 8 on average, even as
curricular emphasis on an item’s topic and cognitive demand in Grades 7 and 8 increases. This
conclusion is consistent with the findings of Mehrens and Phillips (1986) that test-curriculum
correspondence ratings were not a significant predictor of school mean mathematics test
subscores for third- or sixth-graders, and that differences in classical item difficulty values for
schools with high or low test-curriculum correspondence (dichotomizing a 1–5 rating scale) were
negligible (Phillips & Mehrens, 1988). This conclusion is also consonant to some degree with
the more recent findings of Polikoff and Porter (2012), who tested interactions between an SECtype instruction-curriculum alignment measure and pedagogical quality measures as predictors of
teachers’ mean residualized student mathematics achievement scores; while most of the
coefficients for interactions predicting mean residualized scores from different mathematics
achievement tests were in the positive direction, suggesting teachers’ scores increased with
alignment when pedagogical quality was relatively high, only 1 out of 12 was significantly
different from zero at the .05 level.

147

The study by Gamoran et al. (1997), which concluded that about one-quarter of the
variability in classroom mean mathematics score gains could be explained by a product term
computed from instruction-test alignment and reported instructional time measures, is often cited
as providing important validation evidence for SEC alignment indices. The alignment measure
used in that study was obtained by matching test content to a precursor of the current SEC
content analysis matrix. Due to collinearity problems, the research model did not include main
effects of instruction-test alignment or instructional emphasis together with the product term that
otherwise would have been interpretable as an interaction. Using more typical linear regression
models including main effects of instruction-curriculum or test-curriculum alignment and content
emphasis, as well as their product, as predictors of classroom or state mean achievement,
Polikoff and Porter’s recent (2012) study and this study have failed to convincingly reproduce
Gamoran et al.’s findings. Operationalizing instruction-test alignment and instructional
emphasis measures using ratings from two different researcher-developed instruments, however,
D’Agostino et al. (2007) found that the interaction between alignment and emphasis was a
positive, significant predictor of fifth-graders’ math scores in a multilevel model, lending some
support to Gamoran et al.’s conclusion that instruction-test alignment influences test score gains,
at least when instructional emphasis on the tested material is high.
A simple interpretation of the mixed results could be that instruction-test alignment
matters for test performance gains, but the additional assumptions about the educational system
required to also link instruction-curriculum or test-curriculum alignment to achievement
outcomes (e.g., the outcome measure corresponds to the curriculum, instruction follows the
curriculum) do not hold in practice. Other explanations for the difficulty replicating Gamoran et
al.’s findings using SEC-type alignment measures may be that the relatively high-poverty

148

schools sampled by the study were not sufficiently representative of the general school
population (Schmidt & Maier, 2009), or that the instruction-curriculum alignment variable used
captured differences in pedagogical quality or mean student affluence that were not represented
elsewhere in the math achievement model. A final possible explanation for the inconsistent
conclusions about the empirical relationship between SEC alignment indices and achievement is
that the Gamoran et al. research model tested the hypothesis that mean mathematics achievement
score gains increased as content coverage—either alignment or instructional time, or both
increased. Although that study’s results have been interpreted as demonstrating the effect of
curricular alignment on achievement test scores, the reported positive effect of the alignment-byemphasis product term on score gains may have been driven primarily by instructional time on
spent on tested topics (e.g., Coates, 2003), rather than by instruction-test alignment. Considering
the accumulated research, overall, I judge the claim that SEC alignment measures can predict
student achievement score gains as still in need of further support.

5.4

Accuracy of the Results in the Mathematics Achievement Test Item-State Population
The previous discussion presenting the conclusions of this study and comparing them to

others’ findings implicated model design differences that may have contributed to discrepancies
in our conclusions about the validity of SEC-type alignment indices. Limitations of this study
arising from failure of my assumptions for statistical estimation and hypothesis testing, or
mismatch between my theoretical and statistical models of the data population also could have
caused my results to diverge from those of previous studies. I will comment first on the most
doubtful assumptions underlying my statistical conclusions. To the extent that the assumptions
of the fractional logit models used in this study are not satisfied in the test item-state population,
the computed regression coefficients or their standard errors could under- or overestimate the
149

true population parameters. Lack of independence of item-state observations in the cross-state
models was addressed through use of a sandwich standard error estimator, but, as in many
studies based on observational data, the possibility of important omitted predictors or systematic
measurement error in the predictors remains a threat to the accuracy of my statistical
conclusions.
Interpreting the regression coefficients from item difficulty models as unbiased estimates
of population quantities requires the assumption that no relevant variables have been omitted
from the models. Failure to control for state or item characteristics that co-vary with states’
curricular emphasis patterns and cause differences in state item difficulty could result in biased
estimates of the effects of alignment, the proportion of curriculum objectives, or their interaction,
on item difficulty. In this study, the omitted variable of greatest concern may be average state
mathematics achievement in each content topic at the beginning of Grade 3, before students were
exposed to instruction on the Grade 3 and 4 math curricula. The models for Grade 4 item
difficulty that were used to address Research Questions 1 and 2 did not account for possible
existing state differences in average, topic-specific mathematics ability at the start of Grade 3
because large-scale achievement testing of students prior to Grade 3 seldom occurs, and so no
suitable measure of initial state mean mathematics topic achievement was available. Because
state-level mathematics curriculum development before 2006 was not always methodical, it is
not immediately clear that there would be a relationship between students’ mean subtopic
performance early in Grade 3, and the distribution of objectives in the Grade 3 and 4 curriculum
documents. If there were such a relationship, it is impossible to predict what its direction might
be, and so difficult to anticipate the direction of any bias in the alignment-related regression
coefficients that might result. The models for Grade 8 NAEP item difficulty that were used to

150

address Research Question 3 did include a topic-specific mathematics pretest score for each
state, although it was measured at the end of Grade 4, rather than at the beginning of Grade 7
instruction; lack of a well-timed pretest measure may have again resulted in some degree of bias
to the regression coefficients unless achievement differences among states in each topic
remained approximately constant between the end of Grade 4 and start of Grade 7. Although
ideal mathematics pretest measures for each state were not available, all models accounted for
state differences in child poverty rates that could be expected to serve, to some degree, as a
proxy for initial mean mathematics achievement.
Previous studies of mathematics achievement test item difficulty indicate that the most
important predictors of difficulty are items’ mathematical complexity and linguistic features
(Enright & Sheehan, 2002; Shaftel et al., 2006). All models controlled for item mathematical
demands. While I did not have access to information about item linguistic features, linguistic
features would not be expected to have any relation to state mathematics curricular emphasis
patterns, unless incidentally, so inclusion of these relevant item characteristics in the models
would not be expected to change the size of regression coefficients for the alignment-related
variables, although it might adjust their standard errors. It is also difficult to conceive of the
proportion of curriculum objectives measure acting as a proxy for some unobserved item
characteristic, such that the apparent effect of curricular emphasis on item difficulty in the Grade
4 data is actually due to unmeasured features of the test items.
As well as requiring the assumption of no omitted variables, interpreting the regression
coefficients and average marginal effects from fractional logit models as unbiased estimates
necessitates the assumption that predictor variables are measured without error. In this study, the
variables likely to contain the most random measurement error are the predictors of main

151

interest: the alignment and proportion of curriculum objectives measures, and the teacher
instructional emphasis ratings in the Grade 4 data. Consolidating objective proportions to the
coarse-grained level should reduce the fraction of random measurement error in the proportion
measure, which has already been averaged over multiple judges, relative to that which would be
found in proportion measures derived from fine-grained SEC content analysis based on a larger
classification matrix (Mehrens & Phillips, 1986). However, if the fine-grained rather than the
coarse-grained objective proportion and alignment measures are theorized to be related to student
achievement outcomes (Porter, 2002), then the coarse-grained variables used in this study should
be viewed as fine-grained variables measured with error, and their relationship to achievement
should be interpreted as likely attenuated relative to the relationship that would be expected if
fine-grained alignment-related variables could be linked to the NAEP and TIMSS test items.
The coarse-grained proportion of objectives measure, which averaged cellwise curriculum
proportions across the grades during and immediately prior to testing, also ignored any effects of
curriculum coverage in earlier grades (subsequent to the Grade 4 pretest measure, in the Grade 8
NAEP data) on achievement, perhaps further weakening the observed relationship between
curricular emphasis and state item difficulty. The instructional emphasis ratings self-reported by
teachers, used as a control variable in some Grade 4 model specifications, are also likely to
contain measurement error. If teachers systematically over-report their emphasis on all
curriculum topics, only the model intercepts will be biased—not a major concern. If, however,
there is substantial random measurement error in the state mean instructional emphasis ratings,
which are correlated with the curricular emphasis proportion measure, bias in the regression
coefficient for the curriculum proportion measure could result. The estimated regression
coefficients for curricular emphasis in the full models that include instructional emphasis may be

152

too large. I would argue, though, that any effect of instructional emphasis on achievement could
be considered a part of the effect of curricular emphasis on achievement, with curricular
emphasis as the a priori cause (e.g., Holland, 1986), so that the reduced models excluding the
possibly unreliable instructional emphasis variable may produce the best estimates for the
relationship between the proportion of curriculum objectives and item difficulty. Average
marginal effects for the proportion of objectives measures in these reduced models for the Grade
4 NAEP and TIMSS data, reported in Chapter 4, were small, positive, and very similar in
magnitude to those interpreted in my conclusions.

5.5

Defensibility of Assumptions about the US Elementary Education System
If a test-curriculum alignment index is determined not to be positively related to student

achievement gains as theorized, or to be only weakly related to achievement gains, it may be that
the index is based on invalid measures of content emphasis, or otherwise not functioning as a
meaningful quantitative variable. However, numerous alternative explanations are possible
(Porter, 2006). My theoretical model of mathematics item difficulty for elementary school
students in the US essentially combined the model of curricular achievement from the early
TIMSS studies (Travers & Westbury, 1989) with recent models of item difficulty on state
mathematics achievement tests (Ferrara et al., 2011; Shaftel et al., 2006). Serious departure of
my theoretical model and assumptions from realities of the elementary education system in the
mid-2000s could lead to incorrect conclusions about the practical importance of curricular
emphasis and test-curriculum alignment for mathematics test item performance, even if the
statistical conclusions presented previously are accurate. In particular, relationships between
alignment-related measures and item difficulty in the large-scale mathematics assessment data,
which appeared to be weakly positive (Grade 4) or null (Grade 8), could have been attenuated to
153

the extent that instruction did not follow the state curricula, test items were not sensitive to
instruction, or student motivation during test-taking was low. This section will weigh each of
these counter-explanations.
The high correlation between curricular content emphasis proportions and mean teacherreported instructional emphasis at the broad topic level observed in this study suggests that
instruction, on average, did not diverge widely from the content specified by state curricula, but
this correlation reflects a macro-level view of instruction-curriculum correspondence, and does
not imply any direction of causality. Elementary mathematics teachers in these states may have
willfully decided not to follow the specific lists of curriculum objectives given in curriculum
documents, interpreted curriculum objective statements differently than state policy makers
intended (e.g., Spillane, 2004), or, perhaps most likely, diverged from the curriculum due to
reliance on state- or district-adopted instructional materials (e.g., Senk & Thompson, 2003) that
were not designed to match their state’s curriculum. However, as described in Chapter 2,
teachers were under significant pressure to deliver instruction following their state curriculum,
and, in surveys of three states between 2004 and 2006 (Stecher et al., 2008), most elementary
and middle school math teachers reported having modified the content of their instruction to
better address state curriculum objectives, so there is some evidence that teachers were aware of
state curriculum documents and intended their instruction to target grade-level student
performance goals.
Achievement test items are often assumed or explicitly claimed to be sensitive to
instruction following a particular curriculum. Although the items used in this study were not
designed to test attainment of objectives set forth in a single specific curriculum document, both
TIMSS and NAEP are intended as tests of school mathematics achievement (Mullis et al., 2005;

154

NAGB, 2006), rather than, for instance, mathematics literacy like the Programme for
International Student Assessment studies. Instructional sensitivity of many items in IEA
mathematics assessments prior to TIMSS 2007 has been demonstrated (Miller & Linn, 1988;
Muthén, 1988; Schmidt et al., 1999). Because the TIMSS 2007 assessment framework draws
heavily on the frameworks of previous IEA studies, performance on the TIMSS items would be
expected to be influenced by differences in instruction across educational jurisdictions. While
the NAEP assessment framework and test development procedures differ from those of TIMSS,
Muthén et al. (1991) asserted that instructional sensitivity appeared to be highest for definitional
and other low-complexity items, which comprise the largest fraction of the NAEP 2007 items,
suggesting that many NAEP items should also be sensitive to differences in examinees’
instructional histories. The positive relationships generally observed in this study between the
proportion of curriculum objectives measure and classical item difficulty indicate that the
response processes for these assessments’ items can be influenced by state differences in
instruction, as posited by Grissmer et al. (2000) regarding NAEP scale scores.
To gauge whether low student motivation during test-taking was likely to have distorted
the relationship between alignment and test item performance, I examined self-reports of the
NAEP examinees on two questionnaire items about effort. Among Grade 4 examinees who
responded to the effort questions in the nine states that had SEC curricular content analyses,
tabulated in appendix Table A5, between one-tenth and one-fifth viewed success on NAEP as
“somewhat” or “not” important, rather than “important” or “very important.” Similar
proportions of fourth-graders asserted that they had not tried as hard on NAEP as on other tests.
There was no obvious relationship between the proportions of fourth-graders responding
affirmatively to these two prompts in the nine states. Eighth-graders were more likely than

155

fourth-graders to make a distinction between the perceived importance of performing well on
NAEP, and the level of effort they actually exerted in responding to the test items. While
between one-third and more than half of Grade 8 students, depending on state, recognized NAEP
as a low-stakes test, only about one-fifth of them asserted that they had offered less than
complete effort on the assessment. Many eighth-graders may have perceived some value in the
test-taking experience besides tangible rewards or sanctions, which they recognized would be
limited (e.g., Brophy & Ames, 2005). Judging from their self-reports, eighth-graders in Alabama
and Kansas appear to have been more likely than those in the other eight states, on average, to
have put forth at least as much effort in completing the NAEP items as they would in taking
other tests, while those in Massachusetts seem to have been less likely to devote full effort to
NAEP test-taking. Comparing the rightmost columns of appendix Table A5 in the eight states
that were analyzed at both grade levels, overall, higher proportions of Grade 8 than Grade 4
students reported exerting less than their full effort on the NAEP assessment, a pattern of
decreasing effort that would be expected across the NAEP-tested grade levels (Brophy & Ames,
2005). However, even among eighth-graders, the hypothesized relationship between alignment
and mean test item performance appears unlikely to have been much attenuated by lack of
examinee motivation on the outcome tests, presuming that students engaged similarly with the
TIMSS assessment tasks.

5.6

Generalizability of the Results to Other Alignment Indices and State Curriculum
Documents
As suggested by the discussion above, the findings of a weak positive relationship

between the “coarse-grained” (Porter, 2002) SEC proportion of curriculum objectives and item
difficulty in the Grade 4 data, and no relationship between curricular emphasis or test-curriculum

156

alignment measures and item difficulty in the Grade 8 data, would be expected to have limited
generalizability to “fine-grained” versions of the same measures. Porter (2002, 2006) has
contended that the alignment among curriculum documents, instruction, and assessments that
matters most for student achievement is alignment at the level of specific content performance
goals—matching of emphasis or instructional time at the level of broad content strands is argued
to be inadequate to produce well-aligned tests, or achievement gains, depending on the intended
use of an alignment index. The associations between analogous measures constructed from finegrained SEC content analysis matrices and mathematics item difficulty or instructional content
emphasis could be either larger or smaller in this state-item sample than those reported for the
coarse-grained measures. The results of this study do not have direct implications for the
validity of SEC test-curriculum alignment indices in other content areas (e.g., English, Language
Arts, and Reading), or of SEC-type indices based on different content classification matrices
(e.g., Liu & Fulmer, 2008).
While my conclusions strictly deal with test-curriculum alignment indices only, validity
evidence for test-curriculum alignment indices may bear on the functioning and interpretation
(Porter et al., 2007) of instruction-curriculum alignment indices generated from SEC data by
teachers, administrators and researchers. Because the validation evidence collected for the SEC
instruction-curriculum index thus far does not provide consistent support for the index as a
measure of instruction-curriculum correspondence when controlling for preexisting group
achievement differences (Gamoran et al., 1997; Polikoff & Porter, 2012), and this study
produced only limited support for the curriculum emphasis proportions underlying SEC
instruction-curriculum alignment indices, the need to further evaluate the validity of these
indices is suggested.

157

The findings of this study cannot be readily generalized to support the validity of results
from other popular alignment methods (e.g., Webb, Achieve, Human Resources Research
Organization) because the types of results generated by various methods (e.g., qualitative or
quantitative, single criterion or multiple criteria) differ considerably. While conclusions
regarding the appropriateness of using unweighted counts of curriculum objectives to compute
test-curriculum alignment indices would be equally relevant to the Webb balance-ofrepresentation index and the SEC alignment index of interest in this study, the small magnitude
of the positive relationships between SEC curricular emphasis proportions and Grade 4 test item
difficulty observed in this study is not clearly attributable to problems with the objective
weighting scheme. Based on this and previous research, the SEC index might be judged to be
better-supported by external validation evidence than other test-curriculum alignment indices,
simply because no comparable information for other methods or indices has been published,
although rater agreement data are now typically reported for various alignment methods’
document reviews.
The external validation evidence for the coarse-grained SEC test-curriculum alignment
index produced by this study would be expected to have some generalizability to the group of
states with established mathematics curriculum documents during this time frame. The weak or
absent connection observed between alignment-related measures and NAEP mathematics item
difficulty, based on test item performance in eleven states from different regions of the US,
would be projected to have more generalizability to the US populations of Grade 4 and 8
students in 2007 than the TIMSS results, which reflect test item responses in only two states.
Although elements and overall emphases of state curriculum documents at the same grade level
are likely to have varied widely in 2007 (Porter et al., 2009; Reys et al., 2007), they were more

158

similar in format than during previous decades (e.g., Webb, 2007), so relationships between the
document content and other indicators of curricular emphasis could be predicted to be similar to
those observed in this study’s cross-state sample. Appendix Table A6, which displays means of
state means for selected characteristics in the study states and all 50 states in Grades 4 and 8,
suggests that the state curriculum content analysis samples should not be claimed to be
representative of those from all states, as the study states, for instance, have a lower mean
proportion of students eligible for the federal school meal program and higher average mean
NAEP Mathematics scale scores than observed in all states at both grade levels. Even if the
study states appeared to be representative with respect to these mean characteristics, they may
not have been representative with respect to curriculum emphases. The results are more likely to
generalize to states that had longstanding curriculum documents than to states with curriculum
documents that were under initial or re-development.

5.7

Suggestions for Future Validation Research
This study sought to investigate the validity of the interpretation of a commonly-used

alignment index as a measure of test-curriculum correspondence by examining linear
relationships between components of the index and concurrent measures of state curricular
emphasis in mathematics. I interpreted the results as providing weak external validation
evidence for the content analysis summary data—proportions of objectives or “grade-level
expectations,”— which are used to compute the SEC test-curriculum alignment index, as
curriculum emphasis measures. However, even taken together with other validation studies’
results, there is little compelling support for any existing alignment method’s rating data as a
replicable, meaningful indicator of test-curriculum correspondence. Additional scrutiny of
commonly-used alignment methods appears to be warranted. Noting that alignment
159

methodologies themselves cannot be considered “valid,” Davis-Becker and Buckendahl (2013, p.
24) recommend collecting several types of evidence to validate the inferences from results of a
given alignment study: procedural evidence (e.g., rater qualifications, execution of rater
training), internal evidence (e.g., rater agreement, reliability coefficients), external evidence
(e.g., replication studies), and utility evidence (i.e., observed usefulness of results to test and
curriculum developers). Of the four types of evidence, internal agreement measures would seem
to be the most easily reported, but also could be inflated by implementation of consensus-seeking
steps within a particular alignment procedure. Although not always systematically documented,
the procedural information they list could usually be recorded in a straightforward manner
following established quality-monitoring procedures for standard-setting studies (e.g., Cizek &
Bunch, 2007). External validation evidence, though, may be expensive and thus difficult to
collect (Davis-Becker & Buckendahl). Utility judgments would appear to be relevant only after
alignment results have otherwise been established as sound.
Kane (2013) argues that validation studies should concentrate on testing the most
doubtful claims in a particular test score interpretive argument. For the alignment indices and
other alignment results used in validation of achievement test score interpretations, the most
crucial, yet dubious claims may be that (1) overall and item-level alignment conclusions are
replicable across independent panels trained by different facilitators, and (2) the content
classification schemes used capture distinct types of performances with mathematical content.
There is a need for evidence showing that alignment matching or rating frameworks have been
systematically developed, reviewed and potentially revised by diverse groups of curriculum and
rater cognition experts. Likewise, there is a need to demonstrate that alignment results can be
sufficiently replicated across independent review occasions, using real data collection as well as

160

generalizability projections or agreement indices estimated from single panels. Since
reproducibility will depend on features of the content classification scheme, these two claims are
interrelated. The additional claim that proportions of curriculum objectives corresponding to
particular content strands indicate intended emphasis by state policymakers seems also doubtful,
but perhaps less fundamental than claims about the meaningfulness of the classification scheme
used and reproducibility of results.
Advising that the organizations sponsoring particular test-curriculum alignment reviews
(often state education agencies) will seldom have the resources to finance collection of external
validation evidence, Davis-Becker and Buckendahl note that assessment professionals have the
responsibility to report and interpret such evidence if it is available. They suggest that external
validation evidence could include evaluations of the same test-curriculum pair by alignment
panels using multiple methods or the same method, results from other types of content analysis
studies, or comparisons with test item content classifications assigned by item writers. Thus far,
the primary external validation claim made for SEC alignment indices has been that they predict
student achievement gains (e.g., Porter, Polikoff, Barghaus, & Yang, 2013), so the present study
sought evidence relevant to this claim. However, interpreting alignment indices, or indeed any
alignment results, as meaningful indicators of test-curriculum correspondence also minimally
requires a claim that the results would not vary too greatly if an independent panel of qualified
experts (having adequate familiarity with the given curriculum and examinee population), trained
by a different facilitator, conducted the content analysis. This claim is implicit in interpretation
of results from any alignment review, including but not limited to SEC procedures, and is
necessary to warrant further claims about state achievement test scores, for instance, that they
measure students’ mastery of curriculum objectives. While monitoring curricular alignment and

161

demonstrating its impact on student learning gains may be separately of interest, the most direct
external validation evidence to obtain for results of SEC and other alignment methods, requiring
no assumptions about instructional quality, would be that the results are reproducible. Such a
validation study would require greater resources than computation of agreement or
generalizability coefficients, but could demonstrate that alignment results are not heavily
dependent on the particular panelists selected (Webb et al., 2007) or on anomalies in the training
or matching process, but rather represent an interpretation of the degree of test-curriculum
correspondence that would be largely held in common by qualified experts.
The procedural validation evidence outlined by Davis-Becker and Buckendahl (2013)
implies that high-quality, high-fidelity implementation of any existing alignment method that
evaluates both content topic and cognitive demand match between tests and curriculum
documents may yield valid conclusions regarding test-curriculum alignment. However, not all
test content classification schemes will be of equal quality or utility (Schmidt & Maier, 2009).
Recent studies have particularly questioned the extent to which expert judges can reliably
categorize test item content using published cognitive demand coding schemes (Schneider et al.,
2013). In addition to seeking empirical evidence of alignment index validity as in the present
study (see also, e.g., Porter et al., 2008; Webb et al., 2005), future development of commonlyused alignment indices should call on sizable, diverse groups of subject-matter curriculum
experts and learning scientists to evaluate the theoretical underpinnings of the classification
schemes used to rate behavioral tasks from tests and curriculum documents (Schmidt & Maier,
2009); topic or cognitive demand categories that are viewed as ill-defined, overlapping, or
inconsistent with knowledge of mathematics learning should be revised. Future research should
consider whether content topics are too fine-grained, or not fine-grained enough (Porter et al.,

162

2013), and whether the descriptions, labels and/or examples used define cognitive demand
categories are sufficiently distinct from one another, and comprehensive of the cognitive
processing or behaviors that could be required by achievement test items. Consideration of
cognitive demand classification schemes in mathematics is particularly needed to find schemes
that can be reliably utilized by item raters (e.g., Ferrara et al., 2011) and item writers (e.g., Porter
et al., 2013), and ideally are also sensible from a cognitive science perspective. Alternatively, if
devising cognitive demand schemes that can be consistently applied by alignment panelists and
item writers proves infeasible, it may be necessary, for curricular achievement tests, to simply
deem test items judged to require the particular behaviors described by curriculum objectives as
“aligned” and items requiring other behaviors, even those that may require similarly-difficult
cognitive processing, as not aligned (e.g., D’Agostino et al., 2008). This method would highlight
current measurement limitations of large-scale testing, but could provide a realistic, transparent
assessment of test-curriculum match. Regardless of the test and curriculum content classification
scheme adopted for a particular alignment review, detailed information about its development
process, and about the consistency with which it can be applied by independent content experts
on different occasions, should be viewed as important pieces of validation evidence in
interpretive arguments for state achievement test scores as measures of curricular attainment.

163

APPENDIX

164

TABLE A1
SEC Task Cognitive Demand
Category
Description

Memorize

Illustrative examples:
Recite basic mathematical facts.
Recall mathematics terms and definitions.
Recall formulas and computational procedures.
Illustrative examples:

Perform
procedures

Use numbers to count, order or denote.
Perform computational procedures or algorithms.
Follow procedures/instructions.
Make measurements.
Solve equations or routine word problems.
Organize or display data.
Read or produce graphs and tables.
Execute geometric constructions.
Illustrative examples:

Demonstrate
understanding

Communicate mathematical ideas.
Use representations to model mathematical ideas.
Explain findings and results from data analysis.
Develop/explain relationships between concepts.
Explain relationships between models, diagrams or other representations.

Conjecture,
generalize, prove

Illustrative examples:
Determine the truth of a mathematical pattern or proposition.
Write formal or informal proofs.
Analyze data.
Find a mathematical rule to generate a pattern or number sequence.
Identify faulty arguments or misrepresentations of data.
Reason inductively or deductively.
Use spatial reasoning.

Solve nonroutine problems,
make
connections

Illustrative examples:
Apply and adapt a variety of appropriate strategies to solve problems.
Apply mathematics in contexts outside of mathematics.
Synthesize content and ideas from several sources.

Note. From CCSSO & WCER (2004).

165

TABLE A2
NAEP Item Mathematical Complexity
Category
Description
This category relies heavily on the recall and recognition of previously learned
concepts and principles. Items typically specify what the student is to do, which is
often to carry out some procedure that can be performed mechanically. It is not left
to the student to come up with an original method or solution. The following are
some, but not all, of the demands that items in the low-complexity category might
make:

Low
complexity

Recall or recognize a fact, term, or property.
Recognize an example of a concept.
Compute a sum, difference, product, or quotient.
Recognize an equivalent representation.
Perform a specified procedure.
Evaluate an expression in an equation or formula for a given variable.
Solve a one-step word problem.
Draw or measure simple geometric figures.
Retrieve information from a graph, table, or figure.
Items in the moderate-complexity category involve more flexibility of thinking and
choice among alternatives than do those in the low-complexity category. They
require a response that goes beyond the habitual, is not specified, and ordinarily
has more than a single step. The student is expected to decide what to do, using
informal methods of reasoning and problem-solving strategies, and to bring
together skill and knowledge from various domains. The following illustrate some
of the demands that items of moderate complexity might make:

Moderate
complexity

Represent a situation mathematically in more than one way.
Select and use different representations, depending on situation and purpose.
Solve a word problem requiring multiple steps.
Compare figures or statements.
Provide a justification for steps in a solution process.
Interpret a visual representation.
Extend a pattern.
Retrieve information from a graph, table, or figure and use it to solve a problem
requiring multiple steps.
Formulate a routine problem, given data and conditions.
Interpret a simple argument.

166

TABLE A2 (cont’d)
High
complexity

High-complexity items make heavy demands on students, who must engage in
more abstract reasoning, planning, analysis, judgment, and creative thought. A
satisfactory response to the item requires that the student think in abstract and
sophisticated ways. Items at the level of high complexity may ask the student to
do any of the following:
Describe how different representations can be used for different purposes.
Perform a procedure having multiple steps and multiple decision points.
Analyze similarities and differences between procedures and concepts.
Generalize a pattern.
Formulate an original problem, given a situation.
Solve a novel problem.
Solve a problem in more than one way.
Explain and justify a solution to a problem.
Describe, compare, and contrast solution methods.
Formulate a mathematical model for a complex situation.
Analyze the assumptions made in a mathematical model.
Analyze or produce a deductive argument.
Provide a mathematical justification.

Note. From U.S. Department of Education, National Assessment Governing Board (2006, pp. 36–40).

167

TABLE A3
TIMSS Item Cognitive Domain
Category
Description
Knowing covers the facts, procedures, and concepts students need to know. This
cognitive domain covers the following behaviors:

Knowing

Applying

Recall definitions, terminology, number properties, geometric properties, and
notation.
Recognize mathematical objects, shapes, numbers and expressions, and
mathematical entities that are equivalent.
Carry out algorithms for addition, subtraction, multiplication or division, or a
combination of these, with whole numbers, fractions, decimals or integers.
Approximate numbers to estimate computations. Carry out routine algebraic
procedures.
Retrieve information from graphs, tables or other sources. Read simple scales.
Using measuring instruments, use units of measurement appropriately and estimate
measures.
Classify/group objects, shapes, numbers or expressions; make correct decisions
about class membership. Order numbers and objects by attributes.
Applying focuses on the ability of students to apply knowledge and conceptual
understanding to solve problems or answer questions. This cognitive domain
covers the following behaviors:
Select an efficient/appropriate operation, method or strategy for solving problems
where there is a known algorithm or method of solution.
Display mathematical information in diagrams, tables, charts or graphs, and
generate equivalent representations for a given mathematical entity or relationship.
Generate an appropriate model, such as an equation or diagram, for solving a
routine problem.
Follow and execute a set of mathematical instructions. Given specifications, draw
figures or shapes.
Solve routine problems, similar to those encountered in class (e.g., use geometric
properties to solve problems; compare and match different data representations
[Grade 8]; use data from charts, graphs or maps).

168

TABLE A3 (cont’d)

Reasoning

Reasoning goes beyond the solution of routine problems to encompass unfamiliar
situations, complex contexts, and multi-step problems. This cognitive domain
covers the following behaviors:
Determine and describe or use relationships between variables or objects in
mathematical situations. Use proportional reasoning (Grade 4). Decompose
geometric figures to simplify solving a problem. Draw the net of a given
unfamiliar solid. Visualize transformations of three-dimensional figures. Compare
and match different representations of the same data (Grade 4). Make valid
inferences from given information.
Extend the domain to which the results of mathematical thinking and problem
solving are applicable by restating results in more general, widely applicable
terms.
Combine mathematical procedures to establish results, and combine results to
produce a further result. Make connections between different elements of
knowledge and related representations, and link related mathematical ideas.
Provide a justification for the truth or falsity of a statement by reference to
mathematical results or properties.
Solve non-routine, unfamiliar problems in mathematical, real-life and/or complex
contexts. Use geometric properties to solve non-routine problems.

Note. From Mullis et al. (2005, pp. 33–38).

169

TABLE A4
Fractional Logit Regression Predicting State-Specific NAEP Grade 4 Classical Item Difficulty, Dropping
One Influential Item (N = 1449)
Model 1
Model 2
Model 3
Coef
Coef
Coef
(SE)
AME
(SE)
AME
(SE)
AME
Proportion of curriculum objectives
0.039*
0.009
0.035*
0.008
on topic at item cognitive demand
level (x10)
(0.016)
(0.017)
Proportion of curriculum objectives
0.035*
0.008
on topic at item cognitive demand
level or higher (x10)
(0.016)
Item topic (ref = Number Properties
and Operations)
0.043
0.010
0.272
0.064
0.275
0.064
Measurement
(0.171)
(0.174)
(0.174)
0.445*
0.103
0.647**
0.148
0.648**
0.149
Geometry
(0.200)
(0.202)
(0.202)
Data Analysis, Statistics,
0.242
0.057
0.485*
0.113
0.483*
0.112
and Probability
(0.204)
(0.203)
(0.203)
0.248
0.058
0.440*
0.102
0.440*
0.102
Algebra
(0.213)
(0.213)
(0.214)
-0.839*** -0.196 -0.840*** -0.197 -0.840*** -0.196
Item complexity (NAEP categories)
(0.115)
(0.115)
(0.115)
-0.050*** -0.012 -0.054*** -0.013 -0.053*** -0.012
State principal component 1 score
(0.003)
(0.003)
(0.003)
-0.046*** -0.011 -0.044***
-0.01
-0.044***
-0.01
State principal component 2 score
(0.003)
(0.003)
(0.003)
0.002
0.001
-0.006
-0.001
-0.004
-0.001
State principal component 3 score
(0.004)
(0.004)
(0.003)
-0.056*** -0.013 -0.058*** -0.014 -0.058*** -0.013
State principal component 4 score
(0.004)
(0.004)
(0.004)
State mean instructional content
0.375***
0.088
0.373***
0.087
emphasis on topic (scale 1–3)
(0.039)
(0.039)
2
R
0.269
0.271
0.271
Notes. * p < 0.05, ** p < 0.01, *** p < 0.001; AME = average marginal effect; ref = reference group; BIC =
Bayesian information criterion

170

TABLE A5
Measures of Average Test-taking Effort by NAEP 2007 Examinees, by Grade and State
Proportion reporting success on NAEP is
somewhat or not important
Grade 4
AL
CA
IN
KS
MA
MI
MN
NJ
OH
OR
VT

Proportion reporting not trying as hard on
NAEP as on other tests

Grade 8

p

SE

0.14
0.15
0.10
0.17
0.16
0.16

0.004
0.007
0.006
0.007
0.007
0.007

0.13
0.18
0.18

0.006
0.007
0.008

Grade 4

p
0.36

SE
0.009

0.49
0.39
0.60
0.48
0.52
0.54
0.50
0.54
0.57

0.010
0.010
0.010
0.010
0.010
0.010
0.010
0.010
0.011

Source. National Assessment of Educational Progress 2007.

171

Grade 8

p

SE

0.21
0.16
0.11
0.11
0.18
0.12

0.005
0.007
0.006
0.006
0.007
0.006

0.15
0.16
0.10

0.007
0.007
0.006

p
0.18

SE
0.008

0.20
0.14
0.26
0.21
0.21
0.25
0.22
0.22
0.23

0.008
0.007
0.009
0.009
0.008
0.008
0.009
0.009
0.010

TABLE A6
Average Means of Selected State Characteristics for Study and All States in 2007, by Grade
Grade 4

Grade 8

Study
Proportion minority
students
Proportion students in
rural schools
Proportion federal
school meal programeligible students
Mean NAEP
Mathematics scale score

All

Study

All

M

SE

M

SE

M

SE

M

SE

0.291

0.060

0.351

0.027

0.260

0.033

0.334

0.026

0.336

0.069

0.384

0.028

0.384

0.062

0.401

0.027

0.392

0.028

0.438

0.016

0.344

0.022

0.390

0.015

243

2.60

240

0.899

286

2.01

281

0.898

Source. National Assessment of Educational Progress 2007.

172

REFERENCES

173

REFERENCES

Abedi, J., & Lord, C. (2001). The language factor in mathematics tests. Applied Measurement in
Education, 14(3), 219–234.
American Educational Research Association, American Psychological Association, &
National Council on Measurement in Education. (1999). Standards for educational
and psychological testing. Washington, DC: American Psychological
Association.
American Federation of Teachers. (2008). Sizing up state standards 2008. Washington, DC:
Author.
Anderson, L. W. (2002). Curricular alignment: A re-examination. Theory Into Practice, 41(4),
255–260.
Beck, M. D. (2007). Review and other views: Alignment as a psychometric issue. Applied
Measurement in Education, 20(1), 127–135.
Bejar, I. I. (1993). A generative approach to psychological and educational measurement. In N.
Fredriksen, R. J. Mislevy, & I. I. Bejar (Eds.), Test theory for a new generation of tests
(pp. 323–359). Hillsdale, NJ: Erlbaum.
Bhola, D. J., Impara, J. C., & Buckendahl, C. W. (2003). Aligning tests with states’ content
standards: Methods and issues. Educational Measurement: Issues and Practice, 22(3),
21–29.
Blank, R. K., & Smithson, J. (2009). Alignment content analysis of TIMSS and PISA
Mathematics and Science assessments using the Surveys of Enacted Curriculum
methodology. Paper prepared for the National Center for Education Statistics and
American Institutes for Research.
Bloom, B. S. (1956). Taxonomy of educational objectives, Handbook I: The cognitive domain.
New York: David MacKay Co Inc.
Bollen, K. A., & Jackman, R. W. (1990). Regression diagnostics: An expository treatment of
outliers and influential cases. In J. Fox & J. S. Long (Eds.), Modern methods of data
analysis (pp. 257–291). Newbury Park, CA: Sage.
Breslow N. E. (1996). Generalized linear models: Checking assumptions and strengthening
conclusions. Statistica Applicata, 8, 23–41.

174

Buckendahl, C. W., Plake, B. S., Impara, J. C., & Irwin, P. M. (2000, April). Alignment of
standardized achievement tests to state content standards: A comparison of publishers’
and teachers’ perspectives. Paper presented at the Annual Meeting of the National
Council on Measurement in Education, New Orleans, LA.
Brophy, J., & Ames, C. (2005). NAEP testing for twelfth graders: Motivational issues.
Washington, DC: National Assessment Governing Board. Retrieved from ERIC database.
(ED500959)
Brown, R. S., & Conley, D. T. (2007). Comparing state high school assessments to standards for
success in entry-level university courses. Educational Assessment, 12(2), 137–160.
Cameron, A. C., & Miller, D. L. (in press). A practitioner’s guide to cluster-robust
inference. Journal of Human Resources. Retrieved from
http://cameron.econ.ucdavis.edu/research/papers.html
Carmichael, S. B., Martino, G., Porter-Magee, K., & Wilson, W. S. (2010). The state of state
standards—and the Common Core—in 2010. Washington, DC: Thomas B. Fordham
Institute.
Chalifour, C., & Powers, D. E. (1989). The relationship of content characteristics of GRE
analytical reasoning items to their difficulties and discriminations. Journal of
Educational Measurement, 26(2), 120–132.
Cizek, G. J., & Bunch, M. B. (2007). Standard setting: A guide to establishing and evaluating
performance standards on tests. Thousand Oaks, CA: Sage.
Coates, D. (2003). Education production functions using instructional time as an input.
Education Economics, 11(3), 273–292.
Cohen, J. (1992). A power primer. Psychological Bulletin, 112(1), 155–159.
Cook, T. D., & Campbell, D. T. (1979). Quasi-experimentation: Design and analysis issues for
field settings. Chicago: Rand-McNally.
Council of Chief State School Officers & Wisconsin Center for Educational Research [CCSSO
& WCER]. (2004, October 25). Coding procedures for curriculum content analyses.
Retrieved from https://secure.wceruw.org/seconline/Reference/CntCodingProcedures.pdf
Cox, D. R., & Snell, E. J. (1989). Analysis of binary data (2nd ed.). New York: Chapman and
Hall.
Crocker, L., & Algina, J. (1986). Introduction to classical and modern test theory. New York:
Holt, Rinehart, and Winston.

175

Crocker, L., Llabre, M., & Miller, M. D. (1988). The generalizability of content validity ratings.
Journal of Educational Measurement, 25(4), 287–299.
Crocker, L. M., Miller, M. D., & Franks, E. A. (1989). Quantitative methods for assessing
the fit between test and curriculum. Applied Measurement in Education, 2(2), 179–194.
Cronbach, L. J., Gleser, G. C., Nanda, H., & Rajaratnam, N. (1972). The dependability of
behavioral measurements: Theory of generalizability for scores and profiles. New York,
NY: Wiley.
D’Agostino, J. V., Welsh, M. E., Cimetta, A. D., Falco, L. D., Smith, S., VanWinkle, W. H., &
Powers, S. J. (2008). The rating and matching item-objective alignment methods. Applied
Measurement in Education, 21(1), 1–21.
D’Agostino, J. V., Welsh, M. E., & Corson, N. M. (2007). Instructional sensitivity of a state’s
standards-based assessments. Educational Assessment, 12(1), 1–22.
D’Agostino, R. B., Belanger, A., & D’Agostino, R. B., Jr. (1990). A suggestion for using
powerful and informative tests of normality. American Statistician, 44(4), 316–321.
Davis-Becker, S. L., & Buckendahl, C. W. (2013). A proposed framework for evaluating
alignment studies. Educational Measurement: Issues and Practice, 32(1), 23–33.
Donald, S. G. & Lang, K. (2007). Inference with differences-in-differences and other panel data.
Review of Economics and Statistics, 89(2), 221–233.
Ebel, R. L. (1956). Obtaining and reporting evidence on content validity. Educational and
Psychological Measurement, 16(3), 269–282.
Efron, B. (1987). Better bootstrap confidence intervals. Journal of the American Statistical
Association, 82(397), 171–185.
Embretson, S. E., & Daniel, R. C. (2008). Understanding and quantifying cognitive complexity
level in mathematical problem solving items. Psychological Science Quarterly, 50(3),
328–344.
Enright, M. K., & Sheehan, K. M. (2002). Modeling the difficulty of quantitative reasoning
items: Implications for item generation. In S. H. Irvine & P. C. Kyllonen (Eds.), Item
generation for test development (pp. 129–157). Mahweh, NJ: Erlbaum.
Ferrara, S., & Duncan, T. (2011). Comparing science achievement constructs: Targeted and
achieved. The Educational Forum, 75, 143–156.

176

Ferrara, S., Duncan, T., Freed, R., Velez-Paschke, A., McGivern, J., Mushlin, S., . . .
Westphalen, K. (2004, April). Examining test score validity by examining item construct
validity: Evidence of the alignment of observed skills, cognitive processes, and response
strategies with test specifications. In E. A. Vanderputten (Chair), Putting alignment to the
test. Symposium conducted at the Annual Meeting of the American Educational Research
Association, San Diego, CA. Retrieved from
http://www.air.org/files/AERA2004MidSchlScienceAssess.pdf
Ferrara, S., Svetina, D., Skucha, S., & Davidson, A. H. (2011). Test design with performance
standards and achievement growth in mind. Educational Measurement: Issues and
Practice, 30(4), 3–15.
Fischer, G. H. (1997). Unidimensional linear logistic Rasch models. In W. J. van der Linden &
R. K. Hambleton (Eds.), Handbook of modern item response theory (pp. 225–243). New
York: Springer.
Floden, R. (2002). The measurement of opportunity to learn. In A. C. Porter & A. Gamoran
(Eds.), Methodological advances in cross-national surveys of educational achievement
(pp. 231–266). Washington, DC: National Academies Press.
Flowers, C., Wakeman, S., Browder, D. M., & Karvonen, M. (2009). Links for Academic
Learning (LAL): A conceptual model for investigating alignment of alternate assessments
based on alternate achievement standards. Educational Measurement: Issues and
Practice, 28(1), 25–37.
Freeman, D. J., Belli, G. M., Porter, A. C., Floden, R. E., Schmidt, W. H., & Schwille, J. R.
(1983). The influence of different styles of textbook use on instructional validity of
standardized tests. Journal of Educational Measurement, 20(3), 259–270.
Frisbie, D. A. (2003). Checking the alignment of an assessment tool and a set of content
standards. Iowa Technical Adequacy Project (ITAP). Iowa City, IA: University of Iowa.
Fulmer, G. W. (2011). Estimating critical values for strength of alignment among curriculum,
assessments, and instruction. Journal of Educational and Behavioral Statistics, 36(3),
381–402.
Gamoran, A., Porter, A. C., Smithson, J., & White, P. A. (1997). Upgrading high school
mathematics instruction: Improving learning opportunities for low-achieving, lowincome youth. Educational Evaluation and Policy Analysis, 19(4), 325–338.
Gill, J. (2001). Generalized linear models: A unified approach. Thousand Oaks, CA: Sage.
Glatthorn, A. A. (1999). Curriculum alignment revisited. Journal of Curriculum and
Supervision, 15(1), 26–34.

177

Gorin, J. S. (2006). Test design with cognition in mind. Educational Measurement: Issues and
Practice, 25(4), 21–35.
Grissmer, D. W., Flanagan, A., Kawata, J. H., & Williamson, S. (2000). Improving student
achievement: What state NAEP test scores tell us. Santa Monica, CA: RAND.
Guion, R. E. (1977). Content validity: The source of my discontent. Applied Psychological
Measurement, 1(1), 1–10.
Gulliksen, H. (1950). Intrinsic validity. American Psychologist, 5(10), 511–517.
Haertel, E. A. (1985). Content validity and criterion-referenced testing. Review of Educational
Research, 55(1), 23–46.
Haladyna, T. M., Downing, S. M., & Rodriguez, M. C. (2002). A review of multiple-choice
item-writing guidelines for classroom assessment. Applied Measurement in Education,
15(3), 309–333.
Hambleton, R. K. (1980). Test score validity and standard setting methods. In R. A. Berk (Ed.),
Criterion-referenced measurement: The state of the art. Baltimore: Johns Hopkins
University Press.
Hambleton, R. K., & Jirka, S. J. (2006). Anchor-based methods for judgmentally estimating item
statistics. In S. M. Downing & T. M. Haladyna (Eds.), Handbook of test development (pp.
399–420). Mahwah, NJ: Erlbaum.
Hambleton, R. K., Pitoniak, M. J., & Copella, J. M. (2012). Essential steps in setting
performance standards on educational tests and strategies for assessing the reliability of
results. In G. J. Cizek (Ed.), Setting performance standards: Foundations, methods, and
innovations (pp. 47–76). New York: Routledge.
Hattie, J., Jaeger, R. M., & Bond, L. (1999). Persistent methodological questions in educational
testing. Review of Research in Education, 24, 393–446.
Herman, J. L., Webb, N. M., & Zuniga, S. A. (2007). Measurement issues in the alignment of
standards and assessment: A case study. Applied Measurement in Education, 20(1), 101–
126.
Hill, H. C. (2001). Policy is not enough: Language and the interpretation of state standards.
American Educational Research Journal, 38(2), 289–318.
Holland, P. W. (1986). Statistics and causal inference. Journal of the American Statistical
Association, 81(396), 945–960.
Jolliffe, I. T. (2002). Principal component analysis (2nd ed.). New York: Springer.

178

Jones, D. H., & Szatrowski, T. H. (1983). On the statistical determination of content validity.
Educational and Psychological Measurement, 43(4), 995–1004.
Kane, M. T. (2006). Validation. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp.
17–64). Westport, CT: American Council on Education/Praeger.
Kane, M. (2009). Validating the interpretations and uses of test scores. In R. W. Lissitz (Ed.),
The concept of validity (pp. 39–64). Charlotte, NC: Information Age.
Kane, M. T. (2013). Validating the interpretations and uses of test scores. Journal of Educational
Measurement, 50(1), 1–73.
Klein, D. (2005). The state of state math standards 2005. Washington, DC: Thomas B. Fordham
Institute.
Klein, S. P., & Kosecoff, J. P. (1975). Determining how well a test measures your objectives.
Los Angeles: Center for the Study of Evaluation, University of California. Retrieved
from ERIC database. (ED109226)
Koretz, D. (2008). Further steps toward the development of accountability-oriented science of
measurement. In K. E. Ryan & L. A. Shepard (Eds.), The future of test-based educational
accountability (pp. 71–91). New York: Taylor & Francis.
Kurz, A., Elliott, S. N., Wehby, J. H., & Smithson, J. L. (2010). Alignment of the intended,
planned, and enacted curriculum in general and special education and its relation to
student achievement. Journal of Special Education, 44(3), 131–145.
La Marca, P. M. (2001). Alignment of standards and assessments as an accountability criterion.
Practical Assessment, Research & Evaluation, 7(2).
La Marca, P. M., Redfield, D., & Winter, P. (2000). State standards and state assessment
systems: A guide to alignment. Washington, DC: Council of Chief State School Officers.
Leighton, J. P., & Gokiert, R. J. (2008). Identifying potential test item misalignment using
student verbal reports. Educational Assessment, 13(4), 215–242.
Leinhardt, G., & Seewald, A. M. (1981). Overlap: What's tested, what's taught? Journal of
Educational Measurement, 18(2), 85–96.
Lepik, M. (1990). Algebraic word problems: Role of linguistic and structural variables.
Educational Studies in Mathematics, 21(1), 83–90.
Linn, R. L. (1980). Issues of validity for criterion-referenced measures. Applied Psychological
Measurement, 4(4), 547–561.

179

Linn, R. L., Baker, E. L., & Betebenner, D. W. (2002). Accountability systems: Implications of
the requirements of the No Child Left Behind Act of 2001. Educational Researcher,
31(6), 3–16.
Liu, X., & Fulmer, G. W. (2008). Alignment between science curriculum and assessments in
selected New York State Regents exams. Journal of Science Education and Technology,
17(4), 373–383.
Lobato, J., & Siebert, D. (2002). Quantitative reasoning in a reconceived view of transfer.
Journal of Mathematical Behavior, 21(1), 87–116.
Martineau, J., Paek, P., Keene, J., & Hirsch, T. (2007). Integrated, comprehensive alignment as a
foundation for measuring student progress. Educational Measurement: Issues and
Practice, 26(1), 28–35.
Martinez, M. E. (1999). Cognition and the question of test item format. Educational
Psychologist, 34(4), 207–218.
Martone, A. (2007). Exploring the impact of teachers’ involvement in an assessment-standards
alignment study. Unpublished doctoral dissertation, University of Massachusetts
Amherst.
Martone, A., & Sireci, S. G. (2009). Evaluating alignment between curriculum, assessment, and
instruction. Review of Educational Research, 79(4), 1332–1361.
McMaken, J., & Porter, A. (2012). The Surveys of Enacted Curriculum as a measure of
implementation. In D. J. Heck, K. B. Chval, I. R. Weiss, & S. W. Ziebarth (Eds.),
Approaches to studying the enacted mathematics curriculum (pp. 173–193). Charlotte,
NC: Information Age.
Mehrens, W. A., & Phillips, S. E. (1986). Detecting impacts of curricular differences in
achievement test data. Journal of Educational Measurement, 23(3), 185–196.
Mehrens, W. A., & Phillips, S. E. (1987). Sensitivity of item difficulties to curricular validity.
Journal of Educational Measurement, 24(4), 357–370.
Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 13–
103). New York: American Council on Education and MacMillan.
Miller, M. D., & Linn, R. L. (1988). Invariance of item characteristic functions with variations in
instructional coverage. Journal of Educational Measurement, 25(3), 205–219.
Millman, J., & Greene, J. (1989). The specification and development of tests of achievement and
ability. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 335–366). New York:
American Council on Education and MacMillan.

180

Mislevy, R. J. (2009). Validity from the perspective of model-based reasoning. In R. W. Lissitz
(Ed.), The concept of validity (pp. 83–108). Charlotte, NC: Information Age.
Mislevy, R. J., Steinberg, L. S., & Almond, R. G. (2002). On the roles of task model variables in
assessment design. In S. H. Irvine & P. C. Kyllonen (Eds.), Item generation for test
development (pp. 97–128). Mahweh, NJ: Erlbaum.
Mislevy, R. J., & Zwick, R. (2012). Scaling, linking, and reporting in a periodic assessment
system. Journal of Educational Measurement, 49(2), 148–166.
Mullis, I. V., Martin, M. O., & Foy, P. (2008). TIMSS 2007 International Mathematics Report:
Findings from IEA’s Trends in International Mathematics and Science Study at the
Fourth and Eighth Grades. Chestnut Hill, MA: Boston College, Lynch School of
Education, TIMSS & PIRLS International Study Center.
Mullis, I. V., Martin, M. O., Ruddock, G. J., O’Sullivan, C. Y., Arora, A., & Erberber, E. (2005).
TIMSS 2007 Assessment Frameworks. Chestnut Hill, MA: Boston College, Lynch School
of Education, TIMSS & PIRLS International Study Center.
Muthén, B. O. (1988). Instructionally sensitive psychometrics: Applications to the Second
International Mathematics Study (CSE Technical Report 286). Los Angeles, CA: Center
for Research on Evaluation, Standards, and Student Testing.
Muthén, B. O., Kao, C.-F., & Burstein, L. (1991). Instructionally-sensitive psychometrics:
Application of a new IRT-based detection technique to mathematics test items. Journal of
Educational Measurement, 28(1), 1–22.
Neidorf, T.S., Binkley, M., Gattis, K., & Nohara, D. (2006). Comparing mathematics content in
the National Assessment of Educational Progress (NAEP), Trends in International
Mathematics and Science Study (TIMSS), and Program for International Student
Assessment (PISA) 2003 assessments (NCES 2006-029). Washington, DC: US
Department of Education, National Center for Education Statistics.
No Child Left Behind Act of 2001. Pub. L. No. 107-110 U. S. C. §115, Stat. 1450 (2002).
Notice of Final Priorities for Race to the Top Fund, 74 Fed. Reg. 59,688 (Nov. 18, 2009).
Olson, J. F., Martin, M. O., & Mullis, I. V. (Eds.). (2008). TIMSS 2007 Technical Report.
Washington, DC: US Department of Education, National Center for Education Statistics.
O'Neil, H. F., Jr., Sugrue, B., & Baker, E. L. (1996). Effects of motivational interventions on
NAEP mathematics performance. Educational Assessment, 3(2), 135–157.
Papke, L. E., & Wooldridge, J. M. (1996). Econometric methods for fractional response
variables with an application to 401(k) plan participation rates. Journal of Applied
Econometrics, 11(6), 619–632.
181

Peak, H. (1953). Problems of objective observation. In L. Festinger & D. Katz (Eds.), Research
methods in the behavioral sciences (pp. 243–300). New York: Dryden Press.
Pedulla, J. J., Abrams, L. M., Madaus, G. F., Russell, M. K., Ramos, M. A., & Miao, J. (2003).
Perceived effects of state-mandated testing programs on teaching and learning: Findings
from a national survey of teachers. Chestnut Hill, MA: National Board on Educational
Testing and Public Policy. (ED481836)
Phillips, S. E., & Camara, W. J. (2006). Legal and ethical issues. In R. L. Brennan (Ed.),
Educational measurement (4th ed., pp. 733–755). Westport, CT: American Council on
Education and Praeger.
Phillips, S. E., & Mehrens, W. A. (1988). Effects of curricular differences on achievement test
data at item and objective levels. Applied Measurement in Education, 1(1), 33–51.
Plake, B. S., Impara, J. C., & Buckendahl, C. W. (2004). Technical quality criteria for evaluating
district assessment portfolios used in the Nebraska STARS. Educational Measurement:
Issues and Practice, 23(2), 12–16.
Poggio, J. P., Glasnapp, D. R., Miller, M. D., Tollefson, N., & Burry, J. A. (1986). Strategies for
validating teacher certification tests. Educational Measurement: Issues and Practice,
5(2), 18–25.
Polikoff, M. S. (2012a). Instructional alignment under No Child Left Behind. American Journal
of Education, 118(3), 341–368.
Polikoff, M. S. (2012b). The association of state policy attributes with teachers’ instructional
alignment. Educational Evaluation and Policy Analysis, 34(3), 278–294.
Polikoff, M. S., & Fulmer, G. W. (2013). Refining methods for estimating critical values for
an alignment index. Journal of Research on Educational Effectiveness, 6(4), 380–395.
Polikoff, M. S., & Porter, A. C. (2012). Surveys of Enacted Curriculum Substudy of the
Measures of Effective Teaching Project: Final report. Retrieved from
http://www.aefpweb.org/sites/default/files/webform/FINAL%20SEC%20REPORT_Polik
off_Porter.pdf
Porter, A. C. (2002). Measuring the content of instruction: Uses in research and practice.
Educational Researcher, 31(7), 3–14.
Porter, A. C. (2006). Curriculum assessment. In J. Green, G. Camilli, & P. Elmore (Eds.),
Handbook of complementary methods in education research (pp. 141–160). Washington,
DC: American Educational Research Association.

182

Porter, A. C., McMaken, J., & Blank, R. K. (2011). Surveys of Enacted Curriculum and the State
School Officers Collaborative. In W. F. Tate, K. D. King, & C. R. Anderson (Eds.),
Disrupting tradition: Research and practice pathways in mathematics education (pp. 21–
31). Reston, VA: National Council of Teachers of Mathematics.
Porter, A. C., Polikoff, M. S., Zeidner, T., & Smithson, J. (2008). The quality of content analyses
of state student achievement tests and state content standards. Educational Measurement:
Issues and Practice, 27(4), 2–14.
Porter, A. C., Polikoff, M. S., Barghaus, K. M., & Yang, R. (2013). Constructing aligned
assessments using automated test construction. Educational Researcher, 42(8), 415–423.
Porter, A. C., Polikoff, M. S., & Smithson, J. (2009). Is there a de facto national intended
curriculum? Evidence from state content standards. Educational Evaluation and Policy
Analysis, 31(3), 238–268.
Rabinowitz, S., Roeber, E., Schroeder, C., & Scheinker, J. (2006, January). Creating aligned
standards and assessment systems (Issue paper 3). Retrieved from
http://www.ccsso.org/Documents/2006/Creating_Aligned_Standards_2006.pdf
Raudenbush, S. W., & Bryk, A. S. (2002). Hierarchical linear models: Applications and data
analysis methods. Thousand Oaks, CA: Sage.
Reckase, M. D., & Chen, J. (2012). The role, format, and impact of feedback to standard setting
panelists. In G. J. Cizek (Ed.), Setting performance standards: Foundations, methods,
and innovations (pp. 149–164). New York: Routledge.
Resnick, L. B., Rothman, R., Slattery, J. B., & Vranek, J. L. (2004). Benchmarking and
alignment of standards and testing. Educational Assessment, 9(1&2), 1–27.
Reys, B., Chval, K., Dingman, S., McNaught, M., Regis, T. P., & Togashi, J. (2007). Grade-level
learning expectations: A new challenge for elementary mathematics teachers. Teaching
Children Mathematics, 14(1), 6–11.
Roach, A. T., McGrath, D., Wixson, C., & Talapatra, D. (2010). Aligning an early childhood
assessment to state kindergarten content standards: Application of a nationally recognized
alignment framework. Educational Measurement: Issues and Practice, 29(1), 25–37.
Roach, A. T., Niebling, B. C., & Kurz, A. (2008). Evaluating the alignment among curriculum,
instruction, and assessments: Implications and applications for research and practice.
Psychology in the Schools, 45(2), 158–176.
Robitaille, D., Schmidt, W. H., Raizen, S., McKnight, C., Britton, E., & Nicol, C. (1993). The
Third International Mathematics and Science Study: Curriculum frameworks for
mathematics and science (Monograph No. 1). Vancouver, Canada: Pacific Educational
Press.
183

Rothman, R. (2003, March). Imperfect matches: The alignment of standards and tests. Paper
commissioned by the National Research Council, Center for Education, Committee on
Test Design for K-12 Science Achievement. Washington, DC: National Academy of
Sciences.
Sanford, E. E., & Fabrizio, L. M. (1999, April). Results from the North Carolina-NAEP
comparison and what they mean to the End-of-Grade Testing Program. Paper presented
at the Annual Meeting of the American Educational Research Association, Montreal,
Canada.
Schafer W. D., Wang, J., & Wang, V. (2009). Validity in action: State validity evidence for
compliance with NCLB. In R. W. Lissitz (Ed.), The concept of validity (pp. 173–193).
Charlotte, NC: Information Age.
Schmeiser, C. B., & Welch, C. J. (2006). Test development. In R. L. Brennan (Ed.), Educational
measurement (4th ed., pp. 307–353). Westport, CT: American Council on
Education/Praeger.
Schmidt, W. H., & Maier, A. (2009). Opportunity to learn. In G. Sykes, B. Schneider, & D. N.
Plank (Eds.), Handbook of education policy research (pp. 541–559). New York:
Routledge.
Schmidt, W. H., McKnight, C. C., Cogan, L. S., Jakwerth, P. M., & Houang, R. T. (1999).
Facing the consequences: Using TIMSS for a closer look at U.S. mathematics and
science education. Dordrecht, The Netherlands: Kluwer.
Schmidt, W. H., McKnight, C. C., Houang, R. T., Wang, H., Wiley, D. E., Cogan, L. S., &
Wolfe, R. G. (2001). Why schools matter: A cross-national comparison of curriculum
and learning. San Francisco, CA: Jossey-Bass.
Schneider, M. C., Huff, K. L., Egan, K. L., Gaines, M. L., & Ferrara, S. (2013). Relationships
among item cognitive complexity, contextual demands, and item difficulty: Implications
for achievement-level descriptors. Educational Assessment, 18(2), 99–121.
Senk, S. L., & Thompson, D. R. (2003). Standards-based school mathematics curricula: What
are they? What do students learn? Mahweh, NJ: Erlbaum.
Shaftel, J., Belton-Kocher, E., Glasnapp, D., & Poggio, J. (2006). The impact of language
characteristics in mathematics test items on the performance of English language learners
and students with disabilities. Educational Assessment, 11(2), 105–126.
Sireci, S. G. (1998). Gathering and analyzing content validity data. Educational Assessment,
5(4), 299–321.
Sireci, S. G., & Geisinger, K. F. (1992). Analyzing test content using cluster analysis and
multidimensional scaling. Applied Psychological Measurement, 16(1), 17–31.
184

Smithson, J. L., & Collares, A. C. (2007, April). Alignment as a predictor of student achievement
gains. Paper presented at the Annual Meeting of the American Educational Research
Association, Chicago.
Snow, R. E. (1994). A person-situation interaction theory of intelligence in outline. In A.
Demetriou & A. Efklides (Eds.), Intelligence, mind, and reasoning: Structure and
development (pp. 11–28). Amsterdam: North-Holland.
Snow, R. E., & Lohman, D. F. (1989). Implications of cognitive psychology for educational
measurement. In R. Linn (Ed.), Educational measurement (3rd ed., pp. 263–331). New
York: Macmillan.
Spillane, J. S. (2004). Standards deviation: How schools misunderstand education policy.
Cambridge, MA: Harvard.
Stecher, B. M., Epstein, S., Hamilton, L. S., Marsh, J. A., Robyn, A., McCombs, J. S., . . .
Naftel, S. (2008). Pain and gain: Implementing No Child Left Behind in three states,
2004–2006. Santa Monica, CA: RAND.
Swanson, C. B., & Stevenson, D. L. (2002). Standard-based reform in practice: Evidence on state
policy and classroom instruction from the NAEP state assessments. Educational
Evaluation and Policy Analysis, 24(1), 1–27.
Tatsuoka, K. K., Corter, J. E., & Tatsuoka, C. (2004). Patterns of diagnosed mathematical
content and process skills in TIMSS-R across a sample of 20 countries. American
Educational Research Journal, 41(4), 901–926.
Travers, K. J., & Westbury, I. (1989). The IEA Study of Mathematics I: Analysis of mathematics
curricula. Oxford: Pergamon.
Truxillo, C. (2005). Maximum likelihood parameter estimation with incomplete data.
Proceedings of the Thirtieth Annual SAS(r) Users Group International Conference.
Retrieved from: http://www2.sas.com/proceedings/sugi30/111-30.pdf
Turner, L. C., & Carlson, L. (2003). Indexes of item-objective congruence for multidimensional
items. International Journal of Testing, 3(2), 163–171.
University of Wisconsin, Wisconsin Center for Educational Research, Measures of Enacted
Curriculum Group [MECG]. (2004, October 25). Coding procedures for curriculum
content analyses. Retrieved from
https://secure.wceruw.org/seconline/Reference/CntCodingProcedures.pdf
University of Wisconsin, Wisconsin Center for Educational Research, Measures of Enacted
Curriculum Group [MECG]. (2010, November 9). Mathematics content analysis [Data
files]. Retrieved from
http://seconline.wceruw.org/MSP/Content/ELA/ELACntRpt/WSELACntRptMenu.asp
185

US Department of Education. (2004, April). Standards and assessments peer review
guidance: Information and examples for meeting the requirements of the No Child Left
Behind Act of 2001. Washington, DC: Author. Retrieved from
http://dese.mo.gov/divimprove/fedprog/grantmgmnt/NCLB_PDF/Standards_Assessemnts
_Peer_Review_Guidance_04282004.pdf
US Department of Education. (2012, June). ESEA flexibility: Review guidance for Window 3.
Washington, DC: Author. Retrieved from Retrieved from
http://www2.ed.gov/policy/elsec/guid/esea-flexibility/index.html\
US Department of Education, Institute of Education Sciences, National Center for Education
Statistics [NCES]. (2008). Digest of education statistics (2008 ed.) [Statistical tables].
Retrieved from http://nces.ed.gov/programs/digest/2008menu_tables.asp
US Department of Education, Institute of Education Sciences, National Center for Educational
Statistics [NCES] (2009, May 13). NAEP technical documentation. Retrieved from
http://nces.ed.gov/nationsreportcard/tdw/
US Department of Education, Institute of Education Sciences, National Center for Educational
Statistics [NCES] (2013). NAEP data explorer [Statistical tables]. Retrieved from
http://nces.ed.gov/nationsreportcard/naepdata/dataset.aspx
US Department of Education, National Assessment Governing Board [NAGB]. (2006, Sept.).
Mathematics framework for the 2007 National Assessment of Educational Progress.
Washington, DC: Author.
Vockley, M. (2009). Alignment and the states: Three approaches to aligning the National
Assessment of Educational Progress with state assessments, other assessments, and
standards. Washington, DC: Chief Council of State School Officers.
Webb, N. L. (1997). Criteria for alignment of expectations and assessments in mathematics and
science education. Research Monograph No. 8. Washington, D.C.: Council of Chief
State School Officers.
Webb, N. L. (1999). Alignment of science and mathematics standards and assessments in four
states. Research Monograph No. 18. Madison, WI: University of Wisconsin, National
Institute for Science Education.
Webb, N. L. (2007). Issues related to judging the alignment of curriculum standards and
assessments. Applied Measurement in Education, 20(1), 7–25.
Webb, N. L., Alt, M., Ely, R., Cormier, M., & Vesperman, B. (2005). The Web Alignment Tool:
Development, refinement, and dissemination. Washington, DC: Council of Chief State
School Officers.

186

Webb, N. L., Alt, M., Ely, R., & Vesperman, B. (2005). Web Alignment Tool (WAT): Training
Manual, Version 1.1. Retrieved from http://wat.wceruw.org/index.aspx
Webb, N. M., Herman, J. L., & Webb, N. L. (2007). Alignment of mathematics state-level
standards and assessments: The role of reviewer agreement. Educational
Measurement: Issues and Practice, 26(2), 17–29.
Welsh, M. E., D’Agostino, J. V., & Kaniskan, B. (2013). Grading as a reform effort: Do
standards-based grades converge with test scores? Educational Measurement: Issues and
Practice, 32(2), 26–36.
Wiley, D. E., & Yoon, B. (1995). Teacher reports on opportunity to learn: Analyses of the 1993
California Learning Assessment System (CLAS). Educational Evaluation and Policy
Analysis, 17(3), 355–370.
Winfield, L. F. (1993, April). Investigating test content and curriculum content overlap to assess
opportunity to learn. Paper presented at the Annual Meeting of the American Educational
Research Association, Atlanta, GA.
Wixson, K. K., & Yochum, N. (2004). Research on literacy policy and professional
development: National, state, district, and teacher contexts. The Elementary School
Journal, 105(2), 219–242.
Woolard, J. C. (2007, April). Measuring systemic alignment of a state’s instruction, standards,
and assessments: A baseline analysis. Paper presented at the Annual Meeting of the
American Educational Research Association, Chicago.
Wooldridge, J. M. (2010). Econometric analysis of cross section and panel data (2nd ed).
Cambridge, MA: MIT Press.
Wyse, A. E., & Viger, S. G. (2011). How item writers understand depth of knowledge.
Educational Assessment, 16(4), 185–206.
Yalow, E. S., & Popham, W. J. (1983). Content validity at the crossroads. Educational
Researcher, 12(8), 10–14.
Zheng, B., & Agresti, A. (2000). Summarizing the predictive power of a generalized linear
model. Statistics in Medicine, 19(13), 1771–1781.

187