APPLYING ITEM RESPONSE THEORY METHODS TO DESIGN A LEARNING
PROGRESSION-BASED SCIENCE ASSESSMENT
By
Jing Chen

A DISSERTATION
Submitted to
Michigan State University
in partial fulfillment of the requirements
for the degree of
DOCTOR OF PHILOSOPHY
Measurement and Quantitative Methods
Curriculum, Teaching, and Educational Policy
2012

ABSTRACT
APPLYING ITEM RESPONSE THEORY METHODS TO DESIGN A LEARNING
PROGRESSION-BASED SCIENCE ASSESSMENT
By
Jing Chen
Learning progressions are used to describe how students’ understanding of a topic
progresses over time and to classify the progress of students into steps or levels. This study
applies Item Response Theory (IRT) based methods to investigate how to design learning
progression-based science assessments. The research questions of this study are: 1) how to use
items in different formats to classify students into levels on the learning progression, 2) how to
design a test to give good information about students’ progress through the learning progression
of a particular construct and 3) what characteristics of test items support their use for assessing
students’ levels.
Data used for this study were collected from 1500 elementary and secondary school
students during 2009-2010. The written assessment was developed in several formats such as the
Constructed Response (CR) items, Ordered Multiple Choice (OMC) and Multiple True or False
(MTF) items. The followings are the main findings from this study.
The OMC, MTF and CR items might measure different components of the construct. A
single construct explained most of the variance in students’ performances. However, additional
dimensions in terms of item format can explain certain amount of the variance in student
performance. So additional dimensions need to be considered when we want to capture the
differences in students’ performances on different types of items targeting the understanding of
the same underlying progression. Items in each item format need to be improved in certain ways
to classify students more accurately into the learning progression levels.

This study establishes some general steps that can be followed to design other learning
progression-based tests as well. For example, first, the boundaries between levels on the IRT
scale can be defined by using the means of the item thresholds across a set of good items.
Second, items in multiple formats can be selected to achieve the information criterion at all the
defined boundaries. This ensures the accuracy of the classification. Third, when item threshold
parameters vary a bit, the scoring rubrics and the items need to be reviewed to make the
threshold parameters similar across items. This is because one important design criterion of the
learning progression-based items is that ideally, a student should be at the same level across
items, which means that the item threshold parameters (d1, d2 and d3) should be similar across
items.
To design a learning progression-based science assessment, we need to understand
whether the assessment measures a single construct or several constructs and how items are
associated with the constructs being measured. Results from dimension analyses indicate that
items of different carbon transforming processes measure different aspects of the carbon cycle
construct. However, items of different practices assess the same construct. In general, there are
high correlations among different processes or practices. It is not clear whether the strong
correlations are due to the inherent links among these process/practice dimensions or due to the
fact that the student sample does not show much variation in these process/practice dimensions.
Future data are needed to examine the dimensionalities in terms of process/practice in detail.
Finally, based on item characteristics analysis, recommendations are made to write more
discriminative CR items and better OMC, MTF options. Item writers can follow these
recommendations to write better learning progression-based items.

To My Father

iv

ACKNOWLEDGEMENTS
The current work is a result of several years of effort, challenges and hard work. It would
not be possible without the help and support of my advisors, colleagues, friends and family.
I am especially grateful to my two advisors, Professor Charles Anderson and Professor
Mark Reckase. They guided me navigating through the complexity of educational research in
both qualitative and quantitative dimensions. I joined in Andy’s Environmental Literacy project
since I started my graduate program five years ago. I am deeply influenced by his devotion and
enthusiasm to teaching and research. His knowledge and expertise greatly broaden my vision in
learning progression as well as the science education in general. I started working with Mark
since I began my dual degree program in the Measurement and Quantitative Methods. I am the
real beneficiary of Mark’s teaching style, broad knowledge and the down-to-earth attitude
towards research.
I would like to express my sincere appreciation to Prof. Christina Schwarz, Prof. Amelia
Gotwals and Prof. Edward Roeber for their great comments and suggestions for my dissertation.
I thank my colleagues in the Environmental Literacy project. We have been working together to
code the 1,500 test papers. All the data used in this dissertation is the results of their hard work.
I want to thank my parents for their long time support and encouragement. Finally, I am
own my deepest appreciation to my husband, Jiangang. His love and encouragement are
indispensible for me to go through the challenges in my research and life.

v

PREFACE
First, this research is supported in part by grants from the National Science Foundation:
Learning Progression on Carbon-Transforming Processes in Socio-Ecological Systems (NSF
0815993), and Targeted Partnership: Culturally Relevant Ecology, Learning Progressions and
Environmental Literacy (NSF-0832173), and CCE: A Learning Progression-based System for
Promoting Understanding of Carbon-transforming Processes (DRL 1020187). Additional
support comes from the Great Lakes Bioenergy Research Center. Any opinions, findings, and
conclusions or recommendations expressed in this material are those of the author(s) and do not
necessarily reflect the views of the National Science Foundation or the United States Department
of Energy.
Second, the focus of this study is assessment design. When the analyses indicate that an
item is not well aligned with the learning progression framework, it is difficult to tell whether the
problem is with the item, the rubric, or with the learning progression framework itself. The
learning progression framework itself may not capture some alternate trajectories students take to
achieve environmental literacy. Thus further investigating the validity of the learning progression
framework is an important topic. However, this dissertation focuses on the problems with the
assessment and the items rather than the learning progression framework itself. So I consider the
problems from the assessment perspective and suggest improvements from the assessment
perspective only.

vi

TABLE OF CONTENTS
LIST OF TABLES ......................................................................................................................... ix
LIST OF FIGURES ....................................................................................................................... xi
Chapter 1 Introduction .................................................................................................................... 1
1.1 Environmental and social background .................................................................................. 1
1.2 Environmental literacy project and learning progression-based assessments ...................... 3
1.3 Purpose of this study ............................................................................................................. 6
Chapter 2 Literature Review ......................................................................................................... 11
2.1 Assessment triangle ............................................................................................................ 11
2.2 Cognition—Learning progression of carbon cycle ............................................................. 14
2.2.1. Learning progression and progress variables .............................................................. 14
2.2.2. Carbon cycle learning progression.............................................................................. 16
2.2.3. Carbon cycle progress variables: process and practice ............................................... 19
2.3 Observation—Items in different formats ............................................................................ 22
2.4 Interpretation— The psychometric models ........................................................................ 26
Chapter 3 Psychometric Theories and Related Terminologies ..................................................... 28
3.1 Classical test theory ............................................................................................................ 28
3.2 Item response theory ........................................................................................................... 29
3.2.1 Unidimensional IRT .................................................................................................... 30
3.2.2 Multidimensional IRT (MIRT) .................................................................................... 32
3.2.3 Maximum likelihood estimation .................................................................................. 40
3.2.4 Information function .................................................................................................... 43
3.3 Information and hypothesis testing ..................................................................................... 44
3.3.1 Some basic notation ..................................................................................................... 44
3.3.2 Hypothesis testing ........................................................................................................ 47
3.4 Learning progression and IRT ............................................................................................ 51
Chapter 4 Methodology ................................................................................................................ 53
4.1 Data ..................................................................................................................................... 53
4.2 The carbon cycle test design ............................................................................................... 53
4.3 Scoring rubrics and coding process .................................................................................... 57
4.4 Data analysis ....................................................................................................................... 58
Chapter 5 Design of a Test Consisting of Items in Multiple Formats .......................................... 60
5.1 Research purpose and procedure ........................................................................................ 60
5.2 Classical test statistics—Item discrimination index ........................................................... 60
5.3 Dimensionality in terms of item format .............................................................................. 61
5.4 Select items to meet the design criteria............................................................................... 63
5.4.1 Design criteria of the learning progression-based carbon cycle assessment ............... 63
5.4.2 Design a test to meet each criterion ............................................................................. 64
5.5 Design discriminative OMC, MTF and CR items .............................................................. 72

vii

5.5.1. OMC options............................................................................................................... 73
5.5.2 MTF options................................................................................................................. 77
5.5.3 CR items....................................................................................................................... 84
Chapter 6 Design of a Test to Assess a Particular Process or Practice ......................................... 91
6.1 Research purpose and procedure ........................................................................................ 91
6.2 Dimensionality in terms of process or practice................................................................... 94
6.3 Design a test to assess a particular process ......................................................................... 97
Chapter 7 Item Characteristics .................................................................................................... 102
7.1 Research purpose and procedures ..................................................................................... 102
7.2 Write good learning progression-based items: How item statistics are related to the item
characteristics .......................................................................................................................... 102
7.3 Write good learning progression-based items: How item statistics are related to
suggestions from qualitative evaluation.................................................................................. 104
7.3.1 CR items..................................................................................................................... 104
7.3.2 OMC and MTF items ................................................................................................. 111
7.4 Recommendations for writing items in future .................................................................. 111
Chapter 8 Discussion and Conclusions ....................................................................................... 113
8.1 Summary of main findings and implications .................................................................... 113
8.1.1. Items in different formats are associated with one main construct but also measure
slightly different aspects of the construct ........................................................................... 113
8.1.2. Improve the quality of the OMC, MTF and CR items .............................................. 114
8.1.3. Use items in multiple formats to meet the test information criterion ....................... 115
8.1.4. Design a test to assess a particular process or practice ............................................. 116
8.1.5. Implications from the item characteristics analysis .................................................. 117
8.2 Discussion of the results ................................................................................................... 117
8.2.1. Items in different formats.......................................................................................... 117
8.2.2. Assessing a particular process .................................................................................. 120
8.3 The broader implications to learning progression-based assessments .............................. 121
8.4 Limitations of this study and future work ......................................................................... 122
APPENDICES ............................................................................................................................ 124
Appendix A Item list ............................................................................................................... 125
REFERENCES ........................................................................................................................... 147

viii

LIST OF TABLES
Table 4.1 Number of test papers collected during 2009-2010 ...................................................... 53
Table 4.2 Three alternative high school test forms ....................................................................... 54
Table 4.3 Number of responses per item ...................................................................................... 55
Table 5.1 Correlations among the EAP estimates in each dimension .......................................... 62
Table 5.2 Descriptive statistics of the item parameters of 38 selected good items ...................... 65
Table 5.3 OMC and MTF item difficulty (recoded as dichotomous items) ................................. 70
Table 5.4 Cross-tabulation between OMC levels and CR levels for ACORN item ..................... 74
Table 5.5 Cross-tabulation between OMC levels and CR levels for BODYTE item ................... 76
Table 5.6 Percentages of each response string of the ENERPLNT item and the average ability
estimates for each response string......................................................................................... 79
Table 5.7 Percentages of students at each CR level for students who selected Y and those who
selected N to each T or F question ........................................................................................ 83
Table 5.8 Compare the average CR level of two groups of students: students who selected Y and
those who selected N ............................................................................................................ 83
Table 6.1 Goodness of fit test among four models ....................................................................... 95
Table 6.2 Correlations among process dimensions....................................................................... 95
Table 6.3 Correlations among practice dimensions ...................................................................... 96
Table 6.4 The item parameters of each process ............................................................................ 99
Table A.1 Descriptions of the four achievement levels of carbon cycle learning progression .. 135
Table A.2 The specific rubric of the CARGAS item .................................................................. 137
Table A.3 Unidimensional PCM results ..................................................................................... 140
Table A.4 The step threshold parameters of 38 good items ....................................................... 142
Table A.5 Excluded items (misfit items and items that have thresholds not in the correct order)
............................................................................................................................................. 144

ix

Table A.6 Effective options of each MTF item .......................................................................... 145

x

LIST OF FIGURES
Figure 2.1 The assessment triangle ............................................................................................... 12
Figure 3.1 Structures of between-item and within-item multidimensionality .............................. 37
Figure 3.2 Fractional error of the recovered variance as a function of test information. ............. 50
Figure 5.1 Item difficulty (b) and threshold (d) parameter distribution ....................................... 66
Figure 5.2 Information curve of 16 real items .............................................................................. 68
Figure 5.3 Information curve of 14 simulated ideal items ............................................................ 68
Figure 5.4 Information curve formed by 3 CR with thresholds at the boundaries and 14 ideal
dichotomous items with item difficulties at the boundaries ................................................. 72
Figure 5.5 Item threshold parameters of all the CR items ............................................................ 85
Figure 5.6 The characteristic curves by category of nine CR items ............................................. 86
Figure 6.1 Graphical representation of the unidimensional and multidimensional models ......... 92
Figure 6.2 Information of all items of each process ..................................................................... 97
Figure 7.1 Item information curve of OCTAMOLE and CARGAS .......................................... 106

xi

Chapter 1 Introduction

Designing assessments is a complex process, involving numerous interdependent
components. How to design high-quality science assessments is a tough question to answer. By
examining the fundamental components of the carbon cycle assessment developed by the
Environmental Literacy project in the past several years and the interplay among these
components, this study aims at exploring the ways to design high-quality science assessments. In
this chapter, I will firstly introduce the environmental and social background that shape the
Environmental Literacy project, and then introduce the assessment developed by the project.
Finally, I will elaborate the specific research purposes of this study.
1.1 Environmental and social background
The global climate is changing. The global surface temperature has increased 1.4°F since
th

the beginning of the 20

century, with about 1.1ºF of the increase occurring in the past 30 years.

The temperature will likely rise at least another 2°F, and possibly more than 11°F, in the next
100 years (IPCC, 2007). The predicted consequences of this increase include widespread melting
of snow and ice, rising global average sea level, increasing frequency and severity of storms, and
other effects on natural ecosystems and human agriculture (Crowley, 2000; Falkowski et al.,
2000; Keeling & Whorf, 2005).
Most scientists agree that the warming in recent decades has been caused primarily by the
increasing concentrations of greenhouse gases such as carbon dioxide and water vapor in the
atmosphere. The climate changes because sunlight can pass through the atmosphere and warm up
the planet, but the greenhouse gases hinder the escape of heat from the Earth to outer space. The
increasing concentration of greenhouse gases in atmosphere primarily results from human

1

activities such as the burning of fossil fuels and deforestation.
Human activities destroy the balance between the amount of carbon emitted into the
atmosphere and the amount of carbon absorbed by plants and other “sinks” on the surface of the
Earth. Currently, combustion of fossil fuels releases about 7 billion tons of carbon to the
atmosphere every year (Hotinski, 2007). Only about half of this excess carbon dioxide is
absorbed by the ocean, plants, and trees, while the rest accumulates in the atmosphere. So human
activities have influenced the ecological carbon cycle, causing more and more carbon goes from
forest and fossil fuel to atmosphere. This enhances the natural greenhouse effect and leads to the
increase of the global surface temperature.
The burning of fossil fuels usually occurs in automobiles, in factories, and in power
plants that provide energy for people. It is a result of both the individual and collective human
activities. Hence, slowing down the rate of carbon dioxide emission and global warming requires
both individual and collective efforts. In order to make responsible and knowledgeable decisions
about the urgent climate change, people need to understand the complex carbon cycling
processes and to be environmentally literate. So the topic of “carbon cycling” has its unique
scientific and practical importance. It is especially important for K-12 students to understand the
carbon cycling when they are in schools so that they can become more environmentally
responsible citizens in the future.
However, many studies showed that Americans are by and large uninformed or
misinformed about environmental science (KACEE News, 2005). The Ninth Annual National
Report Card shows that the environmental "illiteracy" remains widespread among American
adults, though 95 percent of them endorse environmental education in the schools (NEETF &
Roper, 2001). Less than half of the American public realizes that driving cars and using electrical

2

appliances in their homes contribute to global climate change. Among the general public, only
45% people can correctly identified emissions from autos, homes, and industries as the main
cause of global climate change (NEETF & Roper, 2001). Only 12% Americans can pass a basic
quiz on the awareness of energy topics (Coyle, 2005).
Both global climate change and the widespread environmental “illiteracy” among
Americans make it imperative to improve the environmental science education. The
Environmental Literacy in American report indicates that environmentally knowledgeable people
are: 10% more likely to save energy in their homes; 50% more likely to recycle; 10% more likely
to purchase environmentally safe products and 50% more likely to avoid using chemicals in yard
care (Coyle, 2005). Environmental knowledge will make a difference in the decisions that people
make about environmental issues. So science education needs to prepare students with the
environmental knowledge so that they are more likely to make environmentally friendly
decisions. The investigation of students’ understanding of carbon cycling and how their
understanding progresses over time is a necessary first step to find out ways to improve our
current science education to help more students become environmentally literate.
1

1.2 Environmental literacy project and learning progression -based assessments
The goal of the Environmental Literacy project at Michigan State University (MSU) is to
improve students’ environmental literacy by the time they are graduating from high school or
college. Environmental literacy involves an understanding of the underlying environmental
principles and applying these principles in everyday life to make informed decisions. The
Environmental Literacy project has several research strands and carbon cycle is one of the
1

Learning progressions have been referred to by many different names, including progress
variables, learning trajectories, progressions of developmental competence, and profile strands.
3

strands. The carbon cycle strand aims to investigate students’ learning progression in
understanding the carbon cycle.
The carbon cycle is a key to understanding environmental systems. All living organisms
are made of carbon compounds. Plants are the producers that generate organic carbon and
harness light energy into chemical potential energy. All living organisms transform carbon
compounds in order to grow and oxidize carbon compounds to obtain energy. In human systems,
the combustion of organic carbon supplies energy to run vehicles, electrical appliances and etc.
Thus, the key biogeochemical processes include (a) organic carbon generation (photosynthesis),
(b) organic carbon transformation (biosynthesis, digestion, food webs, and carbon sequestration),
and (c) organic carbon oxidization (cellular respiration, combustion). Because these processes
are the means by which living organisms and human systems acquire energy and the means by
which environmental systems regulate levels of atmospheric CO2, these processes are used to
describe the environmental systems. These carbon-transforming processes are essential for
students to understand the environmental systems.
To explore students’ understanding of these carbon-transforming processes and how their
understanding progresses over time, the Environmental Literacy research team developed
learning progressions to describe students’ progress over time. The idea of a learning progression
implies that “science learning is not simply a process of acquiring more knowledge and skills,
but rather a process of progressing toward greater levels of competence as new knowledge is
linked to existing knowledge, and as new understandings build on and replace earlier, naïve
conceptions” (Wilson & Bertenthal, 2005, p.114). In the Environmental Literacy project, the
learning progression is developed in an iterative process that moves back and forth between three
major elements: a framework, assessments and scoring rubrics.

4

The framework describes learning of a specific concept over long periods of time. First,
researchers develop an initial framework. Under the guidance of the framework, assessment
tasks are designed to assess students’ understanding of the concept. Then based on the
framework and the patterns in the assessment data, scoring rubrics are developed to grade
students’ responses to the assessment tasks. Then researchers use the results to revise the
framework. After the framework has been revised, they revise existing items and develop new
items according to the revised framework. This is an iterative process, in which the results from
the assessments lead to revisions in the framework and the other way around. Over the past five
years, the Environmental Literacy project has gone through three iterative development cycles.
Each iterative cycle represents an effort to strengthen the linkage and coherence among the
elements. The assessments include both written assessments and clinical interviews.
The learning progression hypothesis suggests that there are general patterns in the
development of students’ knowledge and practice that are both conceptually coherent and
empirically verifiable (Anderson, 2010). Through an iterative process of design-based research,
moving back and forth between the development of frameworks and empirical studies of
students’ reasoning and learning, researchers can develop research-based frameworks,
assessments and scoring rubrics that are both conceptually coherent and empirically verifiable. In
the Environmental Literacy project, the researchers implicitly make claims such as:
1) The learning progression framework represents empirically-verified levels of students’
achievement in developing accounts of carbon-transforming process.
2) The learning progression framework and the carbon cycle assessment are unidimensional.
The learning progression levels, which represent increasing understanding and
complexity, can be ordered along a continuum. Though the assessment includes items

5

assessing different processes, students use the same ability to answer these items.
3) The assessment can accurately locate individual students’ understanding within the
framework.
There are multiple threats to the validity of these claims. In a previous study (Mohan,
Chen, Baek, Choi, Lee, & Anderson, 2009), the carbon cycle research team investigated the
validity of the first and second claims. The study showed that the carbon cycle learning
progression framework represented empirically-verified levels of students’ achievement.
Another conclusion was that students have a similar level of reasoning on different carbon
transforming processes. While there were unique patterns for each item, the overall trend did not
suggest major difference in reasoning among the process dimensions. So the assessment and the
learning progression framework were essentially unidimensional. However, this conclusion was
not based on a statistical analysis. Continually monitoring the patterns in the dimensions is
needed when revising the assessments and the framework.
The focus of this dissertation is to investigate how to design learning progression-based
science assessments to accurately classify students among the achievement levels. This is related
to both the second and the third claims mentioned above. The dimensionality analysis can inform
the assessment design. Whether students use the same ability or different abilities to answer the
items has different implications for the assessment design. Meanwhile, designing assessments to
accurately classify students among the learning progression achievement levels ensures the
validity of the third claim above. The research focus of this dissertation is specified as three
research questions listed in the following section.
1.3 Purpose of this study
The general purpose of this study is to investigate how to design learning progression-

6

based science assessments to accurately classify students’ understanding into learning
progression levels. There are three specific research questions:
1) How can tests be designed that use items in different formats (constructed response,
ordered multiple-choice, multiple True or False) to accurately classify students’
understanding into levels on the learning progression?
2) The carbon cycle learning progression framework includes students’ understanding of
different carbon transforming processes and different scientific practices. Whether
students’ understanding of these processes and practices is associated with a single
construct or different constructs? If items of different process/practice measure different
constructs, how should a test be designed to estimate students’ proficiency for a particular
process or practice?
3) What characteristics of test items support their use for assessing students’ levels in a
learning progression? How can these characteristics be used as design criteria for test
items?
The first, second and third research questions are investigated in Chapter 5, 6 and 7
respectively. Both the first and second questions are about designing a test that is sufficient to
accurately classify students’ understanding into levels on the learning progression in science.
More specifically, accurately classifying students means the test has small measurement error for
students over a range of abilities. The psychometric information reflects the amount of
measurement error of persons’ ability. The larger the information, the smaller the measurement
error of persons’ ability (more details about information can be found in Section 3.2.4).
A test with high test information is desirable since it can measure students’ ability more
precisely. In practice, test information around 10 can be considered as good for detecting the

7

difference between individual students. Test information around 5 can be considered as sufficient
to detect the difference between two groups of about 30 students each (see section 3.3 for a
detailed discussion about why choosing information 10 or 5 as a rule of thumb). If the test
information is 5, then the test can detect a certain difference between two groups of students at
certain significance and power levels. For example, if each group has 30 students and assumes
the observed variance of the ability estimates of each group is 0.5, the test can detect a difference
of 0.58 (this is on the latent trait scale rather than raw score scale) between the mean ability
estimates of the two groups at the significance level of .05 and the power level of 0.8.
The first research question investigates how to design a test composed of items in
different formats to classify students’ understanding into levels. The items developed by the
Environmental Literacy project are in three formats: 1) Constructed Response (CR) items, 2)
Ordered Multiple-Choice (OMC) (Briggs, Alonzo, Schwab, & Wilson, 2006) plus CR items and
3) Multiple True or False (MTF) plus CR items. Each item format is explained below.
The CR items require examinees to create their own responses rather than choosing a
response from a set of options. There are two most common types of CR items: short-answer
items and essay items. The CR items developed by the research group are short-answer items,
each of which requires about five minutes for students to answer. Each CR question is scored
according to a scoring rubric that gives varying degrees of credit according to the learning
progression achievement levels.
The OMC +CR items are two-tier items that have two parts. The first part is an OMC
question that requires students to choose a response from a list of options. A unique feature of
the OMC items in comparison to the traditional multiple-choice (MC) items is that each option is
linked to a particular developmental level of students’ understanding of the target concept.

8

Students will get partial credit if they select a response that represents lower level understanding.
The second part is a CR question that asks students to explain the choice they made in the OMC
part.
The MTF + CR items are also two-tier items that have two parts. In the MTF part, a set of
true or false questions ask students to judge what are the matter or energy source(s) for events
such as tree growth or human growth from a list of options. Based on their responses, we
pinpoint their achievement levels. The CR part then asks students to explain the choices they
made in the MTF part.
Each item format has its advantages and disadvantages. OMC and MTF are effective item
formats, which require relatively shorter administration time and less scoring effort than CR
items. However, guessing can be involved when students answer items in these formats,
especially for students at low ability levels who are more likely to guess. So the items may not
measure low-level students precisely. CR items require longer administration time and more
scoring effort, but they are more appropriate for measuring students’ high order thinking. An
important question is how to design a test composed of items in different formats to utilize the
advantages of each format and measure the target construct is the first question.
The second research question investigates whether students’ understandings of different
processes/practices are associated with a single latent construct or several constructs. The extent
to which students’ understandings of these processes/practices are related to each other will be
investigated. The results have implications for the test design and item selection process. For
example, if students’ understandings of different processes/practices are very distinct from each
other, then only items of each particular process/practice should be selected to assess students’
understandings of that process/practice. The research reported here investigated whether the

9

processes/practices define an unique ability scale in the item response dimensional space and
whether the different processes/practices are highly intercorrelated.
The third research question is what characteristics of test items support their use for
assessing students’ levels in a learning progression and how can these characteristics be used as
criteria for designing test items. There are many guidelines and tips about how to write good
items that can be found in literature (American Educational Research Association, American
Psychological Association, & National Council on Measurement in Education, 1999; Thorndike,
2005). The research purpose of this dissertation is to relate item characteristics to information
provided by the item about students’ learning progression levels. Learning progression-based
items not only assess whether or not students master the concept, but also assess the trajectories
of students’ learning of the measured concept and where are they along the trajectory. Results
from the item characteristics analysis can tell us what kind of OMC/MTF options can accurately
differentiate students among levels and what kind of CR item stem can elicit detailed responses
from students at different ability levels. These results can provide guidelines and clear targets to
item writers to write good learning progression-based items in addition to the test specifications.

10

Chapter 2 Literature Review
The research questions proposed in the first chapter are centered on how to design
learning progression-based science assessments. Before investigating this, we need to have some
fundamental understanding of the assessment and the learning progression. The first section (2.1)
of this chapter starts by reviewing assessments from the theoretical perspective and illustrates the
assessment triangle that underlies every assessment. Then, each assessment triangle vertex is
explained in the following three sections. Section 2.2 reviews a model of cognition and learning.
It introduces the learning progression idea and carbon cycle learning progression framework that
the assessment is based on. Section 2.3 summarizes literatures about the advantages and
disadvantages of items in different formats to provide guidelines for evaluating the effectiveness
of items in multiple formats. Section 2.4 reviews the psychometric models that are used for the
data analysis.
2.1 Assessment triangle
Assessments are designed based on a coherent argument to suit the assessment’s purpose
(Mislevy, Steinberg & Almond, 2003). It is a process of reasoning that starts from the
observation of students’ performances on the assessment tasks to inferences about their
knowledge or skills of the measured concept. Mislevy and his colleagues pointed out that good
assessment tasks cannot be developed in isolation. Development must start from the intended
inferences, then the observations and performances that are needed to support those inferences,
the assessment tasks that will elicit these performances, and reasoning to connect each
component to make coherent arguments for assessment design (Almond, Steinberg & Mislevy,
2002; Mislevy, Steinberg & Almond, 2003).
The National Research Council (NRC) portrayed assessment as a triangle with three

11

corners—cognition, observation, and interpretation (Pellegrino, Chudowsky, & Glaser, 2001).
This triangle underlies all assessment and it presents assessment as an integration of the three
fundamental elements (See Figure 2.1 below). Pellegrino, et al, claimed in the NRC report:
“These three elements—cognition, observation, and interpretation—must be explicitly connected
and designed as a coordinated whole. If not, the meaningfulness of inferences drawn from the
assessment will be compromised” (p. 2).

Figure 2.1 The assessment triangle

The Cognition vertex plays the central role in assessment design. It refers to the theories
and beliefs about what people know, how they know it, what are the knowledge and skills that
are important to measure. In other words, it is not only which knowledge and skills are to be
assessed but also how the knowledge and skills develop. The design of assessment needs to be
consistent with best available understanding of how students learn. In measurement terminology,
the targeted knowledge and skill to be assessed is referred as the “construct”. An assessment
should start from an explicit and clearly conceptualized cognitive model of learning with a welldefined construct that is considered most important to assess.
The Observation vertex refers to a set of assessment tasks that are used to elicit responses

12

from examinees. They are based on theories and beliefs of the kinds of tasks that will prompt
students to provide valid and rich responses. The assessment tasks developed for observation
need to serve the purpose of the assessment. They must be carefully designed to elicit the
knowledge and cognitive processes that the model of learning suggests are most important for
competence in the domain. Meanwhile, observations need to support the inferences and decisions
that will be made based on the assessment results.
The Interpretation vertex includes all the methods and tools that are used to reason from
observations to inferences of students’ learning. Interpretation methods or tools such as statistical
models or qualitative models are used to characterize and summarize the patterns of the data
collected through assessment tasks. The interpretation model needs to fit the model of cognition
and learning to characterize the knowledge and skills that cognitive theories suggest are
important to pursue. Meanwhile, the interpretation model depends on the type of data collected
through observation. Through interpretation, the observations of students’ performances are
synthesized into inferences of their knowledge, skills and other attributes being assessed.
The three vertices of the assessment triangle need to be connected with each other to lead
to an effective assessment and sound inferences (Pellegrino, 2009). Assessment developers
should not only focus on the observation corner, but also pay explicit attention to all three
elements of the assessment triangle (cognition, observation, and interpretation) and their
coordination. They need to use the assessment triangle as a foundation to develop systematic
approaches to design assessments. These systematic approaches will be different from the
common approaches that merely focus on the development of “good items” in isolation from all
other important facets of design (NRC, 2006).
Similar to the assessment triangle, the Berkeley Evaluation and Assessment Research

13

(BEAR) center developed four building blocks for constructing quality assessments that map to
the NRC Assessment triangle: construct maps, items design, the outcome space, and the
measurement model (Kennedy, 2005; Wilson, 2005). The difference between BEAR’s building
blocks and the assessment triangle is that the building blocks emphasize that assessments need to
be based on a developmental perspective on student learning. It also emphasizes the alignment
between what is taught and what is assessed in constructing assessments. The BEAR center
supported the development of the carbon cycle assessment following these building blocks.
The design of high-quality assessment is a complex process. The analyses proposed in
this study investigate the connections between the cognition, the observation and the
interpretation vertices.
2.2 Cognition—Learning progression of carbon cycle
2.2.1. Learning progression and progress variables
The NRC emphasizes the central role of a model of cognition and learning in assessment
design (NRC, 2001). Good science assessments need to be based on a modern understanding of
students’ science learning. Contemporary theories of learning emphasize that learning is a
process of constructing understanding that involves ongoing revision and reorganization of
current thinking as new knowledge is acquired (NRC, 2007). This suggests one should take a
developmental approach for science assessment design.
A developmental approach for assessment is the process of monitoring students’ progress
in a domain of learning over time. This will help to find out the best ways to facilitate their
further learning. A developmental approach involves knowing what students know now, and
what they need to know in order to progress. The developmental approach can be applied to
develop large-scale assessments at national and state levels, as well as classroom assessments.
14

This approach uses a learning progression or some other continuum to design assessments to
monitor students’ progress over time.
Learning progressions are “descriptions of the successively more sophisticated ways of
thinking about a topic that can follow one another as children learn about and investigate a topic
over a broad span of time (e.g., six to eight years)” (Duschl, Schweingruber, & Shouse, 2007,
chapter 8). Learning progressions are anchored on one end by what we know about the reasoning
of students about specific concepts entering school (i.e., lower anchors). On the other end,
learning progressions are anchored by societal expectations (e.g., science standards) about what
we want high school students to understand about science when they graduate (i.e., upper
anchors). Learning progressions describe the intermediate understandings between these anchor
points (Mohan, Chen & Anderson, 2009).
Progress variables are used to track students’ increasingly sophisticated understanding of
a given concept. Progress variables mediate between big ideas and specific concepts and skills
being learned during instruction (Wilson, 2005). Progress variables are aspects of knowledge and
practice that are present at the achievement levels of the learning progression. The development
of progress variables can be traced across levels. The difference between learning progressions
and progress variable is summarized in Merritt and Krajcik’s paper (2009): “Learning
progressions are a means for determining how to support student learning of the big ideas of
science. They are big picture, research-based and provide opportunities to think about how to
engage students in long-term learning of both content and practice skills. Progress variables
serve as a means for tracking students’ progress during instruction. Just like learning
progressions, they are also research-based. Moreover, the development of these progress

15

variables is important because they can form the basis for tracking students’ understanding of the
particle nature of matter. ” (p.2)
The National Research Council recommended using learning progressions to inform
science assessment (NRC, 2006, 2007). However, learning progression-grounded items pose
development challenges. It is difficult to write items that provide opportunities for students to
respond at multiple levels of a learning progression (Anderson, Alonzo, Smith, & Wilson, 2007).
CR items must be carefully designed to elicit complete responses while not telling students what
should be included. Multiple-choice items must be written in such a way that the highest level is
not indicated by the use of “science-y” terminology not present in lower level options (Alonzo &
Steedle, 2008). So this study is aiming to investigate the development of learning progressiongrounded items.
2.2.2. Carbon cycle learning progression
The carbon cycle assessment is developed based on the achievement levels the
Environmental Literacy project identified in previous studies (Mohen, Chen & Anderson, 2009;
Jin & Anderson, 2010). Over the past several years, the Environmental Literacy project research
team has taken an iterative approach that moves back and forth between the development of the
learning progression framework and the development of assessments. The learning progression
framework was refined based on the empirical data gathered from assessments. Then the
assessments were modified according to the refined framework. The current assessments were
developed based on four achievement levels:
Level 1: Simple force-dynamic accounts.
Level 2: Elaborated force-dynamic accounts (e.g., different functions for different organs)
Level 3: Attempts to trace matter and energy, but with errors (e.g., matter-energy confusion,

16

failure to fully account for mass of gases).
Level 4: Correct qualitative tracing of matter and energy through processes at multiple scales
(e.g. macroscopic scale, microscopic scale and large scale).
In the following paragraphs, I will provide more detailed descriptions of the learning
progression levels and provide example responses from one item (item label ENERPEOP) to
illustrate each level. This item asks students “what are the energy sources for people to live and
grow?” Students are required to choose Yes or No for a list of five things: water, food, nutrients,
sunlight, oxygen and then explain their answers.
At the lowest level--Level 1--students describe the world in terms of objects and events
rather than chemically-connected processes. Their understandings are confined to the
macroscopic scale without recognizing the underlying chemical changes or energy
transformations of events. Students describe macroscopic processes in terms of the action-result
chain. They think the actors use enablers to accomplish their goals and the interactions between
actors and enablers do not involve any change of matter/energy. A typical level 1 response from
the item is: “People need water when they were exercising, so we can feel energized. You need
food so you won't feel weak during the day. You need nutrients to help keep your body healthy
and strong. When you exercise, it also helps your body stay strong. You need sunlight to feel
refreshed.”
At Level 2, students continue to attribute events to the purposes and natural tendencies of
actors, but they also recognize that macroscopic changes result from “internal” or “barely
visible” parts and mechanisms that involving changes of materials and energy in general. A
typical level 2 response to the item is “we drink water and eat food. Food has nutrients and
vitamins that are converted into energy and pumped to your muscles. We do exercise to burn fat.

17

Sunlight does not give us energy, plant use it but not us.” Student began to pay attention to
components of food such as “nutrients” and “vitamins”.
At Level 3, students can reason about macroscopic or large-scale phenomena but because
of limited understanding at the atomic-molecular scale, they cannot trace matter and energy
separately and consistently through those phenomena. Level 3 students link macroscopic changes
to chemical changes and describe chemical changes as changes involving atoms, organic
molecules, and energy forms, but do not successfully conserve matter and energy, for example,
they might think organic molecules convert into energy. A level 3 response to the item is “All
living organisms need water; food contains glucose needed to make ATP for energy. Nutrients
are a food source. Exercise controls the size of lipids. Sunlight is the energy source of all life.”
Only at the highest level--Level 4--students can use atomic-molecular models to trace
matter/energy systematically through multiple processes connecting multiple scales. They use
constrained principles (conservation of atoms and mass, energy conservation and degradation),
codified representations (e.g. chemical equations, flow diagrams) to explain chemical changes. A
level 4 response is “Humans break down carbohydrates as well as fats. The body gets these from
food. H2O, nutrients, exercise, CO2, and O2 are all necessary for life but are not energy sources.
Humans don’t undergo photosynthesis so sunlight isn't an energy source”. See Table A.1 in the
Appendix for detailed descriptions of these four achievement levels.
This learning trajectory is a typical path followed by American students (Anderson,
2010). It gives clues about the types of assessment tasks that will elicit evidence to support
inferences about student achievement at different points along the progression. By considering
the ways in which students learn science, the science assessments and tasks can be created to
gather information on how well and to what degree students are progressing over time toward

18

more expert understanding.
2.2.3. Carbon cycle progress variables: process and practice
There are two identified progress variables that present at all the achievement levels of
the carbon cycle learning progression: process and practice. The Environmental Literacy project
identified the process and practice progress variables because these are used to mediate between
big ideas (carbon cycle) and specific concepts and skills being learned in classrooms. They are
used to track students’ increasingly sophisticated understanding of the carbon cycle.
Understanding carbon cycling in socio-ecological systems is challenging for most
students. Many studies showed that students did not fully understand carbon cycling in socioecological systems. Kempton, Boster, and Hartley (1995) found that many students confused
global warming with ozone depletion. Research found that it took time for students to understand
the mechanism of global warming over the course of secondary education (Boyes & Stanisstreet,
1993). Some other studies (e.g., Anderson, Sheldon, & Dubay, 1990; Songer & Mintzes, 1994;
Fisher, et al., 1984) documented a wide range of students’ difficulties to understand carbontransforming processes such as photosynthesis and cellular respiration.
To explain the carbon cycle in complex coupled human and natural system, students need
to see the key processes that tie systems together and perform certain practices (e.g. tracing
energy, tracing matter) to explain those processes. Thus, the Environmental Literacy project
identified the key carbon-transforming processes and practices that can be used as conceptual
tools for students to reason about carbon cycling in complex systems. There are six key carbontransforming processes and five scientific practices identified. The six carbon-transforming
processes are described below:

19



Photosynthesis — the chemical process that plants convert carbon dioxide into organic
compounds using the energy from sunlight. The items about plant growth assess students’
understanding of photosynthesis.



Biosynthesis/digestion — organic carbon is transformed during these processes.
Biosynthesis a cellular process by which substrates are converted to more complex
products under the catalysis of enzymes. Digestion is the mechanical and chemical
breakdown of food into smaller components that are more easily absorbed into a blood
stream. The items about animal growth assess students’ understanding of biosynthesis
and/or digestion.



Cellular respiration — organic carbon is oxidized during cellular respiration (including
decomposition) and combustion. Cellular respiration is a set of the metabolic reactions
and processes by which the chemical energy of organic molecules (e.g. glucose,
carbohydrates, fats, proteins) is released and partially captured in the form of ATP. The
animal function items address students’ understanding of cellular respiration.



Decomposition — the process by which organic material is broken down into simpler
forms of matter. Decomposition is one type of cellular respiration. The items such as
APPLEROT and TREEDECAY (APPLEROT and TREEDECAY are item labels.
APPLEROT stands for “apple rot” and TREEDECAY stands for “tree decay”) measure
students’ understanding of decomposition.



Combustion — the burning of a fuel and oxidant to produce energy and release carbon
dioxide. The items about burning fossil fuels, burning candles or matches assess students’
understanding of combustion.

20



Cross-process events— a set of related carbon transforming processes. Students’
understanding of cross-process events is measured by items that require students to
connect their understanding of different carbon transforming processes to reason about a
phenomena such as global warming.

The five practices identified are specified below:


Macroscopic practice—students’ general account of material kinds and what is happening
to materials and forms of energy (or actors, actions, and enablers) at the macroscopic
scale. Items that assess macroscopic practice are questions such as “Where does the
object come from?”, “Where does it go?” and “How does it change?”.



Mass/Gases/Amount practice—quantitatively accounting changes in mass (or
size/amount) of materials. Items that assess this practice are questions such as “Does air
have weight/mass?” and “Does the stuff contribute to weight gain/loss?”.



Energy/Causes practice—accounting specifically for energy or closely related terms
(power, light, heat). Items of this practice are “What are the things that cause changes?”,
“Where does the energy come from?” and “Where does the energy go?”.



Microscopic practice—using structures and functions of subsystems to account for
macroscopic observations. Questions such as the following address this practice: “What
are the smaller/invisible parts?”, “Are there invisible changes behind the macroscopic
phenomena?” and “How are they related to macroscopic phenomena?”.



Large-scale practice—association and tracing among macroscopic processes (using
structure and function of large-scale systems). This practice focuses on questions such as
“How are changes/events similar or different?” and “How are changes/events
connected?”.

21

Though students have learned some fundamental principles to trace matter or trace
energy in their science classes, they seldom apply them to environmental issues. Numerous
studies have found that students intuitively focus on visible aspects of systems and do not use
atomic-molecular accounts to explain macroscopic or large-scale events (Hmelo-Silver, Marathe,
& Liu, 2007; Lin & Hu, 2003). A study conducted in the Environmental Literacy project
indicated that pre-service science teachers did not trace matter and energy separately to explain
chemical changes. Instead, they thought fat was “burned up” or “used for energy” when people
lost weight (Wilson, Anderson, Heidemann, Merrill, Merritt, Richmond, Sibley & Parker, 2006).
The identified key practices can help students to reason about the carbon-transforming processes.
The assessment items are developed to assess these practices and processes. Each item
assesses students’ understanding of a single process or their understanding of cross-process
events (e.g. global warming). Each item also addresses one scientific practice such as tracing
energy. The assessment consists of items focusing on six processes (plant growth, animal
growth, animal functioning, decomposition, combustion, cross-process) and five practices
(macro, mass, energy, micro, large-scale). One goal of the assessment is to measure students’
ability of each practice/process that the assessment is designed to measure. So the “cognition”
vertex of the assessment are levels of achievement with respect to the processes/practices and the
“observation” tasks need to be designed to measure these processes/practices precisely.
2.3 Observation—Items in different formats
When designing a test, the selected item format(s) should be useful for eliciting evidence
of students’ understanding of the measured construct. The three item formats used in this
assessment, two-tier items consisting of OMC plus CR parts, two-tier items consisting of MTF
plus CR and CR only formats, are used to tap into students’ learning progression of carbon

22

cycling. The groups of items in these formats are assembled so the scores that they give can shed
light on the full range of the science content knowledge, understandings, and skills included in
the construct as elaborated by the related learning performances. Knowing the advantages and
disadvantages of each item format will help us to select the appropriate format(s) to achieve the
goal of our assessment.
It is widely recognized that the MC item format is an effective means for determining
how well students have acquired basic content knowledge. However, the limitations of MC items
are also well recognized such as the guessing effect and not showing students’ original thoughts.
Researchers pointed out that MC items might not be able to measure high order thinking
(Delandshere & Petrosky, 1998; Kennedy, 1999; Lane, 2004) and might encourage teachers to
drill students on isolated facts and formulas (Frederiksen, 1984; Shepard, 2000). However, some
well-designed MC items can be used to measure complex cognitive processes. For example, the
Force Concept Inventory (Hestenes, Wells & Swackhamer, 1992) was an assessment that used
MC items but tapped higher-level cognitive processes.
MTF items are similar to MC items. The difference is, rather than selecting one best
answer from several alternatives, students respond to each of the several alternatives as separate
True or False questions. This item format is especially good for assessing students’ commitments
to fundamental principles. Since MTF items allow students to select multiple answers, to answer
the item correctly, students need to not only identify all the correct answer(s) but also exclude all
the incorrect answer(s). This requires students to have deep understanding of the principles being
assessed and apply those principles consistently. For example, students need to identify sunlight
as the energy source for tree growth, and recognize though trees need nutrients, water and air to
grow, these are not energy sources for trees.

23

Some articles addressed the advantages and disadvantages of the MTF format. Frisbie
(1992) gave a comprehensive review of the literature and synthesized the following merits of
MTF items: “(a) They are a highly efficient format for gathering achievement data, (b) they tend
to yield more reliable scores than MC and other objective formats, (c) they measure the same
skills and abilities as content-parallel MC, (d) they are a bit harder than MC for examinees, and
(e) they are perceived by examinees as harder but more efficient than MC” (p. 25). There are
also some shortcomings with MTF format. Usually, answering MTF item involves a lot of
guessing, especially for examinees with the least knowledge who will guess most. MTF may not
be reliable at the low ability range. Grosse and Wright (1985) found that the examinees’ response
style (guess “T” more often or guess “F” more often) would determine whether the true score or
the false score was more reliable. Dunham (2007) found that students’ responses to the MTF
item were influenced by an “optimal number correct” response set. For example, examinees
tended to endorse three or four of the six MTF options more frequently than would be expected
by chance alone. These results suggest that MTF item can be used as an alternative to MC items
but when designing and analyzing MTF items, attention needs to be paid to the reliability of the
items, the guessing involved in the responses, and the response style factor.
The major advantage of CR items is that they are more appropriate for measuring
students’ abilities to organize, integrate and synthesize their knowledge and their abilities to
solve novel problems. CR items can be used to demonstrate students’ original thoughts and they
allow students to show the process of their reasoning. Hence, CR items can serve as useful
assessment tool for teachers (McNeill & Krajcik, 2007; Champagne, Kouba & Gentiluomo,
2008). CR items also have disadvantages such as the difficulty in administrating, scoring,
inconsistencies among raters, and not always showing students’ thinking.

24

Some previous studies were conducted to compare the use of OMC items with the use of
other types of items when assessing the same concepts. Briggs, Alonzo, Schwab, and Wilson
(2006) used OMC items to assess students’ levels on a learning progression of the earth and solar
system. The results indicated that test scores based on OMC items compared favorably with
scores based on traditional MC items in terms of their reliability. There was a weak to moderate
positive correlation between students’ scores on OMC items and their scores on comparable tests
consisting of traditional MC items. Alonzo & Steedle found that compared to CR items, OMC
items “appear to provide more precise diagnoses of students’ learning progression levels and to
be more valid, eliciting students’ conceptions more similarly to cognitive interviews compared to
open-ended items” (Alonzo & Steedle, 2008, p.1).
Other researchers found inconsistency in students’ responses to items in different formats
but addressing the same underlying principles. Steedle (2006) found that students performed
differently on MC and short-answer items targeting the understanding of the same underlying
progression. Lee, Liu & Linn (2011) found that compared to MC items, CR items discriminated
between high and low knowledge integration ability students much more effectively, measured a
wider range of knowledge integration levels, and were more sensitive to knowledge integration
instruction.
Many studies that compared MC and CR items across a range of outcomes suggested that
these items might measure a different aspect of the construct, especially at the extremes of the
distribution (Lee, Liu & Linn, 2011). Ercikan, Schwarz, Julian, Burket, Weber, and Link (1998)
used an IRT model to calibrate both item types on a single scale and discovered that, when
combined to produce a single scale, the overall measurement accuracy improved because the CR
items could tap very-low and very-high ability groups. Wilson and Wang (1995) reported that

25

“performance-based items provided more information than multiple-choice items and also
provided greater precision for higher levels of the latent variable” (p. 51).
The different results from these studies point to the need to better understand the
affordances of different item types for assessing students’ learning progression levels.
Meanwhile, the analysis of the two-tier items consisting of OMC and CR parts or consisting of
MTF and CR parts helps to explore new forms of assessment to be used in classroom and largescale contexts. This study examines how effective are items in these formats in terms of
differentiating students among levels. Then the study investigates ways to make good use of
items in these formats to form a test that can both effectively and accurately diagnose students’
learning progression levels.
2.4 Interpretation— The psychometric models
In large-scale assessment programs, the measurement model used for the data analysis
will be either based on classical test theory (CTT) (Novick, 1966; Lord & Novick 1968; Allen &
Yen, 2002) or item response theory (IRT) (brief introduction of CTT and IRT can be found in
Chapter 3). All models are incorrect to some extent. They are an oversimplification of reality.
However, the model does not have to be assumed to be absolutely correct to be useful. The
decision about which measurement model should be used is generally based on the inferences
one wants to support with test results. The idea of learning progressions is that students develop
successively more sophisticated ways of thinking about a topic over time. So their abilities are
assumed to lie on a latent continuum, a scale along which individuals can be ordered. Applying
IRT models to estimate students’ proficiency along the latent continuum is appropriate.
A variety of measurement models are available depending on the type of item and the
assumptions that are made. The measurement model used to interpret the data can be evaluated

26

by two criteria. First, it should be tightly connected with the “cognition” part to formalize the
relationships posited in the model of cognition and learning. A measurement model grounded in
substantive theory is more likely to lead to solid inferences (Rao & Sinharay, 2007). Second, the
measurement model needs to fit the data adequately. The fit of a measurement model is
evaluated in terms of the extent to which observed data deviate from predictions of the model.
In this study, models will be selected based on their fit with both the substantive theory
and the empirical data. The measurement models used in this study are unidimensional and
multidimensional partial credit models. These models are reviewed in Chapter 3 and compared in
terms of assumptions and properties. Classical test statistics and IRT models are used to analyze
the quality of the assessment. The dimension analysis is used to find out the latent constructs that
the assessment assesses. After knowing more about the latent constructs and how precise
different composites of ability are being assessed, this study explores ways to design the test that
can give good information about students’ progress through the learning progression of the
construct being measured.
Thus this study gives specific consideration to all three components of the assessment
triangle and integrates the three components as a whole. For the carbon cycle assessment, the
cognition to be measured is students’ levels of performance on the progress variables: process
and practice. The observations are the items in multiple formats. These need to be selected based
on what would constitute evidence of student competencies and what are the effective ways to
collect evidences. The interpretation should fit with the data and be supported by the theoretical
underpinning.

27

Chapter 3 Psychometric Theories and Related Terminologies
The analyses in this study are mainly based on the IRT. Classical test statistics are also
involved to evaluate the assessment. In this chapter, a brief introduction of the general
framework of IRT and classical test theory is included. The relevant IRT models are reviewed
and relevant terminologies are explained. These can provide some background knowledge for
people who are not familiar with the measurement theories and terminologies to make sense of
the data analyses. In addition, this chapter also includes a discussion of the relationships between
test information and hypothesis testing and the rational to set the amount of desired information
at 5 as a rule of thumb.
3.1 Classical test theory
Classical test theory (CTT) (Novick, 1966; Lord & Novick 1968; Allen & Yen, 2002) can
be regarded as roughly synonymous with true score theory. CTT is based on the understanding
that a given test score can be thought of as consisting of two parts. One is the error of
measurement and the other is the actual individual score on the studied attribute, which is of
interest. This latter part is called true score.
Up to the 1980’s, interpreting test scores was largely based on the CTT. In CTT, an
observed score (X) is equal to the true score (T) plus error (E).
X=T+E
(3.1)
CTT assumes that each person has a true score, T, which is not directly observable. The
true score will be equal to the observed score if there are no errors in measurement. A person's
true score is defined as the expected number-correct score over an infinite number of

28

independent administrations of the test under the exact same condition. The observed score, X, is
equal to the true score plus measurement error, which is not correlated with the true score by the
definition of CTT.
The most important concept in CTT is reliability. It describes the relations between the
observed score, the true score and the measurement error. Reliability is defined as the proportion
of variance of observed score that is attributable to the true score rather than to the error:

1
(3.2)
That is, reliability (

) is the complement to 1 of the ratio of error variance to observed variance.

The item discrimination index based on the CTT is analyzed in this study. It is the
correlation between the students’ scores of the item and their total scores. It represents the
discrimination ability of an item and is an indicator of the item quality. The higher the average
item discrimination ability, the higher the reliability of the test (Ebel, 1979). Ebel proposed
ranges for evaluating discrimination indices for items on classroom tests. The ranges are
summarized below:


 0.40



0.30 ~ 0.39 Reasonably good, consider improving



0.20 ~ 0.29 Marginal, needs improvement



0.19

Very good

Poor, reject or revise

3.2 Item response theory
Compared with the CTT, IRT (Lord, 1980; Rasch, 1960; Rasch, 1980; Lazarsfeld &
Henry, 1968) is considered as a modern psychometric theory. The idea of IRT is to model the
29

relationship between person proficiency level (the continuum) and the probability of correct
response to an item. The assumptions of IRT are stronger than CTT, which assume that the
probability of observed responses is determined by examinee’s ability and the parameters that
characterize the items. Though there could be infinite number of possible IRT models used to
estimate item parameters and person proficiencies. Only a few of them are the most applied IRT
models.
3.2.1 Unidimensional IRT
Unidimensional IRT (UIRT) assumes a single underlying trait or a common composite of
traits explains persons’ performance on the test items. The simplest commonly used UIRT
model is the one-parameter logistic model. It is used for dichotomous scored items. It has one
parameter for describing the ability of the person and one parameter for describing the difficulty
of the item. The equation for the model is given by

1

1

(3.3)
where

th

th

is the response of n student to the i item.
th

is the latent trait (the ability parameter) of the n student and
th

 is the item difficulty of the i item.
Note that the only observable quantity is the Xni and student ability parameter as well as
the item difficulty parameter will be estimated by maximizing the total likelihood (see section

30

3.3.3 for more details). That is, as long as we determine the model and collect the data (

), the

rest will be a pure mathematical process to get the person ability parameter and the item
difficulty parameter estimated.
The one-parameter logistic model is for items that are dichotomously scored. For the
items that are polytomously scored, there are other UIRT models. One of the most commonly
used models for polytomous item is the partial credit model (PCM) (Masters 1982). PCM was
designed for test items with two or more ordered categories and is appropriate for test items in
which the scores on the item represent levels of performance, with each higher score meaning
that the examinee accomplishes more of the desired task. The boundaries between adjacent
scores are labeled “thresholds”. Threshold parameters are the locations on the ability scale that
students with those abilities will have the same probability of getting the two adjacent score
points. For example, the assessment items in this study are all polytomous item and usually have
4 score points: 1, 2, 3, and 4. Then, there are three thresholds, the threshold between level 1 and
2 (d1), between level 2 and 3 (d2), and between level 3 and 4 (d3). If a student’s ability is d1,
then he/she will have 50% of the chance to get a Level 1 or a Level 2 score. If his/her ability
increases, the probability of getting a level 2 will be higher than the probability of getting a level
1.
The OMC, MTF questions and the CR items have two or more score categories, and
higher score requires accomplishing more of the desired task. Therefore, it is appropriate to use
PCM to model our data. The probability of student n being graded into level x for item i is given
by

31

x

exp  ( n   ij )
P(X ni  x) 

j 0
k

mi

 exp  (

n

 ij )

j 0

k 0

(3.4)
where

is the observed score of person n on item i,
is the maximum score on item i,
is the proficiency of person n,

and

is the threshold parameter for the

score category for item i.

The unidimensional PCM can be applied to estimate person proficiency and item parameters.
3.2.2 Multidimensional IRT (MIRT)
The actual interactions between person and test items are often more complicated than
what the UIRT implied. MIRT is an extension of the UIRT model used to describe situations
where multiple skills and abilities are needed to respond to the test items. MIRT describes
students’ abilities in a multidimensional space with each construct as a line in the space. MIRT
identify a mathematical model that can represent the connection between the probability of a
response to an item and the location of a person in a multidimensional space (Reckase, 2009).

32

The potential use of MIRT for educational assessment has been recognized for more than
twenty years (e.g. Embretson, 1984; Reckase, 1990). It is a useful methodology for assessing
competencies in educational assessment and provides a more accurate representation of the
complexity of tests. MIRT models provide tools for gaining more detailed information than the
information gained from more traditional, classical measurement models. Applying MIRT
models can help understand the latent traits that an item measures, for example, apply MIRT
models to fit the data can tell how many latent traits influence performance on an item. The
examinee’s proficiency levels on each latent trait are estimated from the MIRT models and the
measurement error estimated tells how precisely different composites of ability are being
measured (Ackerman, 1994a, 1994b, 1996; Muraki & Carlson, 1995; Reckase, 1985, 1997;
Reckase & McKinley, 1991; Yao & Schwarz, 2006). MIRT models are useful for understanding
both the items and the students’ abilities in complex domains. So MIRT analysis can provide
more detailed information about the items and the test, which can inform the instrument
construction.
It is appropriate to apply MIRT model to analyze the assessment data since the
assessment has items in multiple formats and the items assess students’ understanding of
different processes. Researchers found that several assessments that contain a mixture of MC and
CR items measure more than one trait (Yao & Boughton, 2009). In addition, the assessment
measures students’ understanding of different carbon transforming processes and different
scientific practices, which may be intrinsically multidimensional.
In addition, science assessments often require numerous knowledge and skills, for
instance, knowledge of different subject matter areas and a variety of skills, such as conceptual
understanding and scientific investigation. So science assessments are likely to be

33

multidimensional. When responding to a particular item, students often rely on more than a
single ability. The multidimensionality that underlies science assessments has been recognized
by researchers (Reckase & Martineau, 2004; Wei, 2008). Therefore, more complex
multidimensional models can be applied to describe the data.
The multidimensional model, called multidimensional random coefficients multinomial
logit model (MRCMLM), is used in this study. It is specified in Adams, Wilson, & Wang’s paper
(1997). It assumes that a set of traits underlie the persons’ responses. It is a general model that
includes both dichotomously and polytomously scored test items. The expression for the full
model is given by

P( X ik  1 | A, B, ξ, θ) 

e
Ki

bikθaikξ

e

bik θaikξ

k 0

(3.5)
where A is a design matrix with vector elements

that select the appropriate item parameter

for scoring the item;

B is a scoring matrix with vector elements

 that indicate the dimension or dimensions

that are required to obtain the score of k on the item;

 is a vector of item difficulty parameters;
θ is a vector of coordinates for locating a person in the construct space.
is the highest score category of the item; k represents the score category.
is an indicator variable that indicates whether or not the observed response is equal
34

to k on Item i. If the score is k, the indicator variable is assigned a 1; otherwise, it is 0.
Suppose an item has four response categories (0, 1, 2, 3) and three latent abilities are
required to solve the item. The MRCMLM model is specified using the design matrix and the
scoring matrix as the following:
0
A= 1
1
1

0
0
1
1

0
0
0
1

0
B= 1
1
1

0
0
1
1

0
0
0
1

A is the design matrix. The elements of A matrix select the appropriate item parameter to
score the item. The rows of the matrix correspond to the scoring categories and the columns
associate with the item parameters. The elements of the A matrix are specified by the test
developer rather than obtained through the statistical estimation procedures. B is the scoring
matrix. The rows of the B matrix represent the scoring categories and the columns of the matrix
correspond to latent dimensions. The elements of B matrix indicate the dimension or dimensions
that are required to obtain the score of k on the item. For example, the abilities in all three
dimensions are required to obtain score 3, so the fourth row of the B matrix is [1,1,1]. In this
case, the MRCMLM is specified as a multidimensional PCM and different achievement levels
require different latent abilities. Multidimensional PCM is used in this study.
The response categories are modeled as
P (X10=1; A, B, ξ | θ) =1/ D
P (X11=1; A, B, ξ | θ) = exp (θ1 + ξ1) / D
P (X12=1; A, B, ξ | θ) = exp (θ1 + θ2 + ξ1 + ξ2) / D

35

P (X13=1; A, B, ξ | θ) = exp (θ1 + θ2 + θ3 + ξ1 + ξ2 + ξ3) / D
where D =1+ exp(θ1 + ξ1 ) + exp(θ1 + θ2 + ξ1 + ξ2) exp(θ1 + θ2 + θ3 + ξ1 + ξ2 + ξ3)
Adams et al. (1997) specified two subclasses of the MRCMLM model. One subclass is
for a between-item multidimensionality test, a test that consists of several unidimensional
subscales. Each item of the test relates to only one latent dimension. The other subclass of the
model is used for within-item multidimensional tests, the case in which each item of the test
relates to more than one latent dimension. In this study, both within and between-item
multidimensional PCM are applied. These two types of multidimensionality can be modeled by
having appropriate design and scoring matrices in the MRCMLM model. Figure 3.1 below
shows the difference between the within-item and between item multidimensionality.
For between-item model, the B matrix has nonzero elements that specify the coordinate
dimension that is measurement target for the item, and the other elements are all zeros. For
within item dimensionality, the B matrix has more than one nonzero element and these nonzero
elements specify the coordinate dimensions that influence the performance of the test item.

36

Figure 3.1 Structures of between-item and within-item multidimensionality

From Adams, Wilson & Wang (1997) p. 9
The MRCMLM can be estimated by marginal maximum likelihood method (MML, Bock
& Aitkin, 1981, see more details about MML in 3.2.3.). Bock & Aitkin’s formulation of the EM
2

algorithm (Dempster, Laird, & Rubin, 1977) can be used to estimate structural item parameters.

2

Expectation-maximization (EM) algorithm is a technique using iterative procedures to find the
maximum likelihood estimators when there are unobserved latent parameters. The iterative
procedures involve two major steps: first, use initial guess of the parameters to calculate the
expectation of the log likelihood conditioned on the latent parameters; second, estimate the
parameters by maximizing the expectation of the log likelihood obtained from the previous step.
Then, use the new parameters as input to repeat the two steps until the likelihood does not
increase much. More details can be found from: http://en.wikipedia.org/wiki/Expectationmaximization_algorithm

37

Estimation of the parameters in the MRCMLM is implemented in the ConQuest program (Wu,
Adams, & Wilson, 1998). The use of MRCMLM model follows a confirmatory procedure that
used to check hypotheses of a dimensional structure when a test has been designed to measure
specific constructs. When estimating the item parameters of the model, the vector-valued person
parameter θ is assumed to follow a multivariate normal distribution. ConQuest provides
estimates of the person parameters, item parameters, means, variances, covariances and
correlations of the latent dimensions, and deviance of the model.
There are also exploratory approaches to examine the dimensionality of the assessment
data. In the exploratory approaches, the number of dimensions needed to accurately model the
relationships in the item response matrix need to be determined first. This is determined from the
interaction between a particular sample of examinees and the particular sample of items
(Reckase, 2009). Parallel analysis, scree plots and residual correlation matrix are often used to
3

determine the number of coordinate axes . Software such as DIMTEST (Stout, Douglas, Junker
& Roussos,1999; Stout, Froelich, & Gao. 2001), Ploy-DIMTEST and DETECT (Zhang & Stout,
1999) are commonly used to implement the procedures for determining the number of
dimensions to model the item response matrix.
There are advantages and disadvantages of both confirmatory and exploratory approach.
The advantage of the confirmatory approach is that the design and scoring matrices of the
MRCMLM model the data explicitly according to the test developer’s intended structure, so the
results are easier to interpret and the fit statistics can be used as a diagnostic tool to confirm
whether the theorized model is an acceptable description of the latent traits. But the confirmatory
3

More details about the methods used to determine the number of coordinate axes such as
parallel analysis, scree plots and residual correlation matrix can be found in Reckase, M. D.
(2009). Multidimensional Item Response Theory. New York: Springer.
38

approach may ignore the relationships that are not specified in advance and may result in less
model fit compare to the exploratory approach. The exploratory approach can achieve more datamodel fit by selecting a model that fits the data best. But since the model is not specified in
advance for confirmation, the result is often difficult to interpret and it sacrifices the use of fit
statistics as a diagnostic tool to confirm the theorized model. In this study, since there are
hypotheses and theories about what the assessment items assess, the confirmatory approach is
applied.
4

Model fit indexes including the Akaike’s (1973) information criterion (AIC, Bozdogan,
5

1987) and the Bayesian information criterion (BIC; Schwarz, 1978), can be used to compare the
6

posited models. Chi-square test can also be used to determine the model fit. In this study, the
model fit indexes are estimated using ConQuest (Wu, Adams, & Wilson, 1998) and chi-square
4

The Akaike’s (1973) information criterion (AIC) is a statistics that tell the relative goodness of
fit of nested models. It is defined as
2 log(Lmax) 2 , where Lmax is the maximized
likelihood and k is the number of model parameters. AIC is used to compare the relative
goodness of fit of nested models and penalize the number of parameters in the model. The model
corresponding to smaller AIC is a better model.
5

The Bayesian information criterion is a statistics that tells the relative goodness of fit of nested
models. It is defined as
2 log(Lmax)    
, where k is the number of model
parameters and N is the number of data points. Lmax is the maximized likelihood. The model
with smaller BIC is a better model.
6

In the context of this dissertation, the chi-square goodness of fit is a way to estimate how well
the model fit the data. In general, the residual between data and model follows normal
distribution. Then, the sum of the residual square will follow a chi-square distribution. A good fit
require the reduced chi-square (the sum of the residual square divided by the number of the
degree of freedom) ~ 1. If the reduced chi square >> 1, the model under fits the data. If the
reduced chi square << 1, then the model is over fit the data. The difference in the model deviance
between two models approximately follows a chi-square distribution with degrees of freedom
equal to the number of additional parameters estimated in the more complex model (Haberman,
1977). A significant test result will indicate that the full model fits the item response data
significantly better than the reduced model. For more details, one can refer to:
http://en.wikipedia.org/wiki/Goodness_of_fit
39

test is used to determine the model fit.
3.2.3 Maximum likelihood estimation
As I discussed in the previous section, the person ability and item parameters can be
estimated by maximum likelihood estimation. Maximum likelihood estimation is a procedure of
finding the value of one or more parameters that makes the observed data distribution the most
probable. The maximum likelihood estimate for a parameter µ is denoted ̂ . Here I illustrate the
maximum likelihood estimation method using the normal distribution as an example. Suppose
th

there is a set of data, denoted as {Xi | i=1, 2, …n }. Xi is the value of the i data in the set. If
the data follow a normal distribution with mean

and variance

2

, then the likelihood function

for each data point is given by:

1

|  ,

√2

exp 

2

 

(3.6)
The joint likelihood for all data points is given by:

,…,

| ,

1
√2

exp 

2
(3.7)

2

exp 

Σ
2
(3.8)

40

The log likelihood function is

ln

∑

1
  ln 2
2

(3.9)
Then the estimators of

and

can be obtained by maximizing the likelihood function. Taking

the partial derivatives of ln(f) with respect to each of the parameters and setting it equals to zero
yields:

Σ

0
(3.10)

Σ

0
(3.11)

Solving equation 3.10 and 3.11 yields:

1

1

(3.12)

41

These are the well-known estimators for the mean and variance, but we obtained them through
the maximum likelihood approach.
The above example shows the essence of the maximum likelihood estimation. As long as
we write done the likelihood function, the model parameters can be easily estimated by
maximizing the likelihood function. In the one-parameter logistic (1PL) model case, the
corresponding likelihood is

1
(3.13)
where

th

th

 is the response of n student to the i item.
th

is the latent trait (the ability parameter) of the n student and

 is the item difficulty of the ith item.
and the person ability as well as item difficulty can be estimated by maximizing this likelihood.
Marginal maximum likelihood estimation (MML) is a special case of the maximum
likelihood estimation. Suppose we have a likelihood function with 4 parameters, for example, L
(a,b,c,e). But we are only interested in two parameters, say, a and b. Then, we can marginalize
the likelihood by integrating out the parameter c, and e, i.e., L(a,b) =

 

, , ,

. Now,

inistead of maximizing the likelihood L(a,b,c,e), we can get the marginal maximum likelihood
estimators for a and b by marginalizing the marginal likelihood L(a,b). Here we can maximize a
simpler function, but we need to integrate out the other parameters first.

42

3.2.4 Information function
In the previous sections, I introduced the general features of the IRT models, and outlined
how to estimate the ability parameter as well as the item difficulty/threshold parameters by the
maximum likelihood estimation. Since the central issue is to estimate the ability parameter for
each student, a very important question is how precise the ability parameter is estimated. To
address this question, another important concept is needed, the information.
In general, the term information tells us how much we know about something. The more
information we have for a given quantity, the less uncertain we will be about it. Therefore, the
information should be proportional to the inverse of the uncertainty. As I have shown that the
person ability parameter is estimated by maximum likelihood estimation, then what we want to
know is the information that is proportional to the inverse of the uncertainty of this maximum
likelihood estimation.
In statistics, a mathematically consistent definition for information of such nature is
called Fisher information, which is the inverse of the variance of the maximum likelihood
estimator. It is defined as:

|

|
(3.14)

Where E denotes taking expectation. In practice, taking the expectation for a complex function
can be difficult. Therefore, an observed Fisher information is defined as:

 

|
(3.15)

43

where

max is the maximum likelihood estimator of . This equation shows that as long as we

write done the likelihood function and obtain the maximum likelihood estimators of the
parameters, we will be able to calculate the information of that parameter. It can be shown from
statistical theory that the above defined Fisher information equals to the inverse of the variance
of the parameter, i.e.

I

1
var

|

 
(3.16)

where

is the true ability, and

is the maximum likelihood estimator of the ability . Therefore,

as long as we know the information, we will know the variance of the parameter we estimated
via the maximum likelihood estimation and therefore know the precision of our estimation.
In our application, we want to know not only the ability parameter, but also its
uncertainty. The information function can tell how precise one can estimate the ability
parameter. The more precise we can measure the ability parameter, the better we can tell the
difference among persons. Therefore, we want the information of the test to be adequately large.
3.3 Information and hypothesis testing
3.3.1 Some basic notation
A test with high information is desirable since it can differentiate smaller ability
differences. There are two cases in terms of the ability differences: 1) the difference between two
individual students; or 2) the difference between two groups of students. For an individual
student, his/her true ability is denoted by
a measured ability

, which is not directly measurable. What we have is

and associated measurement error on , denoted as . The measured

44

and

the true

, given the measurement error is sampled from a normal distribution, whose density

is shown in the following equation:

1

|

√2

exp 

 

2

(3.17)
The measurement error on

relates the test information for the same

as =1/

 

. Clearly,

the higher the information is, the more precise we can measure the .
Now, let's consider a group of students, each with a true ability

Ti. In the group, all the

true abilities of students follow a normal distribution as:

1
√2

exp 

2
(3.18)

In terms of the corresponding observed

, we have

1

|

2

exp 

2
(3.19)

That is, the observed

=

th

for the i student follows a distribution with mean  and variance

1/

.

is the true group standard deviation. The maximum likelihood

estimator of the group mean  is given by:

45

∑
1

∑

 

(3.20)
Here, we need to note that the

is not directly measurable. But it can be estimated by

maximizing the likelihood

1

exp 

2

2

(3.21)
In the case ∈  << , the well known estimators for

and

are recovered:

1

1
1

(3.22)

46

3.3.2 Hypothesis testing
A. Significance level and test power
In hypothesis testing, two types of errors are of interest. Type I error refers to the
situation that we reject the null hypothesis when it is correct while the Type II error refers to the
situation that we accept the null hypothesis when it is incorrect. The probability of making type I
error is called significance level, usually denoted by . The probability of NOT making type II
error is called the power, often denoted by 1- . In practice, as a rule of thumb, we usually
choose

= 0.05 and 1-

= 0.8. In our current application, two types of hypothesis testing will

be considered: 1) for any two students, we want to know if their true abilities are the same; 2) for
two groups of students, we want to know if the group means are the same. I will show how the
test information will affect these two types of hypothesis testing in this section.

B. Individual student case
For individual student, what we want to test is whether two students are different in their
true

T s.

We can run a t-test with the t-static defined as

(3.23)
where 1 and 2 are the true abilities of the two students.

and

are their corresponding

measurement errors. For a given significance level and test power, the greater the information,
the smaller the difference between two students' abilities can be detected (

47

1

-

2).

If the test

information is 10, it can detect the difference between two abilities with a difference of 0.92 on
the logit scale at the significant level of .05 and power of .8.

C. Group student case
For the group-wise hypothesis testing, we want to know whether two groups of students
are different in their mean true abilities

T.

t

The t-statistics for two groups of students is given as:

σ

σ
(3.24)

where
denotes the mean of the true abilities of group 1.
denotes the mean of the true abilities of group 2.

σ denotes the standard deviation of the ability estimates of group 1.
σ denotes the standard deviation of the ability estimates of group 2.
n1 and n2 are the sample size of group 1 and group 2.
However, the true ability is not directly measurable and we can only estimate them based
on the measured . Though the mean of
variance of

can be well estimated by the mean of , the

is related to the variance of the measured
48

by

(3.25)
where we have assumed all the ∈ are equal to  for simplicity. If we do not have any
information about the measurement errors on , we can only estimate the variance of the true
ability as Var( ). This will over-estimate the variance of the true ability, decreasing the
sensitivity of the testing. However, if we know the measurement errors, we will be able to
estimate the variance of the true ability via the Equation (3.25) above.
Measurement error is related to the t-statistics because it contributes to a small portion of
the group variance (σ1  and σ2 ). Smaller measurement error can increase the differentiability a
little bit by decreasing the group variance but it does not necessarily lead to a smaller variance of
the group. If the test information is 10, then the test can detect a certain difference between two
groups of students at a certain significance and power level. For example, if each group has 30
students and the true variance of the ability estimates of each group is assumed to be 0.6, the test
can detect a difference of 0.87 between the mean ability estimates at the significance level of .05
and the power level of 0.8. If the test information is 5 and the other variables remain the same,
then the test can detect a difference of 0.93 between the mean ability estimates at the same
significance and power levels.
According to Equation (3.25), it seems that we can always recover the variance of the
true ability, no matter how large the measurement errors are. Then, why the measurement errors
need to be small? The key part here lies in that if the measurement error is large, then we cannot
recover the variance of the true ability at the same precision as when the measurement error is
small. To demonstrate this, I used a Monte Carlo simulation. First, I generated 100 T from N(0;

49

1). Then, I assumed the measurement error varied from 0.1 to 100 with an increment of 0.1
each time. At each measurement error , the measured ability 
recovered variance of the true ability T was calculated by

+N (0,1) *
2

Then, the

. After that, I calculated

the fractional error of the recovered variance with respect to the true variance [
(T)] /

2

(T) as a function of the test information (inverse of the measurement error

square). The results are shown in Figure 3.2.
Figure 3.2 Fractional error of the recovered variance as a function of test information.
For interpretation of the references to color in this and all other figures, the reader is referred to
the electronic version of this dissertation.

Note: The blue dots are results and the red dots are the mean in each bin of size 5. The red error
bar is the standard deviation in each bin.
50

Clearly, as the information increases, the dispersion of the fraction error decreases, while
when the information is small, the scatter of the fractional error can reach to 50%. If the
information is sufficiently large, we will be able to recover the variance of the true ability with
significantly less uncertainty. However, in practice, increasing the test information requires
increasing the number of test questions, which is constrained by the amount of available test
time. Therefore, a rule of thumb choice for test information is 10, which corresponds to a
fractional error of ~ 10%. From the graph, the test information of 5 also corresponds to a
fractional error that closes to 10%. So if the test is designed to test the difference between two
groups of students rather than two individual students, information of 5 can be considered as
sufficient. As long as we can estimate the variance of the true ability reliably, it is
straightforward to calculate the t-statistic and run the t-test.
3.4 Learning progression and IRT
In the Environmental Literacy project, the researchers are trying to make claims such as:
1) the learning progression framework and the carbon cycle assessment are unidimensional. The
learning progression levels, which represent increasing understanding and complexity, can be
ordered along a continuum; 2) the assessment can accurately locate individual student’s
understanding within the framework; 3) the same student should be at similar levels across items.
Applying IRT methods can investigate the validity of these claims. According to the first
claim, the learning progression framework is unidimensional, there should be a single dimension
defined by the achievement levels that accounts for a significant portion of the variance in
student performance. The dimension analysis can verify whether the unidimensional claim is
true. It can tell whether students use the same ability or different abilities to answer the items. So

51

it has implication for the learning progression framework and the assessment development.
Students’ abilities are indicated by the achievement levels defined in the learning progression
framework. Students’ abilities are also indicated by the IRT ability estimates. If these two
different definitions of ability reconcile with each other, it will provide evidence for the validity
of the learning progression framework and the validity of the single latent construct defined by
the four achievement levels. So IRT analysis can be conducted to examine both the validity of
the framework and the validity of the assessment.
Second, applying IRT methods can reduce the measurement error so that students can be
more accurately classified into learning progression levels. This ensures the reliability of the
assessment and supports the second claim above.
Third, IRT methods can be used to check how consistent do students respond across
items. If the item difficulty and item thresholds are similar across items, it means the same
student will be at similar levels across items. So the third claim is valid.
There are other advantages of applying IRT methods. Generally speaking, person ability
estimated from IRT models is a better measure of person’s proficiency than the raw levels
assigned by raters. It considers item characteristics such as item difficulty so it indicates
students’ ability more accurately. Because of all these advantages, IRT methods are applied in
this study to test the validity of the learning progression claims.

52

Chapter 4 Methodology
In this chapter, I introduce the assessment developed, the assessment data collected, and
the methods used to analyze the data. Section 4.1 describes the sample for this study. Section 4.2
introduces the test design. Section 4.3 describes the data scoring process and Section 4.4 briefly
introduces the data analyses.
4.1 Data
During the 2009-2010 school year, the Environmental Literacy project collected written
assessment data from elementary to high school students. The data were collected from twelve
science teachers’ classrooms, with four teachers at each level. These teachers used the teaching
materials designed by the Environmental Literacy project research team. They administered the
tests before and after they taught the teaching materials. So about half of the test data were
collected in the pretest and the rest were collected in the posttest. In total, there are 1500 test
papers from 10 rural and suburban schools. The more specific numbers of tests from each grade
level and from pre or posttest are described in Table 4.1.
Table 4.1 Number of test papers collected during 2009-2010
Elementary
Middle
High
Pre/Post Sum

Pre
167
288
262
717

Post
149
439
195
783

Grade Sum
316
727
457
Total: 1500

4.2 The carbon cycle test design
The carbon cycle assessment was designed for each of the three grade levels: elementary,
middle and high school. At each grade level, there were three alternative forms. The items
administered at each grade level were selected to be appropriate for students at that grade level.

53

Table 4.2 below describes the three test forms and the items that have been administered at the
high school level.
7

Table 4.2 Three alternative high school test forms
High Form A

High Form B

High Form C

[Photosynthesis items]

[Digestion items]

[Cellular respiration items]

CARBPATH
ENERPLNT
THINGTREE
PLANTGAS

EATAPPLE
INFANT
CARBBODY
ENERPEOP
[Combustion items]

GLUGRAPE
WTLOSS
BODYTEMP
AIRNBODY
[Human energy system items]

GASOLINE CAR
BRNMATCH
WAXBURN

GLOBWARM
LAMPELEC
KLGSEASON

[Cross process items]
GRANJOHN
DIFEVENTS

[Cross process items]

[Decomposition items]
TREEDECAY
POTATO
BREADMOLD
[Cross process items]
EATBRTHE
ECOSPHERE
[Linking items]
ENERPEOP(AB)
TROPRAIN(AC)

[Linking items]
PLANTGAS(AB)
WTLOSS(BC)

DEERWOLF
TROPRAIN
[Linking items]
BRNMATCH(BC)
BREADMOLD(AC)

All items assess students’ understanding of matter or/and energy transformation in six
carbon transforming processes— plant growth, animal growth, animal function, combustion,
decomposition and cross-process events. The items in bold font are the OMC + CR items. The
items in italic font are the MTF+CR items. The rest are the CR items. There are linking items
across the forms, which are in the last two rows of Table 4.2. About 20% of the items are the
linking items. Most of the items administered at high school level are different from those
7

In Table 4.2, each item is named using some key words in the item. For example, EATAPPLE
item asks students the question: An apple is eaten by a boy and digested in his body. What
happens to the apple when it is digested?

54

administered at the elementary level. Items administered at the middle school level are a
combination of the elementary and the high school items. Some items are used across all threegrade levels as vertical linking anchor items.
There are 43 items in total, including 25 CR items, 7 OMC + CR items, 10 MTF + CR
items and 1 MTF item. All the items are listed in Table 4.3 below. This item pool was developed
based on the item pool used in the previous year (2008-2009). Some of the items were used in
the previous year but were modified according to the assessment results, while some were newly
developed. Except four elementary items, all the other items have more than 100 responses. The
numbers of responses for each item are shown in Table 4.3. These numbers are different from
item to item, depending on whether the item is an anchor item that is used across forms or grade
levels.
Table 4.3 Number of responses per item
Item
ID

Item format

Item label

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18

OMC+CR
OMC+CR
OMC+CR
OMC+CR
OMC+CR
OMC+CR
OMC+CR
MTF+CR
MTF+CR
MTF+CR
MTF+CR
MTF+CR
MTF+CR
MTF+CR
MTF+CR
MTF+CR
MTF+CR
MTF

ACRON
BREADMOLD
BRNMATCHA (M/H)
DEERWOLF
TROPRAIN
BODYTEMP
WTLOSS
AIREVENT
ANIMWINTER
ENERPEOP
ENERPLNT
GLOBWARM(M)
GLOBWARM(H)
INFANT
OCTAMOLE
THINGTREE
STONEWIN
POTATO

55

Number of
OMC/MTF
responses
612
418
516
232
637
133
886
189
79
469
508
172
132
457
148
585
191
162

Number of
CR responses
601
397
525
233
641
137
885
198
80
900
522
185
136
455
148
598
198
N/A

Table4.3 (Cont’d)
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43

CR
CR
CR
CR
CR
CR
CR
CR
CR
CR
CR
CR
CR
CR
CR
CR
CR
CR
CR
CR
CR
CR
CR
CR
CR

AIRNBODY
APPLEROT
BRNMATCH (E)
BRNMATCHB (M/H)
CARBODY
CARBPATH
CARGAS
CONNLIFE
CUTTREE
DIFEVENT
EATAPPLE
EATBRTHE
ECOSPHERE
GIRLRUNNAB
GIRLRUNNC
GLUGRAPE
GRANJOHN (D)
GRANJOHN (P)
GROWTH
KLGSEASON
LAMPELEC
PLANTGAS
TREEDECAYAB
TREEDECAYC
WAXBURN

56

N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A

339
253
182
492
254
361
355
220
75
412
449
348
283
69
68
265
121
102
352
124
318
359
741
653
260

An example of MTF + CR item is given below:
Example MTF item: The baby gained more and more weight as she grew. Where did her
weight come from? Please circle Yes or No for each of the following and explain your
choices.
a. Sunlight

Yes

/

No

b. Water

Yes

/

No

c. Air

Yes

/

No

d. Nutrients

Yes

/

No

e. Foods

Yes

/

No

f. Exercises

Yes

/

No

Paired CR item: Please explain your answer. Try to explain what happens inside the girl’s
body to each of the materials that you circled “Yes.”
4.3 Scoring rubrics and coding process
Students’ responses were coded by nine raters in the Environmental Literacy research
team. These raters majored in either science education or educational measurement. All raters
were familiar with the learning progression framework and the scoring rubrics. They all had
some experience in coding similar written assessment data. Seven of the raters had worked in the
project for over two years and had coded hundreds of responses collected every year. The other
two raters who joined in the project in the last year went through the coding training and coding
practice to ensure high accuracy for their coding.
The responses to the CR items and the CR part of the two tier items were coded using the
generic rubrics and the item specific rubrics. The generic rubric has general level descriptions
that describe the general characteristics across items (See Table A.1 in the Appendix). The item-

57

specific rubrics have specific level descriptions of each level for each item and representative
example responses for each level (See Table A.2 in the Appendix for an example). These rubrics
were developed in previous studies and refined during the coding process to distinguish
responses more clearly. Ten percent of the responses were double coded by a second rater. The
8

inter-rater reliability between the first and second raters was higher than 80% for all items.
Discrepancies in coding were discussed and final agreements were reached for each response.
Students’ responses to the OMC questions were recoded into levels according to the level
of understanding the option represents. Appendix A lists all the items and the level of each OMC
option. Students’ responses to the MTF item, which included a string of T or F responses, were
also recoded into levels based on the number of correct choices made by the students. For
example, level 1 means the student made one correct choice and level 4 means the student made
four correct choices. The correct answers of the MTF items are in bold in Appendix A.
4.4 Data analysis
Both classical test statistics and IRT based statistics were used to evaluate the quality of
the items. Item discrimination indices were analyzed. It is the correlation between the students’
scores (the final score agreed by both raters) on the item and their total scores. IRT based
statistics such as item fit indices, item difficulty, step difficulty and measurement error were also
used to evaluate the quality of item. Since there were common anchor items across grade levels
and test forms, the entire data matrix followed the common anchor item design. The combined
set of items used at all grade levels was calibrated through a concurrent calibration using IRT
models. ConQuest was used to estimate both the item and person ability parameters. Open source
8

This is the exact plus adjacent agreement that includes differences within 0.5 level between
two raters.
58

software R and Microsoft Excel were used to calculate the summary statistics.
The ConQuest program can provide various IRT based statistics. For example, it can
9

provide person ability estimates such as the Expected A Posterior (EAP) estimates and
maximum likelihood estimates (MLE, see section 3.2.3 for detailed explanation of MLE). It can
also provide fit statistics for individual items. These are the residual-based indices such as the
weighted and unweighted fit statistics developed by Wright and Masters (1982). Weighted fit
statistics are usually preferred because they are less sensitive to unexpected responses made by
persons for whom the item of interest is far too easy or far too difficult. Wu (1997) has shown
that these statistics have approximate scaled chi-square distributions and can be transformed to
approximate normal deviates (t-values). An item is considered as a misfit item if the absolute
value of its associated t-statistic is greater than 2.0. A t-value greater than 4.0 or less than - 4.0
indicates serious misfit. Items that do not converge or show poor IRT fit may have low quality
and may cause potential problems when included into the test.

9

The Expected A Posterior (EAP) estimates of person combine the item calibrations, prior
rough idea of person ability, and the observed responses to obtain improved, a posteriori person
ability measure. It is based on the posterior probability distribution of the ability parameter.
Suppose we want to estimate the ability parameter and we know the posterior distribution is
p data| θ p θ
p θ|data
 
p data
Then, we have
θp θ|data dθ
EAP θ
p θ|data dθ
59

Chapter 5 Design of a Test Consisting of Items in Multiple Formats
5.1 Research purpose and procedure
The first research question is how to design a test using items in multiple formats to
precisely classify students into levels. This is one central question of this study. To measure
students’ learning progression as accurately as possible, the most appropriate items need to be
selected. Three steps of data analyses were conducted to address this question:
First, the dimensionality analysis was conducted to see whether items in different formats
assess the same ability or not. A confirmatory approach was applied to analyze the
dimensionality using the subjective classification of items according to their formats. Both
unidimensional PCM model and the multidimensional PCM model were used to fit students’
OMC, MTF and CR scores. The appropriate model was selected according to the chi-square
goodness of fit test and how well it is supported by the theoretical underpinning.
Second, after the item parameters were calibrated using the selected model, items were
selected to form a test based on certain test design criteria. More details about the design criteria
and how items were selected to meet these criteria were discussed in Section 5.4.
Third, to design OMC, MTF items to accurately classify students, discriminative
OMC/MTF options are needed. Correlation and cross tabulation analyses were conducted to see
how well the OMC/MTF choices relate to students’ abilities or their CR levels. To design better
CR items, whether the CR items accurately classified students into levels was analyzed.
5.2 Classical test statistics—Item discrimination index
The item discrimination was analyzed as a first check of the item quality. It was
computed as the correlation between the students’ scores (the final scores agreed by both raters)

60

on the item and their total scores. The item discrimination indices of most items were higher than
0.3. However, the item discrimination indices of nine OMC and MTF items were lower than 0.3
and four of them were even lower than 0.2. These OMC items were BODYTEMP (H) (0.20),
TROPRAIN (0.22) and WTLOSS (0.24). The MTF items were AIREVENT (0.24), ANIMWINT
(0.17), BODYTEMP (0.25), INFANT (0.16), POTATO (0.18) and THINGTREE (0.19).
A discrimination index value below 0.30 indicated that an item might not be measuring
what it was intended to measure, and should be reviewed. The discriminations of all CR items
were higher than .30. Since the test mainly consisted of CR items, the low discrimination of
OMC and MTF items indicated that students’ OMC and MTF scores did not strongly correlate to
their CR scores. There might be multidimensionality in terms of item format. This was examined
in the dimension analysis.
5.3 Dimensionality in terms of item format
The dimensionality analysis was conducted to test whether items in different format
assessed the same ability. Two models were used to fit students’ response codes: a
unidimensional Partial Credit Model (1PCM) and a three-dimensional Partial Credit Model
(3PCM) that classified items into three dimensions according to their formats. To compare the
goodness of fit between these two models, a chi-square test was performed on the difference
between the deviances of these two models. The difference in the model deviance approximately
follows a chi-square distribution with degrees of freedom equal to the number of additional
parameters estimated in the more complex model (Haberman, 1977). The difference between
model deviances was 420. The additional number of parameters of the 3PCM was 5. A chisquare statistics of 420 with degree of freedom of 5 was statistically significant at 0.001 level. So
the 3PCM fit the data significantly better than the 1PCM.

61

There was increase in model fit by applying 3PCM, which suggested the additional
dimensions could explain some of the variance in student performance. However, there were
moderate correlations among these three latent dimensions and strong correlations between the
abilities estimated using the 1PCM and the abilities in each dimension estimated using the
3PCM. The correlation between the OMC and the CR dimension was 0.69 and the correlation
between the MTF and the CR dimension was slightly higher, which was 0.76. The disattenuated
10

correlations

between the unidimensional ability estimates and the ability estimates in three

dimensions were all over 0.9. Table 5.1 provided all these correlations. These high correlations
indicated that one dimension was sufficient to approximate student’s ability.
Though the dimensionality analysis suggested that the OMC, MTF and CR items might
measure different components of the construct, the goal of the assessment is to design OMC,
MTF and CR items that assess the same construct. The OMC and MTF items need to be revised
to predict students’ CR levels more accurately (Section 5.5.1 and 5.5.2 discuss about this in
detail). A unidimensional model is supported by the cognitive theories underlie the assessment
design. Multiple dimensions are only necessary when we want to consider the nuisance
dimensions for a particular measurement purpose.
Table 5.1 Correlations among the EAP estimates in each dimension

Dimension 1 (OMC)
Dimension 2 (CR)
Dimension 3 (MTF)
Unidimensional ability

Dimension 1
(OMC)
1
0.69
0.44
0.94

10

Dimension 2
(CR)

Dimension 3
(MTF)

1
0.76
0.99

1
0.96

Disattenuated correlation is the correlation between two sets of parameters that accounts for
measurement error contained within the estimates of those parameters. The measurement error of
the EAP ability estimates is accounted in the disattenuated correlations.
62

Note: the correlations in the last row between unidimensional ability estimates and
multidimensional ability estimates are disattenuated correlation.
So the 1PCM was applied to analyze the data. When the 1PCM was applied to the data, a
student’s ability measured could be considered as a composite of the abilities in the
multidimensions. The results showed most items fitted well with the 1PCM model. The MNSQs
were within [0.67,1.33] and the t-statistics are within [-2,2]. This indicated in general, the
learning progression framework and the unidimensional assumption were supported. Table A.3
in the Appendix has a list of the item difficulty parameter estimates and the fit statistics from the
1PCM results. Nine items did not fit well with the 1PCM model (see Table A.3 in the Appendix).
These items need to be reviewed.
5.4 Select items to meet the design criteria
5.4.1 Design criteria of the learning progression-based carbon cycle assessment
There are some general considerations for the test design such as reliability and validity.
However, each research project has its own unique goals and the design criteria must be made in
a way so that the results can support the inferences that one wants to make. In the Environmental
Literacy project, the following criteria are what we are specifically interested in:
1) High information at the boundaries between levels on the IRT ability scale (we are less
concerned with information within the boundaries). So the boundaries need to be defined
first.
2) Similar item step thresholds across items. For leaning progression-based items, ideally,
students at the same ability level will get the same level across all items. This means that
the item step thresholds need to be similar across items.
3) Detect the differences between classes of about 30 students (e.g. pre posttest difference at
63

the class level)
4) Use more OMC and MTF items in the test to reduce scoring effort.
5.4.2 Design a test to meet each criterion
1) Define boundaries on the IRT scale
First, if the ability defined by the IRT scale reconciles with the ability defined by the
achievement level codes, it means the construct—the learning progression framework is
generally supported. To test whether these two defined abilities reconcile with each other,
boundaries should be set on the IRT scale first to classify students into levels.
Classifying students based on the IRT scale has some advantages. A student may be at
different levels on different items. How to decide his/her achievement level in general? The
ability estimated from IRT analysis is based on students’ responses to all items and takes the
item characteristics into account. So it’s a better measure of students’ ability. Therefore, we want
to define boundaries on the IRT scale to classify students into levels. This is usually conducted in
standard setting process which is a complex process that beyond the scope of this study.
A simpler way was applied in this study to set the boundaries. The mean of the item
thresholds across a set of good items was taken as where the estimated boundaries should be.
The set of good items were the items that fitted well with the 1PCM model and their thresholds
were in the correct order. The boundaries were set based on 38 items that considered as good
items (about 2/3 of all items). Table A.4 in the Appendix listed the items included in the good
item set and Table A.5 in the Appendix listed the 22 items that were excluded. The threshold
parameters of these excluded items do not represent the boundaries well. Half of these excluded
items were OMC and MTF items that did not fit well with the unidimensional model. The other
half are CR items. Some of these CR items have problems in their scoring rubrics and some of

64

these CR items cannot differentiate students at some achievement levels. The problems of these
CR items will be discussed in more detail in section 5.5.3. Therefore, it’s reasonable to exclude
these 22 items when setting the boundaries.
The thresholds of the good items were relatively close to each other. The mean of the
thresholds between level 1 and 2 (d1), level 2 and 3 (d2), and level 3 and 4 (d3) were -1.7, 0.5
and 1.9 respectively. So these were considered as where the estimated thresholds should be.
When the boundaries are set, students can be classified into levels according to their ability
estimates.
2) Similar item step thresholds across items
Some items had thresholds that were closer to the estimated boundaries than others. Table
5.2 included the descriptive statistics of the item parameters and Figure 5.1 showed the
distribution of item parameters, including the item difficulty parameter (b) and item step
threshold parameters (d1, d2, d3). The second design criterion is to have similar locations of
thresholds across items. Ideally, all the items should have thresholds at those boundaries.
Table 5.2 Descriptive statistics of the item parameters of 38 selected good items
b
Mean
Median
S.D.
Skewness
Maximum
Minimum

0.000
-0.03
0.65
-0.04
1.32
-1.21

d1
-1.72
-1.67
1.12
0.16
0.96
-4.43

65

d2
0.52
0.30
1.10
0.32
2.90
-1.28

d3
1.91
1.82
0.80
0.10
3.50
0.42

Figure 5.1 Item difficulty (b) and threshold (d) parameter distribution

It can be seen that the threshold parameters across items vary a bit. If the items and
rubrics are well designed and the scorings are reliable, then the variance of the threshold
parameters, d1, d2 and d3 should be small. The items with d1, d2 and d3 deviated far away from
the mean values is a sign that either the item or the scoring is not appropriate. For example, on
the “histogram of d1 graph”, the d1 of some items (e.g. TROPRAIN, GLUGRAPE,
GLOBWARM) are much smaller than the others. These items are not discriminative at the
lowest level—Level 1. However, either because the scoring rubric did not clarify that these items
were not discriminative at Level 1, or because of coding mistakes, a small proportion of the

66

responses were coded as Level 1. So the threshold parameter between level 1 and level 2 (d1) of
these items are much smaller than the d1 of the other items.
3) Detect the differences between classes
To accurately classify students into levels, the measurement errors at the boundaries
should be small. This means the information at the defined boundaries needs to be high. Since
the third design criterion is to detect the difference between classes, information of 5 or above on
these boundaries can be considered as sufficient (see Section 3.3.2 about why selecting 5 as the
criterion). Items were selected to form a test that could get information above 5 at the defined
boundaries.
Items were randomly selected from the good item set to see how many items were
needed. The result suggested around 16 items were needed. The information curve formed by
these 16 items is shown in Figure 5.2 below. According to the second design criterion, all item
thresholds should be close to the defined boundaries. So some ideal items that have thresholds at
those boundaries are simulated. Then these ideal items are selected to see how many items are
needed to achieve information above 5 at the boundaries. The result shows around 14 items are
needed. The information curve formed by these 14 ideal items is shown in Figure 5.2 below.

67

Figure 5.2 Information curve of 16 real items

Figure 5.3 Information curve of 14 simulated ideal items

Sixteen real items are required to get information above 5 at the boundaries, but only 14
items are required to get information above 5 at the boundaries when item thresholds are the
same across items. So adjusting rubrics to get similar thresholds across items will slightly reduce
the number of items needed to reach the information criterion at the boundaries. Most
importantly, having similar thresholds across items is an important criterion to design learning

68

progression-based items. So it is worth to modify the item or adjust the rubrics to get similar item
thresholds across items.
4) Using more OMC and MTF items in the test
The OMC and MTF item formats can be used as alternative formats to reduce the scoring
effort and administration time as long as they can accurately classify students into levels and can
elicit responses consistent with those elicited by the CR items. Including OMC/MTF items in the
test might also give information about students’ other abilities such as the ability to identify the
best/correct answer. However, in the prior analysis, some OMC items and MTF items do not fit
well with the unidimensional model. This indicates too much randomness in the data so these
OMC/MTF items do not perform well to classify students into levels.
Three problems may cause the misfit. The first problem is the OMC and MTF items
assess different aspects of the construct (e.g. the ability to recognize a correct answer) as
discussed in the dimensionality analysis previously. So the unidimensional model does not fit
well. The second possible problem is the quality of the OMC and MTF items. Some of the OMC
and MTF options cannot discriminate students appropriately. Section 5.5.1 and 5.5.2 will discuss
this problem and how to design better OMC and MTF options in detail. The third problem is the
PCM model may not be the best model for OMC and MTF items. The randomness in the OMC
and MTF data might due to the restricted range of responses and the guessing effect. In addition,
some OMC items have two options representing understanding at the same level, so the
probabilities at each level by chance are different. Some new models are under development for
the OMC items but they are limited to certain types of OMC items which may not be appropriate
for the OMC items in this study.
The misfit of the OMC and MTF items can be resolved in two ways. First, since the PCM

69

might not be the best model for OMC and MTF item, the one-parameter logistic model (1PL) is
applied to the OMC and MTF data by recoding the OMC and MTF items as dichotomous items
(1- if student choose the best answer or choose all correct answers, 0-otherwise). The result
shows that one OMC item shows slight misfit, all the other recoded OMC and MTF data fit well
with the 1PL model. The difficulties of the 7 OMC and 10 MTF items are listed in Table 5.3
below. Most of the MTF items are difficult which indicates that students need to have high
ability to choose all correct answers. If the item difficulty is close to a boundary, the item can
still work well to classify students between levels.
Table 5.3 OMC and MTF item difficulty (recoded as dichotomous items)
OMC items
ACRON_OMC
BODYTEMP_OMC
BREAD_OMC
BRNMATCH(M)_OMC
DEERWOLF_OMC
TROPRAIN_OMC
WTLOSS_OMC
MTF items
AIREVENT_MTF
ANIMWINT_MTF
STONEWIN_MTF
ENERPEOP_MTF
ENERPLNT_MTF
GLOBWARM(M)_MTF
GLOBWARM(H)_MTF
INFANT_MTF
OCTAMOLE_MTF
POTATO_MTF

Difficulty
-0.411
-2.037
-1.271
0.405
0.863
0.172
-0.236
Difficulty
-0.552
2.652
2.355
1.138
1.622
1.428
3.914
1.904
1.798
2.498

The second way to solve misfit is that, since some MTF options are not as discriminative
as others (Section 5.5.2 will discuss this problem in detail), responses from the less
discriminative sub-questions can be excluded from data analysis. Students’ MTF scores are
calculated based on their responses to the discriminative options only. I discuss about this further
70

in Section 5.5.2. Because some MTF items don’t have any discriminative option, this approach is
not sufficient to improve the fit statistics of MTF items significantly. Section 5.5.1 and 5.5.2
address other possibilities to improve the quality of OMC and MTF items.
Besides using OMC and MTF items, an alternative way to reduce the scoring effort and
administration time is to develop some dichotomously scored items such as multiple-choice or
True-or-False items in the test. If the difficulties of the dichotomous items are at the defined
boundary, these items can help to accurately classify students. For example, if there are three
well-designed CR items that have thresholds at the defined boundaries and there are 14
dichotomous items with difficulty at the boundaries (five at -1.7, five at 1.9 and four at 0), the
information curve formed by these items is shown in the graph below:

71

Figure 5.4 Information curve formed by 3 CR with thresholds at the boundaries and 14
ideal dichotomous items with item difficulties at the boundaries

The information curve is above 5 across all boundaries. So these 3 CR and 14
dichotomous items can form a reliable test to classify students into four achievement levels. The
next question that arises is how to develop dichotomous items that have difficulties close to the
boundaries. The difficulties of the recoded OMC items can provide some information about what
kind of dichotomous items that may need to be developed.
5.5 Design discriminative OMC, MTF and CR items
Using OMC/MTF items can reduce the test administration time and scoring effort. In
order to use items in these formats in the test, OMC and MTF items need to either accurately
classify students into several achievement levels or distinguish students between two adjacent
levels (OMC and MTF items act as dichotomous items). The previous analysis shows when

72

OMC and MTF items are scored as dichotomous items, they fit well with the 1PL model in
general. When the rescored OMC and MTF items have difficulties that are close to one of the
estimated boundaries (-1.7, 0.5, 1.9), they can be used to classify responses between two levels.
For example, OMC items are generally easy and can be used to classify level 1 and level 2
students. MTF items are generally more difficult and can be used to classify level 3 and level 4
students.
On the other hand, OMC and MTF items can be designed to perform better as
polytomous items. The quantitative analyses suggest many OMC or MTF items have poor item
statistics when coded as polytomous items. The item discrimination indices of some items are
lower than 0.20. Some items do not fit the unidimensional model or the step thresholds of the
item are not in the correct order. Since the test mainly consisted of CR items, the low
discrimination indices of OMC and MTF items indicate that students’ OMC and MTF scores do
not strongly correlate to their CR scores. The misfit and the incorrect order of the step thresholds
indicate problems with the OMC and MTF items. So sections 5.5.1 and 5.5.2 discuss how to
design OMC and MTF options to align with CR items.
5.5.1. OMC options
To evaluate how well students' OMC levels predict their CR levels, students’ levels on
the OMC questions are cross-tabulated with their levels on the paired CR questions. The result
shows the OMC level can predict the CR level to some extent, but there are cases that the OMC
level over-predicts or under-predicts CR levels. More often, the OMC level over-predicts the CR
level. Take the ACORN item as an example, this item asks students to identify where the weight
of the tree comes from.

73

ACORN
OMC part: A small acorn grows into a large oak tree. Where does most of the weight of
the oak tree come from? (Circle the best explanation from the list below).
A). From the natural growth of the tree (level 1)
B). From carbon dioxide in the air and water in the soil (level 3)
C). From nutrients that the tree absorbs through its roots (level 2)
D). From sunlight that the tree uses for food (level 1)
Paired CR part: Please explain why you think that the answer you chose is better than the
others. (If you think some of the other answers are also partially right, please explain that,
too.)
Table 5.4 Cross-tabulation between OMC levels and CR levels for ACORN item

CR levels
0
1
2
3
4

Level 1 (A)
6
105
21
3
0

OMC response levels
Level 3 (B)
Level 2 (C)
7
8
14
69
88
170
26
9
2
0

Level 1 (D)
2
36
24
4
0

Table 5.4Table 5.4 shows how students’ choices for the OMC part cross-tabulate with
their levels of the paired CR part. In the table, the numbers in the grey cells represent the number
of cases that the OMC level is consistent with the CR level. There are relatively large counts in
the grey cells, which means that students’ CR responses were coded the same as their OMC
responses. So the OMC options can predict students’ levels for the CR part to some extent.
The counts in the other cells served as evidence that the different item formats are not
eliciting consistently coded responses from students. The numbers in the cells above the grey
74

cells represent the cases that the OMC part over-estimate students’ CR levels and the numbers in
the cells below the grey cells represent the cases that the OMC part under-estimate students’ CR
levels. There are more over-estimations than under-estimations, which indicate that the OMC
question is easier than the CR question. Many students could identify that the weight came from
CO2 and water but could not explain how CO2 and water contributed to weight gain. For
instance, one student selected the choice B but his/her response to the CR question was “the
carbon in the air and the water in the soil makes the weight more at the bottom then on top”.
Clearly, the students could not explain the photosynthesis process. This is also true with the other
OMC+CR items. Students’ OMC levels are usually higher than their paired CR levels. The
hypothesis for the discrepancy between students’ OMC levels and their CR levels is that students
perform better when identifying the input and output of carbon transforming processes than
explaining what happens during the processes. Since most OMC questions assess the former
ability and most CR questions assess the latter ability, students’ OMC levels are usually higher
than their CR levels.
The cross-tabulation analysis also shows that in most cases, the OMC options associated
with range of levels rather than a single level. For example, the BODYTEMP item (below) asks
students where human body heat mainly comes from. Students who selected the correct option
(option C) of the OMC question tended to be at multiple levels in the corresponding CR part. For
instance, one student who selected option C gave a level 4 explanation: “because the food is then
broken down and the chemical energy in that food is then changed into thermal energy to keep
you warm”. However, another student who selected option C provided a level 1 explanation: “I
think all of these answers were legitimate, but I choose C because you have to eat food, everyone
does, so it’s a natural function, it would make sense to make heat from something you do daily.”

75

So the OMC option cannot predict the level of the student’s CR response very well.
BODYTEMP
OMC part: Your body needs heat to keep its normal temperature. Where does the heat
mainly come from? Please choose ONE answer that you think is best.
A) The heat mainly comes from sunlight. (Level 1)
B) The heat mainly comes from the clothes you are wearing. (Level 1)
C) The heat mainly comes from the foods you eat. (Level 3)
D) When people exercise, their bodies create energy. (Level 2)
Paired CR part: Please explain why you think that the answer you chose is better than the
others. (If you think some of the other answers are also partially right, please explain that,
too.)
In total, 91 students selected option C. Among these students, 59 of them are at level 3,
20 are at level 4 and the others are at level 1 or 2 on the paired CR part.
Table 5.5 shows how the OMC levels cross-tabulate the CR levels. Most of the OMC
options of this item represent understanding at lower levels. Hence, these OMC options underpredict students’ CR levels.
Table 5.5 Cross-tabulation between OMC levels and CR levels for BODYTE item
OMC response levels
CR response Level 1 (A) Level 1 (B) Level 2 (D)
levels
1
8
3
11
2
1
1
11
3
1
0
6
4
0
0
0

76

Level 3 (C)
3
9
59
20

In summary, OMC options can predict students’ CR level to some extent. But in most
cases, the OMC questions are not true OMC questions. There are cases that OMC level overpredict or under-predict CR levels. Often times, OMC level over-predict CR levels. When all the
OMC options represent understanding at low levels, the OMC levels may under-predict students’
real achievement level. So in order to design OMC options that can better predict students’ CR
levels, the OMC options need to represent understanding at multiple levels.
5.5.2 MTF options
The MTF options of the MTF+CR items were analyzed in similar ways to see how
students’ T or F responses were related to their CR responses. If students’ T or F responses can
predict their levels on the paired CR question, then MTF can be used instead of the CR format to
detect students' achievement levels.
First, how students' response strings to the set of T or F questions were related to their
levels on the paired CR item was analyzed. Then, the relation between students’ responses to
each sub T or F question and their CR levels was analyzed. The main findings were summarized
below:
1) Students who selected all correct responses were usually at very high CR levels and had
high ability estimates. This suggests that the set of T or F questions is useful to identify
the students at the high ability range.
2) Students’ number of correct choices can detect the achievement level of students who are
at the middle and lower ability range.
3) The patterns of T or F response string do not associate with students’ paired CR levels
clearly. This is mainly because students often select both low level and high-level
options since they do not have enough sophisticated understanding to rule out the lower

77

level distracters.
4) Some of the “T” or “F” questions work better to differentiate students than others.
In the following paragraphs, examples are given to illustrate each of these four main findings
about the MTF items.
First, take the ENERPLNT item as an example. Table 5.6 describes the percentages of
the most common T or F response strings. Among 508 students, only 4% of them (20 students)
correctly identify "sunlight" as the only energy source for plant growth. The average of their
paired CR level is 3.9 and the average of their ability estimates is .906, which is significantly
higher than the other students. This is a common pattern across most of the MTF items. Students
who gave all correct answers are those at very high ability levels. So MTF items are very useful
to identify these students.
ENERPLNT
MTF part: Which of the following are sources of energy for plants? Circle yes or no for
each of the following:
a). Water

Yes / No

b). Sunlight

Yes / No

c). Air

Yes / No

d). Nutrients in soil

Yes / No

e). They make their own energy

Yes / No

Paired CR part: Explain what you think is energy for plants.

78

Table 5.6 Percentages of each response string of the ENERPLNT item and the average
ability estimates for each response string

Percentages

Water

Sunlight

ENERPLNT (n=508)
Air
Nutrient

4%
No
Yes
No
28%
Yes
Yes
Yes
25%
Yes
Yes
Yes
14%
Yes
Yes
No
9%
Yes
Yes
No
Others
…
…
…
(20%)
Note: The correct response string is in bold.

No
Yes
Yes
Yes
Yes
…

Own
energy

CR level
average

No
No
Yes
No
Yes
…

3.9
1.9
1.9
1.8
2.1
…

Average
ability
estimates
.906
-0.29
0.13
-0.09
-0.38
…

Second, the patterns of T or F response strings are not clearly associated with students’
paired CR levels. Take the ENERPLNT item as an example, the average CR level of the
students who selected “Y” to the first four options and “N” for the last option was 1.9. The
average CR level of the students who selected “Y” to all five options was also 1.9. There was no
clear pattern in terms of how the responses strings associate with the CR levels. Students who
gave different response strings were at similar CR levels. The main problem is that students often
select both low level and high-level options because they do not have enough sophisticated
understanding to rule out the lower level distracters. In this case, most students selected sunlight
as the energy source but they selected the others as the energy source as well.
Data from the ANIMWINTER item give another good example for this problem. This
item asks what happens to the fat that the animal lost during hibernation. The correct answer is T
for the second option, “the fat was turned into water and gases that the animal breathed out” and
F for the other options. Some options such as “turned into waste in the digestive system and left
the body as poop” are designed as a lower level distracter. Students who selected the correct

79

answer also select T to those lower level distracters. However, in the CR part, when the question
is asked in an open-ended way and there are no low level distracters, students are more likely to
be at higher levels. For example, many students selected T to the high and low level options, but
their explanations are mainly on fat and heat/energy conversion, which are at level 3.
ANIMWINTER:
MTF part: During winter, many animals have problems finding food and may hibernate
(sleep through the winter). These animals lose weight by spring. What do you think happens
to the fat that the animal lost during hibernation? Circle True OR False for each possibility.
True False The fat was turned into heat to keep their bodies warm during the winter
True False The fat was turned into water and gases that the animal breathed out
True False The fat was turned into waste in the digestive system and left the body as poop.
True False The fat was turned into other materials in the body that don't weigh as much.
True False The fat was used up in the animal’s body and disappeared.
CR part: Think about your responses above. Please explain as much as you can about what
happens to the fat in the animal’s body during hibernation.
Third, since students often choose both low and high-level options, and it’s hard to judge
their understanding level based on their T or F response strings, therefore, another approach was
applied to analyze the correlation between students’ T or F response strings and their paired CR
levels. Students’ MTF responses were recoded into scores according to the number of correct
choices students made for the T or F questions. The IRT analysis suggested students’ MTF
scores based on their number of correct choices generally fit the 1PCM model and the step
thresholds of the MTF items were in the correct order. This suggests MTF items can also
measure students who are in the middle or low ability range using students’ number of correct

80

choices as their MTF scores.
One MTF item (THINGTREE) showed serious misfit and three MTF items showed slight
misfit (AIREVENT, INFANT, POTATO). The THINGTREE item below asks students to
identify things that a tree needs in order to grow from a list: sunlight, soil, water and air.
THINGTREE
A small oak tree was planted in a meadow. After 20 years, it has grown into a big tree,
weighing 500 kg more than when it was planted. Do you think the tree will need any of
the following things to grow and gain weight? Please circle Yes or No and explain your
choice. If you circled yes, explain how the tree uses it. What happens to it inside the
tree?
Sunlight

YES

NO

Soil

YES

NO

Water

YES

NO

Air

YES

NO

Most of the students selected that all four things were needed regardless of their ability level. For
example, a student who selected Yes to all four options was clearly at high ability level. He/she
provided a level 4 response as the one follows:
During photosynthesis, chloroplasts absorb and use light energy. CO2 and H2O combined
into organic materials and release O2. Light energy is converted into chemical energy.
The sugar produced by photosynthesis is converted to starch, which involves in the
synthesis of amino acid, protein, and lipid. So the weight of tree increases.
Meanwhile, another student who selected T to all four options was at low ability level. He/she
gave a level 1 response:

81

It needs sunlight so it can grow. Without soil the tree won’t grow healthy and strong. It
always needs water so it can grow just like it needs sunlight. Without the air it won’t even
be able to grow.
These two students were at significantly different ability levels. However, they got the same
number of correct answers. The number of correct choice is not a good measure of students’
ability. So this MTF question is not well designed. The other three MTF items that show misfit
have similar problems and need to be reviewed to include discriminative options.
Fourth, some of the “T” or “F” questions work better to differentiate students than others.
Take the ENERPLNT item as an example, the “water”, “air”, “nutrients” options are most
effective to detect students’ differences. Table 5.7 below shows the percentages of students at
each CR level for two groups of students: the group who selected Y and the group who selected
“No” to the question. For three options, water, air and nutrient, students who circled “No” were
more likely to be at higher CR levels than those who selected Y. The t-test indicated for these
three options, there was significant group difference in terms of students’ CR levels between the
students who selected Y and who selected N. But for the other options (b. sunlight; e. plant make
their own energy), the differences between the groups who selected Y and who selected N were
not significant. This means that these two options are less effective to differentiate students. The
t-test results are in Table 5.8.

82

Table 5.7 Percentages of students at each CR level for students who selected Y and those
who selected N to each T or F question
WATER (%)

LIGHT (%)

Yes
34
44
22
0

Yes
31
42
23
4

AIR (%)

Level
Level 1
Level 2
Level 3
Level 4

No
9
25
38
28

No
36
27
36
0

Yes
34
45
20
0

No
26
35
29
9

NUTRIENT
(%)
Yes
32
46
22
0

No
25
22
32
21

OWN
ENERGY
(%)
Yes
No
32
30
33
48
34
15
0
6

Table 5.8 Compare the average CR level of two groups of students: students who selected Y
and those who selected N
WATER
Mean CR
n
Sig

LIGHT

AIR

NUTRIENT

Yes
No
1.87 2.85
455
65
.012

Yes
No
1.99 2.00
499
22
.394

Yes
No
1.85 2.21
319
198
.000

Yes
No
1.89 2.48
431
87
.000

OWN
ENERGY
Yes
No
2.02
1.97
233
279
.120

Excluding less discriminative options and including more discriminative options will improve
the quality of the MTF items to some extent. Table A.6 in the Appendix is a list of what are the
effective sub-questions for each MTF item according to the t-statistics. After recalculating
students’ MTF scores using only their responses to the effective sub T or F questions, the item
discrimination of most MTF items improves. However, since some items don’t have any
discriminative option or only have some slightly effective options, by excluding less effective
sub T or F question is not sufficient to improve the quality of MTF items.
In summary, the set of T or F questions of the MTF questions are very useful to identify
students at very high achievement levels. Those students often make correct choices to all T or F

83

questions. The number of correct T or F choices students made can also measure students who
are in the middle or low ability range to some extent. For the MTF items that do not fit well with
the PCM model or have low discrimination, they need to be redesigned to have a better
combination of T or F questions since some T or F options are more effective than others in
terms of differentiating students. The design of MTF options needs to be informed by more
research to include the most efficient indicators and combinations of indicators to differentiate
students.
5.5.3 CR items
The CR items should be able to elicit responses at multiple levels of a learning
progression. If the item can only elicit responses at particular levels, we need to know what
levels that the item is discriminative at and then use the item appropriately. For example, if an
item can only distinguish low-level responses, then the item is most appropriate to be used for
low ability students. To find out the levels that each CR item is discriminative at, the item
threshold parameters are analyzed. If the threshold parameters are not in the correct order, it may
indicate that the item is not discriminative at certain levels or the item classify students
inaccurately. Knowing this will allow us to either make modifications to make the CR item be
discriminative for a wider range of levels or to use it more appropriately.
The result indicates that most of the CR items are effective for differentiating students
among levels. Figure 5.5 shows the item threshold parameters of all the CR items. Among all 42
CR items/questions (25 CR items and the 17 CR questions from the OMC+CR items and the
MTF+CR items), the item threshold parameters of 33 items are in the correct order. The
threshold parameters (d1, d2, d3) of 9 items are very close to each other or not in the correct
order, which suggests that there are too many or too few responses at a particular level
84

comparing to the proportion of responses at that level from other items. So the item does not
accurately classify students at these particular levels.
Figure 5.5 Item threshold parameters of all the CR items

The threshold parameters of the following nine items are not in the correct order:
AIRBODY, BODYTEMP_E, BODYTEMP_H, BREADMOLD, ECOSPHERE and
ENERPEOP, TREEDECAY_C, LAMP and OCTAMOLE. Figure 5.6 below shows the item
characteristic curves by categories for these nine items.

85

Figure 5.6 The characteristic curves by category of nine CR items

Note: the black line represents the probability of getting score 1; red line for score 2, green line
for score 3 and blue line for score 4.
The AIRBODY item has 87% of the students at level 2, 9% of the students at level 3 and
4% of the students at level 4. The majority of the students are at level 2. Though this item was
administered at middle and high school level, it did not elicit many high level responses.
Students gave very brief responses such as “lung” or “blood” to sub-question A. They explained
the process as breathing in oxygen and breathing out carbon dioxide, which seemed less than
they might actually know. Though some sub-questions of this item ask students to explain
“how”, these questions still did not work well to elicit detailed responses at cellular or atomicmolecular scale. These questions may need to be revised to elicit higher-level responses. For

86

example, subquestion C can be revised as “where does the carbon in carbon dioxide that people
breathe out comes from?”
AIRNBODY
Humans get oxygen from the air they breathe in, and they breathe out carbon dioxide.
a. Where in the body does the oxygen get used?
b. How does the oxygen get used?
c. How is the carbon dioxide produced in the body?
d. Does breathing help your body use energy? If so, how?
Four items, BODYTEMP_E, BODYTEMP_H, BREADMOLD, and OCTAMOLE have
much fewer level 2 responses but more level 3 responses compared to the other items, so the step
difficulties are not in the correct order. These items do not classify level 2 and 3 clearly. Both the
BODYTEMP_E and the BODYTEMP_H items ask students to identify the energy source of
human body heat. Many students could identify “food” as the energy source of human body heat
but did not provide further explanations in terms of how food was used in human body to
provide energy, and these responses were coded as Level 3. So some students who might be
actually at level 2 got level 3 for these two items. The BREADMOLD and the OCTMOLE items
are mostly discriminative at the higher levels such as level 3 and 4, so there are relatively fewer
level 2 responses.
The other four items, ECOSPHERE, ENERPEOP, TREDECAY and LAMP did not
classify students between level 2, 3 and 4 precisely. There were relatively fewer level 3
responses and most responses were scored either as level 2 or level 4. All these four items assess
students’ understanding of energy transformations. The problem with these items might due to
the scoring rubrics do not clearly distinguish the adjacent levels or due to coding mistakes. The

87

general level descriptions of students’ understanding of energy transformation at level 2, 3 and 4
are as follows:
Level 4: Students clearly distinguish matter from energy and knows energy degradation.
Level 3: Students do not consistently distinguish matter from energy, mixing forms of
matter with forms of energy
Level 2: Students do not clearly distinguish energy from other enablers, so they may
identify sunlight or other enablers (or nothing) as inputs and heat or other products (or
nothing) as outputs.
Based on the level description, both level 2 and level 3 students confuse matter and energy. They
cannot trace energy consistently. The distinction between level 2 and level 3 is not clear in the
level description. The level description does not emphasize what a level 3 student can do but a
level 2 student cannot. So the responses were coded as either level 2 or level 4.
The results of CR items indicate most of the CR items are effective to differentiate
students among levels. However, the threshold parameters (d1, d2, d3) of some items are very
close to each other or not in the right order, which suggests that the items do not accurately
classify students at some levels. The items and the scoring for these items need to be reviewed to
classify students more accurately.
When the Environmental Literacy project was developing the scoring rubric, the levels
that the item was not discriminative at on conceptual grounds were specified in the scoring
rubric. Then the responses of this item were classified into the discriminative levels only. As
discussed previously in this section, the statistical analyses suggested additional items that had
too few or too many responses at a particular level. The statistical analyses can help to identify
additional items that might not be discriminative at a particular level. Then all the responses

88

coded at that level should be recoded into the adjacent levels to reduce the measurement error.
Sometimes, there are too few or too many responses at a particular level due to the ambiguity of
the scoring rubric or due to coding mistakes. Then the rubric needs to be revised and the coding
mistakes need to be corrected.
In summary, this chapter discusses how to design a learning progression-based test using
items in the OMC, MTF and CR format. Though the dimensionality analysis suggests that 3PCM
fits better than the 1PCM, there are moderate correlations among these three latent dimensions
and strong correlations between the abilities estimated using the 1PCM and the abilities in each
dimension estimated using the 3PCM. These high correlations indicate that one dimension is
sufficient to describe student’s ability. The additional dimensions may only capture subtle
differences among item format.
A unidimensional model is supported by the cognitive theories underlie the assessment
design. The results show most items fit well with the 1PCM model. This indicates in general, the
learning progression framework and the unidimensional assumption are supported. The levels
defined by the learning progression framework reconcile with the ability defined by the IRT
scale. This is evidence that the learning progression framework is valid.
This chapter also discusses how to design a learning progression-based test to accurately
locate students’ understanding within the learning progression achievement levels. First, the
boundaries between levels are defined on the IRT scale and then items in multiple formats are
selected to achieve the information criterion at all the defined boundaries. This ensures the
accuracy of the classification. One important design criterion for the learning progression-based
item is that ideally, students should be at the same level across items. So the items threshold
parameters should be similar. This chapter provides the calibrated item parameters to inform the

89

future development of carbon cycle items. Finally, how to design OMC and MTF items to
predict students’ CR levels more accurately and what are the valid ranges of the CR items are
discussed.

90

Chapter 6 Design of a Test to Assess a Particular Process or Practice
6.1 Research purpose and procedure
The carbon cycle assessment is designed to assess students’ understanding of six carbontransforming processes: animal functioning, animal growth, plant growth, decomposition,
combustion, and cross-process events. The assessment also assesses five practices: macroscopic,
mass/gases/amount, energy/causes, microscopic, and large-scale practices (See section 2.2.3 for
a detailed description of these processes and practices). These processes are the key carbontransforming processes in socio-ecological systems and the practices help to reason about the
carbon-transforming processes in complex systems. Appendix A includes the information about
the practice and the process that each item measures.
In some cases, we want to know students’ understanding of a particular process or
practice. This can provide teachers information about students’ performances on the particular
process or practice so that they can adjust their teaching. In order to design such a test, the
dimensionality of the assessment data is investigated first to see whether items of different
processes or practices assess the same ability or different abilities and what are the correlations
among the constructs. If items of different processes or practices measure students’ ability in
different dimensions, then to assess students’ understanding of a particular process or practice,
only items of that process or practice should be used.
The following steps were followed to investigate how to design a test to assess a
particular process or practice. First, dimensionality analysis was conducted to see whether items
of different processes/practices assessed the same latent construct. In the current item pool, there
are 42 CR items. The hypothesis made by the Environmental Literacy project is that students use
the same ability to respond to these items that assess different processes and practices. In other

91

words, students’ understandings of different processes are highly correlated, which can be
considered as the same type of ability, so as their abilities for different practices. Dimensionality
analysis can test this hypothesis.
To evaluate the dimensionality of the item response data, the unidimensional PCM and
the multidimensional PCMs were applied to the data and the model fits were compared. Both
multidimensional within-item and between-item models were used to account for the structure of
the assessment. The dimensions were defined in terms of carbon transforming processes and/or
the scientific practices. In total, four models were used to fit the data to explore the correlations
among the constructs. The model parameters were estimated using the ConQuest software.
Figure 6.1 is a graphical representation of these four models. Model 1 is the
unidimensional PCM. The assumption is that there is one latent general construct that determine
students’ performances on all items. Model 2 is a between item multidimensional PCM, and the
dimensions are defined in terms of processes. Model 3 is a between-item multidimensional PCM
and the dimensions are defined in terms of process. The last model, Model 4, is a within-item
multidimensional PCM and the dimensions are defined by both process and practice. Each item
is associated with one process and one practice.
Figure 6.1 Graphical representation of the unidimensional and multidimensional models
Model 1. Unidimensional PCM

92

Figure 6.1 (Cont’d)
Model 2. Multidimensional between item model (processes as dimensions)

Model 3. Multidimensional between-item model (practices as dimensions)

Model 4. Multidimensional within-item model with processes and practices as dimensions

The names of the dimensions are listed below:
Process
Dimension 1—Plant growth
Dimension 2—Animal growth
Dimension 3—Animal function
Dimension 4—Combustion
Dimension 5—Decomposition
Dimension 6—Cross-processes

Practice
Dimension 7—Macroscopic
Dimension 8—Mass/gases/amount
Dimension 9—Energy/causes
Dimension 10—Microscopic
Dimension 11—Large-scale practices

93

Second, the fit of these four models were compared. Chi-square goodness of fit tests were
performed on the difference of the deviance of each model. Meanwhile, the model fit was
compared in terms of how well the model was supported by the cognitive theories. Then item
parameters were calibrated by applying the most appropriate model to fit the data.
Third, items were selected to assess a particular process/practice based on the
dimensionality result. If the dimensionality is greater than one in terms of process, then only
items of a particular process or practice should be selected to assess students’ understanding of
that process or practice. The selected items need to yield the amount of desired test information
at the boundaries for the process or the practice being measured. This item selection and test
design procedures were similar to what were conducted in Chapter 5.
6.2 Dimensionality in terms of process or practice
The results from the chi-square goodness of fit tests are given in Table 6.1. From the
results, one can see that Model 2 is the most appropriate model. We can have the following
observations based on Table 6.1.


Model 3 and Model 1: There is no significant difference between Model 1 and Model 3.
The unidimensional model, Model 1, fits the data as well as Model 3.



Model 2 and Model 1: Model 2 fits the data significantly better than Model 1.



Model 4 and Model 1: Model 4 fits the data significantly better than Model 1.



Model 2 and Model 4: A further comparison between Model 2 and Model 4 shows that
Model 2 explains the data as well as Model 4. There is no significant difference between
Model 2 and Model 4. But since Model 2 has fewer parameters, it is more parsimonious
than Model 4. Therefore, Model 2 is the best model among these four models.

94

Table 6.1 Goodness of fit test among four models

chi-square test (Model 2 vs. 1)
chi-square test (Model 3 vs. 1)
chi-square test (Model 4 vs. 1)
chi-square test (Model 2 vs. 4)

Difference
between
deviance
94.217
5.637
114.179
19.962

Difference
between # of
parameters
20
14
65
45

p-value
<.001
0.975
<.001
0.999

The dimensionality analyses suggest that the multidimensional between item model using
processes as dimensions fit the data best. This suggests that the data are multidimensional in
terms of the processes but not in terms of the practices. In general, the increase in model fit by
applying multidimensional model is not big. There are moderate to strong correlations among the
process dimensions and strong correlations among the practice dimensions, which suggest that
different practices can be considered as the same construct. Table 6.2 and Table 6.3 are the
correlations among the process dimensions and the correlations among the practice dimensions
based on Model 2 and Model 3 results respectively.
Table 6.2 Correlations among process dimensions
D1
Plant
growth
D1 Plant growth
D2 Animal growth
D3 Animal function
D4 Combustion
D5 Decomposition
D6 Cross-process
Variance

D2
Animal
growth

D3
Animal
function

D4
Combust
ion

D5
Decomposition

D6
Crossprocesses

0.675
0.870
0.769
0.859
0.854
1.985

0.732
0.857
0.633
0.680
1.846

0.799
0.808
0.857
0.910

0.717
0.754
2.184

0.845
1.763

1.067

95

Table 6.3 Correlations among practice dimensions
D7
D8
D9
D10
Macroscop Mass/gases Energy/cau Microscopi
ic
/ amount
ses
c
D7 Macroscopic
D8 Mass/gases/amount
D9 Energy/causes
D10 Microscopic
D11 Large-scale
Variance

0.934
0.879
0.894
0.789
2.179

0.886
0.907
0.756
1.666

0.869
0.755
0.665

D11
Largescale

0.798
1.046

1.594

In general, there are moderate to high correlations among process and practice
dimensions. This suggests that a single latent construct explains most of the variance in students’
responses. One possible reason for the high correlations among the practice and process
dimensions is that the student sample does not show much variation in these dimensions. Even if
the items are sensitive to the difference between dimensions, the item response data do not show
strong multidimensionality.
The high correlations among the practice dimensions may also suggest that the
understanding of the different practices is in fact strongly psychologically linked. For instance,
the ability to explain at microscopic scale is associated with the ability to explain changes at the
macroscopic scale. The process dimensions are highly correlated but the correlations are not as
strong as those among the practice dimensions. The same student’s responses to items of
different processes are different to some extent.
In this study, it is hard to tell whether the high correlations among the process or practice
dimensions should be attributed to the psychological links among the dimensions or to the lack
of variation of the sample in these dimensions. This can be tested in the future with a different
sample that does have variance in these dimensions (e.g. a group of students who have learned
photosynthesis and another group of students who do not). Since students’ understanding of
96

different processes might be different, the design of a test to assess a particular process is
discussed in the following section.
6.3 Design a test to assess a particular process
There are 42 CR items in total including: six plant growth items, three animal growth
items, ten animal function items, six combustion items, five decomposition items and twelve
cross-process items. To assess a particular process, only items of that process should be selected.
Figure 6.2 Figure 6.2 shows for each process, the information collected by using all the items of
that process in the current item pool. There are more cross-processes items than items of any
other processes, so it is not surprising that the information of the cross-process items is higher
than the information of the items of other processes. But from this graph, we can see that except
for the cross-processes, the test information of the other processes cannot reach to 5 across the
boundaries even if all the items are selected.
Figure 6.2 Information of all items of each process

97

Table 6.4 below shows the item difficulty parameters and threshold parameters derived
from the Model 2 results. The difficulty parameter (b) indicates the relative difficulty of the item
among all items of the same process. The threshold parameters are the boundaries between
levels. Since the design criterion of learning progression-based item is to have similar thresholds
across items, the items with boundaries that are away from other items need to be reviewed.
Additional items need to be developed to form a test that can reach the information criterion at
the defined boundaries. The procedures are similar to those discussed in Chapter 5.

98

Table 6.4 The item parameters of each process
Process
Process
1: Plant
Growth

Process
2: Animal
Growth

Process
3: Animal
Function

Process
4:
Combusti
on
Process
5:
Decompo
sition

Process
6: Crossprocess

Item label

b

ACRON
CARPATH
ENERPLNT
GRANJOHN_P
PLANGAS
THINTREE
CARBODY
EATAPPLE
INFANT
AIRBODY
ANIMWINT
BODYTEMP_E
BODYTEMP_H
EATBRTHE
ENERPEOP
GIRLRUN_AB
GIRLRUN_C
GLUGRAPE
WTLOSS
BRNMATCH_E
BRNMATCH_MA
BRNMATCH_MB
CARGAS
OCTAMOLE
WAXBURN
APPLEROT
BREADMOLD
GRANJOHN_D
TREEDECAY_AB
TREEDECAY_C
AIREVENT
CONNLIFE
CUTTREE
DEERWOLF
DIFEVENT
ECOSPHERE
GLOBWARM_M
GLOBWARM_H
GROWTH
KLGSEASON
LAMP
TROPRAIN

0.534
0.435
-0.564
-0.305
0.568
-0.668
0.582
-0.718
0.135
1.329
-1.125
0.464
-0.434
0.680
0.171
0.315
-0.283
-0.738
-0.379
-1.079
-0.025
0.018
-1.116
0.516
1.685
-0.485
-0.243
0.623
0.148
-0.042
-0.459
-0.685
-0.108
1.276
0.036
0.698
-0.401
-0.814
1.152
0.762
-0.354
-1.103

d1
-2.560
-4.154
-2.727
-2.496
-1.924
-3.576
0.177
-3.473
-1.679
1.710
-2.001
-0.909
-0.860
-1.345
-1.160
-1.664
-0.283
-3.299
-3.140
-2.232
-2.850
-3.052
-3.762
0.516
0.921
-1.719
-0.814
-2.440
-2.581
-0.850
-1.358
-2.996
-0.570
0.548
-1.383
-1.472
-2.387
-3.288
-1.221
-1.166
-1.405
-4.628
99

d2
0.958
2.479
-0.728
1.886
3.060
-0.801
0.988
-1.683
1.950
0.949
-0.249
-1.178
-1.622
1.141
1.278
-0.648

d3
3.203
2.979
1.764

-0.905
-0.305
0.074
0.192
0.312
-0.359
-0.507
2.449
0.748
-1.037
1.920
1.021
0.390
0.440
1.626
0.355
2.003
-0.373
1.765
-0.253
-0.633
1.201
0.491
0.063
-0.749

1.989
2.308

2.373
3.003

3.481
1.179
2.244
0.393
3.255

2.584
2.795
0.774
1.540
1.122
2.388
2.003
0.335

1.865
1.799
1.436
1.480
3.477
2.963
0.280
2.069

The column headers are:
b is the item difficulty parameter
d1 is the first item threshold parameter. It is the cutting point on the ability scale between
score 1 and 2
d2 is the second item threshold parameter. It is the cutting point on the ability scale between
score 2 and 3
d3 is the third item threshold parameter. It’s the cutting point on the ability scale between
score 3 and 4
In short, science assessments often require numerous types of knowledge and skills,
which are likely to be multidimensional. To design a learning progression-based science
assessment, we need to understand whether the assessment measures a single construct or several
constructs and how items are associated with the constructs being measured. Only items that
assess the construct of interest should be used in the test. To scrutinize the dimensionalities
among different processes/practices, we need data from student samples that show variances
among the process/practice dimensions to investigate the links among the process/practice
dimensions.
The current assessment does not have enough items to assess a particular process
accurately. Two things need to be done to design a test for a particular process. First, according
to the design criterion of learning progression-based items, items that have thresholds away from
the other items of the same process need to be examined. Second, items that have thresholds
close to the estimated boundaries need to be developed to form a test that can reach the test

100

information criterion at the boundaries. The current items can be used as a reference to develop
new items.

101

Chapter 7 Item Characteristics
7.1 Research purpose and procedures
Items were evaluated quantitatively in the Chapter 5 and 6 in terms of item fit indices,
discrimination indices, item difficulty and threshold parameters. According to the quantitative
evaluation, some items are better than others. These items are more discriminative, fit well with
the model, and have thresholds that close to the estimated boundaries. In this chapter, item
characteristics are analyzed to find the characteristics that are related to good item statistics.
These characteristics can be used as guidelines to design learning progression-based items in
future.
The characteristics of the items are coded in terms of the following aspects:
A. Whether the item includes picture(s) or not
B. The number of sub-questions included in the item
C. The familiarity of the example(s) to students
D. The process that the item assesses
E. The practice that the item assesses
F. The scale of the item (e.g. microscopic item, macroscopic item, large-scale item)
In addition, each item received a qualitative rating from two science education researchers. The
qualitative evaluation results are analyzed to provide additional suggestions to write good
learning progression-based items.
7.2 Write good learning progression-based items: How item statistics are related to the item
characteristics

102

I investigated how the item characteristics above were related to the item statistics. The
results from t-test and ANOVA analyses indicated there were no significant group differences in
item statistics (e.g. item difficulty and item step thresholds) in terms of the following item
characteristics:
A. Whether the item includes picture(s) or not;
B. The number of sub-questions included in the item;
C. The familiarity of the example(s) to students;
D. The process that the item assesses and
E. The practice that the item assesses.
However, the item difficulty and the location of the step thresholds were significantly
different among items of different scales. The difficulties of microscopic items were significantly
higher than those of the macroscopic items or large-scale items. The mean difficulty of
microscopic items was around 0.7 while as the mean difficulty of macroscopic items and that of
the large-scale items were both around -0.2. About half of the microscopic items were not
discriminative at level 1. For example, the microscopic items such as CARBPATH,
CARBBODY, GRANJOHN could not elicit level 1 responses. Students whose understandings
were constrained at macroscopic scale were not able to explain the movement of atoms and
molecules at all. One fourth of the macroscopic items were not discriminative at the highest
level—level 4.
Since there are no general connections between the superficial item characteristics listed
above and the item statistics except the scale of the item, in the following section, more detailed
item characteristics analysis is conducted to find out some rules to write items that will result in
good item statistics.

103

7.3 Write good learning progression-based items: How item statistics are related to
suggestions from qualitative evaluation
The following suggestions might help the items to perform better according to both
feedback provided by a group of science education researchers and the quantitative results.
7.3.1 CR items
First, some items are not discriminative at some achievement levels. It’s important to
notice the levels that the item is discriminative at and classify responses into those levels only.
This will improve the fit statistics and make the thresholds in the correct order. In Section 5.5.3,
nine CR items are identified that do not have item threshold parameters in the correct order. This
may due to the item is not discriminative at a particular level. If so, collapsing score categories
will help to make the thresholds in the correct order.
Second, as mentioned in section 7.2, many macroscopic items are not valid for level 4
and microscopic items are often not valid for level 1. Depending on the students who will take
the item, the same question can be asked at different scales to measure students more precisely.
For example, the OCTAMOLE item and the CARGAS item both assess the concept of the
combustion of gasoline. These two items basically assess the same concept. However, the
OCTAMOLE item is proposed at microscopic scale and the CARGAS item is proposed at
macroscopic scale. The difficulty of the OCTMOLE item is 0.5 and the difficulty of the
CARGAS item is only -0.8. The item information curves are in the graph below. The
OCTAMOLE item has information peaked in the high ability range and the CARGAS has
relatively flat information curve over a wider range of abilities. Therefore, depending on the
students who will take the item, the same question can be asked in different scales to measure
students more precisely.
104

OCTAMOLE
Gasoline is mostly a mixture of hydrocarbons such as octane: C8H18. Decide and circle whether
each of the following statements is true (T) or false (F) about what happens to the atoms in a
molecule of octane when it burns inside a car.
T F Some of the atoms in the octane are incorporated into carbon dioxide in the air.
T F Some of the atoms in the octane are incorporated into air pollutants such as ozone or
nitric oxide.
T F Some of the atoms in the octane are converted into energy that moves the car.
T F Some of the atoms in the octane are burned up and disappear.
T F Some of the atoms in the octane are converted into heat.
T F Some of the atoms in the octane are incorporated into water vapor in the
atmosphere.
a. When the gas tank is empty and the car stops, where is the energy that was in the
gasoline?
b. What was the original source of energy of gasoline?
c. Is air needed for the car to use the gasoline? If so, how does the air change as the car
runs?

105

CARGAS
When you are riding in a car, the car uses gasoline to make it run. Eventually the gasoline tank is
empty.
a. What happens to the materials the gasoline is made of when the car uses the gasoline?
b. Is air needed for the car to use the gasoline? If so, how does the air change as the car
runs?
c. Where does energy come from to make the car run?

Figure 7.1 Item information curve of OCTAMOLE and CARGAS

Third, it is pointed out in Section 7.2 that some of the macroscopic items are
discriminative at all four levels while others are not discriminative at level 4 as indicated by their
thresholds. Then, the question is what are the item characteristics that make some items work
better than others? One feature of the items that are discriminative at all four levels is that these
items explicitly require students to trace matter or trace energy. The EATAPPLE item and the
GRUGRAPE item below are good examples of this.

106

EATAPPLE
An apple is eaten by a boy and digested in his body.
a. What happens to the apple when it is digested?
b. Do you think the apple the boy ate can help all parts of his body (like his fingers) to
grow? Please circle one: YES

NO

If you answered YES, please explain how can an apple that goes to the boy’s stomach
help his fingers to grow. If you answered NO, please explain how the boy’s body makes
his fingers grow.
GLUGRAPE
The grape you eat can help you move your body parts such as your legs.
a. Please describe how the substances from the grape provide energy to move your legs.
Describe as many intermediate stages and processes as you can.
b. Can the substances of the grape also be involved in helping to keep your body warm?
Please explain your answer.
These two questions successfully elicited both low and high level responses. The EATAPPLE
item asks students to trace apple through the body and the GLUGRAPE item asks students to
trace energy from the grape people eat to energy that help people to move body parts. These
questions have a clear focus on matter/energy transformations and encourage students to explain
the transformations.
Some other items that measured the same concept were not able to elicit high-level
responses. For example, the INFANT item also measured the concept of animal growth. Same as
the EATAPPLE item, it was administered to students at all three grade levels. However, the
INFANT item mainly elicited level 1 and 2 responses (97%). The question is not as focused as

107

the EATAPPLE item and it does not explicitly require students to trace matter.
INFANT
Do you think the baby girl will need any of the following things to grow and gain weight? Please
circle Yes or No and explain your choice. If you circled yes, explain how the girl’s body uses it.
What happens to it inside the inside the girl’s body?
Sunlight

Yes

No

Water

Yes

No

Air

Yes

No

Food

Yes

No

The response that the same student gave for the INFANT item tended to be at lower
levels than the one he/she gave for the EATAPPLE item, though these two items measure the
same concept generally. For example, below are the responses that a high school student gave to
the INFANT item and the EATAPPLE item respectively:
INFANT: The girl does not need sunlight. She needs water to stay hydrated and live. The
girl needs air to breath and respire. She needs food to grow.
EATAPPLE: The body removes all of the helpful vitamins, nutrients, and/or helpful
substances out of the apple and the rest becomes waste. The apple can help all parts of his
body to grow. The energy from the apple goes all over the body because the body cells
pick of the energy from the apple in the villi and take it to cells all around the body.
The student’s response to the INFANT item was at level 2, but his/her response to the
EATAPPLE item was at level 3. The EATAPPLE item explicitly requires students to trace
matter from the apple to human body parts. This helped to elicit higher level responses.
Fourth, it is better to propose a specific problem than a general problem to elicit

108

responses focusing on the measured concept. For example, both the KLGSEASON and the
CUTTREE item assess students understanding of global warming and how photosynthesis is
related to global warming. The KLGSEASON specifically focus on the changes in concentration
of CO2 in the atmosphere, it asks students why the atmospheric carbon dioxide levels decreasing
in the summer and fall every year and increasing in the winter and spring. But the CUTTREE
item asks students why cutting down trees increases global warming. It requires students to make
two connections, one between the CO2 level and global warming, and the second between trees
and CO2. Some students fail to make one of the connections and cannot provide any related
answer to this question. The question would be better if it asks students why cutting down tree
will increase the CO2 concentration.
The item discrimination (the correlation between students’ scores of the item and their
total scores) of the CUTTREE item is .38 and the discrimination of the KLGSEASON is .54,
which is much higher. Both items fit the PCM model. The CUTTREE item is not valid for level
four and the thresholds are -0.24 (d1) and 0.30 (d2). The KLGSEASON item is valid for all four
levels and the thresholds are -1.05 (d1), 0.30 (d2), and 2.34 (d3). The KLGSEASON item did a
better job to differentiate students over a wider ability range.
Fifth, some types of CR items cannot elicit detailed explanations so these items do not
measure students as precisely as others. For example, when the item gives students several
examples and ask students to explain each example, students often do not provide enough details
for each example. The CARBPATH item asks students where carbon can be found inside a tree.
The item gives students three locations (leaves, wood, and roots) and asks them to judge whether

109

they can find carbon at each location and how carbon gets there. Students often do not give
detailed explanation about how carbon gets to each location. As a result, 96% of the responses
are at level 1 or 2, and only around 4% of the responses are at level 3 or 4. This makes the item
invalid to measure high-level students.
Similarly, students often do not provide detailed explanations to items that require them
to make comparisons, connections or analogies among examples. Their explanations usually
focus on the observable similarities or differences between the events and ignore deeper
connections between events at atomic-molecular scale. For example, the GROWTH item asks
students to think of ways that the plants and animals are different in ways they use water, air and
nutrients to grow. For example, a typical student’s response is that “plants use water to suck it up
with their roots, animals drink it. Plants make air, animals breathe in air. Plants use nutrients to
grow, animals eat nutrients.” There are not many deep descriptions of what is happening and
how food might be different for plants and animals.
In addition, there are some other general rules to follow when writing CR items. For
example, the item needs to be scientifically rigorous, irrelevant information should not be
provided, words or phrases that may involve construct irrelevant difficulty should not be
included in the item stem. Researchers who evaluated the items pointed out some of our items
had these problems. For example, the stem of the TROPRAIN item has two words that may
produce construct irrelevant difficulty, one is "recycled" and the other is "ecosystem". These
words may increase the difficulty for students to understand the item. As a result, this item has a
lot of missing responses, which suggests the item is not well received by examinees.

110

7.3.2 OMC and MTF items
This study only includes 7 OMC and 11 MTF items. It’s difficult to draw conclusions
based on this small sample of items in terms of how item characteristics are related to the item
statistics. Section 5.4.2 discussed how to solve the misfit of OMC and MTF items and how to
design better OMC and MTF options were discussed in Section 5.5.1 and 5.5.2.
7.4 Recommendations for writing items in future
Based on the above analyses, the following recommendations can be made for writing
good learning progression-based items:


The scale of the item has an impact to the item difficulty and the discriminative range.
When the question is asked at the microscopic scale, it often cannot discriminate lower
level students. When the question is asked at the macroscopic scale, it often cannot
discriminate high level students. So depending on the target group, the same question can
be asked at different scales.



Since some items are not discriminative at some particular achievement levels. It is
important to notice the levels that the item is discriminative at and classify responses into
those levels only.



Items that explicitly require students to trace matter or trace energy are more likely to be
discriminative at all achievement levels.



The items that ask concrete and specific problems usually work better to elicit detailed
and focused responses.



To gather more detailed responses, avoid asking students to explain too many examples
in one item.

111



The items that ask students to make comparisons, connections, or analogies need to be
carefully designed since students will make comparisons, connections or analogies in all
possible aspects, which may not address the construct being assessed.



According to the analyses of the OMC and MTF options in Chapter 5, OMC and MTF
options need to be designed carefully.
o When the OMC options are associated with a restricted range of learning
progression levels, the OMC item often under or over predict students’ real
learning progression levels. So in order to design OMC options that can better
predict students’ real levels, the OMC options need to represent understanding at
multiple levels.
o The design of MTF options needs to be informed by more research to include the
most efficient indicators and combinations of indicators to differentiate students.
o The OMC and MTF options should be designed based on the ability of the
examinees. Options need to be discriminative for the examinees who take the
item.
o OMC and MTF items can be treated as dichotomous items (1- if student choose
the best answer or choose all correct answers, 0-otherwise) to provide information
about students’ achievement level. In general, the recoded MTF items are difficult
and the recoded OMC items are easy. So MTF items can be used to classify high
level students and OMC items can be used to classify low level students.

112

Chapter 8 Discussion and Conclusions
8.1 Summary of main findings and implications
The focus of this dissertation is to investigate how to design learning progression-based
science assessments to accurately classify students into achievement levels. It investigates 1)
how to design OMC, MTF and CR items to classify students among levels, 2) how to design a
test for a particular process or practice and 3) what are the item characteristics that support the
use of items to classify students among levels. The followings are the main findings from the
investigation of these three research foci.
8.1.1. Items in different formats are associated with one main construct but also measure slightly
different aspects of the construct
The analysis suggests that OMC, MTF and CR items are associated with one main
construct but they may also measure slightly different aspects of the construct. The data from
most of the items fit well with the unidimensional model. This suggests a single construct
explains most of the variances in students’ performances on the carbon cycle assessment. The
abilities defined on the IRT scale reconcile with the abilities defined by the learning progression
level, which provides evidence of the validity of the learning progression framework and the
single construct defined by the four achievement levels. So the unidimensional hypothesis of the
carbon cycle learning progression framework is generally supported.
There are additional dimensions in terms of item format that can explain some amount of
the variance in student performance. There are moderate correlations among students’ abilities in
the CR, OMC and MTF dimensions and high correlations among students’ unidimensional and
multidimensional ability estimates. So the unidimensional data might be sufficient to describe

113

the data. But multiple dimensions are necessary when we want to consider the nuisance
dimensions for a particular purpose.
This finding can inform the assessment design. Depending on the purpose of the
assessment, items in these formats can be reasonably used to measure what is intended to
measure. For example, to measure students’ general understanding of carbon cycle, items in all
these formats can be used. If the assessment focuses on students’ abilities to organize, integrate
and synthesize their knowledge and their abilities to solve novel problems, then CR format is
preferred. In addition, items in each format need to be improved in certain ways.
8.1.2. Improve the quality of the OMC, MTF and CR items

The OMC and MTF questions have some problems indicated by the discrimination
indices, item fit indices, and item thresholds. In terms of the OMC items, students’ OMC levels
can predict the CR level to some extent. Students’ OMC levels are significantly correlated with
their levels on the paired CR questions. But in some cases, the OMC level over-predict or underpredict the CR level. Often times, OMC levels over-predict CR levels. When all the OMC
options represent understanding at low levels, the OMC levels may under-predict students’ real
achievement levels.
In terms of the MTF items, the set of T or F questions are useful to identify students at
very high achievement levels. Those students often make all correct choices to the T or F
questions. The number of correct T or F choices students made can indicate the ability level of
students who are in the middle or low ability range to some extent. The MTF items that have low
discrimination indices or show misfits with the PCM model need to be revised. Some T or F
options are more effective than others in terms of differentiating students. Therefore, the design

114

of the MTF options needs to be informed by more research to include the most efficient
indicators and combinations of indicators to differentiate students.
Most of the CR items fit well with the unidimensional model and are effective to
differentiate students among levels. The threshold parameters (d1, d2, d3) of a few CR items are
very close to each other or not in the correct order. This suggests that these items do not
accurately classify students at some levels. This can be improved by adjusting the scoring
rubrics, correcting the coding mistakes or recoding the data into levels that the items are
discriminative at.
8.1.3. Use items in multiple formats to meet the test information criterion

To detect the difference between two groups of students, I set “5” as information criterion
at the boundaries between levels on the ability scale. This amount of test information is sufficient
to detect the difference between two groups of students (30 students in each group, with a
difference of 0.93 between the group mean ability estimates) at the significance of 0.05 and the
power level of 0.8.
To accurately classify students into levels, first, the boundaries between levels on the IRT
scale were defined by using the means of the threshold parameters across a set of good items.
Then items are selected to achieve high information at these defined boundaries. To reach test
information above 5 at the boundaries, we need 16 items from the current item pool. When the
thresholds of the items are close to the defined boundaries, only 14 items are needed to reach the
same amount of information (above 5) at the defined boundaries. So adjusting rubrics to get
similar thresholds will reduce the number of items we needed to reach the same amount of test
information at the defined boundaries. Most importantly, one design criterion of learning

115

progression-based items is that, ideally, students at the same ability level will get the same level
across all items. This means that the item thresholds (d1, d2 and d3) should be similar across
items. The item threshold parameters of our current items vary a bit. The items and the scoring
need to be reviewed to adjust the thresholds to be similar across items.
Our current assessment has about 10-12 items on each form, which is not too long to be
administered in one science class. If the rubrics can be adjusted to make the items classify
students more consistently and if the items can be revised to be more discriminative, then around
14 items on one test form will be sufficient to detect the difference between student groups. The
analyses also suggest other ways to reduce the test administration time and scoring effort, some
dichotomously scored items that have difficulties at the boundaries can be used. For example,
using 14 dichotomous items (around 4~5 items with difficulty at each boundary) and 3
polytomous items that have thresholds at the defined boundaries will also reach the information
criterion.
8.1.4. Design a test to assess a particular process or practice
To design a learning progression-based science assessment, we need to understand
whether the assessment measures a single construct or several constructs and how items are
associated with the constructs being measured. This study examines whether different carbon
transforming processes and different scientific practices are associated with a single latent
construct or different constructs. The results show there is multidimensionality in terms of
process but not practice. In general, the correlations among process/practice dimensions are
moderate to strong. It is not clear whether the high correlations are due to the inherent links
among processes/practices or due to the student sample does not show much variation in these

116

process/practice dimensions. Future data are needed to examine the dimensionality in terms of
process/practice in more detail.
8.1.5. Implications from the item characteristics analysis
Based on item characteristics analysis, recommendations are made at the end of Chapter
7 in terms of how to write items to achieve better item statistics. These can provide guidelines for
item writers to write learning progression-based items besides the general item writing guidelines
and tips that can be found in literature.
Finally, the results from this study can inspire another iteration of assessment design, in
which we refine the learning progression framework, modify existing items, develop new
assessment tasks, collect more data and analyze the data with an appropriate statistical model to
understand students’ learning progression of carbon cycle.
8.2 Discussion of the results
8.2.1. Items in different formats
1) OMC format
Students’ OMC levels can predict their CR level to some extent. However, OMC level
often over-predict the CR levels. The hypothesis for the discrepancy between students’ OMC
levels and their CR levels is that students perform better to name the input and output of carbon
transforming processes, which is assessed by most OMC questions, than to explain what happens
during the processes which is assessed by most CR questions. When the OMC options mostly
represent understanding at low levels, the OMC levels will under-predict students’ real
achievement level. These over and under-predictions need to be noticed when using OMC items.
Options at multiple levels need to be developed to reduce cases of over or under predictions.

117

Since OMC options associated with a more restricted range of learning progression levels
compare to CR items, the OMC items do not measure students as precisely as the CR items at the
extremes of the ability distribution. This result is consistent with some previous studies
mentioned in the literature review. For example, Lee, Liu & Linn (2011) found that compared to
MC items, CR items discriminate between high and low knowledge integration ability students
much more effectively and measure a wider range of knowledge integration levels. Ercikan,
Schwarz, Julian, Burket, Weber, and Link (1998) and Wilson and Wang (1995) have similar
conclusions from their studies.
Less discrimination of the OMC items at the low end of the ability distribution may result
from guessing involved when answering OMC item. And less discrimination of the OMC items
at the high end of the ability distribution might due to fewer options designed at the high levels,
especially at level 4. It is difficult to write OMC options at higher achievement levels without
using “science-y” terminologies. So in order to reduce measurement errors due to the low
discrimination of OMC format at the high end, we need to either develop high-level options
without using science-y terminologies or use OMC items mainly to measure median and lowlevel understanding.
2) MTF format
The MTF items work especially well to assess students’ commitments to fundamental
principles. Since MTF item allow students to select multiple answers, to answer the item
correctly, students need not only to identify all the correct answer(s) but also to exclude all the
incorrect answer(s). This requires students to have deep understanding of the principles being
assessed and apply those principles consistently. The result shows only a small proportion of
students with high abilities can correctly answer all the T or F questions.

118

The number of correct choices students made can indicate their ability level to some
extent. Some of the T or F questions distinguish students better than others. More research is
needed to find out the most effective options and most efficient combinations of options. When
designing the MTF items, the item writers need to design the options to avoid the situation that
the students will make the same choices regardless of their ability levels. If students at different
ability levels will give different T or F response strings, then the MTF item is effective.
The fixed options in OMC or MTF question restrict students’ thinking. Some of the lower
level distractors lowered their responses. Especially for the MTF items, since students can chose
T for more than one option, they may choose T for both low level options and high level options.
But when the same question is asked in an open-ended way, their responses are at higher levels.
Thus, some of the MTF questions can be asked in a more open-ended so there will be less
influence from the item itself on students’ responses. Students might be able to provide more
focused and detailed responses then.
3) CR format
Most of the CR items are effective for differentiating students. There are a small number
of CR items that do not have item thresholds parameters in the right order. This is a sign that
either the item or the scoring rubric is not appropriately designed. So the item or the scoring
needs to be reviewed.
As Anderson et al (2007) pointed out; one challenge with developing learning
progression grounded items is that it is difficult to write items that provide opportunities for
students to respond at multiple levels of a learning progression. The results of this study also
suggest some CR items can only elicit responses at particular levels rather than all levels. For

119

example, most items proposed at microscopic scale are only discriminative for level 2 and above.
These items need to be used appropriately so they are discriminative for the examinees.
8.2.2. Assessing a particular process
Our previous study (Mohan, Chen, Baek, Choi, Lee, & Anderson, 2009) showed that
students had a similar level of reasoning on different carbon transforming processes. So the
assessment and the learning progression framework are essentially unidimensional. This study
indicates that there is multidimensionality in terms of carbon transforming processes but not in
terms of scientific practices. But the correlations among process/practice dimensions are
moderate to strong. It is not clear whether the strong correlations are due to the inherent links
among processes/practices or due to the student sample does not show much variation in these
process/practice dimensions. Future data are needed to examine the dimensionalities in terms of
process/practice in detail.
There were some differences between the assessments used in the previous study and
those used in this study. In the previous study, students answered only one or two items of each
of those six processes. In this study, each student answered around three to five items of each
process but the test included items of fewer processes. So students’ ability in each dimension was
measured more precisely. Hence, the conclusion based on this study is more reliable.
After knowing more about what are the latent constructs that the assessment measures,
and how items are associated with the latent constructs, items can be selected more purposefully
to measure a particular construct such as knowledge of a carbon transforming process. This can
provide teachers information about students’ performances on the particular process so that they
can adjust their teaching of a particular unit.

120

8.3 The broader implications to learning progression-based assessments
Findings from this study can generalize to other research on designing learning
progression-based science assessment. Dimensional analysis is a way to provide evidence for the
construct validity of the assessment. It is an approach that can be applied to the design of other
learning progression-based science assessments. Science assessments often assess various
knowledge and skills, for instance, knowledge of different subjects and a variety of skills, such
as conceptual understanding and scientific investigation. So science assessments are likely to be
sensitive to differences on multiple dimensions. The dimensionality analysis is one way to
understand the construct being measured by the assessment and how items are associated with
the construct being measured.
Furthermore, this study establishes some typical procedures that can be followed to
design other learning progression-based tests. For example, first, the test developers should set a
test information criterion depending on the purpose of the test (e.g. detect the difference between
individual students, or detect the difference between groups of students). Then, the test
developers can use either statistical approaches (e.g. take the mean of the item threshold
parameters across a set of items) or standard setting approaches to set the boundaries between
levels and select items to reach the information criterion at the boundaries. Third, since for
learning progression-based items, ideally, the thresholds should be similar across items, the items
with thresholds different from those of the other items should be examined.
In addition, findings about the item formats can inform the future use of these items in
other learning progression-based assessments. We know that items in different formats might
assess slightly different aspects of the construct. So depending on the goal of the assessment, an
appropriate item format(s) should be selected to measure what is intended to measure. This study

121

also provided suggestions to improve the quality of the OMC, MTF and CR items respectively.
These suggestions are based on both statistical analyses and qualitative item characteristic
analyses. These suggestions are generalizable to the design of other learning progression-based
items in these formats.
8.4 Limitations of this study and future work
Four problems limit the validity or generalizability of the findings from this study and
suggest directions for future work.
First, the OMC, MTF and CR items are not completely independent items. Each OMC
question is paired with a CR question and each MTF question is paired with a CR question as
well. The OMC/MTF question and the CR question share the same item stem. So there is
inherent correlation between a student’s OMC response and his/her response to the paired CR
question. This may inflate the correlations between students’ ability measured by the OMC/MTF
items and their ability measured by the CR items a little bit. But since the correlations between
the abilities in the OMC/MTF dimension and the CR dimension are calculated based on students’
responses to all CR items (42 in total) instead of just the paired CR items, the correlations
between the OMC/MTF and the CR dimension are not inflated much by the inherent relations
between the OMC/MTF questions and their paired CR questions.
Second, the partial credit model is used to fit OMC and MTF responses in this study.
However, PCM might not the best model for the OMC and MTF items. Guessing was not
accounted for in the model. In addition, for some OMC items, there are two options at the same
level. So the PCM model may not be the best choice. Some new models are under development
to fit OMC responses. For instance, Briggs & Alonzo (in press) introduce the Attribute Hierarchy
Method (AHM; Leighton, Gierl & Hunka, 2004) as a relatively novel approach for modeling

122

OMC items. Since the new models are still under development and they are mainly suitable for
certain types of OMC items, this study did not apply the novel models for the OMC items.
Third, this study mainly focused on the assessment items rather than the students who
took the items. The information about the students such as their general science achievement or
the science courses they had taken were not considered in this study. Since not much information
was known about the student sample, it was not clear whether the unidimensionality in terms of
practice was because the practices were psychologically linked or because the student sample did
not show much variation in these practice dimensions. In this study, it is difficult to tell whether
the unidimensionality is attributed to the former or the latter. In the future, more information
about students can be collected. And this can be tested with a different sample that does show
variation in the practice dimensions (e.g. a group of students who have learned energy and
another group of students who do not).
Finally, this study only involves 7 OMC and 11 MTF items. So the findings about the
OMC and MTF formats are based on the data from relatively small numbers of items. These
findings need to be verified in the future with data from more OMC and MTF items.

123

APPENDICES

124

Appendix A Item list
ACORN [Plant growth, Mass/gases/amount]
A small acorn grows into a large oak tree. Where does most of the weight of the oak tree come
from? (Circle the best explanation from the list below).
a. From the natural growth of the tree. (Level 1)
b. From carbon dioxide in the air and water in the soil. (Level 3)
c. From nutrients that the tree absorbs through its roots. (Level 2)
d. From sunlight that the tree uses for food. (Level 1)
Explain why you think that the answer you chose is the best answer.
AIREVENT [Cross-processes, Macroscopic]
The 4 pictures below show 4 events happening. Do you think air is needed for each of the
events? Please circle Yes or No and explain your choice.
A. Plant growth
B. Girl running
C. Burning wood
D. Food decay
Events
A. Plant
growth
B. Girl running
C. Burning
wood
D. Food decay

Does the event need air?
(Circle)
Yes
No
Yes
Yes

No
No

Yes

If you circled yes, explain how is air used in
the event?

No

Do the events that you circled “Yes” use air in similar ways or in different ways? Please explain
your answer.
AIRNBODY [Animal function, Macroscopic]
Humans get oxygen from the air they breathe in, and they breathe out carbon dioxide.
a. Where in the body does the oxygen get used?
b. How does the oxygen get used?
c. How is the carbon dioxide produced in the body?
d. Does breathing help your body use energy? If so, how?
ANIMWINTER (E) [Animal function, Macroscopic]
During winter, many animals have problems finding food and may hibernate (sleep through the
winter). These animals lose weight by spring. What do you think happens to the fat that the
animal lost during hibernation?
Circle True OR False for each possibility.
True False The fat was turned into heat to keep their bodies warm during the winter
True False The fat was turned into water and gases that the animal breathed out
True False The fat was turned into waste in the digestive system and left the body as poop.
True False The fat was turned into other materials in the body that don't weigh as much.
True False The fat was used up in the animal’s body and disappeared.

125

Think about your responses above. Please explain as much as you can about what happens to the
fat in the animal’s body during hibernation.
APPLEROT [Decomposition, Mass/gases/amount]
When an apple is left outside for a long time, it rots.
a. What causes the apple to rot?
b. The weight of the apple decreases as it rots. What do you think happens to the matter or stuff
that was once in the apple?
c. Is there energy involved when the apple rots?
Circle one: Yes
/
No
Please explain your answer.
BODYTEMP (E) [Animal function, Energy/causes]
You are playing outside in a cold winter. You find a stone on the ground. When you pick up the
stone, you find that the stone is very cold. Why can people keep warm on a cold day, but stones
cannot? Which of the thing(s) from the list below can help to keep people’s bodies warm? Please
circle YES or NO for each thing in the list below.
a. Water
b. Food
c. Air
d. Exercise

YES
YES
YES
YES

NO
NO
NO
NO

Try to write an explanation of how people’s bodies stay warm that includes ALL of the things
you circles “YES” for in the list above.
BODYTEMP (H) [Animal function, Energy/causes]
Your body produces heat to maintain its normal temperature. Where does the heat mainly come
from? Please choose the ONE answer that you think is best.
a. The heat mainly comes from sunlight. (Level 1)
b. The heat mainly comes from the clothes you are wearing. (Level 1)
c. The heat mainly comes from the foods you eat. (Level 2)
d. The heat mainly comes from your body when you are exercising. (Level 1)
Please explain why you think that the answer you chose is better than the others. (If you think
some of the other answers are also partially right, please explain that, too.)
BREADMOLD (M, H) [Decomposition, Mass/gases/amount]
A loaf of bread was left inside its plastic bag for two weeks on a balance measuring its mass.
Three different kinds of mold grew on it. Assuming that the bread did not dry out, which of the
following is a reasonable prediction of the weight of the bread and mold together?
A) The mass has increased, because the mold has grown. (Level 1)
B) The mass remains the same as the mold converts bread into biomass. (Level 2)
C) The mass decreases as the growing mold converts bread into energy. (Level 3)
D) The mass decreases as the mold converts bread into biomass and gases. (Level 4)
Please explain your answer and indicate any important transformations.

126

BRNMATCH(E) [Combustion, Mass/gases/amount;]
When a match burns, it loses weight and becomes smaller.
a. What does the flame need to keep burning the match?
b. What happens to the materials the match is made of as the match burns?
c. Is air needed for the match to burn? Please circle one: YES NO If you answered YES, please
explain how does the air change as the match burns.
d. Where does energy needed for the match to burn come from?
BRNMATCH (M, H) [Combustion, Microscopic, Energy/causes]
When a match burns, the released energy
a. comes mainly from the match. (Level 3)
b. comes mainly from the air. (Level 2)
c. is created by the fire. (Level 1)
d. comes from the energy that you used to strike the match. (Level 2)
e. none of the above. (Level 1)
Please explain your answer.
CARBBODY [Animal growth, Microscopic]
Use the table below to explain where you think that carbon is found inside a person’s body and
how it gets there.
Location

Do people have carbon in their
muscles?
Do people have carbon in their fat?
Do people have carbon in their
blood?

If you circled yes, explain how the
carbon gets to that location. Include
molecules in your explanation if you
can.
Yes

No

Yes
Yes

No
No

CARBPATH [Plant growth, Microscopic]
Use the table below to explain where you think that carbon is found inside a tree and how it gets
there.
Location

Circle Yes
or No

Does a tree have carbon in its
leaves?
Does a tree have carbon in its
wood?
Does a tree have carbon in its
roots?

Yes

No

Yes

No

Yes

No

127

If you circled yes, explain how the
carbon gets to that location. Include
molecules in your explanation if you
can.

CARGAS [Combustion, Macroscopic]
When you are riding in a car, the car uses gasoline to make it run. Eventually the gasoline tank is
empty.
a. What happens to the materials the gasoline is made of when the car uses the gasoline?
b. Is air needed for the car to use the gasoline? If so, how does the air change as the car runs?
c. Where does energy come from to make the car run?
CONNLIFE [Cross-processes, Large-scale practices]
Explain how the following living things are connected with one another:
Grass
Cows
Human beings
Decomposing bacteria
CUTTREE [Cross-processes, Large-scale practices]
Some people say that cutting down trees in a forest will increase global warming. Do you agree?
Circle one:
YES
NO
Please explain your answer.
DEERWOLF [Cross-processes, Large-scale practices]
A remote island in Lake Superior is uninhabited by humans. The primary mammal populations
are white-tailed deer and wolves. The island is left undisturbed for many years. Select the best
answer(s) below for what will happen to the average populations of the animals over time.
a. The deer will all die or be killed. (Level 1)
b. The wolves will all die or be killed. (Level 1)
c. On average, there will be a few more deer than wolves. (Level 2)
d. On average, there will be a few more wolves than deer. (Level 1)
e. On average, there will be many more deer than wolves. (Level 3)
f. On average, there will be many more wolves than deer. (Level 1)
g. On average, the populations of each would be about equal. (Level 1)
h. None of the above. My answer would be:
Please explain your answer to what happens to the populations of deer and wolves.
DIFEVENTS [Cross-processes, Energy/causes]
A. Eating a hamburger B. Filling up a car with gasoline C. Watering plants
The pictures above show three things happening.
A science teacher says that pictures “A” and “B” are similar events, but picture “C” is different
from “A” and “B”. What reason do you think the science teacher might have for saying that?
Explain as much as you can.
EATAPPLE [Animal growth, Microscopic]
An apple is eaten by a boy and digested in his body.
a. What happens to the apple when it is digested?
b. Do you think the apple the boy ate can help all parts of his body (like his fingers) to grow?
Please circle one: YES
NO

128

If you answered YES, please explain how can an apple that goes to the boy’s stomach help his
fingers to grow. If you answered NO, please explain how the boy’s body makes his fingers grow.
EATBRTHE [Animal function, Macroscopic]
Humans must eat and breathe in order to live and grow. Are eating and breathing related to each
other? (Circle one)
YES
NO
If you circled “Yes” explain how eating and breathing are related. If you circled “No” then
explain why they are not related. Give as many details as you can.
ECOSPHERE [Cross-processes, Energy/causes]
NASA scientists invented the EcoSphere – inside a sealed glass container, there are air, water,
gravel, and three types of living things – algae, shrimp, and bacteria. Usually, these three living
things can stay alive in the container for two or three years until the shrimp become too old to
live. The picture above shows an EcoSphere and its contents.
Do you think that the living things need to get energy from outside of the EcoSphere to keep
living?
Circle one:
YES /
NO
If your answer is NO, how can the living things stay alive without getting energy from the
outside world? If your answer is YES, what form of energy do they get from outside of the
EcoSphere?
Do you think the living things will release energy out of the EcoSphere?
Circle one:
YES
NO
Please explain your answer.
If you circled YES above, what’s the form of energy that is released out of the EcoSphere?
ENERPEOP [Animal function, Energy/causes]
People need energy to live and grow. Which of the following is/are energy source(s) for people?
Circle yes or no for each of the following and explain your answers.
a. Water
YES NO
b. Food
YES NO
c. Nutrients
YES NO
d. Exercise
YES NO
e. Sunlight
YES NO
g. Carbon dioxide
YES NO (E, M don't ask for CO2)
h. Oxygen
YES NO
Please explain ALL your answers, including why the things you circled “No” for are NOT
sources of energy for humans.
ENERPLNT(E,M,H) [Plant growth, Energy/causes]
Which of the following is (are) energy source(s) for plants? Circle yes or no for each of the
following.
a. Water
YES
NO
b. Light
YES NO
c. Air
YES NO
d. Nutrients in soil
YES
NO

129

e. Plants make their own energy. YES
NO
Please explain ALL your answers, including why the things you circled “No” for are NOT
sources of energy for plants.
GIRLRUNN [Animal function, AB- Energy/causes, C-Macroscopic]
The following picture shows a girl running.
a. When a girl runs, how does her body make her legs move? Try to list everything that the girl’s
body needs to make her legs move and explain how her body uses those things.
b. Is food one of the things that the girl’s body needs to move her legs?
Circle one: YES NO
If you answered “YES,” try to explain how food that goes to the girl’s stomach can help her
legs to move.
c. Is air needed for her to run? Circle one: YES NO
If you answered “YES”, explain how the air that goes into her lung helps her run? Is the air
she breathes out different from the air she breathes in?
GLOBWARM (M, H) [Cross-processes, Large-scale practices]
a. How would you define or describe global warming?
b. What events from the list below do you think could cause global warming? Events Will the
event contribute to global warming? (Note: Please circle “Yes” even if you think the contribution
is small) If you circled Yes, please explain why the event will contribute to global warming.
[High]
Driving trucks long distances on the highway
Yes
No
Cutting down forests to have land for farming
Yes
No
Running a refrigerator with electricity
Yes
No
Using aerosol (spray can) hairspray
Yes
No
Eating lots of beef for dinner
Yes
No
[Middle]
Driving trucks long distances on the highway
Yes
Cutting down forests to have land for farming
Yes
Burning 95 candles on your great-great-aunt's birthday cake) Yes
Using aerosol (spray can) hairspray
Yes

No
No
No
No

GLUGRAPE [Animal function, Energy/causes]
The grape you eat can help you move your body parts such as your legs.
a. Please describe how the substances from the grape provide energy to move your legs. Describe
as many intermediate stages and processes as you can.
b. Can the substances of the grape also be involved in helping to keep your body warm? Please
explain your answer.
GRANJOHN [Decomposition, Plant growth, Macroscopic, Microscopic]
Grandma Johnson had very sentimental feelings toward Johnson Canyon, Utah, where she and
her late husband had honeymooned long ago. Because of these feelings, when she died she
requested to be buried under a creosote bush in the canyon. Describe below the path of a carbon

130

atom from Grandma Johnson’s remains, to inside the leg muscle of a coyote. NOTE: The coyote
does not dig up and consume any part of Grandma Johnson’s remains.
GROWTH [Cross-processes, Macroscopic]
Both plants and animals need air, water, and nutrients to grow. Can you think of ways that the
plants and animals are different in the ways they use water, air, and nutrients to grow?
INFANT(E,M,H) [Animal growth, Mass/gases/amount]
Do you think the baby girl will need any of the following things to grow and gain weight? Please
circle Yes or No and explain your choice. If you circled yes, explain how the girl’s body uses it.
What happens to it inside the inside the girl’s body?
Sunlight Yes
No
Water
Yes
No
Air
Yes
No
Food
Yes
No
KLGSEASON [Cross-processes, Large-scale practices]
The graph given below shows changes in concentration of carbon dioxide in the atmosphere over
a 47-year span at Mauna Loa observatory at Hawaii, and the annual variation of this
concentration.
a. Why do you think this graph shows atmospheric carbon dioxide levels decreasing in the
summer and fall every year and increasing in the winter and spring?
b. Why do you think this graph shows atmospheric carbon dioxide levels increasing from 1960 to
2000?
LAMPELEC [Cross-processes, Energy/causes]
When you turn on a lamp, you can see the light. Where does the light energy come from? Trace
the energy as far as you can. You may or may not fill up all of the spaces in the table.
What form of energy was it? Where was it?
Light energy of the light
Before that…
Before that…
Before that…
Before that…
Before that…
Before that…
OCTAMOLE [Combustion, Microscopic]
Gasoline is mostly a mixture of hydrocarbons such as octane: C8H18. Decide and circle whether
each of the following statements is true (T) or false (F) about what happens to the atoms in a
molecule of octane when it burns inside a car.
T F Some of the atoms in the octane are incorporated into carbon dioxide in the air.

131

T F Some of the atoms in the octane are incorporated into air pollutants such as ozone or nitric
oxide.
T F Some of the atoms in the octane are converted into energy that moves the car.
T F Some of the atoms in the octane are burned up and disappear.
T F Some of the atoms in the octane are converted into heat.
T F Some of the atoms in the octane are incorporated into water vapor in the atmosphere.
a. When the gas tank is empty and the car stops, where is the energy that was in the gasoline?
b. What was the original source of energy of gasoline?
c. Is air needed for the car to use the gasoline? If so, how does the air change as the car runs?
PLANTGAS [Plant growth, Macroscopic]
Plants take in gas(es) from their environments. Please circle the gas(es) that plants take from
their environments (You may circle more than one). You may also write down other gas(es).
Oxygen
Carbon dioxide
Other:_____________
Explain what happens to the gas(es) once it is (they are) inside the plant.
POTATO [Decomposition, Microscopic]
A potato is left outside and gradually decays. One of the main substances in the potato is the
starch amylose, which is made of many glucose molecules bonded together. What happens to the
atoms in amylose molecules as the potato decays? Circle True (T) or False (F) for each option.
T
T
T
T
T

F
F
F
F
F

Some of the atoms are converted into nitrogen and phosphorous: soil nutrients.
Some of the atoms are used up by decomposers and disappear.
Some of the atoms are incorporated into carbon dioxide.
Some of the atoms are turned into energy by decomposers.
Some of the atoms are incorporated into water.

STONEWIN (E, M) [Animal function, Energy/causes]
You are playing outside in a cold winter. You find a stone on the ground. When you pick up the
stone, you find that the stone is very cold. Why can people keep warm on a cold day, but stones
cannot? Which of the thing(s) from the list below can help to keep people’s bodies warm? Please
circle YES or NO for each thing in the list below.
a. Water
YES
NO
b. Food
YES
NO
c. Air
YES NO
d. Exercise
YES NO
Try to write an explanation of how people’s bodies stay warm that includes ALL of the things
you circles “YES” for in the list above.
THINGTREE (E, M, H) [Plant growth, Mass/gases/amount]
A small oak tree was planted in a meadow. After 20 years, it has grown into a big tree, weighing
500 kg more than when it was planted. Do you think the tree will need any of the following
things to grow and gain weight? Please circle Yes or No and explain your choice.If you circled
yes, explain how the tree uses it. What happens to it inside the tree?

132

Sunlight
Soil
Water
Air

YES
YES
YES
YES

NO
NO
NO
NO

TREEDECAY [Decomposition, AB-Macroscopic, C-Energy/causes]
A tree falls in the forest. After many years, the tree will appear as a long, soft lump on the forest
floor.
a. The lump on the forest floor weighs less than the original tree. What happened to it?
Where would you find the matter that used to be in the tree?
b. What caused those changes in the wood? Explain as much as you can how these changes
happened.
c. Is energy involved when the tree decays?
Circle one: Yes /
No
If your answer is yes, please explain how energy is involved.
TROPRAIN [Cross-processes, Large-scale practices]
A tropical rainforest is an example of an ecosystem. Which of the following statements about
matter and energy in a tropical rainforest is the most accurate? Please choose ONE answer that
you think is best.
a. Energy is recycled, but matter is not recycled. (Level 2)
b. Matter is recycled, but energy is not recycled. (Level 4)
c. Both matter and energy are recycled. (Level 3)
d. Both matter and energy are not recycled. (Level 2)
Please explain why you think that the answer you chose is better than the others. (If you think
that some of the other answers are partially right, please explain that, too.)
WAXBURN [Combustion, Mass/gases/amount]
A burning candle is put into an air-tight container. After some time, the candle stops burning.
a. Predict whether the air inside the candle will have more, the same, or less of the gases below.
Explain where the gases come from or go to.
Gas
Prediction
Explanation: How did burning the candle produce
(circle)
or use the gas?
Oxygen
More Same Less
Carbon
More Same Less
dioxide
Water vapor
More Same Less
b. Where does the energy for burning come from? Please explain your answer.
WTLOSS(M,H) [Animal function, Microscopic]
When a person loses weight, what happens to some of the fat in the person’s body? Choose ONE
answer that you think is best.
a. The fat is broken down and leaves the person’s body as water and gas. (Level 3)
b. The fat is converted into energy. (Level 2)
c. The fat is used up providing energy for the person’s body functions. (Level 2)

133

d. The fat is broken down and leaves the person’s body as feces and urine. (Level 1)
Please explain why you think that the answer you chose is better than the others. (If you think
some of the other answers are also partially right, please explain that, too).

134

Table A.1 Descriptions of the four achievement levels of carbon cycle learning progression
Explaining
Level 4.
Linking
processes
with matter
and energy
as
constraints

Specific Level Description
Macro:
Describe systems as conserving matter and energy in hierarchy of scales;
Link macroscopic processes to chemical reactions with matter and energy as
constraints;
Link macroscopic processes to large-scale carbon cycle and energy flow.
Gases:
Correct explanation of gases (CO2 or O2) change in chemical reactions or in
global-scale changes.
Micro:
Atomic-molecular accounts conserving atoms in chemical changes and/or
conserving energy with degradation.
Large:
Describe matter cycle involving carbon transforming between organic and
inorganic forms;
Describe energy flow with degradation or connected to chemical reactions.
Level 3.
Macro:
Describe actors as systems containing matter and energy in hierarchy of scales,
Changes of
but do not conserve matter/energy successfully;
Molecules
and Energy Link macroscopic changes to chemical changes and describe chemical changes
Forms with as changes involving atoms, organic molecules, and energy forms, but do not
Unsuccessful successfully conserve matter and energy. (e.g., organic molecule and energy
Constraints conversion; energy conservation without degradation)
Gases:
Describe air as mixture of gases including CO2 or O2;
Describe gas cycle as changes between CO2 and O2 and CO2 and O2 as
different substances or molecules;
Connect gas cycles with chemical changes.
Identify CO2 as the product of combustion or cellular respiration.
Micro:
Trace materials to and from cells;
Provide incomplete atomic-molecular accounts about changes of molecules
and energy forms.
Large:
Link large-scale processes to macro or atomic-molecular processes, but
without full conservation of matter and energy;
Describe large-scale processes as materials passing on without organic carbon
generation and oxidation;
Describe energy passing on without degradation or connecting to chemical
reactions.

135

Table A.1 (Cont’d)
Level 2.
Forcedynamic
accounts
with hidden
mechanisms

Macro:
Still focus on actors, enablers, and results, but link the macroscopic changes to
hidden mechanisms that involving changes of materials and energy in general.
Gases:
May use CO2 or O2 to describe the quality of the air.
Describe gas changes in life-related events.
Micro:
Link macro-processes with unobservable mechanisms or hidden actors (e.g.,
decomposer)
Large:
Describe networks of actors & enablers (e.g., food chains with emphasis on
eating rather than matter/energy flow.)
Level 1.
Macro:
Macroscopic Describe macro-processes in terms of the action-result chain: actors use
enablers to accomplish their goals; interactions between actors and enablers
forceare like macroscopic physical push-and-pull that does not involve any change
dynamic
of matter/energy.
accounts
Gases:
Air (fresh air, bad air) as enablers or waste products of the actor. No explicit
gas. exchange.
Large:
No connections to larger systems. Simple food chains as series of events.
Micro:
Connections to subsystems limited to parts student can see or feel.

136

Table A.2 The specific rubric of the CARGAS item
CARGAS(E,M): When you are riding in a car, the car uses gasoline to make it
run. Eventually the gasoline tank is empty.
a. What happens to the materials the gasoline is made of when the car uses the
gasoline?
b. Is air needed for the car to use the gasoline? If so, how does the air change
as the car runs?
c. Where does energy come from to make the car run?
Explaining
Specific level description for this item
Typical example
a. The gasoline is
Level 4.
Macro:
turned into CO2
Linking
- Describe systems as conserving matter and energy in
and H2O through
processes
hierarchy of scales;
combustion.
with matter
- Link car running to the combustion (or burning) of
b. The air is
and energy
gasoline in which they trace matter OR energy
as constraints successfully. Tracing matter successfully means that they connected to
carbon and
explain the consumption of gasoline by stating that
hydrogen
materials of gasoline and air (NOTE: mentioning O2 is
molecules in the
not necessary because not asked for) change into CO2
fuel and is turned
which goes into the air. Tracing energy successfully
into CO2 and
means that they explain that the energy that makes the
H2O.
car run ultimately comes from high-energy bonds (C-C,
C-H) or chemical energy in the materials of gasoline. To c. The fuel - which
be level 4, they also must not confuse matter and energy is chemical
energy.
in tracing them (e.g., No matter/energy conversion).
- Link macroscopic processes to large-scale carbon cycle
a. when the gas
and energy flow.
runs out, that
Gases:
means all of the
- Correctly explain that air is needed and CO2 is
high energy bonds
produced in the combustion of gasoline.
were broken down
Micro:
b. Yes / the air
- Correctly describe atomic-molecular accounts tracing
helps break down
carbon through combustion (materials of gasoline ->
the bonds
CO2).
c. the ultimate
- Identifies the materials of gasoline and air as reactants
source of energy in
and carbon dioxide as a key product.
the gasoline is c_c,
Large:
and C-H bonds
- Describe matter cycle involving carbon transforming
between organic and inorganic forms;
- Describe energy flow with degradation or connected to
chemical reactions.
- Explain combustion at atomic-molecular level,
consistently trace matter and energy through the process.

137

Table A.2 (Cont’d)
Level 3.
Changes of
Molecules
and Energy
Forms with
Unsuccessful
Constraints

Macro:
- Describe actors as systems containing matter and
energy in hierarchy of scales, but do not conserve
matter/energy successfully;
- Link car running to the combustion (or burning) of
gasoline in which they trace matter or energy yet
unsuccessfully. This may entail one of several things: (1)
they trace both matter and energy, but they confuse
matter and energy in tracing them (e.g., the materials of
gasoline change into energy that makes the car run); (2)
they trace matter only and unsuccessfully (e.g. they
explain that the materials of gasoline change into some
gases without identifying them); (3) they trace energy
only and unsuccessfuly (e.g. they mention some forms of
energy other than chemical energy or "bonds" without
identifying them.)
Gases:
- Describe air as mixture of gases including CO2 or O2;
- Describe gas cycle as changes between CO2 and O2
and CO2 and O2 as different substances or molecules;
- Connect gas cycles with chemical changes.
- Identify CO2 as a product of the combustion of
gasoline.
Micro:
- Trace materials to and from cells
- Provide incomplete atomic-molecular accounts about
changes of molecules and energy forms.
Large:
- Link large-scale processes to macro or atomicmolecular processes, but without full conservation of
matter and energy;
- Describe large-scale processes as materials passing on
without organic carbon generation and oxidation;
- Describe energy passing on without degradation or
connecting to chemical reactions.

138

a. They get made
into different types
of energy.
b. No
c. The gas being
changed into
motion energy.

Table A.2 (Cont’d)
Level 2.
Forcedynamic
accounts
with hidden
mechanisms

Level 1.
Macroscopic
forcedynamic
accounts

Micro: Link macro-processes with unobservable
mechanisms or hidden actors (e.g., decomposer)
Large: Describe networks of actors & enablers (e.g., food
chains with emphasis on eating rather than matter/energy
flow.)
Macro: Still focus on actors, enablers, and results, but
link the car running to hidden mechanisms that involve
changes of materials and energy in general (e.g., identify
gasoline as the source of the energy that makes the car
run).
Gases: May use CO2 or O2 to describe the quality of the
air without any explanation of mechanism.
Mention air, smoke or/and ash as byproducts of burning.
Describe gas changes in life-related events.
Micro: Link macro-processes with unobservable
mechanisms or hidden actors.
May describe explosion or any process to provide energy
in engine, cylinder and piston.
Large: Describe networks of actors & enablers (e.g., car
running with emphasis on burning gasoline)
Macro:
Describe car running in terms of the action-result chain:
car uses enablers (e.g. gasoline) to run; interactions
between actors and enablers are like macroscopic
physical push-and-pull that does not involve any change
of matter/energy (e.g. without gasoline your car cannot
work, Energy comes from running of the car).
Gases: Describe air in term of tire, engine, people breath
in car running OR state that air is not involved in this
process.
Large: No connections to larger systems
Micro: Connections to subsystems limited to parts
student can see or feel
Describe how the materials the gasoline is made of get
used using single-process/step words (e.g., "evaporate,"
"gets burned," "used up") without providing additional
steps (e.g., "go out in the air"), hierarchy in structure, or
smaller entities.

139

A: The materials
get burned and
they come out of
your gas pipe
where smoke
comes out.
B: No, cause when
the materials burn
it makes smoke
and air.
C: The energy
comes from the
gas in order for the
car to run.

A. It is used up.
B. No
C. The engine.

Table A.3 Unidimensional PCM results
ID
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40

Item
ACRON_MC
ACRON_CR
AIREV_CR
AIRBO_CR
ANIMW_CR
APPLR_CR
BODYE_CR
BODYH_MC
BODYH_CR
BREAD_MC
BREAD_CR
MATCHEL_CR
MATCHM_MC
MATCHMA_CR
MATCHMB_CR
CARBO_CR
CARPA_CR
CARGA_CR
CONNL_CR
CUTTR_CR
DEERW_MC
DEERW_CR
DIFEV_CR
EATAP_CR
EATBR_CR
ECOSP_CR
ENERP_CR
ENPLN_CR
GIRLAB_CR
GIRLC_CR
GLOBM_CR
GLOBH_CR
GLUEG_CR
GRAND_CR
GRANP_CR
GROWT_CR
INFAN_CR
KLGSE_CR
LAMPE_CR
OCTAM_CR

ESTIMATE
0.09
0.842
-0.169
0.95
-1.097
-0.107
0.19
-1.842
-0.64
-1.024
-0.194
-0.407
-0.372
0.104
0.127
0.991
0.696
-0.65
-0.432
0.032
0.292
1.037
0.086
0.101
0.415
0.62
-0.05
-0.094
0.009
-0.389
-0.297
-0.809
-0.906
0.749
0.086
0.992
0.767
0.532
-0.345
0.417

ERROR^
0.06
0.065
0.091
0.091
0.106
0.085
0.075
0.097
0.081
0.057
0.054
0.09
0.061
0.061
0.065
0.094
0.096
0.066
0.099
0.112
0.085
0.098
0.061
0.062
0.071
0.08
0.044
0.056
0.105
0.121
0.084
0.092
0.079
0.107
0.111
0.078
0.076
0.094
0.061
0.096

140

WEIGHTED FIT
MNSQ
CI
1.18
( 0.91, 1.09)
1.01
( 0.89, 1.11)
0.94
( 0.81, 1.19)
0.96
( 0.70, 1.30)
1
( 0.75, 1.25)
0.9
( 0.84, 1.16)
1.02
( 0.85, 1.15)
1.18
( 0.74, 1.26)
1.04
( 0.79, 1.21)
1.09
( 0.90, 1.10)
0.96
( 0.88, 1.12)
0.91
( 0.82, 1.18)
1.16
( 0.90, 1.10)
0.98
( 0.88, 1.12)
0.93
( 0.87, 1.13)
1.03
( 0.74, 1.26)
0.94
( 0.76, 1.24)
0.91
( 0.85, 1.15)
0.99
( 0.80, 1.20)
1.07
( 0.65, 1.35)
1.13
( 0.77, 1.23)
0.97
( 0.74, 1.26)
0.98
( 0.88, 1.12)
0.94
( 0.89, 1.11)
1.01
( 0.85, 1.15)
1.05
( 0.81, 1.19)
0.92
( 0.90, 1.10)
0.96
( 0.89, 1.11)
0.95
( 0.72, 1.28)
0.83
( 0.79, 1.21)
0.92
( 0.80, 1.20)
0.94
( 0.78, 1.22)
0.92
( 0.84, 1.16)
1.01
( 0.69, 1.31)
0.92
( 0.72, 1.28)
0.92
( 0.85, 1.15)
0.92
( 0.89, 1.11)
0.98
( 0.78, 1.22)
1.11
( 0.85, 1.15)
0.85
( 0.79, 1.21)

T
3.9
0.2
-0.6
-0.2
0
-1.2
0.3
1.3
0.4
1.7
-0.6
-1
3.0
-0.3
-1
0.3
-0.5
-1.2
-0.1
0.4
1.1
-0.2
-0.4
-1.1
0.2
0.5
-1.5
-0.7
-0.3
-1.6
-0.8
-0.5
-1
0.1
-0.5
-1.1
-1.4
-0.2
1.4
-1.4

Table A.3 (Cont’d)
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60

PLANG_CR
THINT_CR
TREDEAB_CR
TREDEC_CR
TROPRA_MC
TROPRA_CR
WAXBUR_CR
WTLOSS_MC
WTLOSS_CR
AIREVE_MTF
ANIM_MTF
BODY_MTF
ENERPE_MTF
ENERPL_MTF
GLOBM_MTF
GLOBH_MTF
INFA_MTF
OCTAM_MTF
POTATO_MTF
THINGT_MTF

0.796
-0.151
0.252
-0.015
-1.041
-0.94
1.367
-1.488
-0.512
-1.2
0.003
0.562
-0.431
0.141
0.107
0.693
0.794
0.281
0.572
-1.353

0.084
0.057
0.056
0.048
0.063
0.061
0.1
0.057
0.051
0.089
0.11
0.105
0.055
0.065
0.104
0.093
0.088
0.086
0.624
0.073

0.9
0.82
0.92
0.94
1.17
0.97
0.95
1.14
1.02
1.22
1.25
1.09
1.12
1.11
1.11
1
1.26
1.08
1.36
1.2

( 0.88, 1.12)
( 0.90, 1.10)
( 0.89, 1.11)
( 0.88, 1.12)
( 0.89, 1.11)
( 0.89, 1.11)
( 0.70, 1.30)
( 0.91, 1.09)
( 0.91, 1.09)
( 0.83, 1.17)
( 0.67, 1.33)
( 0.82, 1.18)
( 0.88, 1.12)
( 0.83, 1.17)
( 0.79, 1.21)
( 0.80, 1.20)
( 0.79, 1.21)
( 0.76, 1.24)
( 0.72, 1.28)
( 0.89, 1.11)

-1.7
-3.5
-1.4
-0.9
3.0
-0.6
-0.3
3.0
0.4
2.3
1.5
1
1.9
1.2
1
0.1
2.2
0.7
2.3
3.3

The column headers are:
 Estimate column provides the item difficulty estimates for every item
 Error column provides the error of the item difficulty estimates
 MNSQ is the mean residual square between what is observed and what is expected.
 CI is the confidence interval of the MNSQ.
 T is the t-statistics that used to indicate the fitness of the item to the model.

141

Table A.4 The step threshold parameters of 38 good items
ID
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38

item
ACRON_CR
AIREV_CR
ANIMW_CR
APPLR_CR
MATCHEL_CR
MATCHM_MC
MATCHMA_CR
MATCHMB_CR
CARBO_CR
CARPA_CR
CARGA_CR
CONNL_CR
CUTTR_CR
DEERW_CR
DIFEV_CR
EATAP_CR
EATBR_CR
ENPLN_CR
GIRLAB_CR
GLOBM_CR
GLOBH_CR
GLUEG_CR
GRAND_CR
GRANP_CR
GROWT_CR
INFAN_CR
KLGSE_CR
PLANG_CR
TREDEAB_CR
TROPRA_CR
WAXBUR_CR
WTLOSS_CR
ANIM_MTF
BODY_MTF
ENERPE_MTF
GLOBM_MTF
GLOBH_MTF
OCTAM_MTF

b
0.745
-0.277
-1.208
-0.227
-0.592
-0.518
-0.005
0.017
0.957
0.562
-0.824
-0.583
-0.031
0.986
-0.028
-0.038
0.299
-0.248
-0.084
-0.445
-0.978
-1.103
0.703
-0.009
0.915
0.674
0.486
0.666
0.123
-1.138
1.317
-0.665
-0.089
0.411
-0.599
-0.024
0.616
0.236

142

d1
-1.778
-1.033
-1.981
-1.196
-1.367
-1.293
-2.156
-2.362
0.754
-3.348
-2.792
-2.718
-0.34
0.4
-1.265
-2.341
-1.646
-1.838
-1.665
-2.182
-3.268
-3.576
-2.095
-1.919
-1.128
-0.893
-1.272
-1.563
-2.131
-4.431
0.959
-3.314
-2.123
-1.442
-1.495
-1.38
-1.414
-0.565

d2
1.132
0.478
-0.435
0.743
0.184
0.258
0.221
0.317
1.161
2.428
-0.099
1.551
0.277
1.573
-0.452
-1.025
0.773
-0.427
-0.824
-0.273
-0.807
-1.278
1.773
1.901
1.069
2.241
0.242
2.895
0.899
-0.805
1.674
-0.605
0.11
2.264
-1.088
1.333
-0.229
0.584

d3
2.882

1.919
2.098
2.605
0.418

1.633
3.251
1.771
1.52
2.238
1.119
1.139
1.544
2.432
2.805
2.488
1.603
1.822
1.925
1.746
0.784
3.492
0.688

The column headers are:
 b is the item difficulty parameter
 d1 is the first item threshold parameter, it’s the cutting point on the ability scale between
score 1 and 2
 d2 is the second item threshold parameter, it’s the cutting point on the ability scale
between score 2 and 3
 d3 is the third item threshold parameter, it’s the cutting point on the ability scale between
score 3 and 4

143

Table A.5 Excluded items (misfit items and items that have thresholds not in the correct
order)
ID
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22

Item
ACORN_MC
THINT_CR
AIREVE_MTF
INFA_MTF
POTATO_MTF
THINGT_MTF
AIRBO_CR
BODYE_CR
BODYH_MC
BODYH_CR
BREAD_MC
BREAD_CR
DEERW_MC
ECOSP_CR
ENERP_CR
LAMPE_CR
OCTAM_CR
TREDEC_CR
ENERPL_MTF
GIRLC_CR
TROPRA_MC
WTLOSS_MC

144

misfit

thresholds
not in
order

Table A.6 Effective options of each MTF item
Item Name
AIREVENT

ANIMWINT

BODYTEMP

ENERPEOP

ENERPLNT

GLOBWARMM

GLOBWARMH

INFANT

OCTMALE

Sub T or F questions
PLANT GROWTH
GIRL RUN
BURNING WOOD
FOOD DECAY
HEAT
GAS/WATER
WASTE
OTHER MATERIAL
DISAPPEAR
WATER
FOOD
AIR
EXERCISE
WATER
FOOD
NUTRIENT
EXERCISE
SUNLIGHT
CO2
O2
WATER
LIGHT
AIR
NUTRIENTS
OWN ENERGY
TRUCK
FOREST
CANDLE
HAIR SPRAY
TRUCK
CUT TREE
REFRIGERATOR
AEROSOL
BEEF
SUN
WATER
AIR
FOOD
CO2
AIR POLLUTION
ENERGY
DISAPPEAR
145

Effective option
(X)
X
X
X
X

X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X

X
X
X

Table A.6 (Cont’d)

THINGTREE

HEAT
WATER
SUN
SOIL
WATER
AIR

X
X
X
X

X indicates there are significant group differences in term of the paired CR score between
students who selected T and how selected F to the question.

146

REFERENCES

147

REFERENCES
Ackerman, T. A. (1994a). Creating a test information profile in a two-dimensional latent space.
Applied Psychological Measurement, 18, 257–275.
Ackerman, T. A. (1994b). Using multidimensional item response theory to understand what
items and tests are measuring. Applied Measurement in Education, 20, 309–310.
Ackerman, T. A. (1996). Graphical representation of multidimensional item response theory
analyses. Applied Psychological Measurement, 20, 311–329.
Adams, R. J., Wilson, M., & Wang, W-C. (1997). The multidimensional random coefficients
multinomial logit model. Applied Psychological Measurement, 21,1-23.
Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. In
B. N. Petrov & F. Csake (Eds.), Second international symposium on information theory
(pp. 267-281). Budapest, Hungary: Akademiai Kiado.
Allen, M.J., & Yen, W. M. (2002). Introduction to Measurement Theory. Long Grove, IL:
Waveland Press.
Almond, R. G., Steinberg, L. S., & Mislevy, R. J. (2002). Enhancing the design and delivery of
assessment systems: A four-process architecture. Journal of Technology, Learning, and
Assessment, 1(5), 1–63. Retrieved August 30, 2005, from
http://escholarship.bc.edu/cgi/viewcontent.cgi?article=1008&context=jtla
Alonzo, A. C., & Steedle, J. T. (2008). Developing and assessing a force and motion learning
progression. Published online in Wiley InterScience.
American Educational Research Association, American Psychological Association, & National
Council on Measurement in Education. (1999). Standards for educational and
psychological testing. Washington, DC: American Educational Research Association.
Anderson, C. W. (2010, March) Learning Progressions for Environmental Science Literacy.
Paper presented at NRC Science Framework Committee.
Anderson, C. W., Alonzo, A. C., Smith, C., & Wilson, M. (2007, August). NAEP pilot learning
progression framework. Report to the National Assessment Governing Board.
Anderson, C. W., Sheldon, T. H., & Dubay, J. (1990). The effects of instruction on college
nonmajors' conceptions of respiration and photosynthesis. Journal of Research in Science
Teaching, 27, 761-776
Bock, R. D., & Aitkin, M. (1981). Marginal maximum likelihood estimation of item parameters:
Application of an EM algorithm. Psychometrika, 46, 443-459.

148

Boyes, E., & Stanisstreet, M. (1993). The greenhouse effect: Children’s perceptions of causes,
consequences and cures. International Journal of Science Education, 15 (5), 531-552.
Bozdogan, H. (1987). Model selection and Akaike’s Information Criterion (AIC): The general
theory and its analytical extensions. Psychometrika, 52, 345-370.
Briggs, D. C., Alonzo, A. C., Schwab, C., & Wilson, M. (2006). Diagnostic assessment with
ordered multiple-choice items. Educational Assessment, 11, 33 – 63.
Champagne, A. B., Kouba, V. L., & Gentiluomo, L. (2008). Assessing science literacy using
extended constructed-response items. In J. Coffey, R. Douglas, & C. Sterns (Eds.),
Assessing science learning: Perspectives from research and practice. Arlington, VA:
NSTA Press.
Chen, J., Anderson, C. W., Choi, J., Lee, Y., & Draney, D. (2010). Assessing K-12 students’
learning progression of carbon cycling using different types of items. Paper presented at
National Association for Research in science Teaching, Philadelphia, PA.
Coyle, K. (2005). Environmental literacy in America: What ten years of NEETF/Roper research
and related studies say about environmental literacy in the U.S. Washington, DC: The
National Environmental Education & Training Foundation.
Crowley, T.J. (2000). Causes of climate change over the past 1000 years. Science, 289(270),
270-277.
Delandshere, G., & Petrosky, A. R. (1998). Assessment of complex performances: Limitations of
key measurement assumptions. Educational Researcher, 27, 14-24.
Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete
data via the EM algorithm (with discussion). Journal of the Royal Statistical Society,
Series, B, 39, 1-38.
Dunham, M. L. (2007) An investigation of the multiple true-false item for nursing licensure and
potential sources of construct-irrelevant difficulty.
http://proquest.umi.com/pqdlink?did=1232396481&Fmt=7&clientI
d=79356&RQT=309&VName=PQD
Duschl, R.A., Schweingruber, H.A, & Shouse, A.W. (2007). Taking science to school: Learning
and Teaching science in grades K-8. Washington, DC: The National Academies Press.
Ebel, R. L. (1979). Essentials of Educational Measurement (3rd ed.). Englewood Cliffs, N.J.:
Prentice Hall.
Embretson, S. E. (1984). A general latent trait model for response processes. Psychometrika, 49,
175–186.

149

Ercikan, K., Schwarz, R. D., Julian, M. W., Burket, G. R., Weber, M. M., & Link, V. (1998).
Calibration and scoring of tests with multiple-choice and constructed-response item types.
Journal of Educational Measurement, 35, 137-154.
Falkowski, P., Scholes, R.J., Boyle, E., Canadell, J., Canfield, D., Elser, J., Gruber, N., Hibbard,
K., Hogberg, P., Linder, S., Mackenzie, F.T., Moore III, B., Pederson, T., Rosenthal, Y.,
Seitzinger, S., Smetacek, V., Steffen, W. (2000). The global carbon cycle: A test of our
knowledge of Earth as a system. Science, 290(291), 291-296.
Fisher, Kathleen M. et al. (1986, February). Student Misconceptions and Teacher Assumptions
in College Biology. Journal of College Science Teaching, 15(4), 276-80
Frederiksen, N. (1984). The real test bias: Influence of testing on teaching and learning.
American Psychologist, 39, 193-202.
Frisbie, D. A. (1992). The multiple true-false item format: A status review. Educational
Measurement: Issues and Practice,11(4), 21–26.
Grosse, M. E., & Wright, B. D. (1985). Validity and reliability of True-False tests. Educational
and Psychological Measurement, 45(1), 1-13
Haberman, S. J. (1977). Log-linear models and frequency tables with small expected cell counts.
Annals of Statistics 5: 1148-1169.
Hestenes, D.,Wells, M., & Swackhamer, G. (1992). Force concept inventory. The Physics
Teacher, 30, 141–151.
Hmelo-Silver, C.E., Marathe, S., & Liu, L. (2007). Fish swim, rocks sit, and lungs breathe:
Expert-novice understanding in complex systems. Journal of the Learning Sciences, 16(3),
307-331.
Hotinski, R. (2007). Stabilization Wedges: A Concept & Game, from
http://www.princeton.edu/~cmi/resources/stabwedge.htm
IPCC, 2007: Summary for Policymakers. In: Climate Change 2007: The Physical Science Basis.
Contribution of Working Group I to the Fourth Assessment Report of the
Intergovernmental Panel on Climate Change [Solomon, S., D. Qin, M. Manning, Z. Chen,
M. Marquis, K.B. Averyt, M.Tignor and H.L. Miller (eds.)]. Cambridge University Press,
Cambridge, United Kingdom and New York, NY, USA.
Jin, H., & Anderson, C. W. (2010, March). Developing a long-term learning progression for
energy in socio-ecological system. Paper presented at National Association for Research in
science Teaching, Philadelphia, PA.
Kansas Environmental Education Conference (KACEE) Newsletter. (2005). From
http://www.kacee.org/About/newsletters/KACEE%20Newsletter%20Fall%202005.pdf

150

Keeling, C.D. & Whorf, T.P. (2005). Atmospheric CO2 records from sites in the SIO air
sampling network. In Trends: A Compendium of Data on Global Change. Carbon Dioxide
Information Analysis Center, Oak Ridge National Laboratory, U.S. Department of Energy,
Oak Ridge, TN.
Kehoe, J. (1995). Basic item analysis for multiple-choice tests. Washington, D.C.: ERIC
Clearinghouse on Assessment and Evaluation, ERIC Identifier: ED398237.
Kempton, W., Boster, J. S., and Hartley, J. A. (1995). Environmental values and American
culture. Cambridge, MA: MIT Press.
Kennedy, C.A. (2005). The BEAR Assessment System: A Brief Summary for the Classroom
Context. BEAR Technical Report Series 2005-03-01. Berkeley, CA: University of
California, BEAR Center.
Kennedy, M. M. (1999). Approximations to indicators of student outcomes. Educational
Evaluation and Policy Analysis, 21(4), 345-363.
Lane, S. (2004). Validity of high-stakes assessment: Are students engaged in complex thinking?
Educational Measurement: Issues and Practice, 23(3), 6-14.
Lazarsfeld P.F, & Henry N.W. (1968). Latent Structure Analysis. Boston: Houghton Mifflin.
Lee, H. -S., Liu, O. L., & Linn, M. C. (2011). Validating Measurement of Knowledge Integration
Science Using Multiple-Choice and Explanation Items. Applied Measurement in
Education, 24(2), 115-136.
Leighton, J. P., Gierl, M. J., & Hunka, S. M. (2004). The Attribute Hierarchy Method for
Cognitive Assessment: A Variation on Tatsuoka's Rule-Space Approach. Journal of
Educational Measurement, 41(3), 205-237.
Lin, C.-Y., & Hu, R. (2003). Students' understanding of energy flow and matter cycling in the
context of the food chain, photosynthesis, and respiration. International Journal of Science
Education, 25(12), 1529-1544.
Lord, F.M. (1980). Applications of item response theory to practical testing problems. Mahwah,
NJ: Lawrence Erlbaum Associates, Inc.
Lord, F. M. & Novick, M. R. (1968). Statistical theories of mental test scores. Reading MA:
Addison-Welsley Publishing Company.
Masters, G. N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47, 149-173.
McNeill, K. L. & Krajcik, J. (2007). Middle school students' use of appropriate and
inappropriate evidence in writing scientific explanations. In Lovett, M & Shah, P (Eds.)

151

Thinking with data: The proceedings of the 33rd Carnegie symposium on cognition.
Mahwah, NJ: Lawrence Erlbaum Associates, Inc.
Merritt, J. & Krajcik, J. (2009, June). Developing a calibrated progress variable for the particle
nature of matter. Paper presented at the Learning Progressions in Science (LeaPS)
Conference, Iowa City, IA.
Mislevy, R. J., Steinberg, L.S., & Almond, R.G. (2003). On the structure of educational
assessments. Measurement: Interdisciplinary research and perspectives. 1(1): 3-62.
Mohan, L., Chen, J., & Anderson, C. W. (2009). Developing a multi-year learning progression
for carbon cycling in socio-ecological systems. Journal of Research in Science Teaching.
46 (6), 675-698.
Mohan, L., Chen, J., Baek, H., Anderson, C.W., Choi, J., & Lee, Y. (2009, April). Validation of
a multi-year carbon cycle learning progression. Paper presented at the annual meeting of
the National Association for Research in Science Teaching, Garden Grove, CA.
Muraki, E., & Carlson, J. E. (1995). Full-information factor analysis for polytomous item
responses. Applied Psychological Measurement, 19, 73–90.
National Research Council. (2001). Knowing what students know: The science and design of
educational assessment. Washington, DC: The National Academies Press.
National Research Council. (2006). Systems for state science assessment. Washington, DC: The
National Academies Press.
National Research Council. (2007). Taking science to school. Washington, DC: The National
Academies Press.
NEETF & Roper Starch Worldwide. (2001). Lessons from the Environment:The Ninth Annual
National Report Card on Environmental Attitudes, Knowledge and Behavior. Washington,
DC: NEETF.
Novick, M.R. (1966) The axioms and principal results of classical test theory. Journal of
Mathematical Psychology, 3(1), 1-18.
Pellegrino, J. W. (2009). The design of an assessment system for the race to the top: A learning
sciences perspective on issues of growth and measurement. Paper presented at the
Exploratory Seminar: Measurement Challenges Within the Race to the Top Agenda.
http://www.k12center.org/rsc/pdf/PellegrinoPresenterSession1.pdf
Pellegrino, J. W., Chudowsky, N., & Glaser, R. (Eds.). (2001). Knowing what students know:
The science and design of educational assessment. Washington, DC: National Academy
Press.

152

Rao, C. R. & Sinharay, S. (2007). Handbook of statistics, Vol. 26: Psychometrics. Elsevier
Science B.V.: The Netherlands, 2007.
Rasch, G. (1960/1980). Probabilistic models for some intelligence and attainment tests.
(Copenhagen, Danish Institute for Educational Research), expanded edition (1980) with
foreword and afterword by B.D. Wright. Chicago: The University of Chicago Press.
Reckase, M. D. (1985). The difficulty of test items that measure more than on ability. Applied
Psychological Measurement, 9, 401–412.
Reckase, M. D. (1990, April). Unidimensional data from multidimensional tests and
multidimensional data from unidimensional tests. Paper presented at the Annual Meeting
of the American Educational Research Association.
Reckase, M. D. (1997). The past and future of multidimensional item response theory. Applied
Psychological Measurement, 21, 25–36.
Reckase, M. D., & Martineau, J. A. (2004). The vertical scaling of science achievement tests.
Unpublished Report, Michigan State University, East Lansing, MI.
Reckase, M. D. (2009). Multidimensional Item Response Theory. New York: Springer.
Reckase, M. D. & Hirsch, T.M. (1991). Interpretation of number –correct scores when the true
numbers of dimension assessed by a test is greater than two. Paper presented at the annual
meeting of the National Council on Measurement in Education, Chicago.
Reckase, M. D., & McKinley, R. L. (1991). The discriminating power of items that measure
more than one dimension. Applied Psychological Measurement, 15, 361–373.
Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics, 6, 461-464.
Shepard, L. A. (2000). The role of assessment in a learning culture. Educational Researcher,
29(7), 4-14.
Songer, C. J., & Mintzes, J. J. (1994). Understanding cellular respiration: An analysis of
conceptual change in college biology. Journal of Research in Science Teaching 31, 621637.
Steedle, J. T. (2006, April). Seeking evidence supporting assumptions underlying the
measurement of progress variable levels. Paper presented at the annual meeting of the
National Council on Measurement in Education, San Francisco, CA.
Stout, W., Douglas, B., Junker, B. & Roussos, L. (1999). DIMTEST [computer software]. The
William Stout Institute for Measurement, Champaign, IL.
Stout, W., Froelich, A. G. & Gao, F. (2001). Using resampling to produce and improved

153

DIMTEST procedure. In Boomsma A, van Duijn MAJ, Snijders TAB (eds.) Essays on item
response theory (pp. 357-375). Springer-Verlag, New York.
Thorndike, R. M. (2005). Measurement and evaluation in psychology and education. Prentice
Hall, Upper Saddle River (NJ).
Wei, H. (2008). Multidimensionality in the NAEP Science Assessment: Substantive perspectives,
psychometric models, and task design. (Doctoral dissertation). Retrieved from ProQuest
Dissertations and Theses. (Accession Order No.[3307786])
Wilson, M. (2005). Constructing measures: An item response modeling approach. Mahwah, NJ:
Erlbaum.
Wilson, C. D., Anderson, C. W., Heidemann, M., Merrill, J., Merritt, B. W., Richmond, G.,
Sibley, D. F., & Parker, J. M. (2006). Assessing Students’ Ability to Trace Matter in
Dynamic Systems in Cell Biology. CBE—Life Sciences Education, 5(4), 323–331.
Wilson, M., & Bertenthal, M. (2005). Systems for state science assessment. Washington, DC:
National Academy Press.
Wilson, M., & Wang, W.-C. (1995). Complex composites: Issues that arise in combining
different modes of assessment. Applied Psychological Measurement, 19(1), 51-71.
Wright, B. D., & Masters, G. N. (1982). Rating scale analysis: Rasch measurement. Chicago:
MESA Press.
Wu, M. L. (1997). The development and application of a fit test for use with marginal maximum
likelihood estimation and generalized item response models. Unpublished master’s thesis,
University of Melbourne.
Wu, M. L., Adams, R. J., & Wilson, M. R. (1998). ACER ConQuest: Generalized item response
modeling software. Melbourne, Australia: Australian Council for Educational Research.
Yao, L., & Boughton, K. A. (2007). A Multidimensional item response modeling approach for
improving subscale proficiency estimation and classification. Applied Psychological
Measurement, 31, 1–23.
Yao, L., & Schwarz, R. (2006). A multidimensional partial credit model with associated item and
test statistics: An application to mixed format tests. Applied Psychological Measurement,
30, 469–492.
Zhang, JM. & Stout, W. (1999). The theoretical DETECT index of dimensionality and its
application to approximate simple structure. Psychometrika 64: 213-249

154