....s: .
J. .. .. ..
f . .

“In .5...

: ..

gummy 92..

i.)

... t. u . .
.V g . _ .. Mina“
, , .. .. ﬂawuknmmvm. Jymwmﬁimmwwﬁw A
a . .. mi bum . Wm“. ..mﬂnma
. . m . ‘ any: V 1,: J ., .
magma , , - ...-fix \ :5”???
gammy... . , «annex. .
mVWwirdr. . , .
:r ,
.. 3.1.
1%?

... .
.53, v
4 ...
;
...... V

. 53
a

u, tad. m.
.V .. Athrruﬁiﬁz
. , . y 4
A . , LE.
,, , “arr”?
N 3...... .
... thug? .
‘mm.ux$wm;~u. Us: ‘ ‘ . .
33.02}... .51...“ ‘
... .32!) .2)»
. L‘ amuﬂuv .
H ...: _. .

a.
nzn .:

.

«w
>
a

4 I
‘
. H1 ..
< ‘
«.3. . » .wnma
.33.
, y:
La v .
x We: n...Lv~. .
way?
2...;
.. dmwwunw‘

Es
3:5.
1. 4.. :11...

13)...

Pa»

. . ‘
*8 1': i .
.~1 6.

you? .

I

 

 

an. I

.3:
{$03.94
a. .....«vu.
. . 3.45.4..-

lh 3.1.3.

 

 

..
xx .9?‘
r n . .3:

This is to certify that the
dissertation entitled
The Relationship Between Item Format
and Cognitive Processes in Wide-scale

Assessment of Mathematics
presented by

Diane R. Garavaglia

has been accepted towards fulﬁllment
of the requirements for

 

Ph . D . degree in Counseling, Educational
Psychology, and Special Education

E &jor proiéssor

 

 

Date '5“ (D- 0‘

 

MS U i: an Afﬁrmative Action/Equal Opportunity Institution 0.12771

 

 

 

 

LIBRARY

Michigan State
University

 

 

PLACE IN RETURN Box to remove this checkout from your record.
TO AVOID FINES return on or before date due.
MAY BE RECALLED with earlier due date if requested.

 

DATE DUE

DATE DUE

DATE DUE

 

 

 

 

 

 

 

 

 

 

 

 

 

 

6/01 CJCIRC/DateDUOp651). 15

 

THE RELATIONSHIP BETWEEN ITEM FORMAT AND COGNITIVE PROCESSES
IN WIDE-SCALE ASSESSMENT OF MATHEMATICS

By

Diane R. Garavaglia

A DISSERTATION

Submitted to

Michigan State University

in partial fulﬁllment of the requirements for the degree of

DOCTOR OF PHILOSOPHY

Department of Counseling, Educational Psychology, and Special Education

2001

ABSTRACT

THE RELATIONSHIP BETWEEN ITEM FORMAT AND COGNITIVE PROCESSES
IN WIDE-SCALE ASSESSMENT OF MATHEMATICS
By

Diane R. Garavaglia

The purpose of this study was to determine whether test takers used different
cognitive processes when they solved multiple-choice versus constructed-response items.
I conducted this study in an era when school accountability and high-stakes, large-scale
assessments were seemingly as important as student learning itself. The hi gh-stakes
nature of testing created an environment in which both testing advocates and challengers
scrutinized tests even more than normal. In particular, some challengers asserted that
multiple-choice items were ill-suited for assessing certain types of cognitive processes or
for providing useful information about student achievement.

Using information as the central idea, I approached the question from an
information-value (or value-added) perspective and attempted to determine whether there
was a difference in the type of cognitive processes elicited by each item format. The
question was narrowly contextualized in one area of mathematics, namely 8th grade,
algebraic pattern items. I selected 34 students who were enrolled in 8th grade
mathematics courses in the spring of 1998. I examined the question using a think-aloud

procedure — an analysis tool seldom used in the field of measurement. The overall

results suggested that students used similar cognitive processes to solve both multiple-
choice and constructed-response pattern items. However, the results were likely related
to the characteristics of the items — that is, many constructed—response items allowed for
one solution path. I referred to these items as “multiple-choice items in disguise.”

Recommendations were offered to test users, developers, and other researchers.

Copyright by
Diane Rose Garavaglia
2001

TABLE OF CONTENTS

LIST OF TABLES ............................................................................................................ vii
LIST OF FIGURES .......................................................................................................... viii
CHAPTER I ........................................................................................................................ 1
Introduction ......................................................................................................................... 1
CHAPTER II ....................................................................................................................... 5
Review of Related Literature .............................................................................................. 5
Information-value Studies .......................................................................................... 5
Think-aloud Methodology .......................................................................................... 9
Comparability of Items Written in Multiple Item Formats ...................................... 15
Mathematical Cognitive Processes ........................................................................... 18
Summary of Literature Review ................................................................................ 21
Contribution of this Study ........................................................................................ 23
CHAPTER HI .................................................................................................................... 25
Study Design and Procedures ............................................................................................ 25
Research Question .................................................................................................... 25
Sample ............................................................................................................................... 25
Procedure for Sample Selection ............................................................................... 26
Instruments ........................................................................................................................ 28
Algebra Strand ................................................................................................................... 28
Design ................................................................................................................................ 32
Section 1: Pilot Study ............................................................................................... 32
Item Selection for the Pilot Study ............................................................................ 33
Section 2: Item Development ................................................................................... 35
Model-shell Item Selection Criteria ......................................................................... 36
Item Writing Procedure and an Example ................................................................. 38
Section 3: Assignment of Items and Students to Forms ........................................... 42
Section 4: Testing Procedures .................................................................................. 43
Threat of Confounding ............................................................................................. 45
Data Collection ......................................................................................................... 46
Data Analyses ........................................................................................................... 47
Descriptive Item Level Statistics .............................................................................. 47
Identify and Validate Cognitive Processes ............................................................... 48
Full-scale Analysis ................................................................................................... 54
Cognitive Process Similarities and Differences ....................................................... 55

Post hoc Evaluation of Steps 2 and 3 ....................................................................... 58

Summary of Design and Procedures ........................................................................ 60
CHAPTER IV ................................................................................................................... 63
Results ............................................................................................................................... 63

Descriptive Statistics ................................................................................................ 63

Summary of Descriptive Statistics ........................................................................... 64

Cognitive Process Categories ................................................................................... 65

Cognitive Process Comparisons ............................................................................... 65

Summary of Cognitive Process Comparisons .......................................................... 83

External Post hoc Evaluation ................................................................................... 85

Results of the Evaluation .......................................................................................... 85

Results of the Teacher Think-aloud Interview ......................................................... 89

Summary of External Post hoc Evaluation .............................................................. 98

Overall Summary of Results .................................................................................... 98
CHAPTER V ................................................................................................................... 101
Summary, Conclusions and Next Steps .......................................................................... 101

Overview of Study ................................................................................................. 101
Conclusions ..................................................................................................................... 105

Recommendations .................................................................................................. 107
Limitations and Next Steps ............................................................................................. 110
APPENDIX A ................................................................................................................. 113
Student Demographic Survey .......................................................................................... 113
APPENDIX B ................................................................................................................. 115
Algebra Items .................................................................................................................. 115
BIBLIOGRAPHY ........................................................................................................... 129

vi

LIST OF TABLES

Table 1. Demographic Information. .................................................................................. 26
Table 2: Distribution of Item Format and Mathematical Strand ....................................... 34
Table 3. Composition of an Item Family ......................................................................... 37
Table 4. Assignment of Items to Test Forms .................................................... 43
Table 5. Item Statistics - Form A ...................................................................................... 63
Table 6. Item Statistics - Form B ...................................................................................... 64
Table 7. Frequency Distribution of Abridged Category List ............................................ 67
Table 8: Family 1: Arrows and U—shapes .......................................................................... 72
Table 9. Family 2: Tacks-Top only/Top—bottom ............................................................... 73
Table 10: Family 3. Extend Pattern of Numbers. .............................................................. 75
Table 11. Family 4. Vertex-Diagonal, Vertex-Triangle. ................................................... 76
Table 12. Family 5. Columns ............................................................................................ 77
Table 13. Family 6. Puppy’s Weight ................................................................................. 77
Table 14. Family 7. Pattern of Letters ............................................................................... 77
Table 15. Family 8. Dots and Stars ................................................................................... 79
Table 16. Overall Patterns for Families 1 through 8 ......................................................... 79
Table 17: Diagonal Rectangle Item ................................................................................... 80
Table 18. Teacher Response Distribution for Abridged Category List ............................. 91

vii

LIST OF FIGURES

Figure 1. Cognitive Process Categories ............................................................................ 52
Figure 2. Tape Distribution Between Researchers ............................................................ 53
Figure 3. Depth of Cognitive Engagement Categories ..................................................... 57

viii

CHAPTER I

Introduction

Testing has always been scrutinized, and it has come under even more scrutiny in
recent years by everyone involved in it — test takers, teachers, parents, school
administrators, community members, legislators, and measurement professionals. With
the high stakes and large costs often associated with tests, it is no wonder they are
scrutinized. Many critics of testing argue that there is too much testing in our schools,
that the time spent on testing is time taken away from instruction and learning. Others
argue that tests emotionally harm some students by putting too much stress on them,
especially when younger students are the test takers. Still others say that tests do not tap
important cognitive processes, such as hi gher-order thinking skills. And, within the last
ten years, both testing advocates and challengers have asserted that particular item
formats are ill-suited for assessing certain types of cognitive processes or for providing
useful information about student achievement. Of all the critiques, this last one may be
the most important; in my opinion, the most important reason for testing is to provide
information to test users.

Information has “value” in that it helps users make decisions or draw conclusions
about questions that matter to them (Pearson and Garavaglia, 1997). In the field of
measurement and within the arena of large-scale testing, a few questions that matter stem
from our interest in the interplay between assessment and curriculum. Test scores are

just one type of useful information to answer questions such as,

0 Are my students reading at grade level?

0 How well are my students performing in reading comprehension?

0 What do the students know about mathematical problem-solving?

Mehrens and Lehmann (1987) implicated both the qualitative and quantitative
aspects of information-value in their statement that it is necessary to have as r_n_uc_h of the
relevant information as possible to make an informed decision: “The more, and more
accurate, the information on which a decision is based, the better that decision is likely to
be” (p.10). The question of interest from an information-value perspective is whether an
additional datum of information would help test users better answer the question(s) of
interest. The question of interest speciﬁc to this study, in a broad sense, was to explore
the information-value of constructed-response items when they are mixed with multiple-
choice items on large-scale assessments. More narrowly, the purpose of this study was to
determine whether constructed-response and multiple-choice items require students to
use similar cognitive processes.

This question is relevant given the developments witnessed over the last dozen
years or so in test development, which is the creation of a test that measures a single
content area by using multiple item formats on the test. Many state level tests and the
National Assessment of Educational Progress (NAEP) use a combination of multiple-
choice and constructed-response items to assess a content area. But, strong advocates of
the constructed-response item format say that certain cognitive processes cannot be
tapped by multiple-choice items: namely, hi gher-order thinking skills (HOTS). Thus,
when using only multiple-choice items on a test, higher levels of knowledge are not being

tested (Snow, 1993). Conversely, there are others who question how it can be said with

any certainty that multiple-choice items are no_t tapping HOTS or that they are {1%
providing meaningful information about the content being assessed (Haladyna, 1994;
Stiggins, 1994; Martinez, 1993; Mehrens, 1992).

Rather than discounting the multiple-choice item format in favor of another
format, Snow (1993) offers a balanced perspective and suggests that further research be
conducted to add meaningful information about the relationship between cognitive
processing and item formats. Haladyna (1994) writes,

We must learn quite a bit more about the effects of item format on cognitive

learning before we can make conﬁdent statements about the effectiveness of

any format. Research is needed that shows the optimal formats for

measuring newly defined abilities and various forms of higher level

achievement. (p.183)

The general goal of conducting this study echoes Haladyna’s statement, that is, to
add meaningful and practical empirical evidence to the literature on the interplay between
cognitive processes and item format. The question is interesting both theoretically and
practically. Theoretically, the question of ensuring measures that tap hi gher—order
thinking is an essential feature of the validity of such a test. Practically, the question is
one of cost-effectiveness and curricular information. Some people want to measure
higher-order thinking, but how can we ﬁnd a format that is simultaneously effective,
informative, and precise and places the least burden on schools, teachers, and students in
terms of both money and time?

I approached the issue from a value-added perspective and attempted to determine

whether there was a difference in the type of cognitive processes elicited by each type of

item format. The question was contextualized in mathematics. I examined the question
by using an analysis tool seldom used in the field of measurement, namely, a think—aloud
procedure that provided verbal evidence about the cognitive processes utilized by
students as they answered items. The items used were released and non-released 8th
grade algebra items from the 1992 and 1996 National Assessment of Educational
Progress (N AEP) and one 8th grade algebra item from the Balanced Assessment project

(Balanced Assessment Package, 1997).

CHAPTER II
Review of Related Literature

The relevant literature to answer the research question can be classified in two
ways: (a) do constructed-response items provide us with more information (information-
value) about what students are capable of doing than we get from multiple-choice items
alone, and (b) are the cognitive processes needed to answer constructed-response items
unique to the constructed-response item format? Much of the relevant literature comes
from studies that analyzed data from several of the College Boards Advanced Placement
(AP) tests. Perhaps this is because the AP tests have been using both multiple-choice and
constructed-response item formats for several years. The literature review also included
studies that used achievement instruments other than the AP tests.

Besides the two main classifications above, three additional topics were reviewed
to learn what information already existed about the issues relevant to this study. The
topics included reviews of the think-aloud methodology, comparability of items written
in multiple formats, and mathematical cognitive processes. Each topic is presented
separately in this chapter. I end the chapter by identifying how this study will contribute
to the literature.

Informationﬂue Studies

Several of the AP studies looked at the amount of new-information gained when
mixed item formats were used on a single test. Although the authors did not provide
operational deﬁnitions for information-value (or value-added, they are seemingly used
interchangeably), I interpreted it as how much additional or new information about a

construct is gained when constructed-response items are added to multiple-choice items

on a test. (See Pearson and Garavaglia, 1997, for a description of how information-value
can be conceptualized in large-scale assessment programs.)

Lukhele, Thissen, and Wainer (1994) used item response theory (IRT) models to
examine the amount of information obtained from different item formats when presented
on the same test. They analyzed the multiple-choice data using the 3—parameter IRT
model and analyzed the constructed-response items using a graded response model.
Using data from the 1989 AP Chemistry and 1988 AP US History tests, they found for
both tests that adding constructed-response items on the tests provided little information
beyond what the multiple-choice items yielded. The authors also examined the amount
of time test takers used to respond to multiple-choice items in comparison to constructed-
response items and the cost to score the two item types. They found that test takers could
answer 16 multiple-choice items in the same amount of time that was needed to answer
one constructed-response item, and that the 16 multiple-choice items cost much less to
score compared to the one constructed-response item. Most important, the information
yield of the 16 multiple-choice items (on the chemistry exam) was double that of the one
constructed-response item. They also showed that multiple-choice items were more cost
effective compared to constructed-response items. In conclusion, they found that
constructed-responSe items yielded less information, required more testing time, and
incurred larger costs compared to multiple-choice items.

Information value studies have also been conducted in the areas of science,
chemistry, and computer science (Thissen, Wainer, & Wang, 1994; Wainer & Thissen,
1993; Wainer, Wang, & Thissen, 1991; Wang, Wainer, & Thissen, 1993). The ﬁndings

from all of the studies suggested that when response data from constructed-response

items were combined with response data from multiple-choice items, little new
information was obtained about any of the areas.

By contrast, research ﬁndings from content areas other than those discussed thus
far suggested that differences (e.g., traits) existed across different item formats. For
instance, in the area of writing, Werts et al. (1980) worked with first-year college
students and attempted to determine whether different item formats would detect
different writing traits. The design was a variation of a multitrait-multimethod design.
Three administrations of the Test of Standard Written English (TSWE) and three short
(20 minute) essay prompts were used to collect response data. The three essay prompts
were considered to be three separate and therefore independent tests. (The authors did
not provide information about the essay prompts.) All of the tests were given within the
same year and over several test occasions. The nonzero covariation for the essay
residuals showed that the essays measured a common trait that was different from
whatever traits the essays and TSWE shared. So, even though all of the assessments
were measuring writing, the essays seemingly measured something unique.

Bennett et al. (1990) conducted two studies using the same test, AP Computer
Science Examination, but different measurement models to examine differences between
item formats. They used a confirmatory factor analysis model in the 1990 study and a
model hypothesizing separate format factors in the 1991 study. In the 1990 study they
ﬁrst treated each constructed-response item as a separate variable. They then grouped ten
or more of the multiple-choice items; each group of multiple-choice items represented a
separate variable. A one-factor covariance structure model was used to analyze the data.

The results indicated that both item formats measured the same characteristics. They

concluded that adding constructed-response items to multiple-choice items did not add
additional information about computer science. However, in 1991 Bennett et a].
discovered that the disattenuated correlation coefficients from the model hypothesizing
separate format factors were signiﬁcantly different from unity. In the 1991 study, the
researchers found differences between the item formats, but, like the 1990 study, a
limitation of the ﬁnding was the lack of information about the source of the differences.

Using factor analysis Thissen et al. (1994) found evidence that the constructed-
response items on AP Computer Science and Chemistry tests measured something unique
from the multiple—choice items because the factors were signiﬁcantly different for the
constructed-response items compared to the general factor. The evidence also indicated
that the constructed-response and multiple-choice questions both measured the same
thing because the loadings for the constructed-response items were larger on the multiple-
choice factor than they were on the constructed-response factor. Further, although the
constructed-response items measured something different from the multiple-choice items,
they did not measure that different thing very well. The researchers based this conclusion
on the observation that the factor loadings for the constructed-response items were small
on the factors speciﬁed for the constructed-response items.

In short, the ﬁndings about item format differences and, in particular, the value
added of constructed-response formats when combined with multiple-choice items on a
test, appear to be mixed. In some content areas, constructed—response items seem to add
little or no information; while in others, they seem to add unique information to the

process of making decisions on the basis of test scores.

Thﬂaloud Methodology

The think-aloud methodology is an interview between the researcher and the
respondent (i.e., student). Generally, the researcher and student sit together at a table as
the student performs the specified task. As the student performs the task, he or she talks
aloud and tells the researcher what he or she is thinking. The researcher prompts the
student for clarity or elaboration when necessary. The setting is informal and collegial.

There are two general approaches when conducting a think-aloud. Ericsson and
Simon (1993) categorized them into two families, concurrent and retrospective
interviews. I chose a concurrent approach for two reasons. First, the accuracy, and
therefore utility, of retrospective verbal reports has been questioned by some (Mueller,
1911; Nisbett and Wilson, 1977). Mueller (1911) noted that subjects sometimes confused
other retrievable information with information related to the processes used to solve the
tasks. Hamilton et al. (1997) reported similar ﬁndings in a more recent study. Hamilton
et al. found that the time lapse between responding to the item and participating in the
interview could result in forgetting, interference, and other memory lapses that
compromise the accuracy of the verbal reports. The ﬁndings from the studies provided
convincing evidence that the use of retrospective verbal reports would likely introduce
measurement error in the data.

Second, the concurrent approach also allowed me to observe the moment by
moment sequential thinking of the student as he or she responded to the items, without
altering the cognitive processes they used to solve the items (Ericsson and Simon, 1993).
Because the purpose of this study was to determine whether similar cognitive process

were needed to solve multiple-choice items and constructed-response items, the sequence

of the cognitive processes had to be maintained and not interrupted during data
collection. Therefore, because the students’ cognitive processes were of primary interest,
the concurrent think-aloud procedure was well suited for the purpose of this study.

Ericsson and Simon (1993) examined the myriad of ways to conduct a think-aloud
interview and the appropriate method to use for a particular purpose. Rather than
reporting the myriad approaches here, I instead reviewed studies that used think-aloud
procedures and principles that mirrored those planned for this study regardless of the
content area.

Montague and Applegate (1993) used a think-aloud to compare the problem-
solving behaviors used by learning disabled, average, and gifted groups of middle school
students. They were particularly interested in learning whether the group of students
identiﬁed as learning disabled used different cognitive processes when solving word
problems compared to the other two groups of students. To test their hypothesis, they
asked the three groups of students to think-aloud as they answered one-step, two-step,
and three-step word problems.

The researchers identiﬁed the cognitive process categories a priori, based on an
information processing theoretical framework, and then used the students’ think-aloud
data to count the number of verbalizations students made within a cognitive process
category. By counting the number of times students, within each of the three groups,
used a particular category, Montague and Applegate (1993) found that students identified
as learning disabled used different approaches to problem-solving than the other two
student groups. The researchers confirmed their hypothesis that students with disabilities

approached problem solving in less effective ways than students without a disability.

10

Hamilton et al. (1997) used the think-aloud procedure to examine how useful the
verbal data would be for supporting the ﬁndings from a statistical analysis (full-
inforrnation item factor analysis). Speciﬁcally, the researchers wanted to learn whether
combining quantitative and qualitative data would be an effective way to examine the
validity of science items written in several different formats. They used a concurrent
think-aloud procedure with high school students.

To examine their research question, they ﬁrst analyzed the students’ item
responses with a factor analytical procedure. Three science dimensions emerged from
the results of the factor analysis. Often this is where factor analysis studies end. But,
these researchers, using the factor analysis results, then selected 16 multiple-choice items
and three constructed-response items to represent the science knowledge assessed by the
three dimensions. They tried to select items that varied in difﬁculty. To compare
cognitive processes associated with the different item formats, they matched two items
based on the content assessed by the items. They then used the interview data from these
items to clarify the meaning of the three science dimensions. After the study, they
concluded that “the most important beneﬁt [of think-alouds] is in identifying knowledge
and skills that test items require or permit but that are ignored in test interpretation”
(p.196).

Research in other content areas used the think-aloud procedure as the main tool
for collecting data, rather than combining it with a statistical procedure as described in
the previous study. Reading comprehension was the content area most frequently
studied. For example, a reading comprehension study by Farr et al. (1990) provided

insight about the kind of information obtained from think-alouds. The researchers

ll

examined only multiple-choice items to learn whether the items assessed the reading
comprehension processes intended by the test develOper. To make this determination, the
researchers had 26 college students take a standardized reading comprehension test.
They asked them to think-aloud as they read the passages that involved a set of context-
dependent items. The most common of four strategies identiﬁed from the verbal reports
showed that the students read the passage, then read the items, and then returned to the
passage to ﬁnd the correct answer, rather than reading for in-depth understanding the ﬁrst
time they read the text.

In addition to the type of reading comprehension utilized by test takers, Far et al.
(1990) also concluded that the development of items determines the type(s) of cognitive
processes that can be used by the respondent. Hamilton et al. (1997) support their
conclusion. While this conclusion seems plausible, it begs the fundamental item
development question: How do you develop items that encourage the type of thinking
you intend to measure? To take this idea one step further, after the test items are written,
it seems reasonable to assume that think-alouds will have to be conducted in parallel with
ﬁeld testing to validate the cognitive process intended by the item writers. Whether or
not achievement test developers will use such a (costly) validation method is unknown.
And, although Farr et al. (1990) did not draw this conclusion from their work, the quality
of the item would seem to play an important part when determining whether items elicit
the intended cognitive process(es). Hence, item quality became a major focus for this

study, as is seen later.

12

Haladyna (1994) suggested that think-alouds could be used as an item review
procedure employed during ﬁeld testing. In his assessment of the procedure, he
compared the think-aloud method to ﬁeld testing items by saying,

In formal testing programs, the think-aloud procedure has been used to
study the thought processes of students during a test. The developmental
ﬁeld test is also designed to accomplish a similar end, to analyze student
behavior during a test to determine if an item is working as intended.
The procedures for the think-aloud and the developmental ﬁeld test are
essentially the same (p.138).

I agree that field testing and think-alouds are two ways to analyze student
behavior on a test. However, I think the kind of information obtained from the two
approaches is different. The think-aloud method provides qualitative information about
the cognitive maneuvers made by the test-taker. The verbal data illuminate the students’
behaviors as they respond to the items. Furthermore, the think-aloud method provides
the researcher with the opportunity to probe the student to further explain his or her
thoughts. By probing, the researcher can ask the student to clarify or elaborate his or her
responses. And, when the researcher uses a one-to-one interview, he or she has the
beneﬁt of seeing (in real-time) how the student moves through a test booklet, interrelates
information on the test, and how the student retrieves particular information from an item
stem (or reading passage) to answer the item.

In contrast, ﬁeld testing does not allow for probing so the researcher is limited by
the information obtained from students’ written or bubbled responses. It also does not

provide an environment for closely observing students’ test taking behaviors. However,

13

ﬁeld testing allows for the collection of several data points on every item because of the
minimal amount of direct interaction needed between the test taker and the test
administrator. Thus, larger numbers of items and test data are collected in an efﬁcient
way from ﬁeld testing than from think-alouds.

Providing the comparisons between think-alouds and field testing is not an
argument against conducting ﬁeld tests. However, the different kinds of information
obtained from a think-aloud and from a ﬁeld test could be used in complementary ways
to validate items on achievement tests. The study by Hamilton et al. (1997) provided
empirical evidence to support this assertion.

A ﬁnal thought about the think-aloud method comes from Norris (1990) who said,

Verbal reports of thinking would be useful in the validation of multiple-
choice critical thinking tests, if they could provide evidence to judge
whether good thinking was associated with choosing keyed answers and
poor thinking was associated with unkeyed answers (p. 55).

Norris' comment about the accessibility of poor thinking associated with unkeyed
answers is an important idea when the users of the information are teachers. Teachers
often want to know why a student answered an item incorrectly. The following questions
often come to mind,

0 Did the student simply misread the test item?

0 Was the item defective?

0 Did the student have a misconception about the particular concept that

prohibited him or her from responding correctly?

14

The teacher has a difﬁcult time answering any of the questions without specific
information obtained directly from the student. The think-aloud procedure meets this
need.

Compaﬂbilitv of Items Written in Multiple Item Forma_ts_

Research that examines the relationship between cognitive processes and item
formats is vulnerable to how items are selected for inclusion in the study. One cannot
assume that an item written in the constructed-response format better assesses cognitive
processes than an item written in the multiple-choice format or vice versa. Chaucey and
Dobbin (1963) said, “multiple-choice questions can be written so as to require substantial
thought.” Hamilton et al. (1997) stated that there are some multiple-choice items that
assess more than factual knowledge; typically, this happens when the items require
students to generate answers that have not been previously memorized.

Hamilton et al. also say that performance items may assess factual or simplistic
knowledge when written to assess those kinds of knowledge. For example, some
constructed-response items require examinees to provide a short list of facts that are
easily recalled directly from instruction. A question that requires students to list the steps
in the water cycle is an example of this type of low-level item. With adequate item
writing training, experience, and skill, multiple-choice and constructed-response items
can be written at levels above recall. Haladyna (1994) provides extensive information
about the technique of writing items to assess a range of cognitive processes and
difﬁculty levels.

When researchers study the effects of item format on cognitive processes, they

typically match pairs of existing items — one multiple-choice and one constructed-

15

response — using content as a means to match them. Matching two items based on
content minimizes error introduced into the equation by only allowing the variable of
interest, in this case cognitive processes, to vary. The following studies show how the
researchers matched items when examining whether item format and cognitive processes
interacted.

Campbell (1995) used NAEP reading items to look for an item format and
cognitive process interaction. He used existing items and attempted to create item pairs,
one multiple-choice and one constructed-response, so that the two items were as
“similar” in content as possible. Three criteria were utilized to select and match the
items: (a) NAEP reading stance classiﬁcation (initial understanding, developing and
interpretation, personal reﬂection, or critical stance), (b) national percent correct, and (c)
type of reading text or situation (literary experience, informational, or perform a task).

Even by Campbell’s admission, there was no guarantee, even after matching the
items as best he could, that the content and comprehension aspects were similar between
the paired items. Also, items that appear similar in content might vary in terms of
quality. For example, in a set of matched items the multiple-choice item may have been a
better item in terms of the depth of knowledge needed by the respondent to select the best
answer, whereas the constructed-response version may have invited a vague or surface
level response. A lesson from Campbell’s (1995) work was that matching existing items
did not necessarily guarantee that the matched-items were assessing similar content
and/or cognitive processes. Thus, when examining cognitive processes associated with

different item formats the quality and content for the paired items must be comparable.

16

Martinez (1991) used the stem equivalent approach when comparing item level
statistical characteristics of ﬁgural items (items that require students to construct a
response and use ﬁgural information, such as illustrations or graphs, as the response
medium) and multiple-choice items in the area of science. Martinez wrote 25 ﬁgural
science items to match the NAEP science specifications. He then matched the ﬁgural
items with 25 existing NAEP multiple-choice items. The 50 items were administered on
parallel test forms. He did not draw conclusions about the comparability of the stem
equivalent items, but he did report item statistics for the matched items, which was the
intended purpose of the study.

I-Iis ﬁnding suggested that the figural items were comparable to or better than
their multiple-choice counterparts in terms of item difﬁculty and discrimination. The
ﬁnding is useful for showing how different item formats compare in terms of item
statistics. The researcher did not examine the cognitive processes associated with the two
item formats or the degree of content or cognitive process similarity between the item
formats. In fact, Martinez proposed that additional research was needed to determine the
extent that item formats draw upon unique abilities.

Martinez (1991) did not report whether he conducted a content item review of the
new ﬁgural items to ensure that they mapped back to the science framework. It also was
unclear if he purposefully used the item specifications to write the ﬁgural items or
whether he randomly wrote 25 ﬁgural items and then determined which of the 25
multiple-choice items most closely matched the ﬁgural items. If he did not ﬁrst select the
multiple-choice items, identify the item specifications that mapped to the items, and then

develop the figural items using the identified specifications, the content match across the

17

item formats may have been different from the beginning. Furthermore, the statistical
differences he found could have been confounded with substantive differences in the
items themselves. The same argument holds for matching cognitive processes between
the existing multiple-choice items and the new ﬁgural items.

Martinez’s (1991) work established the basis for writing items for this study but I
included an extra step to the process to address the comparability of content across item
formats, as discussed above. To do this, I borrowed ideas from Frederiksen’s (1984)
research on test bias and Haladyna’s extensive work in item development in an attempt to
write content comparable items. Details about the item writing process are reported in
chapter 3.
mthematical Cognitive Processes

Demby (1997) used a retrospective interview approach to determine which
procedures students used to perform algebraic operations on classroom level tests. A
cohort of 108 students were ﬁrst tested in the 7th grade and re-examined in the 8‘h grade.
The study was conducted in two phases. First, the students were administered an algebra
test in their regular classroom. The researcher then analyzed the students’ written work,
classiﬁed the observed errors, and selected 51 students to participate in a follow-up
interview.

In the second phase of the study, Demby returned each student’s original test.
The students were instructed to correct any mistakes they made during their original
solution strategies and re-work the items. After the students corrected the mistakes, the
researcher interviewed 51 students and asked them to explain how they obtained the

answer to each item. Seven common solution strategies emerged as the researcher

18

analyzed the interview data: automatization, formulas, guessing-substituting, preparatory
modiﬁcation of the expression, concretization, rules, and quasi-rules. Demby also
noticed that some of the strategies were used independent of the others; other times
combinations of one or more of the seven strategies were noted, e.g., PM +R+C and
R+GS. The combinations seemed to represent consecutive steps of an algebraic
transformation. The researcher noted that the seven common strategies occurred
regardless of a right or wrong answer.

Demby also observed qualitative differences in the types of errors made and
solution strategies used from grade 7 to grade 8. Most notably, she observed that 7th
graders often used wrong rules in the beginning of the school year and improved their
application of rules as the school year progressed. She also observed that students used
heuristics to solve the items, rather than formal rules that were taught in class or
presented in textbooks. Students used formulas infrequently. Her overall conclusion was
that incorrect application of rules seemed to be a normal developmental stage of learning
algebra.

Gerace and Mestre (1982) examined the cognitive processes employed by 9‘h
graders enrolled in Algebra I classes, and more speciﬁcally the errors students made
when solving algebra problems. Data were collected using a think-aloud procedure. The
results indicated that students had difficulty differentiating between labels and variables.
For example, the students were presented with the following question, Use S and P to
represent that there are 6 times as many students as professors at this university. Thirty-
ﬁve percent of the 14 students wrote 6S=P. The interviewers concluded that the students

used S and P as labels rather than treating them as variables. Students made the same

19

error in three more “label versus variable” items like the one presented above. In fact, the
researchers concluded that the ﬁrst noun the student read in the problem statement
triggered the students to treat the variable as a label.

The researchers also concluded that many of the students approached algebra as
rule-based rather than concept-based. But, they observed that students often misapplied
the use of algebraic rules. This finding was similar to Demby’s (1997) finding.

The last study reviewed was conducted in the early 1980’s and therefore was
solidly grounded in information-processing theory. Leino (1981) investigated the
relationship between cognitive processes and mathematical achievement (i.e., course
grades), among other things. To examine the types of processes students used, Leino
collected think—aloud data on 21 7th grade students when solving mathematics items. He
described the mathematics items as a collection of problems or tasks that assessed
arithmetic, algebraic, and geometric problems. The items were included in an appendix;
all of the items were presented in the constructed-response format.

Of particular interest was the list of cognitive processing and strategies Leino
listened for while coding student think-alouds on mathematics items. The processes were
grouped into three general categories:

1. Obtaining information

0 Perceiving the given information (facts, figures, etc.)
0 Perceiving geometric information in embedding context
0 Finding out the relations between information given

0 Grasping the formal structure of a problem

20

2. Processing information
0 Using trial-and-error method
0 Using appropriate notations and combining them to the initial information
0 Getting the expression of the solution
0 Operating with numerals and other symbols
0 Drawing inferences
- Generalizing objects, relations, and operations
0 Changing the direction of reasoning (forward-backward)
o Curtailing the reasoning process or using some curtailing model
0 Making helpful drawings, ﬁgures, or graphs
0 Processing fast
3. Retaining and recalling information
- Recalling terminology, formulas, or concepts
0 Recalling generalizations
o Recalling problem type
These processes represented one perspective about the development of cognitive
processes in mathematics; they also served as a basis for subsequently comparing the
cognitive processes developed for the current study.
_S_u_rr;m_arv of Literature Review
A literature review was conducted on four primary aspects of this study 1) the
amount of new information gained when mixed item formats are used on one test, 2) the

think-aloud methodology as a research tool, 3) issues related to item characteristics and

21

item quality, and 4) what is already known about mathematical cognitive processes.
Each aspect is brieﬂy summarized below.

The findings from the first section of the literature review were mixed. In some
studies, the researchers found signiﬁcant differences between item formats; ﬁndings from
other studies indicated no differences. But, I did observe that these findings varied by
content area. That is, in some areas constructed-response items seemed to add little or no
information, but in others they seemed to add unique information about the process of
making decisions on the basis of test scores.

The second section of the literature review cited the various ways researchers
have used think-alouds. For instance, researchers used think-alouds to (a) determine
whether test items actually measure the cognitive processes intended by the item writers,
(b) examine group differences, and (c) examine how useful the verbal data would be for
supporting the ﬁndings from a statistical analysis. One researcher advocated the use of
think-alouds during test development as another way to assess item performance.
Although none of the studies employed think-alouds to speciﬁcally examine and identify
cognitive processes used by test takers, they do conﬁrm that the qualitative methodology
would be an effective method for answering this type of question.

Findings from the third section of the literature review indicated that it was
difﬁcult to create item pairs from already existing items. I concluded that the limited
number of items in a test’s item bank and the unbalanced number of multiple-choice and
constructed-response items available in an item bank compound the difﬁculty of

matching two items.

22

The last section of the literature review pointed to the variety of procedures
researchers have used to investigate the cognitive processes students used when
answering mathematics items. Some of the researchers used prominent learning theories
to create a priori categories, which were then used to code interview data. One
researcher allowed the categories to emerge from retrospective interview data. Most of
the categories differed across the studies but two researchers concluded that middle
school students often rrrisapplied algebraic rules when solving items.

Contribution of this Studv

This study contributes to the literature in a few unique ways. First, all of the
researchers who have examined the value-added of combining multiple-choice and
constructed-response items on a test used analytical models. Although my question
examines the issue from a value-added lens as well, I also employed a qualitative
approach that I believe better assesses the question in general, and my question in
particular. For example, if the goal of examining value-added is to ascertain the amount
of added technical information (i.e., an IRT information perspective) gained by
combining item formats on a test, then analytical models are the most appropriate means
for that examination.

But I took a different perspective on the value-added question and focused on
whether we gain information about the content area by looking at the cognitive processes
students used when solving the two item formats. The think-aloud procedure better
assesses this question rather than an analytical model. A secondary, but no less
important, purpose of the study was to encourage practitioners and psychometricians to

consider the beneﬁt of the think-aloud methodology to inform classroom instruction and

23

curriculum and an item’s contribution to the content area being assessed. Think-alouds
can illuminate both the similar and different ways that students solve algebra test items,
which could result in a change in the way teachers instruct or test developers write items.
An analytical model would not have provided as useful information for these types of
purposes.

Third, the results could conceivably contribute to the art of item writing. The
item writing technique used in this study could be used to generate multiple items
quickly, in both multiple-choice and constructed-response item formats. All of the items
would presumably measure the same content but they would perhaps elicit different
cognitive processes and thereby contribute the depth of assessing the content area.

Last, the results of this study provide information about what cognitive processes
8‘h grade students use when solving algebra items regardless of format. Researchers
could compare and contrast these processes with their own research experiences or with
other research available in the literature. Other researchers could sin gle-out the
methodology and duplicate the study using another content area.

Regardless which parts of the study are excerpted, the overall contribution of this
study is two-fold. One, I hope to encourage measurement professionals to think about
item information in an alternative way than it is traditionally considered. Two, I want to
encourage measurement professionals to use a non-traditional measurement tool as they

continue to examine how students interact with test items.

24

CHAPTER HI
Study Design and Procedures
Research Question

A grounded theory model was used to examine the verbal protocols and answer
the following research question,

Are different cognitive processes used by test takers when responding to multiple-

choice and constructed-response mathematics items?

Sample

The sample was drawn from two school districts in the Lansing, Michigan region.
Two schools participated in the study, one school from each school district. One school
was in an urban school district setting, with an ethnically diverse student population, and
a range of low to middle socioeconomic status. The other school was in a suburban
setting with a less ethnically diverse student population, comprised primarily of students
from the middle socioeconomic status.

All of the students were enrolled in the 8th grade. N 0 students were intentionally
omitted from participating in the study, but, as will be explained later, not all participated.
Gender and ethnic information were used to provide details about the composition of the
sample, rather than used as independent variables. See Table 1 for the demographic

composition of the sample.

25

Table 1. Demographic Information.

 

 

 

 

 

 

 

 

 

 

 

 

Urban School Suburban School Total Students

Female 8 10 18
Male 9 7 16
Total 17 17 34
African-American 2 1

Asian 1 0 1
Hispanic 3 0

White 7 16 23
Multi-racial 1 0 1
Other 1 0 1
Blank 2 0

Total 17 17 34

 

 

 

 

 

 

As seen, the number of boys and girls comprising the sample is similar. Most of
the students were white (68%), with a small number of students represented by the other
ethnic categories.

Procedure for Sample Selection

The selection of students was nonrandom because the teachers had the option of
withholding their classes from participation. Teachers at the urban school selected
students in two out of four 8th grade mathematics classrooms and teachers at the suburban
school allowed students from four out of six 8th grade mathematics classrooms to
participate. The four suburban classrooms represented four different levels (tracks) of
mathematics instruction: transitional, regular, pre-algebra, and algebra. Classrooms in
the urban school were not tracked, or at least not identiﬁed by the teachers as being
tracked.

Parent permission was obtained prior to data collection. To facilitate this process,
the classroom teachers distributed, and collected, parent permission letters to every 8th

grade student in the selected classrooms. Passive parental permission (if parent did not

26

return the letter, then his/her child M participate in the study) was used in both
schools. Over 400 students received permission and became part of the sampling pool.
For unknown reasons, ten suburban parents and one urban parent denied permission.

As parent permission was obtained, I kept a master list of student names that was
subsequently used to select the sample. (Students and teachers knew from the beginning
that not every student would be asked to participate in the think-aloud study, as only a
small number of students were needed.) A sample of 24 students was originally planned
but I over sampled to account for attrition and other unforeseen problems that would

reduce the ﬁnal sample size. I selected 34 students — 17 students from each of the two

schools — using a version of systematic sampling with a random start.

This type of sampling method can result in a biased sample if the list is ordered
(i.e., alphabetical or rank ordered according to a criterion measure) (Fraenkel & Wallen,
1993). The list was not ordered in any particular way. Nonetheless, as another
precaution against bias, I showed the 34 student names to their respective teachers to
verify that a range of mathematics achievement was represented. The teachers veriﬁed
the sample’s range of achievement.

Because of the initial limitation imposed by the teachers, it was impossible to
attain a true random sample. But, the sample was randomly selected within the school
sampling constraints. Perhaps even more important, the sample size for this study was
very small. Thus, combination of the constrained random sample and small sample
preclude generalizing the results beyond the 8‘11 grade or generalizing the results beyond

the schools where this study occurred.

27

Instruments
Three instruments were used to collect the data: the test booklet, the protocol
guide, and a short demographic survey. The interviewers also used the protocol guide to
record notes about students’ responses during the think-alouds. The instruments were all
pre-coded with a unique number that was matched to each student’s name.

The test booklet. Two test booklets comprised of items originally appearing on

 

the National Assessment of Educational Progress (NAEP) program and one item from the
Balanced Assessment Package (1997) project were used to collect the data. A total of 17
algebra items appeared on each booklet. In each booklet, eight of the items were
multiple-choice items and nine items were constructed-response. The two item formats
were dispersed throughout each booklet. The non-secure items are presented in
Appendix B.

The protocol glide. The protocol guide mirrored the test booklet, with the
addition of item speciﬁc prompts and space for the interviewers to record notes. The
interviewers used the protocol guide as a script, which ultimately served to help
standardize the think-aloud procedure.

The surveys. The students responded to a short demographic survey. They
provided information about their gender, age, frequency of doing math and reading
homework, and school name. The students recorded the information themselves.

Algebra Strand

Typically, a content framework deﬁnes the content and cognitive processes

measured on a test. Framework developers often write a framework to represent a

particular mathematics program. Although several different mathematics programs are in

28

use, the intent of this study was not to evaluate a particular program or to compare two or
more programs. Instead, an effort was made to select a neutral mathematics framework
— a framework that reportedly did not promote a speciﬁc mathematical program. The
National Assessment of Educational Progress (NAEP) was a testing program that meets
this criterion.

Because I used NAEP items as the medium for data collection, the five
mathematical content strands assessed on NAEP limited my choices of content areas.
The NAEP mathematical construct is deﬁned as ﬁve content strands: number sense,
properties and operations; measurement; geometry and spatial sense; data analysis,
statistics, and probability; and algebra and functions. Of the ﬁve content strands, the
selection of algebra, from which to select items, was not an arbitrary decision for many
reasons. First, the NCT M Standards targeted algebra instruction at the eighth grade
(Silver, 1997). Second, it is commonly the students’ ﬁrst class in mathematics where
they are introduced to abstract concepts compared to the more concrete mathematical
operations and number manipulations taught in early grades. And third, understanding
algebraic concepts provides the foundation needed to be successful in more advanced
mathematics courses.

Furthermore, the cognitive processes available to be studied were limited to the
cognitive processes deﬁned on NAEP. As described in the NAEP mathematics
framework (1996) items are written to assess one of three mathematical abilities:
conceptual understanding, procedural knowledge, and problem solving. The
mathematical abilities describe the characteristics of the knowledge or process needed by

the respondent to successfully manage the task presented in the item. Thus, when we are

29

in the parlance of N AEP, mathematical ability represents the cognitive process an item is
supposed to elicit.

The NAEP framework provides detailed descriptions of the three processes
assessed on the test. According to the NAEP framework (1996), conceptual knowledge
is deﬁned as a class of objects that share a common set of characteristics. Procedural
knowledge is deﬁned as a series of related actions connected with an object or result.
And, problem solving is deﬁned as a combination of conceptual and procedural
knowledge. The N AEP designers provided more complete and descriptive deﬁnitions for
each of the mathematical abilities.

Conceptual Understanding

0 recognize, label, and generate examples and nonexamples of concepts;

0 create, interpret, and relate models, diagrams, graphs, and varied

representations of concepts;

0 identify and apply mathematical principles;

0 make valid statements that generalize relationships among concepts in

conditional forms;

0 understand the meaning of facts and deﬁnitions;

0 compare, contrast, and integrate related concepts and principles;

0 recognize, interpret, and use the signs, symbols, and terms used to represent

concepts; or

0 interpret the assumptions and relations involving concepts in mathematical

settings.

30

Procedural Knowledge

0 select and apply appropriate procedures correctly;

0 analyze the efﬁciency of different procedures;

0 verify or justify the correctness of a procedure using concrete models of
symbolic methods;

0 apply important formulas; or

extend or modify procedures to address factors inherent in problem settings.

Problem solving

0 use accumulated knowledge of mathematics in new situations;

0 recognize and formulate problems;

0 understand assumptions made with respect to given information;

0 use strategies, data, models, and relevant mathematics;

0 generate, extend, and modify procedures;

0 use reasoning in new settings (i.e., inductive, deductive, algorithmic, or

algebraic); or

0 judge the reasonableness and correctness of solutions.

Students were not likely to use all of the components within each definition as
they respond to any one item, but the complete deﬁnition was given for the reader’s
beneﬁt. And, because any one item cannot capture every component of the deﬁnition,
the items selected for the study limited which part(s) of the deﬁnition(s) were used by the

students as they responded to an item.

31

Design

The study was designed to determine whether students use different cognitive
processes when responding to multiple-choice and constructed-response items. The
presentation of the design is organized into several sections. The ﬁrst section is a review
of a pilot study conducted prior to this study, because some of the decisions for this study
were based on what I learned from the pilot study. The second section is a description of
how the items for this study were developed. The third section explains how the items
from each item family were assigned to the test booklets and to the students. The fourth
section describes the testing procedures used to collect the data.
Section 1: Pilot Study

I conducted a pilot study during the summer of 1997 for another research project
(Pearson and Garavaglia, 1997). Eighth grade children participating in an after school
program participated in the study. Many of the students were from the urban school used
in the current study. Three questions were examined,

1. How many items could 8‘h grade students answer during an hour think-aloud

session?
2. Could 8‘h graders sustain thinking aloud for an hour?
3. Which mathematical strand, either algebra or measurement, worked best
during a think-aloud?

Findings from the pilot study indicated that students could easily answer up tol4 items
(7 multiple-choice and 7 short constructed-response) in a 50-minute to one-hour think-

aloud session, without experiencing fatigue.

32

Findings related to the third question indicated that measurement items provided
almost no evidence about the cognitive processes used by respondents when solving the
items. First, the items were so basic and not engaging that the evidence gathered was not
very telling. This was seen for all of the measurement items piloted. Second, many of
the measurement items required the students to use a ruler or a protractor to measure a
diagonal or an angle. The students talked about how they used the tool rather than how
they solved the problem. Some mathematics educators may say that describing how
respondents use a tool is evidence about how one solves an item, but the items
themselves did not allow for variation in responses, because they were very easy. Based
on the ﬁndings from the pilot study, measurement items were not included in the current
study.

Item Selection for the Pilot Study

This section brieﬂy describes the number and types of items selected for the pilot
study. The entire pool of 1992 and1996 released algebra and measurement NAEP items
were available to select the pilot items. A total of 14 items were selected for the pilot
study; seven of the items assessed algebra and seven assessed measurement. The item
selection process was limited by the number of constructed-response algebra items in the
set of released N AEP items. There were only three constructed-response algebra items in
the entire set of the released algebra items, therefore all three of the constructed-response
items were included in the pilot study. Four constructed-response items from the
measurement area were then selected. In order to have seven items in each area, four
algebra items were multiple-choice items and three measurement items were multiple-

choice. See Table 2 for the distribution of measurement and algebra items by item

33

format. As displayed, a total of 14 items were used in the pilot study, seven from each

mathematical strand and seven from each of two item formats.

Table 2: Distribution of Item Format and Mathematical Strand

 

 

 

Math Area MC Format CR Format Total
Measurement 3 4 7
Algebra 4 3 7

 

 

 

 

 

 

In addition to the number of items represented in each item format, three other
item-related criteria were used to select the items. Two of the criteria were statistical in
nature and the third criterion was based on the cognitive process (procedural knowledge,
conceptual knowledge, or problem solving) associated with the items. Item difﬁculty and
discrimination statistics (the IRT parameters were obtained from operational
administration of the NAEP items) were used to obtain a range of items, in terms of their
statistical properties. Although an attempt was made to select items from each of the
three cognitive processes, many of the items came from the problem-solving dimension.

Items within each strand, measurement and algebra, were matched across item
format by using the above three item related criteria: item difﬁculty, item discrimination,
and cognitive process. A match was deﬁned as two algebra items, for example, from
different item formats, with similar item difficulty and discrimination statistics, and the
same cognitive process. However, by the end of the pilot study I learned, as did
Campbell (1995), that matching on these criteria did not necessarily result in "perfectly"
matched item pairs. Furthermore, obtaining close matches using the three criteria was
often difﬁcult when using already existing items — the items were not intentionally

developed for the purpose of this type of study. Based on what was learned from the

34

pilot study and from Campbell’s (1995) experience, it became apparent that an alternate
method for matching items was needed for the full-scale study, namely a method that did
not exclusively rely on existing items.

An unanticipated finding from the pilot study also informed the current study. I
developed a protocol guide that included common prompts across items and item—specific
prompts, to standardize the think-aloud sessions. The protocol guide booklet worked
well. The protocol was user—friend] y and the prompts were easily understood by the
students. The protocol guide was very helpful during the think-aloud sessions because it
standardized the think-aloud sessions across students (Ericsson and Simon, 1993). I
retained the protocol guide for this study.

The information obtained from the pilot study was valuable in many ways. The
three questions examined by Pearson and Garavaglia (1997 ) provided information about
how to design a think-aloud study. And, I learned how to conduct a research study in a
school setting (e. g., negotiating a space, getting students out of class, accurately
projecting the length of time students would spend in a think-aloud session). Everything
that was learned during the pilot study was carried forward to this study in an attempt to
improve upon what should or should not be done.

Section 2: Item Development

The goal of writing items for this study was to develop multiple-choice and
constructed-response algebra items that were as comparable in content as possible. It was
important to hold content constant across formats so that any differences in cognitive
engagement that might be observed could be attributed to format. In other words, item

format was the only item related factor allowed to vary in this study.

35

There is no one tried and true method of developing comparable or parallel items
mentioned in the literature. To maximize the chance of developing genuine
comparability across item formats, I relied on the experiences of others (c.f., Campbell,
1995; Frederiksen, 1984) as well as on existing item development procedures (Haladyna,
1994). One way to write similar items in two different item formats is to use already
developed multiple-choice and constructed-response items and transform them into the
corresponding item format (Frederiksen, 1984). Fredericksen suggests that this approach
maximizes the likelihood of obtaining construct equivalence. On the surface, removing
or adding response options to existing items seems to be a good suggestion. However,
Frederiksen also suggests that existing items should be used when making the
conversion. But, taking into account what was learned from the experiences of other
researchers, 1 did not think that Frederiksen’s suggestion was sufﬁcient, in and of itself,
for the purpose of this study. So, I buttressed Frederick's item conversion suggestion
with Haladyna's (1994) item-shell method of writing items. The term “model-shell”
references the item writing method used for this study.

Model-shell Item Selection Criteria

I started the item development process by choosing eight NAEP items and used
them as models for developing the other items. Because the model item became one of
the items studied, the model had to meet certain item selection criteria. The four criteria
were:

0 Items had to measure algebraic patterns.

0 Items represented a range of item difﬁculties (to ensure variability in the

data).

36

0 Respondents had to use different algebraic equations to solve the items.

0 Four of the eight model items had to be multiple-choice items and four had to

be constructed-response items.

The ﬁrst three item selection criteria were met; however, the last criterion was not
met. After sorting all of the 1992 and 1996 released and secure NAEP items that
measured algebraic patterns, I discovered that few of these items were written in the
constructed-response format. Achieving a perfect balance in the number of multiple-
choice and constructed-response model items was not possible. The ﬁnal selection of
model items was ﬁve multiple-choice and three constructed-response items.

One of the multiple-choice model items is used here to simultaneously illustrate
the item writing procedure and to introduce the notion of a “family” of items — two
multiple-choice and two constructed-response, each with similar content. I followed the
same item writing process whether the original item was a multiple-choice item or a
constructed-response item. Table 3 displays the composition of items within an item

family.

Table 3. Composition of an Item Family

 

 

 

 

Item Format
1 Original —original content and format (either MC or CR)
2 Converted —Change the format: MC to CR or CR to MC
3 Transformed — This is a “clone” of the original in the

same format as the original. For example, different
numbers or different stimuli (stars versus dots) might be
used

4 Converted Transformed — Change the format of the
transformed (clone) item.

 

 

 

 

 

37

Item Writing Procedure and an Example

The ﬁrst item in the family was M the original NAEP item. To write the
second item in the family, the response options were removed to convert a multiple-
choice item into a constructed-response item (response options were added to convert a
constructed-response item into a multiple-choice item). The conversion left the item

stem in intact. An example is provided.

Original NAEP multiple—choice item:

Puppv’s Age Puppv’s Weight
1 month 5 lbs.

2 months 12 lbs.

3 months 17 lbs.

4 months 20 lbs.

5 months ?

1. Jim records the weight of his puppy every month in a chart like the one
shown above. If the pattern of the puppy’s weight gain continues, how many
pounds will the puppy weigh at 5 months?

A. 30
B. 25

C. 23
D. 21

38

Converted constructed-response item:

Puppy’s Age Puppv’s Weight
1 month 5 lbs.

2 months 12 lbs.

3 months 17 lbs.

4 months 20 lbs.

5 months ?

2. Jim records the weight of his puppy every month in a chart like the one
shown above. If the pattern of the puppy’s weight gain continues, how many

pounds will the puppy weigh at 5 months?

Answer:

 

At this point, two of the four items in the item family were written. As seen in the
example, the same algebraic equation can be used to answer both items, regardless of the
item format.

The original multiple-choice item served as a model-shell to write the third and
fourth items of the item family (see Table 4). To hold the content constant in the item
family, the algebraic concept assessed in the original item was changed by slightly
altering some feature of the item, such as the algebraic equation needed to solve the
problem. For example, in the original and revised sample items, the pattern for the
puppy's weight gain is decreasing by a difference of two pounds each month. For the
transformed items, the pattern for the puppy’s weight gain decreases by a difference of
one pound each month. The transformed constructed-response (multiple-choice) item

was converted to the rewritten transformed multiple-choice (constructed-response)

39

format. The two examples below exemplify the development of the third and fourth

items in an item family.

Transformed constructed-response item:

Puppv’s Age Puppv’s Weight
1 month 10 lbs.

2 months 15 lbs.

3 months 19 lbs.

4 months 22 lbs.

5 months ?

3. John records the weight of his puppy every month in a chart like the

one shown above. If the pattern of the puppy’s weight gain continues, how many

pounds will the puppy weigh at 5 months?

Answer:

 

Rewritten transformed multiple-choice item:

Puppv’s Age Puppv’s Weight
1 month 10 lbs.

2 months 15 lbs.

3 months 19 lbs.

4 months 22 lbs.

5 months ?

4. John records the weight of his puppy every month in a chart like the

one shown above. If the pattern of the puppy’s weight gain continues, how many

pounds will the puppy weigh at 5 months?

A. 30
B. 27
C. 25
D. 24

4o

The four items represent a “family” of comparable items, two written in the
multiple-choice format and two written in the constructed-response format. In total, there
were eight families of four comparable items developed, resulting in 32 items. (See
Appendix B. To maintain the integrity of the secured items, only the public released
NAEP items appear in Appendix B.) As seen, the NAEP constructed-response items
consist of short answer (one or two sentences), fill-in the blank, or extended constructed-
response item-types.

The ﬁnal step in the item development process consisted of a content review to
ensure that all of the items measured algebraic patterns. A mathematics educator
reviewed the 32 items for content validity. She also reviewed the four items within an
item family to review their content comparability.

In addition to the NAEP items, a performance item from the Balanced
Assessment Package (1997) was included in the item set. Items from the Balanced
Assessment Package were intentionally developed to be integrated with classroom
instruction and to assess mathematical concepts common to middle school curricular
goals.

The Balanced Assessment item selected for this study measured an algebraic
pattern, consisted of multiple, scaffolded steps, and required several minutes to solve. I
purposefully added this item to the assortment of NAEP items so that students could
respond to an item that had been intentionally developed to appear on a performance
assessment. So, ifI found no differences between the NAEP constructed-response and
multiple-choice items but I found some differences in the cognitive processes elicited by

the Balanced Assessment item, I then would be able to attribute the absence of between

41

item format differences to the idea that the constructed-response items did not tap the
sorts of cognitive processes that were tapped by the performance item. To this end, the
constructed-response items could be thought of as multiple-choice items in disguise.

In summary, the purpose for writing new items rather than only using existing
items, was to obtain a set of comparable items whose content would be similar across two
item formats. The item writing process was purposefully developed because, as indicated
in previous research (Campbell, 1995; Haladyna, 1994), matching items by using item
difﬁculty and discrimination statistics is not a guarantee that the matched items will have
equivalent content.

Section 3: Assignment of Items and Students to FOI‘I‘QS

Recall that four items defined a family of items. Placing items with the same
pattern but a different item format in a test form would likely introduce item dependency
and a practice effect. To address these issues, I assigned items with the same pattern to
two different forms. Conversely, items with the altered patterns were assigned to the
same form. Table 4 represents a sample assignment of items to different forms. This

distribution resulted in the assembly of the two test forms.

42

Table 4. Assignment of Items to Test Forms

 

 

 

Form Designation Item Assignment
A Original content and format
(multiple-choice)
A Transformed constructed-

response: slightly changed
content—different format than
original

B Converted constructed-response:
same original content-different
format than original version

B Rewritten transformed multiple-
choice: same slightly changed
content-different format than
transformed version

 

 

 

 

 

 

The two forms were randomly assigned to the 34 students. Random assignment
would control for pre-existing achievement level differences within the non-random
sample (Stanley and Campbell, 1963). And, random assignment of forms would control
for curricular and instructional differences between the classes.

By randomly assigning the forms, half of the students responded to two members
of an item family (say 1 and 4) while the other half responded to the other two members
of a family (2 and 3). One member of a given family was randomly assigned to a serial
position within the ﬁrst half of a form; the other member was assigned a comparable
position within the second half of that form. The last item in each form was the
performance item from the Balanced Assessment package.

Section 4: Testing Procedures

An interviewer escorted each student from his or her regular classroom to a quiet

room where the think-aloud took place. Prior to the start of an interview, the interviewer

told every student what to expect during the think-aloud session, to eliminate or reduce

43

any feelings of nervousness or apprehension. The explanation included the interviewer’s
role and the student’s role throughout the session. Furthermore, the interviewer assured
each student that his or her answers to the items would not count towards classroom
grades. The interviewer also explained that the intent of the think-aloud was to obtain
verbal accounts of what the student was thinking as he or she solved the items, rather
than whether or not the student provided a correct answer. Finally, the interviewer used a
protocol guide during every think-aloud session, to standardize each of the 34 sessions.
Seventeen algebra pattern items presented in multiple-choice (8 items) and
constructed-response (9 items, one was a performance task) formats were administered
during a single think-aloud session. Each session lasted about an hour and was
audiotaped. The reason for taping the think-alouds was to facilitate the transcription of
the qualitative data. Hand-written notes also were taken during the think-alouds,
however copious notes were not recorded to ensure that the interviewers would not miss
something a student said or miss an opportunity to probe a student’s verbal account.
The following steps were followed for every think-aloud:
l. Introductions between the interviewer and the students.
2. The interviewer explained what a think-aloud was and shared with the student
exactly what would happen during the session.
3. The demographics survey was completed by the student.
4. The interviewer began the session with a warm-up think-aloud question. The
interviewer answered the question ﬁrst to demonstrate how to think-aloud.
The interviewer then presented the same warm-up question to the student.

The student answered the question while thinking-aloud. (The warm-up

question asked was “how many times have you talked on the phone over the
last 3 days?”).

5. If the student did not have questions, the think-aloud began.

6. The students were instructed to read every question aloud, and then verbally

express what they were thinking while they solved each item.

7. When necessary, the interviewer reminded students to “think-aloud” if they

became quiet, or introspective, while answering an item.

8. The students continued through all 17 items at their own pace.

9. The interviewer administered the “think-aloud method perception survey”.

10. The interviewer asked the students whether they had any questions about the

session.

11. The interviewer thanked the students for their participation.

Threat of Confounding

Potential threats to the outcome of the study needed to be realized, and if possible
controlled for, prior to its implementation. One potential threat may come from some
students feeling inhibited to express themselves verbally because of the audiotapes. To
address this threat, the students were assured their comments would be kept conﬁdential
and anonymous.

Lack of student motivation may be one of the largest threats to obtaining accurate
and complete information in situations like this one. That is, the students knew that no
stakes were attached to their performance on the items and therefore they may not have
exerted much effort to solve them. This phenomenon often is found when pilot testing

new items. To counter this likely problem, the interviewer encouraged the students (and

45

teachers) to take this study seriously and to do their best when answering the items. The
interviewers also told the students that the purpose of the study was to determine how
they solved the items rather than on the number of items they solved correctly.
Data Collection

Data collection occurred during the spring of 1998. All of the data were collected
within two weeks. Tape recorders were used to facilitate data collection rather than
relying on interviewer notes alone. One beneﬁt of recording the sessions was to decrease
data recording errors that would likely occur with hand written accounts. The
interviewers were also free to concentrate on probing the students.

Conducting the data collection in two schools introduced a few logistical issues.
First, a quiet location with an electrical outlet for the tape recorder was needed for the
think-aloud sessions. And second, I had to work within the schedule provided to me by
the teachers. As it turned out, neither of the logistical issues was difﬁcult to solve.
Adequate space was provided at both schools and the teachers were very ﬂexible with
their classroom schedules.

Four trained interviewers and I conducted the think-alouds. Two interviewers
were involved in the pilot study and were therefore already familiar with the protocols. I
trained two additional interviewers to use the interviewer protocol guides and to conduct
a think-aloud session. And, prior to conducting an interview, both of the interviewers
observed one of the three experienced interviewers conduct a think-aloud, to further
familiarize them with the process. Because of the limited time in which to collect all of
the think-aloud data, the two novice interviewers did not conduct an initial, supervised

think-aloud interview. Instead, I monitored their interviews by sitting in on some think-

46

aloud sessions to ensure that they followed the protocol guides and that they did not ask
leading questions. I was accessible to the interviewers throughout data collection.
Data Analyses

The purpose of this section is to present the analyses used to examine the research

questions. The data analyses consisted of ﬁve steps. These were:

1. Use descriptive statistics to examine each item’s difﬁculty, standard deviation,
and frequency distribution of score points.

2. Use the grounded research approach to identify and validate emerging
categories (in the tradition of the constant comparative analysis) in students’
verbal protocols.

3. Complete a full-scale analysis using the identiﬁed themes.

4. Compare cognitive process similarities and/or differences between item
formats within item-pairs. Create an index from the original themes that
represented depth of cognitive processing engagement.

5. Conduct a post hoc evaluation of Steps 2 and 3 using external evaluators who
have content expertise and curriculum and instruction knowledge.

Descriptive Item Level Statistics

Descriptive statistics were calculated separately by form. All non-responses (e. g.,
skipped items) were considered wrong answers and subsequently re-coded as zeroes. I
ﬁrst calculated frequencies, maximum, and minimum statistics for all variables (e. g.,
student id, test number, form, school, iteml through item16) to verify that the data was
keyed in correctly. I then calculated traditional classical test theory item means and

standard deviations to get an initial examination of each item’s distribution. Finally, I

47

calculated the mean score on the eight multiple-choice items and the mean score on the
eight constructed-response items.
Identifv and Validate Cognitive Processes

The verbal data were used to identify which cognitive processes the students used
when answering the algebra questions. To that end, a grounded theory approach was
used to examine whether students used different cognitive processes when responding to
multiple-choice and to constructed-response items. The ﬁrst step in analyzing the data
was the establishment and validation of the cognitive processes used by the students as
they answered the algebra items.

The steps for identifying and validating the cognitive processes are presented here
rather than in the methodology section for two reasons: (a) they are integral parts of the
protocol analysis phase, and (b) grounded theory blurs methodology and analysis. To
facilitate the initial development of categories that exempliﬁed the cognitive process, six
interviews were transcribed (almost verbatim) so that the cognitive moves were easily
identifiable. Two graduate students and I began the analysis by examining several
responses to one item and recording the cognitive processes used to answer the item. We
then broadened our analysis by carrying the cognitive processes forward to other items
and different students, revising, adding, and deleting cognitive processes as necessary
(i.e., open coding).

We developed plausible categories that accounted for most of the verbal data. We
then tested the categories with another tape to build our conﬁdence that the categories
accounted for most of the responses (category saturation). This constant comparative

nature of grounded theory gave the emerging concepts specificity because we

48

continuously asked questions of ourselves while we established the categories (Strauss
and Corbin, 1990).

During the initial phase of identifying the cognitive processes described above,
we listened to six tapes (two tapes per person) and independently recorded the cognitive
processes verbalized by the students. To ensure that we were on a similar analysis path,
we met after listening to eight items and discussed the cognitive processes identiﬁed. We
identiﬁed similar processes and were able to justify the ones that differed. We compiled
a larger list of cognitive processes by combining the processes each researcher
independently identiﬁed. We each then ﬁnished recording the cognitive processes
associated with the remaining eight items.

After coding two tapes, we met again to compare notes. Twenty-eight categories

exempliﬁed the cognitive processes used by the eighth graders (see Figure 1 below).

49

 

 

Camry Deﬁnition Example
Overall pattern Indicates an overall grasp of Understands pattern
recognition the item represented in item (from

beginning to end of item)
“pattern repeats itself.”

 

Pattern not recognized

Indicates lack of
understanding

Test taker indicates, “I can’t
ﬁgure out what the pattern
is” “I know there’s a
pattern, but I don’t see it.”

 

No information used from
item

Indicates lack of organizing
information

“I’ve seen an item like this
before and the answer was

’9

 

Partial information used
from question

Indicates concern for
organizing or fully
understanding information

Student knew 28x2=56 but
then did not add last two
tacks. Or, uses information
in beginning of pattern and
ignores information in
middle and end of pattern.

 

All information used from
question

Indicates thoroughness in
organizing information

Determines pattern by using
the information given in
itenr, e. g., uses all numbers
listed in a column, not just
ﬁrst few numbers and then
skips the rest.

 

Visual representation
(e.g., draws picture)

Indicates importance of
transforming information into
a manageable framework

Draws chart, picture, or
table to solve item. No
indication student
understands an algebraic
equation can also be used.
“I have to draw a picture to
solve this.” I have to make a
chart to ﬁgure out the
attem.”

 

Applicable equation used

Indicates a connection
between problem and learned
mathematical knowledge

Uses an equation to solve
item. “The equation is
28x2+2” “I solved item by
using picture+l= number of
pictures.”

 

Informed guess (uses
some data given in item,
e.g., information in the
multiple-choice options)

Indicates concern for
understanding problem

Uses mc options as a guide
to solve item. “I looked at
the answers and used B to
solve the pattern.” Knows
answer is wrong because it
isn’t listed as an option.

 

 

 

Guess without use of
information given (blind
guess)

 

Indicates lack of
understanding problem

 

Student admits to guessing.
“I picked an answer that
looks the best.” “I don’t
know, I just guessed.”

 

50

 

 

Calculation error

Indicates lack of concern or
attention

Subtracts rather than adds.
Adds numbers wrong.

 

Calculation error, but
adjusts answer to ﬁt
choices provided

Indicates ability to recognize
error and connect it to
problem

Provides a wrong answer
then recognizes answer is
wrong. Re-works item.

 

Non-applicable equation
used

Indicates inability to connect
problem and prior knowledge

An equation is used that
doesn’t ﬁt the pattern.
“28/2=14-2=12”

 

Test “wiseness”

Indicates some ability to
connect problem with prior
knowledge

Uses something in item to
help solve it. “That choice
was weird b/c item said that
she didn’t want to draw all
to the dots.” “D is too big. C
is too low and 220 is kinda
low. So, 420 is the answer.”
“My answer doesn’t make
any sense.”

 

Information from previous
situation recalled

Indicates carry-over from one
situation to another

Student recalls how he/she
solved item in different
situation. “That’s how I
solved the item before.”

 

Information from
previous, comparable item
recalled (carry—over effect)

Indicates carry-over from one
item to another item

Student recalls how he/she
solved comparable item on
test. “This item looks like
the other one.”

 

Student returns to question
and changes answer

Indicates concern for
understanding problem

Student returns to problem
after solving the comparable
item and changes answer. “I
think I did the other one
wrong. I’m going to go back
and check.”

 

Uses estimation

Indicates ability to connect
response with likely answer

Solves problem to certain
step and then sees answer is
higher than 2 of the mc
options and thinks another is
too high/low. Or, picks an
answer from me options that
is close to answer student
computed. “because 420 is
the closest to my answer.”

 

Mental math - work not
shown

Indicates organizational
method

Student doesn’t have to
solve item by writing
pictures or equations on
paper.

 

 

 

Student checks work

 

Indicates thoroughness in
overall approach

 

Checks solution by using
other information in item. “I
came up with an equation
and checked whether it was
correct by seeing if it
worked for steps 2 and 3.”

 

51

 

 

 

 

 

 

 

 

 

 

 

T Misinterprets question Indicates inability to connect Student thinks question is
asked/answers question problem with prior asking him/her to solve for
other than that being asked knowledge something it really isn’t. “I
think they mean to solve for
the area.”

U Partial pattern recognized Indicates a grasp of the item Thinks pattern stops at some
point and a different one is
used.

V Complex pattern extension Indicates understanding of Student identiﬁes the pattern
item and ability to generalize and then extends it several
process ‘steps’ beyond that which is

given in the item e.g.,
information provided for the
ﬁrst few steps and student
has to solve for step 20.

W Simple pattern extension Indicates understanding of Sequential steps in solving
problem problem are provided. “It’s

a continuous pattern. The
next arrow would be left.”

X No control of math Indicates concern for Student says add but then

vocabulary/says or writes mathematical understanding multiplies. Uses

operation but does not use nonmathematical terms to

that operation express computation.
“Numbers E) up 5, down 2.”

Y Relationship between Indicates some concern for When two sets of numbers

numbers given in question organizing information are given, student sees a

recognized pattern exists between
them..

Z Relationship between Indicates concern for ability When two sets of numbers

numbers given in question to organize information are given, student sees

not recognized information given in each
column as being
independent. “I don’t need
to use the numbers in this
column to ﬁgure out the
pattern in this column.”

AA Grapples with information Indicates a tendency to Tries multiple computational

to try to solve question consider multiple data strategies to solve item.
sources or possibilities “That’s not working so I
have to try something else.”
Persists to solve item.
BB Vocabulary in question Indicates concern for Doesn’t understand

 

not understood

 

mathematical understanding

 

mathematical terms. “I don’t
know what that word is.”
(inﬁnity)

 

Figure 1. Cognitive Process Categories

52

 

To validate the categories — before starting the full—scale analysis — we each
listened to the same two tapes and independently analyzed them using the 28 categories.
We regrouped to determine whether additional cognitive processes were identified, to
further explain and discuss our interpretation of the 28 categories, and to examine the
degree of agreement in identifying the cognitive processes, for each item.

Because we all analyzed the same two tapes (see Figure 2 for a graphical
representation) agreement was determined by comparing how each of us categorized each

item.

 

Figure 2. Tape Distribution Between Researchers

For example, researchers A, B, and C’s categorizations were compared to each
other. Agreement was defined as the percentage of matches across all items on a form.
Agreement ranged from .91 to .98. This level of agreement indicated that we had
internalized similar meanings of the 28 cognitive processes and that we were able to
reliably identify the cognitive processes verbalized by the students.

The Categories. Refer to Figure l for the 28 cognitive processes that emerged
from the analysis. The categories were not hierarchically arranged. As the analysis

progressed, it became evident that the categories appeared in different frequencies within

53

and across items. In fact, some categories were used infrequently but frequency of
appearance did not result in the deletion of a category.
Full-scale Analysis

The two graduate students and I independently analyzed the protocol data.
Because of equipment failures and/or inaudible tapes, a total of 28 protocols (about 14
protocols from each of Form A and B) were included in the analysis. Responses to some
items on a usable protocol were inaudible resulting in a different number of usable
responses across the items.

As we listened to a tape, we recorded the category that represented the cognitive
processes verbalized by the student in the same sequence as the students verbalized them.
Besides using the cognitive process categorizations, the researchers took notes that
explained/justiﬁed the identiﬁed cognitive processes. About mid-way through analyzing
the protocols, another check of rater-agreement was conducted to ensure the reliable
categorizations of the protocols. To calculate an agreement index, each researcher
independently coded the same two protocols. Agreement here meant that all three
researchers selected the same cognitive processes for each item. Agreement across the
three researchers was high (.90).

To organize the qualitative data and facilitate analysis, the categories, notes,
student information (e. g., student id, school), and item information (e. g., item format,
right/wrong answer) were entered into a database. One record was established for each
student (that is, each student represented an individual record). Queries were used to

facilitate analyzing the data in the following ways,

54

0 Cognitive processes associated with one item (i.e., frequency of categories),

and

0 Cognitive processes between item formats within item-pairs.

Cogaitive Process Similarities and Differences

The next phase of the analyses involved examining the cognitive processes used
to answer each item and then compare the similarities and/or differences of cognitive
processes between the two item formats within an item-pair. For the individual item
analysis, the cognitive processes were grouped for each item. The number of unique
categories per item were examined to get a sense of the type of processes used to answer
each item, regardless of item format. The between item format analysis appeared more
informative and useful for answering the research question. Here, the cognitive processes
between the item formats for item-pairs were analyzed using a meta-analysis-like
approach. This analysis was done for all eight item-pairs on each form. The one
Balanced Assessment item was compared to the other items, as it did not have a
comparable multiple-choice item.

One of the graduate students and I independently examined the cognitive
processes for the two comparable items on a form (see Table 5) and the Balanced
Assessment item. I started the analysis by comparing the cognitive processes associated
with each item in the item pair. Frequencies of cognitive processes were computed for
each item and presented in tables. Each item’s mean was presented as well. A narrative
account was prepared for each set of items.

Closer examination of the 28 cognitive process categories revealed that some

represented deeper engagement of cognitive processes than did others. Thus, to focus the

55

analysis, the 28 categories were examined to determine which categories represented key
elements associated with deeper cognitive engagement in algebra. To aid in identifying
the key elements, I examined the cognitive processes used by some low and high
performing mathematics students.

Nine categories were selected as indicators of deeper cognitive engagement (see
Figure 3). The categories were not listed in the table in any kind of hierarchy. The
following rationale led to their selection. The ﬁrst two categories were prerequisites for
understanding the area of algebraic patterns. They also would provide evidence about the
degree of understanding the students had about patterns. The next two categories (E and
D) would capture whether students had the capacity to identify and use information that
was necessary to solve the item. To elicit these processes, the students would have to be
mentally engaged with the item to even begin to think about how to solve it. Categories
F and G indicated engagement because the student would have to be thoughtful about the
way he or she decided to manipulate the information. The student would also have to re-
present the information given in the question using one of these two processes. The
observance of category Y would occur if the student understood one of the fundamental
concepts in algebra and therefore would facilitate the student solving the item.
Furthermore, the processing and manipulation of variables would occur only if the
student was being thoughtful as he or she made the cognitive moves. I thought category
S indicated engagement because it represented a thoughtful and deliberate action on the
part of the student. And ﬁnally, category AA would occur when the student had a
difficult time ﬁnding a solution for an item and therefore would have to change strategies

or continue to ineffectively fumble with information. Neither process would occur if the

56

student were disengaged from the situation — in this case, the item. The other categories
indicated some thoughtfulness as the student solved an item (e. g., H, K, P, Q, and R), a
description of the item itself (V and W), or a description of the events as the student

solved the item (e.g., J, N, O, P, and X). None of these categories represented depth of

engagement when compared to the nine selected categories.

 

 

 

 

 

 

 

 

 

 

 

 

 

Code Cognitive Process Cateﬂn'y Reason for Selection
A Overall pattern recognition Indicates an overall grasp of the
item
U Partial pattern recognition Indicates a grasp of the item
B All information used from Indicates thoroughness in
question organizing information
D Partial information used from Indicates concern for organizing
question information
F Visual representation of Indicates importance of
information (e. g., makes chart transforming information into a
or table) manageable framework
Y Relationship between numbers Indicates some concern for
recognized organizing information and the
recognition of variables
G Applicable equation used Indicates a connection between
problem and prior knowledge
8 Student checks work Indicates thoroughness in overall
grproach
AA Grapples with information to Indicates a tendency to consider
try to solve question multiple data sources or
possibilities

 

 

Figure 3. Depth of Cognitive Engagement Categories

The ﬁnal step in the analysis was to use the nine categories to report a “depth of
cognitive process index”. However, after further inspection of the data, it became evident

that additional analyses would be helpful. Thus, in addition to reporting the depth of

57

process index, one additional index common to all items (overall pattern recognition) and
two to three family specific indices were reported because it became evident that not
every category had equal relevance to every family.

The item family was the focus of the analysis. To examine the trends across and
within the item formats, the four items within an item family were grouped in a table.
The depth of cognitive engagement global index was the arithmetic mean of the
proportions of each of the nine categories. The global index for each item was obtained
by summing the frequency of the nine categories (see Table 7) and dividing by the
number of students who responded to each item times nine (the number of categories).
The denominator varied because of the different number of students (between 13 and 17
students) who responded to an item.

For the general and family specific indices the frequency as well as the relative
frequency (proportion) of responses were reported. The proportion for each item was
obtained by dividing the frequency of the category (see Table 7) by the number of
students who responded to the item. The value in the denominator varied because
different numbers of students responded to each item. The mean for each item was
reported to provide an index of difﬁculty.

Post hoc Evaluation of Steps 2 and 3

A small panel of math educators and 8‘h grade teachers were convened to review
the category system and the process used to develop the system. Speciﬁcally, the panel
was asked to listen to and code two students’ think-aloud interviews using the 28
cognitive processes. Almost simultaneously, I replicated the think-aloud procedure with

one of the 8th grade mathematics teacher from the suburban school. These activities

58

served several purposes: (a) validate the category system and the coding process, (b)
learn whether the teacher used similar cognitive processes as the students, and (c)
determine whether additional cognitive processes emerged when an expert solved the
pattern items.

The panel met for a one-day training session. The purpose of the training was to
familiarize the panel with the development of the category system and how it was
subsequently used to analyze the verbal data. During the session, the panelists were
shown copies of the original protocol guides, the student test booklets, and the list of 28
cognitive process categories. During the session, the panel had time to review the
training materials and reﬂect on the activities presented to them. However, most of
active review work occurred outside of the training session.

The four panel members were divided into two review groups consisting of one
teacher and one mathematics educator. Each of the two subgroups were given one think-
aloud tape, a transcription of the think-aloud interview, and the associated test booklet.
The reviewers were instructed to review the transcript and identify the cognitive
processes expressed by the student. They were given the audiotape to supplement their
understanding of the transcription. The reviewers were told to attend to whether the list
of 28 categories captured the student’s processes and whether they heard any processes
that were missing from the list. The reviewers listened to their assigned tape
individually.

During the same time frame as the reviewers were conducting these activities, I
duplicated the think-aloud session with an 8’h grade teacher. Working within the

teacher’s schedule, I arranged for a one hour meeting to conduct the think-aloud. I

59

instructed the teacher to ﬁrst solve each item using a strategy that she thought the
students would probably use and then to look for alternative strategies that the students
may have missed but that she saw because of her content knowledge. The latter
instruction was used as a vehicle to generate alternative cognitive processes, especially if
the alternative strategy required the use of different cognitive processes. The teacher’s
think-aloud was audiotaped and subsequently analyzed using the same procedures I used
to analyze the students’ think-alouds. The results of this and the other four primary data
analysis steps are presented in the next chapter.

Summary of Design and Procedures

The study was designed to determine whether students use different cognitive
processes when responding to multiple-choice and constructed-response items. To
examine the research question, thirty-four 8Lh grade students from two schools in two
different school districts were selected to participate in think-aloud interviews. I selected
the students using a version of systematic sampling with a random start. To build the
master list of participant candidates, parent permission was obtained from over 400
students. Teachers, parents, and students knew that not every student on the list would be
selected to participate in the study.

Three instruments were used to collect the data: the test booklet, the protocol
guide, and a short demographic survey. Each student was administered the three
instruments during their think-aloud interview. The main instrument was the test booklet
because it contained the items that were used to elicit the students’ cognitive processes.
The items that appeared in the test booklet were written speciﬁcally for this study. The

goal of writing items was to develop multiple-choice and constructed-response algebra

60

items that were as comparable in content as possible. It was important to hold content
constant across formats so that any differences in cognitive processes that might be
observed could be attributed to format. A total of 33 items were used for this study; 16
unique NAEP items and one Balanced Assessment item, which appeared on both test
booklets. Seventeen items appeared on each test booklet. The test booklets were
randomly assigned to the students.

Think-alouds were used to collect the data. I first used the procedure in a pilot
study to iron-out some of the details (e.g., number of items 8th graders could comfortably
answer in an hour) and to practice using the technique. For this study I applied what I
learned from the pilot study and from the published think-aloud research (c.f., Ericsson
and Simon, 1993). Specifically, I employed the concurrent think-aloud procedure,
standardized the think-alouds by using protocol guides, wrote prompts for each item that
were asked of every student, and trained and monitored the interviewers who helped me
collect the data. I also audiotaped every interview to facilitate analyzing the data, rather
than relying on the interviewers’ notes.

I identiﬁed six main steps to analyze the think-aloud data. The steps consisted of
both quantitative and qualitative analysis approaches. The quantitative tools consisted of
item level statistics to gauge an item’s level of difﬁculty, frequency counts of the
category codes, and arithmetic means and proportions of a subset of categories for each
item family. The qualitative tools consisted of identifying the categories that represented
the cognitive processes and narrative accounts of the students use of the categories and

the post hoc reviewers ﬁndings.

61

To review the data collection and analysis procedures, I convened four experts in
the ﬁeld of mathematics — two mathematics educators and two 8th grade mathematics
teachers. I asked them to listen to a think-aloud interview, review the cognitive processes
to determine whether they adequately represented the cognitive processes they heard
during the interview and identify whether cognitive processes were missing from the list.
The last part of the post hoc review consisted of duplicating the think-aloud and analysis
procedures used on the students by interviewing one of the teachers and analyzing her
think-aloud verbalizations. The main purposes of these activities were to (a) validate the
category system and the coding process, (b) learn whether the teacher used similar
cognitive processes as the students, and (c) determine whether additional cognitive

processes emerged when an expert solved the items.

62

CHAPTER IV
Results

Descriptive Statistics

Form A. Descriptive statistics were calculated for the 17 students who responded
to Form A during the think-aloud interview. Whether or not the students answered the
items correctly was not the primary intention of the interviews, but the information was
useful to review as it provided a general idea about student performance. The item means
were reported in Table 5. The means were consistently higher for the multiple-choice
items than for the constructed-response items, except for one item-pair. And, in two
item-pairs the means for the two item formats where the same. Out of 24 possible points,

the mean test score was 18.69 with a standard deviation of 4.21.

Table 5. Item Statistics - Form A

 

 

 

 

 

 

 

 

 

Item/Format Mean Std. Dev. Item/Format Mean Std. Dev.
llmc 1.0 .00 9/cr 1.0 .00
2/cr .62 .51 10/mc .70 .48
3/cr .85 .30 ll/mc .92 .19
4/mc .85 .38 12/cr .69 .48
5/cr .62 .51 l3/mc .85 .38
6/mc .92 .28 14/cr .92 .28
7/mc .92 .28 lS/cr l .0 .00
8/cr .58 .40 16mc .67 .40

 

 

 

 

 

 

 

 

Form B. The same statistics calculated for Form A were calculated for Form B
(Table 6). The item means were larger for multiple-choice items except for one item-
pair, where the mean for the constructed-response item was larger. One set of item-pairs
had identical means. The mean test score on the 24 point test was 16.61 with a standard

deviation of 4.23.

63

Table 6. Item Statistics - Form B

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Item/Format Mean Std. Dev. Item/Format Mean Std. Dev.
llmc .70 .48 13/cr .54 .52
2/cr .62 .51 14/mc .92 .28
3/cr .92 .28 15/mc .92 .28
4/mc .54 .42 16/cr .49 .33
5/cr .92 .28 9/cr 1.00 .00
6/mc .85 .38 10/cr .69 .48
7/mc .77 .39 ll/cr .81 .38
8/cr .85 .38 12/mc .92 .28

 

may of Descriptive Statistics

Consistent ﬁndings were seen across both Forms A and B. On each form, the
students generally found the multiple-choice items easier than the constructed-response
items, with a few exceptions. The overall means for the multiple-choice items were
greater than the means for the constructed-response items. However, the mean score for
the multiple-choice items on Form B was slightly larger than the mean score for the
multiple-choice items on Form A, .84 and .81 respectively. The converse ﬁnding was
seen for the constructed-response items on Forms A and B, .69 and .66 respectively.

Receiving higher mean scores on the multiple-choice items was not surprising.
First, students could check their answer against the options. If their answer was not listed
as an option, they knew that they must have made a mistake in their solution strategy.
The same checking procedure was not available with constructed-response items.
Second, multiple-choice items facilitated educated guesses because students could
eliminate options that they knew, with some level of certainty, were not the correct
answers. The constructed-response items did not lend themselves to this type of
guessing. Last, students could solve some multiple-choice items using the options. That

is, they could use each of the options to inform and guide their solution strategy, because

64

they knew one option had to be the correct answer. The interviewers saw all of the
possible explanations used during the think-alouds.
Cognitive Process Categories

Only one primary result was expected from the second and third steps of the
analysis plan, namely, the discovery of the categories that exempliﬁed the cognitive
processes and the full-scale analysis of the protocol data using the categories. After using
data analysis strategies from the grounded theory tradition, 28 categories emerged from
the verbal data (see Figure 1). Detailed information about the development of the
categories can be found in Chapter III, Identify and Validate Cognitive Processes.

The full-scale analysis also is described in Chapter III. Grounded theory blurs the
line between methodology and analysis, which are often discernible with experimental
design traditions. When analyzing the verbal data it became evident that the line between
analysis and results also became indistinguishable. The constant comparative nature of
grounded theory forced the blurring of the methodology, analysis, and results
components of research. Because the results of the analysis emerged during the analysis
itself, I avoided re-reporting the results here and refer the reader to Chapter III for details.
Cognitive Process Comparisons

The means reported above provided an index of performance on each item, across
and within item-pairs. However, the index was not useful for examining cognitive
processes. To get a general idea about the types of cognitive processes used by the test
takers, I examined the frequency distributions of the cognitive process categories for each
item in the eight item-pairs, per form. As it turned out, not all of the 28 categories were

used often.

65

To get the most information from the descriptive analysis, I reduced the number
of categories by excluding the ones that did not offer much information (i.e., the
categories used infrequently). I ﬁrst reviewed the original list of 28 categories. I then
selected the categories that best referenced a signiﬁcant cognitive move or that were
needed for the students to have an understanding of algebraic patterns. This activity
resulted in nine categories: A, D, E, F, G, S, U, Y, and AA (refer to Figure 1 for details).
The categories also had to appear with some frequency to be analytically useful, which
they did. I then determined the frequency at which each category appeared for all 33
items. The frequencies in Table 7 were grouped by the eight item-families. The
grouping facilitated within-item format and across-item format comparisons, for each

item family.

66

Table 7. Frequency Distribution of Abridged Category List

FAMILY l: ARROWS & U-SHAPED

    

 

FAMILY 2: TACKS

    

 

FAMILY 3: EXTEND PATTERN 0F NUMBERS

    

 

67

Table 7 (cont’d).

 

FAMILY 4: VERTEX

    

 

FAMILY S: COLUMNS

    

 

FAMILY 6: PUPPY'S WEIGHT

    

 

68

Table 7 (cont’d).

 

FAMILY 7: PATTERN OF LETTERS

    

 

FAMILY 8: DOTS & STARS

    

 

 

 

DIAGONAL RECTANGLE
CONSTRI CTED RESPONS
M A D E E G & U Y AA
17 2 7 l 1 l4 3 6 2 2 7

 

 

 

 

 

 

 

 

 

 

 

 

69

The results indicated that categories A (“overall pattern recognition”) and E (“all
information used from the question”) appeared most often for all 33 items. Category Y

9’ H

appeared in the “polygon, column,” and “puppy weight” questions. These three item
families present numbers in a columnar format. To solve the items, the students had to
recognize the pattern of the independent and dependent variables (that is, the pattern
within each column of numbers). For example several students solved the multiple-
choice “column” item by noticing that the “B’s are going up 4 and A’s are going up 2.”
This item required students to solve the pattern represented by each variable so that they
could continue the pattern despite the missing cells in the A and B columns. But, to solve
the item, the students had to add on 4 three times in the B column. That is, they did not
have to know how many numbers were missing in column A to know how many steps to
extend the pattern in column B. In fact, for all three of the item-families, the students
only had to recognize the pattern of the dependent variable to complete the pattern.

A similar type of approach was used to answer the “polygon” items. Here, the
pattern extended several steps beyond the initial part of the pattern presented to the
students in the item stem. The students used one of two approaches to solve the polygon
items. Both solutions required students to understand that the pattern ended at 20. That
is, they had to use the information given about the polygon and the number of triangles or
diagonals to know where to stop extending the pattern. The students ﬁrst recognized that
the polygon variable consistently increased by one. They then saw that the
triangle/diagonal variable also consistently increased by one. The solution path varied at

this point. The students either continued to extend the variables by 1’s or subtracted 2

from the polygon variable to get the number of diagonals represented in the other

70

variable. The following student quote exempliﬁes the latter solution strategy; “the
pattern is going down by 3’s because the diagonals are 3 less then the sides.” Or the
students continued the pattern by listing the polygon numbers to 20 and then they
continued to list the diagonal/triangle pattern until that pattern ended at the same place as
the polygon pattern. The students had to generalize the pattern beyond one step to
successfully answer the polygon and column items.

Next, a more detailed analysis was conducted to examine the trends in the
cognitive processes. In addition to applying the scheme across all items, I selected the
categories that best represented the cognitive moves made by the students within an item
family. The categories differed by families because the trends indicated that certain
categories represented what might be called family-appropriate processes. The indices
were labeled “family specific index” to indicate that the index should be interpreted in
relation to the particular items within a family. To this point, most of the previous
analyses were reported at the item-pair level within a form. For this analysis the indices
were reported by item format within the eight item families to facilitate comparing the
two item formats across the four items in an item family.

In addition to the family-speciﬁc indices, two additional indices were reported for
every item-family: (a) a global cognitive process index that represented the mean
proportion of the 9 categories, and (b) a common index that represented the most
frequently used cognitive category seen across all items. The items’ means were reported
to serve as another comparison index. The aforementioned indices are illustrated in
Tables 8 through 15. A sample interpretation of the indices are provided for the ﬁrst

item-family.

71

Table 8: Family 1: Arrows and U-shapes

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Global Common Family Family
Index Index Specific Specific
Index Index
Family Format Mean Overall Overall Visual Used
cognitive pattern representation relevant
process recognition N (p) information
N (p) in item *
N (P)
Arrows mc 100 .20 ll (.85) 2 (.14) 10 (.71)
U-shapgs mc 100 .18 15 (.88) 0 12 (.76)
MC mean 100 .19 26 (.87) 2 (.06) 22 (.74)
Arrows cr 100 .20 13 (.77) 0 15 (.88
U-shapes Ct .92 .27 10 (.77) 1 (.08) ll (.85)
CR mean .96 .24 23 (. 77) 1(.08) 26 (.87)

 

* For all item-families, categories D and B were added together.

The items associated with Family 1 were the easiest of all 33 items. A high value
for the mean index indicated that the item was easy (several students correctly answered
the item). As seen, the items’ means were very high, and therefore very easy. The
overall cognitive process global index was slightly larger for the items presented in the
constructed-response format, although the mean index was very similar. The u-shaped
constructed-response item captured more of the overall cognitive process compared to the
other three items. This means that of the students who responded to the item, they used
more of the nine categories compared to the other items in the family. Very few students
used a visual representation, as indicated by the very low value of the family speciﬁc
index. And, a similar number of students across item formats recognized the pattern
illustrated in the items. That is, the high index values (.77-.88) indicated that many
students all used this cognitive process when solving the items. In general, the indices

did not indicate variations in cognitive processes across the four items.

72

The protocol data showed that students most often saw that the ﬁgures rotated to
the right or to the left, depending on the item. They then continued the rotation pattern to
ﬁll-in the missing ﬁgure. They often used the words “continuous pattern” during the
interview.

The next family was referred to as “tacks”. An illustration of overlapping pictures
with either tacks on the top of them or tacks on the top and bottom of the pictures
accompanied the item. The students were asked how many tacks it would take to hang
29 pictures. These items were more involved in terms of the cognitive load the students
had to maintain because they not only had to ﬁnd the pattern but then they had to extend

it several steps (25 steps) beyond the illustration.

Table 9. Family 2: Tacks-Top only/Top-bottom

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Global General Family Family Family
Index index specific specific speciﬁc
index index index
Family Format Mean Overall Overall Visual Applicable Used
cognitive pattern representation equation relevant
process recognition N (p) N (p) information
N (p) N (1))
Top mc .70 .20 7 (.41) 4 (.24) 3 (.17) 14 (.82)
Top-bottom mc .92 .25 6 (.46) 2 (.15) 5 (.38) ll (.85)
MC mean .81 .22 13 (.43) 6 (.20) 8 (.27) 25 (83)
Top Ct .62 .21 4 (.31) 4 (.31) 0 8 (.62)
Top.bottom Ct .62 .19 6 (.35) 4 (.24) 3 (.17) 13 (.77)
CR me_a_n .62 .20 10 (.33) 8628) 3(.10) 21 (.71)

 

 

 

 

As with the first family, these data were more remarkable for similarities than
differences in the global index; however, there was some differentiation among family
speciﬁc indices for the four items. The multiple-choice items appeared to elicit better use
of relevant information and appropriate equations, as indicated by the larger index values

compared to the constructed-response items. The constructed-response versions

73

prompted students to construct visual representations 3 little more often than the multiple-
choice variations. However, the qualitative analysis showed that the visual
representations did not help the students.

The items in Family 3 could be described as completing a pattern by ﬁlling-in the
last two numbers in a series of numbers. As part of the answer, the students had to
provide the rule they used to solve the pattern. Their responses exhibited almost identical
indices for both item difﬁculty and overall cognitive processing responses. The
constructed-response variations led students to rely more often on visuals representations
(i.e., write pattern of how numbers increased and decreased across the pattern) and the
multiple-choice versions apparently led students to use information provided in the items
to solve them. Perhaps a reliance on visuals dampened the need to attend to all of the
information offered in the item.

Examination of the protocol data showed a slight difference in the way the
students solved the items, although the cognitive process appeared to be similar, as
exempliﬁed in the general index. Two patterns were inherent in these items. The
students found the alternate rule (every other number increased by 3) equally often in
both item format presentations, even though the multiple-choice options did not reﬂect

the alternate pattern.l Essentially, the qualitative analysis supported the general index

 

1 The original item was a 2-point constructed-response item. The counter part multiple-choice versions were
written to reﬂect the 2-point characteristic of the original item (according to the scoring rubric). To accomplish this.
the part of the item that addressed the rule mirrored the answer on the rubric. The disconnection between the multiple-
choice options and the alternate pattern perplexed some students to the point of not being able to see the rule in the

correct answer choice. This happened rarely as indicated both by the protocol data and the percentage correct index.

74

because the same cognitive process was used to solve the items even though students

were able to solve them using one of two valid solution paths.

Table 10: Family 3. Extend Pattern of Numbers

 

 

 

 

 

 

 

 

 

Global General Family Family specific
Index index specific index index
Family Format Mean Overall Overall Visual Used relevant
cognitive pattern representation information
process recognition N (p) N (p)
N (p)
Ptrn l mc .77 .23 10 (.71) 2 (.14) ll (.39)
Ptrn 2 mc .92 .20 13 (.81) l (.06) 12 (.38)
MC mean .85 .21 23 (. 76) 3 (.10) 23 (.38)
Ptrn 1 Ct .85 .20 ll (.65) 2 (.12) 12 (.35)
Ptrn 2 or .81 .19 10 (.77) 3 (.23) 6 (.23)
CR mean .83 .20 2] ( .70) 5 (.17) 18 (.32)

 

 

 

 

 

 

 

The results of the analysis for the Family 4 items (Vertex Diagonal and Vertex
Triangle) mirrored the results of Family 3. All of the cognitive process indices were

remarkably similar, each differing by only a few percentage points.

75

 

Table 11. Family 4. Vertex-Diagonal, Vertex-Triangle

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Global General Family Family Family
Index Index speciﬁc specific specific
index index index
Family Format Mean Overall Overall Visual Relationship Used
cognitive pattern representation between relevant
Process recognition N (p) numbers information
N (p) recognized N (p)
N (p)
Diagonal mc .92 .11 7 (.61) 4 ( .19) 3 (.21) 3 (.10)
Triangle mc .85 .29 12 (.71) 3 Q8) 11 (.65) 10(26)
MCmean .89 .21 19(.67) 7(.19) I4 (.56) I3 (.19)
ﬂgonal cr .69 .12 10 (.70) 2 (.13) 6 (.55) 3 (.09)
Triangle cr .85 .29 9 (.69L 3 Q3) 8 (.62) 5 (.12)
CR mean .77 .20 19(.69) 5 (.17) 14 (.59) 8 (.11)

 

 

 

The mean index associated with Family 5 suggested that the constructed-response
format depressed performance (.58 versus .78), which may have signaled signiﬁcant
differential cognitive processes. But comparison of the other indices indicated relatively
similar patterns of cognitive processes at the global and family speciﬁc levels. Slight
differences were observed between the visual representation and the overall pattern
recognition indices. This suggested that more students used a visual approach (students
wrote the pattern they observed in each column) to solve the constructed-response items
compared to the multiple-choice items. The students may have used the visual approach
because they were unable to recognize the overall pattern or because the visual approach
negated the need to recognize the overall pattern.

The results for Families 6 (puppy’s weight) and 7 (pattern of letters), were notably
ﬂat in many of the indices, especially the global and general indices, reﬂecting the
consistent pattern of results found for earlier families, most notably l, 2, and 3. Slight

differences were observed in the visual representation index again.

76

 

Table 12. Family 5. Columns

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Global General Family Family Family
Index Index speciﬁc specific specific
index index index
Family Format Mean Overall Overall Visual Relationship Used
cognitive pattern representation between #‘s relevant
process recognition N (p) recognized information
N (p) N (p) N (p)
Columns,l3 mc .85 .24 8 (.50) 2 (.13) 7 (.44) ll (.34)
Columns,l4 mc .70 .27 8 (.64) 3 (.23) 4 (.29) ll (.39)
MC mean .78 .26 I6 (.57) 5(.19) 11 (.37) 22 (.37)
Columns,l3 cr .54 .25 7 (.50) 3 (.21) 5 (.36) 9 (.32)
Columns,l4 Ct .62 .26 9 (.53) 4 (.24) 5 (.29) 14 (.41)
CR mean .58 .26 I6 ( .52) 7 (.23) 10 (.33) 23 (.40)
Table 13. Family 6. Puppy’s Weight
Global General Family Family Family
Index Index speciﬁc speciﬁc specific
index index index
Family Format Mean Overall Overall Visual Relationship Used
cognitive pattern representation between #‘s relevant
process recognition N (p) recognized information
N (P) N (p) N (1))
Weight 21 me .92 .23 12 (.7) 2(12) 6 (.35) 13 (38)
‘ Weight 24 me .92 .21 8 (.57) 4 (.29) 2 (.14) ll (.39)
MC mean .92 .22 20 (.64) 6 (.20) 8 (.26) 24 (.38)
Weight 21 cr .62 .14 7 (.55) 3 (.15) 3 (.15) 8 (.43)
Weight 24 cr .92 .22 12 (.71) 2 (.12) 6 (.35) 13 (.38)
CR mean .77 .19 19(.64) 5(.14) 9(.26) 21(.42)
Table 14. Family 7. Pattern of Letters
Global General Family Family specific
Index Index speciﬁc index index
Family Format Mean Overall Overall Visual Used relevant
cognitive pattern representation information
process recognition N (p) N (p)
N (D)
As & Bs mc .92 .20 12 (.71) 2 (.12) ll (.32)
Cs & Ds mc .92 .20 ll (.79) 3 (.21) 10 (.36)
MC mean .92 .20 23 (.75) 5(.16) 21(.34)
As & Bs cr .92 .20 8 (.62) 3 (.23) 10 (.38)
Cs & Ds cr 1.00 .23 14 (.82) 3 (.18) 14 (.41)
CR mean .96 .22 22 ( .73) 6 (.20) 24 (.40)

 

 

 

 

 

 

 

 

77

 

A different sequence of patterns became apparent in Family 8. First, the mean
index stands apart from the previous indices because the four items in this family were
much more difﬁcult than the other items, but the two formats within the family continue
to be canonical.

Second, the constructed-response version, despite its similar mean difﬁculty,
prompted a somewhat different global cognitive process (.21 versus .15). The difference
appeared to come primarily from the “used relevant information” index, which
represented a difference in the number of students using relevant information provided in
the item. The general index complemented the family speciﬁc index because students
would ﬁnd it difﬁcult to recognize the overall pattern if they were unable to use the
relevant information provided in the item, and vice versa.

This item family was the most challenging of the entire set of eight item families,
from writing comparable items to the students’ apparent difﬁculty at answering the items.
Of the eight item families, it most closely represented a performance item, in that the
items had multiple solution paths and required the students to not only show their
mathematical work but offer a written explanation about their solution strategy. The
items required either the facile use of an equation (which only seven students managed to
use, and then not always correctly) or some careful logical reasoning paired with

sequential problem representation.

78

Table 15. Family 8. Dots and Stars

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Global General Family Family Family
Index Index speciﬁc speciﬁc speciﬁc
index index index
Family Format Mean Overall Overall Visual Applicable Used
cognitive pattern representation equation relevant
process recognition N (p) N (p) information
N (p) N (p)
Dots mc .58 .23 6 (.46) 2 (.15) 4 (.31) 9 (.35)
Stars mc .54 .08 7 (.48) 6 (.45; 0 (0) 6 (.28)
MC mean .56 .15 13(.48) 8(.31) 4(.13) 14 (.33)
Dots cr .49 .19 4 (.27) 4 (.27) 1 (.07) 10 (.33)
Stars Ct .67 .23 5 (.42) 2 (.17) 2 (.17) 9 (.38)
CR mean .58 .21 9 (.34) 6 (.23) 3 (.11) 19 (.35)

 

The data from all eight families were summarized (see Table 16) for the global

and general indices, overall cognitive process and overall pattern recognition,

respectively. Again, the data were more remarkable for their similarities rather than their

differences. Recognizing that some variation existed among item families, the overall

trends, both within and between families, pointed to a conclusion that similar cognitive

processes were used by students as they solved both item formats.

Table 16. Overall Patterns for Families 1 through 8

 

 

 

 

 

 

 

Global Index General Index
Format Mean Overall cognitive Overall pattern
process recognition
MC mean .78 .21 .60
CR mean .75 .21 .56

 

 

 

To examine the robustness of the conclusion, the data associated with the

Balanced Assessment item were summarized (see Table 17) and subsequently compared

to the previous results. This item required students to extend the given pattern, which

79

 

was contextualized in a diagonal rectangle problem. There were five separate scaffold

pattern extensions.

Table 17: Diagonal Rectangle Item

 

 

 

 

 

 

 

 

 

 

 

Global General Family Family Family
Index Index speciﬁc index speciﬁc speciﬁc
index index
Family Format Mean Overall Overall Visual Grapples Used
cognitive pattern representation with relevant
process recognition N (p) information information
N (p) in item N (p)
N (P)
Diagonal cr .23 .26 2 (.09) 14 (.61) 7 (.30) 18 (.39)
rectangle

 

The ﬁrst index that stood apart from the previous ones was the global index, but

the ﬁnding was not overly striking when compared to the other 32 items. For instance,

the global index was only slightly larger than the summarized index in Table 16. It was

comparatively larger across each individual item family, except for item Family 5

(columns). As indicated by the low index, very few students recognized the overall

pattern, a ﬁnding that complements the mean index, despite their attempts at representing

the pattern visually. The applicable equation was difficult for the students to ﬁgure out,

so they apparently relied on using visual approaches to ﬁnd the pattern.

A new family speciﬁc index emerged to examine trends, which was used so

infrequently with the other items that it did not emerge as a family specific index. As

seen, seven students struggled with the item, manipulating the given information every

possible way, resolute to successfully extend the pattern. The fact that 30% of the

students who attempted to answer the item were observed using the strategy primarily

80

 

with this item suggested that it may have elicited some different cognitive processes,
even though, as the qualitative data suggested, most of them were perplexed by the item.

A close look at the qualitative records for all the students who attempted this item
revealed the majority of them were highly engaged cognitively. One possibility,
supported in the analysis, was that they could not rely on “easier” methods such as
visually representing the diagonal rectangle and/or counting the dots (a cognitive process
often used in the stars-dots items). Most of the students gave an answer. They did not
guess blindly or give up entirely. This observation was significant because this item
came at the end of the think-aloud protocol, a point where students were more likely to be
fatigued. Further, it seemed that many of the students had never encountered a problem
like this one; they could have easily become frustrated with the item and became resolute
with their failed attempt.

The protocol evidence supported the assertion that for the performance item, the
students used different cognitive processes for the multiple-choice and constructed-
response items.

Two trends emerge from the interviews with students completing the Balanced
Assessment item. First, the number of students using formulas, even if they were wildly
incorrect, increased signiﬁcantly from part A to part D. In part A, only 1 student used
any formula, but eight students were using formulas by part D. Though unsuccessful,
students were actively trying to solve the problem by looking for a pattern and creating
an equation that accurately represented it. Most of the students used the area equation to
solve the problem, but other students tried to look for equations from the patterns that

they saw in the earlier parts of the item. Charlie’s response illustrates this:

81

“I am trying to ﬁnd a pattern. . .in a 4x5 rectangle you make each
number go up by 1, and it was 18 higher in the 5x6, so then the 10x11

would be. . .well I could do this: 10 (in the 10x11)-5(in the 5x6)=5 so then

the pattern is going up by 5. So l8x5=90, and 90-4-50 (number of dots in

5x6): 140.”

Many students used this approach to find a pattern and an equation.

Second, students attempted several different mathematical processes to solve the
problem. In general, students drew upon the strategies they knew from previous
experiences with solving mathematical items and tried many of them before deciding
upon an approach. This was particularly true of students who had used visual
representations of the rectangles to solve earlier problems but found that they could not
continue to use this strategy. Allison’s transcript illustrates this point.

Allison drew the 5x4 and the 5x6 rectangle and counted the dots inside. On the
10x11 problem, she commented, “I need a way to get it so I don’t have to count.” She
attempted to draw the 10x11 rectangle on the worksheet, but realized that it will not ﬁt.
She said she did not know how to do the item, but continued to struggle with various
solution strategies to ﬁnd a way to solve it. She looked at the 5x6 rectangle and said,
“Well this 10 by 11 is double 5x6” and began to double the answer for the 5x6 to solve
the problem. Then she realized that 6 is not double 11, and looked at the problem again.
She then commented, “This is really hard” and began to look for patterns of numbers in
the 4x5 and the 5x6 items to get the answer (i.e. “5x6 is 4x5+1x1). She commented,
“That’s not working, so I need to try something else.” After a few other attempts, she

ﬁnally decided to calculate the area for her ﬁnal answer (10x11=110).

82

Students like Allison explicitly demonstrated the cognitive complexity of the item
and the various cognitive processes they used in their attempt to solve the item. They
commented that the problem was difﬁcult, they did not know what they were doing, and
they thought their answers were wrong. Despite these challenges, they continued to work
to solve the problem. I called this “playing with the numbers” or “grappling” (category
AA) because students went in with a pretty wide lens of possibility solution strategies.
They generally tried different mathematical computational strategies or looked for
patterns inside of the given numbers. For example, Allison revisited the 5x4 and 5x6
rectangles to look for patterns within the numbers by ‘playing with them.

This kind of “playing” was not observed to the same extent in the multiple—choice
or constructed-response items. Generally, the students either knew the answers to the
items or they solved the items quickly, without having to search their mathematical
toolbox for multiple solution strategies or ways of reasoning or making sense of the
information. Consequently, students were less likely to “play” or grapple with the
information provided in the multiple—choice and constructed-response items, or to think
meta-cognitively about what they were doing as they solved the items.

Summary of Cognitive Process Comparisons

The ﬁndings from the descriptive analyses and the two analyses that used the
subset of cognitive process categories did not, in general, reveal signiﬁcantly different
cognitive processing between the multiple-choice and constructed-response item formats,
within or across the eight item families. The global index, which represented the nine
cognitive process categories that represented the significant cognitive moves on the part

of the students, did not elicit substantial ﬁndings when looked within the item families

83

and in the aggregate. The most interesting ﬁndings were found with the family specific
indices, most likely because the aggregate indices masked the cognitive processes
differences, albeit small, that were elicited by the items.

Many of the family specific indices were comparable between item formats
within an item family but a few inconsistent ﬁndings were observed. Most notably, the
students used visual representation more often to solve the constructed-response items
(ﬁve of the eight item families plus the Balanced Assessment item) than the multiple-
choice items. This was evident in the higher indices associated with the constructed-
response items compared to the indices associated with the multiple-choice items within
an item family.

In some item families, especially for those prone to visual solution paths, the
visual index interacted with the overall pattern recognition index. Apparently, students
had to recognize the entire pattern to employ a visual solution strategy. But, it should be
noted that although a visual solution strategy worked with the items in this study, it may
not work as well with pattern items that are more complex and cognitively challenging.

Although the protocol data suggested some differences in the cognitive processing
used to answer the multiple-choice and constructed-response items, the differences were
small in terms of the relative differences in proportions observed within an index in any
one-item family. For example, the largest difference in proportions in the visual
representation index was .08, which represented two additional students using the
cognitive process (see item families 2 and 8).

The most remarkable difference was seen in the Balanced Assessment

performance item. A unique cognitive process (category AA), not seen in the other

84

items, was elicited when the students solved the performance item. The item challenged
the students to grapple with the information in a way that was not seen in the other 32
items. The items that came closest to a similar cognitive struggle were the stars-dots
items. But, the particular cognitive process was used less frequently compared to the
other eight categories so category AA was not reported as a family specific index (see
Table 7, to examine the frequency distribution of the nine categories).
Extergl Post hoc Evaluation

The purposes of the expert panel review were to evaluate whether I had accurately
and sufﬁciently identiﬁed the cognitive processes expressed by the students and to
evaluate whether the analysis procedures were appropriately conducted. The review
occurred after the student think-alouds and data analyses were completed. Nonetheless,
evaluating the cognitive processes that emerged and the data analysis procedures from
which they emerged, even in a post hoc review, would strengthen the results of the study.
The results of the evaluation were generally positive. The next section details the

germane ﬁndings from the evaluation — both the post hoc review and the think-aloud

interview with the 8‘11 grade teacher.
Results of the Evm

After the panel of four mathematics experts met for a one-day information sharing
and training session, they individually evaluated the validity of the 28 cognitive process
categories and the coding system by listening to two think-aloud audiotapes. I divided
the panel into two groups and each group listened to the same think-aloud interview. One
8th grade teacher and one mathematics educator were assigned to each group. The

panelists reconvened after their review and each member had the opportunity to share her

85

impressions. The panelists’ comments fell within three general categories. Specific

examples follow each general comment. As seen, their suggestions would clarify the

meaning of the original cognitive process categories.

Change the wording of some categories to better convey their meaning.

Reword Category G: Uses applicable formula (where formula means a
symbolic representation, e. g., length x width), rather than generating a
heuristic.

Reword Category F: Re-presentation of information in question, e.g.,
tables, lists.

Reword Category R: Work not shown.

Reword Category Y: Relationship between independent (single) and
dependent (multiple) variables recognized.

Reword Category Z: Relationship between independent (single) and
dependent (multiple) variables not recognized.

Reword Category AA: Persists, even when uncertain of solution path.
Reword Category M: Get right answer for wrong reason; uses some
irrelevant information to get answer right (e. g., selected longest answer).
Reword Category X: Mis-states math terms; mismatch between verbal

explanation and actual use of math terms.

Create new categories to account for some unidentiﬁed cognitive processes.

Uses existing knowledge from previous experiences (e. g., classroom
instruction).

Monitors understanding of item.

86

0 Uses real world knowledge or practical reasoning to solve item.

0 Applies appropriate arithmetic operations (+, -, x, \).

o Applies inappropriate arithmetic operations (+, -, x, \).

0 Uses deep cognitive processing or engagement: Student shifts strategy;
has a repertoire of tools available to solve pattern.

Retain all original 28 categories (but reword as indicated above).

After the panelists shared their evaluation of the cognitive process categories, they
were asked to comment on their experience of coding the students’ think-alouds. (Time
constraints prevented one of the reviewers from completing this exercise.) The purpose
of the activity was to expose potential problems with the original coding procedure. I
was not interested in assessing inter-rater reliability. The panel’s general observations
follow.

1. There seemed to be an inverse relationship between students mathematical

ability and their depth of engagement when solving the items.

2. The students’ low reading level seemed to prevent them from understanding
the items. The panelists were uncertain about whether the students
misunderstood the meaning of certain words, whether they accidentally
misread certain words, or whether they could not read an item at all.

3. There were qualitative and quantitative differences between what students
verbally reported and what they wrote as answers to the constructed-response
items.

4. The coding procedures were appropriate and easy to apply.

87

5. The audiotapes were essential to code the cognitive processes; that is, the
transcripts and test booklets alone were insufﬁcient information to accurately
code the processes.

As seen, most of their comments represented observations about the students’
behaviors rather than problems uncovered with the coding procedures. In fact, the last
two observations indicate that the panelists found the procedures sound.

Because the panel listened to two of the 34 student think-aloud interviews, I
assessed the panel’s observations by discussing them with the other think-aloud
interviewers —I would have been remiss to assume that their observations, from two
tapes, reﬂected all of the qualitative data collected for this study. We concluded that the
ﬁrst point noted above was not universally observed. For example, some of the students
enrolled in an algebra course and/or who had strong mathematical ability found the
diagonal rectangle item difficult, yet they were deeply engaged when solving the item.
Their engagement was evident in their persistence and struggle to solve the item.
Conversely, some of the students enrolled in non-algebra classes and/or who appeared to
ﬁnd most of the items challenging solved some items with little engagement.

Scant evidence existed to conﬁdently refute or support the second point, except
for the last notion, which questioned the students’ literacy. The interviewers asked all of
the students to read every item aloud. None of the interviewers encountered illiterate
students.

The think-aloud interviewers also observed the third point listed above, even

though it was not formally documented during the think-aloud interviews. I did not

88

disregard this observation when I encountered it during the think-alouds because of lack
of interest but rather because it was not the focus of the analysis.

As mentioned early, because of limited resources, the panel only reviewed two
tapes when they evaluated the categories and the coding system. Because of this
limitation, one must consider the conclusions from the post hoc review with some
caution. For instance, the two students did not use all 28 cognitive processes when
solving the items; therefore, the panelists were unable to conduct a full review of the
categories based on verbal evidence.

However, in the absence of complete verbal data, the panelists relied on their
knowledge of 8th grade students and on 8th grade mathematics curriculum and instruction.
Their experiences allowed them to evaluate the categories for which no direct think-aloud
evidence existed on the two tapes available to them. Based on the direct verbal evidence
and on their experience, the panel ultimately concluded that the original categories
seemed appropriate and represented the cognitive processes students used when solving
the pattern items.

Results of the Teacher Think-aloud Interview

The think-aloud procedure used with the students was duplicated with an 8‘11 grade
mathematics teacher. She responded to all 18 items on Form A. The teacher was ﬁrst
asked to solve the items from an 8’h grader’s perspective and then to determine whether
there were alternate solution strategies that may have been foreign to the students. Three
results emerged from this activity and suggested that: (a) the teacher used similar
cognitive processes when solving most of multiple-choice and constructed-response

items, (b) the teacher and the students often used the same verbal explanations and

89

cognitive moves when solving many of the items, and (c) the students saw alternative
ways to solve a few of the items but the teacher did not.

The information in Table 18 represents the cognitive moves made by the teacher
for each of the 17 items. It was used to base the ﬁrst and second conclusions. Evidence
to support the third conclusion was presented in narrative form using excerpts from the

teacher’s verbal protocol when appropriate.

90

FAMILY l: ARROWS & U-SHAPED
ARROWS: CONSTRUCTED RESPONSE

Table 18. Teacher Response Distribution for Abridged Category List

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

It in # A D E F G S U Y AA
9 l 1
U-SHAPED: MULTIPLE CHOICE
Itgm # A D E F G S D Y AA
1 l 1
FAMILY 2: TACKS
TACKS TOP ONLY: MULTIPLE CHOICE
tern # A D E F G S U Y AA
10 l l l
TACKS TOP & BOTTOM: CONSTRUCTED RESPONSE
I m # A D E F G S U Y AA
2 1 l 1
FAMILY 3: EXTEND PATTERN OF NUMBERS
EXTEND NUMBER PATTERN (1,6,4,9): CONSTRUCTED RESPONSE
m # A D E F G S U Y AA
3 l l l l
EXTEND NUMBER. PATTERh (4.3.7.6): MULTIPLE CHOICE
[tam # A D EL F G S U Y AA
11 l l l 1
FAMILY 4: VERTEX
VERTEX DIAGONAL: CONSTRUCTED RESPONSE
[tam # A D E F G S U Y AA
12 1 l l l
VERTEX TRIANGLE: MULTIPLE CHOICE
Item # A D E F G S U Y AA
4 l l l 1
FAMILY 5: COLUMNS
COLUMNS 13: MULTIPLE CHOICE
em # A D E F G S U Y AA
13 l 1 1 1
COLUMNS 14: CONSTRUCTED RESPONSE
# A D E F G S U Y AA
5 l 1 l l 1
FAMILY 6: PUPPY'S WEIGHT
PUPPY'S WEIGHT 21LBS: MULTIPLE CHOICE
Itgm # A D E F G S U Y AA
6 l l 1 l
PUPPY'S WEIGHT 24LBS: CONSTRUCTED RESPONSE
Item # A D E F G S U Y AA
14 l l l 1
FAMILY 7: PATTERN OF LETTERS
As & Bs: MULTIPLE CHOICE
m # A D L F G S U Y AA
7 l l 1 l
Cs & Ds: CONSTRUCTED RESPONSE
FEE] # A D E F G S U Y AA
15 l 1 1 1
FAMILY 8: DOTS & STARS
DOTS: MULTIPLE CHOICE
Item # A D E F G S L) Y AA
16 l l I 1
STARS: CONSTRUCTED RESPONSE
Item # A D E F G S U Y AA
__8 l l 1 l 1 I l
DIAGONAL RECTANGLE
CON STRUCTED RESPONSE
Item # A D E F G S U Y AA
17 l 1 l l l 1

 

 

 

 

 

 

 

 

 

 

 

91

 

 

 

 

 

 

 

 

 

 

The ﬁrst result was based on an examination of the multiple-choice and
constructed-response items within a family. As seen in Table 18, the teacher used the
same categories to solve the multiple-choice and constructed-response items across six of
the eight families (excluding the diagonal rectangle item because it did not have a
multiple-choice counterpart). For Family 5 and 8 she used additional cognitive processes
to solve the constructed-response items compared to the multiple-choice items.
Speciﬁcally, in Family 5 she checked her work (category S) while solving the
constructed-response item but she did not check her work while solving the multiple-
choice item. Three additional cognitive processes emerged while the teacher solved the
constructed-response item in Family 8, namely applicable equation (category G), checks
work (category S), and grapples with information (category AA).

I am not sure why she checked her work when solving the constructed-response
item in Family 5 except to hypothesize that she did not have the response options
available to conﬁrm her own solution, as she did for the multiple-choice item within the
family. The same reason could be offered for category S emerging for the constructed-
response item in Family 8. The other two categories may have emerged because of the
ordering of the items in the booklet and/or because of bias I may have introduced during
the interview.

This conjecture is based on the following facts. In Form A, the constructed-
response, stars item appeared before the multiple-choice, dots item. Compared to the
other items, the teacher struggled to ﬁnd a solution (category AA) for the stars item as the
item was more complex and was therefore more difﬁcult. There were two general

strategies for solving the item, either using a visual approach by noticing the relationship

92

between the number of stars in the columns and rows or by ﬁnding the algebraic equation
that represented the pattern. The teacher relied solely on her knowledge that an algebraic
equation (category G) was at the root of every pattern, and thereby applied this
knowledge to the star item. She had difﬁculty ﬁnding the applicable equation but
eventually solved the item.

Interviewer bias likely occurred when we discussed the various ways that the
students could have solved the item. During the conversation I shared with her that some
students used a visual strategy and described it to her. The teacher commented that the
visual approach was easier than an equation approach. When she encountered the dots
item she used the visual approach and easily solved the item — hence the omission of
categories G, S, and AA for the multiple-choice version of the item. If the conjecture is
correct, then the difference observed between the stars and dots items was spurious. If it
is wrong, then the teacher likely learned from her ﬁrst encounter with the stars item and
applied her learning to solving the dots item.

The second result — the teacher and the students often used the same verbal
explanations and cognitive moves when solving many of the items — was based on the
data from the teacher’s and students’ (see Table 7) abridged category systems. I
observed that the teacher and students often used the same cognitive moves when solving
the 16 items. A few differences were identiﬁed in the data. The most obvious difference
occurred in categories D (partial information used from question) and U (partial pattern
recognition). For every item, at least one student, and often three or four students,
received a D and for the majority of the items at least one student received a U. These

two categories were not assigned to the teacher. Apparently, the teacher always

93

recognized the entire pattern (e. g., she never assumed that the pattern identiﬁed in the
beginning of a series was the same pattern at the end of the series) and used all of the
information available in the item to solve the item (e. g., she never jumped from the
beginning of the item to immediately solving the item, or ignored information in the
middle or at the end of the series of information in the pattern).

The teacher’s cognitive moves also differed for categories F (visual
representation) and G (applicable equation used). She often used the two strategies,
moving back-and-forth between the two, whereas the students often relied on only one of
the two cognitive processes when solving an item. The use of multiple cognitive moves
within an item may be an indication of the teacher’s lmowledge that there are often
several ways to solve an algebraic pattern item.

I also observed that the teacher relied on formulas much more often than did the
students. She ﬁrst attempted to ﬁnd a formula that ﬁt the observed pattern, regardless if a
visual approach would have been easier. For example, while solving the stars item, she
ﬁrst saw that each step had one more row and column compared to the previous step.
She then noticed that the difference between steps 1 and 2 was 5 and steps 2 and 3 was 7,
and thereby noticed that the pattern between steps was an increase of two. She veriﬁed
her observations by applying this knowledge to the fourth step. Up to this point, she and
most of the students used similar cognitive moves. However, her knowledge of
mathematical rules and relationships (e. g., she knew a quadratic formula was needed to
represent the pattern) led her down a path to ﬁnd a formula that ﬁt her observed pattern,
whereas most of the students used some type of visual approach (e.g., listed the numbers

in the pattern all the way to the 16th step). She maneuvered through several formulaic

94

strategies, changing strategies several times before completing the item. She
perseverated on finding a formula to the point of her not seeing the easier visual
approach. When I explained the “area” method (rows x columns) to her she responded
“This is an easy way, wow. Students have an advantage if they see the item like this
because it is easier to solve, rather then looking for a formula.”

As mentioned earlier, some of the students used alternative solution strategies
when solving some of the items that the teacher did not notice until I drew her attention
to them. This happened for three item-pairs: columns, extend pattern of numbers, and
dots & stars. The ﬁrst two item-pairs had multiple patterns but the students used the
same cognitive moves to solve the items. The last item-pair had multiple solution
strategies, which required different cognitive moves.

The columns items required students to complete a pattern by identifying the
pattern in each of two columns (independent and dependent variables), continuing the
pattern across a gap in the pattern, and then completing the pattern for the dependent
variable. Most of the students and the teacher solved the items the same way; they
applied the cognitive moves just listed. However, a few students noticed the relationship
between the two numbers in columns A and B, rather than the difference between
consecutive numbers within a column. After the teacher solved the item I asked her if
there were alternative ways to solve it. She said no. I then shared with her that a few
students discovered an alternative pattern. Upon hearing this she re-examined the item
but still could not see the second pattern. I explained that there was a relationship

between the numbers across the two columns; but she had a difﬁcult time seeing the

95

alternate pattern. She did not see the alternate pattern until I pointed to the numbers and
said it aloud.

A similar conversation occurred between the teacher and I when she was solving
the extend pattern of numbers item-pair. These items required students to ﬁll-in the last
two numbers of a series eight numbers that followed a certain pattern. Most of the
students and the teacher solved the two items in the same way, that is, by recognizing the
pattern between consecutive numbers. However, some students noticed that a difference
of three existed between every other number. After the teacher completed the items, I
told her that some students noticed a second pattern. The second pattern was not
apparent to her until I said it aloud. She explained that some pattern items have more
than one pattern within them, which makes working with pattern items difﬁcult of
students.

The dots and stars items had multiple solution strategies inherent in them. The
teacher immediately looked for a formula to solve the items whereas most of the students
used some kind of visual, or re-presentation of information approach. Most of the
students relied on listing a series of numbers from the 4m step to the 16th (or 20th step,
depending on the item), using a “counting-on” approach to ﬁgure out the answer. A few
students were able to see a very straightforward way to solve the item, which required the
least number of cognitive moves compared to the other solution strategies. These
students recognized that the number of dots (stars) in a row multiplied by the number of
rows equaled the number of dots (stars) in the block of dots (stars). They also noted that
the number of rows and the step number were the same number. They solved the item by

using these two pieces of information. That is, they saw how the item grew

96

geometrically. This approach became known as the “area” method. The area method
was not apparent to the teacher until I explained it aloud. She immediately recognized
the simplicity of this approach and commented that students who can see things visually
often have an easier time solving pattern items.

I do not have a good explanation for the above results except that they may have
something to do with the richness and depth of the teacher’s mathematical knowledge,
which ultimately led her to find the underlying mathematical formula of the observed
pattern. Conversely, the students limited exposure to patterns (in terms of their level in
school and variations of patterns) may have hindered their understanding of what really
composed a pattern and therefore resulted in them using visual or other familiar
mathematical approaches (e. g., counting-on strategy) that they had learned up to the 8th
grade. The teacher could not proffer additional insights except to explain that patterns
are difﬁcult items to work with because one really does not know the true pattern within
an item. That is, a series of ﬁve numbers could be presented in an item with a pattern up
plus one, minus two, plus one, minus two. This pattern may match the ﬁve presented
numbers; however, the sixth and subsequent numbers, which are unknown to the
respondent, could result in a second viable pattern for that series of numbers.

An unexpected and unintended result occurred during the teacher’s think-aloud
session. As mentioned earlier, after the teacher solved an item, I shared with her some of
the alternate ways that the students solved the item. Because of the teacher’s classroom
experience, she was able to see the item and solution strategy through the lens of a
student. She took notes during the think-aloud session that connected their solution

strategies with her instructional approaches. She seemed quite engaged and excited about

97

her leaming and commented about how useful think-alouds could be to inform
instruction.
Summary of External Post hoc Evaluation
The main results of the post hoc evaluation showed
0 That the teacher and students often used the same cognitive processes to solve
the pattern items,
0 The category system sufﬁciently captured the major cognitive processes used
by the students when solving the items,
0 The need to re-word some of the categories to make them more
understandable to the mathematics community,
0 That a teacher and students can see different patterns in an item and both
patterns can be correct,
0 That a teacher is unable to see a pattern that some students can see,
0 The importance of including content experts at the beginning of this type of
research, and
0 The usefulness of the think-aloud procedure to inform instruction.
Overall Summgy of Results
This study examined whether students used different cognitive processes to
answer multiple-choice and constructed-response items. Every analysis led to the same
conclusion; that is, regardless of item format, students tended to use the same cognitive
processes when solving the algebraic pattern items. One exception arose when I
compared the cognitive processes associated with the Balanced Assessment item with

those of the other 32 items. In this one situation, students verbalized a few different

98

cognitive processes (e. g., category AA) and they appeared much more engaged (e. g.,
relentless pursuit of a solution) when solving the item. This ﬁnding suggested that if test
developers want students to be deeply engaged with an item and foster their use of
problem solving processes, then the items have to be intentionally written in a way that
elicits such thinking. The Balanced Assessment item was developed to elicit these
behaviors; however, it also required more time to solve and to score compared to the
other items in this study.

The multiple-choice and constructed-response items in the dots and stars item-
farnily came closest to eliciting the level of engagement and types of cognitive processes
observed in the Balanced Assessment item. Each of these items had a few common
characteristics, which perhaps contributed to the common results:

0 They allowed for multiple solution paths,

They required several minutes to solve,

o The problem space required students to extend the pattern several steps
beyond the set-up pattern,

0 They required an understanding of an algebraic formula to facilitate ﬁnding

the solution, and

0 They were difﬁcult items (low p-values).

These complex combinations of item characteristics were absent in the other
seven item-families. As anticipated from the beginning of the study, item characteristics
and item quality apparently played integral roles in eliciting higher order thinking

processes, in both multiple—choice and constructed-response item formats.

99

One unintended ﬁnding emerged from the panel’s post hoc evaluation. I came to
realize the importance of including mathematics experts when conducting this type of
study. In hindsight, the experts should have been involved from the beginning of the
category development, rather than enlisted as post hoc reviewers. It became clear that
their expert knowledge would have aided in the initial development of the cognitive
process categories and it would have facilitated in the identiﬁcation of the cognitive
processes, especially when the students’ thinking was difﬁcult to follow. Furthermore,
the categories would have been based on a formal, content based language, which would
have been familiar to mathematics teachers, compared to the informal, unreﬁned

language invented by a non-mathematics expert.

100

CHAPTER V
Summary, Conclusions and Next Steps
Overview of Study

The purpose of this study was to determine whether test takers use different
cognitive processes when they solve multiple-choice and constructed-response items. I
conducted this study in an era when school accountability and hi gh-stakes, large-scale
assessments were seemingly as important as student learning itself. Thus, given the
importance placed on accountability and testing by policy-makers, parents, school
personnel, and other stake-holders, I sought to deepen our understanding about how
students interact with test items — the foundation of a test score.

Measurement professionals use a myriad of statistical procedures to assess how
students interact with test items. For example, psychometricians use either classical test
theory or IRT models to examine the difficulty and discriminating ability of an item. The
impact of an item’s contribution to a student’s ﬁnal test score partly depends on how the
total test score is calculated (i.e., pattern scoring, unweighted number correct, or
weighted number correct). But, regardless of how the ﬁnal test score is calculated, each
item “possesses” a certain amount of information (in the parlance of IRT) that ultimately
contributes to a student’s test score.

The amount of item information varies by item and often by item format.
Historically, multiple-choice items, on average, contribute more information (per unit of
testing time and cost) to a test score compared to constructed-response items. This type
of item information is important for many aspects of testing, for instance during forms

assembly when a test developer has to select items that match a predetermined test

101

information function. Pearson and Garavaglia (1997) proposed additional ways to think
about test and item information for large-scale assessments. One of their notions was to
look at item information through the lens of information-value (or value-added). This
general perspective was adopted for this study.

The sample for this study consisted of 8th grade students who were enrolled in
mathematics courses that ranged from general mathematics to algebra. The teachers
selected the classrooms from which they allowed students to be selected. Because all 8th
graders were not accessible to me, the sample could not be considered a true random
sample. However, to obtain a sample that was as representative as possible, while
working within the constraints of the two schools, I generated a sample list from which I
drew the sample. The list consisted of over 400 students from the two schools. Using a
variation of systematic sampling with a random start, I selected 17 students from each of
the two schools. The students knew that their verbal responses would remain
anonymous, and that their participation would neither improve nor harm their
mathematics course grades. There was no attrition in the sample.

Three instruments were used to collect the data: (a) a short demographic survey,
(b) a test booklet, and (c) a protocol guide booklet. The test booklet was the main
instrument as it contained the test items. The items were from the algebra strand and
speciﬁcally assessed the area of algebraic patterns. Each of two test booklets was
composed of 17 pattern items; eight of the items were multiple-choice, eight were
constructed-response, and one was a performance item. The one performance item

appeared in both test booklets. The 17 students within each school were randomly

102

assigned to each test booklet to control for any cunicular and instructional differences
between classrooms and schools.

I developed 32 (except the performance item) pattern items from the model-shell
item writing procedure I created for this study. Generally, I started the item writing
process with an original NAEP item that was written as a multiple-choice (or, in some
instances the original item was written as a constructed-response) item. I then removed
the response options (or created response options) and created a constructed-response
version (or a multiple-choice version) of the original item. The third item was created by
slightly altering the content of the original item and writing a multiple-choice (or
constructed-response) item. The fourth item was created by removing the response
options (or adding response options) from the third item to create a constructed-response
(multiple-choice) version of the third item. These four items were called an item-family.
The ﬁrst and third and the second and fourth items were called comparable items because
they represented the items with the slightly altered, yet comparable content, presented in
the two different item formats.

I created eight item-families from eight original N AEP items. The purpose of
writing the 32 items was to maintain content across the four items within an item-family
so that differences in cognitive processes could be attributed to item format rather than
item content. The performance item was added to the other 32 items for the following
reason. If I found no differences between the constructed-response and multiple-choice
items but I found some differences in the cognitive processes elicited by the Balanced

Assessment item, I then would be able to attribute the absence of between item format

103

differences to the idea that the constructed-response items did not tap the kinds of
cognitive processes that were tapped by the performance item.

A think-aloud procedure was used to collect the data for this study. I selected the
procedure after I reviewed several analytical models (e.g., IRT, factor analysis,
MANOVA, structural equation models) that were used in studies from the relevant
published literature. However, these models do not capture the cognitive processes
students used to solve the items. The think-aloud procedure does capture the needed data
so I ultimately selected a methodology that was (and is) seldom used within the ﬁeld of
measurement, but that was most appropriate for this study.

The verbal data that emerged from the think-alouds were analyzed using a
constant comparative approach to identify the categories that represented the cognitive
processes. Both descriptive statistics and qualitative narratives were used to analyze the
verbal data. After the data were analyzed, I convened a group of four mathematics
experts to conduct a post hoc evaluation of the cognitive process categories and the
analysis procedures. As part of the post hoc review, I duplicated the think-aloud
procedure with one of the reviewers. She answered the items twice, ﬁrst as a
mathematics expert and second as an 8Lb grade student. That is, she was instructed to
solve each item using a strategy that she thought an 8th grade student would use.

The overall result from both the qualitative and quantitative data analyses
indicated that students employed similar cognitive processes to solve algebraic pattern
items written in the multiple-choice and constructed-response formats, especially for the
cloned NAEP items. A few unique cognitive processes emerged from the dots and stars

and Balanced Assessment items that were not observed in the other items; however, the

104

cognitive processes were similar between the multiple-choice and constructed-response
versions of these items.
Conclusions

1 offer four conclusions based on the ﬁndings in this study. First, the addition of
constructed—response items, at least as they have been operationalized in the current
National Assessment of Educational Progress, does not appear, on the basis of the current
analysis, to add additional information about algebraic patterns above and beyond that
which could be gathered from multiple-choice items. This conclusion is based on the
common cognitive processes students used to solve the multiple-choice and constructed-
response items, both within each item-farmly and between them.

A plausible reason for the lack of between item format differences is that the
constructed-response items available in this study do not function like genuine
performance items. Even the stars and dots items, which come closest to resembling a
performance item, do not elicit different cognitive processes between the item formats.
Two distinctive features are observed between the items in the item families and the
performance item. First, all the constructed-response items, except the stars and dots
items, had, or at least elicited, only one solution path. The Balanced Assessment item,
the only item in the corpus that had the look and feel of a genuine performance, elicited
multiple solution paths (albeit largely unsuccessful for the students in my sample).
Second, the students found most of the items easy to moderately easy to solve, except for
the stars and dots item and the performance item. In fact, very few students successfully
solved the latter item. Thus, perhaps one way to capture different processing between

item formats is to develop constructed-response items that more closely resemble the

105

unique features in a performance item — an open solution paths and a high or moderate
level of item difﬁculty.

Second, students appeared to use the most efﬁcient strategy to solve any given
item, regardless of item format. That is, if the presentation of an item leads students to
the one most efﬁcient way to solve an item, then that is the strategy the students will use,
regardless of item format. Evidence of this conclusion was even seen in the Balanced
Assessment item. On the surface, the most efﬁcient way to solve the item is to count the
dots within the diagonal rectangle. All of the students who attempted this item used the
counting strategy to solve the ﬁrst problem of the item. Most of the students continued to
use, or at least attempted to use, this strategy as the problem became more difﬁcult.

Third, item features —— item difﬁculty, complex problem space, and multiple
solution paths — seem to play an integral role in determining the types of cognitive
processes elicited by an item. Comparing the dots and stars item-family and Balanced
Assessment items to the other items provides the evidence to support this conclusion.
For instance, most of the items used in this study are easy (indicated by item means
ranging from .49 to 1.0), have one solution path, and are not intrinsically complex
patterns. The dots and stars and Balanced Assessment items had the opposite features.
As mentioned above, different cognitive processes became evident when I analyzed the
verbal data for these ﬁve items.

And, even more telling is the evidence that emerges when comparing the
Balanced Assessment item to the dots and stars items. Speciﬁcally, the students used a
few different cognitive processes to solve these items, compared to the other items in the

study, but the degree to which they use the cognitive processes differs qualitatively. That

106

is, although category AA (grapple with information) emerged for the dots and stars and
Balanced Assessment items, the students struggled longer and harder on the Balanced
Assessment item compared to the dots and stars items. The Balanced Assessment item
was the most difﬁcult and complex item in the study; yet, it allowed for the emergence of
unique cognitive processes. If test developers want items on large-scale assessments to
measure a range of cognitive processes, then perhaps the items need to be moderately
difficult, have multiple solution strategies, and a complex problem space.

Fourth, I conclude that the use of think-alouds is a valuable tool in the ﬁeld of
measurement, especially when investigating research questions that deepen or broaden
our understanding about the cognitive traits underlying items or adding new information
about the construct of interest. As mentioned previously, psychometricians rarely
employ the think-aloud methodology for either of these purposes. This study provides
some evidence to promote its effectiveness.

Recommendations

Researchers, test developers, and test users would likely be interested in the
results of this study. I propose recommendations for each group by suggesting
improvements to this study’s design or by suggesting applications of the methodology.

Researchers. There are a few design features that other researchers may consider
altering if they attempt to replicate this study. First, I would recommend the involvement
of content experts and teachers from the outset, rather than involving them at the end of
the study in a post hoc fashion. Their content knowledge will help deﬁne the cognitive
processes that emerge from the students’ thinking. Their content knowledge also allows

them to precisely describe the categories by using terminology familiar to the

107

mathematics community. They also have unique insight about how students approach
certain mathematics concepts that comes from their daily interaction with the students,
especially teachers with several years of teaching experience.

Second, I recommend that researchers practice using the think-aloud procedure in
a pilot study or other practice event. Knowing when to remain quiet and when to probe
comes with practice. In the pilot study, my first few think-alouds resembled tutoring
sessions; a common occurrence with new users (Paulsen, 1999). I also had to practice
maneuvering through the protocol guide and asking prompts in a way that would not
interrupt or lead the students’ thinking. Beyond practice, I suggest that novices work
with an experienced user of think-alouds and be observed and critiqued by an
experienced interviewer during practice interviews. Researchers could also consider
using videotapes to monitor and evaluate their interview technique.

Third, researchers could examine the relationship between instruction and
performance by determining whether students are more likely to use (and more successful
when using) strategies that they have been explicitly taught as a part of their curriculum.
If the students all use the same solution approach, regardless of instructional practice and
item format, then the results would support the assertion that students use the most
intuitively efﬁcient and straightforward solution strategy rather than one they have been
taught as a part of their curriculum.

Finally, when feasible, researchers should use think-alouds to supplement
information obtained from IRT or factor analysis when examining test- or item-related
research questions. For instance, the item information curve generated by an IRT model

is a useful indicator of the variability of an item, but the reason(s) for the variability is not

108

detectable by the model, and therefore remains unknown. The same is true for the factors
produced by factor analysis. The unknown can become more known if the information
obtained from a think-aloud complements, deepens, broadens, or contradicts the
analytical information.

Test Developers. I recommend that the think-aloud methodology be used to
inform item development and research on the validity of test items. The results of one
study (Paulsen, 1999) indicated that the think-aloud procedure can be successfully
employed to assess the construct validity of items by asking students to state their
understanding of the items and then compare their understandings with the intention of
the item writer. Paulsen also suggested that think-alouds can be used to detect potential
item problems — such as confusing or awkward sentence structure, multiple correct
answers, typographical errors, or confusing graphics — that may confuse the test takers
and thereby lead them to giving an erroneous answer. This use of think-alouds can
bolster the validity of items.

The timing of think-alouds in the Paulsen study occurred when the items were
still in the development stage and therefore were not fully screened or pilot tested by the
time the study occurred. Perhaps the think-aloud procedure should be used after items
have been fully screened as the think-aloud approach is too expensive and labor intensive
to detect the sorts of item problems that experienced item reviewers can detect more
efﬁciently. Paulsen (1999) agrees and adds the point that the elimination of obvious item
problems before items are subjected to a think-aloud means that students will spend less

time deciphering the errors, and more time thinking about how to answer the items.

109

 

Test Users. Teachers could use the think-aloud procedure to make curriculum and
instruction decisions. For instance, teachers can gain ﬁrst hand information about the
various solution paths students use to solve the items, which can in turn inform them
about how students apply (or rnisapply) the concepts taught by the teacher. Teachers also
can detect whether subgroups of students — that is, students with some type of cognitive
disability — have difficulty applying certain mathematical concepts by purposefully
studying the cognitive processes used by the subgroup(s). Finally, teachers can assess
their own item writing ability by conducting think-alouds on the items they write for their
classroom tests. They may learn that the students’ understanding of an item and their
intention of what the item is supposed to measure are not aligned. Finally, there is some
evidence in the reading education literature (c.f., Baumann and Seifert-Kessel, 1992) that
the think aloud approach can actually help students develop new and transferable
strategies.

Limitations and Next Steps

As I have implied throughout this report, this work has serious limitations,
limitations which compromise its generalizability. First, the subjects in the study are by
no means nationally representative, even though I did attempt to obtain diversity with
respect to ethnicity and income. Second, the sample is small; a larger sample would have
increased the power of the analysis, thus increasing capacity to detect subtle but
signiﬁcant differences. Third, the sample of items is even more limited; they represent
one very narrow strand in the middle school mathematics curriculum. Fourth, a different
and more elaborate categorization scheme for coding the think-alouds, the kind that may

have emerged had I engaged the content experts earlier in the process, may have yielded

110

greater sensitivity to the cognitive depth that proved so elusive to uncover. Fifth, other
items — other items from the algebra strand and items from other mathematical strands
— might have yielded different cognitive processes.

There is need for further research in determining the cognitive processes that
students use to solve test items. Speciﬁcally, certain student populations may use or
apply cognitive processes differently, which in turn may affect how teachers decide to
instruct the concepts to the various subgroups. This type of research could be
implemented by purposefully sampling the subgroup(s) of interest and replicating the
think-aloud methodology used for this study.

In addition, further research could be conducted to determine whether mixing
items from more than one mathematical strand (e. g., number sense and geometry) affects
how students cognitively process test items. In this study, I purposefully limited the area
to algebraic patterns to obtain a “tight” study design. But, the design does not reﬂect the
multiple mathematics concepts assessed on large-scale assessments. This design
alteration could also determine how easily students switch their cognitive processing
when alternating from one mathematical area to another. Such information could also
affect test development. That is, a mathematics test may include items from geometry
and algebra, and alternate the presentation of the two strands. If the results of the think-
aloud study indicate that students are not effective or efﬁcient at moving between the
strands, then perhaps test developers would reconsider the ordering of items on a test.
Finally, this study focused on a very narrow area of mathematics. Additional research is
needed within mathematics and in other content areas, especially in the areas often tested

on large-scale assessments, i.e., science and writing.

111

APPENDIX A

112

APPENDIX A

Student Demographic Survey

Student ID:

1. How old are you?

 

N

. What school do you attend?

 

 

U.)

. Are you a male or female? (Circle one)
A. Female

B. Male

4. How often do you do math homework? (Circle one)

A. Few times a month
B. Once a week

C. Few times a week
D. Every day

5. How often do you read for fun? (Circle one)
A. Few times a month
B. Once a week

C. Few times a week
D. Every day

113

APPENDIX B

114

 

APPENDIX B

Algebra Items

Form A

Family 1 - comparable items:

 

 

II II II ||_LJ...‘Z...

1. In the pattern above, which ﬁgure would be next?

 

 

 

 

 

 

A.

 

L_J
al—l

 

t<-i+t<—l->t_?_

2. In the pattern above, what ﬁgure would be next?

Answer:

 

115

Family 4 - comparable items:

1. From 1 vertex of a 4-sided polygon, 2 triangles can be drawn.
From 1 vertex of a 5-sided polygon, 3 triangles can be drawn.
From 1 vertex of a 6-sided polygon, 4 triangles can be drawn.
From 1 vertex of a 7-sided polygon, 5 triangles can be drawn.

How many triangles can be drawn from 1 vertex of a 20-sided polygon?

A. 17
B. 18
C. 20
D. Infinity

2. From any vertex of a 4—sided polygon, 1 diagonal can be drawn.
From any vertex of a 5-sided polygon, 2 diagonals can be drawn. E
From any vertex of a 6-sided polygon, 3 diagonals can be drawn. ’.

From any vertex of a 7-sided polygon, 4 diagonals can be drawn. (-

How many diagonals can be drawn from any vertex of a 20-sided

polygon?

 

Answer:

 

116

Family 5 - comparable items:

1. If the pattern shown in the table were continued, what number would
appear in the box at the bottom of column B next to 14?

 

 

 

 

 

 

 

 

 

 

 

 

2 5
4 9
6 l3
8 17
I
14 | 9
Answer:
2. If the pattern shown in the table were continued, what number would

appear in the box at the bottom of column B next to 13?

 

 

 

 

 

 

 

 

 

 

 

1 4
3 8
5 12
7 16
13 9

A. 18

B. 26

C. 28

D. 32

117

Family 6 - comparable items:

Puppy’s Age Puppy’s Weight
1 month 5 lbs.
2 months 12 lbs.
3 months 17 lbs.
4 months 20 lbs.
5 months ?

1. Jim records the weight of his puppy every month in a chart like the one
shown above. If the pattern of the puppy’s weight gain continues, how many pounds will

the puppy weigh at 5 months?

 

A. 30
B. 25
C. 23
D. 21
Puppy’s Age Puppy’s Weight
1 month 10 lbs.
2 months 15 lbs.
3 months 19 lbs.
4 months 22 lbs.
5 months ?
2. John records the weight of his puppy every month in a chart like the one

shown above. If the pattern of the puppy’s weight gain continues, how many pounds will

the puppy weigh at 5 months?

Answer:

 

118

Family 8 - comparable items:

1. This question requires you to show your work and explain your reasoning.
You may use drawings, words, and numbers in your explanation. Your answer should be
clear enough so that another person could read it and understand your thinking. It is

important that you show aﬂ your work.

A pattern of stars is shown below. At each step, more stars are added to
the pattern. The number of stars added at each step is more than the number added in the

previous step. The pattern continues inﬁnitely.

(lst step) (2nd step) (3rd step)
* * * * *

a: an: a: a: * a: a: a: a:

a: * a: a: a: a: a: a: a: a: a: a:
3 Stars 8 Stars 15 Stars

Joan has to determine the number of stars in the 16th step, but she does not

want to draw all 16 pictures and then count the stars.

Explain or show how she could do this gig give the answer that Joan

should get for the number of stars.

 

 

 

 

 

119

 

2. A pattern of dots is shown below. At each step, more dots are added to the
pattern. The number of dots added at each step is more than the number added in the

previous step. The pattern continues inﬁnitely.

(lst step) (2nd step) (3rd step)

coco

coo a...
co 0.. oooo
2 Dots 6 Dots 12 Dots

Marcy has to determine the number of dots in the 20th step, but she does

not want to draw all 20 pictures and then count the dots.
2a. How could Marcy ﬁgure out how many dots are in the 20th step?

A. She could ﬁgure out an algebraic formula that explains al_l 3 steps
given and then apply the formula to the 20th step.

B. She could ﬁgure out the answer for the 4th step and then multiply the
answer by 5.

C. She could draw the figures for steps 4 through 10 and then double the
answer she got in the 10th step.

D. She could subtract the number of the step (20thstep) from the square
of the step (202).

2b. What answer should Marcy get in the 20th step?

A 100
B 220
C. 420
D. 1,220

120

Form B

Family 1 - comparable items:

1. If the pattern shown in the table were continued, what number would

appear in the box at the bottom of column B next to 14?

 

 

 

 

 

 

 

 

 

 

 

2 5
4 9
6 l3
8 17
I l
| 14 l "
A. 19
B. 21
C. 23
D. 25
E. 29
2. If the pattern shown in the table were continued, what number would

appear in the box at the bottom of column B next to 13?

 

 

 

 

 

 

 

A B
1 4
3 8
5 12
7 16
13 9

 

 

 

 

Answer:

 

121

W .

 

Family 2 - comparable items

Puppy’s Age Puppy’s Weight

 

1 month 5 lbs.
2 months 12 lbs.
3 months 17 lbs.
4 months 20 lbs.
5 months ?
1. Jim records the weight of his puppy every month in a chart like the one

shown above. If the pattern of the puppy’s weight gain continues, how many pounds will

the puppy weigh at 5 months?

 

 

Answer:

Puppy’s Age Puppy’s Weight
1 month 10 lbs.

2 months 15 lbs.

3 months 19 lbs.

4 months 22 lbs.

5 months ?

2. John records the weight of his puppy every month in a chart like the one

shown above. If the pattern of the puppy’s weight gain continues, how many pounds will

the puppy weigh at 5 months?
A. 30
B. 27
C. 25
D. 24

122

 

Family 4 — comparable items:

1. A pattern of stars is shown below. At each step, more stars re added to the
pattern. The number of stars added at each step is more than the number added in the

previous step. The pattern continues inﬁnitely.

(lst step) (2nd step) (3rd step)

* =1: 31: * *

a: :1: :1: =1: :1: :1: 4: =1: >1:

3|: * * * * * * * * alt * *
3 Stars 8 Stars 15 Stars

Joan has to determine the number of stars in the 16th step, but she does not

want to draw all 16 pictures and then count the stars.
la. How could Joan figure out how many stars are in the 16th step?

A. She could ﬁgure out an algebraic formula that explains al_l 3 steps
given and then apply the formula to the 16th step.

B. She could ﬁgure out the answer for the 4th step and then multiply the
answer by 5.

C. She could draw the ﬁgures for steps 4 through 8 and then double the
answer she got in the 8th step.

D. She could subtract the number of the step (l6thstep) from the square of
the step (162).

1b. What answer should Joan get in the 16th step?

A 100
B 220
C. 288
D. 1,118

123

 

 

 

2. This question requires you to show your work and explain your reasoning.
You may use drawings, words, and numbers in your explanation. Your answer should be
clear enough so that another person could read it and understand your thinking. It is

important that you show all, your work.
A pattern of dots is shown below. At each step, more dots are added to the
pattern. The number of dots added at each step is more than the number added in the

previous step. The pattern continues inﬁnitely.

(lst step) (2nd step) (3rd step)

0000

cc. coco
co cc. 0000
2 Dots 6 Dots 12 Dots

Marcy has to determine the number of dots in the 20th step, but she does

not want to draw all 20 pictures and then count the dots.

Explain or show how she could do this _a_r_r_c_l give the answer that Marcy

should get for the number of dots.

 

 

 

 

124

 

 

Family 5 - comparable items:

LJ

 

 

 

_I |_

L]

 

 

 

1. In the pattern above, what ﬁgure would be next?

Answer:

 

t+l+t+l+tl

2. In the pattern above, which ﬁgure would be next?

11‘

125

B!- C.l ->

D.

Family 8 - comparable items

1. From 1 vertex of a 4-sided polygon, 2 triangles can be drawn.
From 1 vertex of a 5-sided polygon, 3 triangles can be drawn.
From 1 vertex of a 6-sided polygon, 4 triangles can be drawn.

From 1 vertex of a 7-sided polygon, 5 triangles can be drawn.

How many triangles can be drawn from 1 vertex of a 20-sided polygon?

Answer:

 

2. From any vertex of a 4-sided polygon, 1 diagonal can be drawn.
From any vertex of a 5-sided polygon, 2 diagonals can be drawn.
From any vertex of a 6-sided polygon, 3 diagonals can be drawn.

From any vertex of a 7-sided polygon, 4 diagonals can be drawn.

How many diagonals can be drawn from any vertex of a 20-sided

polygon? m. ‘

14
17
19

20 L

Inﬁnity

 

171.505”?

126

Diagonal Rectangle Problem

 

Let’s call this shape a “4 by 5 diagonal rectangle.” (One edge is 4 diagonal units
long, and the other is 5 diagonal units long.)

1. How many dots are inside the 4 by 5 diagonal rectangle?

2. How many dots will lie inside a 5 by 6 diagonal rectangle?

3. How many dots will lie inside a 10 by 11 diagonal rectangle?

4. How many dots will lie inside a 100 by 101 diagonal rectangle? Explain how
you got your answer.

5. How many dots will lie inside an n by (n+1) diagonal rectangle? Explain how
you got your answer.

127

 

BIBLIOGRAPHY

 

128

BIBLIOGRAPHY

Balanced Assessment Package (1997). Balanced Assessment Package. Balanced
assessment for the mathematics curriculum. Berkeley, CA: University of Ca.

Baumann, J. F., Seifert-Kessel, N., & Jones, L. A. (1992). Effect of think-aloud
instruction on elementary students' comprehension monitoring abilities. Journal of
Reading Behavior, 24, 143-167.

Bennett, R., Rock, D., Braun, H. Frye, D., Spohrer, J., and Soloway, E. (1990).
The relationship of constrained free-response to multiple-choice and open-ended items.
Applied Psychological Measurement, 13(2), 151-162.

Bennett, R., Rock, D., Wang, M. (1991). Equivalence of free-response and
multiple-choice items. Journal of Educational Measurement, 2_8_(1), 77-92.

Bennett, R. and Ward, W. (Eds) (1993). Constructing versus choice in cognitive
measurement: Issues in constructed response. performance testing,_and portfolio
assessment. Lawrence Erlbaum Associates, Hillsdale, NJ.

Campbell, J. (1995) A comparison of thinking processes elicited by multiple-
choice and constructed-response questions on an assessment of reading comprehension,
Paper presentation at the National Reading Conference, 1995, New Orleans, LA.

Campbell, D. T. & Stanley, J. C. (1963). Exmrimental and quasi-experimental
designs for research. Houghton Mifﬂin Company, Boston, MA.

Chauncey, H. And Dobbin, J. E. (1963). Testing: Its place in education today.
New Yorszarper and Row. In M. Martinez, (1993). Problem-solving correlates of new
assessment forms in architecture. Applied measurement in Education, _6(3), 167-180.

Curriculum and evalpation standards for school mathematics. Reston, VA:
National Council of Teachers of Mathematics, 1989.

Demby, A. (1997). Algebraic procedures used by 13-to-15 year-olds.
Educational Studies in Mathematics, 33, 45-70.

Downing, S. and Haladyna, T. (1997). Test item deve10pment: Validity evidence
from Quality assurance procedures, Applied Mea_surement in Education. EU), 61-82.

Ericsson , K. A. and Simon, H. A. (1993). Protocol apralysis: Verbal reports as
d_at_a (Rev. ed.). The MIT Press, Cambridge, Mass.

129

 

Farr, P., Pritchard, R., and Smitten, B. (1990). A description of what happens
when an exarrrinee takes a multiple-choice reading comprehension test. J oumal of
Educational Mea_surement, 2_7, 209-226.

Fraenkel, J. R. and Wallen, N. E. (1993). How to design and eva_luate research in
education, 2“d Ed. McGraw-Hill Inc., New York.

Frederiksen, N. (1984). The real test bias. American Psychologist, 19(3), 193-
202.

Gerace, W. J ., & Mestre, J. P (1982). The learning of algebra by 9‘11 graders:
Research ﬁnding relevant to teacher—training & classroom practice. Paper prepared for
the National Institutes of Education: Washington DC.

Goldstein, H. (1994). Recontextualizing mental measurement, Educational
Mea_surement: Issues and Practice, l_2(l), 16-19, 43.

Hair, J ., Anderson, R., Tatham, R., & Black, W. (1992). Multivariate data analysis
with readings, 3rd edition, Macmillan Publishing Company.

Haladyna, T. (1994). Developing and validating multiple-choice test items.
Lawrence Erlbaum Assoc., Hillsdale, NJ.

Hambleton, R. K. and Swaminathan, H. (1987). Item response theory: Principles
and applications. Kluwer-Nijhoff, Boston, MA.

Hamilton, L., Nussbaum, E., and Snow, R. (1997). Interview procedures for
validating science assessments. Applied Mea_surement in Education, 1_Q(2), 181-200.

Harris, D. (1993). _P_racticayssues in egating. Paper presented at the Annual
Meeting of the American Educational Research Association, Atlanta, GA.

Lukhele, R., Thissen, D., & Wainer, H. (1994). On the relative value of multiple-
choice, constructed response, and examinee-selected items on two achievement test.
Journal of Educational Mga_surement, 3_1, 234-250.

Martinez, M. (1991). A comparison of multiple-choice and constructed-response
figural response items. Journal of Educational Measurement, _2_§,(2), 131-145.

Martinez, M. (1993). Problem-solving correlates of new assessment forms in
architecture. Applied measurement in Education, _6(3), 167-180.

Mehrens, W. A. & Lehmann, I. J. (1987). Using standardized tests in education,
fourth edition. Longman, Inc.: New York, NY.

130

Montague M. & Applegate B. (1993). Middle school students’ mathematical
problem solving: An analysis of think-aloud protocols. Learning Disability Quarterly,
16, 19-30.

Mueller, G. E. (1911). Zur analyse der gedachtnistatigkeit und des
vorstellungsverlaufew: Teil 1. Zeitschrift fur psychologie, 5. In Ericsson, K. and Simon,
H. (1993). Protocol AﬂlVSiSZ Verbal reports as data. The MT Press, Cambridge, Mass.

National Center for Education Statistics, National Assessment of Educational
Progress (NAEP), 1996 Mathematics Assessment.

Nisbett, R. E. & Wilson, T. D. (1977). Telling more than we can know: Verbal
reports on mental processes. Psychological Review, 8_4, 231-159.

Norris, S. P. (1990). Effects of eliciting verbal reports of thinking on critical
thinking test performance. Journal of Educational Mea_surement, _2_1, 41-58.

Paulsen, C. (1999). An exploratory study of cognitive laboratories for
development and construct validation of reading and mathematics achievement test items.
Unpublished doctoral dissertation, University of Pennsylvania.

Pearson, P. D., Garavaglia, D. (1997) Improving the information value of
performance items in large scale assessments. Washington, DC: Author.

Pearson, P. D., Garavaglia, D., Rodriguez, M., Danridge, J ., & Montanez, M.
(1997). Investigating cognitive engagement through think-alouds. Unpublished
manuscript.

Silver, E. (1997). Algebra for all - Increasing students' access to algebraic ideas,
not just algebra courses. _mthematics Teaching in the Middle School, _2_(4), 204-207.

Snow, R. E. (1993). Construct validity and constructed-response tests. In R. E.
Bennett and WC. Ward (Eds), Construction versus choice in cognitive measurement:
Issues in constructed responaet,p©§rfomam testing,_and portfolio assessment (pp. 45-60).
Lawrence Erlbaum Associates, Hillsdale, NJ.

Thissen, D., Wainer, H., & Wang, X-B. (1994). Are tests comprising both
multiple-choice and free-response items necessarily less unidimensional than multiple-
choice test? An analysis of two tests. Journal of Educational Measurement, 3;, 113-123.

Werts, C., Breland, H., Grandy, J ., & Rock, D. (1980). Using longitudinal data to

estimate reliability in the presence of correlated errors of measurement. Educational and
Psychological Mea_surement. 4_0_(l), 19-29

131

 

    

        
  

 

(£qu

0

11131111111111)

 

111571111111

2