EFFECTS OF FEEDBACK TIMING AND TYPE ON LEARNING ESL GRAMMAR RULES
By
Elizabeth H. P. Lavolette

A DISSERTATION
Submitted to
Michigan State University
in partial fulfillment of the requirements
for the degree of
Second Language Studies – Doctor of Philosophy
2014

ABSTRACT
EFFECTS OF FEEDBACK TIMING AND TYPE ON LEARNING ESL GRAMMAR RULES
By
Elizabeth H. P. Lavolette
The optimal timing of feedback on formative assessments is an open question, with the
cognitive processing window theory (Doughty, 2001) underlying the interaction approach
suggesting that immediate feedback may be most beneficial for language acquisition (e.g., Gass,
2010; Polio, 2012) and two educational psychology hypotheses conversely suggesting that
delayed feedback may be superior for error correction (dual-trace hypothesis, Kulik & Kulik,
1988; interference-perseveration hypothesis, Kulhavy & Anderson, 1972).
To explore the effects of varied feedback timing on both item learning and rule
generalization, 118 intermediate ESL students were randomly assigned to item-by-item or endof-test computerized feedback conditions. Within each timing group, half of the students
received feedback that indicated the correct answer and whether they had answered correctly or
incorrectly (without metalinguistic feedback). The other students received additional feedback
that stated a rule that applied to the item (metalinguistic feedback). A pretest, two treatments, a
5-minute-delayed posttest, and a 1-week-delayed posttest were administered. Each treatment
contained 17 multiple-choice items that were followed by item-by-item or end-of-test feedback.
The pretest and both posttests included all items from the treatment (to test item learning) plus 10
new multiple-choice items to test generalization of rules. The data were analyzed using mixeddesign ANOVAs.
The item-by-item metalinguistic feedback group had higher gain scores than the other
feedback groups on the treatment items on both posttests, although no significant main effects
were found for either feedback timing or type. This suggests that item-by-item metalinguistic

feedback is better for item learning. On the items that did not appear on the treatment, the itemby-item groups outperformed the end-of-test groups, with a marginally significant main effect of
feedback timing, F(1, 108) = 3.61, p = .06, η2part = .032. This suggests that item-by-item
feedback may be better for learning to generalize. In addition, the groups that received item-byitem feedback spent significantly less time reading the feedback than did the groups who
received end-of-test feedback, F(1, 108) = 4.14, p = .044, η2part = .037. These combined results
suggest that item-by-item metalinguistic feedback may be more effective and efficient for
language learners for both item learning and learning to generalize, although the small effects
sizes indicate that providing this type and timing of feedback should be only one of many
interventions to improve instruction. In addition, these results lend support to the cognitive
processing window theory and attention-based theory underlying the interaction approach.

I dedicate this work to my husband, R. Jess Lavolette. This work was only possible with your
love and support.

iv

ACKNOWLEDGEMENTS

Many people contributed to this study. First, I thank all of the members of my committee,
Dr. Susan Gass and Dr. Paula Winke, who have provided helpful guidance and feedback
throughout the process. In particular, my co-chairs, Dr. Senta Goertler and Dr. Charlene Polio,
have spent countless hours advising me and challenging me to improve my work.
I am grateful to numerous others who contributed to this work. Thank you to Rod Ellis,
who provided helpful feedback on the study design. Thank you to the friends and family who
anonymously responded to norming survey for the test items. Thank you to Brian Adams in the
MSU College of Arts and Letters who installed the Concerto testing platform on a server, which
allowed me to collect all of the data. Thank you to the students in Dr. Goertler’s CALL class
who gave me helpful feedback on the pilot study proposal and report drafts. Thank you to Dr.
Daniel Reed, who kindly granted permission for me to conduct my research in ELC classes and
provided data on mean TOEFL scores. Thank you to the teachers in the MSU English Language
Center who allowed me to use their class time and access their students for my pilot and main
studies: Leah Addis, Collin Blair, Janet Colson, Carmella Gillette, Ashley Hewlett, Peter
Hoffman, David Krise, Ann Letson, Alicia Norgrove, Laura Ramm, Stacy Sabraw, Peter Sakura,
Carlee Salas, and Cristen Vernon. Thank you to Mike Kramizeh, who was unendingly patient in
scheduling the computer labs, and thank you to the lab assistants were very helpful in getting the
labs prepared for my study. Last but not least, thank you to the ELC students who participated in
the study.

v

TABLE OF CONTENTS

LIST OF TABLES ......................................................................................................................... ix
LIST OF FIGURES ........................................................................................................................ x
CHAPTER 1: INTRODUCTION ................................................................................................... 1
1.1 Purpose and Rationale of the Study ...................................................................................... 1
1.2 Definitions ............................................................................................................................. 2
1.3 Importance of the Current Study ........................................................................................... 4
1.4 Overview ............................................................................................................................... 5
CHAPTER 2: THEORETICAL BACKGROUND AND LITERATURE REVIEW..................... 6
2.1 SLA Theories and Feedback Timing .................................................................................... 6
2.2 Language Research on Feedback Timing ........................................................................... 13
2.2.1 SLA research ................................................................................................................ 13
2.2.2 CALL research ............................................................................................................. 16
2.2.3 Language assessment research ..................................................................................... 18
2.2.4 Other language-related research ................................................................................... 20
2.3 Educational Psychology Theories of Feedback Timing ...................................................... 22
2.4 Educational Psychology Research on Feedback Timing .................................................... 24
2.4.1 Review articles ............................................................................................................. 25
2.4.2 Individual studies.......................................................................................................... 29
2.4.2.1 Delay longer than end-of-test more effective than end-of-test .............................. 30
2.4.2.2 Delay longer than end-of-test more effective than item-by-item........................... 31
2.4.2.3 Delay longer than end-of-test more effective than delay shorter than end-of-test;
end-of-test more effective than delay shorter than end-of-test .......................................... 32
2.4.2.4 End-of-test more effective than item-by-item........................................................ 32
2.4.2.5 Delay shorter than end-of-test more effective than item-by-item .......................... 33
2.4.2.6 Item-by-item more effective than end-of test ........................................................ 35
2.4.2.7 Item-by-item more effective than delay longer than end-of-test ........................... 36
2.4.2.8 Item-by-item more effective than delay shorter than end-of-test .......................... 36
2.4.2.9 No difference between item-by-item and end-of-test/various delays .................... 37
2.4.2.10 No difference between end-of-test and longer delay ........................................... 39
2.4.2.11 Other .................................................................................................................... 40
2.5 SLA Theories and Metalinguistic Feedback ....................................................................... 41
2.6 Language Research on Metalinguistic Feedback ................................................................ 42
2.6.1 SLA and CALL research .............................................................................................. 42
2.6.2 L2 assessment research on metalinguistic feedback .................................................... 46
2.7 Educational Psychology Theories of Informational Feedback ........................................... 47
2.8 Educational Psychology Research on Informational Feedback .......................................... 48
2.9 Research Questions and Hypotheses ................................................................................... 51

vi

2.9.1 Predictions for Research Question 1 ............................................................................ 53
2.9.2 Predictions for Research Question 2 ............................................................................ 54
2.9.3 Predictions for interaction effects between feedback timing and feedback type ......... 55
CHAPTER 3: METHOD .............................................................................................................. 57
3.1 Participants .......................................................................................................................... 57
3.2 Materials .............................................................................................................................. 60
3.2.1 Pretest ........................................................................................................................... 62
3.2.2 Treatments 1 and 2 ....................................................................................................... 63
3.2.3 Five-minute-delayed posttest........................................................................................ 65
3.2.4 One-week-delayed posttest ........................................................................................... 65
3.3 Procedure ............................................................................................................................. 66
3.4 Analysis ............................................................................................................................... 68
3.4.1 Research Question 1 ..................................................................................................... 70
3.4.2 Research Question 2 ..................................................................................................... 71
3.4.3 Question and feedback display times ........................................................................... 73
3.5 Summary of Analyses ......................................................................................................... 76
CHAPTER 4: RESULTS .............................................................................................................. 79
4.1 Preliminary Results ............................................................................................................. 79
4.2 Research Question 1 ............................................................................................................ 82
4.2.1 Research Question 1a: Repeated items......................................................................... 82
4.2.2 Research Question 1b: New items ................................................................................ 91
4.3 Research Question 2 ............................................................................................................ 96
4.3.1 Repeated items.............................................................................................................. 96
4.3.2 New items ................................................................................................................... 100
4.4 Summary of Results .......................................................................................................... 103
4.4.1 Research Question 1a: Item learning.......................................................................... 103
4.4.2 Research Question 1b: System learning ..................................................................... 107
4.4.3 Research Questions 2a and b: Reinforcing correct responses and correcting errors .. 107
CHAPTER 5: DISCUSSION...................................................................................................... 109
5.1 Summary of Findings ........................................................................................................ 109
5.2 Research Question 1 .......................................................................................................... 110
5.2.1 Research Question 1a: Item learning.......................................................................... 110
5.2.2 Research Question 1b: System learning ..................................................................... 111
5.3 Interaction Effects ............................................................................................................. 113
5.3.1 Time x feedback timing x feedback type interaction for item learning ..................... 113
5.3.2 Time x feedback timing interaction for item learning ................................................ 116
5.4 Why Results Differ Between Current Study and Previous Literature .............................. 118
CHAPTER 6: CONCLUSION ................................................................................................... 121
6.1 Summary of Findings ........................................................................................................ 121
6.2 Pedagogical and CALL Implications ................................................................................ 122
6.3 Theoretical and Research Implications ............................................................................. 123
vii

6.4 Limitations ........................................................................................................................ 126
6.5 Future Directions ............................................................................................................... 128
APPENDICES ............................................................................................................................ 130
Appendix A: Consent Form .................................................................................................... 131
Appendix B: Article Rules ...................................................................................................... 132
Appendix C: Test Items........................................................................................................... 133
Appendix D: Exit Questionnaire ............................................................................................. 137
REFERENCES ........................................................................................................................... 139

viii

LIST OF TABLES
Table 1: Summary of Language Learning Studies on Feedback Timing Studies......................... 14
Table 2: Summary of Educational Psychology Review Articles on Feedback Timing................ 26
Table 3: Summary of Individual Educational Psychology and Other Studies on Feedback Timing
....................................................................................................................................................... 27
Table 4: Theoretical Predictions for Posttest Results Based on Feedback Timing and Type ...... 55
Table 5: Participants’ Demographic Information ......................................................................... 59
Table 6: Research Questions and Corresponding Analyses of Gain Scores and Conditional
Probabilities .................................................................................................................................. 77
Table 7: Analyses of Feedback and Question Display Times ...................................................... 78
Table 8: Overall Test Results........................................................................................................ 81
Table 9: Mean (SD) Gain Scores From Pretest to Each Posttest, Repeated Items ....................... 83
Table 10: Mean (SD) Total Time in Seconds Questions Were Displayed, Repeated Items Only 88
Table 11: Mean (SD) Total Time in Seconds Feedback Was Displayed ..................................... 90
Table 12: Mean (SD) Gain Scores From Pretest to Each Posttest, New Items ............................ 93
Table 13: Mean (SD) Total Time in Seconds Questions Were Displayed, New Items Only ....... 94
Table 14: Mean Conditional Probabilities (Standard Deviations), Repeated Items ..................... 98
Table 15: Mean Conditional Probabilities (Standard Deviations), New Items .......................... 101
Table 16: Research Questions and Results ................................................................................. 104
Table 17: Results of ANOVAs of Display Times ...................................................................... 106

ix

LIST OF FIGURES
Figure 1: Example question. ......................................................................................................... 61
Figure 2: Division of participants into feedback groups. IBI = item by item; EOT = end of test. 64
Figure 3: Metalinguistic feedback on an incorrect response. ....................................................... 65
Figure 4: Procedure. ...................................................................................................................... 67
Figure 5: Gain scores for two feedback timings on both posttests, repeated items only. IBI =
item-by-item feedback; EOT = end-of-test feedback. .................................................................. 85
Figure 6: Gain scores of feedback groups on both posttests, repeated items only. IBI = item-byitem feedback; EOT = end-of-test feedback; meta = metalinguistic feedback; nonmeta = no
metalinguistic feedback. ............................................................................................................... 86
Figure 7: Interaction between time and feedback timing for total feedback display time. IBI =
item-by-item feedback; EOT = end-of-test feedback. .................................................................. 91
Figure 8: Display time interaction for time x feedback timing x feedback type, new questions
only. IBI = item-by-item feedback; EOT = end-of-test feedback; meta = metalinguistic feedback;
nonmeta = no metalinguistic feedback. ........................................................................................ 95
Figure 9: Probability of selecting the correct response on an item on the 5-minute-delayed
posttest (R2/W1) or 1-week-delayed posttest (R3/W1), given that it was answered incorrectly on
the pretest, repeated items only. IBI = item by item feedback; EOT = end of test feedback; meta
= metalinguistic feedback; nonmeta = no metalinguistic feedback. ............................................. 99
Figure 10: Probability of selecting the correct response on an item on the 5-minute-delayed
posttest (R2/R1) or 1-week-delayed posttest (R3/R1), given that it was answered incorrectly on the
pretest, new items only. IBI = item by item; EOT = end of test. ................................................ 102
Figure 11: (Duplicate of Figure 6.) Gain scores of feedback groups on both posttests, repeated
items only. IBI = item by item feedback; EOT = end of test feedback; meta = metalinguistic
feedback; nonmeta = no metalinguistic feedback. ...................................................................... 114
Figure 12: (Duplicate of Figure 5.) Gain scores of feedback groups on both posttests, repeated
items only. IBI = item by item feedback; EOT = end of test feedback. ..................................... 117

x

CHAPTER 1: INTRODUCTION
1.1 Purpose and Rationale of the Study
Computer-assisted language learning (CALL) applications can provide instant feedback
to language learners on a wide variety of activities, from multiple-choice questions and cloze
exercises to speaking and writing activities. Proponents of using technology in language teaching
and assessment often explicitly claim that immediate feedback is superior to delayed feedback,
without citing any evidence (e.g., Alderson, 2005; Brown, 1997; Chun & Brandl, 1992; García &
Arias, 2010; Kane-lturrioz, 2008; Lan, Sung, & Chang, 2007). When these claims are not explicit,
they are often implicit in the design of language-learning applications (e.g., Amaral & Meurers,
2011; Heift, 2010; Nagata & Swisher, 1995; Nagata, 1999). In addition, major commercial
CALL applications like Rosetta Stone, Tell Me More, Duolingo, Open English, and Pimsleur all
provide immediate feedback to learners on most of their activities. However, no evidence shows
that the immediate feedback produced by computer applications is more useful to language
learners than similarly produced delayed feedback. Related research in educational psychology
has shown that immediate feedback on multiple-choice questions (provided by a computer or by
using specially prepared paper forms) may help students retain material better than feedback that
is delayed until the end of the activity or until a day later (e.g., Dihoff, Brosvic, Epstein, & Cook,
2004; Kulik & Kulik, 1988). Other researchers have found delayed feedback to be more effective
(e.g., Guzmán-Muñoz & Johnson, 2008; Schooler & Anderson, 1990). However, little is known
about how feedback timing affects second language learning. Given these conflicting results in
educational psychology and the gap in the CALL and SLA literature, the purpose of this study is
to provide evidence of how varied feedback timing affects the acquisition of English by adult
second language learners. Specifically, in this study, I focus on measuring how differing timings

1

of feedback on multiple-choice questions affect learning to apply rules for using English articles.
In addition, I address the question of how providing or not providing metalinguistic feedback
affects learning the same rules and whether it interacts with feedback timing.
1.2 Definitions
Some concepts need to be defined before proceeding. First, a fundamental concept in this
study is feedback. I use Cohen's (1985) simple definition from the context of computer-based
instruction of “the message which follows the response made by the learner” (p. 33). This
message may be delivered by an interlocutor during conversational interaction or during a test by
a computer, and the message may follow more or less quickly after the response made by the
learner. The types of feedback that will be discussed are defined by the amount of information
provided. At the end of the spectrum that provides less information, knowledge of results
feedback tells the learner only whether he or she answered correctly or incorrectly. Providing
slightly more information, knowledge of correct response feedback implicitly or explicitly
includes knowledge of results feedback, but also indicates the correct response. Finally,
informational feedback includes knowledge of correct response feedback plus additional
information, such as a rule that can be extended beyond the current question.
For timing, the terms immediate and delayed are imprecise. The terms are not used in a
systematic way in the literature, either within second language acquisition (SLA) or in
educational psychology, despite the publication 25 years ago of a taxonomy of such feedback in
computer-based instruction (Dempsey & Wager, 1988). Drawing on this taxonomy and the work
of Henshaw (2011), in the current study, I will focus on feedback on test items under two
conditions: feedback provided on an item-by-item basis and provided at the end of a test
(corresponding to Dempsey and Wager’s (1988) “end-of-module” definition of feedback). When

2

considering the previous literature, I will use two further categories: delays shorter than end-oftest, such as a delay of 7 seconds after a response to a question before feedback is provided; and
delays longer than end-of-test, such as a 24-hour delay. I will use these terms to better describe
and compare the previous literature.
The next concept that needs to be defined is learning. Two types of learning, system and
item learning, will be considered in this study (Schmidt, 1995). A system has been learned if the
learner can correctly apply a rule to a previously unseen situation, such as a multiple-choice
question or a writing task. Note that being able to state the rule (a type of verbal information) is
not what is considered here. Rather, system learning is an intellectual skill (Gagné, 1985) in
which the rule is applied. Compared to item learning, this type of learning is closer to what is
generally investigated by SLA researchers. As an example of system learning, imagine that a
learner knows an explicit rule, such as “Use the when the context makes a noun known to the
reader/listener.” Then, an item is presented such as “Its front tire was flat, so __ bicycle was
unusable.” The learner has to fill in the correct article. The learner has no memory of the correct
answer, so he or she must apply the rule to respond correctly. The second type of learning is item
learning, which is the type of learning generally studied in the educational psychology research
on feedback timing. An item has been learned if the learner can correctly respond to it after a
previous exposure to the item. If the learner has received feedback on the item that included the
correct response, item learning may be the same as memorizing the correct response. This is a
type of verbal information learning (Gagné, 1985). For example, imagine that that the same item,
“Its front tire was flat, so __ bicycle was unusable,” is presented, followed by the correct answer.
If the item has been learned, the next time the learner sees this exact item, he or she may simply
retrieve the correct answer from memory.

3

Finally, I will use the terms error correction and reinforcement of a correct response to
differentiate between two possible feedback conditions. In both cases, feedback is provided to
the learner on his or her response to a test item by providing the correct answer. Error correction
is the case in which a learner has answered an item incorrectly. For example, if a learner is
presented with the item “Its front tire was flat, so __ bicycle was unusable” and he or she
chooses the response “a,” the feedback would indicate that the correct answer is “the” and would
be an instance of error correction. Reinforcement of a correct response is the case in which the
learner has answered correctly. For example, if a learner is presented with the same item and
chooses the correct answer, “the,” then the same feedback (i.e., that the correct answer is “the”)
would be an instance of the reinforcement of a correct response. Both of these types of feedback
are subtypes of knowledge of correct response feedback, and both are derived from the
behaviorist paradigm (e.g., Skinner, 1968).
1.3 Importance of the Current Study
The current study is relevant to SLA researchers, language teachers, and instructional
designers for several reasons. Beginning with SLA researchers, the current study provides
precise language for sorting out the varied definitions of immediate and delayed, as they are used
in both SLA research and research in related fields. The terms item by item and end of test are
defined above for use within this dissertation and beyond. In addition, SLA researchers can gain
further information by using conditional probabilities to analyze data, breaking it down into
categories that reveal whether correct responses have been reinforced and whether errors have
been corrected. This is commonly done in educational psychology research, but has not yet
gained a foothold in SLA research. Finally, researchers will be interested in the evidence
provided in the following study in support of Doughty's (2001) theory of a cognitive processing

4

window, within which feedback is most usefully provided, and in support of an SLA attentionbased theory (e.g., Gass, 1997; Pica, 1994; Schmidt, 1995).
Teachers and instructional designers will be interested in the results of the current study
for other reasons. First, the results provide a starting point for determining when to more
effectively and efficiently provide which type of feedback to learners. In addition, the results
may help teachers make an informed decision about which commercial CALL products will be
most useful to their students.
1.4 Overview
The rest of this dissertation is organized as follows. Chapter 2 explains the theoretical
background for the current study and gives an overview of relevant literature in CALL, SLA,
language assessment, other language-related fields, and educational psychology. The research
questions and predictions for the current study are at the end of Chapter 2. Chapter 3 provides
details on the method that I used to investigate the research questions, including the participants,
materials, procedure, and analysis. Chapter 4 presents the results, followed by a discussion in
Chapter 5. The conclusion is in Chapter 6, followed by the appendices and references.

5

CHAPTER 2: THEORETICAL BACKGROUND AND LITERATURE REVIEW
Feedback timing is an underresearched area in SLA, perhaps because few theories
explicitly or implicitly bear on the question. Therefore, I begin the theoretical discussion by
considering which theories address this topic generally, without a concern for the applicability of
the theories to the current study. I then look more closely at the interaction approach, which
gives a psycholinguistic account for the superiority of immediate feedback. It is this
psycholinguistic account, rather than the interaction approach itself, that I will apply to the
research study described below. After the SLA theoretical background, I review the languagerelated empirical research on feedback timing. Next, I examine selected theories from
educational psychology that make predictions about the relative effectiveness of immediate and
delayed feedback, and I then give an overview of the related research in that field.
After addressing the topic of feedback timing, I move to a discussion of the effectiveness
of feedback with and without metalinguistic information. I present the theoretical position of the
interaction approach, then language-related research on this topic. Although educational
psychology theories do not directly address the topic of metalinguistic feedback, I review
theories and research related to informational feedback. Finally, I present the research questions
and predictions for the current study.
2.1 SLA Theories and Feedback Timing
I temporarily put aside the problem of whether various SLA theories apply to a situation
in which a human answers multiple-choice questions and the computer provides feedback. For
the moment, I focus on what the theories can contribute to a general discussion of feedback
timing. In fact, most mainstream SLA theories have little to say about when feedback should be
provided. Feedback (i.e., negative evidence) plays little, if any, role in theories based on

6

Universal Grammar (e.g., Cook, 1989; Schwartz & Gubala-Ryzak, 1992; but c.f. White, 1991).
In theories in which feedback may play a limited role, such as processability theory (e.g.,
Pienemann, 1998), the ideal timing of feedback in relation to the error is not specified by the
theory, either explicitly or implicitly. Similarly, in purely usage-based approaches (e.g., N. Ellis,
2006, 2008), feedback may play a limited role (without an ideal feedback timing specified by the
theory), although Ellis incorporates ideas from the interaction approach to acknowledge feedback
and focus-on-form as a means of drawing learners’ attention to features of language. I discuss the
interaction approach itself below. In skill acquisition theories (e.g., Dekeyser, 1997, 2007), the
requirement for feedback to be timely is implicit, but no mention is made of why timeliness is
important or how to define it (but see Hartshorn et al., 2010, for 24-hour-delayed feedback as
one interpretation of timely).
In sociocultural theory, feedback is adjusted online (i.e., during interaction) to the needs
of the individual learner (Aljaafreh & Lantolf, 1994, p. 466), but this may not imply that
feedback immediately follows an error. Rather, researchers have variously operationalized
feedback as provided immediately after an error is made, as in the Computerized Dynamic
Assessment of Language Proficiency (http://calper.la.psu.edu), delayed until the completion of
an oral narrative (e.g., Lantolf, 2008; Poehner, 2008), and delayed longer, as in individual
tutoring on a piece of writing that has been completed at an unspecified earlier time (e.g.,
1

Aljaafreh & Lantolf, 1994). Thus, SLA researchers have not interpreted sociocultural theory as
specifying a particular feedback timing as better than another.
The interaction approach is the SLA theory that most clearly and explicitly specifies the

1

Arguably, some of the feedback provided in the study by Aljaafreh and Lantolf (1994) came
immediately following failed attempts to correct errors. However, from another perspective, the
initial errors were made during the writing, not during the later process of reading aloud.
7

ideal feedback timing. In this approach, negotiation for meaning during a conversation between
an L1 and L2 speaker causes interactional adjustments to the speech of the L1 speaker,
facilitating language learning on the part of the L2 speaker (Long, 1996). This approach has been
formulated in terms of language use in conversation and posits a role for immediate feedback on
errors (e.g., Gass, 2010b; Long, 1996). Doughty (2001), for example, wrote the following.
If the verbatim format of recent speech remains activated in memory and
available for use in subsequent utterance formulation, this can be taken to be an
important cognitive underpinning for facilitating the opportunity to make
cognitive comparisons. With regard to the timing of the information to be
compared, the most efficient means to promoting cognitive comparison would
seem to be provision of immediately contingent recasts. (p. 253; emphasis added)
I interpret immediately contingent recasts in a conversational setting as analogous in
timing to item-by-item feedback on a multiple-choice test.
The reasoning behind the interaction approach, as described above, is similar to that of
the direct contrast hypothesis, which states that in child first language acquisition, negative
evidence, typically provided as recasts by adults, allows a child to retreat from
overgeneralizations in his or her developing grammar (e.g., Saxton, Backley, & Gallaway, 2005;
Saxton, 1997). According to this hypothesis, the immediate contingency of the recast on the error
is what leads to the effectiveness of recasts. This hypothesis might also be applicable to L2
learning, with the prediction that recasts will be more effective for L2 speakers than models of
accurate speech provided at a time somewhat removed from an error. However, after adapting
the hypothesis to L2 learning, it is equivalent to the interaction approach in terms of its
predictions for the most effective feedback timing. In addition, no clear psycholinguistic basis

8

has been proposed for the direct contrast hypothesis, which makes it difficult to argue for its
applicability in the current study. Therefore, I do not consider it further here.
Now that I have established that the interaction approach is the SLA theory that most
directly deals with the topic of feedback timing, I turn to the question of its applicability to the
current study. This approach has been extended to computer-mediated interaction between
learners and expert language users, such as video and audio chats (e.g., Yanguas, 2010) and text
chats (e.g., Lee, 2008). Moving even further from the approach’s origins in face-to-face
conversational interaction, Heift (2004) applied the approach to the interaction between a learner,
who inputs a sentence into a computer, and a natural-language processing program, which
outputs metalinguistic feedback. Indeed, Chapelle (2003) has proposed that the interaction
approach could be extended to the input-enhancing interaction that occurs when learners interact
with computers. She gave the example of a learner clicking a hyperlinked word to get a
definition when reading a passage. From this perspective, the interaction approach may also be
extended to the interaction between a human and a computer in the context of feedback on
multiple-choice questions, as explained below.
Consider this scenario: A learner reads a sentence with a blank on a computer screen.
From a list of answer choices, he or she drags and drops a phrase into the blank. After clicking a
“submit” button, the learner is presented with feedback that indicates (a) that the selected
response was incorrect, (b) the correct answer, embedded in the original sentence, and (c) a rule
explaining why it is correct. Part (a) of the feedback serves as an explicit indication that the
learner has made an error. Part (b) provides information equivalent to that in a recast in spoken
interaction, and Part (c) is metalinguistic information. Certainly this type of feedback qualifies as
enhanced input. Therefore, one might argue that the interaction approach can be applied to this

9

situation.
Admittedly, the argument above goes quite far afield from the origin of the interaction
approach. However, one aspect of the approach in particular may be more readily extendable. I
next examine the basis within the interaction approach for the claim that immediate feedback is
superior to delayed feedback. This claim is based on the idea of attention to contrast. That is, a
language learner who produces a nontargetlike utterance will contrast his or her utterance with a
targetlike recast that another speaker produces in response (e.g., Goo & Mackey, 2013; Long,
2007). According to Gass (2010),
Attention alone is not sufficient. A contrast must be attended to, or in SLA
parlance, a gap must be noticed. And conversation provides a forum for the
contrast to be detected, especially when the erroneous form and a correct one are
in immediate juxtaposition. (p. 230)
I argue that multiple-choice questions with correct answers given as feedback also
provide a forum for the contrast to be detected, especially when the feedback is provided on an
item-by-item basis. Similarly, Long (1996) stated that “[n]egative feedback of this type (i.e., in
the form of implicit correction immediately following an ungrammatical learner utterance) is
potentially of special utility because it occurs at a moment in a conversation when the NNS is
likely to be attending to see if a message got across, and to assess its effect on the interlocutor”
(p. 429). Although Long referred to a spoken conversation, it follows that when negative
feedback comes immediately after a learner responds to a multiple-choice question, the learner is
likely to be attending to the feedback to find out whether he or she answered the question
correctly. Conversely, when the negative feedback comes at a later time, the learner is less likely
to be attending.

10

Of course, conversation typically occurs in a face-to-face setting between two (or more)
people, and a multiple-choice computer-based test is taken by one person looking at a computer
screen. In addition, feedback provided in the two contexts may be perceived differently by
learners for affective reasons. For example, learners may perceive the feedback provided by a
computer as less face-threatening than that provided by another person. These differences are
potentially important in determining how attention is focused. However, little research exists to
clarify how attention to contrast in a written, computer-based format may differ from attention to
contrast in spoken interaction. Therefore, I proceed here under the tentative assumption that
item-by-item negative feedback on a multiple-choice test on a computer screen will be more
effective in drawing a learner’s attention to contrast than feedback that comes later.
Another potentially important difference between these contexts is that learners typically
focus on meaning during conversation, while they may be more focused on form when they are
taking a multiple-choice test. This may prime learners to be more attentive to contrasts in a
computer-based test than during conversation.
The attention to contrast described by both Long (1996) and Gass (2010) has a
psycholinguistic basis in what Doughty (2001) termed the cognitive processing window.
According to Doughty (2001), negative feedback must be immediate to be effective because
learners’ ability to perform cognitive comparisons is limited by working memory (p. 225).
Negative feedback that focuses on form can most usefully be provided within this cognitive
window, which may last up to 40 seconds (p. 227). On the other hand, Doughty and Long (2003,
p. 65) more recently claimed that this window is not well understood and that it may not even
depend on working memory, implying that the previous estimate of the duration of the window
may not be accurate. However, given no alternative hypothesis as to why the window should be

11

limited, I proceed here under the assumption that working memory constrains the window.
While Doughty (2001) cites the working memory model of Cowan (1995), the model of
Baddeley (2003) is also widely used in psycholinguistic research, and various other models exist.
When considering the capacity of working memory, which model is referenced may make little
difference. Here, it suffices to say that, according to Baddeley (2003), working memory is “a
limited capacity system, which temporarily maintains and stores information, supports human
thought processes by providing an interface between perception, long-term memory and action”
(p. 829). Similarly, according to Cowan (2005), working memory is “the set of processes that
hold a limited amount of information in a readily accessible state for use in an active task” (p.
39). Note that in both definitions, the capacity of working memory is limited.
Information (in this case, an error made by a learner) activated in working memory is
likely to remain activated until item-by-item feedback is provided. Information is unlikely to
remain activated until end-of-test feedback is provided, not only because of the time delay, but
also because of the intervention of other items. Of course, in the end-of-test feedback condition,
upon receiving feedback, a learner could reactivate an error in working memory by repeating the
processing that occurred when he or she initially saw the item, but this is not guaranteed to occur.
The preceding discussion of feedback has focused on errors. However, receiving
feedback on an error is only one of two possible outcomes for a learner answering a multiplechoice question. The other possibility is that the learner answers the question correctly and
receives feedback that reinforces the correctness of the response. In this case, when the learner
cognitively compares his or her response to the provided correct response, no difference is found.
Although the literature on the cognitive processing window does not directly address this case,
the theory can be logically extended to predict that the cognitive comparison in this case should

12

also be more effective when the feedback comes within the cognitive processing window, as
limited by working memory. Effective should be understood here to mean that the learner
continues to answer the item correctly in the future, rather than (erroneously) changing his or her
answer, in the case of item learning, and to mean that the learner correctly answers novel items
that follow the same rule, in the case of system learning.
2.2 Language Research on Feedback Timing
In the following sections, I summarize the previous research on feedback timing in SLA,
CALL, language assessment, and other language-related areas. A summary is shown in Table 1.*
2.2.1 SLA research
Within SLA, most researchers have found no difference between item-by-item and endof-test feedback. To study the effects of oral feedback timing, Sheen (2012) had adult ESL
students perform a narration task using the past tense, and she provided explicit, metalinguistic
feedback. In one condition, the feedback was provided immediately after a student made an error
(analogous to item-by-item feedback), while in the other condition, the feedback was delayed
until the end of the task (analogous to end-of-test feedback). No significant differences were
found between the feedback groups on either a posttest or a delayed posttest. As Sheen noted, the
end-of-test feedback took more time to provide than the item-by-item feedback, making the
conditions difficult to directly compare. Quinn (2013) similarly tested the effects of providing
item-by-item and end-of-test oral feedback on oral tasks, in this case, related to English passive
constructions. He found no significant difference between the feedback conditions, possibly
because the learners in the delayed group were asked to repeat the task in which the error
occurred before feedback was provided, making the feedback essentially item-by-item.

13

Table 1: Summary of Language Learning Studies on Feedback Timing Studies
More effective feedback timing
Brosvic, Epstein, Dihoff, and Cook (2006a); Brosvic, Epstein, Dihoff, and Cook,

IBI more effective than delay shorter than

(2006b); Opitz, Ferdinand, and Mecklinger (2011); Schroth and Lund (1993)

EOT.

Nagata (1996)

IBI more effective than EOT

Schroth and Lund (1993)

Delay shorter than EOT more effective
than IBI.

Aubrey and Shintani (2014)

Delay shorter than EOT more effective
than delay longer than EOT.

Goda (2004); Henshaw (2011); Quinn (2013); Sheen (2012)

No difference between IBI and EOT.

Lavolette, Polio, and Kahng (2013)

No difference between EOT and longer
delay.

Lai, Fei, and Roots (2008); Sakai (2004)

IBI recasts noticed more often than EOT
recasts/models.

Note. IBI = item by item; EOT = end of test.

14

Without looking at the effectiveness of feedback on learning, SLA researchers have also
examined the effects of the immediate contingency of recasts on the noticing of a gap between
the learner’s production and the recast. Lai, Fei, and Roots (2008) looked at the effects of the
timing on ESL learners’ noticing of recasts in typed CMC interactions in a laboratory setting.
Noticing was measured using the learners’ reports in stimulated recalls or think-aloud protocols.
The researchers found that recasts were noticed more often when they immediately followed an
error or were separated from the error by only material not related to the content as compared to
recasts that were separated from the relevant error by additional content. Sakai (2004) also
looked at the effects of contingency on noticing, but in spoken interaction in a laboratory setting.
He compared item-by-item recasts and delayed “models,” which were similar to recasts, but
provided a few minutes later. For the EFL learners in the study, the item-by-item recasts were
more effective for noticing, as measured using a stimulated recall.
Loewen (2004) examined the timing of recasts in relationship to uptake, or the response
of a learner to a recast. One reason to examine uptake is that it may be an indication of noticing
the gap between an erroneous and correct form. Loewen did not define the term immediate in the
study, but I surmise from the examples given that recasts were immediate if they were produced
in the first turn following an erroneous utterance by a learner, analogous to item-by-item
feedback. Other recasts were classified as delayed, or in Loewen’s terminology, deferred. The
results showed that for the adult ESL learners in the study, item-by-item recasts were five times
as likely to be followed by uptake than recasts produced during later turns. However, as Loewen
pointed out, the learners were often given no opportunity for uptake when the recasts were
delayed.

15

2.2.2 CALL research
To my knowledge, only six studies have examined the effect of feedback timing on
language learning in a CALL setting (Aubrey & Shintani, 2014; Dabrowski, LeLoup, &
MacDonald, 2013; Goda, 2004; Henshaw, 2011; Lavolette et al., 2013; Nagata, 1996). Two of
them found significant differences based on feedback timing. The first study that revealed a
difference was that of Nagata (1996), who did not design the study with the intention of
examining a difference based on feedback timing. Rather, she was interested in the difference
between CALL and non-CALL teaching practices. Therefore, the feedback timing factor is
confounded with how feedback was provided, preventing any firm conclusions. In Nagata’s
study, half of the undergraduate participants received item-by-item feedback from a computer
program on their usage of Japanese particles, while the other half received the similar (although
less individualized) feedback after completing the entire exercise on paper. Both groups
participated in the study during their normal class period. The results of immediate and delayed
posttests showed that the computerized item-by-item feedback was significantly more effective
than the paper-based end-of-test feedback. Second, Aubrey and Shintani (2014) found that
feedback provided at a longer delay than item-by-item feedback (but shorter than end-of-test)
was more effective than feedback provided at a delay longer than end of test. In their study,
English learners in Japan wrote essays using Google Docs, an application that allows multiple
people to synchronously edit a document. The researchers provided feedback by inserting
comments into the margin of the document, providing only the correct form, and only targeting
hypothetical conditionals. One group (synchronous feedback) received feedback while they were
writing, generally after they had finished writing the targeted sentence and before they had
finished writing the following sentence. A second group (asynchronous feedback) received

16

feedback a few minutes after they finished writing the entire essay. A third group (control group)
received no feedback. Both of the treatment groups performed significantly better than the
control group on the immediate posttest, but only the synchronous feedback group outperformed
the control group on the delayed posttest.
Four CALL studies showed no difference between various feedback timings. Goda (2004)
found no effect of differing feedback timings (item-by-item and end-of-test) on EFL students’
scores on TOEFL structure questions. Two test versions were used, with the questions on the
treatment test being different from those on the two posttests. However, the questions on the
treatment were randomly chosen, so the structures on them may not have been relevant to the
structures on the posttest. Dabrowski, LeLoup, and MacDonald (2013) looked at the effects of
instructor feedback, provided to one group after a delay, and computer feedback, provided
immediately to the other group. Note the confound between the variable of immediate/delayed
feedback and that of computer/instructor feedback, preventing conclusions about the relative
efficacy of the immediate or delayed feedback in this study. The computer feedback was
provided using My Spanish Lab, but it is not clear whether the feedback was item-by-item or
end-of-test (which prevents the inclusion of this study in Table 1). No differences were found in
the chapter test scores for the two groups. In L2 writing, Lavolette, Polio, and Kahng (2013)
examined the effects of feedback timing using Criterion, a program provided by ETS that uses
natural language processing to give feedback on ESL students’ TOEFL-style essays. One group
received feedback immediately upon completing an essay (analogous to end-of-test feedback),
while the other received the feedback a week later. The timing of the feedback did not affect the
students’ responses to the feedback.
The CALL study that is most similar to the current study is that of Henshaw (2011).

17

Using processing instruction, Henshaw examined the effects of feedback timing on English
native speakers’ learning of the Spanish subjunctive. The learners in her study were first
screened for previous knowledge of the subjunctive using a pretest, then received explicit
instruction on this grammar structure. Next, they answered multiple-choice questions testing
their recognition and interpretation of the subjunctive, with one group receiving item-by-item
feedback, a second group receiving end-of-test feedback, a third group receiving feedback 24
hours after taking the test, and a fourth group receiving no feedback. The feedback was an
indication of whether they had answered correctly or incorrectly plus a metalinguistic
explanation. A week later, the students took a posttest that included the items that the students
had previously seen and new items. No significant differences on old or new items were found
among the groups who received feedback, although all feedback groups outperformed the nofeedback group.
2.2.3 Language assessment research
Much of the literature on assessment involving feedback is far removed from the context
of the current study. In fact, little empirical research has addressed the question of whether
assessments that provide feedback can affect student learning. Of the various types of
assessments, three major types provide learners feedback on their responses: formative, dynamic,
and diagnostic assessment. To my knowledge, only investigations of diagnostic assessment have
considered the timing of the feedback, all of which were reported in a monograph by Alderson
(2005).
Alderson (2005) listed some suggested characteristics of diagnostic tests, including that
“[d]iagnostic tests provide immediate results, or results as little delayed as possible after testtaking.” (p. 11). However, he provides little support for this suggestion. He also reported on two

18

unpublished studies (Floropoulou, 2002; Yang, 2003) that asked users about their preferences for
receiving item-by-item feedback on an early version of DIALANG
(http://www.lancaster.ac.uk/researchenterprise/dialang/about), which is a diagnostic test of
language proficiency for European languages. Alderson does not mention which languages were
tested in either study. All users got end-of-test feedback in the form of a review, and the users
had the option of leaving on a default option of item-by-item feedback or turning it off. In both
studies, the users’ behavior varied as to whether they chose item-by-item feedback. Floropoulou
(2002) asked six users about their reasons for their feedback timing preferences. She quoted two
users who preferred no item-by-item feedback . One preferred no item-by-item feedback because
getting the feedback during the test might influence his or her thinking about the test. The other
believed that it allowed him or her to avoid the encouraging effect of getting answers right and
the discouraging effect of getting them wrong. One user who preferred item-by-item feedback
was also quoted, mentioning only that it is good to know immediately whether your answers are
right or wrong.
The second relevant study that Alderson (2005) reported on was an unpublished MA
thesis by Yang (2003), who studied 13 users of DIALANG. The users’ behavior was somewhat
different from that in Floropoulou’s (2002) study. While four of Yang’s users chose to receive
the item-by-item feedback throughout the test and six chose not to get it on any items, two users
turned off item-by-item feedback after feeling discouraged due to answering the first four items
wrong, and one student selectively turned on item-by-item feedback for items that she was not
confident of having answered correctly. Those who always got item-by-item feedback chose it
because they wanted to know how they were doing. Those who never got item-by-item feedback
chose against it because they wanted to finish the test quickly and because knowing that they got

19

questions wrong would be demotivating. In addition, “one student said that, since she could not
change the answer if it was wrong, she saw no point in using immediate feedback” (Alderson,
2005, p. 215). All users indicated that they found the end-of-test review useful.
Overall, the assessment literature includes little mention of the issue of feedback timing,
and even less of the effectiveness of one feedback timing compared to another. One reason for
this is that the purpose of assessments is most often to assess learning, rather than to promote it.
Another reason may be the potential for item-by-item feedback on items at the beginning of a
test to affect students’ answers later on the test. However, this concern would not prevent an
investigation into the differences between end-of-test feedback and feedback provided at a longer
delay.
2.2.4 Other language-related research
Because little SLA, CALL, or language assessment research has looked at the relative
effectiveness of varied feedback timing, as reviewed above, I also review here studies of
vocabulary and artificial grammar learning that were undertaken from non-SLA perspectives.
The studies of language learning that have examined the variable of timing are summarized in
Table 1.
Several groups of researchers working from non-SLA perspectives have studied language
learning. Some findings showed an advantage for item-by-item feedback over feedback with a
delay shorter than end-of-test. For example, in neuroscience, Opitz, Ferdinand, and Mecklinger
(2011) found that participants who received item-by-item feedback while learning an artificial
grammar responded correctly to significantly more items than participants who received item-byitem feedback delayed by 1 second. This is interesting in light of the fact that a 1-second delay is
likely to still be within the cognitive processing window. In educational psychology, two studies

20

that looked at students learning form-meaning mappings found that item-by-item feedback was
more effective than end-of-test feedback or longer feedback delays. In a laboratory, Brosvic,
Epstein, Dihoff, and Cook (2006a) tested undergraduate students on the definitions of Esperanto
words. The treatments varied on the timing of feedback (none, item-by-item, end-of-test, or
delayed by 24 hours) and whether the student could select another response if the first selection
was incorrect. The tests and feedback were provided using special paper forms, which had a
coating over the answer choices that was removed by the participants to reveal the feedback for
the chosen response, or feedback was provided by an assistant who held up an index card. The
students who received item-by-item feedback outperformed the students in the other groups on
all posttest measures, with the students who could select multiple responses outperforming those
who could only select one. A similar study that manipulated the length of delay and the number
of questions answered until feedback was provided had similar results: namely, that receiving
item-by-item feedback was most effective (Brosvic et al., 2006b).
A study by Schroth and Lund (1993) showed results both supporting item-by-item
feedback as more effective than a delay shorter than end-of-test and the reverse; that is, other
results supported a delay shorter than end-of-test as more effective than item-by-item feedback.
In a laboratory, undergraduate students learned an artificial grammar that consisted of patterns of
letters that followed simple rules. The materials were presented on paper cards, and the
experimenter provided the feedback verbally. The results showed that the participants who
received item-by-item feedback learned the patterns more quickly than the participants who
received feedback delayed by 10, 20, or 30 seconds. However, the participants who received the
30-second delayed feedback were more accurate at transferring what they had learned to a new
task on both an immediate and a delayed posttest. Note, however, that the results are

21

questionable, given that the participants in item-by-item feedback group reached criterion on the
training task more quickly than the participants in the delayed groups, resulting in fewer practice
trials.
2.3 Educational Psychology Theories of Feedback Timing
Despite the dearth of SLA theories that address feedback timing, many educational
psychology theories have been proposed to explain why immediate or delayed feedback is
superior for learning. Perhaps most famously, behaviorist psychology contends that feedback (or
reinforcement, in behaviorist terms) must be immediately contingent upon a response to have a
learning (or conditioning) effect. In fact, in a discussion of the effects of a teacher’s feedback on
a student’s learning of mathematics, Skinner (1968) wrote, “It can easily be demonstrated that,
unless explicit mediating behavior has been set up, the lapse of only a few seconds between
response and reinforcement destroys most of the effect” (p. 16). The early work of behavioral
psychologists like Skinner continues today in the form of behavior analysis. Here, I specifically
consider relational frame theory (e.g., Hayes, Barnes-Holmes, & Roche, 2001). This theory
extends behavioral principles to verbal behavior, based on the human ability to create links or
relations between stimuli. However, a major change from Skinner’s theory is that feedback need
not be immediate to be effective (Barnes, 1996). Thus, from a behavior analytic perspective, it is
not clear whether item-by-item feedback is predicted to be more effective than end-of-test
feedback for either reinforcement of correct responses or error correction, and I do not
investigate this type of theory any further.
Next, I concentrate on three theories that predict that delayed feedback is superior: the
interference-perseveration hypothesis (Kulhavy & Anderson, 1972), the dual-trace hypothesis
(Kulik & Kulik, 1988), and an attention-based account (Phye & Andre, 1989).

22

To understand the predictions of the three theories, it is helpful to keep in mind the
context for which they were developed. This context prototypically has two phases: learning and
testing. In the learning phase, the participant answers questions and gets feedback of some sort.
In the testing phase, the participant answers the same questions that were presented in the
learning phase but does not get feedback. In addition, the three hypotheses are only designed to
deal with item learning, not with system learning. That is, the hypotheses predict learning when
the items are the same in the learning and testing phases. They do not make predictions about
learning the rules associated with the treatment items and being able to extend those rules to new
items on a posttest. Finally, note that the terms immediate and delayed have been interpreted in
various ways in the literature, so it is not possible to be more precise here.
The interference-perseveration hypothesis (Clariana, Wagner, & Roher Murphy, 2000;
Kulhavy & Anderson, 1972; Smith & Kimball, 2010) predicts differing results for immediate
and delayed feedback based on whether a test-taker initially responds correctly or incorrectly.
The hypothesis indicates that delayed feedback is superior to immediate feedback for the
correction of errors because the time that passes between the incorrect response and the feedback
during the learning phase allows the memory trace of the incorrect response to fade. Then, the
memory trace of the correct response (provided in the feedback) is stronger than that of the
incorrect response. This results in a greater number of correct responses during the testing phase.
In the case of a correct response during the learning phase, the interference-perseveration
hypothesis predicts that immediate and delayed feedback will produce similar results for
reinforcement of the correct response, as measured during the testing phase. That is, because the
participant responded correctly during the learning phase, no incorrect response exists or needs
to be forgotten; therefore, immediate and delayed feedback are predicted to have similar effects.

23

Interestingly, this hypothesis and the cognitive processing window lead to exactly opposite
predictions for the effectiveness of item-by-item feedback based on similar ideas about the
limited capacity of memory.
The dual-trace hypothesis (Clariana et al., 2000; Glover, 1989; Kulik & Kulik, 1988;
Rankin & Trepper, 1978) indicates that delayed feedback is more effective because it gives the
participant two encoding opportunities. In other words, each time a participant encounters an
item provides one encoding opportunity. If the feedback is immediate, the question and the
feedback are fused into one encoding opportunity, but if the feedback is delayed, the feedback
acts as a second encoding opportunity. Thus, the dual-trace hypothesis predicts that delayed
feedback is superior for both the correction of errors and the reinforcement of correct responses
because of the additional encoding opportunity that delayed feedback provides.
A final account for why delayed feedback may be superior to immediate feedback for
item learning is based on attention. When students are presented with feedback in a normal
classroom situation, they choose whether and how long to attend to feedback on a given item.
According to the attention-based account (Kulhavy & Anderson, 1972; Phye & Andre, 1989),
learners pay more attention to and therefore spend more time studying 24-hour-delayed feedback
compared to end-of-test feedback, making the 24-hour-delayed feedback more effective.
However, it is not clear what this theory suggests for item-by-item feedback compared to end-oftest feedback, so I will not consider it further.
2.4 Educational Psychology Research on Feedback Timing
In educational psychology and related fields, the study of feedback timing has a history
of nearly a century. A summary of the results of relevant meta-analyses is shown in Table 2, and
a summary of the results of individual studies is shown in Table 3. Patterns are difficult to distill

24

from the varied findings, but evidence is available to support both item-by-item and end-of-test
as the more effective feedback timing for item learning.
2.4.1 Review articles
Researchers have performed at least four meta-analyses and one other review article that
examined the variable of timing. However, the results of these studies (with the exception of that
of Bangert-Drowns, Kulik, Kulik, & Morgan, 1991) should be interpreted with caution because
the meta-analyzers used the definitions of immediate and delayed of the researchers of each
individual study, which is not consistent from one study to another. This makes the results of the
meta-analyses problematic at best. With that caveat in mind, I briefly review the results of the
four meta-analyses and one other review article.

25

Table 2: Summary of Educational Psychology Review Articles on Feedback Timing
Study

More effective feedback timing

Kulik and Kulik (1988)

Most classroom studies found IBI or EOT feedback more effective than feedback that was provided
later. Most laboratory studies found later feedback more effective than IBI or EOT feedback.

Kulik and Kulik (1988)

16 studies: more effective learning with IBI feedback. 11 studies: more effective learning with EOT
or IBI feedback delayed by seconds.

Bangert-Drowns, Kulik,

EOT had larger effect size than longer delay; longer delay had larger effect size than IBI.

Kulik, and Morgan (1991)
Azevedo and Bernard (1995)

Positive effect for immediate over delayed in computer-aided instruction.

Hattie and Timperley (2007)

5 meta-analyses: delayed more effective; 8 meta-analyses: immediate more effective.

Jaehnig and Miller (2007)

Both IBI and feedback provided at various delays are effective in programmed instruction.

Note. IBI = item by item; EOT = end of test.

26

Table 3: Summary of Individual Educational Psychology and Other Studies on Feedback Timing
Study
English and Kinzer (1966); Kulhavy and Anderson (1972); Metcalfe et al. (2009),
Experiment 1; Surber and Anderson (1975); Webb, Stock, and McCarthy (1994),
Experiment 1
Butler et al. (2007); King, Young, and Behnke (2000)
Sturges (1978)

Guzmán-Muñoz and Johnson (2008); Rankin & Trepper (1978); Schooler and Anderson
(1990), Experiments 2 and 3
Rankin & Trepper (1978); Schroth (1992); Schroth (1995); Smith and Kimball (2010)
Brosvic and Epstein (2007); Dihoff, Brosvic, Epstein, and Cook (2004); Lin, Lai, and
Chuang (2013)
Brosvic and Epstein (2007); Dihoff, Brosvic, Epstein, and Cook (2004); King, Young,
and Behnke (2000)
Schroth (1992); Schroth (1995)
Clariana et al. (2000); El Saadawi et al., (2008); Gaynor (1981); Lewis and Anderson
(1985), Experiments 2 and 3; Schooler and Anderson (1990), Experiment 1; Smith and
Kimball (2010); Surber and Anderson (1975); Van der Kleij, Eggen, Timmers, and
Veldkamp (2012)
Metcalfe et al. (2009), Experiment 2; Webb et al. (1994), Experiment 2
Note. IBI = item by item; EOT = end of test.

27

More effective feedback timing
Delay longer than EOT more effective than
EOT.
Delay longer than EOT more effective than
IBI.
Delay longer than EOT more effective than
delay shorter than EOT; EOT more
effective than delay shorter than EOT
EOT more effective than IBI.
Delay shorter than EOT more effective
than IBI.
IBI more effective than EOT.
IBI more effective than delay longer than
EOT.
IBI more effective than delay shorter than
EOT
No difference between IBI and
EOT/various delays.

No difference between EOT and longer
delay.

In a classic meta-analysis of the effects of delayed and immediate feedback, Kulik and
Kulik (1988) reviewed both classroom and laboratory studies on the learning of test content in a
variety of subjects (e.g., chemistry, psychology, and math). Of the 11 classroom studies, 9 found
that immediate feedback was significantly more effective than feedback that was provided later,
either at the end of the test or a day to a week later, based on the amount of the materials that
students retained during the original period of learning. In 13 of the 14 laboratory studies, the
opposite result was found: The participants who received delayed feedback performed better than
the participants who received immediate feedback. Note, however, that both Butler, Karpicke,
and Roediger (2007) and Metcalfe, Kornell, and Finn (2009) claimed that the classroom versus
laboratory distinction made by Kulik and Kulik (1988) was not the reason for the different
findings, claiming instead that learner attention to the feedback was the key difference.
Several other meta-analyses have also examined feedback timing. Looking exclusively at
computer-aided instruction, Azevedo and Bernard (1995) meta-analyzed 22 studies and found a
positive effect for immediate over delayed feedback. In a meta-analysis of meta-analyses, Hattie
and Timperley (2007) looked at the results of 74 meta-analyses of the effects of feedback on
learning and found seemingly contradictory results regarding immediate versus delayed feedback.
Five meta-analyses found delayed feedback more effective, with an effect size of 0.34, while
eight meta-analyses found immediate feedback more effective, with an effect size of 0.24.
However, as the authors argued, there is a key difference between conditions under which
immediate and delayed feedback were beneficial: immediate feedback was beneficial for easier
items, whereas delayed feedback was beneficial for more difficult items. The reason for this may
be that more difficult items take longer to process, making the extra time before feedback is
provided useful for learning, while this extra time is unhelpful for easier items.

28

Bangert-Drowns et al. (1991) conducted a meta-analysis in which they specified that they
analyzed end-of-test and item-by-item feedback. Based on 40 studies, they found that end-of-test
feedback produced larger effect sizes than longer delays and that longer delays had larger effect
sizes than item-by-item feedback.
Finally, Jaehnig and Miller (2007) reviewed feedback types in programmed instruction.
According to the authors, programmed instruction is a type of instruction grounded in behaviorist
psychology in which a stimulus is presented to a learner, who has an opportunity to respond. The
response is followed by feedback on the correctness of the response. Note that this definition
applies to many studies in CALL and SLA as well as educational psychology, and indeed,
studies such as those of Nagata (1993) and Rosa and Leow (2004) are included in the review.
Although feedback in programmed instruction is defined as immediate (i.e., item-by-item),
studies within the review examined various delays. While the authors did not perform a metaanalysis, they concluded that both item-by-item feedback and feedback provided at various
delays are effective, with the caveat that the effects of delaying elaborated feedback (i.e.,
feedback that includes information beyond knowledge of the correct response) have not been
fully explored.
2.4.2 Individual studies
The results of studies of feedback timing are varied, and the lack of systematicity in
defining immediate and delayed makes the results challenging to synthesize. Further
complicating the issue is the fact that an individual study may contain multiple experiments with
differing findings, or even a single experiment whose results suggest an advantage for different
feedback timings, based on different independent and dependent variables.
The studies summarized below have two factors in common. First, in nearly all of the

29

studies, the participants were learning using their L1s. Second, the posttests generally included
the same items as the treatments, so any learning demonstrated in the studies is item learning.
Exceptions to these two generalizations are noted. A summary of all studies is shown in Table 3.
2.4.2.1 Delay longer than end-of-test more effective than end-of-test
In an older classroom study, English and Kinzer (1966) tested undergraduate students on
the content of articles using a multiple-choice test completed on paper. There were four
treatment groups: The first received end-of-test feedback, and the others received feedback
delayed by 1 hour, 2 days, or 1 week. The results showed that the 1-hour and 2-day delays were
superior to the other conditions.
Kulhavy and Anderson (1972) tested high school students using multiple-choice
questions completed on paper after they had studied material about psychology that provided
end-of-test or 1-day delayed feedback. The authors found that the students who received the 1day delayed feedback did significantly better on a delayed posttest (a week after the initial
learning) than the students who received the end-of-test feedback. These results led to the
original statement of the interference-perseveration hypothesis.
Metcalfe et al. (2009) studied the learning of L1 English vocabulary using one
experiment with child participants and one with adult participants. In the first experiment, sixthgrade children played a computer game that taught them vocabulary. In the testing phase, a
definition appeared, and the learner typed the corresponding word. The children received end-oftest feedback on some of the questions and 1- to 4-day delayed feedback on other questions. This
experiment is noteworthy because the lag between the feedback and the test was controlled for in
half of the participants, while the lag between the questions and the feedback was controlled in
the other participants. For all participants, the 1- to 4-day delayed feedback was more effective

30

than the end-of-test feedback. The second experiment was similar to the first, but the results
showed no difference between the two feedback timings, so its results are described below.
Surber and Anderson (1975) tested high school students on their comprehension of an
article using multiple-choice questions on paper, providing them end-of-test or 1-day delayed
feedback. The 1-day delayed feedback was more effective for error correction (although there
was no significant difference in the feedback timings for reinforcement of correct responses).
Webb, Stock, and McCarthy (1994) tested undergraduate students on general knowledge,
such as history and geography, using multiple-choice questions in two experiments. Both were
conducted in a laboratory setting on computers. Half of the participants received end-of-test
feedback, and half of the participants received feedback a day later. In Experiment 1, the 1-day
delayed feedback was more effective than the end-of-test feedback. However, note that the lag
between the feedback and posttest was shorter for the 1-day delayed feedback than for the endof-test feedback. In addition, the participants who received 1-day delayed feedback had a
significantly higher proportion of errors that they corrected from the training to the posttest than
did those who received end-of-test feedback. Experiment 2 was similar to Experiment 1, but with
slightly different results, so it is described below.
2.4.2.2 Delay longer than end-of-test more effective than item-by-item
Butler et al. (2007) tested the effects of item-by-item versus 10-minute and 1-day delayed
feedback on the performance of undergraduate students on a multiple-choice reading
comprehension test on computers in a laboratory. The reading passages were taken from study
guides for standardized tests. Based on a constructed-response posttest, the researchers found
that the students answered more questions correctly when they had the 1-day delayed feedback.
King, Young, and Behnke (2000) looked at the effects of feedback timing on speech

31

performance by undergraduates in a laboratory setting. Feedback was given to the participants
during the speech performance (analogous to item-by-item feedback) or at a 1-day delay. The
researchers found that the item-by-item feedback was significantly more effective than the 1-day
delayed feedback in getting the participants to make more eye contact, which the researchers
characterized as a task that required little processing. The 1-day delayed feedback was
significantly more effective than the item-by-item feedback in getting the participants to lengthen
their planned introductions, which the researchers characterized as a task that required more
processing.
2.4.2.3 Delay longer than end-of-test more effective than delay shorter than end-of-test; endof-test more effective than delay shorter than end-of-test
Sturges (1978) tested undergraduate students on the content of a psychology lecture using
a computer. The test included both multiple-choice and short-answer items. Students received 2second delayed, end-of-test, or 1-day delayed feedback. A posttest given 1 to 3 weeks later
showed that the end-of-test and 1-day delayed feedback were significantly more effective than
the 2-second delay.
2.4.2.4 End-of-test more effective than item-by-item
In a study by Guzmán-Muñoz and Johnson (2008), L1 and L2 Dutch undergraduate
students participated in an computerized laboratory experiment in which they dragged Dutch city
names to their correct locations on a map. One group viewed a completed map while they
performed the task, another group received feedback after placing each city (item-by-item), and a
third group received feedback after placing all cities on the map (end-of-test). The learners in the
end-of-test group showed the greatest gains on both immediate and 1-week delayed posttests.
Note that this task requires visual and spatial skills, making it different from the more traditional

32

test-like tasks in most of the other studies reviewed here. This may explain why end-of-test
feedback was more effective.
Rankin and Trepper (1978) asked undergraduate and graduate students to complete a 10item multiple-choice test on human sexuality on a computer, presumably in a laboratory. The
participants were divided into groups that received item-by-item, 15-second delayed, and end-oftest feedback. A retention test was given 24 hours after the treatment. The 15-second delayed and
end-of-test feedback were significantly more effective for retention of knowledge than was the
item-by-item feedback. However, note that the participants did not take a pretest and that the
total number of questions (10) was very small.
Schooler and Anderson (1990) conducted three experiments in which they taught novices
how to use the LISP programming language. The results of the first experiment differed from
those of the Experiments 2 and 3, so Experiment 1 is described in a different section below. In
Experiments 2 and 3, a LISP tutor program was used to instruct the participants and provide
feedback either as they were typing their solutions to problems (item-by-item) or after they
submitted a complete solution (end-of-test). Note that the program prevented the participants
from continuing to write an incorrect solution in the item-by-item condition, while the program
allowed them to continue writing following an error in the end-of-test condition. As a posttest,
the participants completed new problems a day later, but without feedback from the tutor. In
Experiments 2 and 3, the end-of-test feedback was more effective than the item-by-item
feedback in terms of the number of errors that the participants made on the posttest and in the
total time they required to complete the posttest.
2.4.2.5 Delay shorter than end-of-test more effective than item-by-item
As described above, Rankin and Trepper (1978) found that 15-second-delayed and end-

33

of-test feedback were significantly more effective for retention of knowledge than was item-byitem feedback.
Schroth conducted two similar studies (1992, 1995) in a laboratory setting in which the
experimenter presented undergraduate students with cards and asked them to decide if they fit a
concept or not. Because the students were not told the concept in advance, they had to determine
the concept based on the feedback. In the first experiment, the feedback was provided item-byitem or at a 10-, 20-, or 30-second delay. Cards were presented until the student correctly
responded to 9 out of 10 cards in a row or until 100 cards had been presented. A week later, a
transfer test was administered that used a new concept and with item-by-item feedback provided
to all participants. The number of trials to criterion during the treatment was significantly lower
for the item-by-item group than for any of the delay groups. Note that this means that the itemby-item group had less practice that the delay groups. In the retention test, the 30-second delay
group reached criterion faster than any of the other groups.
In Schroth’s 1995 study, two further experiments were conducted under conditions
similar to those of the 1992 study. In the first experiment of the newer study, instead of the 20second-delayed feedback, feedback was provided at a delay that randomly varied between 10 and
30 seconds. An immediate transfer test was added after the training, and in both the immediate
and delayed transfer tests, the feedback was provided using the same timing as that during the
treatment. The results showed that the item-by-item group needed significantly fewer trials to
criterion on the treatment than the other groups. On both the immediate and delayed transfer tests,
the varied-delay group was the fastest to criterion. The second experiment was similar, but itemby-item feedback was provided to all participants during the two transfer tests. The results of the
trials to criterion on the treatment task were the same as those for the first experiment. The

34

results of the immediate transfer test showed that the 30-second-delayed and variably delayed
feedback groups achieved criterion significantly faster than the other two groups. On the delayed
transfer test, the variably delayed feedback group was significantly faster than the other groups.
Smith and Kimball (2010) investigated the effects of varied feedback timing on
participants’ learning of trivia facts using a short-answer test in two experiments conducted on
computers in a lab. In both experiments, undergraduate students received item-by-item or 8minute delayed feedback. That is, feedback in the delayed condition was mixed in with the
questions. A posttest was administered a week after the treatment. In the first experiment, the 8minute delayed feedback was more effective than item-by-item feedback for reinforcement of
correct responses, but there was no difference between the conditions for error correction.
2.4.2.6 Item-by-item more effective than end-of test
Brosvic, Epstein, and colleagues conducted two studies that found that item-by-item
feedback is superior to end-of-test feedback on multiple-choice tests presented on paper. Dihoff,
Brosvic, Epstein, and Cook (2004) conducted a classroom experiment in which undergraduate
students were provided with item-by-item, end-of-test, or 1-day delayed feedback. The group
that received item-by-item feedback did significantly better on the posttest (2 weeks later) than
the other groups. Brosvic and Epstein (2007) found that undergraduate students in introductory
psychology courses performed better on several posttest measures of learning, including delayed
posttests 3, 6, 9, and 12 months after treatment, when they had been trained using item-by-item
feedback as opposed to end-of-test feedback, 1-day delayed feedback, or no feedback.
In a computer science class, Lin, Lai, and Chuang (2013) tested the effects of various
feedback timings on learning to create database concept diagrams. They developed a system that
provided diagnostic feedback at each step of the process of solving a problem (item-by-item) and

35

compared it to similar systems that provided diagnostic feedback at the end of the process (endof-test-a) or only information on whether the solution was correct or incorrect at the end of the
process (end-of-test-b). The students who received item-by-item feedback scored significantly
higher on an immediate posttest than did the students in the other two groups. No delayed
posttest was administered.
2.4.2.7 Item-by-item more effective than delay longer than end-of-test
Each study in this category is also included in a category above. Dihoff et al. (2004)
conducted a classroom experiment using undergraduate student participants. The researchers
found that a group that received item-by-item feedback did significantly better on the 2-week
delayed posttest than the other groups, including 1-day delayed feedback.
Similarly, Brosvic and Epstein (2007) studied undergraduate students in introductory
psychology courses. The students who had been trained using item-by-item feedback performed
better than the students who had been trained using 1-day delayed feedback on delayed posttests
that were taken 3, 6, 9, and 12 months after treatment.
King, Young, and Behnke (2000) looked at the effects of feedback timing on speech
performance by undergraduate in a laboratory setting. They found that 1-day delayed feedback
was significantly more effective than item-by-item feedback in getting the participants to
lengthen their planned introductions.
2.4.2.8 Item-by-item more effective than delay shorter than end-of-test
Both of Schroth’s studies (1992, 1995) summarized above found that item-by-item
feedback was more effective than various delays shorter than end-of-test for one purpose:
achieving criterion on a task in which the participants had to learn a sorting rule based on
feedback from the researcher.

36

2.4.2.9 No difference between item-by-item and end-of-test/various delays
Clariana et al. (2000) tested a connectionist model of feedback timing using item-by-item
and end-of-test feedback. Their participants were high school students who read passages of
varied content and answered comprehension questions on computers. The researchers found no
significant difference between item-by-item and end-of-test feedback for item learning. However,
they found the interesting trend that item-by-item feedback was more effective with difficult
items and end-of-test feedback was more effective with easy items, which is similar to Kulhavy
and Anderson's (1972) claim that delayed feedback (operationalized as 1-day delayed) is more
effective than immediate feedback (operationalized as end-of-test) for difficult tests.
El Saadawi et al. (2008) looked at the pathology reports written by medical residents who
had been trained using a computer program in two conditions: the residents received feedback as
they wrote (item-by-item) or after they had written the entire report (end-of-test). All of the
residents showed significant improvement from pretest to posttest, but there were no significant
differences between the two conditions.
In a study by Gaynor (1981), undergraduate students answered free-response questions
about business statistics. A pretest was given on paper during class, then a four-lesson sequence
was completed on a computer outside of class. Three types of feedback were provided during the
lessons: item-by-item, 30-second delayed, and end-of-test. A paper-based posttest was given
during class, 24 hours after the end of the treatment period, and a paper-based delayed posttest
was given in class a week later. No significant differences were found between any of the groups.
Lewis and Anderson (1985) conducted three experiments, but only Experiments 2 and 3
addressed feedback timing. In these experiments, participants played a computer game in which
they typed instructions to move through rooms in a maze, with the goal of exiting it. They could

37

choose various actions depending on what features they saw in the rooms, but only one action
was correct (i.e., leading to progress out of the maze) in a given room. In Experiment 2, the
participants received feedback immediately upon completing an action (item-by-item) or after
completing the next action (i.e., after proceeding one room down the wrong path). In an
immediate posttest, no significant difference was found between the two feedback timings. No
delayed posttest was administered. Experiment 3 was similar to Experiment 2, except that the
amount of practice with the computer game was doubled and spread out over two days. In the
immediate posttest, the participants in the item-by-item feedback group significantly
outperformed those in the 1-step delayed feedback group. Again, no delayed posttest was
performed.
Schooler and Anderson (1990) conducted three experiments in which they taught novices
how to use the LISP programming language. The results of Experiments 2 and 3 are summarized
above, while the results of Experiment 1 fit the current category. As in Experiments 2 and 3, a
LISP tutor program was used to instruct the participants and provide feedback either as they
were typing their solutions to problems (item-by-item) or after they submitted a complete
solution (end-of-test). Experiment 1 differed from the other two in that once a participant in the
end-of-test condition submitted a solution and received feedback once, additional feedback on
the same problem became item-by-item. No significant difference was found between the endof-test and item-by-item feedback conditions in terms of the number of errors that the
participants made on the posttest and in the total time they required to complete the posttest.
Smith and Kimball (2010) investigated the effects of feedback timing on participants’
learning of trivia facts using a cued-response (short-answer) test in two experiments conducted
on computers in a lab. The first experiment is summarized above. In the second experiment, as in

38

the first, undergraduate students received item-by-item feedback or feedback delayed by 8
minutes. In the second experiment, the delay between the feedback and test was controlled. A
posttest administered a week after the treatment showed no difference between the two feedback
conditions.
As summarized above, Surber and Anderson (1975) tested high school students on their
comprehension of an article using multiple-choice questions on paper, providing them end-oftest or 1-day delayed feedback. No difference was found in the feedback timings for
reinforcement of correct responses (although the 1-day delayed feedback was more effective for
error correction).
Van der Kleij, Eggen, Timmers, and Veldkamp (2012) investigated the effects of three
computer-based feedback conditions on L1 adult speakers of Dutch who answered multiplechoice questions related to marketing (in Dutch). The conditions were item-by-item knowledge
of the correct response plus elaborated feedback (an explanation of how to arrive at the correct
response), end-of-test knowledge of the correct response plus elaborated feedback, and end-oftest knowledge of the correct response. No significant differences were found between the
groups on an immediate posttest, and no delayed posttest was administered. However, the
participants in this study reported that the item-by-item feedback was most beneficial for their
learning.
2.4.2.10 No difference between end-of-test and longer delay
Metcalfe et al. (2009) performed two experiments, the first of which is described above.
The second experiment was similar to the first, but the participants were college students instead
of sixth-grade students. The object of study was again L1 English vocabulary. For the group that
did not have the time lag between feedback and test controlled, the 1- to 4-day delayed feedback

39

was more effective than the end-of-test feedback. For the group that had the lag controlled, there
was no significant difference for the two feedback timings. This implies that for the college
students, the lag was a more important factor in the learning of vocabulary items than the timing
of the feedback.
The first experiment of Webb et al. (1994) is described above. In the second experiment,
undergraduate students answered multiple-choice questions on general knowledge and received
end-of-test or 1-day delayed feedback, as in Experiment 1. An addition for Experiment 2 was
that the participants rated their response confidence in addition to answering the questions, but
those results are irrelevant to the current discussion. Although the posttest results for the end-oftest and 1-day delayed feedback were not significantly different, the participants who received 1day delayed feedback corrected a significantly higher proportion of errors from the training to
the posttest than did those who received end-of-test feedback.
2.4.2.11 Other
Peeck and Tillema's (1978) study does not fit into the scheme of Table 3 because the
researchers did not include a group that received either item-by-item or end-of-test feedback. The
researchers studied the effects of varied feedback timings on Dutch fifth grade students’
responses to multiple-choice questions about an article in their L1. The study took place in the
students’ classrooms and was completed on paper. One group received feedback 30 minutes after
answering the questions, while another group received feedback 1 day later. The participants
took an immediate posttest on the second day of the study and a delayed posttest on the seventh
day. No differences were found between the groups on the immediate posttest, but the group that
received 1-day delayed feedback performed significantly better than the other group on the
delayed posttest. In addition, the students who received 1-day delayed feedback corrected a

40

significantly higher proportion of errors from the treatment to the posttest than the students who
received 30-minute delayed feedback.
2.5 SLA Theories and Metalinguistic Feedback
Although the cognitive processing window theory addresses the timing of negative
feedback, it does not indicate whether including metalinguistic information in feedback will be
more effective than not including it within the window. To begin to address this question, I
consider the idea of attention to form.
According to Schmidt (1995), metalinguistic feedback may be more useful to learners
than recasts alone, and attention or noticing is the key to this prediction. As suggested by Pica
(1994) and Gass (1997), providing metalinguistic feedback draws learners’ attention to the
correct forms, making metalinguistic feedback more effective than feedback that does not
include metalinguistic information. However, the question remains as to whether this prediction
can be applied to a situation that does not involve interaction per se, but a learner answering
multiple-choice questions and receiving feedback. According to Gass (1997),
“If what is crucial about interaction is the fact that input becomes salient in some
way (i.e., enhanced), then it matters little how salience comes about—whether
through a teacher’s self-modification, one’s own request for clarification, or
observation of another’s request for clarification. The crucial point is that input
becomes available for attentional resources and attention is focused on a
particular form or meaning.” (p. 129)
Thus, it is again the psycholinguistic basis of the interaction approach that I adopt here, with the
key to the effectiveness of metalinguistic feedback being attention.

41

2.6 Language Research on Metalinguistic Feedback
The language-related research on the relative effectiveness of metalinguistic and
nonmetalinguistic feedback is summarized in the following section. First, I cover SLA and
CALL research, followed by language assessment research.
2.6.1 SLA and CALL research
The effectiveness of various types of feedback has been a major focus within second
language acquisition (SLA) research. Starting with the explicitness of instruction, researchers
have shown in two meta-analyses that explicit instruction is generally more effective than
implicit instruction (Norris & Ortega, 2000; Spada & Tomita, 2010). First, Norris and Ortega
(2000) performed a meta-analysis of empirical studies of the effectiveness of L2 instruction.
They defined explicit instruction as instruction that included rule explanation or that asked
learners to arrive at metalinguistic explanations. Implicit instruction, on the other hand, was
defined as instruction that did not include rule explanations or ask learners to arrive at them.
Learning was measured differently among the studies, with 65% using constrained constructed
response, 39% using selected response, 29% using metalinguistic judgment, and 16% using free
constructed response. The results showed that explicit instruction was more effective than
implicit instruction, with a large mean effect size. This effect size was even larger for studies that
used constrained constructed response and selected response. More recently, Spada and Tomita
(2010) used the same definitions for implicit and explicit instruction as Norris and Ortega (2000),
and the studies that they meta-analyzed also overlapped with those in the Norris and Ortega
study. However, they collapsed the learning outcome measures into two categories: free and
controlled responses (including constrained constructed response, selected response, and
metalinguistic judgment). They further divided studies into those that examined simple and

42

complex features, which was determined based on the number of transformations needed to
apply the relevant rule. Their results for explicit instruction overall were similar to those of
Norris and Ortega, with Spada and Tomita finding larger mean effect sizes for explicit compared
to implicit instruction. However, the results for free versus controlled outcome measures were
somewhat different, with Spada and Tomita finding the largest effect size for free outcome
measures after explicit instruction on complex features.
Although the results of Norris and Ortega (2000) and Spada and Tomita (2010) provide
some insight into the effects of metalinguistic instruction, they do not directly address the effects
of metalinguistic feedback. Three meta-analyses have investigated this factor, with two showing
a greater effect of metalinguistic feedback, and one showing a greater effect of nonmetalinguistic
feedback. Li (2010) meta-analyzed studies on both written and oral corrective feedback on L2
learning, and two of the variables he examined were metalinguistic feedback and recasts. The
results showed larger effect sizes for metalinguistic feedback in both immediate and delayed
posttests. Turning to oral corrective feedback exclusively, Lyster and Saito (2010) performed a
meta-analysis of classroom studies. They classified feedback as recasts, explicit correction, and
prompts. The category of prompts included metalinguistic feedback in addition to clarification
requests, repetition of error, and elicitation. Their results showed larger effect sizes for prompts
compared to recasts. Unlike the other meta-analyses summarized above, Mackey and Goo (2007)
found nonmetalinguistic feedback to be more effective than metalinguistic feedback. They
performed a meta-analysis of studies of synchronous interaction on the acquisition of grammar
structures, including both face-to-face and computer-mediated interaction. The types of feedback
examined were recasts, negotiation, and metalinguistic feedback, and the mean effect size for
metalinguistic feedback was smaller than those for recasts and negotiation.

43

Like the meta-analyses, somewhat mixed results have come of experimental and quasiexperimental SLA research focused on the relative effectiveness of metalinguistic explanations
and other types of feedback. First, Bitchener and Knoch (2009) studied the correction of article
errors made in writing by ESL learners and found no advantage for metalinguistic feedback. The
authors compared three types of direct corrective feedback, all of which included providing the
correct form: written and oral metalinguistic explanation; written metalinguistic explanation; and
direct corrective feedback only. They found no significant difference in the groups’ accuracy
scores on immediate and delayed posttests, although all groups improved over time compared to
the pretest scores.
Conversely, many SLA researchers have found an advantage for metalinguistic feedback.
Carroll and Swain (1993) studied Spanish L1 learners of ESL. The participants needed to
produce the dative alternation for a given sentence or state that it did not alternate. The
researchers divided the learners into five groups who got the following types of feedback when
they answered incorrectly: metalinguistic information, an explicit indication that the answer was
wrong, a recast, a question about whether they were sure, and no feedback. The metalinguistic
feedback outperformed all other groups. With a focus on English articles, Muranoi (2000) also
found an advantage for L2 learners provided with metalinguistic rule feedback after a group
interaction in which implicit negative feedback was provided, compared to the group that
received the implicit feedback only. In this study, the treatment was provided as part of a task in
which students played roles in oral conversation, and learning was measured using a similar task.
Ellis, Loewen, and Erlam (2006) found that explicit, metalinguistic feedback on oral errors in
using the past tense in English was more effective than recasts. They measured learning using an
oral imitation test, an untimed grammaticality judgment test, and a metalinguistic knowledge test,

44

but did not test production of the structure. Similarly, Sheen (2007) found that explicit,
metalinguistic feedback on oral errors in English article use was more effective than recasts. She
measured learning using a speeded dictation test, a writing test prompted by a series of pictures
and words, and an error correction test. Goo (2011) studied Korean L1 learners of English in a
foreign language setting and the effect of metalinguistic feedback and recasts on their oral
production. Learning was measured using oral production and grammaticality judgment tests.
The results differed depending on the structure investigated, with metalinguistic feedback being
more effective for the that-trace filter, while recasts were more effective for past unreal
conditionals. Goo stated that the differing results may be due to the relative complexity of the
two rules.
In CALL, several studies by Nagata and Heift on the learning of L2 Japanese and
German, respectively, demonstrated an advantage for explicit, metalinguistic feedback over other
types of feedback, such as increasing the salience of the error using highlighting or repetition
when the student reviews his or her response (e.g., Heift, 2004, 2006; Nagata & Swisher, 1995;
Nagata, 1997). In contrast, studies on CALL feedback on the acquisition of Spanish as a foreign
language by L1 English speakers have generally revealed either no difference between
metalinguistic and no metalinguistic feedback or an advantage for no metalinguistic feedback
(e.g., Kregar, 2011; Moreno, 2007; Sanz & Morgan-Short, 2004). CALL feedback has also been
studied with ESL learner participants, with mixed results for the relative effectiveness of
metalinguistic and nonmetalinguistic feedback (e.g., Loewen & Erlam, 2006; Sauro, 2009).
Finally, a study by Murphy (2010) bridged the divide between the SLA and CALL
research reported in this section and the educational psychology literature reported in the next
section. That is, although Murphy studied learning in a CALL context, he did not use

45

metalinguistic feedback, but rather, elaborative feedback, which consisted of hints to foster
interaction and repeated attempts to choose the correct answer. This form of elaborative feedback
is to some extent similar to the scaffolding provided in sociocultural-theory based interventions
(e.g., Aljaafreh & Lantolf, 1994; Poehner & Lantolf, 2013). Other types of elaborative feedback
are considered in the next section. In Murphy’s study, pairs of English learners in Japan
answered multiple-choice reading comprehension questions. One group of learners received
feedback that included only the correct response, while another group received elaborative
feedback followed by the correct response. Murphy found that the elaborative feedback was
superior to the correct response feedback based on the students’ accuracy results on a follow-up
comprehension exercise.
2.6.2 L2 assessment research on metalinguistic feedback
As mentioned above, three major types of assessments provide learners feedback on their
responses: formative, diagnostic, and dynamic assessments. Despite the fact that the feedback is
provided with the intention of promoting learning, no L2 empirical research on these types of
assessment has addressed the question of the effectiveness of metalinguistic feedback. Research
on dynamic assessment comes the closest, so I briefly describe this below.
Dynamic assessment is based on the principles of sociocultural theory. It is described as
follows:
Dynamic assessment integrates assessment and instruction into a seamless,
unified activity aimed at promoting learner development through appropriate
forms of mediation that are sensitive to the individual’s (or in some cases a
group’s) current abilities. In essence, DA is a procedure for simultaneously
assessing and promoting development that takes account of the individual’s (or

46

group’s) zone of proximal development (ZPD). (Lantolf & Poehner, 2004, p. 50)
Mediation is a form of feedback that may assist the learner in correctly responding to a question.
This mediation may take the form of metalinguistic or nonmetalinguistic feedback.
One of the purposes of dynamic assessment is to promote development, and some
research in this area has been conducted on the effectiveness of dynamic assessment to promote
language development (e.g., Ableeva, 2010; Poehner & Lantolf, 2013; Poehner, 2007, 2008). For
example, Poehner and Lantolf (2013) developed computerized multiple-choice dynamic
assessments of Chinese and French reading and listening comprehension. The mediation was a
series of hints that gradually narrowed down the search space for the correct answer. In addition
to normal items, the test included transfer items, which were in the same format as the normal
items, but more difficult. The authors claimed that their computerized dynamic assessments
showed that the participants learned during the test, but given that there was no pretest for what
they already could do on the transfer tasks, this claim can only be supported anecdotally. The
relative effectiveness of the types of mediation (i.e., feedback) used was not compared, and this
is true of other studies of dynamic assessment as well.
2.7 Educational Psychology Theories of Informational Feedback
Because educational psychology as a field does not have a special focus on language
learning, theories within the field have little to say about the effectiveness of metalinguistic
feedback. However, among the categories of feedback that are often distinguished, such as
knowledge of results, knowledge of correct response, and informational feedback (e.g., Jaehnig
& Miller, 2007), metalinguistic feedback fits into the last category. To reiterate the definitions,
knowledge of results feedback tells the learner only whether he or she answered correctly or
incorrectly. Knowledge of correct response feedback includes knowledge of results feedback

47

(implicitly or explicitly), but also indicates the correct response. Informational feedback includes
knowledge of correct response feedback plus additional information, such as metalinguistic
information.
The literature on these types of feedback in educational psychology is largely atheoretical.
Although many researchers have compared the feedback types under varied conditions, most
have not proposed theoretical accounts as to why, for example, informational feedback may be
more effective than knowledge of correct response feedback. One exception to this is the work of
Smits, Boon, Sluijsmans, and Van Gog (2008), who examined the influence of working memory
capacity on learning to perform a task. They claimed that for complex cognitive tasks, learners
with little prior knowledge benefit from informational feedback that provides details at each step
because it helps them avoid overloading working memory capacity, presumably because they
can refer to the details in writing, rather than needing to hold them in working memory. On the
other hand, learners with more prior knowledge can benefit more from global feedback that does
not include details, presumably because the knowledge of the details is already integrated into
their knowledge schemata, freeing working memory capacity to look at the task more globally.
However, this theory does not directly apply to less complex tasks, such as responding to
multiple-choice questions, which do not contain explicit steps. In addition, the distinction
between detailed and global feedback is not analogous to that between metalinguistic and
nonmetalinguistic feedback.
2.8 Educational Psychology Research on Informational Feedback
Several review articles have compared informational feedback and knowledge of correct
response feedback, with the two more recent (Bangert-Drowns et al., 1991; Jaehnig & Miller,
2007) finding an advantage for informational feedback, although the oldest (Kulhavy & Stock,

48

1989) found no consistent pattern. In a meta-analysis, Bangert-Drowns, Kulik, Kulik, and
Morgan (1991) found a larger effect size for informational feedback (referred to as explanation;
d = 0.53) than for knowledge of correct response feedback (d = 0.22). Most recently, in a review
of feedback types in programmed instruction (a behaviorist approach to individual instruction),
Jaehnig and Miller (2007) found that knowledge of results feedback was not effective, while
knowledge of correct response feedback was less effective than informational feedback (referred
to in the article as elaborative feedback).
One additional review article compared informational feedback and knowledge of results
feedback. Crooks (1988) reviewed studies and meta-analyses on classroom evaluation, without
performing a meta-analysis himself. Crooks concluded that knowledge of results should be
provided for all questions, with more informative feedback (e.g., metalinguistic information)
only necessary when the student has made an error.
As a counterpoint, at least one study showed an advantage for knowledge of correct
response feedback over informational feedback. Kulhavy, White, Topp, Chan, and Adams (1985)
had undergraduate students read a passage about the U.S. Navy, then answer multiple-choice
questions about it. Four different feedback conditions were used, with the feedback varying
along a continuum of complexity from knowledge of correct response to knowledge of correct
response plus an explanation of why each of the incorrect alternatives was incorrect plus the
section of the passage where the answer could be found. The researchers found an advantage for
less complex feedback.
Many studies have found no advantage to providing explanation feedback compared to
providing correct answer feedback (e.g., Park & Gittelman, 1992; Whyte, Karolick, Nielsen,
Elder, & Hawley, 1995). I will elaborate on a few recent examples here. Mandernach (2005)

49

studied undergraduate students in a psychology class who received one of a number of different
kinds of feedback on multiple-choice questions. One group received no feedback, a second group
got knowledge of results, a third group got knowledge of correct results, a fourth group got
knowledge of results and was presented with a paragraph in which the correct answer could be
found, and a fifth group got knowledge of correct response and received explanations of the
selected response and the correct response. The feedback that the fifth group received is clearly a
type of informational feedback, while the feedback that the fourth group received could also be
considered informational. No significant differences were found among the conditions. As
another example, Smits, Boon, Sluijsmans, and Van Gog (2008) had Dutch secondary school
students work through genetics problems (presumably in their L1) in a web-based environment,
and they got either global feedback, which included the correct answer and a problem-solving
approach, or informational feedback, which included the correct answer, problem-solving
approach, a fully worked solution, and an explanation for why the answer was correct. The
learners who had the higher prior knowledge performed better under the global feedback
condition, while no difference between the conditions was found for the learners who had lower
prior knowledge.
Note that most of the individual studies of informational feedback and those in the
reviews examined item learning only. Counter to this trend is the research of Butler, Godbole,
and Marsh (2013), who looked at both item learning, using questions that asked about definitions
of concepts, and extension of learning to inference questions in reading comprehension. The
definition questions appeared on both 5-minute-delayed posttests and 1-week-delayed posttests,
and the inference questions only appeared on the final posttests. The types of feedback that were
provided were knowledge of correct response or informational feedback, which included

50

knowledge of correct response and two additional sentences from the passage that elaborated on
the correct answer. The researchers found that the informational feedback was more effective for
promoting extension of learning than correct answer feedback, but that the two types of feedback
were equivalent for promoting item learning.
Overall, the educational psychology literature concurs with the L2 literature:
Informational feedback (such as metalinguistic feedback) generally seems to be more effective
than knowledge of correct response for promoting both item learning and system learning.
2.9 Research Questions and Hypotheses
The current study focuses on learning the correct usage of English articles when feedback
that includes or does not include metalinguistic information is provided on an item-by-item basis
or at the end of a test. This results in four conditions: item-by-item feedback including
metalinguistic information, item-by-item feedback without metalinguistic information, end-oftest feedback including metalinguistic information, and end-of-test feedback without
metalinguistic information.
To summarize the theory and research introduced so far, SLA and educational
psychology theories makes conflicting predictions about the ideal timing of feedback, and
empirical studies can be cited to support or refute any given prediction. The effectiveness of
metalinguistic feedback is much clearer and has been generally supported by both theory and
empirical results. However, none of the theories considered above predicts how providing or not
providing metalinguistic feedback will interact with item-by-item or end-of-test timing. For
example, it is possible that the effects of increasing attention by providing metalinguistic
feedback and of increasing attention by providing that feedback on an item-by-item basis (as
predicted by the theories supporting the interaction approach) has an additive effect, leading to

51

the most attention and therefore the most learning, but it is also possible that attention cannot
meaningfully be increased in this way. Because of this gap in both theoretical and empirical
knowledge and the potential for this knowledge to help teachers better plan feedback to increase
student learning, the first research question is as stated below:
1. On a multiple-choice drag-and-drop test, does the timing (item-by-item or end-of-test)
and type (with or without metalinguistic information) of feedback affect ESL
students’ gain scores on 5-minute-delayed and 1-week-delayed posttests
a.

on the same (repeated) questions?

b.

on new questions?

Note that repeated and new questions are included here in order to measure the effects on both
item learning and system learning. (Recall that an item has been learned if the learner can
correctly respond to it after a previous exposure to the item, whereas a system has been learned if
the learner can correctly apply a rule to a previously unseen situation.)
Next, the cognitive processing window theory, the dual-trace hypothesis, and
interference-perseveration hypothesis make different predictions based on whether the learner
answers a question correctly, so it is also important to break down the results by correct and
incorrect responses. This may help to distinguish which theories are best supported by the data.
Thus, the second research question is as follows:
2.

On a multiple-choice drag-and-drop test, does the conditional probability of
correctly answering a question on 5-minute-delayed and 1-week-delayed posttests
differ for groups based on the timing (item-by-item or end-of-test) and type (with or
without metalinguistic information) of feedback for questions that are initially
answered

52

a.

correctly?

b.

incorrectly?

A summary of the predictions based on three theoretical perspectives is shown in
Table 4. The two item types that are considered are multiple-choice items that the learners have
received feedback on during the treatment (repeated items) and multiple-choice items that the
learners have not received feedback on (new items).
2.9.1 Predictions for Research Question 1
The cognitive processing window, as hypothesized by Doughty (2001), provides an
opportunity for learners to notice a contrast between their selected response and the correct
response. This window is limited by the constraints of working memory. Therefore, item-by-item
feedback on errors is predicted to be more effective for language learning than end-of-test
feedback. When the feedback comes at the end of the test, the learner may no longer have his or
her erroneous response in working memory and may be unable to make the comparison. The
situation is similar for correct responses, although the prediction in this case is not completely
clear because it is not directly addressed by the theory. Therefore, for all item types (repeated
and new), one may predict that the participants in the item-by-item group will have higher gain
scores on the 5-minute-delayed and 1-week-delayed posttests than the end-of-test and nofeedback groups.
The dual-trace hypothesis and interference-perseveration hypothesis both predict that
end-of-test feedback will be superior to item-by-item feedback for repeated items that are
initially answered incorrectly (error correction). However, the dual-trace hypothesis and
interference-perseveration hypothesis predict slightly differing results for item-by-item and endof-test feedback when the test taker responds correctly (reinforcement of correct responses). The

53

interference-perseveration hypothesis predicts no difference in this case, while the dual-trace
hypothesis predicts that the end-of-test feedback will be superior.
Because the dual-trace hypothesis and interference-perseveration hypothesis are not
designed to deal with system learning, only speculation is possible for their predictions for the
new items. The dual-trace hypothesis may predict the best results for the end-of-test feedback
because the rules have, in effect, been reviewed twice as many times in this condition as in the
item-by-item feedback condition. That is, a learner receiving item-by-item feedback would have
attempted to remember a rule once when answering a question, and would have then read the
rule, with the two instances blurring into one. In the end-of-test condition, the learner would have
attempted to remember the rule once when answering the question and then read the rule later, at
the end of the test, resulting in two separate instances of reviewing the rule. The interferenceperseveration hypothesis may predict no difference between the item-by-item and end-of-test
feedback because no memory trace of the new items exists.
The results of a pilot of the current study (Lavolette, 2013) do not fit the predictions of
any of the theories perfectly. The (nonsignificant) trends seen in the pilot followed the
predictions of the educational psychology theories for error correction (i.e., end-of-test > itemby-item). However, an advantage was seen for item-by-item feedback in the reinforcement of
correct responses, which is as predicted by the cognitive process window theory. Finally, a very
slight, nonsignificant advantage was seen for end-of-test feedback for system learning. Because
of its small magnitude, this advantage is likely to remain nonsignificant in the current study, if it
appears at all. This may best fit the predictions of the interference-perseveration hypothesis.
2.9.2 Predictions for Research Question 2
The only theory that bears on the issue of metalinguistic feedback is the attention-based

54

theory supporting the interaction approach. From this perspective, metalinguistic feedback
should increase learners’ attention to the error and the correct response, leading to more error
correction than feedback that does not include metalinguistic information. For the same reasons,
metalinguistic feedback should increase learners’ attention to feedback on correct responses,
leading to better reinforcement of correct responses. This increased attention should also lead to
better system learning.
Table 4: Theoretical Predictions for Posttest Results Based on Feedback Timing and Type
Item learning (repeated items)

System learning (new

Reinforcement

Error correction

items)

EOT < IBI

EOT < IBI

EOT < IBI

Dual trace

EOT > IBI

EOT > IBI

EOT > IBI (?)

Interference-perseveration

EOT = IBI

EOT > IBI

EOT = IBI (?)

Attention-based (SLA)

No meta < Meta

No meta < Meta

No meta < Meta

of correct
responses
Cognitive processing window
(SLA)

Note. IBI = item by item; EOT = end of test; meta = metalinguistic.
2.9.3 Predictions for interaction effects between feedback timing and feedback type
No empirical evidence exists that would allow me to predict an interaction effect between
feedback timing and providing or not providing metalinguistic information. In addition, because
the educational psychology theories considered here do not directly address metalinguistic
feedback, they cannot make predictions about the potential interaction. Speculatively, I predict

55

that the effects of feedback provided during the cognitive processing window and the attentionenhancing effects of metalinguistic feedback may be cumulative, leading to the greatest effect on
learning when metalinguistic feedback is provided on an item-by-item basis (i.e., within the
cognitive processing window). However, it is possible that metalinguistic feedback provided
outside of the cognitive processing window is just as effective as metalinguistic feedback
provided within it.

56

CHAPTER 3: METHOD
The first chapter above explained the purpose of the current study, which is to provide
evidence of how varied feedback timing and types affect the acquisition of English by adult
second language learners. As explained in the following chapters, this was accomplished by
measuring how providing or not providing metalinguistic information, item-by-item or at the
end of a test, affects ESL learners who answer multiple-choice questions that require them to
apply rules for using English articles. In Chapter 2, I reviewed the theories and previous
literature from various fields that bear on this question. The theories lead to conflicting
predictions about the optimal feedback timing, and the research on feedback timing has been
inconclusive. On the other hand, theory and research generally agree that metalinguistic feedback
is more effective than nonmetalinguistic feedback, although no theory predicts how feedback
type interacts with varied feedback timing, nor do any empirical results bear on this issue. I
presented the research questions and predictions at the end of Chapter 2.
Below, I provide details on the participants who took part in the study, then describe the
materials used. Following that is a description of the procedure, then details of how the data were
analyzed.
3.1 Participants
Participants were recruited from classes at the English Language Center at Michigan
State University, and 221 learners agreed to participate. Excluding those who did not complete
all portions of the study (n = 35) and those who scored at ceiling on the pretest (n = 74), 112
students participated, with about 25 to 30 randomly assigned to each feedback group. All 112
participants completed the exit questionnaire, and demographic information on the students is
shown in Table 5. The students were in the top two levels of the Intensive English Program

57

(Levels 3 and 4) and the single level of the English for Academic Purposes Program (Level 5).
The mean iBT TOEFL scores of the students in these levels overall were about 67 for Level 3,
68 for Level 4, and 72 for Level 5 (D. Reed, personal communication, March 28, 2014). The
average age of the participants in each feedback group was from 20 to 23 years. Their native
languages were primarily Mandarin Chinese and Arabic, which reflects the population of the
English Language Center.

58

Table 5: Participants’ Demographic Information
Group
N
Gender
Level

Age

Language
10 Mandarin
2 Cantonese
11 Arabic
1 Japanese
3 Portuguese
1 Taiwanese
14 Mandarin
1 Cantonese
3 Arabic
1 Korean
3 Japanese
2 Thai
1 Mandarin-Cantonese bilingual
13 Mandarin
1 Cantonese
4 Arabic
2 Korean
5 Japanese
1 Taiwanese
2 Thai
1 missing
16 Mandarin
6 Arabic
2 Korean
4 Japanese
1 Mandarin-Cantonese bilingual
1 Mandarin-Taiwanese bilingual

Item-by-item
metalinguistic

28

18 male
10 female

10 Level 3
11 Level 4
7 Level 5

22 mean
4.3 SD
18 low
38 high

Item-by-item
without
metalinguistic

25

12 male
13 female

5 Level 3
11 Level 4
9 Level 5

20 mean
1.5 SD
18 low
25 high

End-of-test
metalinguistic

29

13 male
16 female

7 Level 3
13 Level 4
9 Level 5

20 mean
2.7 SD
18 low
30 high

End-of-test
without
metalinguistic

30

16 male
13 female
1 missing

6 Level 3
16 Level 4
8 Level 5

23 mean
7.2 SD
18 low
54 high

59

Years studying
English
5.5 mean
5.1 SD
1 low
10 high

Months in
US
10.8 mean
9.1 SD
1 low
40 high

8.0 mean
4.0 SD
1 low
15 high

9.6 mean
8.2 SD
1 low
30 high

7.3 mean
3.4 SD
1 low
13 high

8.4 mean
8.9 SD
1 low
39 high

7.7 mean
4.1 SD
1 low
20 high

13.2 mean
15.1 SD
1 low
69 high

Also on the exit questionnaire, the participants indicated whether prior to participating in
the study, they had been familiar with the six rules for articles that were tested (Appendix B). All
but four participants were familiar with at least one rule, and the mean number of rules that
participants knew was 3.4 (SD = 1.6). Fourteen participants indicated that they knew all six rules.
The average number of participants who knew each rule was 64 (SD = 8.0), with a minimum of
49 and a maximum of 71 participants knowing a given rule.
3.2 Materials
I administered a pretest, two identical treatments, a 5-minute-delayed posttest, a 1-weekdelayed posttest, and an exit questionnaire, all via computers. The purpose of the materials was
to measure the participants’ knowledge of the usage of articles (pretest and posttests), to teach
the participants how to use articles (treatments), and find out demographic information about the
participants (exit questionnaire). The topic of the items on the tests and treatments was article
usage. I chose this topic because of its persistent difficulty for many learners (e.g., Master, 2002).
Although using articles in a nonnativelike way may not lead to misunderstandings, in my
experience, ESL learners often express an interest in improving their accuracy in the use of
articles.
All items on all of the tests and treatments were multiple-choice. This made it possible to
give item-by-item and end-of-test feedback that was specific to each participant’s responses. I
wrote all of the items, which were then tested out on 8 native English speakers to ensure that
more than one (for the experimental items) or two (for the filler items) responses were not
possible. Items for which 7 out of 8 native speakers did not agree on the intended response were
revised and retested.
The participants answered the items by dragging and dropping the selected response into

60

the blank in the sentence (see Figure 1). Each sentence had one blank, and the three answer
choices were “a/an [noun],” “the [noun],” and “[noun].” The drag-and-drop movement was
selected because it allowed the participants to see their selected response within the context of
the sentence, as opposed to checking a box. In addition, previous research has shown that
dragging is more effective for learning than clicking (Heift, 2003). To respond to the questions,
the participants dragged and dropped the article and the following noun as a single piece.
Because the article and noun were linked in this way, responding to a question always involved
the same type of mouse movement, whether the chosen answer contained an overt article or not.
If the article and following noun were separate pieces, making a “no article” response would
have involved dragging and dropping only one word, while making a response that contained an
overt article would have required dragging and dropping two separate words. Therefore, keeping
the article and noun linked made the participants’ task more uniform.
Figure 1: Example question.

The target nouns were singular and countable, and each noun appeared an equal number
of times in each context (no article, a/an, the, and either a/an or the). This ensured that the
participants would not receive a score higher than a person guessing randomly by using a
61

strategy such as always choosing the response with the definite article.
Each item had one blank and three answer choices, which corresponded to the three
possible article choices of no article, an indefinite article (a or an as appropriate), or the definite
article (the). One item appeared per screen on all tests and treatments, and the participant was
required to submit an answer before proceeding to the next item. This prevented missed
responses due to clicking the “Submit” button without selecting a response. Participants could
not go back to previous items, which prevented the participants from changing previous
responses due to what they learned later in the test or treatment. The items on each test and
treatment were randomized for each participant. All tests and treatments were presented using a
web-based open-source testing platform, Concerto (http://www.psychometrics.cam.ac.uk/). This
platform was chosen because of its flexibility to deliver feedback whenever and however the
programmer/researcher desired, as well as regulate the timing of breaks. The platform was
installed on servers owned by Michigan State University’s College of Arts and Letters. It
recorded the responses to the questions and the time that each screen was displayed, in addition
to recording the responses to the exit questionnaire. No record was made of the clicking or
mouse movements that occurred while the participants were choosing their responses. Having
this data might provide some insight into what the participants were thinking when they were
choosing their responses. However, this information would provide little information to support
or refute the hypotheses being tested in the current study, so the collection and analysis of this
type of data is left to future studies.
3.2.1 Pretest
The purpose of the pretest was to determine how accurately the participants could use
articles following the targeted rules before the treatments began. The pretest contained 48

62

multiple-choice items, listed in Appendix C. Thirty-six of the items are experimental items that
have only one possible answer, and the remaining 12 items are fillers that have two possible
answers. The purpose of the filler items was to show the students that the domains of the rules
are not mutually exclusive in actual usage. However, these items were not analyzed because two
out of three possible answers were correct, giving a high likelihood of answering them correctly
by guessing. The pretest took about 15 minutes for the participants to complete.
3.2.2 Treatments 1 and 2
Each treatment contained 32 of the multiple-choice items from the pretest. Using only 32
of the 48 pretest items reserved 16 items from the pretest that did not appear on the treatments
and that the participants therefore did not receive feedback on. These new items were used to
answer Research Question 1b, while the items on the treatment, which the participants received
feedback on, were repeated items and were used to answer Research Question 1a. Of the 32
treatment items, 24 were experimental items, and 8 were fillers. The number of filler items was
chosen in proportion to the number of experimental items included on the treatment. That is, the
proportion of experimental to filler items on the pretest (36:12 = 3:1) is the same as the
corresponding proportion on the treatments (24:8 = 3:1). An example question is shown in
Figure 1.
The treatments provided four types of feedback, depending on the feedback group (i.e.,
item-by-item with metalinguistic information, item-by-item without metalinguistic information,
end-of-test with metalinguistic information, end-of-test without metalinguistic information; see
Fig. 2). The feedback for all groups followed the same format, with the feedback provided on
one question per screen. The feedback screen first showed the correct answer(s), followed by
“You answered correctly” for correct responses or “You answered incorrectly” for incorrect

63

responses (Figure 3). I decided on this arrangement of the feedback for the following reason: by
learning that he or she answered the question incorrectly before seeing the correct answer, a
participant could repeat the processing that occurred when he or she initially saw the item, thus
reactivating the error in working memory. This is of particular concern for participants in the
end-of-test feedback condition, who would not have seen the question on the immediately
preceding screen. Providing the correct answer first without the learner’s erroneous response
helps to minimize the possibility of reactivating the error. Note that the participants were shown
the correct answer, but not their own, potentially erroneous, response. After the indication of the
correctness of the learner’s response, for the two metalinguistic feedback groups, next on the
same screen was a metalinguistic rule, regardless of whether the participant answered correctly
or incorrectly. Note that no instruction was provided other than the feedback itself. A week
passed between the two identical treatments, which each lasted about 20 minutes.
Figure 2: Division of participants into feedback groups. IBI = item by item; EOT = end of test.

64

Figure 3: Metalinguistic feedback on an incorrect response.

3.2.3 Five-minute-delayed posttest
The participants completed the 5-minute delayed posttest during the same session as
Treatment 2, after a 5-minute break. I included this break to ensure that the 40-second cognitive
processing window had fully elapsed for all of the items and feedback on the preceding treatment,
as well as to give the participants a chance to rest. The 5-minute-delayed posttest included all
items from the pretest so that both item learning and system learning could be measured. No
feedback was provided on any of the posttest items to any of the groups because the effects of
feedback on the posttest would have been impossible to separate from the effects of the treatment,
especially for the participants who received item-by-item feedback. While it is possible that the
participants may have remembered the correct answers for the items that appeared on the
treatments (once on each treatment, or twice for each item), 16 of the items (12 experimental and
4 filler) appeared only on the pretest and the posttests. Because no feedback was provided on
these items, the participants needed to apply the rules that they had learned. The participants
completed the posttest in about 15 minutes.
3.2.4 One-week-delayed posttest
The participants completed the 1-week-delayed posttest 1 week after the 5-minutedelayed posttest, and it contained the same items as the 5-minute-delayed posttest. All of the

65

participants received end-of-test metalinguistic feedback as a courtesy. This feedback did not
affect the posttest results because it came after the final posttest was completed. The participants
completed this posttest in about 15 minutes.
3.3 Procedure
Figure 4 shows an overview of the procedure. In advance of Session 1, I randomly
assigned the participants to the feedback groups (Figure 2), with approximately the same number
of participants in each group per class. During Session 1, I demonstrated how to respond to the
items. This was important so that the mechanics of answering the items themselves would not
cause the participants to answer them incorrectly. Then, the participants completed the pretest,
which took 15 to 20 minutes, depending on the participant. After a 5-minute break, the
participants completed Treatment 1. This took 20 to 30 minutes. During this first session, I told
the participants that I would come to their class a total of three times. Although I did not tell
them that they would encounter the same items on all of the tests and treatments, it is probable
that many of them figured this out after the first treatment. Some of them asked me why they saw
the same items again, and I told them that repetition was useful for learning. A week later, during
Session 2, the participants completed Treatment 2, took a 5-minute break, then completed the 5minute-delayed posttest, which altogether took 35 to 45 minutes. One week later, they completed
the 1-week-delayed posttest and the exit questionnaire, which took 20 to 30 minutes.

66

Figure 4: Procedure.

I was present in the classes where the study took place only during the experimental
sessions themselves, so I do not know the exact content that the participants were exposed to
67

outside of class. However, the instructors of the classes told me that they did not explicitly study
articles in their classes during the class sessions before the study or during the study itself.
3.4 Analysis
As a preliminary step in analyzing the data, I excluded from the analysis participants who
scored very high on the pretest. These participants were not of interest because they had very
little margin to learn from the feedback. Specifically, I eliminated participants who scored above
85% (an arbitrarily chosen cut-off score) on the new or repeated items: 11 or 12 out of 12 on the
new items or 20 or more out of 24 on the repeated items. This eliminated 61 participants. I also
calculated the item discrimination of each item on the pretest, considering the repeated items and
the new items separately. The item discrimination indicates “the degree to which an item
separates the students who performed well from those who performed poorly on the test as a
whole” (Brown, 2005, p. 68). That is, an item with low discrimination may be easy for
participants who score low on the test overall and difficult for participants who score high on the
test overall, which indicates that the item may not be testing the same construct as the other items
on the test. There may be something wrong with the statement of the item itself which makes it
tricky, or it may be an item that is very easy or very difficult for all participants. Therefore, items
with a discrimination of lower than .2 (marked with asterisks in Appendix C) were excluded
from the analysis, leaving 10 new items and 17 repeated items. This discrimination level was
chosen based on Ebel's (1979, p. 267) recommendations, as cited by Brown (2005, p. 75). After
excluding these items, participants who were at the new ceiling of 9 or 10 out of 10 new items or
15 or more out of 17 repeated items were excluded. This eliminated a further 13 participants.
After the elimination of these participants, the total number included in the analyses was 112,
with demographic information as given above.

68

All of the analyses described below are mixed-design ANOVAs, which are used when
both repeated-measures and between-groups variables are involved in the analysis (Field, 2009, p.
507). This type of analysis was chosen, rather than the other possible analyses (e.g., a repeatedmeasures ANOVA and a series of t-tests), to minimize the number of analyses needed to answer
the research questions, thereby reducing the likelihood of finding a statistically significant result
by chance alone. In addition, the ANOVAs provide information on interactions between
variables, which cannot be produced by a t-test.
For all of the ANOVAs, I calculated effect sizes as partial eta squared (η2part). According
to Brown (2008), these effect sizes “indicate the percentage of variance in each of the effects (or
interaction) and its associated error that is accounted for by that effect (or interaction)” (p. 42).
Thus, partial eta squared can be understood as a percentage. To my knowledge, no one has
assigned standard interpretations to these effect sizes. However, for eta squared effect sizes, J.
Cohen (1988) interpreted .01 as a small effect, .06 as a medium effect, and .14 as a large effect.
Using this as guidance for partial eta squared, I interpreted values above .15 as a large effect,
between .15 and .05 as a medium effect, between .05 and .01 as a small effect, and below .01 as
a very small effect.
I used a significance level of p ≤ .05 for all analyses. An argument could be made for
using a correction to the significance level because of the multiple ANOVAs used. However, as
detailed below, I performed the analyses on mutually exclusive portions of the data. For example,
I used the gain scores for the repeated questions in one ANOVA, while I used the gain scores for
the new items in a separate ANOVA. In that sense, I did not make multiple comparisons of the
treatment groups, so I chose not to use a correction for the significance level.
Before performing the ANOVAs, I checked the three assumptions of normality,

69

homogeneity of variance, and sphericity. For the assumption of normality, I used KolmogorovSmirnov tests and examined the Q-Q plots. For each ANOVA, some of the KolmogorovSmirnov tests were significant, indicating that the assumption of normality had been violated,
and the corresponding Q-Q plots confirmed this in some cases. However, ANOVAs are
generally robust to violations of normality when group sizes are equal (Field, 2009, pp. 359–360),
so I decided that the assumption of normality was sufficiently met in all cases. The results of
checking the other two assumptions vary for each of the ANOVAs, so they are provided below.
3.4.1 Research Question 1
After the preliminary steps of excluding items and participants, I ran the analyses
described below to determine whether the participants who received the four treatments differed
in how much they learned, addressing Research Question 1. To investigate item learning
(Research Question 1a), I analyzed the old items only. I used a 2 x 2 x 2 mixed-design ANOVA
with the gain scores (from the pretest to the 5-minute-delayed and 1-week-delayed posttests) as a
within-subject factor and feedback timing (item-by-item or end-of-test feedback) and feedback
type (with or without metalinguistic information) as between-subjects factors. To investigate
system learning (Research Question 1b), I ran a separate 2 x 2 x 2 mixed-design ANOVA using
the gain scores for the new items, using the same factors.
Before performing the ANOVAs, I checked the three assumptions of normality,
homogeneity of variance, and sphericity. Information about the normality assumption is given
above. I used Levene’s test to check the homogeneity of variance. No significant results were
found, so this assumption was met. Finally, the assumption of sphericity is automatically met
because at least three levels of a variable are necessary for sphericity to become a problem (Field,
2009, p. 459), and each variable had only two levels.

70

3.4.2 Research Question 2
To answer Research Question 2, I further analyzed the posttest results of the repeated and
new items based on whether a given participant answered correctly or incorrectly on the
corresponding pretest item, following Clariana, Wagner, and Roher Murphy (2000; see also
Butler et al., 2007; Peeck & Tillema, 1978; Smith & Kimball, 2010; Surber & Anderson, 1975).
The terminology used in this previous literature is conditional probabilities, and I adopt this
terminology here as well. However, the calculations are performed post-hoc, so probability here
should not be understood as a prediction. Instead, it is a description of the learners’ responses to
the posttest questions based on their pretest responses. These conditional probabilities provide
information about how the treatment affected reinforcement of correct responses (when a
participant answers correctly on both a pretest and posttest) and error correction (when a
participant answers incorrectly on a pretest and correctly on a posttest). For both the repeated and
new items, the four sets of conditional probabilities that I calculated are R2/R1, the probability
that an item is answered correctly on the 5-minute-delayed posttest, given that it has been
answered correctly on the pretest; R3/R1, the probability that an item is answered correctly on the
1-week-delayed posttest, given that it has been answered correctly on the pretest; R2/W1, the
probability that an item is answered correctly on the 5-minute-delayed posttest, given that it has
been answered incorrectly on the pretest; and R3/W1, the probability that an item is answered
correctly on the 1-week-delayed posttest, given that it has been answered incorrectly on the
pretest. For example, I calculated R2/R1 by first counting the number of correct responses for a
given participant on the pretest (R1), then counting the number of correct responses for those
same questions on the 5-minute-delayed posttest (R2). I then divided R2 by R1 to get R2/R1. For
the repeated and new items, I performed two 2 x 2 x 2 mixed-design ANOVAs with the

71

conditional probabilities on the 5-minute-delayed and 1-week-delayed posttests (R2/R1 and R3/R1;
R2/W1 and R3/W1) as a within-subject factor and the two feedback timing groups and the
with/without metalinguistic feedback groups as between-subjects factors.
The analysis of the conditional probabilities is necessary to investigate differences in the
predictions of the educational psychology theories. In addition, the results obtained from
investigating the conditional probabilities have the potential to provide more detailed insight into
what the participants learned correctly or incorrectly through their participation in the study. If,
for example, I examine the results from item learning overall, I can determine the gain scores
from pretest to posttest. However, gain scores do not tell the whole story. The intended result of
the treatment is that participants who answered questions incorrectly on the pretest will answer
those questions correctly on the posttest, without incorrectly answering any of the questions that
they originally answered correctly. However, imagine this scenario: Out of 20 questions, a
participant answers 10 correctly and 10 incorrectly on a pretest. After a treatment, the same
participant answers the 20 items again on a posttest. Of the 10 items answered correctly on the
pretest, she answers 7 of the items correctly on the posttest, and of the 10 items answered
incorrectly on the pretest, she answers 3 correctly on the posttest. Her score is still 10, with a
gain score of zero, which may lead to the conclusion that the treatment had no effect. However, a
closer investigation of the conditional probabilities would reveal that the treatment had both
positive and negative effects. Without separating the results into conditional probabilities, this
information is lost. Therefore, while overall gain scores provide useful information, akin to
averages, breaking down scores into conditional probabilities may provide further useful
information. However, the conditional probabilities will be volatile and therefore unreliable if
many of the participants answer relatively few questions either correctly or incorrectly on the

72

pretest. Ideally, a test for which conditional probabilities are used will provide a large pool of
items that are answered both correctly and incorrectly on the pretest for each participant. As the
results below will show, relatively few items were answered incorrectly on the pretest, making
the results of the analyses of error correction (but not reinforcement of correct responses)
unreliable.
Before performing the ANOVAs, I checked the three assumptions of normality,
homogeneity of variance, and sphericity. Information about the normality assumption is given
above. Next, I used Levene’s test to check the homogeneity of variance. A significant result was
found for R2/W1 for the item-by-item and end-of-test groups for the repeated items, which lends
further support to the claim that the analyses involving this conditional probability are not
reliable. In all other cases, this assumption was met. Finally, the assumption of sphericity was
met because at least three levels of a variable are necessary for sphericity to become a problem
(Field, 2009, p. 459), and each variable had only two levels.
3.4.3 Question and feedback display times
Next, I wanted to determine whether the time that the participants spent viewing the
questions and feedback on the treatments was significantly different among the with/without
metalinguistic feedback groups and two feedback timing groups. Because of the self-paced
nature of the tests and treatments, I needed to investigate the possibility that the time spent with
the feedback displayed differed among the feedback groups. If so, the time differences, rather
than the treatments themselves, could explain any differences in the learning outcomes. Although
data that is a set number of standard deviations from the mean is often trimmed from reaction
times, in the current study, this would have eliminated display times in which the participant may
simply have been viewing the screen for an unusually long time. Although classroom distractions,

73

rather than long reading times, undoubtedly caused some of these outliers, it is impossible to
separate the two types of long display times. That said, these classroom distractions most likely
affected all feedback groups in the same way, so including all of these long display times in the
analysis should not qualitatively affect the results. Of course, the display times should not be
taken as an exact measure of how long the participants were actually reading the screen, but only
as an upper bound. With these limitations in mind, I eliminated from the current analysis only the
display time for the first question on each test and treatment; which question was eliminated
varied by participant because of the randomization of the questions. This first display time was
often longer than the other display times for two reasons. First, the participants sometimes waited
to be told to proceed before answering the first question. Second, Treatment 1 and the 5-minutedelayed posttest both came after a computer-timed 5-minute break, and the participants were
generally not ready to proceed instantaneously after the break. Note, however, that this reasoning
does not apply to the time that the feedback was displayed, so none of the feedback display times
were excluded from the analysis. I did not remove display times that were exceptionally short for
either the questions or the feedback because these very short display times simply indicated that
the participant spent very little time with that page displayed.
To analyze the feedback and question display times, I ran three mixed-design ANOVAs
on the total display times of each participant for each test or treatment (Table 7). The display
times for the 1-week-delayed posttest were not included in any of the analyses because this
posttest occurred last, and therefore could not have affected the previous tests. All ANOVAs
used feedback timing (item-by-item and end-of-test) and type (with and without metalinguistic
information) as between-subjects factors, but the within-subject (repeated measures) factor
varied. The first ANOVA was 4 x 2 x 2 and used the display times for the repeated questions on

74

the pretest, Treatment 1, Treatment 2, and the 5-minute-delayed posttest as the within-subjects
variable. The second ANOVA used the display times for the new questions on the pretest and the
5-minute delayed posttest as the within-subjects variable. Note that the new questions did not
appear on the treatments, so no question display times were recorded. The third ANOVA was 2 x
2 x 2 and used the feedback display times for Treatment 1 and Treatment 2 as the within-subjects
variable. Note that the treatments were the only materials where the feedback appeared and that
the participants saw this feedback before they took either posttest.
Before performing the ANOVAs, I checked the three assumptions of normality,
homogeneity of variance, and sphericity. Information about the normality assumption is given
above. The results of checking the other two assumptions are given for each ANOVA in turn
below.
I began with the analysis of the display times for the repeated questions (ANOVA 1 in
Table 7). I used Levene’s test to check the homogeneity of variance, and the results showed that
this assumption was met. Mauchly’s test showed that the assumption of sphericity was violated,
so I used the Greenhouse-Geisser correction for the significance values for this analysis.
Next, I analyzed the display times for the new questions (ANOVA 2 in Table 7). I used
Levene’s test to check the homogeneity of variance, and the results showed that this assumption
was met. The assumption of sphericity was met because at least three levels of a variable are
necessary for sphericity to become a problem (Field, 2009, p. 459), and each variable in this
ANOVAs had only two levels.
Finally, I analyzed the display times for the feedback (ANOVA 3 in Table 7). I used
Levene’s test to check the homogeneity of variance, and the results showed that this assumption
was violated. The assumption of sphericity was met because at least three levels of a variable are

75

necessary for sphericity to become a problem (Field, 2009, p. 459), and each variable in this
ANOVA had only two levels. Although not all of the assumptions of a mixed-design ANOVA
were met in this case, I decided to run the analyses anyway because no alternative analysis was
available. The results, therefore, will be interpreted with caution.
3.5 Summary of Analyses
To sum up this section, the tables below detail the research questions and the analyses
used to investigate them. The analyses used on the participants’ gain scores and conditional
probabilities are shown in Table 6. The analyses used on the reaction times are shown in Table 7.

76

Table 6: Research Questions and Corresponding Analyses of Gain Scores and Conditional Probabilities
Research question
1. Does timing & type of

Within-subject (repeated measures) factor
a. on repeated questions?

Gain scores (from pretest to 5-minute- and 1-week-delayed

feedback affect gain
scores on 5-minute- & 1-

posttests) as within-subject factor
b. on new questions?

Gain scores (from pretest to 5-minute- and 1-week-delayed

week-delayed posttests
2. Does conditional

posttests) as within-subject factor
a. correctly

on repeated

Conditional probabilities on 5-minute- & 1-week-delayed

probability of correctly

questions?

posttests (R2/R1 & R3/R1) as within-subject factor

answering question on 5-

on new questions?

Conditional probabilities on 5-minute- & 1-week-delayed

minute- and 1-weekdelayed posttests differ

posttests (R2/R1 & R3/R1) as within-subject factor
b. incorrectly

on repeated

Conditional probabilities on 5-minute- & 1-week-delayed

for groups based on

questions?

posttests (R2/R1 & R3/R1) as within-subject factor

timing & type of

on new questions?

Conditional probabilities on 5-minute- & 1-week-delayed
posttests (R2/R1 & R3/R1) as within-subject factor

feedback for questions
initially answered

Note. All analyses are 2 x 2 x 2 mixed-design ANOVAs with feedback timing (item-by-item or end-of-test) and feedback type (with or
without metalinguistic information) as between-subjects factors.
77

Table 7: Analyses of Feedback and Question Display Times
ANOVA

Research

Display times

Levels of repeated-measures

#

question

for

variable

1

1a

Repeated items

4

Tests & treatments included

Pretest, Treatments 1 & 2, 5-minute-delayed
posttest

2

1b

New items

2

Pretest, 5-minute-delayed posttest

3

1a & b

Feedback

2

Treatments 1 & 2

Note. All analyses are mixed-design ANOVAs with feedback timing (item-by-item or end-of-test) and feedback type (with or without
metalinguistic) as between-subjects factors.

78

CHAPTER 4: RESULTS
As explained in Chapter 1, the purpose of the current study is to provide evidence of how
varied feedback timing and types affect the acquisition of English by adult second language
learners. To accomplish this, ESL learner participants answered multiple-choice questions that
required them to apply rules for using English articles, and I investigated how providing or not
providing metalinguistic feedback, either item-by-item or at the end of the test, affected their
responses. In Chapter 2, I reviewed the relevant theories and previous literature. Although
theories disagree and the experimental research is inconclusive about the optimal feedback
timing, theory and research generally agree that providing metalinguistic feedback is more
effective than not providing it. In Chapter 3, I described the participants, who were ESL learners
who primarily spoke Chinese and Arabic as their first languages. I also described the materials,
which were drag-and-drop multiple-choice questions, feedback, and questionnaires, all delivered
via computer. The procedure, shown in Figure 4, took place in the participants’ ESL classrooms,
and was conducted over three class periods. Finally, I described the analysis of the gain scores
(Table 6) and question and feedback display times (Table 7), which was primarily accomplished
using mixed-design ANOVAs.
Below, I provide preliminary results of the pretest and each posttest, then address each of
the research questions in turn. Following that is a summary of the results.
4.1 Preliminary Results
Descriptive statistics for the pretest and both posttests are shown in Table 8. I calculated
the reliability of each test using Cronbach’s alpha. Kline (2005) recommended that alpha be at
least .7 for low stakes tests (p. 182), and some of statistics for the old and new items do not reach
this level. However, this is likely due to the small number of items included (Cortina, 1993).

79

When all 27 test items are included in the calculation, the minimum alpha is .79, which is well
within the acceptable range.

80

Table 8: Overall Test Results
Test

Items

Number of items Cronbach’s alpha Mean SD Min Max Mean item discrimination

Pretest

Old items

17

.57

10.3

2.8 3

14

.37

New items 10

.09

6.1

1.5 1

8

.32

All items

27

.79

16.4

4.3 4

22

.35

Old items

17

.81

12.2

3.7 4

17

.49

New items 10

.40

6.5

1.8 2

9

.40

All items

27

.86

18.7

5.5 6

26

.46

1-week-delayed posttest Old items

17

.74

12.0

3.3 4

17

.44

New items 10

.25

6.7

1.6 3

10

.36

All items

.81

18.6

4.9 7

27

.41

5-min.-delayed posttest

27

Note. The number of test takers was 112 for all tests.

81

4.2 Research Question 1
First, I address the question of whether the timing (item-by-item or end-of-test) or type
(with or without metalinguistic information) of feedback on a multiple-choice drag-and-drop test
affects student scores on 5-minute-delayed and 1-week-delayed posttests on repeated and new
questions. That is, I investigated whether feedback timing or type affects item or system learning,
in the short or somewhat longer term.
4.2.1 Research Question 1a: Repeated items
To investigate item learning, I first analyzed the repeated items. For all participants
combined, the mean gain from the pretest to the 5-minute-delayed posttest was 1.9 points (SD =
2.9), and the mean gain from the pretest to the 1-week delayed posttest was 0.9 points (SD = 2.5).
Thus, overall, gain scores were higher on the first posttest, with a loss of about one point from
the first to the second posttest. However, the standard deviations were quite high, with a wide
range of individual scores. A breakdown by feedback timing and type of feedback is shown in
Table 9.

82

Table 9: Mean (SD) Gain Scores From Pretest to Each Posttest, Repeated Items
5-minute-delayed posttest 1-week-delayed posttest
Item-by-item

2.2 (2.6)

0.8 (2.4)

End-of-test

1.6 (3.0)

0.9 (2.5)

Metalinguistic

2.2 (2.8)

1.1 (2.3)

No metalinguistic

1.7 (2.9)

0.6 (2.6)

Item-by-item metalinguistic

2.4 (2.8)

1.3 (2.4)

Item-by-item, no metalinguistic 2.1 (2.6)

0.1 (2.3)

End-of-test metalinguistic

2.0 (2.9)

0.9 (2.2)

End-of-test, no metalinguistic

1.3 (3.2)

1.0 (2.8)

I performed a 2 x 2 x 2 mixed-design ANOVA with the gain scores as a within-subject
factor and feedback timing and with/without metalinguistic feedback as between-subjects factors.
The main effect of time (from the 5-minute-delayed posttest to the 1-week-delayed posttest) was
significant, F(1, 108) = 27.83, p < .001, η2part = .21 (a large effect size), with the gain scores on
the 1-week-delayed posttest being lower. That is, on average, the participants increased their
scores from the pretest to each posttest, but the gains were not fully maintained on the 1-weekdelayed posttest. Note that this effect is relatively large. The main effect of feedback timing was
not significant, F(1, 108) = 0.18, p = .67, η2part = .002 (a very small effect size), nor was the main
effect of feedback type, F(1, 108) = 1.13, p = .29, η2part = .01 (a small effect size). The
interaction between feedback timing and feedback type was not significant, F(1, 108) = 0.24, p
= .62, η2part =.002 (a very small effect size). The interaction between time and feedback timing
was significant, F(1, 108) = 4.20, p = .043, η2part = .037 (a small effect size), with the participants

83

who got item-by-item feedback doing better than the participants who got end-of-test feedback
on the 5-minute-delayed posttest, but the two groups performing similarly on the 1-week-delayed
posttest (Fig. 5). The effect size is small. This interaction will be considered in more detail in the
discussion. The interaction between time and feedback type was not significant, F(1, 108), =
0.013, p = .91, η2part < .001 (a very small effect size). The three-way interaction of time, feedback
timing, and feedback type was significant, F(1, 108) = 4.90, p = .029, η2part = .043, which is a
small effect size. As Figure 6 shows, the groups that received metalinguistic feedback dropped in
their gain scores at approximately the same rate from the 5-minute-delayed to the 1-week
delayed posttest, while the groups that did not receive metalinguistic feedback showed differing
patterns. The group that got item-by-item nonmetalinguistic feedback showed a sharper drop in
gain scores from the first to the second posttest, while the group that got end-of-test feedback
without metalinguistic information showed very little decrease in gain scores. The interpretation
of this interaction will be further considered in the discussion.

84

Figure 5: Gain scores for two feedback timings on both posttests, repeated items only. IBI =
item-by-item feedback; EOT = end-of-test feedback.

85

Figure 6: Gain scores of feedback groups on both posttests, repeated items only. IBI = item-byitem feedback; EOT = end-of-test feedback; meta = metalinguistic feedback; nonmeta = no
metalinguistic feedback.

To determine whether the display time for the questions on the pretest, two treatments,
and first posttest was significantly different among the with/without metalinguistic feedback
groups and two feedback timing groups, I performed a 4 x 2 x 2 ANOVA on the total display
times per participant. The mean display times are shown in Table 10. The main effect of time
was significant, F(2.4, 254.4) = 192.6, p < .001, η2part = .64, which is a large effect size, with the
mean total display times decreasing over time. This is not surprising because the participants

86

became increasingly familiar with the questions over time, which meant that they needed less
time to read them and respond to them. No other effects were significant: feedback timing, F(1,
108) = 0.23, p = .64, η2part = .002 (a very small effect size); feedback type, F(1, 108) = 0.25, p
= .62, η2part = .002 (a very small effect size); time x feedback timing, F(2.4, 254.4) = 0.33, p
= .75, η2part = .003 (a very small effect size); time x feedback type, F(2.4, 254.4) = 0.99, p = .39,
η2part = .009 (a very small effect size); feedback timing x feedback type, F(1, 108) = 2.54, p
= .072, η2part = .023 (a small effect size); time x feedback timing x feedback type, F(2.4, 254.4) =
1.77, p = .17, η2part = .016 (a small effect size).

87

Table 10: Mean (SD) Total Time in Seconds Questions Were Displayed, Repeated Items Only
Test or treatment

Pretest

Treatment 1

Treatment 2

5-minute-delayed posttest

IBI

250.3 (72.7)

153.4 (43.4)

171.5 (49.2)

136.4 (39.9)

EOT

249.5 (73.9)

150.7 (38.3)

168.2 (53.9)

126.4 (50.8)

Meta

251.8 (72.1)

158.0 (41.9)

171.3 (55.0)

128.4 (42.2)

Nonmeta

247.9 (74.6)

145.7 (38.6)

168.2 (48.1)

134.0 (50.0)

IBI meta

265.6 (78.9)

162.0 (44.3)

176.4 (55.7)

139.1 (44.2)

IBI no meta

233.1 (62.3)

143.8 (41.1)

166.1 (41.0)

133.4 (35.0)

EOT meta

238.4 (63.3)

154.2 (39.9)

166.4 (54.9)

118.0 (38.1)

EOT no meta

260.2 (82.6)

147.3 (37.1)

170.0 (53.8)

134.6 (60.3)

Note. IBI = item-by-item feedback; EOT = end-of-test feedback; meta = metalinguistic feedback.

88

To determine whether the display time for the feedback on the two treatments was
significantly different among the with/without metalinguistic feedback groups and two feedback
timing groups, I performed a 2 x 2 x 2 ANOVA on the total display times per participant. The
mean display times are shown in Table 11. Note that the assumption of homogeneity of variance
was violated for the analysis of the feedback display time, so the results should be interpreted
with caution. However, the results are easily explained by the design of the study, as shown
below, which leads me to believe that they may be accurate.
A significant main effect was found for time, F(1, 108) = 18.37, p < .001, η2part = .15 (a
large effect size), with the time spent with the feedback displayed decreasing from the first
treatment to the second. Like the question display times, the participants were increasingly
familiar with the feedback as they progressed through the treatments, so it is unsurprising that
they took less time to read the feedback on the second treatment. A significant main effect was
also found for feedback timing, F(1, 108) = 4.14, p = .044, η2part = .037 (a small effect size), with
the participants who got end-of-test feedback displaying the feedback longer than those who got
item-by-item feedback. This may be due to the fact that the participants in the item-by-item
feedback group had read the context sentence immediately before getting the feedback, so they
did not need to take the time to read the sentence again in detail, while the participants in the
end-of-test feedback group needed to take time to refamiliarize themselves with it to understand
the feedback. A significant main effect was also found for feedback type, F(1, 108) = 6.90, p
= .010, η2part = .060 (a medium effect size), with metalinguistic feedback displayed significantly
longer than feedback without metalinguistic information. This effect is expected because there
was simply more text on the screen for the metalinguistic feedback condition.

89

Table 11: Mean (SD) Total Time in Seconds Feedback Was Displayed
Treatment

1

2

IBI

43.1 (30.0)

34.2 (25.2)

EOT

59.8 (50.1)

38.6 (28.3)

Meta

61.1 (46.2)

40.8 (30.7)

Nonmeta

42.4 (35.5)

32.0 (21.6)

IBI meta

55.8 (33.6)

40.4 (31.8)

IBI no meta

29.0 (12.5)

27.1 (11.8)

EOT meta

66.2 (55.9)

41.1 (30.2)

EOT no meta

53.6 (43.9)

36.1 (26.7)

Note. IBI = item-by-item feedback; EOT = end-of-test feedback; meta = metalinguistic feedback.
A marginally significant interaction was found between time and feedback timing, F(1,
108) = 3.31, p = .072, η2part = .030, which is a small effect size. As Figure 7 shows, the
participants who got end-of-test feedback decreased their feedback display times more from
Treatment 1 to Treatment 2 than did the participants who got item-by-item feedback. This result
may be due to the participants in the end-of-test group spending more time to read the feedback
on Treatment 1 because of their need to refamiliarize themselves with the context sentences as
explained above, but being familiar enough with the feedback itself on Treatment 2 that they did
not read it at all. That is, on Treatment 2, the overall average display time for the feedback was
about 36 seconds. Because there were 17 questions, this averages out to just over 2 seconds per
feedback screen, which indicates that the participants in general were spending very little time
reading the feedback. Thus, although the end-of-test feedback participants needed more time
than the item-by-item feedback participants to read the feedback on Treatment 1, both groups
90

nearly stopped reading the feedback at all on Treatment 2.
Figure 7: Interaction between time and feedback timing for total feedback display time. IBI =
item-by-item feedback; EOT = end-of-test feedback.

4.2.2 Research Question 1b: New items
To investigate system learning, I analyzed the new items. For all participants combined,
the mean gain from the pretest to the 5-minute-delayed posttest was 0.39 points (SD = 2.8), and
the mean gain from the pretest to the 1-week delayed posttest was 1.38 points (SD = 2.0). Thus,
overall, gain scores were lower on the first posttest, with a gain of about one point from the first
to the second posttest. However, as with the repeated items, the standard deviations were quite

91

high, indicating a wide range of individual scores. A breakdown by feedback timing and type of
feedback is shown in Table 12.
To investigate system learning, I ran a 2 x 2 x 2 mixed-design ANOVA using the new
items. The main effect of time (from the 5-minute-delayed posttest to the 1-week-delayed
posttest) was significant, F(1, 108) = 41.22, p < .001, η2part = .28 (a large effect size), with the
gain scores on the 1-week-delayed posttest being higher. This shows that system learning
increased over the time of the study. The main effect of feedback timing was marginally
significant, F(1, 108) = 3.61, p = .060, η2part = .032 (a small effect size), with the group that
received item-by-item feedback scoring higher. This shows that item-by-item feedback was more
effective for system learning than end-of-test feedback. The main effect of feedback type was not
significant, F(1, 108) = 1.04, p = .31, η2part = .01 (a small effect size). None of the interaction
effects were significant: between feedback timing and feedback type, F(1, 108) = 0.85, p = .36,
η2part = .008 (a very small effect size); between time and feedback timing, F(1, 108) = 0.069, p
= .79, η2part = .001 (a very small effect size); between time and feedback type, F(1, 108), = 0.044,
p = .84, η2part < .001 (a very small effect size); and between time, feedback timing, and feedback
type, F(1, 108) = 2.91, p = .097, η2part = .025 (a small effect size).

92

Table 12: Mean (SD) Gain Scores From Pretest to Each Posttest, New Items
5-minute-delayed posttest

1-week-delayed posttest

Item-by-item

0.7 (1.5)

1.7 (2.0)

End-of-test

0.1 (2.0)

1.1 (2.0)

Metalinguistic

0.6 (1.8)

1.5 (2.1)

No metalinguistic

0.2 (1.8)

1.2 (1.9)

Item-by-item metalinguistic

0.6 (1.7)

1.9 (2.2)

Item-by-item no metalinguistic

0.8 (1.3)

1.6 (1.8)

End-of-test metalinguistic

0.6 (1.9)

1.2 (2.0)

End-of-test no metalinguistic

−0.3 (2.1)

0.9 (1.9)

To determine whether the display time for the questions on the pretest and first posttest
was significantly different among the metalinguistic/nonmetalinguistic feedback groups and two
feedback timing groups, I performed a 2 x 2 x 2 ANOVA on the total display times per
participant. The mean display times are shown in Table 13. The main effect of time was
significant, F(1, 108) = 179.1, p < .001, η2part = .62 (a large effect size), with the display times
decreasing over time. As with the repeated question display times, this effect is expected because
the participants became more familiar with the questions after reading them the first time, thus
needing less time to read them the second time. No significant main effect was found for
feedback timing, F(1, 108) = 0.68, p = .41, η2part = .006 (a very small effect size), or for feedback
type, F(1, 108) = 0.21, p = .65, η2part = .002 (a very small effect size). No significant interactions
were found between time and feedback timing, F(1, 108) = 0.024, p = .88, η2part < .001 (a very
small effect size); time and feedback type, F(1, 108) = 0.37, p = .54, η2part = .003 (a very small

93

effect size); or feedback timing and feedback type, F(1, 108) = 1.92, p = .17, η2part = .017 (a
small effect size). A significant interaction was found between time, feedback timing, and
feedback type, F(1, 108) = 6.29, p = .014, η2part = .055 (a medium effect size). This interaction is
shown in Figure 8: The four groups show some differences in how long they display the new
questions on the pretest, but cluster closer together on the 5-minute-delayed posttest. This
interaction cannot be explained by the variables in the study because none of the participants
received feedback during the pretest. Therefore, the participants in the groups may have had
some differences before the study began that led them to spend different amounts of time reading
the questions on the pretest, although these differences were no longer evident by the first
posttest. However, it is not clear why these differences affected the display times for the new
questions, but not the repeated questions or the feedback.
Table 13: Mean (SD) Total Time in Seconds Questions Were Displayed, New Items Only
Test

Pretest

5-minute-delayed posttest

IBI

143.8 (45.9)

88.9 (29.0)

EOT

138.7 (49.8)

83.1 (31.2)

Meta

143.4 (49.8)

86.2 (32.7)

No meta

138.8 (46.1)

85.5 (27.7)

IBI meta

155.4 (53.0)

88.4 (32.6)

IBI no meta

130.9 (32.9)

89.5 (25.2)

EOT meta

131.7 (44.4)

84.0 (33.3)

EOT no meta

145.4 (54.5)

82.1 (29.6)

Note. IBI = item-by-item feedback; EOT = end-of-test feedback; meta = metalinguistic feedback.

94

Figure 8: Display time interaction for time x feedback timing x feedback type, new questions
only. IBI = item-by-item feedback; EOT = end-of-test feedback; meta = metalinguistic feedback;
nonmeta = no metalinguistic feedback.

95

4.3 Research Question 2
Next, I addressed the question of whether the timing (item-by-item or end-of-test) and
type (with or without metalinguistic information) of feedback has a differential effect on
questions that are answered correctly and incorrectly.
I calculated four sets of conditional probabilities for the repeated items and for the new
items: R2/R1, the probability that an item is answered correctly on the 5-minute-delayed posttest,
given that it has been answered correctly on the pretest; R3/R1, the probability that an item is
answered correctly on the 1-week-delayed posttest, given that it has been answered correctly on
the pretest; R2/W1, the probability that an item is answered correctly on the 5-minute-delayed
posttest, given that it has been answered incorrectly on the pretest; and R3/W1, the probability
that an item is answered correctly on the 1-week-delayed posttest, given that it has been
answered incorrectly on the pretest. Then, for each type of question (repeated and new), I
performed two 2 x 2 x 2 mixed-design ANOVAs with the conditional probabilities on the 5minute-delayed and 1-week-delayed posttests (R2/R1 and R3/R1; R2/W1 and R3/W1) as a withinsubject factor and the two feedback timing groups and the groups who received or did not
received metalinguistic information as between-subjects factors.
4.3.1 Repeated items
To investigate item learning, I performed two 2 x 2 x 2 mixed-design ANOVA using the
conditional probabilities for the repeated items. The first ANOVA used the probabilities of
answering an item correctly on each posttest given that the item had been answered correctly on
the pretest. The means for each condition are shown in Table 12. The results showed no
significant main effects of time, F(1, 108) = 0.07, p = .78, η2part = .001 (a very small effect size),
feedback timing, F(1, 108) = 0.14, p = .71, η2part = .001 (a very small effect size), or feedback

96

type, F(1, 108) = 1.66, p = .20, η2part = .015 (a small effect size). There were also no significant
interaction effects for time x feedback timing, F(1, 108) = 1.63, p = .20, η2part = .015, time x
feedback type, F(1, 108) = 0.95, p = .33, η2part = .009 (a very small effect size), feedback timing
x type, F(1, 108) = 0.16, p = .69, η2part = .002 (a very small effect size), or time x feedback
timing x feedback type, F(1, 108) = 0.002, p = .96, η2part < .001 (a very small effect size).
The second ANOVA used the probabilities of answering an item correctly on each
posttest given that the item had been answered incorrectly on the pretest. The means for each
condition are shown in Table 14. The results showed no significant main effects of time, F(1,
108) = 2.44, p = .12, η2part = .022 (a small effect size), feedback timing, F(1, 108) = 0.003, p
= .96, η2part < .001 (a very small effect size), or feedback type, F(1, 108) = 0.10, p = .75, η2part
= .001 (a very small effect size). There were also no significant interaction effects for time x
feedback timing, F(1, 108) = 0.034, p = .85, η2part < .001 (a very small effect size), time x
feedback type, F(1, 108) = 0.35, p = .55, η2part = .003 (a very small effect size), or feedback
timing x type, F(1, 108) = 0.039, p = .85, η2part < .001 (a very small effect size). A significant
three-way interaction was found for time x feedback timing x feedback type, F(1, 108) = 9.15, p
= .003, η2part = .078, which is a medium effect size. Figure 9 shows this interaction: The group
that got end-of-test metalinguistic feedback patterned with the group that got item-by-item
feedback without metalinguistic information, and the group that got item-by-item metalinguistic
feedback patterned with the group that got end-of-test feedback without metalinguistic
information. On the 5-minute-delayed posttest, the former two groups were more likely to
correctly answer a question that they had answered incorrectly on the pretest, as compared to the
latter two groups. On the 1-week delayed posttest, however, the former groups’ probabilities of
answering correctly dropped, while the latter groups’ probabilities increased. Note, however, that

97

the probability of the item-by-item metalinguistic group did not drop as sharply as that of the
end-of-test group without metalinguistic information.
Table 14: Mean Conditional Probabilities (Standard Deviations), Repeated Items
R2/R1

R3/R1

R2/W1

R3/W1

Item-by-item

.81 (.16) .79 (.17) .58 (.23) .56 (.22)

End-of-test

.79 (.18) .80 (.16) .59 (.28) .56 (.24)

Metalinguistic

.79 (.19) .77 (.18) .60 (.27) .56 (.23)

No metalinguistic

.81 (.15) .82 (.14) .57 (.24) .56 (.23)

Item-by-item metalinguistic

.80 (.18) .76 (.18) .56 (.25) .58 (.23)

Item-by-item no metalinguistic .83 (.11) .82 (.15) .60 (.20) .53 (.20)
End-of-test metalinguistic

.78 (.20) .78 (.19) .64 (.29) .53 (.23)

End-of-test no metalinguistic

.79 (.17) .82 (.14) .54 (.27) .58 (.24)

98

Figure 9: Probability of selecting the correct response on an item on the 5-minute-delayed
posttest (R2/W1) or 1-week-delayed posttest (R3/W1), given that it was answered incorrectly on
the pretest, repeated items only. IBI = item by item feedback; EOT = end of test feedback; meta
= metalinguistic feedback; nonmeta = no metalinguistic feedback.

Rather than attempt to interpret this complex interaction at face value, I believe that some
caution is needed. For this particular interaction, the conditional probabilities used in the analysis
were calculated using only the repeated items that the participants answered incorrectly on the
pretest. Some participants (n = 5) included in the analysis got as few as three of the items wrong,
and 42 of the participants got 3, 4, or 5 questions wrong on the pretest. This means that the

99

conditional probabilities were very volatile: If a participant who answered only 3 questions
incorrectly on the pretest answered one of those questions correctly on the 5-minute-delayed
posttest, then answered two of them correctly on the 1-week-delayed posttest, that participant’s
conditional probability changed from .33 to 0.66, just by answering one additional question
correctly. For this reason, I do not believe that the results of this analysis are reliable or
generalizable, especially given the interaction effect, which does not fit well with the results of
other analyses. Note that this caution does not apply to questions that were answered correctly on
the pretest because only 5 participants answered as few as 5 questions correctly on the pretest,
and no participants answered fewer than 5 questions correctly.
To test this, I removed all participants who answered 5 or fewer questions incorrectly on
the pretest and ran the ANOVA on the conditional probabilities (R2/W1 and R3/W1) again. The
analysis contained 70 participants, and none of the main or interaction effects were significant.
Of course, this analysis differs from the one that includes all participants, not only in the number
of participants included, but also that the pretest scores were lower. However, I believe that this
analysis adds support to my argument that the analysis above should be interpreted with caution.
4.3.2 New items
To investigate system learning, I performed two 2 x 2 x 2 mixed-design ANOVA using
the conditional probabilities for the new items. The first ANOVA used the probabilities of
answering an item correctly on each posttest given that the item had been answered correctly on
the pretest. The means for each condition are shown in Table 15. The results showed no
significant main effects of time, F(1, 108) < 0.001, p = .99, η2part < .001 (a very small effect size),
feedback timing, F(1, 108) = 0.02, p = .88, η2part < .001, or feedback type, F(1, 108) = .002, p
= .97, η2part < .001 (a very small effect size). There were also no significant interaction effects for

100

time x feedback timing, F(1, 108) = 0.57, p = .45, η2part = .005, time x feedback type, F(1, 108) =
0.81, p = .37, η2part = .007 (a very small effect size), feedback timing x type, F(1, 108) = 1.25, p
= .27, η2part < .001 (a very small effect size), or time x feedback timing x feedback type, F(1, 108)
= 0.78, p = .38, η2part = .007 (a very small effect size).
Table 15: Mean Conditional Probabilities (Standard Deviations), New Items
R2/R1

R3/R1

R2/W1

R3/W1

Item-by-item

.77 (.19) .76 (.19) .47 (.27) .54 (.26)

End-of-test

.76 (.22) .77 (.20) .45 (.26) .43 (.23)

Metalinguistic

.75 (.23) .77 (.23) .48 (.26) .53 (.23)

No metalinguistic

.77 (.18) .76 (.16) .44 (.27) .44 (.27)

Item-by-item metalinguistic

.77 (.20) .79 (.21) .49 (.27) .58 (.26)

Item-by-item no metalinguistic .76 (.18) .72 (.16) .44 (.27) .50 (.27)
End-of-test metalinguistic

.74 (.25) .75 (.24) .48 (.25) .48 (.19)

End-of-test no metalinguistic

.78 (.18) .79 (.16) .43 (.27) .38 (.26)

The second ANOVA used the probabilities of answering an item correctly on each
posttest given that the item had been answered incorrectly on the pretest. The means for each
condition are shown in Table 14. The results showed no significant main effects of time, F(1,
108) = 1.15, p = .29, η2part = .011 (a small effect size), feedback timing, F(1, 108) = 2.23, p = .14,
η2part = .02 (a small effect size), or feedback type, F(1, 108) = 2.38, p = .13, η2part = .022 (a small
effect size). There was a significant interaction effect for time x feedback timing, F(1, 108) =
5.08, p = .026, η2part = .045 (a small effect size), with similar probabilities of answering correctly
on the 5-minute-delayed posttest after answering incorrectly on the pretest for participants who

101

received item-by-item and end-of-test feedback, but higher probability of answering correctly on
the 1-week-delayed posttest for participants who received item-by-item feedback (Fig. 10).
There were no significant interaction effects for time x feedback type, F(1, 108) = 0.73, p = .40,
η2part = .007 (a very small effect size), feedback timing x type, F(1, 108) = 0.008, p = .93, η2part
< .001 (a very small effect size), or time x feedback timing x feedback type, F(1, 108) = 0.13, p
= .72, η2part = .001 (a very small effect size).
Figure 10: Probability of selecting the correct response on an item on the 5-minute-delayed
posttest (R2/R1) or 1-week-delayed posttest (R3/R1), given that it was answered incorrectly on the
pretest, new items only. IBI = item by item; EOT = end of test.

102

As above for the repeated items, I do not believe that this analysis is reliable because of
the volatility of the conditional probabilities for participants who answered very few question
incorrectly on the pretest. I ran the same analysis after removing all participants with 3 or fewer
incorrect responses on the pretest (n = 64), and the results showed no significant effects. This
lends some support to the claim that the analysis of the conditional probabilities above is not
reliable.
4.4 Summary of Results
Below, the significant results for each research question are summarized in Table 16. The
display time results are summarized in Table 17. Following that, I summarize the answer to each
research question based on the results.
4.4.1 Research Question 1a: Item learning
This question addresses whether the timing (item-by-item or end-of-test) and type
(metalinguistic/no metalinguistic) of feedback affected students’ scores on posttests on the same
questions (i.e., items that the participants got feedback on during the treatments). Thus, this
question is about item learning. Overall, the item-by-item metalinguistic feedback was superior
to the other timing and type combinations, although the effect sizes involved were relatively
small.

103

Table 16: Research Questions and Results
Research question
1. Does timing & type
of feedback affect gain
scores on 5-minute- &
1-week-delayed
posttests

Significant ANOVA
results
Main effect of time, F(1,
108) = 27.83, p < .001,
η2part = .21
Interaction of time x
feedback timing, F(1,
108) = 4.20, p = .043,
η2part = .037
Interaction of time x
feedback timing x
feedback type, F(1, 108)
= 4.90, p = .029, η2part
= .043

a. on repeated questions?

b. on new questions?

Main effect of time, F(1,
108) = 41.22, p < .001,
η2part = .28
Nearly sig. main effect
of feedback timing, F(1,
108) = 3.61, p = .060,
η2part = .032

104

Direction of effect
5-min.-delayed posttest > 1-weekdelayed posttest
5-min.-delayed posttest: IBI > EOT
1-week-delayed posttest: IBI ≈ EOT
(Fig. 6)
Meta groups dropped in gain scores
at same rate from 5-min.-delayed to
1-week delayed posttest. IBI group
without meta showed sharper drop in
gain scores from 5-min.-delayed to
1-week-delayed posttest, while EOT
group without meta showed very
little decrease in gain scores (Fig. 7)
5-min.-delayed posttest < 1-weekdelayed posttest
IBI > EOT

Table 16 (cont’d)
2. Does conditional
probability of correctly
answering question on
5-minute- and 1-weekdelayed posttests differ
for groups based on
timing & type of
feedback for questions
initially answered

a. correctly

b. incorrectly

on repeated
questions?
on new questions?
on repeated
questions?

on new questions?

No sig. results
No sig. results
Interaction of time x
feedback timing x
feedback type, F(1, 108)
= 9.15, p = .003, η2part
= .078

Interaction of time x
feedback timing, F(1,
108) = 5.08, p = .026,
η2part = .045
Note. IBI = item by item; EOT = end of test; meta = metalinguistic feedback.

105

5-min.-delayed posttest: EOT meta &
IBI no meta > IBI meta & EOT no
meta
1-week delayed posttest: EOT meta
& IBI no meta < IBI meta & EOT no
meta (Fig. 10)
5-min.-delayed posttest: IBI ≈ EOT
1-week-delayed posttest: IBI > EOT
(Fig. 11)

Table 17: Results of ANOVAs of Display Times
Main effect or interaction
Repeated items (questions)
New items (questions)
Feedback
Time
F(2.4, 254.4) = 192.6, p < .001,
F(1, 108) = 179.1, p < .001,
F(1, 108) = 18.37, p < .001,
η2part = .64
η2part = .62
η2part = .15
Display times became shorter over time
Feedback timing
F(1, 108) = 0.23, p = .64, η2part
F(1, 108) = 0.68, p = .41, η2part F(1, 108) = 4.14, p = .044,
= .002
= .006
η2part = .037
EOT > IBI
Feedback type
F(1, 108) = 0.25, p = .62, η2part
F(1, 108) = 0.21, p = .65, η2part F(1, 108) = 6.90, p = .010,
= .002
= .002
η2part = .060
Meta > no meta
Time x feedback timing
F(2.4, 254.4) = 0.33, p = .75,
F(1, 108) = 0.024, p = .88,
F(1, 108) = 3.31, p = .072,
η2part = .003
η2part < .001
η2part = .030
2
Time x feedback type
F(2.4, 254.4) = 0.99, p = .39,
F(1, 108) = 0.37, p = .54, η part F(1, 108) = 2.30, p = .13, η2part
η2part = .009
= .003
= .021
2
2
Feedback timing x feedback
F(1, 108) = 2.54, p = .072, η part
F(1, 108) = 1.92, p = .17, η part F(1, 108) = 1.06, p = .31, η2part
type
= .023
= .017
= .010
Time x feedback timing x
F(2.4, 254.4) = 1.77, p = .17,
F(1, 108) = 0.18, p = .67, η2part
F(1, 108) = 6.29, p = .014,
feedback type
η2part = .016
= .002
η2part = .055
Note. Bolded results are significant at the level of p < .05. IBI = item-by-item feedback; EOT = end-of-test feedback; meta =
metalinguistic feedback. Significant three-way interaction effect for new items: The four groups show some differences in how long
they displayed the new questions on the pretest, but clustered closer together on the 5-minute-delayed posttest.

106

Two significant or nearly significant results support this conclusion. First, the larger gain
scores for the item-by-item group on the 5-minute delayed posttest show that the item-by-item
feedback was more effective in the short term for item learning, although in the longer term, both
feedback timings were equally effective. At first glance, it may seem that neither timing was
terribly effective in the longer term, with both groups gaining less than one point on average
from the pretest to the 1-week-delayed posttest. However, the total number of repeated items was
17, and many of the participants (n = 42) answered only 3, 4 or 5 of these items incorrectly on
the pretest. A second result supports the conclusion that item-by-item metalinguistic feedback
was superior: The interaction between time, feedback timing, and feedback type showed that the
group that received item-by-item metalinguistic feedback had higher gain scores on both
posttests than all other groups. Still, the item-by-item metalinguistic feedback group’s average
gain from the pretest to the 1-week-delayed posttest was less than 1.5 points.
4.4.2 Research Question 1b: System learning
This question addresses whether the timing (item-by-item or end-of-test) and type
(metalinguistic/no metalinguistic) of feedback affected participants’ scores on posttests on new
questions (i.e., items that the participants did not encounter or get feedback on during the
treatments). Thus, this question is about system learning. The results support the superiority of
item-by-item feedback generally, regardless of the type of feedback, with the caveat that the
effect approached significance, with a relatively small effect size.
4.4.3 Research Questions 2a and b: Reinforcing correct responses and correcting errors
Research Question 2a addresses whether differing feedback timings and types affect the
degree to which correct responses were reinforced from the pretest to each posttest. However, no
significant main effects or interactions were found for any of the groups on either repeated or

107

new items. Therefore, I found no evidence to support a claim of differences due to feedback
timing or type on these conditional probabilities.
Research Question 2b addresses whether differing feedback timings and types affect the
degree to which a participant corrects initially incorrect responses from the pretest to each
posttest. The results indicate a complicated answer.
First, looking at item learning, the interaction of time x feedback timing x feedback type
was significant, with a small to medium effect size. The results showed that in the shorter term,
the group that got end-of-test metalinguistic feedback and the group that got item-by-item
feedback without metalinguistic information corrected more of their initially incorrect responses
than did the group that got item-by-item metalinguistic feedback and the group that got end-oftest feedback without metalinguistic information. In the longer term, this trend reverses, such that
the groups most likely to correct their initial errors were those that got item-by-item
metalinguistic feedback and the group that got end-of-test feedback without metalinguistic
information. However, due to the volatility of the conditional probabilities, I do not believe that
this interaction should be interpreted.
Next, for system learning, the group that got item-by-item feedback showed similar or
better conditional probabilities than the end-of-test group on both posttests, with a small effect
size. If one were to interpret this effect at face value, it would indicate that for system learning,
item-by-item feedback was superior to end-of-test feedback in helping students to correct errors,
especially on the 1-week-delayed posttest. However, as with the interaction effect above, I
believe that the volatility of these conditional probabilities limits their usefulness.

108

CHAPTER 5: DISCUSSION
Below, I summarize the findings of the current study in relation to the purpose, which
was to provide evidence of how varied feedback timing and type affect the learning of English
article rules by adult learners. Following that is a discussion of possible explanations for the
findings related to Research Question 1, then a discussion of the two interaction effects found for
item learning. Finally, I discuss some potential reasons why the results of the current study
differed from those of previous studies with similar designs.
5.1 Summary of Findings
Some of the results reported above support the cognitive processing window theory
(Doughty, 2001) and the SLA attention-based theory (e.g., Gass, 1997; Pica, 1994; Schmidt,
1995). For item learning, the item-by-item metalinguistic feedback group showed greater mean
gain scores than the other feedback groups, lending support to both the cognitive processing
window theory and the SLA attention-based theory. For system learning, the item-by-item
feedback groups showed higher mean gain scores than the end-of-test feedback groups on the 5minute-delayed posttest, and the item-by-item metalinguistic feedback group had the highest
mean gain scores on the 1-week-delayed posttest, again lending support to the cognitive
processing window and the SLA attention-based theory. The results for the reinforcement of
correct responses provided no evidence to support either theory, either for item or system
learning. The results of error correction for item and system learning also provided no evidence
due to the unreliability of the conditional probabilities. Because none of the results showed that
end-of-test feedback was more effective than item-by-item feedback, no support was found for
either of the educational psychology theories examined above, that is, the interferenceperseveration hypothesis (Kulhavy & Anderson, 1972) and the dual-trace hypothesis.

109

5.2 Research Question 1
In addition to the analyses of the gain scores, I further investigated the results for
Research Questions 1a and b by analyzing the time that the participants spent with the questions
and feedback displayed on their screens. Below, I discuss how these further analyses show that
this display time does not explain the item and system learning results, and I provide more
plausible explanations for these results.
5.2.1 Research Question 1a: Item learning
As a possible explanation for why item-by-item metalinguistic feedback was generally
superior to end-of-test feedback and feedback without metalinguistic information for item
learning, I explored the relative time that the participants spent with questions and feedback
displayed. No statistically significant differences were found in the question display times for the
repeated questions, other than an overall decrease in display time as the participants moved
through the tests and treatments, which was also seen for the new item and feedback display
times. For the feedback display times, the metalinguistic feedback was displayed longer than the
feedback without metalinguistic information, and the end-of-test feedback was displayed longer
than the item-by-item feedback. Therefore, the length of time that the feedback was displayed
might explain why the metalinguistic feedback was more effective, but not why the item-by-item
feedback was more effective.
None of the predictions of the educational psychology theories for item learning, which
include the dual-trace hypothesis (Clariana et al., 2000; Glover, 1989; Kulik & Kulik, 1988;
Rankin & Trepper, 1978) and interference-perseveration hypothesis (Kulhavy & Anderson,
1972), were borne out by the results. A more convincing theoretical explanation of the
differences in the gain scores among the groups is provided by two of the theories underlying the

110

interaction approach: the limitations of the cognitive processing window (Doughty, 2001) in
combination with the attention-based explanation (e.g., Gass, 1997; Pica, 1994; Schmidt, 1995).
As predicted by the cognitive processing window theory, the feedback that was provided
immediately after a participant answered a question was more effective than feedback that was
provided at the end of the test. In addition, as predicted by the attention-based explanation,
metalinguistic feedback was more effective than feedback without metalinguistic information.
Although it was not possible to make a theoretical prediction for the interaction between
feedback timing and type, the results showed that the effects were, to some degree, additive, with
item-by-item metalinguistic feedback the most effective condition.
5.2.2 Research Question 1b: System learning
For system learning, the item-by-item feedback groups showed higher mean gain scores
than the end-of-test feedback groups on the 5-minute-delayed posttest, and the item-by-item
metalinguistic feedback group had the highest mean gain scores overall. The display times for
the questions and feedback were analyzed in an effort to explain this result. For the question
display times, the mean display times decreased over the course of the study for all participants.
The only other significant result was a three-way interaction of time x feedback timing x
feedback type. This showed that on the pretest, the item-by-item metalinguistic feedback group
spent, on average, longer with the new items displayed than the other groups did. This could
explain why the item-by-item metalinguistic group performed better than the other groups.
However, because the pretest was identical for all feedback groups, I cannot explain why the
item-by-item metalinguistic group spent longer reading the questions on this test. It may have
been due to individual differences that randomly were concentrated in this group. Thus, if the
time spent reading the questions is indeed the reason for the higher gain scores, this result may

111

not be replicable. If, on the other hand, the nature and timing of the feedback is what caused the
higher gain scores, rather than the time spent reading the questions, then the results will be
replicable.
No analyses could be run for the feedback display times on the new items because, by
design, no feedback was ever given on these items. However, the feedback display times for the
repeated items might explain the gain scores on the new items, given that the feedback was also
relevant to the new items. The groups that got end-of-test feedback displayed the feedback
significantly longer than the item-by-item groups did, which cannot explain the fact that the
item-by-item groups had higher gain scores. The metalinguistic feedback groups also displayed
the feedback significantly longer than the feedback groups who did not receive metalinguistic
information, and no corresponding relationship was seen in the gain scores. Thus, the feedback
display times do not explain the gain scores.
Given the inconsistent relationships seen between the gain scores and the question and
feedback display times for the new questions, the display times are not a convincing explanation
for the difference between the groups in gain scores. As with item learning, the results for system
learning corresponded to the predictions made based on the cognitive processing window theory,
with no support for any of the educational psychology theories mentioned. However, unlike the
results for item learning, the results for system learning do not provide any support for the SLA
attention-based theory because no significant differences were found between the groups that got
feedback with and without metalinguistic information.
Another interesting difference between the results for item learning and system learning
can be seen in the trends from the 5-minute-delayed posttest to the 1-week-delayed posttest.
Although item learning decreased on average from the first to second posttest, system learning

112

conversely increased. In both cases, the effect sizes were large. A possible explanation for this is
that for item learning, the learners did not necessarily need to deeply process the metalinguistic
information they received as feedback; they only needed to memorize the correct responses.
Thus, a week after receiving feedback for the last time, they had forgotten some of the feedback.
While many educational psychology studies used only one posttest, those that used more than
one tended to find a similar drop in item learning over time (e.g., Brosvic et al., 2006a, 2006b;
Clariana, Ross, & Morrison, 1991). On the other hand, for system learning, it may be that the
learners benefitted from the additional time to process the rules and/or exemplars provided in the
feedback (e.g., Mackey, Gass, & Mcdonough, 2000, p. 474).
5.3 Interaction Effects
I found two significant interaction effects, both for item learning. Because these effects
were not fully predicted by any of the theories detailed above, I consider here what may have
caused them.
5.3.1 Time x feedback timing x feedback type interaction for item learning
First, an interaction effect was found for time x feedback timing x feedback type when
examining item learning. The graphical representation of this effect is repeated below as Figure
11. The two metalinguistic feedback groups dropped in their gain scores at approximately the
same rate from the 5-minute-delayed to the 1-week delayed posttest, while the groups that
received feedback without metalinguistic information showed differing patterns. The gain scores
of the group that got item-by-item feedback without metalinguistic information dropped sharply
from the first to the second posttest, while the gain scores of the group that got end-of-test
feedback without metalinguistic information remained nearly the same.

113

Figure 11: (Duplicate of Figure 6.) Gain scores of feedback groups on both posttests, repeated
items only. IBI = item by item feedback; EOT = end of test feedback; meta = metalinguistic
feedback; nonmeta = no metalinguistic feedback.

Before proceeding, consider that although I have mentioned that item learning may be the
result of memorization, in fact, the repeated items can be answered correctly either by
memorizing the correct response or by applying a metalinguistic rule. Therefore, it should not be
unexpected that on the repeated items, the groups that got metalinguistic feedback did as well as
or better than the groups that did not get metalinguistic feedback, simply because they had more
potential resources to draw on. In addition, the downward trend in gain scores for all groups is
114

likely due to the participants forgetting the memorized items (e.g., Brosvic et al., 2006a, 2006b;
Clariana, Ross, & Morrison, 1991), although forgetting may have also affected their memory of
the metalinguistic rules. In addition, getting feedback on an item-by-item basis, even without
metalinguistic information, may have provided an advantage on the 5-minute-delayed posttest
for memorizing the correct answers because the feedback was provided within the cognitive
processing window. This explains why the group that got end-of-test feedback without
metalinguistic information had the lowest mean gain score on the 5-minute-delayed posttest.
What remains to be explained is why the downward trends occurred at different rates.
To explain these effects, a few assumptions are necessary. Because the results in the
current study that agree well with the cognitive processing window theory and SLA attentionbased theory, I assume that these theories apply to this interaction effect as well. First, let us
assume that the groups that got metalinguistic feedback were better able to explicitly learn the
metalinguistic rules than the groups that did not get metalinguistic feedback, as predicted by the
SLA attention-based theory. Let us also assume that the item-by-item feedback groups had an
easier time memorizing the answers because the feedback was provided within the cognitive
processing window. Finally, while both memorized responses and metalinguistic rules may be
forgotten, metalinguistic information (implicit or explicit) may take time to be fully processed
(e.g., Mackey, Gass, & Mcdonough, 2000, p. 474). Therefore, a stronger effect of forgetting may
be seen for memorized responses than for metalinguistic rules on the 1-week-delayed posttest.
With these assumptions in place, I can explain the differing rates at which the four groups
decreased in gain scores from the 5-minute-delayed posttest to the 1-week-delayed posttest. The
item-by-item feedback group that did not get metalinguistic information had a relatively easy
time memorizing the correct responses to answer the questions on the 5-minute-delayed posttest

115

because these participants received feedback within the cognitive processing window, but they
forgot most of the correct responses by the time of the 1-week delayed posttest. Without much
metalinguistic information to fall back on, this group’s mean gain scores fell to around zero by
the time of the delayed posttest. On the other hand, the end-of-test feedback group that did not
get metalinguistic information did not do as well at memorizing the correct responses because
the answers were not provided within the cognitive processing window. However, because all of
the feedback was provided consecutively, without being interrupted by the need to respond to
new questions, this group may have been better able to implicitly or explicitly derive the rules
from the feedback they did receive, compared to the item-by-item groups. Therefore, their mean
gain score fell less from the first to second posttest.
5.3.2 Time x feedback timing interaction for item learning
Next, an interaction effect was found for time x feedback timing in item learning (Figure
12). On the 5-minute-delayed posttest, the participants who got item-by-item feedback
outperformed the participants who got end-of-test feedback, while the mean scores were nearly
the same on the 1-week-delayed posttest, with the end-of-test group even doing somewhat better
than the item-by-item group. This interaction may best be understood by considering the threeway interaction that was explained above.

116

Figure 12: (Duplicate of Figure 5.) Gain scores of feedback groups on both posttests, repeated
items only. IBI = item by item feedback; EOT = end of test feedback.

Because these two interaction effects come from the same analysis, this two-way
interaction can be understood as the result of averaging the effects seen in the three-way
interaction above. That is, the item-by-item feedback group here is the average of the item-byitem feedback groups (that did and did not get metalinguistic information) in the three-way
interaction, and the end-of-test feedback group here is the average of the end-of-test feedback
groups (that did and did not get metalinguistic information) in the three-way interaction. Thus,
the explanation above also applies here.

117

5.4 Why Results Differ Between Current Study and Previous Literature
Given that previous research on feedback timing and metalinguistic feedback found a
wide range of results from a wide range of initial conditions, the results of the current study
inevitably differ from some of the previous results. Below, I consider why my results differ from
those of studies that were similar in design to the current study and from those of studies that
made generalizations that appear to apply to the current study.
The design of the current study is most similar to that of Henshaw (2011). However,
some of the results differ, with Henshaw finding no significant difference between feedback
provided on an item-by-item basis and at the end of a test. A potential explanation for this
difference is that the design of Henshaw’s (2011) study included instruction that was provided
before the treatments and feedback, whereas no instruction was provided in the current study.
According to Goo and Mackey (2013), the effects of instruction may mitigate the effects of
feedback (p. 152). Thus, the lack of a significant difference between the item-by-item and endof-test feedback found by Henshaw may have been due to the effectiveness of the instruction,
which all participants received.
In addition, the current results are somewhat at odds with the conclusions of Shute (2008),
who distilled the disparate results of the educational psychology literature into recommendations
for providing formative feedback. She found that immediate feedback (left defined relative to
delayed feedback) was generally preferable to delayed feedback, which is clearly consistent with
the current results. However, she also suggested that delayed feedback might be better for
promoting the transfer of learning, citing Kulhavy et al. (1985) and Schroth (1992). In contrast,
the current study showed that item-by-item feedback was superior to end-of-test feedback for
both item and system learning. One obvious difference between the contexts is that Shute’s

118

conclusion was drawn based on studies that did not involve language learning, and she may not
have intended transfer to include rule learning. In addition, Shute’s suggestion about delayed
feedback, by her own admission, needed more research.
Two researchers have demonstrated that relatively immediate feedback is more effective
for easy items, while relatively delayed feedback is more effective for difficult items (Clariana et
al., 2000; Kulhavy & Anderson, 1972). I did not determine difficulty ratings for the items in the
current study, and it would be useful to address this in future studies. However, to some extent,
the participants themselves determined which questions were easy for them personally by
answering them correctly on the pretest. These questions can be considered subjectively easy for
each participant. Conversely, the questions that were difficult for individual participants were
those that they answered incorrectly on the pretest. These questions can be considered
subjectively difficult. Thus, the analyses of the conditional probabilities in which the participant
answered correctly (R2/R1 and R3/R1) or incorrectly (R2/W1 and R3/W1) on the pretest are
relevant here. The former analyses did not produce any significant results, and I argued that the
later results were unreliable, both for item and system learning. None of these results support the
generalization that immediate feedback is better for easy items and delayed feedback better for
difficult items. The reason for this discrepancy between the current study and the previous
research may be that previous studies determined item difficulty not by individual participant,
but for the population overall, using data from previous research (Clariana et al., 2000) or the
researchers’ intuition (Kulhavy & Anderson, 1972). In addition, the previous research did not
involve second-language learning, which by its nature may be different from other types of
learning.
On a related note, Goo (2011) found that metalinguistic feedback was more effective

119

when the rule involved was a simple one, whereas nonmetalinguistic feedback was more
effective when the rule was complex. Goo does not precisely define complexity, so it is difficult
to say how the rules in the current study would be classified. Because they are not purely
syntactic rules (i.e., semantic features must be taken into account to use the rules appropriately),
they might be classified as complex. Therefore, simpler rules than the ones used in the current
study might show a stronger effect for metalinguistic feedback and potentially provide more
support for the SLA attention-based theory.

120

CHAPTER 6: CONCLUSION
Below, I summarize the major findings of the current study, then explore the pedagogical,
CALL, theoretical, and research implications. Following that, I examine the limitations of the
study and some possible future directions.
6.1 Summary of Findings
Overall, the results of the current study show that item-by-item feedback is superior to
end-of-test feedback for both item learning and system learning, a result suggested by the
findings of Lai, Fei, and Roots (2008), Nagata (1996), and Sakai (2004). The results also provide
evidence that providing metalinguistic feedback is superior to not providing metalinguistic
feedback for item learning, in agreement with the findings of Li (2010) and Lyster and Saito
(2010). Little evidence was found in support of either item-by-item or end-of-test feedback
timing or providing or not providing metalinguistic feedback for the reinforcement of correct
responses or error correction, either for item or system learning. Thus, the bulk of evidence in
this study is in support of the cognitive processing window theory (Doughty, 2001), and some of
the evidence also supports the SLA attention-based theory (e.g., Gass, 1997; Pica, 1994; Schmidt,
1995). In addition, the results show that metalinguistic feedback provided on an item-by-item
basis may provide more of an advantage than other combinations of feedback timing and type.
Interestingly, the results of the current study support both the efficacy of item-by-item
feedback and its efficiency. That is, the participants who got the item-by-item feedback spent
less time with the feedback displayed on the screen than did the learners who got the end-of-test
feedback. No significant differences were found between reading times for the questions
themselves in terms of feedback timing, with the exception of the item-by-item metalinguistic
feedback group spending longer reading the new item questions on the pretest, which is

121

unrelated to the treatment. Although the results showed some evidence for the effectiveness of
metalinguistic feedback, the participants spent more time reading this type of feedback than did
those who did not get metalinguistic feedback. Thus, metalinguistic feedback is not more
efficient than feedback without metalinguistic information.
6.2 Pedagogical and CALL Implications
One could argue that immediate feedback is the current best practice in computer-assisted
language learning. However, I have shown in the literature review, immediate is interpreted in
various ways, and both item-by-item and end-of-test feedback have been regarded as immediate.
We cannot blithely assume that all “immediate” feedback timings are the same; in fact, the
current study shows that item-by-item and end-of-test feedback may not be equivalent in terms
of student learning. From a practical standpoint, a teacher might most effectively provide
students with both item-by-item feedback, which may offer learning advantages, and end-of-test
feedback, which students could use for further review, as suggested by Cohen (1985). However,
if efficiency is a concern, learners may have the least feedback to read while still getting the most
effective results if feedback is provided on an item-by-item basis, but with metalinguistic
feedback provided only for questions that are answered incorrectly, similarly to Crooks's (1988)
suggestion of providing informational feedback beyond knowledge of results only for errors.
The current study provides evidence that item-by-item feedback is superior to end-of-test
feedback, which lends some support to the effectiveness claims of the companies that produce
CALL applications, such as Rosetta Stone, Tell Me More, Duolingo, Open English, and Pimsleur,
which all provide item-by-item feedback on at least some of their exercises. It also lends some
support to researchers who have claimed that immediate feedback is superior to delayed
feedback (e.g., Amaral & Meurers, 2011; Heift, 2010; Nagata & Swisher, 1995; Nagata, 1999).

122

However, I caution researchers, materials designers, and companies to use precise terms, such as
those in Dempsey and Wager's (1988) taxonomy of feedback timing in computer-based
instruction, rather than immediate and delayed. I expand on this caution in the following section.
In addition, while I did find a statistically significant difference between item-by-item and endof-test feedback in some cases in the current study, the effect sizes were small. Therefore, I
question the emphasis that is placed on the immediacy of feedback by both researchers and
CALL companies. While item-by-item feedback may indeed be better than end-of-test feedback
in some situations, simply providing feedback on this schedule is unlikely to make a large
difference in what students learn. Rather, it is one factor among many that may help students to
learn more effectively and efficiently. In other situations, item-by-item feedback may not be
appropriate or as effective as end-of-test feedback, and future research is needed to determine
when item-by-item feedback is best.
6.3 Theoretical and Research Implications
I draw several implications for SLA research and theory from the current study. First, the
current results must not be extrapolated to claim that immediate feedback is better than delayed
feedback. Second, in certain situations, researchers can use conditional probabilities to glean
information from data that may otherwise remain hidden. Third, multiple-choice questions can
provide a forum for learners to notice a contrast between their own output and a native-like
version. Finally, I consider the suggestion that metalinguistic feedback and item-by-item
feedback timing have additive effects. I elaborate on each of these implications below.
While the results of the current study show that item-by-item feedback was on average
more effective than end-of-test feedback, this should by no means be extrapolated to suggest that
in general, relatively immediate feedback is more effective than relatively delayed feedback. For

123

example, the current study provides no evidence related to the relative efficacy of end-of-test
feedback and feedback provided at a 24-hour delay. Based on the cognitive processing window
theory, which was generally supported by the current results, I would expect end-of-test and 24hour-delayed feedback to both be less effective than item-by-item feedback. However, their
relative effectiveness that is an empirical question that is worth addressing in future studies.
Next, I consider the utility of examining conditional probabilities. My original reason for
analyzing them was to investigate differences in the predictions of the educational psychology
theories, and no evidence was found to support any of those theories that would necessitate
examining the conditional probabilities. However, the results obtained from investigating the
conditional probabilities also had the potential to provide more detailed insight into what the
participants learned correctly or incorrectly through their participation in the study. The
conditional probabilities for error correction were volatile due to many of the participants
answering relatively few questions incorrectly on the pretest, which meant that the results were
not reliable. Ideally, tests for which conditional probabilities are used would be more difficult
and longer than the one used here, providing a bigger pool of items that are answered both
correctly and incorrectly on the pretest for each participant. Given appropriate data, I believe that
conditional probabilities can make a useful contribution to an analysis. In addition, in the current
study, I did not examine whether the incorrect responses selected were the same from the pretest
to each posttest (in part because there were only three choices), but that has the potential to
provide yet more information about the effect of a treatment. Despite the common use of
conditional probabilities in educational psychology research (e.g., Butler et al., 2007; Clariana et
al., 2000; Peeck & Tillema, 1978; T. A. Smith & Kimball, 2010; Surber & Anderson, 1975), in
the SLA literature, results have not traditionally been broken down in this fashion. SLA

124

researchers are (or should be) interested in both corrective feedback and in feedback that helps
reinforce the knowledge of correct answers. Therefore, we should be interested in the type of
conditional probabilities that were used in the current study. We should not neglect the
potentially revealing information available to us from conditional probabilities, given appropriate
data.
Next, I consider the role of multiple-choice questions in providing an opportunity for
learners to notice a contrast between their own language and that of a native speaker. Long (1996)
stated that implicit negative feedback may be especially useful when it comes immediately after
an ungrammatical learner utterance because of learner attention to a response at that point in a
conversation. In addition, Lai, Fei, and Roots (2008) and Sakai (2004) reported more noticing
when recasts were provided on an item-by-item basis, either in typed CMC or spoken interaction,
respectively. Based on this, I argued above that multiple-choice questions with correct answers
given as feedback provide a forum for the contrast to be detected, especially when the feedback
is provided on an item-by-item basis. Based on the results of the current study that item-by-item
feedback was generally more effective than end-of-test feedback, multiple-choice questions may
indeed provide this forum for noticing a contrast.
Finally, the results showed that item-by-item metalinguistic feedback was the most
effective condition, which suggests that the effects of metalinguistic feedback and item-by-item
feedback were complementary to some extent. This indicates that metalinguistic feedback may
be more effective during the cognitive processing window than outside it, perhaps because the
learner is cognitively comparing his or her current explicit metalinguistic understanding with the
metalinguistic information provided in the feedback. However, this reasoning is speculative and
should be investigated in future research.

125

6.4 Limitations
While the findings in the current study are a valuable first step toward determining the
most effective and efficient feedback timing, some limitations should be kept in mind. Below, I
expand on some of the important limitations.
Several limitations of the study are related to control. First, no true control group was
included in the study. That is, no group received no feedback, so the results cannot tell us
whether the feedback in the current study was more effective than no feedback. However, a vast
literature has shown the advantages of feedback (e.g., the following meta-analyses: Li, 2010;
Mackey & Goo, 2007; Russell & Spada, 2006), and denying a group the predicted benefits of
feedback raises ethical issues (Gass, 2010a).
Because the current study was conducted in classrooms, some variables were
uncontrollable. For instance, normal classroom interruptions occurred that may have distracted
the participants from the tests, treatments, and feedback. The effect of these interruptions was not
unduly distributed to any given feedback group, however, because the participants were
randomly assigned to the groups within each class. Although the learners’ ability to use articles
was controlled by eliminating those who scored at ceiling on the pretest, I did not control for
more general English proficiency. In addition, nearly all of the participants indicated that they
had prior knowledge of some of the rules used in the study. It is possible that the results would
have differed if the rules were entirely new to them.
Each participant received feedback on a given item only twice (once per treatment), and
the last posttest was completed only one week after the last treatment and two weeks after the
pretest. Thus, the current study does not contribute to our understanding of what the participants
may have retained in the longer term.

126

The end-of-test feedback was by design given closer to the time of the 5-minute-delayed
posttest than most of the item-by-item feedback. This may have given the participants who got
end-of-test feedback an advantage on the 5-minute-delayed posttest, although the advantage
should have been largely eliminated by the time of the 1-week-delayed posttest. The 5-minute
delay was incorporated in an attempt to mitigate this advantage, but 5 minutes may not have
been long enough to do so.
In the end-of-test feedback condition, upon receiving feedback, a learner could have
reactivated an error in working memory by repeating the processing that occurred when he or she
initially saw the item. I do not know how often this occurred in the current study. This
reactivation could be the reason for the small effect sizes seen between the item-by-item and
end-of-test feedback. It may possible to control this reactivation with a clever design, but a real
classroom situation is nearly always going to allow for this reactivation.
The instructions for the pretest (and for the treatments and posttests) stated “Choose the
best answer to fit the blank. For some questions, more than one answer is possible, so choose the
best one” (Appendix C). This wording was chosen so that the participants would not be confused
when they encountered the questions that had two possible answers. However, this choice of
words may have been confusing to the participants because it implies that one answer is better
than the other. A more helpful choice of wording may have been “For some questions, more than
one answer is possible, so choose the one that you like best.”
Finally, the small number of questions used in the tests may have reduced the effect sizes
in this study. For example, only 10 new items were included in the analyses, and some
participants who correctly answered 8 out of 10 of these items on the pretest were included in the
analysis. Even if the treatments were as effective as possible, these participants could only have

127

gained 2 points on the posttests. Thus, a more difficult test may have led to larger effect sizes.
6.5 Future Directions
The optimal timing of feedback in the context of second language acquisition is an
underresearched area, and interest in this area is only beginning to pick up speed (Aubrey &
Shintani, 2014; Quinn, 2013; Sheen, 2012). Given the potential applications of this research for
teachers, instructional designers, and companies that design CALL applications, I expect to see
more research on this topic in the near future. In addition, there is a great need to understand how
the timing of feedback interacts with the type of feedback and learners’ prior knowledge to more
effectively design learning. Below, I describe some future directions for this area of inquiry.
The context of this study was multiple-choice questions, and no writing, listening,
speaking, or extended reading comprehension data was collected. Future studies are needed to
determine whether the current results also apply to less controlled learner production and
contexts in which learners cannot readily access their metalinguistic knowledge. Studies of other
English structures as well as structures in other languages are also needed to extend the results.
Although I know how long the participants in the current study left the feedback and
questions displayed on their screens, I do not know how long they spent reading. Previous
studies have shown that users read feedback (Heift, 2001), but may skip or skim over feedback
on questions that they have answered correctly (Pujolà, 2001) and feedback over three lines long
(van der Linden, 1993). Given that the feedback in the current study was longer than three lines
and that the feedback was displayed for relatively short mean times, the participants may indeed
have skipped over much of it. Future studies could use eye-tracking technology to better
determine how long participants actually spend reading feedback. Further insight into what the
participants were thinking while they were choosing their response could also be gained by

128

recording the screen during the treatments and tests, possibly in combination with think-aloud
protocols or stimulated recalls.
By manipulating whether the participants in the current study received metalinguistic
feedback and when they received feedback, I tested the predictions of the cognitive processing
window (Doughty, 2001) and SLA attention-based theories (e.g., Gass, 1997; Pica, 1994;
Schmidt, 1995). However, other theories may also apply to the questions of when feedback
should be provided and what the feedback should include. For example, cognitive load theory
(e.g., Sweller, 1994) suggests that including more text on the screen may increase the cognitive
load of participants with low memory resources, thus reducing their reading comprehension
(Ardaç & Unal, 2008). This in turn suggests that providing metalinguistic feedback in addition to
correct response feedback may increase L2 learners’ cognitive load and reduce their learning.
This prediction is not consistent with the results of the current study but merits further
investigation. The current study raises the important question of when feedback should be
delivered to learners, and the results provide the tentative answer that item-by-item feedback is
both more efficient and more effective than end-of-test feedback. Based on these results, I hope
that researchers, teachers, and instructional designers will be mindful of how they use the terms
immediate and delayed regarding feedback and consider providing item-by-item feedback where
possible.

129

APPENDICES

130

Appendix A: Consent Form
Michigan State University
IRB ID# i044441
Consent Form
Study Title: Effects of Feedback Timing on ESL Students’ Learning of Articles
Researcher and Title: Dr. Charlene Polio,
Professor, Second Language Studies
Department and Institution: Department
of Linguistics and German, Slavic, Asian
and African Languages, Michigan State
University
Address and Contact Information:
B251 Wells Hall, East Lansing, MI 48823.
Phone: 517-884-1502
Email: polio@msu.edu

Researcher and Title: Elizabeth (Betsy)
Lavolette, PhD candidate, Second Language
Studies
Department and Institution: Department
of Linguistics and German, Slavic, Asian
and African Languages, Michigan State
University
Address and Contact Information:
451 Glenmoor #2, East Lansing, MI 48823.
Phone: 808-224-0949
Email: betsy@msu.edu

1. EXPLANATION OF THE RESEARCH and WHAT YOU WILL DO:
You are being asked to participate in a research study of how students learn from computers
in language classrooms.
In this study, you will take five tests on articles (a/an, the, no article). You will also fill out an
opinion questionnaire and a background questionnaire.
You must be at least 18 years old to participate in this research.
2. YOUR RIGHTS TO PARTICIPATE, SAY NO, OR WITHDRAW:
This research is being done as part of your regular classroom work. You are being asked for
your consent to use this classroom work for research purposes. Participation in this study is
strictly voluntary; you have the right to say no. You are free to change your mind and
withdraw at any time without consequence or penalty. Your grades will not be affected by
your participation or not in this research.
3. COSTS AND COMPENSATION FOR BEING IN THE STUDY:
There are no costs or compensation for participating in this study.
4. CONTACT INFORMATION FOR QUESTIONS AND CONCERNS
If you have concerns or questions about this study, such as scientific issues, how to do any
part of it, or to report an injury, please contact the researcher (Betsy Lavolette, 451 Glenmoor
#2, East Lansing, MI 48823. Phone: 808-224-0949. Email: betsy@msu.edu).
Name _________________________________________________________
____ I agree to participate.
____ I do not agree to participate.
131

Appendix B: Article Rules
Rule 1: Use no article for transportation when it follows by or via.
Rule 2: Use no article for some places when they are used for their main purpose.
Rule 3: Use a/an when a noun does not refer to a specific thing or person—it’s ANY [noun].
Rule 4: Use a/an when mentioning a thing or person that the reader (or listener) does not know
about.
Rule 5: Use the when something after a noun makes it definite, especially descriptions starting
with that.
Rule 6: Use the when the context makes a noun known to the reader (or listener).

132

Appendix C: Test Items
The experimental items are shown below, sorted according to the rule they exemplify.
The filler items, which have two possible answers, are shown after the experimental items. Each
item has three answer choices: a/an [noun], the [noun], and [noun] (without an article). The
correct answers are shown below in brackets. For each type of experimental item below, all
items appeared on the pretest and posttests, and the first four only appeared on the treatments.
The items with a star had poor item discrimination and were not included in the analysis. All of
the filler items appeared on the pretest and posttests, and the first eight only appeared on the
treatments.

Instructions:
Choose the best answer to fit the blank.
For some questions, more than one answer is possible, so choose the best one.

Rule 1
1.1) Instead of driving, I want to travel by [horse].
1.2) The cheapest way is to go by [train].
1.3) A: How are you traveling to Washington? B: Via [airplane].
*1.4) You’ll get there in 10 minutes if you go via [bicycle].
1.5) I don’t like going to school by [bus] in the winter.
1.6) Let’s not go via [car] because that will be slow.

Rule 2

133

2.1) She didn’t go to [bed] until 1:00 a.m.
2.2) A: Did you knock on their door? B: Yes, but no one was at [home].
2.3) A: My son is a high school student and my daughter is a college student. B: Can I meet
them? A: No, they’re both at [school] now.
*2.4) Customer: If you don’t return my money, I will take you to [court]!
2.5) A: I was too sick to go to my three classes last week. B: Do you feel well enough to go to
[class] this week?
2.6) Even when I am traveling, I always go to [church] on Sundays.

Rule 3
*3.1) What kinds of animals have you ever ridden? Have you ridden [a horse]?
3.2) A: If you were rich, would you buy expensive cars? B: No, I would buy [an airplane].
3.3) I just moved into my house, and I need [a bed].
*3.4) They were looking for [a home] near the university.
3.5) I’m thinking of taking [a class] at the college this summer.
3.6) A: Have you seen everything in town? B: Not yet. I’d like to see [a church] if there are any
here.

Rule 4
4.1) A: There are many trains in San Francisco, aren’t there? B: Yes, and I read that today, [a
train] caught fire during rush hour.
4.2) A: Did you buy the bicycle that we looked at yesterday? B: Not that one, but I did buy [a
bicycle].

134

4.3) A: Did you take your problem to the court in Michigan? B: No, it was [a court] elsewhere.
4.4) I read that computers were stolen from [a school] somewhere in California.
*4.5) He is tired of riding the bus, so he’s looking for [a car].
4.6) A: Are you OK? You look like you were hit by a car! B: No, it was [a bus].

Rule 5
*5.1) I was late because [the train] that I took to work was late.
5.2) A: Did you ride today? B: Yes, I rode [the horse] that is eating grass.
*5.3) I couldn’t sleep because [the bed] in my hotel room was too hard.
5.4) My parents still live in [the home] that I grew up in.
5.5) In front of my house, I saw [the car] that my friend had just bought.
5.6) This morning at 8:00, [the class] that I had was not very interesting.

Rule 6
6.1) The captain checked the engines and wings, then we boarded [the airplane].
6.2) Its front tire was flat, so I could not use [the bicycle].
6.3) The judge asked me to read a statement to [the court].
*6.4) Many of the teachers spoke English, but [the school] did not have English classes.
*6.5) When our driver comes back, [the bus] will leave.
6.6) Do you see the large cross at the front of [the church]?

Fillers (Either “an/a” or “the” is possible)
7.1) [A/The horse] with black spots was standing in the field.

135

7.2) I took [a/the train] to Chicago.
7.3) Before my flight, I watched [a/the airplane] land smoothly.
7.4) He went back to the store to buy [a/the bicycle] with wide tires.
7.5) The room was only big enough to contain [a/the bed].
7.6) We plan to buy [a/the home] near the lake.
7.7) This desk is from [a/the school] in my town.
7.8) He grew up in [a/the town] directly west of Lansing.
7.9) A: Why did you stop? B: [A/The car] ahead of me stopped.
7.10) Just before I got to the bus stop, [a/the bus] drove past.
7.11) [A/The class] that I am taking now is interesting.
7.12) We saw [a/the church] in the middle of town.

136

Appendix D: Exit Questionnaire
1. Do you think that your ability to use articles (a, an, the) improved from practicing with
the computer?
yes/no
2. What part of the practice was helpful? What part was not helpful?
3. What would make the computer practice of articles (a, an, the) more useful to you?
4. Did you learn the following rules for using articles (a, an, the) before participating in this
study?
a. Use no article for transportation when it follows by or via.
b. Use no article for some places when they are used for their main purpose.
c. Use a/an when a noun does not refer to a specific thing or person—it’s ANY
[noun].
d. Use a/an when mentioning a thing or person that the reader (or listener) does not
know about.
e. Use the when something after a noun makes it definite, especially descriptions
starting with that.
f. Use the when the context makes a noun known to the reader (or listener).
5. Age in years:
6. Gender
Male

Female

Other

7. What is your first language?
Mandarin (Chinese)

Cantonese (Chinese)

Japanese

Other ____

Spanish

137

Arabic

Korean

8. How many years have you studied English?
9. How many months total have you lived in an English-speaking country?
10. Any comments?

138

REFERENCES

139

REFERENCES

Ableeva, R. (2010). Dynamic assessment of listening comprehension in second language
learning. (Doctoral dissertation). Available from ProQuest Dissertation and Theses
Database. (UMI Number: 3436042).
Alderson, J. C. (2005). Diagnosing foreign language proficiency: The interface between learning
and assessment. London, England: Continuum.
Aljaafreh, A., & Lantolf, J. P. (1994). Negative feedback as regulation and second language
learning in the zone of proximal development. The Modern Language Journal, 78(4), 465–
483.
Amaral, L. A., & Meurers, D. (2011). On using intelligent computer-assisted language learning
in real-life foreign language teaching and learning. ReCALL, 23(01), 4–24.
doi:10.1017/S0958344010000261
Ardaç, D., & Unal, S. (2008). Does the amount of on-screen text influence student learning from
a multimedia-based instructional unit? Instructional Science, 36, 75–88.
doi:10.1007/s11251-007-9035-4
Aubrey, S. C., & Shintani, N. (2014, March). The effects of synchronous and asynchronous
written corrective feedback on grammatical accuracy in a computer-mediated environment.
Paper presented at the meeting of the American Association of Applied Linguistics,
Portland, OR.
Azevedo, R., & Bernard, R. M. (1995). A meta-analysis of the effects of feedback in computerbased instruction. Journal of Educational Computing Research, 13(2), 111–127.
Baddeley, A. (2003). Working memory: Looking back and looking forward. Nature Reviews:
Neuroscience, 4, 829–839. doi:10.1038/nrn1201
Bangert-Drowns, R. L., Kulik, C.-L. C., Kulik, J. A., & Morgan, M. (1991). The instructional
effect of feedback in test-like events. Review of Educational Research, 61(2), 213–238.
doi:10.3102/00346543061002213
Barnes, D. (1996). Naming as a technical term: Sacrificing behavior analysis at the alter of
popularity? Journal of the Experimental Analysis of Behavior, 65(1), 264–267.
Bitchener, J., & Knoch, U. (2009). The relative effectiveness of different types of direct written
corrective feedback. System, 37(2), 322–329. doi:10.1016/j.system.2008.12.006
Brosvic, G. M., & Epstein, M. L. (2007). Enhancing learning in the introductory course. The
Psychological Record, 57(3), 391–408.

140

Brosvic, G. M., Epstein, M. L., Dihoff, R. E., & Cook, M. J. (2006a). Acquisition and retention
of Esperanto: The case for error correction and immediate feedback. The Psychological
Record, 56, 205–218.
Brosvic, G. M., Epstein, M. L., Dihoff, R. E., & Cook, M. J. (2006b). Retention of Esperanto is
affected by delay-interval task and item closure: A partial resolution of the delay-retention
effect. The Psychological Record, 56, 597–615.
Brown, J. D. (1997). Computers in language testing: Present research and some future directions.
Language Learning & Technology, 1(1), 412–414.
Brown, J. D. (2005). Item analysis in language testing. In Testing in language programs (pp. 66–
88). New York, NY: McGraw-Hill.
Brown, J. D. (2008). Effect size and eta squared. Shiken: JALT Testing & Evaluation SIG
Newsletter, 12(2), 38–43.
Butler, A. C., Godbole, N., & Marsh, E. J. (2013). Explanation feedback is better than correct
answer feedback for promoting transfer of learning. Journal of Educational Psychology,
105(2), 290–298. doi:10.1037/a0031026
Butler, A. C., Karpicke, J. D., & Roediger, H. L. (2007). The effect of type and timing of
feedback on learning from multiple-choice tests. Journal of Experimental Psychology:
Applied, 13(4), 273–281. doi:10.1037/1076-898X.13.4.273
Carroll, S., & Swain, M. (1993). Explicit and implicit negative feedback: An empirical study of
the learning of linguistic generalizations. Studies in Second Language Acquisition, 15, 357–
386.
Chapelle, C. A. (2003). The potential of technology for language learning. In English language
learning and technology: Lectures on applied linguistics in the age of information and
communication technology (pp. 35–68). Amsterdam, The Netherlands: John Benjamins.
Chun, D. M., & Brandl, K. (1992). Beyond form-based drill and practice: Meaning-enhancing
CALL on the Macintosh. Foreign Language Annals, 25(3), 255–267.
Clariana, R. B., Ross, S. M., & Morrison, G. R. (1991). The effects of different feedback
strategies using computer-administered multiple-choice questions as instruction.
Educational Technology Research & Development, 39(2), 5–17.
Clariana, R. B., Wagner, D., & Roher Murphy, L. C. (2000). Applying a connectionist
description of feedback timing. Educational Technology Research & Development, 48(3),
5–22.
Cohen, J. (1988). The analysis of variance. In Statistical power analysis for the behavioral
sciences (2nd ed., pp. 273–406). Hillsdale, NJ: Lawrence Erlbaum Associates.
141

Cohen, V. B. (1985). A reexamination of feedback in computer-based instruction: Implications
for instructional design. Educational Technology, 25(1), 33–37.
Cook, V. (1989). Universal Grammar theory and the classroom. System, 17(2), 169–181.
Cortina, J. M. (1993). What is coefficient alpha? An examination of theory and applications.
Journal of Applied Psychology, 78(1), 98–104. doi:10.1037//0021-9010.78.1.98
Cowan, N. (2005). The present theoretical approach. In Attention and memory: An integrated
framework. (pp. 39–73). Oxford, England: Oxford University Press.
Crooks, T. J. (1988). The impact of classroom evaluation practices on students. Review of
Educational Research, 58(4), 438–481.
Dabrowski, R., LeLoup, J. W., & MacDonald, L. (2013). Effectiveness of computer-graded vs.
instructor-graded homework assignments in an elementary Spanish course: A comparative
study at two undergraduate institutions. The IALLT Journal, 43(1), 79–100.
Dekeyser, R. M. (1997). Beyong explicit rule learning: Automatizing second language
morphosyntax. Studies in Second Language Acquisition, 19, 195–221.
Dekeyser, R. M. (2007). Skill acquisition theory. In B. Vanpatten & J. Williams (Eds.), Theories
in second language acquisition: An introduction (pp. 97–113). Mahwah, NJ: Erlbaum.
Dempsey, J. V., & Wager, S. U. (1988). A taxonomy for the timing of feedback in computerbased instruction. Educational Technology, 28(10), 20–25.
Dihoff, R. E., Brosvic, G. M., Epstein, M. L., & Cook, M. J. (2004). Provision of feedback
during preparation for academic testing: Learning is enhanced by immediate but not delayed
feedback. The Psychological Record, 54, 207–231.
Doughty, C. J. (2001). Cognitive underpinnings of focus on form. In P. Robinson (Ed.),
Cognition and second language instruction (pp. 206–257). Cambridge, England: Cambridge
University Press.
Doughty, C. J., & Long, M. H. (2003). Optimal psycholinguistic environments for distance
foreign language learning. Language Learning & Technology, 7(3), 50–75.
Ebel, R. L. (1979). How to improve test quality through item analysis. In Essentials of
educational measurement (3rd ed., pp. 258–273). Englewood Cliffs, NJ: Prentice-Hall.
El Saadawi, G. M., Tseytlin, E., Legowski, E., Jukic, D., Castine, M., Fine, J., … Crowley, R. S.
(2008). A natural language intelligent tutoring system for training pathologists:
Implementation and evaluation. Advances in health sciences education: Theory and
practice, 13(5), 709–722. doi:10.1007/s10459-007-9081-3

142

Ellis, N. (2006). Cognitive perspectives on SLA: The associative-cognitive CREED. AILA
Review, 19, 100–121.
Ellis, N. (2008). Usage-based and form-focused SLA: The implicit and explicit learning of
constructions. In A. Tyler, Y. Kim, & M. Takada (Eds.), Language in the context of use:
Discourse and cognitive approaches to language (pp. 99–126). Berlin, Germany: Mouton
de Gruyter.
Ellis, R., Loewen, S., & Erlam, R. (2006). Implicit and explicit corrective feedback and the
acquisition of L2 grammar. Studies in Second Language Acquisition, 28, 339–368.
doi:10.1017/S0272263106060141
English, R. A., & Kinzer, J. R. (1966). The effect of immediate and delayed feedback on
retention of subject matter. Psychology in the Schools, 3(2), 143–147.
Field, A. (2009). Discovering statistics using SPSS (3rd ed.). London, England: Sage.
Gagné, R. M. (1985). What is learned—Varieties. In The conditions of learning and theory of
instruction (3rd. ed., pp. 46–69). New York, NY: Holt, Rinehart and Winston.
García, M. R., & Arias, F. V. (2010). A comparative study in motivation and learning through
print-oriented and computer-oriented tests. Computer Assisted Language Learning, 13(4–5),
457–465.
Gass, S. M. (1997). Input, interaction, and the second language learner. Mahwah, NJ: Lawrence
Erlbaum Associates.
Gass, S. M. (2010a). Experimental research. In B. Paltridge & A. Phakiti (Eds.), Continuum
companions to research methods in applied linguistics (pp. 7–21). London, England:
Continuum.
Gass, S. M. (2010b). Interactionist perspectives on second language acquisition. In R. B. Kaplan
(Ed.), The Oxford handbook of applied linguistics (2nd. ed., pp. 217–231). Oxford, England:
Oxford University Press.
Gaynor, P. (1981). The effect of feedback delay on retention of computer-based mathematical
material. Journal of Computer-Based Instruction, 8(2), 28–34.
Glover, J. A. (1989). The “testing” phenomenon: Not gone but nearly forgotten. Journal of
Educational Psychology, 81(3), 392–399. doi:10.1037//0022-0663.81.3.392
Goda, Y. (2004). Feedback timing and learners’ response confidence on learning English as a
foreign language (EFL): Examining the effects of a computer-based feedback and
assessment environment on EFL students' language acquisition. (Doctoral dissertation).
Available from ProQuest Dissertation and Theses Database. (UMI No. 3123098).

143

Goo, J. (2011). Corrective feedback, individual variation in cognitive capacities, and L2
development: Recasts vs. metalinguistic feedback. (Doctoral dissertation). Available from
ProQuest Dissertation and Theses Database. (UMI Number: 3450729).
Goo, J., & Mackey, A. (2013). The case against the case against recasts. Studies in Second
Language Acquisition, 35(1), 127–165. doi:10.1017/S0272263112000708
Guzmán-Muñoz, F. J., & Johnson, A. (2008). Error feedback and the acquisition of geographical
representations. Applied Cognitive Psychology, 22(7), 979–995. doi:10.1002/acp
Hartshorn, K. J., Evans, N. W., Merrill, P. F., Sudweeks, R. R., Strong-Krause, D., & Anderson,
N. J. (2010). Effects of dynamic corrective feedback on ESL writing accuracy. TESOL
Quarterly, 44(1), 84–109. doi:10.5054/tq.2010.213781
Hattie, J., & Timperley, H. (2007). The power of feedback. Review of Educational Research,
77(1), 81–112. doi:10.3102/003465430298487
Hayes, S. C., Barnes-Holmes, D., & Roche, B. (Eds.). (2001). Relational frame theory: A postSkinnerian account of human language and cognition. Hingham, MA: Kluwer.
Heift, T. (2001). Error-specific and individualised feedback in a Web-based language tutoring
system: Do they read it? ReCALL, 13(1), 99–109.
Heift, T. (2003). Drag or type, but don’t click: A study on the effectiveness of different CALL
exercise types. Canadian Journal of Applied Linguistics, 6(1), 69–85.
Heift, T. (2004). Corrective feedback and learner uptake in CALL. ReCALL, 16(2), 416–431.
doi:10.1017/S0958344004001120
Heift, T. (2006). Context-sensitive help in CALL. Computer Assisted Language Learning, 19(2–
3), 243–259. doi:10.1080/09588220600821552
Heift, T. (2010). Developing an intelligent language tutor. CALICO Journal, 27(3), 443–459.
Henshaw, F. (2011). Effect of feedback timing in SLA: A computer-assisted study on the
Spanish subjunctive. In C. Sanz & R. P. Leow (Eds.), Implicit and explicit language
learning: Conditions, processes, and knowledge in SLA and bilingualism (pp. 98–113).
Washington, DC: Georgetown University Press.
Jaehnig, W., & Miller, M. L. (2007). Feedback types in programmed instruction: A systematic
review. The Psychological Record, 57(2), 219–232.
Kane-lturrioz, R. (2008). Computer-based language assessment: A formative approach. ReCALL,
9(1), 15–21. doi:10.1017/S0958344000004584

144

King, P. E., Young, M. J., & Behnke, R. R. (2000). Public speaking performance improvement
as a function of information processing in immediate and delayed feedback interventions.
Communication Education, 49(4), 37–41.
Kline, T. (2005). Psychological testing: A practical approach to design and evaluation.
Thousand Oaks, CA: Sage.
Kregar, S. (2011). Relative effectiveness of corrective feedback types in computer-assisted
language learning. (Doctoral dissertation). Available from ProQuest Dissertations and
Theses database. (UMI No. 3477247).
Kulhavy, R. W., & Anderson, R. C. (1972). Delay-retention effect with multiple-choice tests.
Journal of Educational Psychology, 68(5), 505–512.
Kulhavy, R. W., & Stock, W. A. (1989). Feedback in written instruction: The place of response
certitude. Educational Psychology Review, 1(4), 279–308.
Kulhavy, R. W., White, M. T., Topp, B. W., Chan, A. L., & Adams, J. (1985). Feedback
complexity and corrective efficiency. Contemporary Educational Psychology, 10, 285–291.
Kulik, J. A., & Kulik, C.-L. C. (1988). Timing of feedback and verbal learning. Review of
Educational Research, 58(1), 79–97. doi:10.2307/1170349
Lai, C., Fei, F., & Roots, R. (2008). The contingency of recasts and noticing. CALICO Journal,
26(1), 70–90.
Lan, Y.-J., Sung, Y.-T., & Chang, K.-E. (2007). A mobile-device-supported peer-assisted
learning system for collaborative early EFL reading. Language Learning & Technology,
11(3), 130–151.
Lantolf, J. P. (2008). Dynamic assessment: The dialectic integration of instruction and
assessment. Language Teaching, 42(3), 355–368. doi:10.1017/S0261444808005569
Lantolf, J. P., & Poehner, M. E. (2004). Dynamic assessment of L2 development: Bringing the
past into the future. Journal of Applied Linguistics, 1(1), 49–72.
doi:10.1558/japl.1.1.49.55872
Lavolette, E. (2013). Feedback timing effects on ESL students’ learning of articles rules.
Unpublished manuscript.
Lavolette, E., Polio, C., & Kahng, J. M. (2013). The usefulness of computer-mediated feedback
in essay revision. Manuscript submitted for publication.
Lee, L. (2008). Focus-on-form through collaborative scaffolding in expert-to-novice online
interaction. Language Learning & Technology, 12(3), 53–72. Retrieved from

145

http://llt.msu.edu/vol12num3/vol12num3.pdf?q=microsoft-word-700-mhz-faqfinal#page=60
Lewis, M. W., & Anderson, J. R. (1985). Discrimination of operator schemata in problem
solving: Learning from examples. Cognitive Psychology, 17(1), 26–65. doi:10.1016/00100285(85)90003-9
Li, S. (2010). The effectiveness of corrective feedback in SLA: A meta-analysis. Language
Learning, 60(2), 309–365. doi:10.1111/j.1467-9922.2010.00561.x
Lin, J.-W., Lai, Y.-C., & Chuang, Y.-S. (2013). Timely diagnostic feedback for database concept
learning. Educational Technology & Society, 16(2), 228–242.
Loewen, S. (2004). Uptake in incidental focus on form in meaning-focused ESL lessons.
Language Learning, 54(1), 153–188.
Loewen, S., & Erlam, R. (2006). Corrective feedback in the chatroom: An experimental study.
Computer Assisted Language Learning, 19(1), 1–14. doi:10.1080/09588220600803311
Long, M. H. (1996). The role of the linguistic environment in second language acquisition. In W.
Ritchie & T. Bhatia (Eds.), Handbook of second language acquisition (pp. 413–468). New
York, NY: Academic.
Long, M. H. (2007). Problems in SLA. Mahwah, NJ: Lawrence Erlbaum Associates.
Lyster, R., & Saito, K. (2010). Oral feedback in classroom SLA: A meta-analysis. Studies in
Second Language Acquisition, 32, 265–302. doi:10.1017/S0272263109990520
Mackey, A., Gass, S., & Mcdonough, K. (2000). How do learners perceive interactional
feedback? Studies in Second Language Acquisition, 22(4), 471–497.
Mackey, A., & Goo, J. (2007). Interaction research in SLA: A meta-analysis and research
synthesis. In A. Mackey (Ed.), Conversational interaction in second language acquisition:
A collection of empirical studies (pp. 407–452). Oxford, England: Oxford University Press.
Mandernach, B. J. (2005). Relative effectiveness of computer-based and human feedback for
enhancing student learning. The Journal of Educators Online, 2(1), 1–17.
Master, P. (2002). Information structure and English article pedagogy. System, 30(3), 331–348.
doi:10.1016/S0346-251X(02)00018-0
Metcalfe, J., Kornell, N., & Finn, B. (2009). Delayed versus immediate feedback in children’s
and adults' vocabulary learning. Memory & Cognition, 37(8), 1077–1087.
doi:10.3758/MC.37.8.1077

146

Moreno, N. (2007). The effects of type of task and type of feedback on L2 development in CALL.
(Doctoral dissertation). Available from ProQuest Dissertations and Theses database. (UMI
No. 3302088).
Muranoi, H. (2000). Focus on form through interaction enhancement: Integrating formal
instruction into a communicative task in EFL classrooms. Language Learning, 50(4), 617–
673.
Murphy, P. (2010). Web-based collaborative reading exercises for learners in remote locations:
The effects of computer-mediated feedback and interaction via computer-mediated
communication. ReCALL, 22(2), 112–134. doi:10.1017/S0958344010000030
Nagata, N. (1993). Intelligent computer feedback for second language instruction. Modern
Language Journal, 77(3), 330–339.
Nagata, N. (1996). Computer vs. workbook instruction in second language acquisition. CALICO
Journal, 14(1), 53–75.
Nagata, N. (1997). The effectiveness of computer-assisted metalinguistic instruction: A case
study in Japanese. Foreign Language Annals, 30(2), 187–200.
Nagata, N. (1999). The effectiveness of computer-assisted interactive glosses. Foreign Language
Annals, 32(4), 469–479. doi:10.1111/j.1944-9720.1999.tb00876.x
Nagata, N., & Swisher, M. (1995). A study of consciousness-raising by computer: The effect of
metalinguistic feedback on second language learning. Foreign Language Annals, 28(3),
337–347.
Norris, J. M., & Ortega, L. (2000). Effectiveness of L2 instruction: A research synthesis and
quantitative meta-analysis. Language Learning, 50(3), 417–528. doi:10.1111/00238333.00136
Opitz, B., Ferdinand, N. K., & Mecklinger, A. (2011). Timing matters: The impact of immediate
and delayed feedback on artificial language learning. Frontiers in Human Neuroscience, 5,
1–9. doi:10.3389/fnhum.2011.00008
Park, O., & Gittelman, S. S. (1992). Selective use of animation and feedback in computer-based
instruction. Educational Technology Research & Development, 40(4), 27–38.
Peeck, T., & Tillema, H. H. (1978). Delay of feedback and retention of correct and incorrect
responses. The Journal of Experimental Education, 47(2), 171–178.
Phye, G. D., & Andre, T. (1989). Delayed retention effect: Attention, perseveration, or both?
Contemporary Educational Psychology, 14, 173–185.

147

Pica, T. (1994). Research on negotiation: What does it reveal about second-language learning
conditions, processes, and outcomes? Language Learning, 44(3), 493–527.
Pienemann, M. (1998). An introduction to Processability Theory. In Language processing and
second language development: Processability theory (pp. 1–73). Amsterdam, The
Netherlands: John Benjamins.
Poehner, M. E. (2007). Beyond the test: L2 dynamic assessment and the transcendence of
mediated learning. Modern Language Journal, 91(3), 323–340.
Poehner, M. E. (2008). Dynamic assessment: A Vygotskian approach to understanding and
promoting L2 development. New York, NY: Springer Science+Business Media.
Poehner, M. E., & Lantolf, J. P. (2013). Bringing the ZPD into the equation: Capturing L2
development during computerized dynamic assessment (C-DA). Language Teaching
Research, 17(3), 323–342. doi:10.1177/1362168813482935
Pujolà, J.-T. (2001). Did CALL feedback feed back? Researching learners’ use of feedback.
ReCALL, 13(1), 79–98.
Quinn, P. (2013, March). The effects of altering the timing of corrective feedback. Paper
presented at the meeting of the American Association of Applied Linguistics, Dallas, TX.
Rankin, R. J., & Trepper, T. (1978). Retention and delay of feedback in a computer-assisted
instructional task. The Journal of Experimental Education, 46(4), 67–70.
Rosa, E. M., & Leow, R. P. (2004). Computerized task-based exposure, explicitness, type of
feedback, and Spanish L2 development. Modern Language Journal, 88(2), 192–216.
Russell, J., & Spada, N. (2006). The effectiveness of corrective feedback for the acquisition of
L2 grammar. In J. M. Norris & L. Ortega (Eds.), Synthesizing research on language
learning and teaching (pp. 147–178). Amsterdam, The Netherlands: John Benjamins.
Sakai, H. (2004). Roles of output and feedback for L2 learners’ noticing. JALT Journal, 26(1),
25–54.
Sanz, C., & Morgan-Short, K. (2004). Positive evidence versus explicit rule presentation and
explicit negative feedback: A computer-assisted study. Language Learning, 54(1), 35–78.
Sauro, S. (2009). Computer-mediated corrective feedback and the development of L2 grammar.
Language Learning & Technology, 13(1), 96–120.
Saxton, M. (1997). The contrast theory of negative input. Journal of Child Language, 24(1),
139–161. Retrieved from http://www.ncbi.nlm.nih.gov/pubmed/9154012

148

Saxton, M., Backley, P., & Gallaway, C. (2005). Negative input for grammatical errors: Effects
after a lag of 12 weeks. Journal of Child Language, 32, 643–672.
doi:10.1017/S0305000905006999
Schmidt, R. (1995). Consciousness and foreign language learning: A tutorial on the role of
attention and awareness in learning. In R. Schmidt (Ed.), Attention and awareness in
foreign language learning (pp. 1–63). Honolulu, HI: University of Hawai’i Press.
Schooler, L., & Anderson, J. R. (1990). The disruptive potential of immediate feedback.
Proceedings of the Twelfth Annual Conference of the Cognitive Science Society, 702–708.
Schroth, M. L. (1992). The effects of delay of feedback on a delayed concept formation transfer
task. Contemporary Educational Psychology, 17, 78–82.
Schroth, M. L. (1995). Variable delay of feedback procedures and subsequent concept formation
transfer. The Journal of General Psychology, 122(4), 393–399.
Schroth, M. L., & Lund, E. (1993). Role of delay of feedback on subsequent pattern recognition
transfer tasks. Contemporary Educational Psychology, 18, 15–22.
Schwartz, B. D., & Gubala-Ryzak, M. (1992). Learnability and grammar reorganization in L2A:
Against negative evidence causing the unlearning of verb movement. Second Language
Research, 8(1), 1–38. doi:10.1177/026765839200800102
Sheen, Y. (2007). The effects of corrective feedback, language aptitude, and learner attitudes on
the acquisition of English articles. In A. Mackey (Ed.), Conversational interaction in
second language acquisition: A collection of empirical studies (pp. 301–322). Oxford,
England: Oxford University Press.
Sheen, Y. (2012, October). The timing of corrective feedback and L2 learning. Paper presented
at the Second Language Research Forum, Pittsburgh, PA.
Shute, V. J. (2008). Focus on formative feedback. Review of Educational Research, 78(1), 153–
189.
Skinner, B. F. (1968). The technology of teaching. New York, NY: Appleton-Century-Crofts.
Smith, T. A., & Kimball, D. R. (2010). Learning from feedback: Spacing and the delay-retention
effect. Journal of Experimental Psychology: Learning, Memory, and Cognition, 36(1), 80–
95. doi:10.1037/a0017407
Smits, M. H. S. B., Boon, J., Sluijsmans, D. M. A., & Van Gog, T. (2008). Content and timing of
feedback in a web-based learning environment: Effects on learning as a function of prior
knowledge. Interactive Learning Environments, 16, 183–193.

149

Spada, N., & Tomita, Y. (2010). Interactions between type of instruction and type of language
feature: A meta-analysis. Language Learning, 60(2), 263–308. doi:10.1111/j.14679922.2010.00562.x
Sturges, P. T. (1978). Delay of informative feedback in computer-assisted testing. Journal of
Educational Psychology, 70(3), 378–387. doi:10.1037//0022-0663.70.3.378
Surber, J. R., & Anderson, R. C. (1975). Delay-retention effect in natural classroom settings.
Journal of Educational Psychology, 67(2), 170–173.
Sweller, J. (1994). Cognitive load theory, learning difficulty, and instructional design. Learning
and Instruction, 4(4), 295–312.
Van der Kleij, F. M., Eggen, T. J. H. M., Timmers, C. F., & Veldkamp, B. P. (2012). Effects of
feedback in a computer-based assessment for learning. Computers & Education, 58(1),
263–272. doi:10.1016/j.compedu.2011.07.020
Van der Linden, E. (1993). Does feedback enhance computer-assisted language learning?
Computers & Education, 21(1–2), 61–65. doi:10.1016/0360-1315(93)90048-N
Webb, J. M., Stock, W. A., & McCarthy, M. T. (1994). The effects of feedback timing on
learning facts: The role of response confidence. Comtemporary Educational Psychology, 19,
251–265.
White, L. (1991). Adverb placement in second language acquisition: Some effects of positive
and negative evidence in the classroom. Second Language Research, 7(2), 133–161.
doi:10.1177/026765839100700205
Whyte, M. M., Karolick, D. M., Nielsen, M. C., Elder, G. D., & Hawley, W. T. (1995).
Cognitive styles and feedback in computer-assisted instruction. Journal of Educational
Computing Research, 12(2), 195–203. doi:10.2190/M2AV-GEHE-CM9G-J9P7
Yanguas, Í. (2010). Oral computer-mediated interaction between L2 learners: It’s about time!
Language Learning & Technology, 14(3), 72–93. Retrieved from
http://www.llt.msu.edu/issues/october2010/yanguas.pdf

150