UNPACKING THE COMPARISON OF L1 AND L2 GLOSSES IN VOCABULARY
                     LEARNING FROM READING
                                    By
                              Yingzhao Chen
                            A DISSERTATION
                                Submitted to
                        Michigan State University
                in partial fulfillment of the requirements
                             for the degree of
            Second Language Studies – Doctor of Philosophy
                                   2023


                                            ABSTRACT
        The appropriate amount of first language (L1) and second language (L2) to use in L2
learning has been constantly debated (e.g., Cummins, 2007; Hall & Cook, 2012). This study
situates the debate of L1 and L2 use in the context of vocabulary learning from reading. By
examining the potential moderating factors on the comparison of L1 and L2 glosses (i.e., short
word definitions provided during reading), the study aims to provide a nuanced picture of how
L1 and L2 input affects vocabulary learning in various circumstances. Investigating L1 and L2
glosses in the context of vocabulary learning also allows the study to contribute to the theories of
bilingual lexicon (e.g., Kroll & Stuart, 1994; Jiang, 2000), i.e., the development of the bilingual
lexicon as a function of input language.
        One hundred and eighteen L2 learners of English completed the study. Participants first
read part of a graded reader, where 24 target words were embedded. Glosses for the target words
were inserted through hyperlinks: participants could click the target words to access their
glosses, written either in the participants’ L1 or L2. Participants’ time spent on reading each
gloss was tracked. After reading, participants went through unannounced vocabulary posttests
that measured receptive and productive meaning knowledge, and lexical retrieval fluency of the
target words. Participants also filled in an exit questionnaire that aimed to further probe their
reading and gloss access behaviors. Participants’ gloss reading time, their vocabulary size, and
the target words’ frequency of occurrence (FoO) were analyzed as moderating variables on the
comparison of L1 and L2 glosses.
        Findings revealed that the comparative effects of L1 and L2 glosses were primarily
moderated by participants’ gloss reading time and target word FoO, suggesting that the initial
depth of processing and subsequent memory reactivation were the keys to successful vocabulary


learning. Results have pedagogical implications for how to choose the language for glosses,
theoretical implications on the bilingual lexicon development, and methodological implications
of using hyperlinks to track behaviors during learning.


                                   ACKNOWLEDGEMENTS
        I am lucky to have been surrounded by kind and supportive people throughout my PhD
journey. In particular, I would like to express my gratitude to the following individuals, without
whom this endeavor would not have been possible.
        First, I am deeply indebted to my advisor, Dr. Shawn Loewen. Shawn has always
supported my research endeavors, encouraging me to probe deeper and take on new adventures.
He cheers for my success and never hesitates to offer emotional support during challenging
times. He shares personal stories to help me navigate different aspects of academia. He is open-
minded, humorous, and easy going. There are always smiles and laughs when working with him.
I am also grateful to Dr. Aline Godfroid, for her guidance in my interdisciplinary research
pursuit. Many thanks to other members of my committee: Drs. Meagan Driver and Sandra
Deshors, for their constructive feedback on my dissertation and other research proojects.
        I hope to express my appreciation to three individuals who are not on my committee: Drs.
Nan Jiang, Paula Winke, and Patti Spinner. Nan has inspired my research. I gain new insights
into bilingualism research from every interaction with him. Paula is always willing to set aside
time in her busy schedule to talk to me and offer me career advice. Patti is my mentor in
teaching. She is encouraging and kind, giving me much freedom to explore new things.
        I am lucky to have done my PhD in the Second Language Studies program. I would like
to thank students and faculty members in the program for a supportive working environment.
        This dissertation project would not have been possible without the financial support from
the Duolingo Dissertation Grant, the National Federation of Modern Language Teachers
Associations (NFMLTA) Dissertation Support Grant, the International Research Foundation for
English Language Education (TIRF) Doctoral Dissertation Grant, the Mango Languages
                                                 iv


Dissertation Award, and the Dissertation Completion Fellowship from the College of Arts and
Letters and the Second Language Studies program. I would also like to acknowledge the time
and effort of my participants.
        Last but not least, words cannot express my gratitude and love to my family. I would not
have survived all the challenges and made it this far without my family.
                                                v


                                          TABLE OF CONTENTS
INTRODUCTION .......................................................................................................................... 1
CHAPTER 1: LITERATURE REVIEW ........................................................................................ 3
CHAPTER: 2 METHOD .............................................................................................................. 37
CHAPTER 3: RESULTS .............................................................................................................. 60
CHAPTER 4: DISCUSSION AND CONCLUSION ................................................................. 100
REFERENCES ........................................................................................................................... 117
APPENDIX A: TASK INSTRUCTIONS................................................................................... 131
APPENDIX B: VOCABULARY PROFILE OF THE READING MATERIAL....................... 134
APPENDIX C: TARGET WORD CHARACTERISTICS......................................................... 135
APPENDIX D: GLOSSES ......................................................................................................... 138
APPENDIX E: EXIT QUESTIONNAIRE ................................................................................. 141
APPENDIX F: LANGUAGE BACKGROUND QUESTIONNAIRE ....................................... 143
APPENDIX G: STIMULI FOR THE SELF-PACED READING TEST ................................... 150
APPENDIX H: MIXED-EFFECTS MODELLING ................................................................... 153
                                                             vi


                                          INTRODUCTION
         Whether to use second language (L2) alone or both the first language (L1) and the L2 in
L2 learning has long been a contentious issue. Theorists in instructed L2 learning have argued
for a bilingual teaching approach while language policy makers seem to encourage maximal L2
use. Empirical findings from classroom studies indicated that L1 use is beneficial for vocabulary
learning (e.g., Tian & Macaro, 2012; Zhao & Macaro, 2016) but may not be superior to
exclusive L2 use for the learning of other aspects (Brown, 2021; Brown & Lally, 2019). In
contrast to advocates of bilingual teaching in instructed L2 learning, three bilingual lexicon
models, i.e., the Revised Hierarchical Model (the RHM; Kroll & Stewart, 1994), Jiang (2000),
and the Revised Hierarchical Model-Repetition Elaboration Retrieval (RHM-RER; Rice &
Tokowicz, 2020), predicted that L1 input is less conducive to developing high-quality lexical
representations that would allow words to be retrieved fluently. This prediction has been
supported by a number of psycholinguistics studies (e.g., Comesaña et al., 2009; Elgort &
Piasecki, 2014; Jeong et al., 2010).
         Given the contradiction in bilingual lexicon theories and in empirical findings in
instructed L2 learning, this study set out to compare L1 and L2 use in the context of glossing
(i.e., providing a short definition for a word) in vocabulary learning from reading. Current
research on the comparison of L1 and L2 glosses not only has yielded inconsistent findings (H.
S. Kim et al., 2020; Zhang & Ma, 2021) but is also limited in number (H. S. Kim et al., 2020)
and scope, often overlooking important variables related to the learning condition and learner
characteristics. In this study, I examined three variables that may potentially moderate the
comparative effectiveness of L1 and L2 glosses, namely L2 vocabulary size, engagement with
glosses, and target words’ frequency of occurrence (FoO). The study extends previous gloss
                                                   1


language studies by including posttests that measured not only learners’ ability to recognize and
recall word meanings but also their fluency of word retrieval in reading.
        The current study represents one of the first attempts to unpack factors that influence the
effects of gloss language in vocabulary learning. Theoretically, the study contributes to the
debate about how the bilingual lexicon develops as a function of input language. Pedagogically,
findings shed light on how to best utilize glosses based on learner and target word characteristics.
                                                 2


                            CHAPTER 1: LITERATURE REVIEW
Use of First and Second Language in Second Language Learning
         Second language (L2) instructors and learners have the liberty to choose the amount of
first language (L1) and L2 to use in the learning of the L2. For example, instructors may explain
L2 grammatical rules in learners’ L1 and learners may look up the meaning of an L2 word in a
monolingual (i.e., L2 only) dictionary. Whether to use L1 and how much L1 to use in L2
learning have been debated for decades. In this section, I first review classroom research on
language of instruction. I then discuss the use of L1 and L2 input on the development of the
bilingual lexicon. I focus on three bilingual lexicon models, namely the Revised Hierarchical
Model (the RHM; Kroll & Stewart, 1994), Jiang’s (2000) model, and the Revised Hierarchical
Model-Repetition Elaboration Retrieval model (RHM-RER; Rice & Tokowicz, 2020).
Language of Instruction in L2 Classrooms
         The two sides of the debate about language of instruction are monolingual and bilingual
teaching. Monolingual teaching refers to using the L2 exclusively and bilingual teaching
involves the use of both learners’ L1 and L2. Several language teaching approaches have
addressed the amount of L1 and L2 use in L2 classrooms. The grammar-translation approach,
which emphasizes the learning of grammatical rules and the ability to translate written text,
explicitly encourages language teachers to give instruction in learners’ L1 (Celce-Murcia, 2014).
Many have advocated translanguaging practices in language classrooms, which see learners’ L1
and L2 as part of their linguistic repertoire and encourage learners to blend their multiple
languages to achieve communicative goals (e.g., Canagarajah, 2011; Creese & Blackledge, 2010;
De Costa et al., 2017). In contrast, monolingual teaching seems to stem from the view that L2
learning is similar to children acquiring their L1 and takes place when learners are immersed in
                                                  3


an L2-only environment (Cummins, 2007; De la Campa & Nassaji, 2009). For example, the use
of L1 is explicitly denounced in the direct method, which focuses on meaning instead of
grammatical rules and became popular as a response against the grammar-translation approach.
Although communicative language teaching, an approach also with a focus on meaning, does not
ban L1 use entirely, it argues for minimum L1 in L2 classrooms (V. Cook, 2001; Cummins,
2007). Theories aside, when it comes to real-world language teaching, language policy makers
seem to favor the monolingual approach: the American Council on the Teaching of Foreign
Language (ACTFL), in their guiding principles, recommends that teachers should use the L2 at
least 90% of the time (n.d.); both the Japanese and Korean governments promote teaching
English in English (see Kubota, 2018; Macaro & Lee, 2013).
        The different opinions on L1 use in L2 learning reflect several beliefs. The first one
pertains to the goal of language learning. In monolingual teaching, L2 learners’ ultimate goal is
to use the L2 to communicate with native speakers of the language in a monolingual
environment while bilingual teaching aims for learners to use the L2 as a lingual franca in a
multilingual environment (Hall & Cook, 2012). The second belief is on the relationship between
the L1 and the L2. Monolingual teaching largely holds that the L1 and the L2 are separated,
despite the now widely accepted view that the two languages are interdependent (V. Cook, 2001;
Cummins, 2007; Hall & Cook, 2012) and that cognitive, academic, and literary skills in the L1
can be transferred to the L2 (Cummins, 2001). Related to the L1-L2 relationship is the third
belief on the role of L1 in L2 learning. Because the L2 is seen as independent of the L1,
monolingual teaching treats the L1 as an interference that must be eliminated from L2 learning.
Proponents of monolingual teaching argue that only when the L1 is avoided can L2 exposure be
maximized, which is crucial especially in a foreign language learning context, where learners do
                                                4


not have much contact with the L2 beyond the classroom (e.g., Carless, 2007; Chambers, 1991;
Chaudron, 1988; Turnbull, 2001). L1 use is also seen as a means of compensating inadequate L2
proficiency (Sato & Angulo, 2020). On the other hand, in bilingual teaching, the L1 and the L2
are interconnected and the L1 is viewed as a resource L2 learners can draw on. On a cognitive
level, the use of L1 is believed to reduce processing loads on learners and thus facilitate learning
(e.g., Hall & Cook, 2012; Scott & Fuente, 2008); L1 can also be used as a tool to mediate the
process of problem solving (e.g., Antón & DiCamilla, 1998; Moore, 2013; Sato & Angulo, 2020;
Watanabe, 2020). On a social level, L1 use helps preserve L2 learners’ linguistic and cultural
identities (G. Cook, 2010) and maintain linguistic diversity by resisting the dominance of
English as well as English native speakers (Phillipson, 1992).
         In terms of empirical research on the use of L1 and L2 in L2 classrooms, most studies are
observational, documenting the amount of and reasons for L1 use. These studies have found that,
theoretical debate aside, teachers in actual classrooms are likely to use learners’ L1 in one way or
another (e.g., Bruen & Kelly, 2017; De la Campa & Nassaji, 2009; Duff & Polio, 1990; Liu et
al., 2004; Macaro, 2001; Polio & Duff, 1994; Tognini & Oliver, 2012). Teachers use the L1 for
various reasons. Polio and Duff (1994) observed six foreign language classrooms in the US and
identified eight functions of L1 use. Examples of these functions included pedagogy (e.g.,
grammar instruction, explaining difficult words, and ensuring comprehension), classroom
management, and connecting with students (e.g., making jokes and expressing empathy) (see
also Ma, 2019). De la Campa and Nassaji (2009) identified 14 uses of the L1 in two German-as-
a-foreign-language classes. Out of these uses, translating L2 utterances in class was the most
frequent one, followed by building rapport and making the learning atmosphere more
comfortable. Nakatsukasa and Loewen (2015) examined the relationship between the amount of
                                                   5


L1 use and the linguistic areas being taught. They found that in teaching grammar and semantics,
teachers used a similar amount of L1 and L2 while in vocabulary teaching, the majority of
teacher discourse (60%) was in the L2. L1 is also common in peer interactions. Several studies
examining spoken peer interaction have found that learners usually used the L1 for (1) language
issues, such as finding the right word and using grammar correctly; (2) metacognitive purposes,
such as goal setting and planning; and (3) social purposes, such as a casual discussion of
unrelated topics (e.g., Gánem-Gutiérrez & Roehr, 2011; Storch & Aldosari, 2010; Swain &
Lapkin, 2000; Tian & Jiang, 2021; Vraciu & Pladevall-Ballester, 2022; Xu & Fan, 2021). Yu and
Lee (2014) looked into L1 use in peer interaction in written format. Learners in the study gave
peer reviews in an L2 writing task. Results revealed that L1 was mostly used to give comments
on the content of the essays while L2 feedback was mostly directed to form (e.g., vocabulary use
and grammar). L2 proficiency has been shown to affect the amount of L1 use among peers, with
lower-proficiency learners tending to use more L1 than their high-proficiency peers (e.g.,
DiCamilla & Antón, 2012; Xu & Fan, 2021; Yu & Lee, 2014).
        A number of studies went beyond observation and examined learners’ attitudes towards
the inclusion of L1 in L2 learning. Most of these studies indicated that the use of L1 was
welcomed by the majority of learners (e.g., Brevik & Rindal, 2020; Brooks-Lewis, 2009; J. H.
Lee & Lo, 2017; Macaro et al., 2020; Tian & Hennebry, 2016; see Shin et al., 2020 for a review).
Participants in these studies perceived L1 as facilitative in improving comprehension (e.g.,
Brooks-Lewis, 2009), reducing anxiety (e.g., Tian & Hennebry, 2016), and solving language
problems (e.g., Macaro et al., 2020). Studies have also shown that learners, while agreeing that
L1 should be included in L2 classrooms, also asked for more L2 use where possible. Macaro et
al. (2020), for example, listed several areas where learners preferred L2 use over L1, including
                                                 6


giving instructions, explaining new words, asking and answering questions. Primary school
learners in Nilsson (2020) expressed preference for predominant use of the L2, despite concerns
over and actual experiences of difficulties in understanding class content in the L2. Like the
amount of L1 use, attitudes towards inclusion of L1 were shown to be moderated by learners’ L2
proficiency. In general, advanced learners preferred more L2 use than lower-proficiency ones (J.
H. Lee & Lo, 2017; Tian & Hennebry, 2016). Macaro and Lee (2013) found that age also
affected learners’ perception of L1 use, with adult Korean learners being more likely to accept
word definitions given in the L2 than young learners. The authors hypothesized that the age
effect may be related to adult learners’ higher L2 proficiency.
        Fewer studies on language of instruction have directly examined how the use of L1 and
L2 affected language development. Most of these studies focused on word learning (e.g., J. H.
Lee & Levine, 2020; J. H. Lee & Macaro, 2013; Tian & Macaro, 2012; Zhao & Macaro, 2016).
These studies unanimously revealed an advantage of incorporating L1 over using L2 only in
classroom vocabulary learning, whether it was vocabulary learning through reading (J. H. Lee &
Macaro, 2013; Zhao & Macaro, 2016) or listening (J. H. Lee & Levine, 2020; Tian & Macaro,
2012). H. Lee and Lee’s (2022)’s meta-analysis on L2 vocabulary learning through teacher
explanation showed that learning through L2 input yielded fewer gains than L1 input both in the
short term and the long term. Zhao and Macaro (2016) suggested that learning L2 vocabulary
through the L1 made retrieval of word meanings more straightforward, leading to the advantage
of bilingual teaching. Qualitative analysis of their data also indicated that some students
misunderstood or had difficulty understanding word meanings written in the L2, which was
another possible reason for the disadvantage of learning vocabulary solely through the L2. J. H.
Lee and Macaro (2013) found that the comparison of monolingual and bilingual teaching was
                                                 7


moderated by age in that young learners benefitted more from L1 use than adult learners,
echoing findings in Macaro and Lee (2013) that young learners seemed to favored L1 use more
than adult learners. The advantage of bilingual teaching over monolingual teaching was not
altered by L2 proficiency (H. Lee & Lee, 2022.; Tian & Macaro, 2012) nor the concreteness of
target vocabulary items (Zhao & Macaro, 2016). Two longitudinal classroom studies, Brown
(2021) and Brown and Lally (2019), investigated monolingual and bilingual teaching in other
areas of language. Results from these two studies were mixed regarding the comparative
effectiveness of the two teaching approaches. No significant difference was found between the
monolingual and bilingual teaching classes after 15 weeks of instruction in areas of writing and
speaking (Brown & Lally, 2019). Similarly, beginner French L2 learners in the monolingual and
bilingual teaching classrooms made similar progress in listening, reading, writing, and
vocabulary after a 10-week course (Brown, 2021). Beginner Arabic L2 learners in the bilingual
teaching class, however, made significantly more progress in vocabulary than those in the
monolingual teaching class (Brown, 2021).
        The abovementioned empirical studies suggest that the amount of, the attitudes towards,
and the effects on learning of L1 and L2 use in L2 classrooms can be moderated by a number of
factors, including L2 proficiency, age, and the target linguistic areas. This highlights the need for
researchers to switch focus from the ‘whether’, that is, whether L1 should be used at all, to the
‘when’ and ‘how’, that is, choosing the language of instruction based on the target linguistic
structures, learners’ individual characteristics, and other factors. With technology, it is now
possible to adapt language instruction in real time based on learner performance and progress,
making it more important to investigate the ‘when’ and ‘how’ of L1 and L2 use. The current
study’s investigation on factors that affect the comparison of L1 and L2 glosses in a digital
                                                  8


learning environment fits into the broader trend of adaptive language learning. Examining the
moderating variables on the effects of gloss language recognizes the roles of learners’ own
language, seeing a learners’ multiple languages, i.e., L1s and L2s, as a repertoire of linguistic
resources that can be deployed flexibly, rather than separate entities (Canagarajah, 2013; Creese
& Blackledge, 2010, 2015; De Costa et al., 2017).
Input Language in Bilingual Lexicon Development
         Research in the previous section on language of instruction was mostly classroom studies
and came from the perspective of instructed second language acquisition. In this section, I review
L1 and L2 use from a psycholinguistic perspective. Specifically, I focus on bilingual lexicon
development and the effect of input language on it.
         There are two key theoretical issues that most bilingual lexicon models attend to (see
Dijkstra & van Heuven, 2002). The first one concerns the structure of the bilingual lexicon, that
is, whether the L1 and L2 mental lexicons are separated or integrated. The second one discusses
whether bilingual processing is selective or not, that is, whether words from only one language
(i.e., selective) or both languages (i.e., nonselective) are activated. Connectionist models, e.g.,
the Bilingual Interactive Activation (the BIA; Dijkstra et al., 1998) and the BIA+ (Dijkstra & van
Heuven, 2002) models, hypothesized an integrated lexicon and nonselective activation of L1 and
L2 words. The Revised Hierarchical Model (the RHM; Kroll & Stewart, 1994) assumed separate
lexicons for the L1 and the L2, but at the same time acknowledged nonselective lexical access
(Kroll, Bobb, & Wodniecka, 2006; Kroll, van Hell, et al., 2010; cf. Brysbaert & Duyck, 2010).
Jiang’s (2000) model and the Revised Hierarchical Model-Repetition Elaboration Retrieval
model (RHM-RER; Rice & Tokowicz, 2020) were founded on the RHM and posited similar
views on the structure of L1 and L2 lexicons and bilingual processing. The RHM, Jiang (2000),
                                                   9


and the RHM-RER were chosen as theoretical support for the current study mainly for two
reasons. First, separate lexicons better accommodate bilingual processing of language pairs of
different scripts (Kroll, van Hell, et al., 2010), e.g., Chinese and English, which were the L1 and
L2 respectively of participants in the current study. Second, these three models have provided
predictions on the effect of input language (i.e., L1 vs. L2) in bilingual lexicon development,
which is the focus of the current study. The Bilingual Interactive Activation (the BIA; Dijkstra et
al., 1998) and the BIA+ (Dijkstra & van Heuven, 2002) models did not make (explicit)
predictions on the effect of input language.
         The RHM (Figure 1) has three components, namely (1) the L1 lexicon, which stores the
forms of L1 words, (2) the L2 lexicon, which contains the forms of L2 words, and (3) concepts,
which are the meanings of words. Here, forms of words refer to the words’ orthography, i.e.,
spelling and pronunciation. In the RHM, the forms of L1 and L2 words are stored separately and
are connected via a lexical link (○ 1 in Figure 1). The forms of L1 words are connected to their
meanings through direct and strong conceptual links (○     2 in Figure 1). The connections between
the forms of L2 words and meanings, on the other hand, are relatively weaker, especially when
the words are newly learned and/or when learners are in the early stages of L2 learning. In t his
case, access to concepts for L2 words is usually mediated by the L1 through the lexical link. As
L2 learners’ proficiency increases, they may eventually be able to establish direct and strong
conceptual links for L2 words (○  3 in Figure 1). Another key factor that influences learners’
ability to directly access concepts for L2 words is the learning condition. van Hell and Kroll
(2012) pointed out that a meaningful learning context, such as learning through pictures or real-
life situations, contributed to the establishment of direct and strong conceptual links for L2
                                                   10


words while learning through L1 translations only strengthened the lexical links between L1 and
L2 words instead of the direct links between L2 words and concepts.
 Figure 1
 Revised Hierarchical Model
                         lexical links ○  1
             L1                                L2
 conceptual links
          ○2                               conceptual links
                                                   ○3
                           concepts
  Note. Adapted from “Category Interference in Translation and
  Picture Naming: Evidence for Asymmetric Connections between
  Bilingual Memory Representations”, by J. Kroll and E. Stuart, 1994,
  Journal of Memory and Language, 33, p.158
  (https://doi.org/10.1006/jmla.1994.1008). Copyright 1994 by
        Jiang’s (2000) bilingual lexicon model elaborated on the developmental aspect of the
RHM. Jiang proposed three stages for the development of L2 lexical knowledge. In the first
stage, unlike an L1 word, which has L1-specific information for both form and meaning, an L2
lexical entry only has L2-specific information for form but not meaning. Instead, the L2 word
entry contains a ‘pointer’ (p. 50) that links the L2 word to its corresponding L1 word. Access to
meaning for the L2 word will be through its L1 counterpart via the pointer. Jiang hypothesized
that the lack of L2-specific semantic information in the L2 word entry was partly due to the lack
                                                  11


of attention to meaning during L2 word learning: learners can usually understand an L2 word
through its L1 translation without the need to extract meaning from context. As one’s exposure
to the L2 increases, they may reach the second stage where L1 semantic information is integrated
into an L2 word entry instead of being accessed via a pointer. Such integration means that
activation of L1 translations for L2 words is now faster than when L1 translations are accessed
via a pointer. What is common between the first and second stages is that access to meaning for
an L2 word is mediated through the L1 and the link between the L2 word and its concept is
weak. In the final stage of lexical development, an L2 word entry contains L2-specific
information for both form and meaning and an L2 word is connected directly to its concept
without L1 mediation. Jiang cautioned, however, that most L2 words may not reach the final
stage partly because the words were initially taught through their L1 translations. Such learning
method encourages reliance on the L1, and that subsequent exposure to the L2 words may only
serve to reinforce this reliance, preventing the development of L2-specific semantics in a lexical
entry.
        The RHM-RER model (Rice & Tokowicz, 2020) focused on vocabulary training methods
that would contribute to the establishment of direct conceptual links for L2 words. In the model,
the L1 lexicon, L2 lexicon, and concepts were conceptualized as tiers (Figure 2). The first tier
represents the forms of L1 and L2 words. The second tier contains the concepts of the words.
Like in RHM, L1 words are connected to meanings directly whereas the connections between L2
words and meanings are mediated through the L1. Rice and Tokowicz added the third tier to
illustrate training methods that strengthened the connections between the first tier (i.e., form) and
the second tier (i.e., meaning), including repetition, elaboration, and retrieval. The authors
stressed that the key to strong connections between the first and second tiers was to use training
                                                  12


methods that went across tiers, that is, to go beyond form-form connections between L1 and L2
words. Repetition, which usually involves repeating the L2 word and its L1 translation, may be
effective but not sufficient to establish high-quality L2 lexical representations because repetition
stays within the first tier, i.e., making L1-L2 form-form connections. Elaboration, on the other
hand, encourages semantic processing by presenting a word in context, providing synonyms, or
background information of the word. Putting the RHM-RER in the context of gloss language, L1
glosses that simply give L1 translations are similar to the repetition of L1-L2 word pairs while
L2 glosses may contain synonyms and are similar to the method of elaboration.
Figure 2
Revised Hierarchical Model – Repetition, Elaboration, Retrieval
 Note. From “A Review of Laboratory Studies of Adult Second Language
 Vocabulary Training”, by C.A.Rice and N. Tokowicz, 2020, Studies in Second
 Language Acquisition, 42, p.443
 (https://doi.org/10.1017/S0272263119000500). Copyright 2019 by Cambridge
 University Press.
                                                  13


        The prediction of the three abovementioned models that learning through the L1 might
hinder the development of direct conceptual links for L2 words has been supported by a number
of empirical studies in psycholinguistics in the context of intentional learning (e.g., Altarriba &
Mathis, 1997; Comesaña et al., 2009; Elgort, 2011; Elgort & Piasecki, 2014; Finkbeiner & Nicol,
2003; Jeong et al., 2010). For example, Comesaña et al. (2009) taught Spanish-speaking children
L2 Basque words through either L1 translations or pictures. The learners were then tested with a
translation recognition task to see if they had established conceptual links for the newly learned
words. Stimuli of the task were L1-L2 word pairs in three types of relations, namely translation,
semantically related, and unrelated. The learners were asked to decide as fast as possible whether
the L1-L2 word pair on the screen was a correct translation or not. The logic behind the task was
that learners who had developed conceptual links for the L2 words would display the semantic
interference effect, taking longer time and making more errors in rejecting semantically related
pairs than unrelated one. Reaction time (RT) and accuracy data in the study revealed that learners
who received picture explanations showed a greater semantic inference effect than those who
learned through L1 translations, suggesting that learning through pictures was more conducive to
the development of conceptual links for the L2 words than learning through L1 translations. The
comparison of Elgort (2011) and Elgort and Piasecki (2014) offered more direct evidence
regarding the effects of input language on word learning. Both studies followed similar learning
and testing procedures: adult English L2 learners were first introduced to a set of pseudowords;
learners were then given flashcards to take home to study the pseudowords before completing
vocabulary posttests a week later. The flashcards provided the meanings of the pseudowords and
an example sentence. The difference regarding the learning phase between the two studies was
that Elgort (2011) used L2-only flashcards, which displayed the pseudowords along with their
                                                 14


explanations in the L2, while in Elgort and Piasecki (2014), the flashcards were bilingual,
including pseudoword explanations in the L1. Both studies used a semantic priming task to
assess semantic representations of the newly learned words. The idea was that only learners who
had established conceptual links for the words would show semantic priming, i.e., reacting faster
and more accurately to semantically related than unrelated word pairs. Results showed that in
Elgort (2011), regardless of participants’ proficiency level, those who learned through L2-only
flashcards, displayed semantic priming, indicating that participants had developed direct links
between L2 words and concepts. In Elgort and Piasecki (2014), only those with a larger L2
vocabulary size were able to do it. Based on the comparison of the two studies, Elgort and
Piasecki (2014) concluded that L2-only flashcards led to lexical knowledge of higher-quality and
were particularly beneficial for lower-proficiency learners.
        Whether L2 words are connected to concepts directly or through L1 mediation carries
real-world implications for L2 learning and use. van Hell and Kroll (2012) saw the ability to
build direct links between L2 words and concepts a “hallmark” (p.154) of high proficiency. Rice
and Tokowicz (2020) believed that the goal of L2 learning was being able to “think in L2” or
“conceptually mediate the language” (p.455), which meant to bypass L1 mediation and access
concepts directly in L2 word processing. Jiang (2000) expressed a similar opinion, arguing that
direct conceptual links for L2 words contributed to fluent use of the words in communication
whereas access to meaning through L1 was often effortful and lacked automaticity. Lexical
fluency is considered an important aspect of word knowledge (Godfroid, 2019; Nation, 2001)
and allows successful real-time language use (Schmitt, 2008). The importance of direct access to
meanings for L2 words highlights the need to employ vocabulary teaching and learning
approaches that promote the establishment of direct L2 form-meaning links. In the context of
                                                15


gloss language research, the critical question is which type of glosses, i.e., L1 or L2 glosses,
works more effectively for the development of such direct conceptual links and hence for greater
lexical fluency. The psycholinguistic studies reviewed above had a more theoretical focus on the
nature of lexical representations rather than measuring the actual processing fluency of word use.
The current study directly examined how fluently learners were able to access L2 words in
everyday tasks such as reading, which provides pedagogical implications on L2 vocabulary
learning.
        It is interesting to note that the RHM, Jiang (2000), and the RHM-RER have been
interpreted differently with regard to their implications on the effects of input language. While
psycholinguistics studies mostly deduced from the bilingual lexicon models that using L1
translations to learn L2 words would have negative effects, research on language of instruction in
L2 classrooms and on gloss language tended to use the models to support L1 use, arguing that
using L1 is beneficial, especially for lower-level learners, because these learners do not yet have
strong connections between L2 words and concepts (e.g., Kang et al., 2020; Macaro & Lee,
2013; Zhao & Macaro, 2016). Rice and Tokowicz (2020) provided a more comprehensive
interpretation of the RHM, Jiang (2000), and the RHM-RER models: there is more than one way
to learn L2 words and learning through L1 translations may be easier for beginner learners
(Kroll, van Hell, et al., 2010). Such interpretation highlights the roles of learners’ individual
differences, such as their L2 proficiency, in choosing input language. The current study, in
comparing L1 and L2 glosses, took into account learners’ L2 vocabulary size, among other
factors, to provide a fuller picture of gloss language effect in vocabulary learning.
                                                  16


Gloss Language Research
        As the review above on the use of L1 and L2 in L2 learning shows, findings in classroom
studies on language of instruction have supported bilingual teaching, i.e., using both L1 and L2,
while psycholinguistic research indicated that using L2 contributes to the development of direct
conceptual links and hence lexical retrieval fluency for L2 words. The contradiction between
these two lines of research warrants more research on the issue of L1 and L2 use. The current
study attempted to investigate this issue from the perspective of gloss language, i.e., L1 versus
L2 glosses, in vocabulary learning from reading.
        There are several critical differences between gloss language research and each of those
two lines of studies (classroom research on language of instruction and psycholinguistic studies
on input language and bilingual lexicon). In terms of gloss language and language of instruction
research, the language used in glosses is pre-planned, meaning that when it comes to L2 glosses,
material writers can carefully select L2 words that are likely to be understood by learners of the
targeted proficiency level. In comparison, teacher speech, which is often the focus of classroom
research on language of instruction, is more spontaneous. Hence, teachers may not attend to
whether all the L2 words they use can be understood by their students. Second, glosses are
written input while teacher speech is spoken input. Written input is untimed and can be
processed at learners’ own pace; in contrast, spoken input is fleeting, requiring greater attentional
resources on learners’ part (K. M. Kim & Godfroid, 2019). Lastly, L2 glosses are short,
compared to the continuous speech from teachers. It may be more challenging for learners,
especially those in lower proficiency, to process such long speech in L2 whereas short L2 glosses
are less likely to be an issue. The possible use of unfamiliar words, aggravated by the length and
timed nature of spoken input, makes it hard for learners to comprehend and learn from L2
                                                  17


teacher speech, which may be the reason why L1 use is almost always advantageous over
exclusive use of L2 in classroom studies. Regarding research on gloss language and on input
language in psycholinguistics, the major difference is the learning condition: the former took
place in incidental learning while the latter involved intentional learning. The difference in
learning condition could lead to differences in learners’ number of chances to see and engage
with the target words.
        In this section, I start by situating the use of glosses in the context of lexical focus on
form and incidental vocabulary learning. I then present an overview of previous gloss language
studies, before discussing potential variables that may moderate the comparative effectiveness of
L1 and L2 glosses.
Glosses, Lexical Focus on Form, and Incidental Vocabulary Learning
        Glosses are short definitions of words provided to support comprehension or word
learning. Glossing belongs to an L2 instruction approach called lexical focus on form (e.g.,
Laufer, 2005, 2006; Laufer & Girsai, 2008). Focus on form takes place when learners’ attention
is briefly directed to linguistic forms, e.g., grammatical rules, in meaning-focused activities
(Long, 1991, 1996). One way focus on form benefits learning is by allocating learners’ limited
attentional resources to both form and meaning. In meaning-focused activities, learners’ primary
attention is on meaning, e.g., reading comprehension and having conversations. Focus on form
provides opportunities for learners to switch their attention temporarily to forms, which learners
may otherwise not have the cognitive resources to attend to (Loewen, 2005, 2014; VanPatten,
1990). Most focus on form research has concerned the learning of morphosyntax (e.g., Ellis et
al., 2006; Fu & Li, 2022; Sato & Loewen, 2018). When focus on form is applied to vocabulary
                                                   18


learning, i.e., lexical focus on form, the goal is to induce learners’ attention to lexical items and
to help learners establish accurate form-meaning connections for unfamiliar words.
         Before further discussion on the effects of lexical focus on form and specifically,
glossing, on vocabulary learning, it is necessary to make clear what vocabulary learning in
meaning-focused activities is. According to Webb (2019), vocabulary learning in a meaning-
focused activity, e.g., understanding the content rather than learning words, can be called
incidental learning. Incidental learning is often thought of as ‘picking up’ words while doing
something else, such as reading a book or watching TV, i.e., learning as a by-product of
meaning-focused activities (Hulstijn, 2001; Loewen, 2014; Webb, 2019). Note that in such
conceptualization of incidental learning, the core is the aim of the activity and not learners’
behaviors, i.e., whether an activity is intended for vocabulary learning and not whether learners
actually intend to learn new words in the activity. It is in fact hard to rule out intention to learn in
incidental learning (Bruton et al., 2011; e.g., Hulstijn, 2001; Loewen, 2014). Several incidental
vocabulary learning studies revealed that learners tried to memorize words encountered during
reading (e.g., Y. Chen, 2021; Godfroid, Ahn, et al., 2018; Pellicer-Sánchez & Schmitt, 2010).
The percentage of unknown words in the learning materials, the learning context (e.g., at home
vs. in the classroom), and the use of typological enhancement are among many factors that may
influence the presence and degrees of intention in an incidental learning condition (Webb, 2019).
Ender (2016) used the term explicit processing to refer learners’ attempts and strategies to learn
new words, such as meaning inferencing and checking a dictionary, in incidental vocabulary
learning. She argued that the use of these strategies, or the intention to learn, could exist in
meaning-focused activities and did not alter the incidental nature of learning in these activities.
In vocabulary research, besides making the activity a meaning-oriented one, incidental learning
                                                  19


can also be operationalized as a learning condition where learners are not forewarned of a
posttest (Hulstijn, 2003). In the current study, the learning condition was incidental in that
learners were asked to comprehend the reading instead of learning new words and they were not
informed of the posttests after reading. Several terms have been used for incidental vocabulary
learning in previous studies, such as contextual word learning (e.g., Elgort, Perfetti, et al., 2015;
Elgort, Candry, et al., 2018; Elgort, Beliaeva, & Boers, 2020) and vocabulary learning
from/during reading (e.g., Elgort & Warren, 2014). The current study use these terms
interchangeably.
        Going back to lexical focus on form, the reason why such approach is used is because
while L2 learners are able to pick up words incidentally, the process is inefficient, taking
considerable time while yielding limited gains. For example, in Godfroid, Ahn, et al. (2018),
after encountering each target word in text around three times on average (range: 1–23 times), L2
learners learned scored around 30% in the form and meaning recognition posttests and 13% in
the meaning recall test. Other studies showed similar percentages of gains in immediate posttests
(e.g., 21% as measured by a meaning generation task in Elgort & Warren, 2014; 18% by a
translation test in Waring & Takaki, 2003). In delayed posttests, gains dropped to 8% in a
translation posttest a week after learning and to 4% three months later in Waring and Takaki
(2003). Elgort and Warren (2014) suggested that at least 12 encounters with a word were
required for noticeable learning. Eye tracking studies revealed that for novel words to be
processed in a similar manner to familiar words, it took 10 exposures (Pellicer-Sánchez &
Schmitt, 2010) or even over 40 (Elgort, Candry, et al., 2018).
        The low learning gains in incidental conditions are mainly due to (1) lack of noticing of
unfamiliar words, (2) inaccurate meaning inference, and (3) low retention of word knowledge
                                                  20


(Hulstijn et al., 1996; Laufer, 2005). Laufer (2005) elaborated on these reasons and argued that
learners often overestimated their word knowledge and hence either failed to notice unfamiliar
words or did not work on guessing word meanings; even when learners attempted to make
guesses on the meanings of the new words, they did not always succeed because the context did
not provide adequate clues; when learners correctly identified word meanings, they may still not
be able to retain the meanings. Laufer (2005) maintained that incidental learning should not be
the default approach for L2 learners and called for instructional intervention using lexical focus
on form.
        Types of lexical focus on form include glossing (e.g., Khezrlou et al., 2017; Warren et al.,
2018), input enhancement, i.e., highlighting, bolding or underlining lexical items (e.g., Boers et
al., 2017; Choi, 2017; Sonbul & Schmitt, 2013; Toomer & Elgort, 2019), and providing bimodal
input (e.g., Y. Chen, 2021; Malone, 2018; Webb & Chang, 2015, 2022). Although input
enhancement and bimodal input can increase the likelihood of learners noticing new vocabulary
items, learners might still incorrectly infer word meanings, leading to erroneous form-meaning
associations. Glosses enhance word salience while at the same time supply meanings, tackling
two of the three major challenges learners face in incidental vocabulary learning, i.e., lack of
noticing and erroneous meaning inference of unfamiliar words. According to the instance-based
model for the learning of word meanings (Bolger et al., 2008), learners extract a word’s core
meaning, i.e., decontextualized as opposed to context-dependent word knowledge, through
repeated encounters with the word in context; definitions given alongside context can accelerate
this process by providing the core meaning directly.
        Earlier research on glossing focused on the comparison between learning conditions with
and without glosses (e.g., Hulstijn et al., 1996; Jacobs et al., 1994). In general, there is a large
                                                 21


positive effect of glossing on vocabulary learning, as suggested by a few meta-analyses
(Abraham, 2008; Yanagisawa et al., 2020; Zhang & Ma, 2021; see Boers, 2022 for a review).
Recent glossing research has witnessed increasing interest in the effects of gloss type, such as L1
versus L2 glosses (e.g., Choi, 2016; Kang et al., 2020; Ko, 2012) and multimodal versus text-
only glosses (e.g., Boers et al., 2017; Jones, 2013; Ramezanali et al., 2021; Warren et al., 2018).
The current study focused on gloss language, i.e., L1 versus L2 glosses. Gloss language is not
only a relevant topic for language pedagogy, but also provides an interface to examine the
implications of bilingual lexicon models, which make predictions regarding the consequences of
learning through L1 and L2.
Previous Research on Gloss Language
        Gloss language has been a contentious issue in vocabulary learning. From a pedagogical
perspective, Laufer and Shmueli (1997) presented arguments both for and against the use of L1
and L2 glosses: on the one hand, students preferred L1 glosses, which allowed a sense of security
about understanding the meaning of the words; on the other hand, learning through L1
translations may result in inaccurate uses of L2 words because there was not always a one-to-one
correspondence between the L1 and the L2 for a given word (see also Jiang, 2000); further, L2
glosses provided additional exposure to the target language, which was believed to be beneficial
for language learning. Findings from gloss language studies have been mixed. While many found
L1 glosses to be more effective (e.g., Choi, 2016; Jacobs et al., 1994), others have demonstrated
equal effectiveness (Kang et al., 2020; Ko, 2012; Yoshii, 2006) or superiority of L2 glosses (e.g.,
Miyasako, 2002; Shiki, 2008). Three recent meta-analyses on glossing (H. S. Kim et al., 2020;
Yanagisawa et al., 2020; Zhang & Ma, 2021) also yielded contradicting findings. Yanagisawa et
al. (2020), which focused on glossing in general, and H. S. Kim et al. (2020), which included
                                                 22


studies on gloss language only, both showed that L1 glosses were more effective than L2 ones in
both immediate and delayed posttests, albeit H. S. Kim et al.’s comparisons had small effect
sizes (g = .44 for immediate posttests; g = .28 for delayed posttests). Zhang and Ma (2021), in
contrast, found that L2 glosses were more effective in the fixed-effect model, and the random-
effect model showed no significant difference between L1 and L2 glosses. Note that many of the
gloss language studies did not report whether words used in L2 glosses were familiar to learners.
The different degrees of familiarity to words in L2 glosses in these studies may be one of the
reasons for the inconclusive findings.
        From a theoretical perspective, it follows from the Revised Hierarchical Model (RHM),
Jiang (2000), and the Revised Hierarchical Model-Repetition Elaboration Retrieval model that
L2 glosses might be more effective than L1 ones, at least in terms of establishing direct
conceptual links for the L2 words. This prediction, though corroborated by some
psycholinguistic research on intentional word learning, has not been consistently realized in
research that compared L1 and L2 glosses in vocabulary learning from reading as shown by the
abovementioned gloss language studies. The difference in learning condition, i.e., incidental in
gloss language research and intentional in psycholinguistics studies, may have accounted for the
discrepancy in results. In incidental vocabulary learning, learners are supported with context
while in intentional learning, words are usually presented with less contextualization. For
example, in Elgort (2011) and Elgort and Piasecki (2014), an example sentence was provided for
each target word; in Jeong et al. (2010), target words were embedded in short videos showing
real-world scenarios. In comparison, target words in gloss language research often appeared in a
short story, which is much longer than a sentence or a short video. The rich context in gloss
language research might have canceled out the negative effect of learning through the L1 and
                                                 23


allowed learners to connect L2 words with their concepts. Further, intentional learning means
that learners can often review the target words as many times as they like while in most gloss
language studies, learners only saw the target words and their definitions once. This means that
learners in intentional learning can rehearse the link between a target word and its definition
multiple times whereas those in incidental learning have only one opportunity to connect the
target word and its gloss. It is somewhat expected that with only one exposure to the target
words, glosses written in the L1, which is easier to process than the L2, provide a quick and easy
way to establish form-meaning associations for words. Learners in the two learning conditions
also differ in the amount of engagement with target words and word definitions. Learners are
instructed to memorize words in an intentional condition while in incidental learning, the focus is
on the comprehension of text where words are embedded, and learners often fail to notice
unfamiliar words. In addition to the difference in learning condition, previous gloss language
research assessed vocabulary gains using offline measures, which only offer insights into the
product of cognitive processing. Psycholinguistic research, in contrast, used online measures that
gauge the real-time lexical processing. It is possible that the processing differences between
words learned through L1 and L2 glosses cannot not captured by offline measures alone. The
prediction of the bilingual lexicon models that learning through L2 is more beneficial should be
reevaluated in gloss language research with online outcome measures and with the number of
target word encounters and learners’ engagement with the words taken into account.
Potential Factors Moderating the Effects of Gloss Language
        How many words learners are able to pick up in incidental learning is affected by a
number of factors. In this study, I examined three, namely frequency of occurrence (FoO) of
target words, i.e., the number of times a target word appears in text, learners’ L2 vocabulary size,
                                                 24


and learner engagement. In this section, I focus on the first two factors, discussing how they
affect incidental learning as shown by previous research and how they may moderate the effects
of gloss language. Learner engagement, as a moderating variable and an outcome variable in this
study, is reviewed in the next section.
        Frequency of Occurrence. Vocabulary learning is an incremental process and the FoO of
target words plays a critical role in the accumulation of lexical knowledge (Hulstijn, 2001).
According to the instance-based models for word learning (Bolger et al., 2008; Reichle &
Perfetti, 2003), each encounter with a word creates a memory trace that contains the word’s
form, meaning, and the context which the word is embedded in; the initial encounter only results
in incomplete word knowledge; with each subsequent encounter, knowledge accumulated in
previous encounters will be reactivated and eventually with sufficient experiences with the word,
its meaning will be extracted. L2 incidental vocabulary learning studies have demonstrated such
an incremental process empirically, showing that though learners were able to gain some lexical
knowledge after one or two encounters with a word (e.g., C. Chen & Truscott, 2010; Malone,
2018), greater number of encounters, or higher FoOs of target words, led to better word learning
(e.g., Godfroid, Ahn, et al., 2018; Vidal, 2011; Webb, 2007). Words of higher FoOs were more
likely to be recognized and recalled in terms of both form and meaning as measured by offline
posttests (e.g., Elgort & Warren, 2014; Webb, 2007; Webb & Chang, 2015); new words that are
repeatedly encountered were also processed more fluently and in a similar manner to familiar
words as assessed by online tests (e.g., Elgort, Beliaeva, & Boers, 2020; Godfroid, Ahn, et al.,
2018; Pellicer-Sánchez, 2016; Pellicer-Sánchez & Schmitt, 2010).
        FoO has been found to interact with lexical focus on form treatments in incidental
vocabulary learning. In a meta-analysis on the effects of repetition in incidental word learning,
                                                 25


Uchihara et al. (2019) revealed an overall medium effect of FoO (r = .34), but the effect dropped
when learning was assisted with multimodal input of reading-while-listening (r = .28) or viewing
(r = .22), indicating that repetition might be more important in a ‘harsh’ unenhanced learning
condition. Similarly, higher FoO may attenuate the effect of lexical focus on form. In Malone
(2018), for example, there was a significant difference in vocabulary learning gains as measured
by a form-recognition test between the reading-while-listening and reading-only conditions when
target words appeared twice in text. However, such difference became nonsignificant when
learners were exposed to target words four times.
        The effects of FoO have rarely been researched in gloss research. Most studies on
glossing included target words that appeared once in text. One exception, Teng (2020), supported
the general trend that higher FoO led to better word learning. The study adopted a 2 (gloss vs. no
gloss) x 3 (FoOs: 1, 3, & 7) between-subject design: learners saw 15 target words embedded in
text once, three times, or seven times either with or without glosses. Posttests on recognition and
recall revealed positive effects of FoO and glossing. There was also an interaction between FoO
and glossing in that the effects of FoO were greater in the gloss than the no-gloss condition. Choi
(2016), a gloss language study, found that L1 and L2 glosses were equally effective in the
learning of target words that appeared twice as shown by both the immediate and delayed
posttests; for target words with an FoO of four, L1 glosses worked better than L2 ones in the
delayed but not in the immediate posttest. Given the mixed findings, further research is needed to
clarify how FoO may influence the effects of glossing and gloss language.
        L2 Vocabulary Size. L2 vocabulary size is a strong predictor of L2 proficiency (Qian &
Lin, 2019). In this study, I used L2 vocabulary size as a proxy of L2 proficiency but will refrain
from using these two terms interchangeably as proficiency is a multifaceted construct and is not
                                                 26


always equal to vocabulary size. Various other ways have been used to operationalize L2
proficiency in vocabulary studies, such as cloze test scores (e.g., Ko, 2012; Malone, 2018),
automaticity in word retrieval (e.g., Elgort, Perfetti, et al., 2015; Elgort & Piasecki, 2014; Elgort
& Warren, 2014), and academic status (e.g., Zhang & Ma, 2021). In what follows, I use the term
‘proficiency’ to refer broadly to learners’ language level regardless of how it is measured and
save the term ‘vocabulary size’ for when learners are measured with vocabulary size tests.
        In general, in an incidental condition, learners of higher proficiency gained greater word
knowledge, as shown by their better performance in paper-and-pencil posttests (e.g., S. Lee &
Pulido, 2017), ERP (e.g., Elgort, Perfetti, et al., 2015), and reaction time (RT; e.g., B. Chen et
al., 2017; Elgort & Warren, 2014) data. Advanced learners also needed fewer encounters with
target words to learn them (Uchihara et al., 2019). This is because higher-level learners may be
able to better comprehend the text where target words are embedded and are thus more likely to
successfully infer target word meanings based on context, even with few repetitions of the target
words. The positive effect of higher proficiency can also be explained through a resonance
mechanism, which hypothesized that known words are nonselectively activated when relevant
words are read (Myers & O’Brien, 1998). It follows that in incidental word learning, advanced
learners have more known words to be activated when reading a target word, leading to faster
establishment of stronger connections between target words and existing words. When glosses
were provided, however, proficiency did not seem to moderate learning gains, as shown by
Yanagisawa et al.’s (2020) meta-analysis, which was probably due to the lack of need to infer
meanings.
        When it comes to proficiency and gloss language, it is logical to conjecture that L2
glosses are less effective for lower-proficiency than for higher-proficiency learners. Finding L2
                                                  27


glosses challenging to understand, learners with limited proficiency are more likely to
misunderstand the glosses or simply to ignore them (Boers, 2022). Results regarding the role of
L2 proficiency in gloss language, however, have been mixed. Yanagisawa et al. (2020) and
Zhang and Ma (2021)’s meta-analyses on glossing did not find a moderating effect of L2
proficiency on gloss language. In contrast, H. S. Kim et al. (2020)’s meta-analysis on gloss
language found that L1 glosses were more effective for lower-level learners. Inconsistency in
findings may be due to different operationalizations of L2 proficiency and also outcome
measures of vocabulary gains. More research is needed to understand the role of L2 proficiency
in the comparison of L1 and L2 glosses, particularly research that takes into consideration other
variables, e.g., target word FoO and learners’ engagement with glosses, and uses various other
outcome measures. Recall that in Elgort and Piasecki (2014), when being able to view target
words multiple times and being measured with a RT-based semantic priming task, learners of
lower proficiency benefitted more from L2 than from L1 word definitions, opposite to findings
in H. S. Kim et al. (2020).
Learner Engagement in Vocabulary Learning
Defining Engagement
        Although engagement is a common term used in everyday life, it can be an elusive
concept and encompasses many different phenomena. Here, I differentiate between two types of
engagement, engagement as attention and engagement as action. Engagement as attention simply
means paying attention to something, e.g., noticing that a word is unfamiliar. Engagement in
action goes beyond mere noticing and refers to actions taken on something, e.g., looking up a
word in a dictionary. Engagement as action is similar to the concept of involvement in the
Involvement Load Hypothesis for vocabulary learning (Hulstijn & Laufer, 2001; Laufer &
                                                28


Hulstijn, 2001). According to the hypothesis, involvement has three components, namely need,
search, and evaluation. Need concerns how motivated a learner is to do a certain thing in order to
complete a task. The need to check a word’s pronunciation is strong when a learner wants to use
the word in a speaking task. Search is the action taken to figure out the meaning of an unfamiliar
word, such as asking instructors or peers about the word. Evaluation refers to learners’
assessment of how the word fits in its context. Engagement as action depends on engagement as
attention, i.e., attention is the prerequisite of action. It is hard to imagine a learner will look up a
word if they haven’t noticed the word first. In the Involvement Load Hypothesis, search and
evaluation are learners’ actions on unfamiliar words and they are “contingent upon allocating
attention to form-meaning-relationships” (Hulstijn & Laufer, 2001, p. 543). Engagement as
attention, in contrast, can take place without engagement as action. A learner may notice an
unknown word but decide not to do anything with it.
Measuring Engagement
         Engagement in vocabulary learning studies has been gauged with eye tracking, think
alouds, retrospective surveys, and tracking learner behaviors in computer- and mobile- assisted
language learning. Except for retrospective surveys, the other three methods are able to reveal
real-time engagement as learners are processing the learning materials. In what follows, I review
how these methods measure engagement as action and as attention.
         Eye tracking is mostly used to measure engagement as attention. In many studies, total
reading time on a word was used as an index of attention paid to the word (e.g., Godfroid, Boers,
& Housen, 2013; Godfroid, Ahn, et al., 2018; Mohamed, 2018; Pellicer-Sánchez, 2016). Eye
tracking can also reveal engagement as action when the actions are performed on screen. Warren
et al. (2018) examined eye movements to target words and three types of marginal glosses (text,
                                                    29


picture, and multimodal). In this study, eye movements towards target words can be seen as
engagement as attention; eye movements towards marginal glosses indicated engagement as
action: learners took actions to look away from the main text to page margins to consult glosses.
However, eye tracking in this study may not be able to reveal learners’ other actions during
reading, such as inferring word meanings.
        Think-alouds probe concurrent processing by asking participants to articulate their
thoughts while doing something. Ender (2016) used this method to examine the cognitive
processes during incidental vocabulary learning. Based on the think-alouds data, she categorized
participants’ processing strategies into ignoring a word, checking a dictionary, inferring meaning
from context, and inferring meaning plus checking a dictionary. Ignoring indicated engagement
as attention because this category included instances where participants would notice that a word
was unfamiliar but decide not to consult a dictionary. The other three categories can be seen as
engagement as action. Ender treated uncommented words as unattended ones, i.e., without
engagement as attention. However, it was hard to ascertain whether participants underwent
unarticulated processing of those words.
        In retrospective surveys on engagement, learners self-report whether they pay attention to
unfamiliar words and what they do with the words. Elgort and Warren (2014), for example, used
a five-point Likert scale to examine the extent to which learners (1) ignored unfamiliar words,
(2) tried to infer the words’ meanings, and (3) noted down the words and checked the dictionary
later. Like in think-alouds, ignoring here would mean engagement as attention without action
and the other two options would indicate engagement as action. A limitation of retrospective
surveys is that they can only give a general picture of learners’ thoughts during learning but are
not able to tell us how learners process each target word.
                                                 30


        When learning takes place on a computer or a mobile device, learners’ behaviors such as
mouse clicks, mouse movements, and keystrokes can be recorded (see Fischer, 2007 for a review
on tracking in computer-assisted language learning). Vocabulary learning studies have mostly
used tracking data to examine whether and how learners used glosses and online dictionaries
(e.g., Chun & Payne, 2004; Laufer & Hill, 2000; H. Lee et al., 2017; Peters, 2007; Varol &
Erçetin, 2021). Tracking data can reveal the number of times, the frequency, and the duration of
gloss access and dictionary lookups. Tracking data can thus be used to indicate engagement as
action, e.g., what learners do with glosses, but may not be able to say a lot about engagement as
attention — when learners do not click a gloss, they may still have paid attention to the word.
Compared with think-alouds, tracking has the advantage of being unobtrusive. Tracking on a
computer or a phone is more ecologically valid than eye tracking because the former allows
learners to perform a task anywhere anytime as long as they have a digital device where the
tracking program is implemented while the latter mostly requires learners to sit in the lab, which
is not a typical environment learning takes place in reality.
Engagement as a Moderating Factor: What are the Effects of Engagement?
         It is a “commonsense notion” (Schmitt, 2008, p. 338) that more engagement predicts
better learning. Craik and Lockhart’s (1972) Levels of Processing framework is often cited to
support this notion. According to the framework, analysis of stimuli goes from the shallow
processing of physical forms to the deeper levels of meaning extraction and elaboration. Deeper
processing leads to stronger and longer-lasting memory traces. The Involvement Load
Hypothesis mentioned above also argued that higher involvement resulted in greater word
knowledge. The positive relationship between engagement and vocabulary learning is evidenced
in empirical research. Eye-tracking studies have shown that longer reading times, an indication
                                                 31


of greater engagement as attention, resulted in higher vocabulary gains in incidental learning
(e.g., Godfroid, Ahn, et al., 2018; Pellicer-Sánchez, 2016). Greater engagement as action also
leads to better learning. In Ender (2016), for example, only 12% of the ignored target words were
recalled in posttests, compared to 27% for words that were looked up in a dictionary.
Engagement as action not only leads to greater vocabulary gains but also accelerates the learning
process. In Elgort and Warren (2014), learners who tried to infer word meanings needed fewer
encounters with a word to learn it. Similarly, Uchihara et al.’s (2019) meta-analysis on repetition
in vocabulary learning revealed that the effects of target word FoO diminished when learners
used dictionaries, asked questions about words, or took notes.
        In research on glossing, consulting a gloss can be seen as engagement as action. The
amount of engagement in gloss research was usually operationalized as the number of times
learners accessed glosses and the duration learners spent reading the glosses. Findings showed
that the relationship between learners’ engagement with glosses and learning gains was not
straightforward. Laufer and Hill (2000) found no relationship between the number of times
learners accessed word definitions and word knowledge measured by a meaning recall test.
Warren et al. (2018) used eye tracking to measure engagement as action, i.e., learners’ reading
time on glosses. The authors, like Laufer and Hill (2000), found no relationship between time on
gloss and learning. H. Lee et al. (2017)’s results revealed a negative correlation between the
amount of time learners spent on glosses and learning gains and a non-significant correlation
between the frequency of gloss access and learning. Finally, in Peters (2007), greater number of
gloss access led to more learning. This positive relationship was moderated by whether a target
word was relevant to the comprehension questions: the correlation between gloss access and
learning was higher for relevant than for nonrelevant words. The discrepancy in results in these
                                                 32


studies is likely due to various reasons. As Peters (2007) suggested, relevance of a target word
could be one of the reasons. Warren et al. (2018) indicated that gloss type could be another.
While the study did not find a relationship between time on glosses and learning, it showed that
time spent on target words and picture glosses together yielded higher gains than text and
multimodal glosses, which the authors attributed to the mnemonic advantage of picture glosses.
No gloss language research, to my knowledge, has examined how engagement with L1 and L2
glosses may be differentially associated with learning. Such examination will advance our
understanding of both engagement and gloss language in vocabulary learning.
Engagement as an Outcome Variable: What Affects Engagement?
        Engagement may be affected by the learning condition and learners’ characteristics. The
Involvement Load Hypothesis (Hulstijn & Laufer, 2001; Laufer & Hulstijn, 2001) predicted the
engagement level of a learning condition based on whether the condition instigated learners’
need, search, and evaluation. For example, engagement is higher when learners are required to
select the correct word meaning from several options than when they are given the meaning. In
glossing research, many studies have found that types of glosses influenced learner engagement
(Rassaei, 2020; Türk & Erçetin, 2014; Varol & Erçetin, 2021; Warren et al., 2018; cf. H. Lee et
al., 2017). Warren et al. (2018) showed that text glosses were mostly likely to be ignored,
followed by picture and multimodal glosses, though there was no significant difference in
attention paid to the three types of glosses. In terms of learners’ characteristics, Chun and Payne
(2004) found that lower working memory was associated with more gloss lookups, i.e., greater
engagement as action. Varol and Erçetin (2021) explored how gloss type and working memory
interactively affected engagement with glosses. Participants in the study read with one of the
four types of glosses that differed in either content (lexical vs. topical) or position (pop-up vs.
                                                  33


separate windows). Findings revealed no interaction between working memory and gloss type,
with the frequency of gloss access being affected by gloss position only. Gloss language studies
have rarely looked into how learners engage with L1 and L2 glosses. Results in Chun and Payne
(2004) suggested that when learners had the freedom to pick, they accessed L1 glosses more
frequently than L2 glosses. Laufer and Hill (2000) alluded to the possibility that learner
characteristics may influence preference for L1 or L2 glosses. In the study, while Israeli high
school students accessed L1 glosses more, college students in Hong Kong preferred L2 glosses.
Given that the learner characteristic of L2 proficiency has been found to influence the amount of
and learner attitudes towards L1 use in L2 classrooms (e.g., DiCamilla & Antón, 2012; J. H. Lee
& Lo, 2017), it is possible that proficiency level also affects how much learners engage with L1
and L2 glosses.
The Present Study
        The goal of the study was to unpack factors that may interactively and independently
moderate gloss language effect on learners’ engagement with glosses and on vocabulary learning
from reading. Such endeavor contributes to our knowledge of when to use L1 and L2 glosses to
optimize learning and acknowledges the critical role of L1 in L2 learning. Specifically, I
examined how learners’ L2 vocabulary size, engagement with glosses, and target words’ FoO
affected the comparative effectiveness of L1 and L2 glosses in vocabulary learning and retention.
I also explored how learners’ vocabulary size influenced their engagement with L1 and L2
glosses. Because a target word was glossed only at its first occurrence and learners did not know
beforehand how many times a target word would appear in text, FoO was not likely to affect
learners’ engagement with the first encounter of the target word and its gloss. Learning was
measured in terms of both receptive and productive meaning knowledge, and fluency of word
                                                 34


retrieval. The assessment of target words’ retrieval fluency went beyond most previous gloss
language research and was able to shed some indirect light on the prediction of the RHM that L2
input facilitates the establishment of direct conceptual links for L2 words and thus more fluent
word retrieval. Both the short-term and long-term development of word knowledge was
measured, by immediate and delayed posttests respectively. The operationalization of variables
involved in the study is summarized in Table 1.
 Table 1
 Operationalization of Variables
 Variables                              Operationalization                   Measures
 Gloss engagement                         Time on gloss                   Computer logs
 Vocabulary size                    Receptive vocabulary size       Updated Vocabulary Levels
                                                                               Test
 Receptive meaning                    Accuracy of meaning             Meaning matching test
 knowledge                                   matching
 Productive meaning                Accuracy of meaning recall           Meaning recall test
 knowledge
 Fluency of lexical retrieval             Reaction time               Self-paced reading test
 Short-term development              Accuracy of immediate             Immediate posttests
                                             posttests
 Long-term development            Accuracy of delayed posttests          Delayed posttests
        The following research questions guided the study:
                                                 35


RQ1. How does gloss language affect learners’ engagement with glosses, as moderated by
learners’ vocabulary size?
RQ2. How does gloss language affect learning in short-term and long-term receptive and
productive meaning knowledge, and lexical retrieval fluency, as moderated by target word FoO,
learners’ vocabulary size, and learner engagement?
        Previous research on the effects of FoO, engagement, and L2 proficiency justified several
possible interactions between the variables being examined. For RQ1, the interaction between
gloss language and vocabulary size is plausible. For RQ2, the possible interactions include (1)
FoO and gloss language, (2) vocabulary size and gloss language, (3) gloss engagement and gloss
language, (4) FoO, vocabulary size, and gloss language, (5) FoO, gloss engagement, and gloss
language, and (6) vocabulary size, gloss engagement, and gloss language.
                                                36


                                     CHAPTER: 2 METHOD
        This chapter presents methodological details regarding participants, materials, procedure,
test scoring approach, and data analysis. Instructions participant received for each task, i.e.,
reading, exit questionnaire, language background questionnaire, and the three vocabulary
posttests, are included in Appendix A. All instructions were written in participants’ first
language (L1).
Participants
        One hundred and eighteen second language (L2) learners of English completed all the
tasks in the study. The participants were recruited through word-of-mouth and flyers posted on
Wechat, a social media platform widely used in China. Participants were randomly assigned to
either the L1 gloss or the L2 gloss group. Eight participants were excluded from data analysis
due to low reading comprehension score (see Reading Comprehension in the Data Analysis
section). The final sample size included in data analysis was 60 in the L1 gloss group and 50 in
the L2 gloss group.
        All included participants spoke Chinese as their L1 and were studying at university in
China at the time of participation. Participants came from different institutions. Their mean age
was 20.93 years (SD = 2.20, range: 17–30). On average, the participants had had English
classroom instruction in China for 11.59 years (SD = 2.36, range: 6–18) and were at different
levels of degree programs, e.g., undergraduate (n = 91), master’s (n = 18), and PhD (n = 1).
        All participants took the updated Vocabulary Levels Test (Webb et al., 2017; see more
details about the test in the section on L2 vocabulary size) and received a score of 20 or above
for each of the 1000, 2000, and 3000 levels. The threshold of 20 indicated that the participants
had a mastery of words in the three levels (Nation, 1983; but see Webb et al., 2017) and that thus
                                                37


participants were not likely to have difficulty comprehending the reading material and the
glosses (see more details in the Materials section below). Participants’ average score in the
updated Vocabulary Levels Test was 134.90 (SD = 11.03; range: 102 – 150) out of a maximum
score of 150, suggesting that the participants were of intermediate proficiency level. The
Cronbach’s α reliability estimate of the updated Vocabulary Levels Test was .91.
Materials
Reading Material
        I adapted the introduction and the first 13 chapters of the Pearson graded reader The
Client to be the reading material. The graded reader is a thriller, and the suspense in the story
may help engage learners and keep their interest in the content. The purpose of the adaptation
was to increase some of the target words’ frequency of occurrence (FoO). The reading has 9,808
words. An analysis by Vocabprofile on Lextutor (Cobb, n.d.) revealed that 98% of the words in
the reading are from the most frequent 2,000 word families in the BNC/COCA word frequency
list (Nation, 2012). See Appendix B for a complete vocabulary profile of the reading material.
Target Words
        Twenty-four words with an FoO ranging from one to 27 (M = 6.58, SD = 6.13) in the
reading material were replaced by pseudowords to avoid out-of-experiment exposure and prior
knowledge. The 24 words were chosen because of their wide range of FoOs. The target words
were underlined, bolded, and colored in blue to indicate the availability of glosses. The
pseudowords’ parts-of-speech remained unchanged from the original words, which included 17
nouns, four verbs and three adjectives. The ratio of target words was 1.61% in text.
        The pseudowords were selected from the English Lexicon Project (Balota et al., 2007), a
corpus that contains behavioral data, i.e., reaction time (RT) and accuracy, of 40,481 words and
                                                 38


40,481 pseudowords from 816 L1 speakers of English in a lexical decision task and 444 in a
naming task. The pseudowords in the project were generated by changing one or two letters in
real words (Balota et al., 2007). The pseudowords were between four and seven letters long (M =
5.5, SD = .72). This range of length was chosen because pseudowords that are one or two letters
different from real words are more word-like when they are longer (Keuleers & Brysbaert,
2010). The chosen pseudowords also had similar RTs as obtained from the English Lexicon
Project (M = 842.23; SD = 25.95). Because the degree of a pseudoword’s resemblance to a real
word affects its RT (Keuleers & Brysbaert, 2010), comparability in RTs indicated that the chosen
pseudowords were similar in their word-likeness, which reduced the likelihood that some
pseudowords were more salient and were easier to learn than others (Bartolotti & Marian, 2017).
The pseudowords were also comparable in the number of orthographic neighbors and mean
bigram frequency, which are believed to affect word processing. Appendix C presents the
selected pseudowords, their characteristics obtained from the English Lexicon Project, and the
original words they replaced.
Glosses
        Two types of glosses were constructed, namely L2 and L1 glosses. L2 glosses were
modifying definitions of the corresponding real words. L1 glosses contained the L1 translations
of the L2 glosses. To make the L2 glosses comprehensible, all words used in the L2 glosses were
within the most frequent 3,000 word families in Nation’s (2012) BNC/COCA word frequency
list. The two types of glosses were matched in length: L2 glosses were 4.17 words long (SD =
1.83) and L1 glosses 4.54 (SD = 2.17) characters long on average.
        The gloss of a target word appeared in a pop-up window (see more details in the Reading
Interface section) when participants clicked the target word. The pop-up window closed when
                                                39


participants clicked the target word again. Each target word was glossed once on its first
occurrence. Examples for L1 and L2 glosses are provided below. For the full list of glosses, see
Appendix D.
Example: glosses for the target word ‘haron’
L1 gloss: 重要政治人物
L2 gloss: an important politician
Reading Comprehension Questions
        Ten comprehension questions were inserted in the reading material with an interval of
about six pages. The comprehension questions were taken from the Pearson teachers’ resources
for the graded reader and were all closed-ended questions, i.e., multiple choice and true/false
items.
The Reading Interface
        Figure 3 is an example of the reading interface as shown on a laptop. The reading
material was presented across 25 pages. Each page contained around 400 words, except for the
first page, where the introduction part of the reading was presented (132 words), and the last
page (249 words). The text of the reading was in Times New Roman font in 18 pixels in the
color of black. The text of the glosses was in the same font in 15 pixels in white against a black
background. Gloss windows always popped up below target words. Page numbers and the total
pages were displayed on the bottom left corner. Page numbers started from the first page of the
reading or the first page after comprehension questions (whichever applied). The total pages
represented the number of pages participants had to read in total before encountering a
comprehension question. For example, in Figure 3, page 3/5 means that participants are on the
third page of the reading or the third page after comprehension questions, and there are five
                                                 40


pages in total before the next set of comprehension questions come up. Below the page numbers,
the ‘next page’ button allowed participants to proceed to the next screen. Participants could not
go back to previous pages once the click the ‘next page’ button.
Figure 3
Example of the Reading Interface
Exit Questionnaire
        The exit questionnaire (see Appendix E), written in participants’ L1, had 10 questions,
and was administered primarily to probe participants’ gloss access. The first question asked
participants how often they checked glosses. In another question, participants indicated to what
extent they skipped a gloss for the following four reasons: (1) they have guessed the word
meanings; (2) they did not need the gloss to understand the reading; (3) knowing a word’s
meaning was not important; and (4) the gloss was not helpful. In these two questions,
participants gave their responses on a scale from 0 to 100 for these questions. 0 meant that
participants did not check glosses at all, and 100 meant that participants checked all the glosses
in the first question. In the second question, 0 indicated that a given reason was never why
participants skipped glosses, and 100 indicated that a particular reason was always why
                                                  41


participants skipped glosses. Participants were also prompted to give examples of glosses that
had helped them with reading comprehension or word learning.
        The exit questionnaire also included questions about whether participants understood the
glosses, whether the glosses had helped with reading comprehension and word learning, and how
much participants enjoyed the reading. Responses to these questions were made on a 100-point
scale. Participants also indicated whether they had deliberately memorized the target words and
whether they had guessed that there would be vocabulary posttests. These two questions required
yes or no answers.
Language Background Questionnaire
        The language background questionnaire (see Appendix F) was administered to collect
details about their language learning history and use. There were three sections in the
questionnaire. The first section asked participants to provide basic information, including their
name, participant ID, age, gender, and education level. The second part focused on participants’
language proficiency. It asked participants to self-assess their overall English proficiency as well
as their proficiency in reading, listening, writing, and speaking on a 10-point Likert scale. In this
section, participants also provided their scores of standardized tests they had taken (e.g., TOEFL
and IELTS). The last part looked into leaners’ English learning history and current English use:
age of onset of English learning, ways of learning, years of formal language education in the
classroom, current amount of English classroom instruction (hours per week), and experiences of
living or studying in an English-speaking country.
L2 Vocabulary Size
        Participants’ vocabulary size was measured by the updated Vocabulary Levels Test
(Webb et al., 2017). The test assesses learners’ receptive word knowledge at five word frequency
                                                  42


levels from 1,000 to 5,000. Each level has 10 clusters, with six words (three target items and
three distractors) and three word meanings in each cluster (see Figure 4 for an example). Each
cluster is worth three points and each level had a maximum score of 30 points. The maximum
score of this test is 150. In each cluster, test takers are asked to match the words on the right with
the word meanings of the left. Two equivalent versions were originally created, and Form A was
used in the current study (see Webb et al., 2017, Appendix 1).
 Figure 4
 Example of the updated Vocabulary Levels Test
Vocabulary Posttests
        I tested participants’ vocabulary knowledge of the target words with a meaning recall
test, a meaning matching test, and a self-paced reading test. The tests were administered
immediately after reading (i.e., immediate posttests) and two weeks after the second reading
session (i.e., delayed posttests). Each test examined a different aspect of word knowledge: the
meaning recall and matching tests assessed participants’ ability to productively retrieve and
receptively recognize word meanings, respectively, and the self-paced reading test tapped into
how participants retrieve and integrate meaning in real-time reading.
        In the meaning recall test, participants saw a target word and were asked to type its L1
translation equivalent, L2 synonym, or a short definition in either the L1 or the L2. Example test
items are presented in Figure 5. In the meaning matching test (see Figure 6), participants saw the
24 target words on the second half of the screen. The target words’ L1 and L2 glosses, along
                                                    43


with two additional word definitions serving as distractors, were presented on the first half. Each
gloss was numbered, and participants were asked to match the target words with their meanings
by selecting from a drop-down menu the number of the corresponding gloss. For both the
meaning recall and the meaning matching tests, response to each item was mandatory.
Participants were instructed to type a question mark“?” when they did not know the answer in
the meaning recall test.
 Figure 5
 Example of the meaning recall test
                                                44


Figure 6
Example of the meaning matching test
        For the self-paced reading test, I adopted the moving window paradigm (see Figure 7 for
an illustration), which is most widely used and yields results the best correlate with gaze
durations in eye tracking (Jiang, 2013; Just et al., 1982). Each trial in the self-paced reading test
began with an asterisk. The asterisk represented the position where a sentence starts. Participants
                                                 45


proceeded through a trial by pressing the space bar. With each press, a word appeared. In a
moving window paradigm, a subsequent word appears to the right of the preceding word, and the
preceding word disappears upon the presentation of the subsequent word.
 Figure 7
 The moving window paradigm
  *
             He
                          is
                                        doing
                                             * -- -- ----- his
                                                         * -- -- ----- ---
                                                         work.
         24 context-neutral sentences were constructed for the self-paced reading test. Three
versions were created for a sentence, each containing a target word, a real word, or a nonword at
the same position, i.e., the critical position. The critical position and the two words that follow
were never in the sentence-final position to avoid the sentence wrap-up effect, i.e., longer RT to
the last word in a sentence. The nonwords used in the self-paced task, like the target words, were
obtained from the English Lexicon Project. Three counterbalanced lists were constructed such
that each version of a sentence appeared once in each list and all three conditions appeared
across the three lists (see Table 2). Trials were randomized. All of the sentences were followed
by a comprehension question to encourage participants to focus on reading for meaning. Stimuli
                                                   46


characteristics obtained from the English Lexicon Project were summarized in Table 3. Kruskal-
Wallis tests suggested that there was no significant difference between pseudowords, nonwords,
and real words in length (χ2 (2) = 2.79, p = .25, ϵ2 = .04), in the number of orthographic
neighbors (χ2 (2) = 2.59, p = .27, ϵ2 = .04), and in mean bigram frequency (χ2 (2) = 3.24, p = .20,
ϵ2 = .05). There was no significant difference between pseudowords and nonwords in their mean
RTs in the English Lexicon Project as suggested by a Mann Whitney U test (W = 374.00, p = .08,
r = .26). For a list of full stimuli in the self-paced reading test see Appendix G.
 Table 2
 Examples of Self-Paced Reading Stimuli
 Sentence                             List 1                  List 2                  List 3
 Jason saw a (critical        latpin (pseudoword)       police (real word)      royate (nonword)
 position) in front of
 the shop.
 They talked about the         remude (nonword)        valoon (pseudoword)     student (real word)
 (critical position)
 very often during
 dinner.
 He took a photo for           teacher (real word)       persh (nonword)      haron (pseudoword)
 the (critical position)
 and his wife.
                                                    47


 Table 3
 Stimuli Characteristics
 Stimuli type            Length          Orthographic            Mean bigram          Mean RT
                        (letters)           neighbor              frequency
 Pseudowords           5.50 (.72)         1.83 (1.13)          1794.13 (826.40)         842.23
                                                                                       (25.95)
 Nonwords              5.92 (.72)          1.33 (.76)          2028.95 (836.54)         859.89
                                                                                       (29.54)
 Real words           5.62 (1.31)         3.04 (3.44)          1577.85 (679.04)           NA
Procedure
       Participants were first screened by the updated Vocabulary Levels Test and the language
background questionnaire. Those who passed the 1000, 2000, and 3000 levels, i.e., scoring above
66% in each level (Nation, 1983), and who were college or graduate students in China continued
with subsequent tasks. One hundred and eighteen out of the 216 who participated in the
screening procedure were eligible to continue their participation. Eligible participants were
randomly assigned to read with L1 glosses (i.e., the L1 gloss group) or with L2 glosses (i.e., the
L2 gloss group). Given the length of the reading, I administered two reading sessions on two
consecutive days. Participants read 18 out of the 25 pages in the first session and the remaining
seven pages in the second session. Participants could choose to read on a laptop, a tablet, or a
phone. Reading instructions were presented before each session. The reading sessions were
untimed and self-paced.
       After the completion of the second reading session, participants completed the exit
questionnaire, which was followed by surprise (i.e., unannounced beforehand) immediate
                                                48


vocabulary posttests in the order of a self-paced reading test, a meaning recall test, and a
meaning matching test. Delayed posttests were administered two weeks after the second reading
session. The total duration of the study was around 3.5 hours. Participants were asked not to
discuss the study with other people to prevent friends who also participated from knowing about
the posttests. In a debriefing at the end of the study, I offered the .txt version of the graded reader
and discussed the design of the study with participants who were interested. Participants were
given 100 RMB (around 15 US dollars) after they completed the study as compensation for their
time.
        The study was implemented online. Data were collected using Qualtrics and Gorilla
Online Experiment Builder. Table 4 presents the online platforms used and estimated duration
for each task in the study.
                                                   49


 Table 4
 Study Timeline, Task Duration, and Data Collection Platforms
                                Task                         Duration              Platform
 Day 1          Updated Vocabulary Levels Test              30 minutes             Qualtrics
 Day 2                  Reading session I                   90 minutes              Gorilla
 Day 3                 Reading session II                   20 minutes              Gorilla
                       Exit questionnaire                    5 minutes              Gorilla
                     Self-paced reading test                10 minutes              Gorilla
                       Meaning recall test                  10 minutes              Gorilla
                     Meaning matching test                  10 minutes              Gorilla
 Day 17           Delayed posttests (self-paced             30 minutes              Gorilla
                reading, meaning recall, meaning
                             matching)
 Total                                                      205 minutes
Data Analysis
       In this section, I first provide details about the coding of responses and data trimming
procedures for each instrument used in the study. I then describe the data analysis approach I
used to model the data.
Reading Comprehension
       One point was awarded for each correct response to the reading comprehension
questions, with 10 points as the maximum score. The average score for the reading
comprehension is 7.65 (SD = 1.45; range: 1–10), demonstrating adequate comprehension of the
reading material by the majority of participants (see Elgort & Warren, 2014; Godfroid, Ahn, et
                                                  50


al., 2018). Eight participants were excluded from data analysis because they had a reading
comprehension score below 6, which indicated that these participants might not have fully
understood the reading material or were not paying sufficient attention to meaning during
reading.
Time on Gloss
        Time on gloss was recorded as the time difference between when a participant clicked
open a gloss and when a participant clicked the gloss again to close the gloss pop-up window.
Time on gloss referred to the total time spent on one particular gloss. If participants clicked a
gloss multiple times, the time on that gloss would be the summed duration of each click. The
original range of gloss time was between 354 milliseconds (ms) and 2234172 ms (around 37
minutes). The extreme values indicated that the recorded gloss time may not have accurately
reflected participants’ actual engagement time with the gloss, i.e., time that participants actually
spent on reading the gloss. Data cleaning was needed in this case to exclude or adjust gloss time
values that did not reflect participants cognitive processes involved in reading a gloss.
        The cutoffs for data cleaning were determined based on research on reading rates and
research using RT tasks. In a review and meta-analysis on reading rates, Brysbaert (2019)
estimated that the reading rates of L1 speakers of English were around 238 words per minute for
nonfictions (around 250 ms a word) and 260 words per minute for fictions (around 230 ms a
word). L2 speakers read slower, with a reading speed ranging from 139 to 174 words per minute
(around 344 to 430 ms a word). In self-paced reading and lexical decision tasks, the lower cutoff
of RT cleaning is usually set between 100 and 250 ms and the upper cutoff between 2500 and
3000 ms per word (Jiang, 2013; Marsden et al., 2018). With the goal to preserve as many data
points as possible, i.e., minimal data cleaning, 250 and 3000 ms were chosen as the lower and
                                                 51


upper cutoffs to identify outliers in the gloss time data: 250 ms per word was not as slow as the
reading rates of L2 speakers, yet not the fastest reading rates of L1 speakers; and 3000 ms was at
the higher end of the typical upper cutoffs in RT research. Further outliers would be taken care of
in model criticism (see the Mixed-effects Modelling section). Specifically, the upper cutoff for
each gloss was calculated as the number of words in the gloss multiplied by 3000 ms and the
lower cutoff as the number of words in a gloss multiplied by 250 ms. Values of time on a gloss
above the upper cutoff for that gloss were winsorized and replaced by the upper cutoff of that
gloss. Winsorization instead of trimming was used because extremely long gloss time still
reflected some cognitive processing of a gloss and thus should not be completely removed. Such
data cleaning procedure affected 20.64% of the data. Values of time on a gloss below the lower
cutoff were replaced by zero. This was because participants were unlikely to have processed the
gloss within such a short time. This data cleaning procedure affected 11.59% of the data.
Meaning Recall Tests
        Correct or partially correct answers were coded as 1 and others (e.g., incorrect answers
and blanks) as 0. Correct answers were defined as ones that included all the semantic features of
a target word. Partially correct answers either (1) contained some but not all of the semantic
features or (2) included all the semantic features plus some features that were not in the word
meaning. Table 5 gives examples of correct and partially correct answers as determined by the
above method. The total data points for each participant were 24. The reliability estimates
(Cronbach’s α) of the immediate and delayed recall posttests were both .91.
                                                  52


 Table 5
 Examples of Correct and Partially Correct Answers in Meaning Recall
 Correct answer types                                                      Examples
 Correct answers                                                   Likely to hurt someone1
 Partially correct answers (some but not all semantic            Violence; violent; hurt others
 features):
 Partially correct answers (added semantic features):         Domestic violence; people who are
                                                                     likely to abuse others
 1: This is also the L2 gloss given to participants.
Meaning Matching Tests
        A response was deemed correct when it matched the target word with its corresponding
meaning. Correct responses were coded as 1 and incorrect ones as 0. The total data points
collected and analyzed for each participant in this test was 24. The immediate and delayed
matching posttests had a reliability estimate (Cronbach’s α) of .92 and .90 respectively.
Self-paced Reading Tests
        Two types of data were collected from the self-paced reading test: accuracy data from the
comprehension question at the end of each trial and RT data from sentence reading. The
accuracy data from the comprehension questions were used for data cleaning. First, participants
with an accuracy rate of lower than 70% were removed. This affected two participants in the
immediate posttest and six in the delayed posttest. Subsequently, trials where participants did not
correctly answer the comprehension questions were also removed.
        RT data to four regions in the self-paced reading test were examined, namely the critical
position (i.e., where the pseudowords, nonwords, or real words appeared, depending on the
                                                 53


condition), the word before the critical position (henceforth position 0), and the two words that
follow (henceforth position 2 and position 3). RT to position 0 was used for a manipulation
check: no significant difference should be found in RTs at this position in the three conditions
(i.e., pseudoword, nonword, and real word) to make sure that participants had the same starting
point before reaching the critical position. The critical position, positions 2, and 3 were selected
because difficulty of word meaning retrieval and integration may not manifest in RTs to the
critical position but to the immediate word that follows, i.e., the spill-over effect; for L2 learners
in particular, the spill-over effect may also occur at the second word after the critical position
due to lower processing efficiency (Jiang, 2013). The critical RT comparisons were between the
pseudoword and nonword conditions, and between the pseudoword and real word conditions. If
learners were able to retrieve and integrate the meanings of the pseudowords in a reading task,
the RTs to the critical position, position 2, or 3 in the pseudoword condition should be faster than
those in the nonword condition, in which the learners would read a nonword they had never
encountered. The comparison of RTs between the pseudoword and real word conditions would
tell us whether the newly-learned pseudowords could be retrieved as fluently as familiar real
words.
         RT trimming was implemented using the trimr package (version 1.1.1; Grange, 2015) in
R. Figure 8 shows the distribution of RTs in the immediate and delayed self-paced reading
posttests on a natural log scale. The RT distributions in the three conditions were slightly
different: RTs in the real word condition concentrated between 6 and 7 on the log scale while
there were larger proportions of RTs greater than 7 in the nonword and pseudoword conditions.
In this case, using absolute values alone may be inappropriate for cleaning the RT data.
Therefore, I chose to use standard deviation, along with absolute values, to identify outliers.
                                                   54


Specifically, I chose 250 ms as the low cutoff. Responses with RTs below this value were
removed because they were too fast to have accurately reflected the genuine reading process. In
addition, responses with RTs 2.5 SD away from the mean RT of a condition were excluded. The
method of identifying outliers based on SD took into account the different distributions of RTs in
each condition. RT cleaning affected 4.06% and 7.53% of the data in the immediate and delayed
self-paced tests respectively.
Figure 8
Reaction Time Distribution in the Immediate and Delayed Self-Paced Reading
Posttests
Mixed-effects Modelling
        To answer the research questions, I used mixed-effects modelling to analyze gloss time
data recorded during participants’ reading, accuracy data from the meaning matching and recall
                                                55


posttests, and RT data from the self-paced reading test. In all mixed-effects models, I used
treatment coding for the categorical variable of group (L1 and L2 gloss groups), with L1 gloss
group as the reference. All continuous independent variables (i.e., target word FoO, vocabulary
size, and gloss time) were standardized and mean-centered using the scale() function in R to
reduce collinearity.
         A maximal model was first built, which included (1) main effects of theoretical interest;
(2) justifiable interactions between the main effects of interest (see The Present Study section);
and (3) maximal random-effects structure justified by the data (Barr et al., 2013). For details of
the initial maximal models for each analysis, see Appendix H. Two steps were involved in model
selection (Gries, 2021). The first step was to determine the random-effects structure: when there
was a convergence issue, the random-effects structure was simplified by dropping by-subject
random slopes first (Barr et al., 2013), followed by by-item random slopes; then, I continued to
simplify the random-effects structure by dropping elements that accounted for the least
variability (i.e., those having the smallest standard deviation in the random effects table in R
output) one by one. Akaike Information Criterion (AIC) was used to select models with the best
random-effects structure, i.e., the model with the smallest AIC. After the random-effects
structure was determined, the second step was to select the best fixed-effects structure. This step
involved excluding insignificant interactions, followed by dropping interactions that did not
improve model fit. Main effects of theoretical interests did not participate in model selection and
were all kept. The best model in the second step would be the model with the smallest AIC. 95%
Wald confidence intervals for estimates in the final models were calculated through the
confint.merMod() function.
                                                 56


        After model selection, the selected model went through model diagnostics. First,
residuals of the final models were checked for outliers. Observations with a residual greater than
2.5 SDs away from the mean were removed, after which the final models were refitted. Second,
VIFs as a test for multicollinearity of the final models were checked using the performance
package (Lüdecke et al., 2020). A VIF value under 10 would indicate that there was no
significant multicollinearity (Hair et al., 1995).
        RQ1 looked into gloss language effect on learners’ gloss engagement, and how the gloss
language effect on gloss engagement was moderated by learners’ vocabulary size. The gloss time
data contained a large number of zeros (n = 516, representing 19% of the data). Data with such
distribution, i.e., a portion of zeros, plus a continuous non-zero part, is called semicontinuous
data and is common in biomedical and econometric research. Because zeros and non-zero data
points are often viewed as results of two distinct processes, a two-part mixed model has been
proposed to analyze such data (e.g., Olsen & Schafer, 2001; Tu & Zhou, 1999). The two-part
mixed model fits zero and non-zero data separately. Zeros are treated as binary data (i.e., zeros
vs. non-zeros) and are modelled with a generalized linear mixed model. Non-zero data are
viewed as continuous and are analyzed with a linear mixed model. In the current study, zeros and
non-zero data can be seen as representing two processes. The former represented a lack of
processing on a gloss. The latter denoted the actual amount of processing on a gloss. Therefore, I
followed the two-part mixed model and analyzed the data in two ways. The first analysis
examined gloss language effect on whether participants processed a gloss or not, and how
vocabulary size may moderate this gloss language effect. To do so, I created a binary variable
called ‘gloss checking’, where I coded the zeros in the gloss time data as 0, representing ‘no
gloss processing’, and the non-zeros as 1, representing ‘gloss processing’. A generalized linear
                                                   57


mixed-effects model was built, with gloss processing as the dependent variable, and group (L1
vs. L2 gloss) and vocabulary size as the main effects of theoretical interest. In the second
analysis, I removed zeros from the gloss time data. I then logged transformed the data to bring its
distribution closer to normal and fit a linear mixed-effects model to the log transformed gloss
time data that did not contain zero. Group (L1 vs. L2 gloss) and vocabulary size were the main
effects of theoretical interest.
         RQ2 examined the effect of gloss language on short-term and long-term learning, and the
moderating effects of target word FoO, vocabulary size, and engagement on gloss language
effect. To answer RQ2, accuracy data from the meaning matching and recall posttests, and RT
data from the self-paced reading test were analyzed. First, four generalized linear mixed-effects
models were built, each for accuracy data from the meaning matching immediate and delayed
posttests, and meaning recall immediate and delayed posttest. In these models, correct responses
were coded as 1 and incorrect ones as 0. The main effects of theoretical interest in these models
were group (L1 vs. L2 gloss), FoO, vocabulary size, and time on gloss.
         For the self-paced reading tests, data from each group (i.e., L1 and L2 gloss) and from
each test timing (i.e., immediate and delayed) were analyzed separately. Data from each position
(i.e., positions 0, critical & 2) were also analyzed in different models. In other words, separate
analyses were conducted by group, position, and test timing. Based on the descriptive statistics
of the RT data, RTs to position 3 were not analyzed. This was because the descriptive statistics
indicated that the RT difference between conditions was not likely to be found at this position,
i.e., the spill-over effect was not likely to have spread to this position. For each group and each
test timing, a linear mixed-effects model was built for RTs each to the critical position and
position 2 (i.e., the position following the critical position). The main effects of theoretical
                                                  58


interest in these mixed-effects models were condition (i.e., nonword, pseudoword, and real
word), FoO, vocabulary size, and time on gloss. In addition, a manipulation check was
conducted to see if the RT difference between conditions at position 0 was significant. The main
effect of interest was condition in the mixed-effects models for the manipulation check. In all
models for the self-paced reading tests, RTs were log transformed to bring the distribution closer
to normal. In addition, condition was treatment coded, with the pseudoword condition being the
baseline. This allowed the crucial comparisons between the real word and pseudoword
conditions, and between the nonword and the pseudoword conditions to be presented and
interpreted in a more straightforward manner.
                                                59


                                     CHAPTER 3: RESULTS
        This chapter first reports descriptive statistics of the dependent variables, namely time on
gloss, number of correct answers in the meaning matching and meaning recall posttests
(immediate and delayed), and reaction times (RTs) to the four regions of interests in the
immediate and delayed self-paced reading tests. Next, the chapter presents the final mixed-
effects models from the analyses engaged to answer the research questions.
Descriptive Statistics
Time on Gloss
        After gloss time cleaning (see Time on Gloss in the Data Analysis section), the average
time a participant spent on a gloss was 4171 milliseconds (ms) (SD = 5426.67; range: 0 – 33000)
in the first language (L1) gloss group and 6069 ms (SD = 5472.83; range: 0 – 30000) in the
second language (L2) gloss group. Figure 9 plots the time on gloss by group. The histogram
shows that the L1 gloss group had a lot more zeros, indicating that fewer participants spent time
on the glosses in the L1 gloss group than in the L2 gloss group.
Figure 9
Time on Gloss by Group
                                                  60


Meaning Recall Tests
        Table 6 shows the total number of correct responses (maximum = 24) in the meaning
recall immediate and delayed posttests. Participants in both groups correctly recalled the
meanings of less than 50% of the target words. The L1 gloss group did better than the L2 gloss
group in both the immediate and delayed posttests.
 Table 6
 Total Number of Correct Responses in Meaning Recall Posttests by Group
                       Immediate posttest                           Delayed posttest
                  Mean (SD)             95% CI              Mean (SD)                95% CI
 L1 Gloss         9.67 (6.68)        [7.94, 11.39]          8.10 (6.59)           [6.39, 9.80]
 L2 Gloss         7.84 (5.20)         [6.36, 9.32]          6.30 (4.75)           [4.95, 7.65]
Meaning Matching Tests
        Table 7 presents the total number of correct responses (maximum = 24) in the meaning
matching immediate and delayed posttests. Both groups did better in receptive meaning
recognition as measured by the meaning matching tests than in productive meaning recall as
measured by the recall tests. The L1 gloss group correctly recognized around half (51.96%) of
the target words in the immediate posttest but less than half (33.75%) in the delayed posttest. The
L2 gloss group recognized less than half in both the immediate (42.33%) and delayed (37.67%)
posttests. As in the meaning recall tests, the L1 gloss group performed better than the L2 gloss
group in both the immediate and delayed posttests.
                                                 61


 Table 7
 Total Number of Correct Responses in Meaning Matching Posttests by Group
                         Immediate posttest                          Delayed posttest
                    Mean (SD)             95% CI             Mean (SD)                  95% CI
 L1 Gloss          12.47 (6.68)       [10.74, 14.19]        10.38 (6.28)             [8.75, 12.02]
 L2 Gloss          10.16 (6.18)        [8.41, 11.91]         9.04 (5.38)             [7.50, 10.58]
Self-paced Reading Test
        The total number of observations included in the self-paced reading tests analyses was
8612 in the immediate posttest and 8000 in the delayed posttest. Table 8 and Table 9 summarize
the means and standard deviations (in parenthesis) of RTs in the immediate and delayed posttests
respectively. Figure 10 presents RTs of each position, and Figure 11 focuses on RTs to the
critical position and position 2. In both the immediate and delayed posttests, positions 0 and 3
saw no large differences in RTs among the three conditions. That is, the participants started at a
similar place before entering the critical position, where a nonword, a pseudoword, or a real
word was inserted. In addition, the spill-over effect seemed to be at position 2, instead of at
position 3. Focusing on the critical position, in both the immediate and delayed posttests,
participants’ RTs to nonwords and pseudowords were much larger than those to the real words,
indicating processing difficulties when reading unfamiliar or newly learned words; the difference
in RTs to nonwords and pseudowords was small at this position. For position 2, where a spill-
over effect will likely happen, i.e., where processing difficulty is reflected, different RT patterns
can be found for the L1 gloss and L2 gloss groups in the immediate posttest. For the L1 gloss
group, RTs to the pseudowords seemed to be much higher than those to the nonwords. It seems
that participants recovered faster from processing difficulties after reading a nonword than after
                                                   62


reading a pseudoword at the critical position. For \e L2 gloss group, RTs to pseudowords and
nonwords were similar. In the delayed posttests, the L1 gloss group’s RTs at position 2 were
slightly faster in the pseudoword than in the nonword conditions, while the L2 gloss groups read
the stimuli in both conditions at a similar speed. RTs in the real word condition were much lower
than those in the pseudoword and nonword conditions in both groups in the immediate and
delayed posttests. RTs in the delayed posttest were lower in general than those in the immediate
posttest, indicating a degree of familiarity with the stimuli when participants took the test for a
second time.
 Table 8
 Reaction Time (ms) by Position and Condition in Self-paced Reading Immediate Posttest
                                   L1 gloss                                L2 gloss
                        Real     Pseudoword     Nonword         Real     Pseudoword     Nonword
                        word                                    word
 Position 0            536.67       573.53        530.13       603.44       595.15       586.16
                      (226.35)     (379.08)      (252.22)     (402.36)     (349.47)     (281.49)
 Critical              720.08      1194.23       1155.38       783.90      1230.91      1263.56
 position             (513.11)     (949.83)      (861.04)     (546.55)     (852.81)     (844.87)
 Position 2            614.27       797.46        685.72       671.83       799.66       778.96
                      (386.34)     (700.50)      (395.40)     (455.79)     (523.25)     (461.31)
 Position 3            585.54       596.23        584.66       606.38       663.41       634.38
                      (431.37)     (429.66)      (321.19)     (329.01)     (417.57)       (370)
                                                  63


Table 9
Reaction Time (ms) by Position and Condition in Self-paced Reading Delayed Posttest
                               L1 gloss                             L2 gloss
                   Real      Pseudoword    Nonword        Real    Pseudoword    Nonword
                   word                                   word
Position 0        454.86        482.15      474.87       490.51      496.19       470.01
                 (194.56)      (275.26)    (244.25)     (234.91)    (267.76)     (190.64)
Critical          549.20        820.58      818.27       621.48      936.25       892.61
position         (296.34)      (667.65)    (717.82)     (334.93)    (793.12)     (661.65)
Position 2        505.53        635.09      697.29       529.96      685.55       701.12
                 (252.91)      (398.60)    (532.31)     (251.56)    (343.02)     (443.33)
Position 3        472.16        507.98      483.73       506.52      554.95       557.59
                 (219.27)      (238.55)    (178.96)     (261.34)    (301.98)     (326.89)
                                            64


Figure 10
RTs in Self-paced Reading Tests by Group, Position, and Condition
                                            65


 Figure 11
 RT of the Critical Position and Position 2
RQ1
        RQs 1a and b examined gloss language effect on gloss engagement and the potential
moderating role of vocabulary size on the gloss language effect. I first investigated whether gloss
language influenced the likelihood of participants processing a gloss. Table 10 showed that the
L2 gloss group were significantly more likely than the L1 gloss group to process a gloss. While
vocabulary size did not moderate the gloss language effect, it predicted gloss processing:
participants with a larger vocabulary size were more likely to spend time processing a gloss.
                                               66


         The analysis on gloss time showed a different pattern (see Table 11). Neither Group nor
vocabulary size were significant. The L1 gloss and L2 gloss groups spent similar amount of time
on the glosses, i.e., no gloss language effect. Vocabulary size did not have an effect on gloss
time.
  Table 10
  Mixed-effects Model for Gloss Engagement: zero vs. non-zero gloss time data
                                Fixed effects                           Random effects
                                                              By participant          By item
                  Estimate [95% SE         z        P        Variance     SD      Variance    SD
                        CI]
  Intercept          1.52 [.69,     .42 3.60 <.001***           5.67     2.38       1.64     1.28
                       2.35]
  Group (L2         3.80 [2.49,     .67 5.68 <.001***                               1.88     1.37
  gloss)               5.11]
  Vocabulary      .59 [.08, 1.10]   .26 2.27      .02*
  size
                                                 67


 Table 11
 Mixed-effects Model for Gloss Engagement: Time on Gloss
                                 Fixed effects                           Random effects
                                                                 By participant       By item
                 Estimate [95%       SE      t        p          Variance SD       Variance SD
                        CI]
 Intercept          8.24 [8.02,      .11  76.05 <.001***            .26     .51       .16      .42
                       8.45]
 Group (L2        .22 [-.04, .48]    .13   1.64      .11                              .32      .57
 gloss)
 Vocabulary       -.02 [-.12, .08]   .05   -.34      .74
 size
RQ2
        RQs 2a and b asked how gloss language affected learning and how target word frequency
of occurrence (FoO), learners’ vocabulary size, and learner engagement moderated the effects of
gloss language on learning. Learning was measured by three tests, namely meaning matching,
meaning recall, and self-paced reading tests, at two time points, i.e., immediate and delayed. In
the following sections, I present findings by test type and test time.
Meaning Recall Test
        Table 12 and Table 13 present the final mixed-effects model for accuracy data in the
meaning recall immediate and delayed posttests respectively. For RQ2a, in both the immediate
and delayed posttests, the main effect of group was not significant, indicating that when all other
variables were at their mean standardized values, the two gloss language groups did not differ
                                                68


significantly in their performance in the meaning recall posttests. In both the immediate and
delayed posttests, target word FoO and participants’ time on reading the glosses were significant
predictors of learning while the main effect of vocabulary size was not significant.
 Table 12
 Meaning Recall Immediate Posttest Mixed-effects Model
                                Fixed effects                           Random effects
                                                              By participant         By item
                    Estimate       SE       z        p        Varianc      S       Varianc    S
                    [95% CI]                                      e       D           e      D
 Intercept        -.81[-1.48, -    .34      -      .02*         4.29      2.         .85     .9
                       0.14]              2.38                            07                  2
 Group (L2            -.50 [-      .42      -       .24
 gloss)             1.32. .33]            1.18
 FoO                1.32 [.93,     .20    6.61 <.001***
                       1.72]
 Vocabulary             .30        .21    1.42      .15
 size              [-.11, .72]
 Time on                .33        .11    3.08   .002**
 gloss              [.12, .55]
 Group *           -.36 [-.63,     .14      -      .01*
 Time on               -.07]              2.51
 gloss
                                                69


Table 13
Meaning Recall Delayed Posttest Mixed-effects Model
                              Fixed effects                   Random effects
                                                     By participant       By item
                   Estimate       SE     z      p    Variance    SD    Variance SD
                   [95% CI]
Intercept        -1.18 [-1.75,    .29    -  <.001***   3.07     1.75      .60     .77
                      -.61]            4.03
Group (L2            -.51 [-      .36    -     .16
gloss)             1.22, .20]          1.42
FoO                 .90 [.55,     .18 5.02 <.001***
                      1.25]
Vocabulary      .26 [-.23, .75] .25 1.03       .30
size
Time on gloss    .39 [.19, .60]   .10 3.76 <.001***
Group * FoO     .07 [-.17, .30] .12     .55    .58
Group *              -.48 [-      .36    -     .19
Vocabulary         1.20, .23]          1.32
size
FoO *            .26 [.09, .42]   .09 3.00   .003**
Vocabulary
size
                                            70


 Table 13 (cont’d)
 Group * Time         -.35 [-.63,    .14     -      .01*
 on gloss                -.08]             2.50
 FoO * Time on            -.24       .12     -       .05
 gloss               [-.48, .003]          1.93
 Group * FoO *            -.19       .12     -       .10
 Vocabulary           [-.43. .04]          1.64
 size
 Group * FoO *      .42 [.09, .75]   .17 2.47       .01*
 Time on gloss
        For RQ2b, several interactions were found. First, in the immediate recall test, time on
gloss significantly moderated the effect of gloss language. Specifically, the more time
participants spent on a gloss, the larger the difference between the L1 gloss and L2 gloss groups
was, due largely to increase in the accuracy of the L1 gloss group (see Figure 12). Second, in the
delayed posttest, the effect of gloss language was moderated by both time on gloss and target
word FoO. Figure 13 plots the three-way interaction between group, time on gloss, and FoO.
Figure 13a shows the gloss time effect for each FoO range by group. Figure 13b juxtaposes
group performance as affected by gloss time by FoO range. Looking at Figure 13a, for the L1
gloss group, the positive effect of time on gloss was the largest when FoO was low. The positive
effect of time on gloss became smaller when FoO increased and even became negative when
FoO reached its largest values. In contrast, for the L2 gloss group, the effect of time on gloss first
increased as FoO became larger. The gloss time effect became the largest when FoO was at mid-
                                                  71


range (z score: 0.5 – 1.5; raw FoO: 10 – 13). The effect of gloss time then decreased as FoO
increased, but remained positive. Figure 13b indicated that the gloss language effect varied by
time on gloss and FoO. When FoO was relatively low (z score: -1.5 – -0.5; -0.5 – 0.5; raw FoO:
1 – 3; 4 –9), the L1 gloss group outperformed the L2 gloss group, and this gloss language effect
became larger as time on gloss increased. When FoO was higher (z score: 0.5 – 1.5; 1.5 – 2.5;
2.5 – 3.5; raw FoO range: 10 –13; 14 – 19; 20 – 27), the L2 gloss group seemed to be able to
outperform the L1 gloss group when participants spent more time on the glosses. It seems that in
the delayed posttest, the L2 gloss group may recall the meanings of more words if participants
had sufficient encounters with the target words and sufficient reading time on the L2 glosses.
The three-way interaction, however, should be interpreted with caution due to the small number
of target words with a raw FoO above 10 (n = 3).
  Figure 12
  Group * Time on Gloss Interaction (Meaning Recall Immediate Posttest)
                                               72


Figure 13
Group * Time on gloss * FoO Interaction (Meaning Recall Delayed Posttest)
   a
   b
                                            73


Meaning Matching Test
        The final mixed-effects models for the meaning matching immediate and delayed
posttests were presented in Table 14 and Table 15. For RQ2a, in both posttests, results showed
no significant difference between the L1 gloss and L2 gloss groups, i.e., no significant gloss
language effect, when other variables were held at their mean standardized values. FoO,
vocabulary size, and the amount of time spent on the glosses all had positive effects on learning.
For RQ2b, in the immediate posttest, the variable of group did not interact with any other
variables. In the delayed posttest, there was a borderline significance for the interaction between
group and time on gloss (RQ2b). As Figure 14 shows, the more time the L1 gloss group spent on
glosses, the higher the accuracy was, leading to larger group difference, i.e., larger gloss
language effect. The group and time on gloss interaction pattern in the meaning matching
delayed posttest was similar to that in the meaning recall immediate posttest.
                                                 74


Table 14
Meaning Matching Immediate Posttest Mixed-effects Model
                            Fixed effects                       Random effects
                                                        By participant      By item
             Estimate [95%      SE      z       p     Variance      SD   Variance SD
                     CI]
Intercept     .09 [-.51, .68]   .30    .29     .78       3.41      1.85     .69     .83
Group (L2          -.63 [-      .37     -      .09
gloss)           1.36, .11]           1.67
FoO           .98 [.63, 1.33]   .18   5.45 <.001***
Vocabulary    .57 [.07, 1.08]   .26   2.21    .03*
size
Time on        .22 [.08, .35]   .07   3.04   .002*
gloss
Group *            -.71 [-      .38     -      .06
Vocabulary       1.45, .02]           1.90
size
                                           75


Table 15
Meaning Matching Delayed Posttest Mixed-effects Model
                             Fixed effects                     Random effects
                                                      By participant       By item
                  Estimate       SE     z      p      Variance    SD    Variance SD
                 [95% CI]
Intercept            -.38        .25    -     .13       2.27     1.51      .43     .66
                 [-.87, .11]          1.54
Group (L2          -.39 [-       .31    -     .21
gloss)           1.00, .22]           1.26
FoO               .86 [.55,      .16 5.48 <.001***
                    1.16]
Vocabulary     .43 [.01, .85]    .21 2.02    .04*
size
Time on gloss  .21 [.02, .39]    .09 2.15    .03*
Group * FoO          -.02        .12 -.14     .88
                 [-.24, .21]
Group *            -.58 [-       .31    -     .06
Vocabulary       1.20, .03]           1.87
size
FoO *          .19 [.03, .35]    .08 2.37    .02*
Vocabulary
size
                                           76


 Table 15 (cont’d)
 Group * Time        -.26 [-.51,   .13   -     .02*
 on gloss               -.01]          2.02
 FoO * Time on       -.27 [-.51,   .12   -     .02*
 gloss                  -.05]          2.37
 Group * FoO *           -.08      .11 -.69     .49
 Vocabulary          [-.29, .14]
 size
 Group * FoO *     .29 [-.03, .60] .16 1.80     .07
 Time on gloss
  Figure 14
  Group * Time on Gloss Interaction (Meaning Matching Delayed Posttest)
Self-paced Reading Test
       The manipulation check indicated that RTs for position 0 did not differ significantly
between the pseudoword and the nonword conditions, and between the pseudoword and real
                                             77


word conditions for the L1 and L2 gloss groups in both the immediate and delayed posttests (see
details of the analyses in Appendix H). The results of the manipulation check suggested that
participants started at roughly the same place before encountering the pseudoword, nonword, or
real word at the critical position. In other words, effects found at the critical position and/or at
position 2 may be attributed not to pre-existing processing differences at position 0, but to
differences in the processing of the nonwords, pseudowords, and real words at the critical
position. In what follows, I present results of the immediate posttest first, followed by those of
the delayed posttest. For each test time point, I present the results of the L1 gloss group,
including results of the critical position and position 2, which is then followed by the results of
the L2 gloss group.
        Immediate Posttest, L1 Gloss. Table 16 and Table 17 include results for the critical
position and position 2 respectively for the L1 gloss group in the immediate posttest. At the
critical position, RTs for the pseudowords did not differ significantly from those for nonwords,
while the pseudoword RTs were significantly slower than those for the real words. The
comparison of RTs for these three types of stimuli reflected participants’ processing difficulties
of nonwords and pseudowords. There was a significant interaction between condition and time
on gloss, showing that the more time participants spent on a gloss, the larger the RT difference
was between the pseudowords and the real words, owing primarily to the speeding up in reacting
to the real words (see Figure 16).
        At position 2, RTs for the pseudoword condition were significantly higher than those for
both the nonword and the real word conditions. However, when participants encountered a
pseudoword more times in the reading (i.e., higher target word FoO), RTs for the pseudoword
conditions decreased (i.e., significant main effect of FoO). In turn, the RT difference between the
                                                  78


pseudoword and real word conditions decreased (i.e., significant interaction between condition
and FoO). When FoO reached the highest (i.e., 27 times), the RTs for the real word and to the
pseudoword conditions were close (see Figure 15). Time on gloss also interacted with condition,
albeit in a different direction. The more time participants spent on a gloss, the larger the RT
difference was between the pseudoword and the real word conditions, and between the
pseudoword and the nonword conditions, mostly due to the decrease in RTs for the real word
condition and increase in RTs for the nonword condition (see Figure 17).
                                                79


Table 16
Self-paced reading immediate posttest: Critical Position, L1 Gloss Group
                              Fixed effects                          Random effects
                                                             By participant     By item
               Estimate [95%      SE      t        p         Variance SD     Variance SD
                      CI]
Intercept        6.83 [6.71,     .06 111.81 <.001***           .15      .38     .01     .11
                    6.95]
Nonword        .01 [-.06, .08]   .04     .32      .75
Real word      -.41[-.48. -.34]  .04 -11.69    <.001***
FoO           -.003 [-.06, .04] .03     -.12      .90
Vocabulary     -.04 [-.14, .06]  .05    -.75      .46
size
Time on        .04 [-.02, .10]   .03    1.20      .23
gloss
Nonword *     -.004 [-.08, .07] .04     -.09      .92
Time on
gloss
Real word *      -.08 [-.15,      .04   -2.26    .02*
Time on             -.01]
gloss
                                              80


Table 17
Self-paced reading immediate posttest: Position 2, L1 Gloss Group
                               Fixed effects                        Random effects
                                                            By participant     By item
               Estimate [95% SE            t       p        Variance SD     Variance SD
                      CI]
Intercept        6.51 [6.44,      .04 177.30     <.001        .03      .18     .01     .09
                     6.58]                        ***
Nonword           -.08 [-.13,     .03   -2.58    .01*
                     -.02]
Real word         -.21 [-.26,     .03   -7.24  <.001***
                     -.15]
FoO               -.08 [-.14,     .03   -2.91    .01*
                     -.03]
Vocabulary     -.02 [-.07, .04] .03      -.59     .56
size
Time on        .05 [.002, .10]    .02    2.02    .04*
gloss
Nonword *       .07 [.01, .12]    .03    2.32    .02*
FoO
Real word *    .06 [.003, .12]    .03    2.07    .04*
FoO
                                              81


Table 17 (cont’d)
Nonword *      -.05 [-.11, .01] .03  -1.79      .07
Time on
gloss
Real word *       -.07 [-.12,   .03  -2.25     .02*
Time on              -.01]
gloss
Figure 15
Condition * FoO Interaction (Position 2, Immediate Self-paced Reading Test, L1 Gloss
Group
                                            82


Figure 16
Time on Gloss * Condition Interaction (Critical Position, Immediate Self-paced Reading Test,
L1 Gloss Group
                                             83


 Figure 17
 Condition * Time on Gloss Interaction (Position 2, Immediate Self-paced Reading Test, L1
 Gloss Group
        Immediate Posttest, L2 Gloss. Results for the critical position and position 2 are
presented in Table 18 and Table 19 respectively. Like the L1 gloss group, participants in the L2
gloss group reacted significantly slower to the pseudowords than to the real words and reacted
similarly to the nonwords at the critical position, reflecting processing difficulties in reading
nonwords and pseudowords. Unlike the L1 gloss group, RTs of the L2 gloss group to the three
types of stimuli were not moderated by any other variables. RTs at position 2 showed a similar
pattern to those at the critical position. That is, RTs in the pseudoword condition were similar to
those in the nonword condition and were significantly higher than those in the real word
condition. No other variables or interactions were found significant.
                                                    84


Table 18
Self-paced reading immediate posttest: Critical Position, L2 Gloss Group
                                Fixed effects                        Random effects
                                                             By participant     By item
               Estimate [95%      SE       t       p         Variance SD     Variance SD
                      CI]
Intercept         6.91 [6.77,     .07 105.06 <.001***          .15      .38     .01     .11
                     7.04]
Nonword        .01 [-.06, .09]    .04     .39     .70
Real word     -.42 [-.49, -.34] .04 -11.27     <.001***
FoO             -.02[-.07, .04]   .03    -.63     .53
Vocabulary      .03[-.08, .14]    .06     .52     .61
size
Time on         .01[-.03, .04]    .02     .26     .80
gloss
                                              85


 Table 19
 Self-paced reading immediate posttest: Position 2, L2 Gloss Group
                                 Fixed effects                           Random effects
                                                                By participant         By item
                  Estimate [95%      SE       t        P        Variance SD         Variance SD
                        CI]
 Intercept          6.52 [6.43,      .05 142.29 <.001***            .06     .24        .01     .09
                       6.61]
 Nonword         .003 [-.06, .06]    .03     .09      .93
 Real word       -.16 [-.22, -.10]   .03   -5.15  <.001***
 FoO              -.04 [-.08, .01]   .02   -1.53      .14
 Vocabulary       -.01 [-.08, .06]   .04    -.31      .76
 size
 Time on          .02 [-.01, .05]    .02    1.15      .25
 gloss
        Delayed Posttest, L1 Gloss. RTs at the critical position (see Table 20) showed similar
patterns to those in the immediate posttest: significantly faster response to real words than to
pseudowords, and similar RTs to nonwords and pseudowords. A significant three-way
interaction was found for condition, FoO, and time on gloss. Figure 18 plots this interaction, with
each panel representing an FoO range. The interaction indicated that when FoO was low (range
in z scores: -1.5 – -0.5; range in raw FoO: 1 – 3), the more time participants spent on a gloss, the
slower they responded to the nonwords and pseudowords. As FoO increased, the relationship
between time on gloss and RTs changed: more time on a gloss led to faster response to
                                                 86


pseudowords. When FoO was between -0.5 and 0.5 (raw FoO: 4 – 9), RTs to pseudowords got
closer to those to real words as time on gloss increased. As FoO continued to increase (above z-
score FoO of 0.5 and raw FoO of 9), sufficient time spent reading glosses may facilitate the
processing of pseudowords, which eventually became faster than that of the real words. Again,
because there were only three target words with a raw FoO above 10, this three-way interaction
should be interpreted cautiously.
                                                87


Table 20
Self-paced reading delayed posttest: Critical Position, L1 Gloss Group
                               Fixed effects                         Random effects
                                                             By participant     By item
               Estimate [95% SE            T        p        Variance SD     Variance SD
                      CI]
Intercept        6.45 [6.34,      .06 113.50     <.001          .13     .36     .01     .08
                     6.57]                        ***
Nonword              -.004        .04    -.11      .91
                  [-.07, .07]
Real word         -.27 [-.34,     .04   -7.69  <.001***
                     -.20]
FoO            .02 [-.05, .08]    .03     .53      .60
Vocabulary     -.03 [-.13, .07] .05      -.64      .52
size
Time on        -.02 [-.08, .04] .03      -.68      .50
gloss
Nonword *      -.06 [-.13, .02] .04     -1.43      .15
FoO
Real word *    -.03 [-.10, .04] .04      -.76      .45
FoO
Nonword *       .10 [.03, .18]    .04    2.88    .004*
Time on
gloss
                                              88


Table 20 (cont’d)
Real word *           -.001     .04  -.04      .97
Time on            [-.07, .07]
gloss
FoO * Time         -.14 [-.23,  .04 -3.27   .001**
on Gloss              -.06]
Nonword *       .09 [-.03, .20] .06  1.46      .14
FoO * Time
on gloss
Real word *      .14 [.03, .26] .06  2.44     .01*
FoO * Time
on gloss
                                          89


 Figure 18
 Condition * FoO * Time on Gloss Interaction (Critical Position, Delayed Self-
 paced Reading Test, L1 Gloss Group)
                                                       Frequency Range
                                                       [-1.5, 0,5]
                                                       (-0.5. 0.5]
                                                       (0.5, 1.5]
                                                       (1.5, 2.5]
                                                       (2.5, 3.5]
       At position 2, no significant difference was found in RTs between the nonword and the
pseudoword conditions. Again, the RT difference was significant between the pseudoword and
real word conditions. No variables moderated the RT difference among the three conditions (see
Table 21 for results).
                                                90


Table 21
Self-paced reading delayed posttest: Position 2, L1 Gloss Group
                              Fixed effects                         Random effects
                                                            By participant      By item
               Estimate [95% SE           t        p        Variance SD     Variance SD
                      CI]
Intercept        6.31 [6.26,     .04 176.99      <.001        .04      .20     .003     .06
                     6.40]                        ***
Nonword               .06        .03    1.99      .05
                 [.001, .111]
Real word         -.19 [-.25,    .03   -6.90   <.001***
                     -.14]
FoO            -.02 [-.05, .02] .02    -1.01      .33
Vocabulary     -.02 [-.08, .04] .03     -.54      .59
size
Time on              -.006       .02    -.26      .79
gloss             [-.05, .04]
Nonword *             -.05       .03   -1.95      .05
Time on         [-.11. .0004]
gloss
Real word *    .04 [-.01, .10]   .03    1.46      .15
Time on
gloss
                                             91


        Delayed Posttest, L2 Gloss. Like the L1 gloss group, at the critical position, the L2 gloss
group reacted significantly faster to real words than to pseudowords, but the pseudoword RTs
did not differ significantly from nonwords RTs (see Table 22). This RT pattern was moderated
by participants’ vocabulary size. Specially, the larger participants’ vocabulary size was, the
faster they reacted to nonwords, eventually making RTs for nonwords shorter than those for
pseudowords. Vocabulary size did not have a significant effect on RTs for pseudowords (i.e., no
main effect). Figure 19 plots the Vocabulary Size * Condition interaction.
                                                92


Table 22
Self-paced reading delayed posttest: Critical Position, L2 Gloss Group
                               Fixed effects                         Random effects
                                                             By participant      By item
               Estimate [95% SE            t       p         Variance SD     Variance SD
                      CI]
Intercept         6.57 [6.45,     .06 101.42     <.001          .15     .38     .004     .06
                     6.70]                        ***
Nonword        -.02 [-.10, .06] .04      -.45     .65
Real word         -.27 [-.35,     .04   -6.93  <.001***
                     -.19]
FoO            .002 [-.04, .04] .02       .10     .92
Vocabulary      .05 [-.07, .17]   .06     .77     .44
size
Time on         .03 [-.01, .07]   .02    1.41     .16
gloss
Nonword *         -.10 [-.17.     .04   -2.47    .01*
Vocabulary           -.02]
size
Real word *    -.07 [-.14, .01] .04     -1.82     .07
Vocabulary
size
                                              93


 Figure 19
 Condition * Vocabulary Size Interaction (Critical Position, Delayed Self-paced
 Reading Test, L2 Gloss Group
        Table 23 presents results for position 2. At this position, the L2 gloss group showed
significantly shorter RTs when responding to the real word condition than to the pseudoword
condition. RTs were similar in the nonword and the pseudoword conditions. Time on gloss
significantly moderated RT difference among the three conditions (see Figure 20): greater time
on gloss saw slower responses in the nonword and the real word conditions, while RTs in the
pseudoword condition remained relatively constant (i.e., no significant main effect of time on
gloss).
                                                 94


Table 23
Self-paced reading delayed posttest: Position 2, L2 Gloss Group
                                Fixed effects                       Random effects
                                                            By participant      By item
               Estimate [95% SE             t       p       Variance SD     Variance SD
                       CI]
Intercept         6.43 [6.36,      .03 189.26     <.001       .03      .16     .003     .06
                      6.50]                        ***
Nonword               -.007        .03    -.24     .81
                   [-.06, .05]
Real word          -.24 [-.30,     .03   -8.64  <.001***
                      -.19]
FoO            .009 [-.02, .04] .02        .54     .60
Vocabulary     -.01 [-.06, .04] .03       -.41     .69
size
Time on              -.0002        .03    -.01     .99
gloss              [-.05, .05]
Nonword *        .06 [.01. .12]    .03    2.16    .03*
Time on
gloss
Real word *     .06 [.003, .12]    .03    2.09    .04*
Time on
gloss
                                               95


 Figure 20
 Condition * Time on Gloss Interaction (Position 2, Delayed Self-paced Reading Test,
 L2 Gloss Group
        Summary of self-paced reading results. Regardless of position, test timing, and group,
RTs in the pseudoword condition were in general significantly slower than in the real word
condition, indicating processing difficulties during reading of the pseudowords. RTs in the
pseudoword condition were similar to those in the nonword condition, except at position 2 in the
immediate posttest for the L1 gloss group, where RTs in the pseudoword condition were
significantly slower than in the nonword condition. This RT pattern suggested that pseudowords
were read in a similar manner to nonwords, when other variables were at their mean standardized
values.
        The number of times a target word appeared in the reading (i.e., FoO) seemed to have
facilitated the L1 gloss group’s subsequent processing of the pseudowords in the self-paced
                                                96


reading tests: in the immediate posttest at position 2 (see Table 17 and Figure 15 for the
Condition * FoO interaction); and in the delayed posttest at the critical position (see Table 20
and Figure 18). No other variables (i.e., vocabulary size and time on gloss) showed a facilitative
effect on pseudoword processing for the L1 gloss group. The L2 gloss group’s processing of
pseudowords was not affected by any variables included in the analyses.
        Thus, to answer RQ2s a and b, when target words appeared a certain number of times,
and when participants spent sufficient time on glosses, the L1 gloss group was able to process
the pseudowords at a speed similar to real words, indicating fluent retrieval of the pseudowords.
For the L2 gloss group, pseudowords were always processed similarly to nonwords and
significantly slower than to real words, regardless of target word FoO, time on reading the
glosses, and vocabulary size. The difference between the L1 and L2 gloss groups suggested that
there might be a gloss language effect, in that with certain conditions, pseudowords learned
through L1 but not L2 glosses may be retrieved as fluently as real words.
Summary of Key Significant Effects
        Table 24 presents the key significant effects in gloss engagement, meaning recall,
meaning matching, and self-paced reading tests.
                                                97


Table 24
Key Significant Effects
Dependent variables         Significant effects             Interpretation
Gloss processing (binary) Group (L1 gloss vs. L2 L1gloss group was more likely to
                                  gloss)            skip a gloss than the L2 gloss
                                                                group.
                             Vocabulary size      For both groups, learners with a
                                                  larger vocabulary size were less
                                                        likely to skip a gloss.
Immediate meaning recall  Group * Time on gloss      Time on gloss had a larger,
accuracy                                         positive effect on L1 gloss group
                                                   than on the L2 gloss group (see
                                                              Figure 12).
Delayed meaning recall    Group * FoO * Time on    L2 gloss group benefitted more
accuracy                           gloss          from greater time on gloss than
                                                    the L1 gloss group when FoO
                                                      was high (see Figure 13).
Immediate meaning                  FoO            Both groups were more likely to
matching accuracy                                respond to the meaning matching
                                                 items correctly as FoO increased.
                              Time on gloss       Both groups were more likely to
                                                 respond to the meaning matching
                                                  items correctly as time spent on
                                                          glosses increased.
                                      98


Table 24 (cont’d)
                                  Vocabulary size         In both groups, learners with
                                                       larger vocabulary size were more
                                                       likely to respond to the test items
                                                                    correctly.
Delayed meaning matching      Group * Time on gloss        Time on gloss had a larger,
accuracy                                               positive effect on L1 gloss group
                                                        than on the L2 gloss group (see
                                                                   Figure 14).
Immediate self-paced          Condition (pseudoword,         At position 2, RTs in the
reading, L1 gloss group, RTs   nonword, real word) *       pseudoword condition were
                                        FoO             shorter and closer to RTs in the
                                                           real word condition, as FoO
                                                            increased (see Figure 15).
Delayed self-paced reading,  Condition * Time on gloss  At the critical position, RTs for
L1 gloss group, RTs                   * FoO             the pseudoword decreased with
                                                         increasing time on gloss, only
                                                        when FoO was high (see Figure
                                                                       18).
                                           99


                      CHAPTER 4: DISCUSSION AND CONCLUSION
        The goal of the current study was to look closely at the gloss language effect, and the
factors that moderated the gloss language effect on learners’ engagement with glosses and on
vocabulary learning from reading. Specifically, the study attempted to answer the following
questions: How do learners reading first language (L1) and second language (L2) glosses engage
with the glosses (RQ1a)? How does learners’ vocabulary size affect their differential engagement
with L1 and L2 glosses (RQ1b)? How does gloss language affect learning, both immediately and
with a two-week interval after reading, in terms of receptive meaning matching, productive
meaning recall, and fluency of online lexical retrieval (RQ2a)? To what extent is the gloss
language effect on learning moderated by target words’ frequency of occurrence (FoO), learners’
vocabulary size, and learners’ engagement with the glosses (RQ2b)?
        In what follows, I first discuss gloss language effect on learners’ engagement with the
glosses (RQs1a & b). I then talk about findings regarding gloss language effect on learning
(RQs2a & b). In particular, I focus on gloss language effect on the learning of different aspects
of word knowledge, and whether and how the variables of FoO, time on gloss, and vocabulary
size moderated the gloss language effect. Finally, based on the findings, I discuss the
implications for L2 vocabulary pedagogy, bilingual lexicon theories, and gloss language research
methodology. I also reflect on the limitations of the current study and suggest directions for
future research.
Gloss Language Effect on Gloss Engagement
        Gloss engagement was affected by gloss language in that the L1 gloss group was
significantly more likely to ignore glosses than the L2 gloss group. To take a closer look at why
participants ignored a gloss, I did a follow-up analysis on participants’ responses in the exit
                                                 100


questionnaire. Specifically, I looked at Question 5, where participants indicated, on a 100-point
scale, the percentage of time they skipped a gloss due to each of the four reasons given in the
questionnaire. Table 25 summarizes the results for this question. For both groups, most of the
time, they skipped a gloss because they “have guessed the word meaning”, followed by the
reasons “I didn’t need the gloss to understand the reading”, and “I didn’t think knowing word
meanings was important”. Participant seldom ignored a gloss because “the glosses were not
helpful”. Mann Whitney U tests revealed that there was significant group difference in the
responses to the first reason (“I have guessed the word meaning”) (W = 1866.50, p = .02, r
= .23). All other reasons did not see significant differences between the two gloss language
groups.
  Table 25
  Summary of reasons for skipping glosses
  Reasons for gloss skipping                                Mean percentage of time (SD)
                                                        L1 gloss group           L2 gloss group
  “I have guessed the word meaning.”                     33.62 (35.66)           18.04 (27.24)
  “I didn’t need the gloss to understand the             26.52 (34.84)           15.58 (26.83)
  reading.”
  “I didn’t think knowing word meanings                  13.98 (26.83)            7.92 (18.37)
  was important.”
  “The glosses were not helpful.”                        3.63 (13.06)             4.42 (13.11)
        It seems that the L1 gloss group was more successful or perceived themselves to be more
capable of guessing the meanings of unfamiliar words. This may partially explain why the L1
                                                101


gloss group tended to skip glosses more so than the L2 gloss group. One hypothesis for why the
L1 gloss group was more confident in their word guessing ability is that glosses written in the L1
might have given those participants a sense that the target words were easy to understand
because it was easier to map the L1 glosses to familiar L1 words. Glosses written in the L2, on
the other hand, might be harder to map onto familiar words or concepts in the L1. Participants in
the L2 gloss group may thus have thought that the glosses were describing new concepts and that
it was difficult for them to get the meanings of the target words by just guessing.
        Another reason why the L1 gloss group skipped more glosses might be due to the greater
interruption L1 glosses imposed. The findings in Varol and Erçetin (2021) were somewhat
similar to those in the current study regarding gloss engagement: significant difference was
found in gloss access frequency but not gloss time based on the position of a gloss. The authors
in the study attributed the difference in gloss access frequency to one type of gloss being more
interrupting in the reading than the other. In the current study, L1 glosses may interrupt the
reading flow more because participants had to switch back and forth between their L1 and L2. If
L1 glosses caused more interference in reading, the L1 gloss group might have been less likely to
check a gloss for target words that appeared later in the reading. Figure 21 plots the percentage
of participants who checked a gloss by the order of appearance of target words. The vertical
dotted lines represent the beginning of a new page. Right before page 6 (the 20th target word) and
page 15 (the 23rd target word), learners had a break in reading, during which they answered two
comprehension questions. One striking feature in the figure is that the percentage of gloss-
checking learners in the L2 gloss group remained relatively stable throughout reading, while the
gloss checking pattern of the L1 gloss group had more ups and downs. Overall, it seems that the
L1 gloss group showed a tendency of less gloss checking as time went by: the three spikes at the
                                                 102


19th, 22nd, and 24th target words were all lower than the spikes at the 1st and the 9th target words.
In addition, the intervals between the spikes became slightly larger, i.e., longer periods of low
percentages in gloss checking, before the L1 gloss group encountered the first set of
comprehension questions (the 20th target word). The first interval was eight words between the
first and the ninth target words, and the second interval was 10 words between the ninth and the
19th target words. After the first set of comprehension questions, the spikes were more frequent,
i.e., shorter periods of low percentages in gloss checking. It was like the L1 gloss group had a
‘reset’, starting to actively access glosses again after the comprehension questions. However, the
exact reasons why the L1 gloss group behaved this way is not known. Interview and think-aloud
data are needed to understand more accurately the gloss access pattern of the L1 gloss group.
   Figure 21
   Gloss Processing by Target Word Occurrence Order
                                                  103


        Among those who spent enough time processing the glosses, there was no significant
group difference in gloss reading time. This reflects that the L1 and L2 glosses may have
attracted similar amount of attention or have posed a similar amount of reading demand. In other
words, even when glosses were written in the L2, participants may not have found them harder to
understand than those written in the L1. The lack of group difference in gloss time was similar to
the finding in Warren et al. (2018), where no significant difference in total reading time from
eye-tracking data was found among three types of glosses, i.e., text, picture, and multimodal.
Findings in the current study, Varol and Erçetin (2021), and Warren et al. (2018) indicate that
attention or engagement differences for different types of glosses may exist but may be very
small and insignificant. In reality, regardless of the format or modality of the glosses, L2 learners
are likely to pay attention to glosses when available, without being drawn to a particular type of
gloss. It is interesting to note that the lack of difference in gloss reading time did not lead to the
lack of effect of this variable on learning in the current study. The L1 gloss group seemed to
have gotten more out of every second spent with the glosses, resulting in better learning
performance in some posttests. I discuss this in more detail in the next section.
Gloss Language Effect on Learning
        In this section, I first compare gloss language effect on different aspects of learning,
namely the development of receptive and productive word meaning knowledge, and fluency and
of lexical retrieval. I then compare the effects of FoO, time on gloss, and vocabulary size on the
gloss language effect and discuss the different paths of memory consolidation through the L1 and
the L2 glosses.
                                                   104


Aspects of Learning
        The literature review reveals contradictory findings among previous gloss language
studies, with some reporting advantages of L1 or L2 glosses and some finding no language
effect. Many of the previous studies, however, did not control or at least report the comparability
of the L1 and L2 glosses in terms of length, and whether the words used in the L2 glosses were
likely to be known by learners. In the current study, after carefully matching the length of the L1
and L2 glosses, and using L2 words in the glosses likely to be familiar to participants as
indicated by their vocabulary size, it was found that short-term development of receptive
meaning knowledge measured by an immediate meaning matching test was the only aspect of
learning not affected by gloss language. For this aspect of word knowledge, the two gloss
groups’ performance was similar in the posttest, which was positively associated with longer
gloss reading time, higher FoO, and larger vocabulary size. However, neither gloss group
benefitted more from any of the three elements.
        Other aspects of learning showed a gloss language effect under certain circumstances,
revealing a sophisticated picture of how the number of times participants encountered a target
word and the amount of time participants spent on reading glosses jointly shaped the gloss
language effect on learning. For the retention of receptive meaning knowledge and short-term
development of productive meaning knowledge, the L1 gloss group benefitted more from longer
gloss reading time while the L2 gloss group was barely affected by the variable (see Figure 12 &
Figure 14). The interactions of gloss time and gloss language found in the delayed meaning
matching and immediate meaning recall posttests indicate that when being tested on delayed
receptive knowledge retention, which was more demanding than short-term development, and on
productive knowledge, which was arguably harder to develop (e.g., Nation, 2019; Pellicer-
                                                105


Sánchez, 2016), time spent on the L1 was ‘worth more’ than with the L2. It is possible that L1-
L2 pairs (i.e., L1 glosses and L2 word forms) were easier to register in memory than L2-L2 pairs
(i.e., L2 glosses and L2 word forms), and these traces of memory may be effectively enhanced
with time on gloss. Such advantage of the L1 did not show up in the learning of word knowledge
that was easier to obtain, such as short-term receptive meaning knowledge.
         Interestingly, when test demand was even higher, i.e., when the retention of productive
knowledge was assessed, the L2 glosses showed an advantage, under the circumstance of long
gloss reading time and high FoO combined. For this aspect of learning, the L1 gloss group only
benefitted more from long gloss reading time when FoO was low, and this benefit gradually
became smaller as FoO increased. The L2 gloss group eventually overtook the L1 gloss group
with long gloss reading time and high FoO. This three-way interaction of gloss language, FoO,
and gloss time suggests that learning words through the L2 may have an advantage for the
retention of productive word knowledge only when (1) the initial processing of word meanings
was deep (as shown by long gloss time), and (2) the word meanings were subsequently
reinforced enough times through encountering the words again. The interaction also indicated
that it may take more time for the advantage of L2 glosses to show up, i.e., in the delayed but not
immediate posttest. In addition, it seems that when it comes to long-term retention of productive
meaning knowledge, the memory of L1-L2 pairs was not strengthened by subsequent encounters
with the words and was even slightly reduced by those encounters.
         Fluency of lexical retrieval is the hardest aspect of word knowledge to obtain and often
takes longer to develop than declarative word knowledge measured in offline tests. Several
previous studies have shown that even when learners have mastered receptive and/or productive
declarative word knowledge, they may not be able to establish retrieval fluency for the newly
                                                 106


learned words (e.g., Y. Chen, 2021; Elgort & Warren, 2014). Previous gloss language studies
rarely examined the retrieval fluency of newly learned target words. In the current study, albeit
not straightforwardly, a clear gloss language effect can be observed in fluency of retrieval
indicated by the RT patterns in the self-paced reading immediate and delayed posttests. Under no
circumstances in the current study did the L2 gloss group react to the target words in a similar
manner to the real words. The L1 gloss group showed some degree of retrieval fluency in the
immediate posttest for target words with high FoOs and in the delayed posttest for words with a
high FoO and long gloss time combined. Again, findings in the self-paced reading tests show
that when test demand was higher, such as in the delayed posttests, a combination of initial
attention to gloss and subsequent reinforcement through repeated encounters is needed for
learning to take place.
        Findings of lexical retrieval fluency were relevant to the development of the bilingual
lexicon. The Literature Review introduces three bilingual lexicon models, all of which predict
that L2 vocabulary learning through the L1 may reduce the chance of establishing a direct
conceptual link for an L2 word and thus result in worse quality of word knowledge. Although the
current study did not directly test whether the target words were directly linked to their concepts,
fluency of retrieval was examined, which could, to some extent, indirectly reflect the strength of
the links between the target words and their concepts, and thus the quality of word knowledge.
Results show that after reading a fictional story in two days, only participants who learned the
target words through L1 input were able to develop some degree of retrieval fluency, which does
not seem to align with the prediction of the bilingual lexicon models and is in contrast with
findings in intentional vocabulary learning studies in this paradigm. This may have to do with the
differences in the nature of learning in intentional and incidental contexts. Learners encounter L2
                                                 107


words in relatively more varied and richer contexts in incidental learning. The rich context where
the L2 words are embedded may make the differences in learning through the L1 and L2 smaller
in this aspect of word knowledge. In addition, appearances of a word may be more spread out in
incidental learning, which may reduce the benefits of learning through the L2 in creating quality
lexical representations. Unlike in intentional learning, where lexical representations established
through L2 input can be reinforced multiple times within a very short interval, in incidental
learning, reinforcement comes sporadically and may not be in time before the memory created
through L2 input fades.
        However, caution should be taken here not to overly interpret the findings in relation to
bilingual lexicon development. First, fluency of retrieval is not equivalent to direct conceptual
links. Even when target words learned through the L1 were retrieved somewhat fluently, it did
not mean that the target words were directly linked to the concepts. It could be the case that only
the L1-L2 form-form connections were strengthened during learning. This is akin to stage 2 in
the model proposed by Jiang (2000), where fast L1-L2 retrieval is achieved. Second, learning
time and number of exposures in the current study may not have allowed the L2 gloss group to
develop lexical retrieval fluency. Given the time and opportunity of repeated exposure, the L2
gloss group may have performed differently.
        To summarize, findings in the current study reveal intricate gloss language effects on
learning, varied across the aspects of knowledge measured. First, L1 and L2 glosses benefitted
the learning of different aspects of word knowledge under different circumstances: L1 gloss
could contribute more to the retention of receptive meaning knowledge, short-term development
of productive meaning knowledge, and fluency of lexical retrieval; L2 gloss was found to
facilitate the long-term retainment of productive meaning knowledge. These findings also
                                                108


indicate that the benefits of L2 gloss were more likely to be found when the type of knowledge
tested was harder to develop. Second, it seems that it would take longer for the advantage of L2
gloss to show up, i.e., in the delayed posttests. In the same vein, one explanation for why the L2
gloss group failed to retrieve target words as fluently as real words in the self-paced reading
posttests may be that the participants needed longer time to integrate glosses written in the L2.
Effects of Moderating Variables
        Most previous gloss language studies simply compared the L1 and L2 gloss groups,
without considering potential moderating factors. The current study included three variables,
namely target word FoO, time spent on reading glosses, and participants’ vocabulary size. These
variables had differential effects on how L1 and L2 glosses affected learning, moderating the
gloss language effect.
        Vocabulary size was the only variable that had a negligible moderating effect on the
comparison of L1 and L2 glosses. Nor did vocabulary size influence learning gains much. The
main effect of vocabulary size was found only in the meaning matching posttests, both
immediate and delayed. The lack of effect of vocabulary size on the learning of most aspects of
word knowledge aligns with the result in Yanagisawa et al.’s (2020) meta-analysis and was
probably because glosses reduced the need to guess word meanings from context, for which
vocabulary size may be an important factor. The lack of moderating effect of vocabulary size on
gloss language may be attributed to the familiarity of words used in the L2 glosses. All words
used in the L2 glosses were within the 3k frequency range, which the participants were likely to
know based on their vocabulary size test scores. It seems that once a lexical threshold was
reached, or that comprehension of glosses was not an issue, vocabulary size may not have a large
effect on gloss language. An alternative perspective to look at the null moderating effect of
                                                 109


vocabulary size is that knowing more words in general would not facilitate participants in
reading either gloss types or in integrating word meanings the glosses provided. In contrast,
spending more time on glosses and/or encountering the words repeatedly were more likely to
give participants a better chance to learn the target words, which I discuss in the next paragraphs.
        According to the Levels of Processing framework (Craik & Lockhart, 1972), time on
gloss, or the amount of engagement with the glosses, reflects the depth of processing, which
predicts the strength of memory. There were three key findings regarding the moderating effects
of time on gloss. First, as mentioned in the last section, the L1 gloss group benefitted more from
longer gloss reading time than the L2 gloss group (i.e., a group and time on gloss interaction). In
other words, the same level of processing depth through the L1 and the L2 glosses did not result
in the same level of memory strength. Such finding suggests that information processed in the L1
may leave a stronger memory trace. The second key finding further corroborates this argument:
the amount of time on gloss hardly moderated the effects of L2 glosses. It seems that deeper
levels of processing in the L2 did not guarantee improved memory strength of target words.
Relatedly, the third key finding reveals that depth of processing as reflected in the amount of
gloss time facilitated learning through L2 glosses only when FoO was high. In addition, the
positive effect of gloss time on the development of lexical retrieval fluency was also only present
for target words with high FoOs. In these situations, depth of processing was not enough but had
to be paired with subsequent reinforcement through repeated exposure in order to improve
learning. The combined effects of gloss time and FoO are discussed in more detail later.
        The effect of FoO, or repeated exposure to target words, also had to do with memory
strength. As mentioned in the Literature Review, the instance-based models for word learning
(Bolger et al., 2008; Reichle & Perfetti, 2003) hypothesized that the incomplete word knowledge
                                                  110


resulting from the initial encounter with a word is reactivated in subsequent encounters, until full
word knowledge is extracted. FoO can thus be seen to represent memory reactivation and
reinforcement. Unlike depth of processing that is represented by the amount of time on glosses,
FoO, or memory reinforcement, had positive effects on both the L1 and L2 gloss groups, i.e.,
lack of group and FoO interactions, in the learning of productive and receptive meaning
knowledge. It means that, in these aspects of learning, regardless of the initial depth of
processing of the glosses, memory traces of target words learned through the L1 and L2 glosses
would be reinforced to a similar degree. In the short-term development of lexical retrieval
fluency, however, FoO only had an effect on the L1 gloss group. Such finding indicates that
within the current range of FoO, that is, within the current level of memory reinforcement
provided, only memory traces created through the L1 may be reinforced to the extent where
lexical retrieval fluency can be developed. Memory traces created through the L2 may need more
instances of reactivation and reinforcement to reach the same level of strength as those created
through the L1. A possible reason is that memory traces in the L1 might be easier to retrieve and
thus reactivate. FoO was also the only variable that had an impact on short-term lexical retrieval
fluency development. Reading time on glosses on the other hand, did not affect this aspect of
learning. This shows that for the retrieval fluency aspect of word knowledge, repeated
reactivation of a word may be more important than its initial depth of processing.
        The three-way interaction of group, FoO, and time on gloss was found in the meaning
recall delayed posttest for the L2 gloss group and the self-paced reading delayed posttest for the
L1 gloss group. One thing to note in this discussion is that glosses were only provided in the first
appearance of a target word. The indication of this design is that the depth of processing
achieved through spending time on reading the glosses only refers to the depth of initial
                                                111


processing. In other words, gloss time was most pertinent to the initial registration, but not or at
least not directly relevant, to depth of subsequent processing. These three-way interactions show
that initial depth of processing or subsequent memory reactivation alone was insufficient for
some aspects of learning, especially those hard to develop, i.e., in this case long-term retention of
productive knowledge and retrieval fluency. For the L2 gloss group, the depth of processing of
L2 glosses facilitated learning only when the words were subsequently reactivated multiple
times. The need for combined effects of FoO and gloss time may be due to greater difficulty in
memorizing the L2 glosses. Therefore, the initial registration of the L2 glosses needed to be
reactivated in context again and again for learning to take place. In addition, the L2 gloss group’s
learning of this aspect of word knowledge became larger than the L1 gloss group’s. One
hypothesis is that because L2 glosses were harder to memorize, the L2 gloss group deliberately
tried to retrieve the L2 gloss for a target word in each subsequent encounter, resulting in stronger
memory and thus better performance in tests for declarative word knowledge. In comparison, L1
glosses were easier to memorize, giving the L1 gloss group a sense of confidence (see also the
section on Gloss Engagement) that they remembered the glosses well. As a result, the L1 gloss
group may not have consciously (but may be subconsciously so) tried to retrieve the glosses
when seeing the target words.
         The challenge of maintaining lexical retrieval fluency also required both deep initial
processing and subsequent memory reinforcement: only when there were enough opportunities
to reinforce the initial memory traces, would deeper level of initial processing of L1 glosses be
facilitative for fluency retention. Again, the L2 gloss group did not achieve the long-term lexical
retrieval fluency as the L1 gloss group, possibly because the L2 gloss group may take longer
time or need more target word exposures to do so. It is interesting to note that in the meaning
                                                  112


recall delayed posttest, the three-way interaction also reveals that with repeated encounters with
the target words, the effect of initial processing depth on the L1 gloss group was reduced. This
contrasts with results in the self-paced delayed posttest, where FoO and time on gloss together
facilitated word retrieval of the L1 group. One tentative explanation could be that the ease of
processing of L1 glosses, the great depth of processing (i.e., long gloss reading time), and
repeated activation (i.e., high FoOs), created rich and detailed semantic representations, which
became a burden for conscious memory retrieval while facilitating online processing of the target
words, resulting in the differential combined effect of FoO and gloss time on productive
knowledge and retrieval fluency.
Implications
        The current study has pedagogical, theoretical, and methodological implications.
Pedagogically, findings in the current study show that L1 and L2 glosses benefited different
aspects of vocabulary learning. Language instructors, material writers, and ed tech designers
should choose gloss language adaptively, based on the aspects of vocabulary knowledge learners
most need for certain words, and the amount of exposure learners are likely to have to the words.
For example, if learners only need short-term receptive meaning knowledge for some words, L1
and L2 glosses are likely to have similar effects, giving language practitioners more freedom to
choose gloss language, depending on the target reader population’s L1 backgrounds (i.e.,
miscellaneous or homogenous). Or, if the goal is to develop lexical retrieval fluency and the
learning phase is short, L1 glosses are preferred. Second, whichever aspect of word knowledge is
targeted, the key to more successful learning is to encourage learners to engage with the glosses
and to increase the opportunities learners encounter the words. Longer engagement with glosses
may be particularly useful when the glosses are written in the L1. When using L2 glosses is the
                                                 113


preferred choice, e.g., for learners from different L1 backgrounds, repeated exposure is the key
to allow learners reap the benefits of glosses, based on the result that only when FoO was high,
did the L2 gloss group benefit from greater gloss engagement.
         Results in the current study support the bilingual teaching approach, i.e., the use of L1
and L2 together, instead of using the L2 exclusively. Within the time limit of the learning phase,
i.e., two days, participants who read the L1 glosses did better in more aspects of learning than
those who read the L2 glosses, given the right conditions. It could be the case, though, that the
advantages of L2 glosses take longer to show up. In any case, the key message here is that the
use of L1 should not be seen as a barrier and can be used to support L2 learning. Word learning
through the L1 may be more efficient. This was the case even when learners have reached a
certain proficiency threshold to allow them to comprehend the L2 glosses without issues.
         Theoretically, the findings of the current study shed some light on the development of
bilingual lexicon. Within a short learning phase (i.e., two days), L1 glosses, but not L2 glosses,
were more efficient in facilitating the development of lexical retrieval fluency. In other words,
the L2 glosses did not lead to the establishment of direct conceptual links for the newly learned
words. However, it is hard to pinpoint whether retrieval fluency shown by the L1 gloss group
came from the establishment of direct conceptual links of the newly learned words, or merely
from the improved automaticity of L1-mediated access to concepts.
         In terms of methodological implication, the study demonstrates the use of hyperlinks to
track learners’ gloss engagement. Although hyperlinks as a tracking tool have been used in other
research areas (see H. Lee & Lee, 2013), they have seldom been used in gloss language research.
With online data collection getting more common, hyperlinks offer a convenient alternative to
eye tracking in monitoring engagement and attention during learning.
                                                 114


Limitations and Future Directions
        While the study is one of the first to examine potential moderating factors on the gloss
language effect on both learning and gloss engagement, it is not without limitations. First, gloss
time tracked through hyperlinks may not have accurately reflected every participant’s
engagement with glosses. While participants were instructed to click close a gloss when they
finished reading it, some participants may have delayed doing so, resulting in the gloss reading
time recorded being longer than the actual reading time. One solution may be to implement
timed glosses, which appear for a specific amount of time, depending on gloss length. After a
gloss disappears, participants have to re-click the gloss to read it again. Although this way may
reduce the likelihood of participants forgetting to close the gloss, attention tracking using
hyperlinks cannot be compared with eye tracking. But hyperlinks have the advantage of allowing
data collection online. This is a tradeoff researchers need to consider.
        Related to online data collection, one concern regards how to make sure that learners are
paying attention during the experiment. The reading comprehension questions during learning
and the comprehension questions in the self-paced reading tests have served as attention checks,
and participants who did not reach a certain comprehension level were excluded. Still, future
studies could insert more attention checks, such as by asking participants to click on a particular
photo out of many displayed on screen. These non-comprehension-based attention checks can
add a further layer of security against distraction.
        Finally, the distribution of FoO was imbalanced, with more target words in the FoO range
between 1 and 10, and fewer above 10. The smaller number of words with an FoO above 10
means that data points were sparse for this FoO range, leading to less accurate estimates of the
data. The uneven distribution of FoO was due to restrictions in the original graded reader. Future
                                                 115


studies may have to implement more adaptations to increase the number of target words for
certain FoO ranges.
        Besides fixing the above issues, future research would also benefit from examining the
long-term gloss language effect by implementing delayed posttests with a longer interval after
the learning phase. The range of FoO should also be increased. Such designs allow us to gain
insights into how the gloss language effect changes overtime, e.g., L2 gloss may benefit retrieval
fluency more as time goes by and as learners are exposed to words more times. In addition, the
current study shows differences in gloss checking but not gloss reading time. Interviews can be
conducted to further investigate why this may be the case. Finally, replication studies should be
conducted with learners from other L1 backgrounds, proficiency levels, and age. Learners’
familiarity with technology should also be considered, which may have an impact on how
learners engage with glosses on a digital device.
Conclusions
        The study reveals a sophisticated picture of how the L1 and L2 glosses may benefit the
learning of different aspects of word knowledge under various circumstances as influenced by
the number of repeated exposures, gloss engagement, and learners’ vocabulary size. Findings i n
the study bear pedagogical implications for language instructors and material writers as to when
to use which type of gloss, based on the learning context and learners’ needs. More importantly,
the study shows that L1 is an asset L2 learners can take advantage of, instead of something to be
avoided in L2 learning. The study also contributes to theories regarding the bilingual model
lexicons and the use of hyperlinks as a tracking tool alternative to eye tracking.
                                                116


                                          REFERENCES
American Council on the Teaching of Foreign Language (ACTFL). (n.d.). Facilitate Target
        Language Use. https://www.actfl.org/educator-resources/guiding-principles-for-
        language-learning/facilitate-target-language-use
Abraham, L. B. (2008). Computer-mediated glosses in second language reading comprehension
        and vocabulary learning: A meta-analysis. Computer Assisted Language Learning, 21(3),
        199–226. https://doi.org/10.1080/09588220802090246
Altarriba, J., & Mathis, K. M. (1997). Conceptual and lexical development in second language
        acquisition. Journal of Memory and Language, 36(4), 550–568.
        https://doi.org/10.1006/jmla.1997.2493
Antón, M., & DiCamilla, F. (1998). Socio-cognitive functions of L1 collaborative interaction in
        the L2 classroom. The Canadian Modern Language Review, 54(3), 314–342.
        https://doi.org/10.3138/cmlr.54.3.314
Barr, D. J., Levy, R., Scheepers, C., & Tily, H. J. (2013). Random effects structure for
        confirmatory hypothesis testing: Keep it maximal. Journal of Memory and Language,
        68(3), 255–278. https://doi.org/10.1016/j.jml.2012.11.001
Balota, D. A., Yap, M. J., Hutchison, K. A., Cortese, M. J., Kessler, B., Loftis, B., Neely, J. H.,
        Nelson, D. L., Simpson, G. B., & Treiman, R. (2007). The English lexicon project.
        Behavior Research Methods, 39(3), 445–459. https://doi.org/10.3758/BF03193014
Bartolotti, J., & Marian, V. (2017). Bilinguals’ existing languages benefit vocabulary learning in
        a third language. Language Learning, 67(1), 110–140. https://doi.org/10.1111/lang.12200
Boers, F. (2022). Glossing and vocabulary learning. Language Teaching, 55(1), 1–23.
        https://doi.org/10.1017/S0261444821000252
Boers, F., Warren, P., He, L., & Deconinck, J. (2017). Does adding pictures to glosses enhance
        vocabulary uptake from reading? System, 66, 113–129.
        https://doi.org/10.1016/j.system.2017.03.017
Bolger, D. J., Balass, M., Landen, E., & Perfetti, C. A. (2008). Context variation and definitions
        in learning the meanings of words: An instance-based learning approach. Discourse
        Processes, 45(2), 122–159. https://doi.org/10.1080/01638530701792826
Brevik, L. M., & Rindal, U. (2020). Language use in the classroom: Balancing target language
        exposure with the need for other languages. TESOL Quarterly, 54(4), 925–953.
        https://doi.org/10.1002/tesq.564
                                                117


Brooks-Lewis, K. A. (2009). Adult learners’ perceptions of the incorporation of their L1 in
       foreign language teaching and learning. Applied Linguistics, 30(2), 216–235.
       https://doi.org/10.1093/applin/amn051
Brown, A. (2021). Monolingual versus multilingual foreign language teaching: French and
       Arabic at beginning levels. Language Teaching Research, 1362168821990347.
       https://doi.org/10.1177/1362168821990347
Brown, A., & Lally, R. (2019). Immersive versus nonimmersive approaches to TESOL: A
       classroom-based intervention study. TESOL Quarterly, 53(3), 603–629.
       https://doi.org/10.1002/tesq.499
Bruen, J., & Kelly, N. (2017). Using a shared L1 to reduce cognitive overload and anxiety levels
       in the L2 classroom. The Language Learning Journal, 45(3), 368–381.
       https://doi.org/10.1080/09571736.2014.908405
Bruton, A., López, M. G., & Mesa, R. E. (2011). Incidental L2 vocabulary learning: An
       impracticable term? TESOL Quarterly, 45(4), 759–768.
       https://doi.org/10.5054/tq.2011.268061
Brysbaert, M. (2019). How many words do we read per minute? A review and meta-analysis of
       reading rate. Journal of Memory and Language, 109, 104047.
       https://doi.org/10.1016/j.jml.2019.104047
Brysbaert, M., & Duyck, W. (2010). Is it time to leave behind the Revised Hierarchical Model of
       bilingual language processing after fifteen years of service? Bilingualism: Language and
       Cognition, 13(3), 359–371. https://doi.org/10.1017/S1366728909990344
Canagarajah, S. (2011). Translanguaging in the classroom: Emerging issues for research and
       pedagogy. Applied Linguistics Review, 2, 1–28. https://doi.org/10.1515/9783110239331.1
Canagarajah, S. (2013). Translingual Practice: Global Englishes and Cosmopolitan Relations.
       Routledge.
Carless, D. (2007). Student use of the mother tongue in the task-based classroom. ELT Journal,
       62(4), 331–338. https://doi.org/10.1093/elt/ccm090
Celce-Murcia, M. (2014). An overview of language teaching methods and approaches. Teaching
       English as a second or foreign language, 4, 2–14.
Chambers, F. (1991). Promoting use of the target language in the classroom. Language Learning
       Journal, 4(1), 27–31.
Chaudron, C. (1988). Second language classrooms: Research on teaching and learning.
       Cambridge University Press.
                                               118


Chen, B., Ma, T., Liang, L., & Liu, H. (2017). Rapid L2 word learning through high constraint
       sentence context: An event-related potential study. Frontiers in Psychology, 8, 2285.
       https://doi.org/10.3389/fpsyg.2017.02285
Chen, C., & Truscott, J. (2010). The effects of repetition and L1 lexicalization on incidental
       vocabulary acquisition. Applied Linguistics, 31(5), 693–713.
       https://doi.org/10.1093/applin/amq031
Chen, Y. (2021). Comparing incidental vocabulary learning from reading-only and reading-
       while-listening. System, 97, 102442.
Choi, S. (2016). Effects of L1 and L2 glosses on incidental vocabulary acquisition and lexical
       representations. Learning and Individual Differences, 45, 137–143.
       https://doi.org/10.1016/j.lindif.2015.11.018
Choi, S. (2017). Processing and learning of enhanced English collocations: An eye movement
       study. Language Teaching Research, 21(3), 403–426.
       https://doi.org/10.1177/1362168816653271
Chun, D. M., & Payne, J. S. (2004). What makes students click: Working memory and look-up
       behavior. System, 32(4), 481–503. https://doi.org/10.1016/j.system.2004.09.008
Cobb, T. (n.d.). Vocabprofile [Computer program]. Retrieved April 30, 2021, from
       https://www.lextutor.ca/vp/.
Comesaña, M., Perea, M., Piñeiro, A., & Fraga, I. (2009). Vocabulary teaching strategies and
       conceptual representations of words in L2 in children: Evidence with novice learners.
       Journal of Experimental Child Psychology, 104(1), 22–33.
       https://doi.org/10.1016/j.jecp.2008.10.004
Cook, G. (2010). Translation in language teaching: An argument for reassessment. Oxford
       University Press.
Cook, V. (2001). Using the first language in the classroom. The Canadian Modern Language
       Review, 57(3), 402–423. https://doi.org/10.3138/cmlr.57.3.402
Craik, F. I. M., & Lockhart, R. S. (1972). Levels of processing: A framework for memory
       research. Journal of Verbal Learning and Verbal Behavior, 11(6), 671–684.
       https://doi.org/10.1016/S0022-5371(72)80001-X
Creese, A., & Blackledge, A. (2010). Translanguaging in the bilingual classroom: A pedagogy
       for learning and teaching? The Modern Language Journal, 94(1), 103–115.
       https://doi.org/10.1111/j.1540-4781.2009.00986.x
                                               119


Creese, A., & Blackledge, A. (2015). Translanguaging and identity in educational settings.
        Annual Review of Applied Linguistics, 35, 20–35.
        https://doi.org/10.1017/S0267190514000233
Cummins, J. (2007). Rethinking monolingual instructional strategies in multilingual classrooms.
        Canadian Journal of Applied Linguistics, 10(2), 221–240.
De Costa, P. I., Singh, J. G., Milu, E., Wang, X., Fraiberg, S., & Canagarajah, S. (2017).
        Pedagogizing translingual practice: prospects and possibilities. Research in the Teaching
        of English, 51(4), 464–472.
De la Campa, J. C., & Nassaji, H. (2009). The amount, purpose, and reasons for using L1 in L2
        classrooms. Foreign Language Annals, 42(4), 742–759. https://doi.org/10.1111/j.1944-
        9720.2009.01052.x
DiCamilla, F. J., & Antón, M. (2012). Functions of L1 in the collaborative interaction of
        beginning and advanced second language learners. International Journal of Applied
        Linguistics, 22(2), 160–188. https://doi.org/10.1111/j.1473-4192.2011.00302.x
Dijkstra, T., & van Heuven, W. J. B. (2002). The architecture of the bilingual word recognition
        system: From identification to decision. Bilingualism: Language and Cognition, 5(3),
        175–197. https://doi.org/10.1017/S1366728902003012
Dijkstra, T., van Heuven, W. J. B., & Grainger, J. (1998). Simulating cross-language competition
        with the bilingual interactive activation model. Psychologica Belgica, 38(3–4), 177–196.
Duff, P. A., & Polio, C. G. (1990). How much foreign language is there in the foreign language
        classroom? The Modern Language Journal, 74(2), 154–166.
        https://doi.org/10.2307/328119
Elgort, I. (2011). Deliberate learning and vocabulary acquisition in a second language. Language
        Learning, 61(2), 367–413. https://doi.org/10.1111/j.1467-9922.2010.00613.x
Elgort, I., Beliaeva, N., & Boers, F. (2020). Contextual word learning in the first and second
        language: Definition placement and inference error effects on declarative and
        nondeclarative knowledge. Studies in Second Language Acquisition, 42(1), 7–32.
        https://doi.org/10.1017/S0272263119000561
Elgort, I., Candry, S., Boutorwick, T. J., Eyckmans, J., & Brysbaert, M. (2018). Contextual word
        learning with form-focused and meaning-focused elaboration. Applied Linguistics, 39(5),
        646–667. https://doi-org.proxy2.cl.msu.edu/10.1093/applin/amw029
Elgort, I., Perfetti, C. A., Rickles, B., & Stafura, J. Z. (2015). Contextual learning of L2 word
        meanings: Second language proficiency modulates behavioural and event-related brain
        potential (ERP) indicators of learning. Language, Cognition and Neuroscience, 30(5),
        506–528. https://doi.org/10.1080/23273798.2014.942673
                                                  120


Elgort, I., & Piasecki, A. E. (2014). The effect of a bilingual learning mode on the establishment
        of lexical semantic representations in the L2. Bilingualism: Language and Cognition,
        17(3), 572–588. https://doi.org/10.1017/S1366728913000588
Elgort, I., & Warren, P. (2014). L2 vocabulary learning from reading: Explicit and tacit lexical
        knowledge and the role of learner and item variables. Language Learning, 64(2), 365–
        414. https://doi.org/10.1111/lang.12052
Ellis, R., Loewen, S., & Erlam, R. (2006). Implicit and explicit corrective feedback and the
        acquisition of L2 grammar. Studies in Second Language Acquisition, 28(2), 339–368.
        https://doi.org/10.1017/S0272263106060141
Ender, A. (2016). Implicit and explicit cognitive processes in incidental vocabulary acquisition.
        Applied Linguistics, 37(4), 536-560.
Finkbeiner, M., & Nicol, J. (2003). Semantic category effects in second language word learning.
        Applied Psycholinguistics, 24(3), 369–383. https://doi.org/10.1017/S0142716403000195
Fischer, R. (2007). How do we know what students are actually doing? Monitoring students’
        behavior in CALL. Computer Assisted Language Learning, 20(5), 409–442.
        https://doi.org/10.1080/09588220701746013
Fu, M., & Li, S. (2022). The effects of immediate and delayed corrective feedback on L2
        development. Studies in Second Language Acquisition, 44(1), 2–34.
        https://doi.org/10.1017/S0272263120000388
Gánem-Gutiérrez, G. A., & Roehr, K. (2011). Use of L1, metalanguage, and discourse markers:
        L2 learners’ regulation during individual task performance. International Journal of
        Applied Linguistics, 21(3), 297–318. https://doi.org/10.1111/j.1473-4192.2010.00274.x
Grange, J. (2015). trimr: An implementation of common response time trimming methods. R
        package version 1.0.1. https://CRAN.R-project.org/package=trimr
Godfroid, A. (2019). Sensitive measures of vocabulary knowledge and processing: Expanding
        Nation’s framework. In S. Webb (Ed.), The Routledge handbook of vocabulary studies
        (pp. 433–453). Routledge.
Godfroid, A., Ahn, J., Choi, I., Ballard, L., Cui, Y., Johnston, S., Lee, S., Sarkar, A., & Yoon,
        H.-J. (2018). Incidental vocabulary learning in a natural reading context: An eye-tracking
        study. Bilingualism: Language and Cognition, 21(3), 563–584.
        https://doi.org/10.1017/S1366728917000219
Godfroid, A., Boers, F., & Housen, A. (2013). An eye for words: Gauging the role of attention in
        incidental L2 vocabulary acquisition by means of eye-tracking. Studies in Second
        Language Acquisition, 35(3), 483–517. https://doi.org/10.1017/S0272263113000119
                                                 121


Gries, S. T. (2021), (Generalized Linear) Mixed-Effects Modeling: A Learner Corpus Example.
        Language Learning, 71(3), 757–798. https://doi-
        org.proxy2.cl.msu.edu/10.1111/lang.12448
Hair Jr, J. F., Anderson, R. E., Tatham, R. L., & Black, W. C. (1995). Multivariate data analysis
        (3rd ed.). Macmillan.
Hall, G., & Cook, G. (2012). Own-language use in language teaching and learning. Language
        Teaching, 45(3), 271–308. https://doi.org/10.1017/S0261444812000067
Hulstijn, J. H. (2001). Intentional and incidental second language vocabulary learning: A
        reappraisal of elaboration, rehearsal and automaticity. In P. Robinson (Ed.), Cognition
        and second language instruction (1st ed., pp. 258–286). Cambridge University Press.
        https://doi.org/10.1017/CBO9781139524780.011
Hulstijn, J. H. (2003). Incidental and intentional learning. In C. J. Doughty & M. H. Long (Eds.),
        The handbook of second language acquisition (pp. 349–381). Blackwell Publishing Ltd.
        https://doi.org/10.1002/9780470756492.ch12
Hulstijn, J. H., Hollander, M., & Greidanus, T. (1996). Incidental vocabulary learning by
        advanced foreign language students: The influence of marginal glosses, dictionary use,
        and reoccurrence of unknown words. The Modern Language Journal, 80(3), 327–339.
        https://doi.org/10.1111/j.1540-4781.1996.tb01614.x
Hulstijn, J. H., & Laufer, B. (2001). Some empirical evidence for the Involvement Load
        Hypothesis in vocabulary acquisition. Language Learning, 51(3), 539–558.
        https://doi.org/10.1111/0023-8333.00164
Jacobs, G. M., Dufon, P., & Hong, F. C. (1994). L1 and L2 vocabulary glosses in L2 reading
        passages: Their effectiveness for increasing comprehension and vocabulary knowledge.
        Journal of Research in Reading, 17(1), 19–28. https://doi.org/10.1111/j.1467-
        9817.1994.tb00049.x
Jeong, H., Sugiura, M., Sassa, Y., Wakusawa, K., Horie, K., Sato, S., & Kawashima, R. (2010).
        Learning second language vocabulary: Neural dissociation of situation-based learning
        and text-based learning. NeuroImage, 50(2), 802–809.
        https://doi.org/10.1016/j.neuroimage.2009.12.038
Jiang, N. (2000). Lexical representation and development in a second language. Applied
        Linguistics, 21(1), 47–77. https://doi.org/10.1093/applin/21.1.47
Jiang, N. (2013). Conducting reaction time research in second language studies. Routledge.
Jones, L. C. (2013). Supporting listening comprehension and vocabulary acquisition with
        multimedia annotations: The students’ voice. CALICO Journal, 21(1), 41–65.
        https://doi.org/10.1558/cj.v21i1.41-65
                                                122


Just, M. A., Carpenter, P. A., & Woolley, J. D. (1982). Paradigms and processes in reading
        comprehension. Journal of Experimental Psychology: General, 111, 228-238.
Kang, H., Kweon, S.-O., & Choi, S. (2020). Using eye-tracking to examine the role of first and
        second language glosses. Language Teaching Research, 136216882092856.
        https://doi.org/10.1177/1362168820928567
Keuleers, E., & Brysbaert, M. (2010). Wuggy: A multilingual pseudoword generator. Behavior
        Research Methods, 42(3), 627–633. https://doi.org/10.3758/BRM.42.3.627
Khezrlou, S., Ellis, R., & Sadeghi, K. (2017). Effects of computer-assisted glosses on EFL
        learners’ vocabulary acquisition and reading comprehension in three learning conditions.
        System, 65, 104–116. https://doi.org/10.1016/j.system.2017.01.009
Kim, H. S., Lee, J. H., & Lee, H. (2020). The relative effects of L1 and L2 glosses on L2
        learning: A meta-analysis. Language Teaching Research, 136216882098139.
        https://doi.org/10.1177/1362168820981394
Kim, K. M., & Godfroid, A. (2019). Should we listen or read? modality effects in implicit and
        explicit knowledge. The Modern Language Journal, modl.12583.
        https://doi.org/10.1111/modl.12583
Ko, M. H. (2012). Glossing and second language vocabulary learning. TESOL Quarterly, 46(1),
        56–79. https://doi.org/10.1002/tesq.3
Kroll, J. F., Bobb, S. C., & Wodniecka, Z. (2006). Language selectivity is the exception, not the
        rule: Arguments against a fixed locus of language selection in bilingual speech.
        Bilingualism: Language and Cognition, 9(2), 119–135.
        https://doi.org/10.1017/S1366728906002483
Kroll, J. F., & Stewart, E. (1994). Category interference in translation and picture naming:
        evidence for asymmetric connections between bilingual memory representations. Journal
        of Memory and Language, 33(2), 149–174. https://doi.org/10.1006/jmla.1994.1008
Kroll, J. F., van Hell, J. G., Tokowicz, N., & Green, D. W. (2010). The Revised Hierarchical
        Model: A critical review and assessment. Bilingualism: Language and Cognition, 13(3),
        373–381. https://doi.org/10.1017/S136672891000009X
Kubota, R. (2018). Unpacking research and practice in world Englishes and Second Language
        Acquisition. World Englishes, 37(1), 93–105. https://doi.org/10.1111/weng.12305
Laufer, B. (2005). Instructed second language vocabulary learning: The fault in the ‘default
        hypothesis.’ In A. Housen & M. Pierrard (Eds.), Investigations in instructed second
        language acquisition. Mouton de Gruyter. https://doi-
        org.proxy2.cl.msu.edu/10.1515/9783110197372
                                                123


Laufer, B. (2006). Comparing focus on form and focus on formS in second-language vocabulary
        learning. The Canadian Modern Language Review / La Revue Canadienne Des Langues
        Vivantes, 63(1), 149–166. https://doi.org/10.1353/cml.2006.0047
Laufer, B., & Girsai, N. (2008). Form-focused instruction in second language vocabulary
        learning: A case for contrastive analysis and translation. Applied Linguistics, 29(4), 694–
        716. https://doi.org/10.1093/applin/amn018
Laufer, B., & Hill, M. (2000). What lexical information do L2 learners select in a call dictionary
        and how does it affect word retention? Language Learning & Technology, 21.
Laufer, B., & Hulstijn, J. H. (2001). Incidental vocabulary acquisition in a second language: The
        construct of task-induced involvement. Applied Linguistics, 22(1), 1–26.
        https://doi.org/10.1093/applin/22.1.1
Laufer, B., & Shmueli, K. (1997). Memorizing new words: Does teaching have anything to do
        with it?.RELC journal, 28(1), 89-108.
Lee, H., & Lee, J. H. (2013). Implementing glossing in mobile-assisted language learning
        environments: Directions and outlook. Language Learning & Technology, 17, 6–22.
Lee, H., Warschauer, M., & Lee, J. H. (2017). The effects of concordance-based electronic
        glosses on L2 vocabulary learning. Language Learning & Technology, 21(2), 32–51.
Lee, J. H., & Lee, H. (2022). Teachers' verbal lexical explanation for second language
        vocabulary learning: A meta-nalysis. Language Learning, 72(2), 576–612.
Lee, J. H., & Levine, G. S. (2020). The effects of instructor language choice on second language
        vocabulary learning and listening comprehension. Language Teaching Research, 24(2),
        250–272. https://doi.org/10.1177/1362168818770910
Lee, J. H., & Lo, Y. Y. (2017). An exploratory study on the relationships between attitudes
        toward classroom language choice, motivation, and proficiency of EFL learners. System,
        67, 121–131. https://doi.org/10.1016/j.system.2017.04.017
Lee, J. H., & Macaro, E. (2013). Investigating age in the use of L1 or English-only instruction:
        Vocabulary acquisition by Korean EFL learners. The Modern Language Journal, 97(4),
        887–901. https://doi.org/10.1111/j.1540-4781.2013.12044.x
Lee, S., & Pulido, D. (2017). The impact of topic interest, L2 proficiency, and gender on EFL
        incidental vocabulary acquisition through reading. Language Teaching Research, 21(1),
        118–135. https://doi.org/10.1177/1362168816637381
Liu, D., Ahn, G.-S., Baek, K.-S., & Han, N.-O. (2004). South Korean high school english
        teachers’ code switching: Questions and challenges in the drive for maximal use of
        English in teaching. TESOL Quarterly, 38(4), 605. https://doi.org/10.2307/3588282
                                                124


Loewen, S. (2005). Incidental focus on form and second language learning. Studies in Second
       Language Acquisition, 27(03). https://doi.org/10.1017/S0272263105050163
Loewen, S. (2014). Introduction to instructed second language acquisition. Routledge.
Long, M. H. (1991). Focus on form: A design feature in language teaching methodology.
       Foreign Language Research in Cross-Cultural Perspective, 39.
Long, M. H. (1996). The role of the linguistic environment in second language acquisition. In W.
       Ritchie & T. Bhatia (Eds.), Handbook of second language acquisition (pp. 413–468).
       Academic Press.
Lüdecke et al., (2021). performance: An R Package for Assessment, Comparison and Testing of
       Statistical Models. Journal of Open Source Software, 6(60), 3139.
       https://doi.org/10.21105/joss.03139
Ma, L. P. F. (2019). Examining the functions of L1 use through teacher and student interactions
       in an adult migrant English classroom. International Journal of Bilingual Education and
       Bilingualism, 22(4), 386–401. https://doi.org/10.1080/13670050.2016.1257562
Macaro, E. (2001). Analysing student teachers’ codeswitching in foreign language classrooms:
       Theories and decision making. The Modern Language Journal, 85(4), 531–548.
       https://doi.org/10.1111/0026-7902.00124
Macaro, E., & Lee, J. H. (2013). Teacher language background, codeswitching, and english-only
       instruction: Does age make a difference to learners’ attitudes? TESOL Quarterly, 47(4),
       717–742. https://doi.org/10.1002/tesq.74
Macaro, E., Tian, L., & Chu, L. (2020). First and second language use in English medium
       instruction contexts. Language Teaching Research, 24(3), 382–402.
       https://doi.org/10.1177/1362168818783231
Malone, J. (2018). Incidental vocabulary learning in SLA: Effects of frequency, aural
       enhancement, and working memory. Studies in Second Language Acquisition, 40(3),
       651–675. https://doi.org/10.1017/S0272263117000341
Marsden, E., Thompson, S., & Plonsky, L. (2018). A methodological synthesis of self-paced
       reading in second language research. Applied Psycholinguistics, 39(5), 861–904.
       https://doi.org/10.1017/S0142716418000036
Miyasako, N. (2002). Does text-glossing have any effects on incidental vocabulary learning
       through reading for Japanese senior high school students? Language Education &
       Technology, 39, 1–20.
                                                125


Mohamed, A. A. (2018). Exposure frequency in L2 reading: An eye-movement perspective of
        incidental vocabulary learning. Studies in Second Language Acquisition, 40(2), 269–293.
        https://doi.org/10.1017/S0272263117000092
Moore, P. J. (2013). An emergent perspective on the use of the first language in the English-as-a-
        foreign-language classroom. The Modern Language Journal, 97(1), 239–253.
        https://doi.org/10.1111/j.1540-4781.2013.01429.x
Myers, J. L., & O’Brien, E. J. (1998). Accessing the discourse representation during reading.
        Discourse Processes, 26(2–3), 131–157. https://doi.org/10.1080/01638539809545042
Nakatsukasa, K., & Loewen, S. (2015). A teacher’s first language use in form-focused episodes
        in Spanish as a foreign language classroom. Language Teaching Research, 19(2), 133–
        149. https://doi.org/10.1177/1362168814541737
Nation, P. (1983). Testing and teaching vocabulary. Guidelines, 5, 12–25.
Nation, P. (2001). Learning vocabulary in another language. Cambridge University Press.
Nation, P. (2012). The BNC/COCA word family lists.
        https://www.wgtn.ac.nz/data/assets/pdf_file/0005/1857641/about-bnc-coca-vocabulary-
        list.pdf
Nation, P. (2019). The different aspects of vocabulary knowledge. In S. Webb (Ed.), The
        Routledge Handbook of Vocabulary Studies (pp.15–29). Routledge.
        http://doi.org/10.4324/9780429291586-28.
Nilsson, M. (2020). Beliefs and experiences in the English classroom: Perspectives of Swedish
        primary school learners. Studies in Second Language Learning and Teaching, 10(2),
        257–281.
Olsen, M. K., & Schafer, J. L. (2001). A two-part random-effects model for semicontinuous
        longitudinal data. Journal of the American Statistical Association, 96(454), 730–745.
        https://doi.org/10.1198/016214501753168389
Pellicer-Sánchez, A. (2016). Incidental L2 vocabulary acquisition from and while reading: An
        eye-tracking study. Studies in Second Language Acquisition, 38(1), 97–130.
        https://doi.org/10.1017/S0272263115000224
Pellicer-Sánchez, A., & Schmitt, N. (2010). Incidental vocabulary acquisition from an authentic
        novel: Do things fall apart? Reading in a Foreign Language, 22(1), 31–55.
Peters, E. (2007). Manipulating L2 learners’ online dictionary use and its effect on L2 word
        retention. Language Learning & Technology, 11(2), 36–58.
        http://dx.doi.org/10125/44103
                                               126


Phillipson, R. (1992). Linguistic imperialism. OUP Oxford.
Polio, C. G., & Duff, P. A. (1994). Teachers’ language use in university foreign language
        classrooms: A qualitative analysis of English and target language alternation. The
        Modern Language Journal, 78(3), 313–326. https://doi-
        org.proxy2.cl.msu.edu/10.2307/330110
Qian, David. D., & Lin, L. H. F. (2019). The relationship between vocabulary knowledge and
        language proficiency. In S. Webb (Ed.), The Routledge handbook of vocabulary studies.
        Routledge.
Ramezanali, N., Uchihara, T., & Faez, F. (2021). Efficacy of multimodal glossing on second
        language vocabulary learning: A meta-analysis. TESOL Quarterly, 55(1), 105–133.
        https://doi.org/10.1002/tesq.579
Rassaei, E. (2020). Effects of mobile‐mediated dynamic and nondynamic glosses on L2
        vocabulary learning: A sociocultural perspective. The Modern Language Journal, 104(1),
        284–303. https://doi.org/10.1111/modl.12629
Reichle, E. D., & Perfetti, C. A. (2003). Morphology in word identification: A word-experience
        model that accounts for morpheme frequency effects. Scientific Studies of Reading, 7(3),
        219–237. https://doi.org/10.1207/S1532799XSSR0703_2
Rice, C. A., & Tokowicz, N. (2020). A review of laboratory studies of adult second language
        vocabulary training. Studies in Second Language Acquisition, 42(2), 439–470.
        https://doi.org/10.1017/S0272263119000500
Sato, M., & Angulo, I. (2020). The role of L1 use by high-proficiency learners in L2 vocabulary
        development. In W. Suziki & N. Storch (Eds.). Languaging in language learning and
        teaching: A collection of empirical studies (pp.41–66). John Benjamins.
Sato, M., & Loewen, S. (2018). Metacognitive instruction enhances the effectiveness of
        corrective feedback: variable effects of feedback types and linguistic targets:
        Metacognitive instruction and corrective feedback. Language Learning, 68(2), 507–545.
        https://doi.org/10.1111/lang.12283
Schmitt, N. (2008). Instructed second language vocabulary learning. Language Teaching
        Research, 12(3), 329–363. https://doi.org/10.1177/1362168808089921
Scott, V. M., & Fuente, M. J. D. L. (2008). What’s the problem? L2 Learners’ use of the L1
        during consciousness-raising, form-focused tasks. The Modern Language Journal, 92(1),
        100–113. https://doi.org/10.1111/j.1540-4781.2008.00689.x
Shiki, O. (2008). Effects of glosses on incidental vocabulary learning: Which gloss-type works
        better, L1, L2, single choice, or multiple choices for Japanese university students?
        Journal of Inquiry and Research, 87, 39–56.
                                                127


Shin, J.-Y., Dixon, L. Q., & Choi, Y. (2020). An updated review on use of L1 in foreign
        language classrooms. Journal of Multilingual and Multicultural Development, 41(5),
        406–419. https://doi.org/10.1080/01434632.2019.1684928
Sonbul, S., & Schmitt, N. (2013). Explicit and implicit lexical knowledge: acquisition of
        collocations under different input conditions. Language Learning, 63(1), 121–159.
        https://doi.org/10.1111/j.1467-9922.2012.00730.x
Storch, N., & Aldosari, A. (2010). Learners’ use of first language (Arabic) in pair work in an
        EFL class. Language Teaching Research, 14(4), 355–375. https://doi-
        org.proxy2.cl.msu.edu/10.1177/1362168810375362
Swain, M., & Lapkin, S. (2000). Task-based second language learning: The uses of the first
        language. Language teaching research, 4(3), 251–274. https://doi-
        org.proxy2.cl.msu.edu/10.1177/136216880000400304
Teng, F. (2020). Retention of new words learned incidentally from reading: Word exposure
        frequency, L1 marginal glosses, and their combination. Language Teaching Research,
        24(6), 785–812. https://doi.org/10.1177/1362168819829026
Tian, L., & Hennebry, M. (2016). Chinese learners’ perceptions towards teachers’ language use
        in lexical explanations: A comparison between Chinese-only and English-only
        instructions. System, 63, 77–88. https://doi.org/10.1016/j.system.2016.08.005
Tian, L., & Jiang, Y. (2021). L2 proficiency pairing, task type and L1 Use: A mixed-methods
        study on optimal pairing in dyadic task-based peer interaction. Frontiers in Psychology,
        12, 699774. https://doi.org/10.3389/fpsyg.2021.699774
Tian, L., & Macaro, E. (2012). Comparing the effect of teacher codeswitching with English-only
        explanations on the vocabulary acquisition of Chinese university students: A lexical
        focus-on-form study. Language Teaching Research, 16(3), 367–391.
        https://doi.org/10.1177/1362168812436909
Tognini, R., & Oliver, R. (2012). L1 use in primary and secondary foreign language classrooms
        and its contribution to learning. In E.A. Soler & M. Safont-Jordà (Eds.), Discourse and
        language learning across L2 instructional setting (pp. 53–77). Brill.
        https://doi.org/10.1163/9789401208598_005
Toomer, M., & Elgort, I. (2019). The development of implicit and explicit knowledge of
        collocations: A conceptual replication and extension of Sonbul and Schmitt (2013).
        Language Learning, 69(2), 405–439. https://doi.org/10.1111/lang.12335
Tu, W., & Zhou, X.-H. (1999). A Wald test comparing medical costs based on log-normal
        distributions with zero valued costs. Statistics in Medicine, 18(20), 2749–2761.
                                                128


Türk, E., & Erçetin, G. (2014). Effects of interactive versus simultaneous display of multimedia
       glosses on L2 reading comprehension and incidental vocabulary learning. Computer
       Assisted Language Learning, 27(1), 1–25.
       https://doi.org/10.1080/09588221.2012.692384
Turnbull, M. (2001). There is a role for L1 in foreign and second language teaching, but….
       Canadian Modern Language Review, 57(4), 531-540.
Uchihara, T., Webb, S., & Yanagisawa, A. (2019). The effects of repetition on incidental
       vocabulary learning: A meta‐analysis of correlational studies. Language Learning, 69(3),
       559–599. https://doi.org/10.1111/lang.12343
van Hell, J. G., & Kroll, J. F. (2012). Using electrophysiological measures to track the mapping
       of words to concepts in the bilingual brain. Memory, Language, and Bilingualism, 126–
       160. https://doi.org/10.1017/cbo9781139035279.006
VanPatten, B. (1990). Attending to form and content in the input: An experiment in
       consciousness. Studies in Second Language Acquisition, 12(3), 287–301.
       https://doi.org/10.1017/S0272263100009177
Varol, B., & Erçetin, G. (2021). Effects of gloss type, gloss position, and working memory
       capacity on second language comprehension in electronic reading. Computer Assisted
       Language Learning, 34(7), 820–844. https://doi.org/10.1080/09588221.2019.1643738
Vidal, K. (2011). A comparison of the effects of reading and listening on incidental vocabulary
       acquisition. Language Learning, 61(1), 219–258. https://doi.org/10.1111/j.1467-
       9922.2010.00593.x
Vraciu, A., & Pladevall-Ballester, E. (2022). L1 use in peer interaction: Exploring time and
       proficiency pairing effects in primary school EFL. International Journal of Bilingual
       Education and Bilingualism, 25(4), 1433–1450.
       https://doi.org/10.1080/13670050.2020.1767029
Waring, R., & Takaki, M. (2003). At what rate do learners learn and retain new vocabulary from
       reading a graded reader? Reading in a Foreign Language, 15(2), 130–163.
       https://doi.org/10.1177/003368828501600214
Warren, P., Boers, F., Grimshaw, G., & Siyanova-Chanturia, A. (2018). The effect of gloss type
       on learners’ intake of new words during reading: evidence from eye-tracking. Studies in
       Second Language Acquisition, 40(4), 883–906.
       https://doi.org/10.1017/S0272263118000177
Watanabe, Y. (2020). Talking to self while writing: Second-language writers’ languaging
       processes and reflections. In W. Suziki & N. Storch (Eds.). Languaging in language
       learning and teaching: A collection of empirical studies (pp. 197–216). John Benjamins.
                                                 129


Webb, S. (2007). The effects of repetition on vocabulary knowledge. Applied Linguistics, 28(1),
        46–65. https://doi.org/10.1093/applin/aml048
Webb, S. (Ed.). (2019). The Routledge handbook of vocabulary studies. Routledge.
Webb, S., & Chang, A. C. S. (2015). Second language vocabulary learning through extensive
        reading with audio support: How do frequency and distribution of occurrence affect
        learning? Language Teaching Research, 19(6), 667–686.
        https://doi.org/10.1177/1362168814559800
Webb, S., & Chang, A. C. S. (2022). How does mode of input affect the incidental learning of
        collocations? Studies in Second Language Acquisition, 44(1), 35–56.
Webb, S., Sasao, Y., & Ballance, O. (2017). The updated Vocabulary Levels Test: Developing
        and validating two new forms of the VLT. ITL - International Journal of Applied
        Linguistics, 168(1), 33–69. https://doi.org/10.1075/itl.168.1.02web
Xu, J., & Fan, Y. (2021). Task complexity, L2 proficiency and EFL learners’ L1 use in task-
        based peer interaction. Language Teaching Research, 136216882110046.
        https://doi.org/10.1177/13621688211004633
Yanagisawa, A., Webb, S., & Uchihara, T. (2020). How do different forms of glossing contribute
        to L2 vocabulary learning from reading?: A meta-regression analysis. Studies in Second
        Language Acquisition, 42(2), 411–438. https://doi.org/10.1017/S0272263119000688
Yoshii, M. (2006). L1 and L2 glosses: Their effects on incidental vocabulary learning. Language
        Learning and Technology, 10(3), 85–101.
Yu, S., & Lee, I. (2014). An analysis of Chinese EFL students’ use of first and second language
        in peer feedback of L2 writing. System, 47, 28–38.
        https://doi.org/10.1016/j.system.2014.08.007
Zhang, C., & Ma, R. (2021). The effect of textual glosses on L2 vocabulary acquisition: A meta-
        analysis. Language Teaching Research, 136216882110115.
        https://doi.org/10.1177/13621688211011511
Zhao, T., & Macaro, E. (2016). What works better for the learning of concrete and abstract
        words: Teachers’ L1 use or L2-only explanations?: Teachers’ L1 use or L2-only
        explanations. International Journal of Applied Linguistics, 26(1), 75–98.
        https://doi.org/10.1111/ijal.12080
                                                130


                             APPENDIX A: TASK INSTRUCTIONS
A1. Instruction for the Reading Task
下面你将读到一本英文悬疑小说。你可以在手机，平板或者电脑上阅读。阅读小说的目的
是理解小说内容。在读的过程中，你将会回答一些关于小说内容的问题。
You will be reading a thriller in English. You can do the reading on your mobile phone, tablet, or
laptop. The purpose of the reading is to understand the content. You will be tested on your
comprehension during and after reading. You will answer questions regarding the content of the
book.
一些词以蓝色和下划线标记。当你点击这些词时，一个窗口会弹出，窗口中有词的定
义。如需要了解该词词义，请点击该词。看完词义后，请再次点击，关闭窗口。
这些词可能会在阅读中反复出现，但词义窗口链接只会出现一次。
Some words in the reading are colored in blue and underlined. When you click one of those
words, a window will pop up and you will see the definition of the word. You can click those
words if you’d like to. Please click the words again to close the pop-up windows.
第一天阅读 18 页，第二天阅读 7 页。阅读不限时间。阅读中请不要查阅字典等其他资
料。翻页速度较慢时，请不要刷新页面。每页只能阅读一次。翻页后不能再往回读。
You will be reading 18 pages on the first day and 8 pages on the second day. There is no time
limit for the reading. Please do not consult the dictionary or any other resource. When the page is
loading slowing after you click the ‘next page’ button, please do not refresh the page. You can
only read each page once.
                                                 131


A2. Instruction for the Vocabulary Size Test
这是一个词汇量调查。每一道题目的左边是词的定义。请为这些词的定义选择意思相近或
对应的词。例如，下图中第一个词义“land with water all around it” 对应词 island, 则在
island 下打钩。每题有 3 个词义，6 个单词，所以有三个单词无对应词义。请尽量不要空
题。请独立作答，不要查阅字典或咨询他人。一共 5 页，每页 10 题。限时 30 分钟！
This is a test of your vocabulary size. On the left are word definitions. Please match the
definitions with corresponding words on the right. For instance, in the image below the first
word definition “land with water all around it” corresponds to the word ‘island’. In this case, put
a tick under ‘island’. In each question, there are 3 word definitions and 6 words. Therefore, 3 of
the words will not match any of the word definitions. Please try to answer all of the questions.
Do not consult the dictionary or any other resource. There are 5 pages in the test, with each page
containing 10 questions. The time limit is 30 minutes!
A3. Instruction for the Meaning Recall Test
请写出词的意思。可以是词义或者近义词。可以使用中文或英文。不知道答案的，写？，
共 24 题。
Please type down the definitions for the words below. You can write down the exact definition or
the synonym. You can use Chinese or English. If you don’t know the answer to the question,
write ?. 24 items in total.
                                                 132


A4. Instruction for the Meaning Matching Test
这是一个考查词义的配对题。你会看到带编号的 26 个词义和 24 个词。请为词语选择对应
的词义编号。
This is a matching test on word knowledge. You will see 26 numbered word definitions and 24
words. Please match the words and their word meanings.
A5. Instruction for the Self-paced Reading Test
这是一个考查句子理解的测试。你会读到一些句子。一开始，屏幕上只有*号. *号表示句
子开始的位置。请按空格键让测试继续。每按一次空格键出现一个词。句子后面会跟有
阅读理解题。如觉得题目中信息符合句子内容，用鼠标按 Yes，不符合按 No。
This is a sentence comprehension task. You will read some sentences. At first, you will only see
a *. This is where the sentence will start. When you are ready, press the space bar. Every time
you press the space bar, a word of the sentence will appear until the sentence ends. After each
sentence, there will be a statement. If you think the statement matches the content the sentence
you just read, press the Yes button using your mouse. If not, press the No button.
请按照正常速度阅读句子。一开始是练习。练习结束测试正式开始时会有提示。请按空格
键进入测试。
Please read the sentences at your normal pace. You will go through some practice trials at the
beginning. When the practice ends, you will be notified. Press the space bar to continue.
A6. Instruction for the Exit Questionnaire
以下是一个关于刚才完成的任务的调查。各个选项无好坏之分，请如实回答。
Next is a survey about the reading task you just finished. Please answer honestly. There is no
good or bad answer.
                                                133


       APPENDIX B: VOCABULARY PROFILE OF THE READING MATERIAL
Below lists the percentages of words in the reading material from certain frequency levels (e.g.,
K1, K2…). K1 represents the most frequent 1000, i.e., 1k, word families, and K2 represents the
second 1000, and so on.
K1: 95.9%; K2: 98.3%; K3: 98.7%; K4: 99%; K6: 99.4%; K11: 99.5%
                                               134


                  APPENDIX C: TARGET WORD CHARACTERISTICS
Table C1
Target Word Characteristics
 Origin Pseudow Part of Fo     Length       Orthographic Mean bigram  Mean
 al      ords       speech  O  (letters)    neighbor     frequency    RT
 words
 cop       latpin    Noun   19       6             1         2,789.60 854.20
 mafia    valoon     Noun   13       6             1         2,108.00  863
 senato    haron     Noun    9       5             3         2,714.75 874.77
 r
 trailer   corax     Noun   10       5             2         2,107.75 826.88
 murde    caudam     Noun    5       6             1           821.4  804.74
 r
 crawl      elile    Verb    4       5             2         2,332.50 871.36
 fright    ginge     Verb    7       5             4         3,150.75 856.15
 en
 grab      ameld     Verb    5       5             1         1,187.25 805.70
 trial     lodice    Noun    3       6             1         1,561.00 893.26
 whisp     loupe     Verb    2       5             2         1,303.00 851.52
 er
 violen    bitser   Adject   1       6             1         2,398.40 824.19
 t                    ive
 shock     appent    Noun    8       6             1         1,975.20 873.17
                                         135


Table C1 (cont’d)
tail       glunk  Noun    4   5        2   789.75  842.56
pipe
garag    goncho   Noun    2   6        1  1,815.20 855.53
e
whisk    vandier  Noun    9   7        1  2,502.33 806.11
ey
detect    toplin  Noun    5   6        1  2,507.40 854.59
ive
wild     baggod   Adject  1   6        1    539.2  839.55
                   ive
lorry       jact  Noun    1   4        4  908.333  800.90
innoc     agloat  Adject  1   6        1  1,458.00 854.63
ent                ive
body       hoag   Noun   27   4        2  799.333  847.50
gun       tonger  Noun    7   6        2  3,452.80 839.07
guard     lancid  Noun    2   6        2  1,746.40 854.42
custod     claft  Noun    5   5        2    747.5  816.23
y
prison    maive   Noun    8   5        5  1,343.25 803.52
Mean                     6.5 5.5     1.83 1794.13  842.23
                          8
                                 136


Table C1 (cont’d)
SD                6.1 .72     1.13  826.40  25.95
                   3
Range             1-  4-7     1-4  593-3453 800-
                  27                         893
                          137


                             APPENDIX D: GLOSSES
Table D1
Glosses
 Target   Pseudowords L2 glosses                L2 gloss L1 gloss L1 gloss
 words                                          length            length
                                                (words)           (characters)
 cop      latpin      police not in uniform     4        便衣警察     4
 mafia    valoon      a criminal organization   3        犯罪集团     4
 senator  haron       an important politician   3        重要政治     6
                                                         人物
 trailer  corax       a car used as a house for 8        供人居住     8
                      people to live                     的改装车
 murder   caudam      a murder that is planned  6        有预谋的     6
                      ahead                              谋杀
 crawl    elile       move on hands and knees   5        匍匐前进     4
 frighten ginge       make someone nervous      3        使人害怕     4
 grab     ameld       to hold suddenly          3        突然抓住     4
 trial    lodice      the process to decide if  9        审判       2
                      someone committed a
                      crime
 whisper  loupe       speak quietly             2        小声说话     4
                                         138


Table D1 (cont’d)
violent   Bitser  likely to hurt someone    4 有暴力倾 6
                                              向的
shock     Appent  a unpleasant surprise     3 让人不愉 8
                                              快的惊喜
tail pipe Glunk   where car smoke gets out  5 汽车排气 5
                                              管
garage    Goncho  a room to store things    5 杂物存储 5
                                              间
whiskey   Vandier strong alcohol            2 烈酒   2
detective Toplin  police who gather         7 负责搜集 11
                  information about a crime   案件信息
                                              的警察
wild      Baggod  crazy, out of control     4 疯狂，失 5
                                              控的
lorry     Jact    a truck                   2 货车   2
innocent  Agloat  having no knowledge of    5 不知情的 4
                  something
body      Hoag    a dead body               3 死尸   2
gun       Tonger  a small gun               3 小手枪  3
                                     139


Table D1 (cont’d)
guard     lancid  someone who protects   4    安保人员 4
                  people
custody   claft   being kept in prison   4    被关押  3
Table D1 (cont’d)
prison    maive   prison for children    3    儿童看守 3
                                              所
Mean                                     4.17      4.54
SD                                       1.83      2.17
                                     140


                          APPENDIX E: EXIT QUESTIONNAIRE
(English translations do not appear in the actual questionnaire.)
                                                       1. How often did you check the word
                                                       definitions (0% means you didn’t
                                                       check at all, 100% means you
                                                       checked all of them)
                                                       2. What was your purpose when you
                                                       check the word definitions? (e.g.,
                                                       100% for reading comprehension, or
                                                       70% for word learning)
                                                       (1) reading comprehension
                                                       (2) word learning
                                                       (3) other
                                                       3. Are there any word definitions
                                                       that you found particularly helpful for
                                                       reading comprehension? (list the
                                                       corresponding words)
                                                       4. Are there any word definitions that
                                                       you found particularly helpful for
                                                       word learning? (list the
                                                       corresponding words)
                                                       5. If you skipped some word
                                                       definitions, what were the reasons?
                                                       (e.g., 50% of the time, it was because
                                                       you already guessed the meaning, or
                                                       70% of the time because you didn’t
                                                       think the meaning was important). If
                                                       you checked all the word definitions,
                                                       please skip this.
                                                       (1) I have guessed the word meaning.
                                                       (2) I didn’t need the word definitions
                                                       to understand the reading.
                                                       (3) I didn’t think knowing word
                                                       meanings was important.
                                                       (4) The word definitions were not
                                                       helpful.
                                                       (5) Others.
                                               141


    6. Can you understand the word
    definitions? (100% means you
    understood them all, 0% means you
    didn’t understood any).
    7. Do you think the word definitions
    were helpful for reading
    comprehension (0% means not helpful
    at all, 100% means very helpful)
    8. Do you think the word definitions
    were helpful for word learning (0%
    means not helpful at all, 100% means
    very helpful)
    9. During reading, did you try to
    memorize words?
    Yes
    No
    10. During reading, did you anticipate
    word tests afterwards?
    Yes
    No
    Did you enjoy the reading?(0% means
    not at all, 100% means a lot)
142


            APPENDIX F: LANGUAGE BACKGROUND QUESTIONNAIRE
(English translations do not appear in the actual questionnaire.)
姓名 Name
   ________________________________________________________________
实验 ID Participant ID
   ________________________________________________________________
性别 Sex
           男
           女
           不想透露
年龄 Age
   ________________________________________________________________
                                               143


请问您在什么年级？What is your academic status?
     大一 Freshman
     大二 Sophomore
     大三 Junior
     大四 Senior
     研究生 Masters student
     博士生 PhD student
     已经毕业 Graduated
                               144


请问您有在国外上课的经历吗？Have you been to school overseas?
            有 Yes
            无 No
请问您在国外上课的时长为：（例：1 年 3 月）How long have you been to school overseas?
(e.g., 1 year 3 month)
            年 year________________________________________________
            月 month________________________________________________
请问您有在国外居住的经历吗？Have you lived overseas?
            有 Yes
            无 No
您在国外居住的具体情况是：The details of your stay overseas
                                         145


                                    什么时候(例：
                      时长（月）                                           居住或停留的
                                      2016-2018)       使用的语言
                       Duration                                            目的
                                      When (e.g.,     Language used
                       (month)                                        Purpose of stay
                                      2016-2018)
      国家 1
    Country 1
      国家 2
    Country 2
      国家 3
    Country 3
请自我评价您的英语水平 (1=差, 10=非常好) Please self-assess your English proficiency (1 =
poor, 10 = excellent)
                                             0    1 2   3   4   5   6  7   8    9   10
                                阅读 reading
                                          146


 写作 writing
听力 listening
口语 speaking
 总体 overall
            147


请提供您完成过的英语测试的成绩 Please tell us scores of standardized English tests you have
taken
                                                        测试年份（例：2020）
                                   总分 score
                                                         year taken (e.g., 2020)
         四级 CET4
         六级 CET6
         专四 TEM4
         专八 TEM8
您几岁开始学习英语？At what age you start learning English?
    ________________________________________________________________
                                      148


您接受了多少年课堂英语教育？How many years of classroom English learning have you
had?
    ________________________________________________________________
现在，每周您上多少小时的英语课？（没有上英语课请填 0）How many hours of English
instruction are you having? (put 0 if you are not taking any English instruction)
    ________________________________________________________________
您课外学习英语的方式：（可多选）Ways of learning English outside the classroom (you can
choose multiple options)
    ▢          看英语书 Reading English books
    ▢          看英语电影电视 Watching English movies or TV series
    ▢          听英语歌 Listening to English songs
    ▢          用英语对话 Talking to others in English
    ▢          其他 Others________________________________________________
对于您的语言学习背景，请问您还有其他信息想要告诉我吗？
Do you have anything else to say about your English learning?
    ________________________________________________________________
                                               149


           APPENDIX G: STIMULI FOR THE SELF-PACED READING TEST
Table G1
Stimuli
 Sentence (with pseudowords)              Real word Nonword Pseudoword meaning
 Jason saw a latpin in front of the shop. police    royate  police not in uniform
 They talked about the valoon very        student   remude  a criminal
 often during dinner.                                       organization
 He took a photo for the haron and his    teacher   persh   an important
 wife.                                                      Wpolitician
 He designed a corax for his friends.     house     dulpit  a car used as a house
                                                            for people to live
 A caudam took place last year here.      party     pidet   a murder that is
                                                            planned ahead
 He tried to elile while looking at his   walk      arrang  move on hands and
 phone.                                                     knees
 I always ginge him in front of people.   kiss      fattice make someone
                                                            nervous
 He tried to ameld me when I was          touch     premox  hold suddenly
 reading.
 People are waiting to the lodice to      show      paisin  the process to decide
 begin at eight.                                            if someone
                                                            committed a crime
 They always loupe to each other.         smile     gatnip  speak quietly
                                              150


Table G1 (cont’d)
He is a bitser student at school.        helpful  gustre  likely to hurt
                                                          someone
This reminded him of an appent at the    event    feload  an unpleasant
same time last year.                                      surprise
He tried to fix the glunk but he failed. chair    theath  where car smoke gets
                                                          out
He read a piece of news about his        company  ostane  a room to store things
goncho when he was eating.
He buys vandier every three month.       meat     glunter strong alcohol
He shook hands with the toplin when      person   esate   police who gather
they met.                                                 information about a
                                                          crime
He was baggod after he knew that.        happy    phronic crazy, out of control
He saw a jact next to his house.         bird     plunt   a truck
He looks agloat all the time.            confused drimful having no knowledge
                                                          of something
He found a hoag next to the tree.        mouse    vertin  a dead body
He took out a tonger all of a sudden.    knife    sluster a small gun
A lancid is standing in front of the     soldier  scrib   someone who
door.                                                     protects people
The writer chose claft as the topic.     society  epema   being kept in prison
                                             151


Table G1 (cont’d)
There is a maive in the north part of the hospital squan prison for children
city.
                                             152


                  APPENDIX H: MIXED-EFFECTS MODELLING
Table H1
Mixed Models Specifications
Models                                               Specifics
Engagement: gloss       gloss checking ~ group + vtotal + group * vtotal + (1 | participant) +
checking                (1+group+zvtotal | number)
Engagement: gloss       gloss time ~ group + vtotal + group * vtotal + (1 | participant) +
time                    (1+group+zvtotal | number)
Matching Immediate      accuracy ~ group + FoO + vtotal + glosstime + group * FoO * vtotal
                        + group * FoO * glosstime + group * vtotal * glosstime + group *
                        glosstime + (1 + FoO + glosstime | participant) + (1 + group +
                        glosstime + vtotal | number)
Matching Delayed        accuracy ~ group + FoO + vtotal + glosstime + group * FoO * vtotal
                        + group * FoO * glosstime + group * vtotal * glosstime + group *
                        glosstime + (1 + FoO + glosstime | participant) + (1 + group +
                        glosstime + vtotal | number)
Recall Immediate        accuracy ~ group + FoO + vtotal + glosstime + group * FoO * vtotal
                        + group * FoO * glosstime + group * vtotal * glosstime + group *
                        glosstime + (1 + FoO + glosstime | participant) + (1 + group +
                        glosstime + vtotal | number)
Recall Delayed          accuracy ~ group + FoO + vtotal + glosstime + group * FoO * vtotal
                        + group * FoO * glosstime + group * vtotal * glosstime + group *
                        glosstime + (1 + FoO + glosstime | participant) + (1 + group +
                        glosstime + vtotal | number)
                                            153


Table H1 (cont’d)
Self-paced Immediate        rt ~ condition + FoO + vtotal + glosstime + condition * FoO * vtotal
                            + condition * FoO * glosstime + condition * vtotal * glosstime +
                            condition * glosstime + (1 + FoO + glosstime + condition |
                            participant) + (1 + glosstime + vtotal | number)
Self-paced Delayed          rt ~ condition + FoO + vtotal + glosstime + condition * FoO * vtotal
                            + condition * FoO * glosstime + condition * vtotal * glosstime +
                            condition * glosstime + (1 + FoO + glosstime + condition |
                            participant) + (1 + glosstime + vtotal | number)
Note. Specifics refer to R codes for the models. Vtotal: vocabulary size.
Table H2
Manipulation Check for self-paced reading immediate posttest: Position 0, L1 gloss group
                              Fixed effects                             Random effects
                                                                By participant       By item
           Estimate [95% CI]       SE       t         p         Variance SD       Variance SD
Intercept    6.24 [6.17, 6.31]     .04 174.57 <.001***             .04     .19       .01     .10
Nonword      -.04 [-.08, .003]     .02   -1.83       .07
Real          -.03 [-.08, .01]     .02   -1.46       .14
word
                                                154


Table H3
Manipulation Check for self-paced reading immediate posttest: Position 0, L2 gloss group
                            Fixed effects                          Random effects
                                                           By participant         By item
          Estimate [95% CI]      SE       t       p        Variance SD        Variance SD
Intercept  6.28 [6.17, 6.31]     .04 139.64 <.001***         .05      .23        .01      .12
Nonword    -.003 [-.08, .003]    .02    -.11     .91
Real        -.003 [-.08, .01]    .02    -.13     .90
word
Table H4
Manipulation Check for self-paced reading delayed posttest: Position 0, L1 gloss group
                            Fixed effects                           Random effects
                                                           By participant         By item
          Estimate [95% CI]      SE       T       p        Variance SD        Variance SD
Intercept  6.09 [6.02, 6.16]     .03 176.53 <.001***          .03     .18        .01      .10
Nonword      -.01 [-.05, .03]    .02    -.57     .57
Real       -.04 [-.08, -.001]    .02   -1.99     .05
word
                                             155


Table H5
Manipulation Check for self-paced reading delayed posttest: Position 0, L2 gloss group
                           Fixed effects                           Random effects
                                                           By participant         By item
          Estimate [95% CI]     SE       T       p         Variance SD        Variance SD
Intercept  6.12 [6.04, 6.19]    .04 155.23 <.001***          .04      .19        .01      .10
Nonword     -.03 [-.08, .01]    .02   -1.46     .15
Real        -.01 [-.06, .03]    .02    -.62     .54
word
                                            156