A CONSTRUCT VALIDATION STUDY OF IMPLICIT AND TIME SENSITIVE VOCABULARY MEASURES
                                           By
                                       Bronson Hui
                                    A DISSERTATION
                                       Submitted to
                              Michigan State University
                      in partial fulfillment of the requirements
                                   for the degree of
                   Second Language Studies—Doctor of Philosophy
                                          2021


                                             ABSTRACT
 A CONSTRUCT VALIDATION STUDY OF IMPLICIT AND TIME SENSITIVE VOCABULARY MEASURES
                                                 By
                                            Bronson Hui
         Vocabulary researchers have started expanding their assessment toolbox by
incorporating timed tasks and psycholinguistic instruments (e.g., priming tasks) to gain insights
into lexical development (e.g., Elgort, 2011; Godfroid, 2020b; Nakata & Elgort, 2020;
Vandenberghe et al., 2021). These timed sensitive and implicit word measures differ
qualitatively from traditional paper- or accuracy-based vocabulary tests and are believed to tap
into lexical strength and representations in the mental lexicon (Elgort, 2018; Godfroid, 2020b).
As a result, there have been calls to use both traditional (explicit) and these timed and implicit
word measures in a complementary manner (e.g., Godfroid, 2020b; Nakata & Elgort, 2020;
Vandenberghe et al., 2021). At the same time, researchers must first develop a thorough
understanding of how these different types of measures (explicit vs. implicit and timed vs.
untimed) relate to each other before they can make informed decisions on their measurement
battery. It is thus well-motivated to examine the construct validity of these measures
empirically and systematically. In this validation study, I took the first step to fill this research
gap by assessing both the predictive and factorial structure validity of these measures.
        One hundred and forty-five learners of English took part in five vocabulary tasks: (1) a
receptive form-meaning task, where they chose an option representing the meaning of the
target word embedded in a sentence, (2) a productive form-meaning task, where they


produced the target word to fit a sentence context, (3) a computerized Yes-No (reaction time)
test, where they indicated if they knew the target word by pressing keys on their keyboard, (4)
a masked repetition priming task with lexical decisions, where they judged if a letter string
forms a word in English, and (5) a semantic priming task with lexical decisions. Items in all five
tests were the same 40 English words sampled across the 2K - 5K frequency bands.
        Data analysis involved item inspection and extraction of person-related parameters
based on Rasch and/or mixed-effects models. The measures of person ability obtained from
individual tasks were then submitted to confirmatory factor analyses in order to assess the
psychometric dimensionality of the measure battery. The resulting latent factor(s), representing
a pure measure of vocabulary under a specific conceptualization, was then used to predict self-
reported proficiency to shed light on their predictive validity. With method effects accounted
for, the one-factor solution (“Vocabulary Knowledge”) produces a good fit and is preferred
based on the principle of parsimony for both the implicit vs. explicit and timed vs. untimed
distinctions. This result provides evidence for psychometric unidimensionality of these
measures as representing a potential unitary construct of vocabulary knowledge. At the same
time, the vocabulary construct has the most explanatory power (predictive validity) when
conceptualized distinctly as lexical knowledge (measured untimed tasks) and strength
(measured by timed tasks). Taken together, these results foreground the need for researchers
to further specify the nature of the vocabulary construct as well as the operational task
features with which it can be assessed empirically. Importantly, I call for more measurement
validation work as researchers expand their assessment toolboxes in vocabulary research.


Copyright by
BRONSON HUI
2021


To the brave and freedom-loving people of Hong Kong
                         v


                                       ACKNOWLEDGEMENTS
        I would like to express my deepest possible gratitude to the following individuals for
their support and guidance throughout my PhD career.
        I am forever indebted to my family for their support and love. This journey would not
have been possible without them. I have always been encouraged to adventure and leave my
comfort zone. Together, we have done exactly that, sailing away from the place we once called
home.
        Many thanks go to my advisor, Dr. Aline Godfroid, who has been tremendously
supportive. She is certainly a role model and a great mentor, guiding me through the challenges
in academia. Her insights often mean more work on my end, but the results are always
amazing. I must also thank members of my dissertation committee: Drs. Shawn Loewen, Paula
Winke, Patti Spinner, Irina Elgort, and Hope Akaeze. The wide range of expertise they have
collectively offered has greatly strengthen the investigation in this work.
        Last but not least, I am grateful for the following financial support: the Language
Learning Dissertation Grant, the Gorilla Grant for Graduate Students, the Dissertation
Completion Fellowship, the Graduate School’s Research Enhancement Scheme, and the
research and conference support from the Second Language Studies (SLS) program.
                                                   vi


                                                     TABLE OF CONTENTS
LIST OF TABLES ................................................................................................................................ ix
LIST OF FIGURES .............................................................................................................................. xi
INTRODUCTION ............................................................................................................................... 1
CHAPTER 1: LITERATURE REVIEW ................................................................................................... 4
   Conceptualizations of Vocabulary Knowledge ............................................................................ 4
     Vocabulary Breadth, Depth, and Strength .............................................................................. 4
     Implicit and Explicit Word Knowledge................................................................................... 12
   Measures of Vocabulary Knowledge......................................................................................... 16
     Traditional and Time-Sensitive Word Measures ................................................................... 17
     Timed Lexical Measure of Strength ....................................................................................... 21
     Implicit Word Measures ........................................................................................................ 23
   Measurement Validation .......................................................................................................... 28
     Measurement Validation Studies in Grammar Research ...................................................... 30
     Measurement Validation Studies in Vocabulary Research ................................................... 32
   The Present Study ..................................................................................................................... 36
CHAPTER 2: METHODOLOGY ........................................................................................................ 39
   Participants................................................................................................................................ 39
   Critical Words ............................................................................................................................ 42
   Data Collection Platform ........................................................................................................... 44
   Measures ................................................................................................................................... 49
     Form-Meaning Receptive Test .............................................................................................. 50
     Form-Meaning Productive Test ............................................................................................. 52
     Yes-No RT Test (Access to the Form-Meaning Link) .............................................................. 53
     Masked Repetition Priming (Lexical Representations) ......................................................... 56
     Semantic Priming (Semantic Representations) ..................................................................... 63
     Self-Reported Proficiency ...................................................................................................... 73
   Procedure .................................................................................................................................. 74
   Data Analysis ............................................................................................................................. 75
     The Form-Meaning Receptive Test ....................................................................................... 76
     The Form-Meaning Productive Test ...................................................................................... 79
     Yes-No Rt Test........................................................................................................................ 80
     Masked Repetition Priming ................................................................................................... 82
     Semantic Priming ................................................................................................................... 86
     Main Cfas And Sems .............................................................................................................. 86
     Analysis Software Packages ................................................................................................... 92
                                                                    vii


CHAPTER 3: RESULTS (INDIVIDUAL TASKS) ................................................................................... 93
  The Form-Meaning Receptive Test ........................................................................................... 93
  The Form-Meaning Productive Test.......................................................................................... 95
  Yes-No RT Test......................................................................................................................... 101
    Accuracy Data ...................................................................................................................... 101
    Reaction Time Data ............................................................................................................. 103
  Masked Repetition Priming ..................................................................................................... 106
  Semantic Priming .................................................................................................................... 107
  Summary of Results for Individual Tasks ................................................................................ 109
CHAPTER 4: RESULTS (CFA AND SEM)......................................................................................... 112
  RQ1a – Explicit vs. Implicit ...................................................................................................... 112
  RQ1b – Knowledge vs. Strength .............................................................................................. 112
  RQ2a – Predictive Validity of a Single Vocabulary Construct.................................................. 117
  RQ2b – Predictive Validity of Lexical Knowledge and Strength .............................................. 117
  Summary of Findings ............................................................................................................... 118
CHAPTER 5: DISCUSSION AND CONCLUSION.............................................................................. 119
  The Jury is out but… ................................................................................................................ 119
  A Broader, Unitary View of Vocabulary Knowledge ............................................................... 120
  What Implicit and Timed Measures Offer............................................................................... 123
  Understanding Priming Tasks as Individual Differences Measures ........................................ 125
  Alternative and Equivalent Models ......................................................................................... 129
  Limitations and Future Directions ........................................................................................... 131
  Conclusion ............................................................................................................................... 132
APPENDICES ................................................................................................................................ 134
   APPENDIX A THE FORM-MEANING RECEPTIVE TEST ............................................................. 135
   APPENDIX B THE FORM-MEANING PRODUCTIVE TEST.......................................................... 150
   APPENDIX C STIMULI FOR THE YES-NO RT TEST .................................................................... 153
   APPENDIX D STIMULI FOR THE MASKED REPETITION PRIMING TASK ................................... 155
   APPENDIX E WORD ASSOCIATION NORMS FOR CRITICAL RELATED TRIALS .......................... 170
   APPENDIX F STIMULI FOR THE MASKED SEMANTIC PRIMING TASK ..................................... 173
REFERENCES ................................................................................................................................ 186
                                                                   viii


                                                    LIST OF TABLES
Table 1 Nation’s (2013) Framework of Vocabulary Knowledge .................................................... 7
Table 2 Desired Sample Sizes Based on Model Fit....................................................................... 40
Table 3 Demographic Information About the Participants ......................................................... 42
Table 4 List of Critical Words ....................................................................................................... 44
Table 5 Summary of Measures for the Present Research ........................................................... 50
Table 6 Means and Standard Deviations of Reaction Times for Words and Non-words ............ 56
Table 7 Summary of Trial Types in the Masked Repetition Priming Task ................................... 58
Table 8 Means and Standard Deviations of Reaction Times for Critical Words Between
          Conditions (Repetition Priming) ................................................................................... 60
Table 9 Summary of Mixed Models for Pilot Data - Masked Repetition Priming Task ............... 61
Table 10 Summary of Trial Types in the Masked Semantic Priming Task ................................... 67
Table 11 Means and Standard Deviations of Reaction Time for Learners in the Semantic Priming
          Task - Piloting ............................................................................................................... 69
Table 12 Means and Standard Deviations of Reaction Time for Native Speakers in the Semantic
          Priming Task - Piloting .................................................................................................. 71
Table 13 Summary of Mixed Models for Pilot Data – Semantic Priming Task (Native Speaker). 72
Table 14 Summary of Number of Data Points for Each Participant ............................................ 75
Table 15 Summary of Hypothesized CFA Models ........................................................................ 89
Table 16 Summary of Software Packages Used ........................................................................... 92
Table 17 Descriptives for the Form-Meaning Receptive Test...................................................... 93
Table 18 Descriptives for the Form-Meaning Productive Test ..................................................... 96
Table 19 Descriptives for the Yes-No RT Test (Accuracy Data).................................................. 101
                                                               ix


Table 20 Descriptive Statistics for the Yes-No Test (Reaction Time Data) ................................ 103
Table 21 Descriptive Statistics for the Masked Repetition Priming Task .................................. 106
Table 22 Summary of Mixed Models - Masked Repetition Priming Task .................................. 107
Table 23 Means and Standard Deviations of Reaction Times for Critical Words Between
          Conditions (Semantic Priming) ................................................................................... 108
Table 24 Summary of Mixed Models - Semantic Priming Task.................................................. 109
Table 25 Summary of Individual Task Results............................................................................. 110
Table 26 Correlation Matrix for the Individual Task Results ...................................................... 111
Table 27 Summary of Confirmatory Factor Analysis and Structural Equation Models ............. 114
Table 28 Model Summary of CFA-M2 ........................................................................................ 115
Table 29 Model Summary of CFA-M3 ........................................................................................ 116
                                                   x


                                            LIST OF FIGURES
Figure 1 The Modified Hierarchical Model (Pavlenko, 2009) ...................................................... 11
Figure 2 Visualization of CFA-M2 ................................................................................................. 91
Figure 3 Visualization of SEM-M3 ................................................................................................ 91
Figure 4 Person-Item Map for the Receptive Test (40-Item)....................................................... 98
Figure 5 Person-Item Map for the Receptive Test (35-Item)....................................................... 99
Figure 6 Person-Item Map for the Productive Test (38-Item) ................................................... 100
Figure 7 Person-Item Map for the Yes-No RT Test - Word Data (33-item) ............................... 104
Figure 8 Person-Item Map for the Yes-No RT Test - Non-Word Data (37-item) ........................ 105
                                                        xi


                                         INTRODUCTION
        Vocabulary knowledge has been consistently found to be the most important
determinant of success in second language use such as reading and listening (e.g., S. Zhang &
Zhang, 2020). At the same time, the construct of vocabulary knowledge has been
conceptualized in many different ways (e.g., Schmitt, 2014; Yanagisawa & Webb, 2020). These
conceptualizations carry different theoretical implications for what it means to know a word.
Importantly, each conceptualization requires valid measurement tools for teachers,
researchers, and language testers to assess different aspects of learners’ word knowledge.
Traditionally, vocabulary is often assessed through paper-and-pencil, accuracy-based tests. In a
complementary manner, researchers have recently started using reaction-time-based,
psycholinguistic tasks in vocabulary studies (e.g., Elgort, 2011; Godfroid, 2020b; Nakata &
Elgort, 2020; Vandenberghe et al., 2021). Unlike traditional, explicit vocabulary tests, these
implicit and time sensitive word measures are believed to tap into learners’ lexical strength and
representations in the mental lexicon, respectively. Due to the qualitative differences between
these measures their traditional counterparts, they can potentially shed new light on the
vocabulary construct and the acquisition process. Therefore, the adoption of these measures
and the corresponding expansion of the methodological toolbox in vocabulary research bring
exciting opportunities. At the same time, the diversifying of assessment tools also calls for a
thorough understanding of the relationships between the many different measures.
Specifically, researchers need to assess the alignment between their conceptualization of the
vocabulary construct and the measurement tools they deploy. Simply put, the question is
whether tests used to measure vocabulary knowledge adequately represent the construct of
                                                 1


vocabulary knowledge as it is conceptualized. If not, researchers need to find new
measurement operationalizations to tap into the specific construct(s) in question. Alternatively,
the construct might require a reconceptualization based on what can be empirically measured.
In addition, researchers should quantify and assess the value of the new insights brought about
by these implicit and time sensitive word measures. Essentially, what can researchers gain from
administering these tests on top of traditional vocabulary tasks? For example, to what extent
can researchers better explain the individual differences in language performance? How
important is the knowledge measured by these tasks in authentic language use, after
considering what is tapped into by traditional measures? Taken together, a construct validation
study of vocabulary measurement, which examines psychometric dimensionality and predictive
validity, represents an important step forward, offering initial insights into these test-
knowledge relationships and provide foundational support for different forms of vocabulary
assessment.
        The present dissertation is such an attempt at construct validation. It is organized in five
chapters. Following this introduction, I first provide a narrative review of the literature in
Chapter 1, where I highlight the research gaps that motivated the study and present the
research question that guided the investigation. In Chapter 2, I detail the methodology used to
address the research question, including my participants, critical words, measures, and data
analysis procedure. In Chapters 3 and 4, I report the results for the individual tasks and for the
overall confirmatory factor analyses (CFA) and structural equation models (SEM), respectively.
In the closing Chapter 5, I discuss the findings in terms of methodology and what they mean for
the understanding of the vocabulary construct, before drawing a conclusion. In doing so, I hope
                                                   2


to draw vocabulary researchers’ attention to the validity of the measures that they rely on in
their studies and call for more validation research in this research area.
                                                 3


                                 CHAPTER 1: LITERATURE REVIEW
         Throughout this chapter, I provide a narrative review of the literature on the
conceptualizations of word knowledge, vocabulary measures used by second language (L2)
researchers, and measurement validation studies in L2 research. The goal of this chapter is to
offer a concise, state-of-the-art overview of the research field, through which I highlight the
motivation for the present study. First, I will start with the theoretical conceptualizations of
vocabulary knowledge, covering vocabulary size, depth, and strength as well as the distinction
between explicit and implicit word knowledge. I will highlight some competing
conceptualizations and the practical and theoretical needs to empirically test and differentiate
between them. Second, I will then discuss implicit and time sensitive vocabulary measures their
application in research as well as their limitations. I will also present three example tests which
researchers have used: one of timed vocabulary tasks and two implicit word measures. I stress
that, despite the variety of available measures, researchers need to engage in measurement
validation work to clarify how these measures may or may not relate to each other and to
provide validity evidence for their proper use and interpretations. In the third section, I review
the literature of measurement validation studies in grammar research followed by the two
validation studies published thus far in the domain of vocabulary research. At the end of the
chapter, I will present my research question for the present dissertation.
Conceptualizations of Vocabulary Knowledge
Vocabulary Breadth, Depth, and Strength
         Breadth and depth is one distinction vocabulary researchers make in conceptualizing
lexical knowledge (e.g., Anderson & Freebody, 1981; Schmitt, 2014; Yanagisawa & Webb,
                                                   4


2020). Vocabulary breadth refers to the size of one’s vocabulary: it is the number of words for
which an individual knows “at least some of the significant aspects of meaning” (Anderson &
Freebody, 1981, p. 92). Despite the focus on quantity, the definition inevitably required some
specification of the quality of the knowledge (i.e., what it means to know a word). In other
words, breadth cannot be easily defined independent of depth. This second, quality dimension
of word knowledge has been referred to as the depth of understanding of a word. This
distinction between vocabulary breadth (quantity or size) and depth (quality) has allowed
researchers to address research questions on, for example, how many words one needs to
comprehend a text (e.g., Laufer, 1992), and on the effects of various vocabulary activities on
different aspects of word knowledge (e.g., Webb, 2007), as well as the relationship between
size and depth (e.g., Schmitt, 2014).
        In terms of the conceptualizations of vocabulary depth, Yanagisawa and Webb (2020)
pointed out that there have been many ways in which depth is defined. Some researchers used
depth and quality of knowledge in an interchangeable manner (Anderson & Freebody, 1981;
Read, 1993). Offering more specificity to the notion of word knowledge quality, Wesche and
Paribakht (1996) differentiate “kinds of knowledge of specific words” and “degrees of such
knowledge” (p. 13). In other words, there are various sub-components of word knowledge (e.g.,
meaning and form). In addition, there is a notion of mastery of such knowledge (e.g., how well
one understands the meaning of a word). These two different, but related, conceptualizations
of depth represent the key approaches researchers take in examining vocabulary depth.
        First, conceptualizing depth as component knowledge, researchers break down word
knowledge into its different component elements (Henriksen, 1999; Read, 2000; Schmitt, 2014;
                                                 5


Yanagisawa & Webb, 2020). To date, Nation's (2013) framework is the most comprehensive and
oft-cited listing of the various components involved in knowing a word (see Table 1). For
example, knowing a word means knowing its form, meaning, and use, which are broken down
into different aspects such as spoken and written word forms. Each of these aspects is further
broken down into receptive and productive knowledge. For example, having receptive
knowledge of a word’s spoken form is to know (recognize) how the word sounds. More
generally, then, the greatest vocabulary depth one can attain, within this framework, is the
mastery of all these 18 different components. Due to the comprehensiveness of this
framework, it has informed the work of many vocabulary researchers investigating vocabulary
depth. For example, in Schmitt’s (2014) review of the conceptualizations of the vocabulary
depth in research, six out of seven operationalizations can be mapped more or less directly to a
component in Nation’s (2013) framework. They include receptive versus productive mastery,
knowledge of multiple word knowledge components, knowledge of polysemous meaning
senses, knowledge of derivative forms (word family members), knowledge of collocation, and
the degree, and kind of lexical organization (Schmitt, 2014). Despite the huge popularity of
Nation’s (2013) framework, one limitation of taking depth exclusively as word component
knowledge is that mastery of knowledge is often seen as a binary. For example, does a learner
know the spoken form of the word, yes or no? In this light, word component knowledge cannot
be easily mapped onto mastery of knowledge, which should be viewed as a continuum (i.e.,
how well a learner knows [a component of] the word), especially for researchers interested in
examining the developmental trajectory of vocabulary knowledge.
                                                 6


Table 1
Nation’s (2013) Framework of Vocabulary Knowledge
Aspects of vocabulary          Receptive/
                                            What it means to master this aspect?
knowledge                      Productive
                               R            What does the word sound like?
           spoken
                               P            How is the word pronounced?
                               R            What does the word look like?
           written
form                           P            How is the word written and spelled?
                               R            What parts are recognizable in this word?
           word parts                       What word parts are needed to express the
                               P
                                            meaning?
                               R            What meaning does this word form signal?
           form and meaning                 What word form can be used to express this
                               P
                                            meaning?
           concept and         R            What is included in the concept?
meaning
           referents           P            What items can the concept refer to?
                               R            What other words does this make us think of?
           association                      What other words could we use instead of this
                               P
                                            one?
           grammatical         R            In what patterns does the word occur?
           functions           P            In what patterns must we use this word?
                                            What words or types of words occur with this
                               R
                                            one?
           collocations
use                                         What words or types of words must we use with
                               P
                                            this one?
                                            Where, when, and how often would we expect
                               R
                                            to meet this word?
           constraints on use
                                            Where, when, and how often can we use this
                               P
                                            word?
                                            7


         Another approach to conceptualizing depth is to view it from a developmental
perspective, “from no knowledge to fully developed knowledge” (Yanagisawa & Webb, 2020, p.
373). In other words, a learner can be described in terms of how they develop knowledge from
partial to precise knowledge of a word’s meaning (Henriksen, 1999; Read, 2004). For example,
this progress can mean knowing a word (e.g., pretty) is a positive adjective, before knowing its
shared meaning with a close synonym (e.g., beautiful). Eventually, the learner understands
some subtle differences between the two (e.g., pretty focuses more on the attractive
appearance of a person while beautiful also implies a person’s positive inner quality). Similarly,
lexical development can also be operationalized as progressing from receptive to productive
knowledge, whereby a learner may first recognize a word and know its meaning before they
can use it appropriately in terms of grammar and meaning (e.g., Henriksen, 1999; Paribakht &
Wesche, 1993; Wesche & Paribakht, 1996). While this receptive-productive trajectory has been
a common operationalization of development (e.g., Yanagisawa & Webb, 2020), this continuum
implies that the end point of development be an accurate and grammatical production of the
word.
         Yet one may wonder whether one’s lexical development truly stops at word production.
If not, what should be the target for learners? One suggestion is that increasing vocabulary
fluency (or automaticity) “should be the ultimate goal for most language learners” (Qian & Lin,
2020, p. 68). In Schmitt’s (2014) words, “[a] way of thinking about what learners can do with
lexical items is how fluently and automatically the items can be used in each of the four skills
(reading, writing, listening, and speaking)” (p. 920). Fluent and automatic use requires what
researchers refer to as strength of lexical knowledge (Nation & Webb, 2011; Webb, 2012;
                                                  8


Yanagisawa & Webb, 2020). Strong lexical knowledge is a prerequisite for efficient retrieval and
access (Godfroid, 2020b; Yanagisawa & Webb, 2020). Similarly, Perfetti (2007) suggested that
adding effective practice to lexical knowledge results in “[processing] efficiency: the rapid, low-
resources retrieval” (p. 359). In this light, then, lexical strength represents as an important
element of the quality of word knowledge, and it should be distinguished from component
word knowledge described earlier (Yanagisawa & Webb, 2020). Again, it is a difference
between how well one can master a specific aspect of a word (e.g., form-meaning mapping)
and how many different aspects of a word one knows (e.g., meaning, collocation). This isolated
treatment of lexical strength and fluent use from word component knowledge echoes Daller et
al. (2007) who proposed a three-dimension lexical space with breadth, depth, and fluency.
Harrington's (2018) notion of lexical facility similarly encapsulates accuracy, speed, and
consistency of access and retrieval, all of which are believed to be the cognitive bases of fluent
language use (e.g., Segalowitz & Segalowitz, 1993). Perfetti (2007) also explicitly spelled out the
processing consequences of high lexical quality, such as processing stability and synchronicity.
Most recently, Godfroid (2020b) proposed to formalize the dimension of fluency of use by
expanding Nation's (2013) framework of word knowledge with an additional facet of
automaticity. Together, these authors rightly pointed out the importance of considering how
well one can use their word knowledge in terms of ease of processing.
        Note, however, that access and retrieval are two language processing operations. As
rightly pointed out by Perfetti (2007), efficient access is a processing consequence of strong,
high quality lexical knowledge. It is well documented in the psycholinguistics literature that
language processing (e.g., word recognition) can be influenced many factors other than the
                                                    9


quality of lexical representations in the mind. These factors include one’s language dominance,
differences and similarities between the two languages of a bilingual, frequency of use and so
on (e.g., Kroll & Tokowicz, 2005). Indeed, psycholinguistic models of the bilingual lexicon often
distinguish knowledge representations in memory from processing operations. For example, in
the Modified Hierarchical Model (Pavlenko, 2009), second language learning is seen as
developing and restructuring conceptual categories that can be L1-specific, L2-specific, or
shared (see Figure 1). According to this model, one learning task for L2 speakers is then to
disambiguate subtle differences at the conceptual level between and within the two languages.
According to Pavlenko (2009) this learning is “a gradual process, taking place in implicit
memory” (p. 159), which I will return to in the next section. The key here is that conceptual
knowledge, together with the lexical, formal knowledge, are represented in memory. Strong,
robust representations are easier to access and retrieve during recognition and production. At
the same time, these language processing operations are often carried out under the mutual
interaction between the mind and the environment. For example, in most word recognition
models, when a word is recognized, its representation needs to have a high level of activation
that surpasses that of other word candidate (e.g., Marslen-Wilson & Welsh, 1978; McClelland &
Elman, 1986). This activation level during the recognition process is shaped by, for example,
both the extent to which the representations are precise (the mind) and whether the language
is being used exclusively in the context (the environment). The bottom line here is, in addition
to strong lexical knowledge, which vocabulary researchers are most interested in, processing-
related factors influencing activation levels during online processing also play important roles in
how vocabulary is used in communication.
                                                 10


Figure 1
The Modified Hierarchical Model (Pavlenko, 2009)
        To summarize this section, I have reviewed the contrast between vocabulary size
(focusing on knowledge quantity) and depth (emphasizing knowledge quality). Depth in turn
should be conceptualized in two distinct ways: first, knowing the different components of a
word (e.g., form, meaning, and use); second, developing strength of such knowledge. It is this
second way, of strength, that is the fundamental basis of real-life, efficient access and retrieval
in authentic communication settings although other factors can also influence processing. On
                                               11


this account, teachers, researchers, and language testers should focus more on the teaching,
learning, and measuring of vocabulary strength. At the same time, there are theoretical
questions that need to be addressed. For example, what is the exact relationship between word
knowledge and lexical strength? To what extent is vocabulary knowledge as strength a separate
dimension from word knowledge? If they are separate, how do they develop? Clarifying these
questions will help researchers develop a unified theory of word knowledge development in L2
speakers, which in turn will have pedagogical implications. In particular, a solid theoretical
understanding will aid teachers when deciding when to conduct classroom activities to promote
fluency. For example, if the development of strength follows a separate trajectory, a teacher
may wish to start fluency training early to allow different types of knowledge to develop
simultaneously. If strength is an extension of the receptive-productive knowledge continuum, a
teacher may delay fluency training until students have some solid receptive and productive
knowledge first.
Implicit and Explicit Word Knowledge
        In addition to the distinction between lexical knowledge and strength, researchers have
also suggested that there are “two types of [lexical] knowledge” (Nakata & Elgort, 2020, p. 6).
These two types have been labelled as explicit vs. implicit, declarative vs. procedural (or non-
declarative), and available online vs. offline (e.g., Nakata & Elgort, 2020). This distinction echoes
a similar contrast in grammar research, where researchers distinguish explicit and implicit
knowledge (e.g., Andringa & Rebuschat, 2015; DeKeyser, 2003; N. Ellis, 2005; Hulstijn, 2005). In
a seminal paper, R. Ellis (2005) summarized the characterizations of these knowledge types and
proposed a number of operational features in tasks that could be used to measure them. One
                                                  12


important feature is particularly relevant to the present context: learners’ awareness of the
knowledge. In particular, the learner always has awareness of their explicit knowledge (“they
know that they know”), but not their implicit knowledge.
         At the same time, it is crucial to acknowledge that this criterion used to differentiate
explicit and implicit grammar knowledge may not be directly applicable to vocabulary research
(Sonbul & Schmitt, 2013). As far as the form-meaning link is concerned, for instance, it may be
difficult to conceive that a learner has no awareness of a word’s meaning. Indeed, the arbitrary
pairings between form and meaning are believed to reside in declarative memory along with
other information that the learner is conscious of (e.g., Ullman, 2001). Hulstijn (2007) similarly
suggested that vocabulary knowledge is largely explicit because of its symbolic nature (i.e., the
form-meaning relationship is arbitrary). For both N. Ellis (1994) and R. Ellis (2004), the semantic
components of a word’s knowledge are often explicit, while knowledge related to form and use
can be either explicit or implicit. Similarly, Sonbul and Schmitt (2013) pointed out the word’s
relationship to the broader linguistic system of the knowledge should be factored into the
consideration of explicit and implicit vocabulary knowledge. In sum, the semantics associated
with vocabulary knowledge has led researchers to suggest that word knowledge is largely
explicit in nature.
         While there may be a lot of truth in the proposition, the picture is far more complex.
First, as briefly mentioned, psycholinguistic models of the bilingual lexicon have attempted to
incorporate lexical items and concepts in both languages (e.g., Pavlenko, 2009). While a learner
can often verbalize a translation equivalent of a word, they might not be aware of the subtly
restructured (L2-specific) conceptual representations in memory that results from extensive
                                                   13


language use. Indeed, as acknowledged by Pavlenko (2009), the distinction between implicit
and explicit word knowledge “has not yet been incorporated into models of the bilingual
lexicon” (p. 150). Therefore, concluding that the semantic components of word knowledge is
categorically explicit may be oversimplistic. In addition, the view of lexical knowledge being
explicit stems from a narrow definition of what it means to know a word (i.e., knowing a word is
only about knowing its meaning). Both SLA researchers (e.g., Meara, 2009; Meara, 1992;
Nation, 2013) and psycholinguists (e.g., Jiang, 2015; McNamara, 2005) have viewed the
semantic components of lexical knowledge as an interconnected network. For example, in
Nation’s (2013) framework, mastering meaning association at the productive level means
knowing what other words can be used in a given context to convey a similar meaning. Meara
(1992, 2009) also views the lexicon as a network structure where items are connected through
associations between them. In psycholinguistics, it is found that activation of a lexical item can
spread to other items via a connected, semantic network (e.g., McNamara, 2005). On this
account, researchers need to differentiate how well a learner knows a particular item (e.g., its
meaning) from how well the item in question is connected to other items (or how well it is
integrated in the lexicon). It appears that researchers suggesting that vocabulary knowledge is
explicit place more emphasis on the former. However, the latter view of knowing a word (i.e., in
terms of integration into the lexicon) is perhaps more comprehensive and deserves more
attention in vocabulary research.
        With regard to learners’ awareness of such integration in the lexicon, the issue is also
complicated. On the one hand, knowledge of meaning association can be inside a learner’s
awareness. For example, a learner can tell that nurse and doctor are related semantically. On
                                                 14


the other, psycholinguistic research has also shown that certain knowledge can be involved in
language processes that are “not available to learners’ conscious control or report” (Elgort,
2018, p. 4). In other words, when a task does not invoke the learner’s awareness of the
knowledge, what is being measured might be considered as implicit. For example, in a
psycholinguistic experiment, a brief presentation of a word (e.g., nurse) can improve the
recognition and/or production of a semantically related word (e.g., doctor) (e.g., Collins &
Loftus, 1975; McRae & Boisvert, 1998). Since the presentation of the first word (i.e., the prime)
is very brief, learners often report a lack of awareness of its presentation. In this light, the
processing of nurse can be outside the awareness of the learner, but it has implications on the
recognition of a subsequent item (doctor). Such a task has been used to investigate the extent
to which the mental lexicon is interconnected, and more importantly, no awareness of the
meaning association needs to be invoked for the effects to be observed. When that is the case,
at least some aspects of this meaning association between nurse and doctor may be implicit.
Implicit knowledge such as meaning association between word items can be important in actual
language use. According to Nakata and Elgort (2020), for example, implicit or tacit word
knowledge “is… needed in fluent, low-effort access to contextually relevant meanings during
reading” (p. 7). In the context of listening, a well-established semantic network might help
listeners predict upcoming information (e.g., Altmann & Kamide, 1999), potentially lessening
some processing burden that can cause a breakdown. However, using awareness as the only
criterion to differentiate explicit from implicit knowledge can be “problematic and thorny” even
in grammar research (Leow, 2001, p. 118). Recent evidence has also shown that different
operationalizations of (un)awareness can become unreliable, leading to potentially different
                                                   15


conclusions (Maie & DeKeyser, 2020). Therefore, caution should be exercised when defining
explicit vs. implicit knowledge in terms of learner awareness.
         In this section, I reviewed two key distinctions in conceptualizing the vocabulary
construct. The first relates to the contrast between lexical knowledge and strength. The former
concerns what a learner knows, and the latter represents the degree of mastery. I also
discussed the distinction between explicit and implicit word knowledge and the use of
(un)awareness as a criterion to distinguish the two. In the current literature, what might be less
clear is how lexical strength is related to implicit knowledge, or if they are related at all.
However, one potential link between the two might be ease of access. As established above,
ease of access is a key characterization of lexical strength (e.g., Daller et al., 2007; Harrington,
2018). Similarly, implicit knowledge can be accessed relatively effortlessly (e.g., R. Ellis, 2005;
Nakata & Elgort, 2020). In this light, then, ease of access can serve as a common ground for
lexical strength and implicit word knowledge, representing one criterion in the distinction
between (explicit) lexical knowledge and word knowledge strength, the latter of which
encapsulates implicit knowledge. Given these theoretical conceptualizations and hypothesized
dimensionality, I review the measurement of these dimensions of the vocabulary construct
below.
Measures of Vocabulary Knowledge
         In this brief overview of vocabulary measures, I will first start with discussing two types
of “qualitatively different” measures: traditional and time-sensitive measures (Godfroid, 2020b,
p. 433). Then, I will discuss timed measure of lexical strength and two implicit word measures,
                                                  16


which can potentially help researchers empirically operationalize constructs such as vocabulary
knowledge strength and implicit word knowledge discussed in the previous section.
Traditional and Time-Sensitive Word Measures
        There have been a number of related terms that researchers have used to contrast with
traditional, paper-based vocabulary tests. These terms include online vs. offline, implicit vs.
explicit, time-sensitive vs. paper-based, but they refer to different sets of tests despite some
overlaps. Although the use of these terms are not always consistent in the literature, three
criteria can be used to distinguish them: (1) is it a real-time measure?, (2) is there time pressure
imposed on the participant?, and (3) does the task invoke awareness of the knowledge?
        First, Godfroid (2020b) defines online measures as those that tap into “learners’ lexical
knowledge during language processing (hence the name online) when there are real time
restrictions” (p. 433, emphasis added). The author compared online measures to watching
sports in real time where “the detailed happenings of the match… unfold” (p. 433). In other
words, these tests are characterized by a real-time element and a time-restriction component.
Eye tracking is an example of real-time measures in the sense that eye movements are
recorded as the participant is engaged in language use such as reading and listening (see
(Godfroid, 2020a for an overview). With regard to time pressure, timed tasks as in a
psycholinguistic experiment, which collects reaction time data, require participants to respond
quickly on a trial-by-trial basis. Time pressure in psycholinguistic is commonly operationalized in
the task instructions which ask participants to respond as fast as possible and/or in the
programming of the experiment where a given trial will end after a set amount of time. Also,
this pressure is intentionally placed on the participant, so they do not have time to reflect on
                                                  17


their responses. As a result, researchers can infer the participant’s access to knowledge and
potentially the nature of knowledge. On this account, this time pressure serves an explicit goal
in the experimental operationalization and hence is different from a time limit in typical
language testing contexts. Often test administers would, on the one hand, standardize
engagement by imposing a time limit; and on the other, ensure the vast majority of test takers
have the time to complete the assessment without inducing test-irrelevant variance (e.g., that
results from test anxiety) (e.g., Denovan & Dagnall, 2019). Finally, implicit measures target
knowledge that is involved in language processes that are outside the learner’s awareness. As
mentioned, it might involve a very brief presentation of linguistic materials such that the
participant is not aware of the presentation. A broader term that encapsulates these tests is
sensitive measures. In the present dissertation, I use sensitive measures as an umbrella term to
refer to vocabulary tests that impose time pressure on the test taker and those that target
implicit word knowledge. These tests are contrasted with traditional vocabulary tests that are
explicit, offline, and untimed.
        As already alluded to, these task features are in sharp contrast with traditional, paper-
based, untimed, explicit word measures that have been used predominated in the field of
vocabulary acquisition (Godfroid, 2020b). These traditional tests are of different formats
ranging from multiple-choice questions, translation (first to second or second to first language)
tasks, to fill-in-the-blank and matching items. They can be designed so as to target different
word knowledge components in Nation’s (2013) framework. For example, in the 14k
Vocabulary Size Test (Nation & Beglar, 2007), participants are presented a target lexical item
and a neutral sentence context from which the meaning cannot be inferred. The task for the
                                                 18


participant is to choose the correct, corresponding meaning out of four options (see an
example item in the Measures section in Chapter 2). Another example is the productive
Vocabulary Levels Test (Laufer & Nation, 1999) where participants are presented with a
sentence context for a missing target word. The task for the test taker is to fill in the blank with
a word that fits the context (see an example item in the Measures section in Chapter 2).
        These offline tests have made a significant contribution to vocabulary research. They
have allowed researchers to identify, for example, the relationship between language use and
vocabulary (e.g., Cheng & Matthews, 2018; Jeon & Yamashita, 2014; Vandergrift & Baker,
2015). For example, Vandergrift and Baker (2015) showed that second language (L2) vocabulary
was most directly and strongly associated with L2 listening performance. Similarly, in reading
research, Jeon and Yamashita (2014) meta analyzed 31 independent correlations between L2
reading comprehension and L2 vocabulary knowledge from 29 studies. The authors found a
high correlation of .79 with 95%CI[.69 – .86]. In addition, these offline tests also provide an
outcome measure for researchers to understand how different learning conditions (e.g., various
types of glossing) can impact vocabulary learning (e.g., H. S. Kim et al., 2020; Ramezanali et al.,
2021; Yanagisawa et al., 2020). For example, Yanagisawa et al. (2020) meta-analyzed 359 effect
sizes in 42 studies and found an overall advantage of glossing (vs. no glossing), with multiple
choice glosses (where only one of the different senses presented fits the current context) being
the most effective.
        Despite the contribution of these tests, Godfroid (2020b) argued that researchers
should consider the face validity of these tests. For example, one should assess the alignment
between how vocabulary knowledge is measured and how it is used in real-life communication.
                                                 19


L2 listening comprehension is a case in point. Since listeners have relatively less control over
the speed of the incoming speech stream (Hui & Godfroid, 2020; K. M. Kim & Godfroid, 2019),
the efficiency with which the listener’s cognitive processor can manage the flood of information
can be a key to successful comprehension (Hui & Godfroid, 2020; Vafaee & Suzuki, 2020). On
this account, possessing word knowledge of spoken word form and the form-meaning
connection, as can be demonstrated on an offline vocabulary test, is perhaps only one
necessary condition for comprehension. Efficiently putting that knowledge to use (i.e.,
efficiently processing phonological and semantic information) may also be necessary. If that is
the case, when tested without time pressure, learner’s ability to point out the meaning of a
word (e.g., in the 14k Vocabulary Size Test) may not entirely coincide with their efficient access
to such knowledge in authentic communication. Indeed, recent evidence has shown that
performance in a timed lexical task can account for some unique variance in L2 listening
comprehension at both propositional and discourse levels (Hui & Godfroid, 2020), meaning that
efficient access to lexical knowledge carries explanatory power above and beyond vocabulary
size measured by offline tests. Similarly, in reading research, Tanabe (2016) reported that
reaction times in a computerized vocabulary test, but not vocabulary test scores (i.e., accuracy
scores) alone, were a significant predictor of reading comprehension under time pressured
conditions. His results echo the idea that fluent reading requires efficient access to the meaning
of words with minimal effort, which can in turn free up cognitive resources for high-level
processing (e.g., Elgort et al., 2018; Nakata & Elgort, 2020; Perfetti, 2007). On this account,
then, time-sensitive measures that are of a different nature appear to have the potential to
carry complementary roles in measuring vocabulary in a comprehensive manner. On the
                                                 20


contrary, using exclusively offline word measures can cause some aspects of lexical skills being
under-represented and hence introduces bias to the vocabulary construct (e.g., Révész &
Brunfaut, 2021). Importantly, these measures, due to their own unique characterization,
provide potential empirical operationalizations of such dimensions as lexical strength and
implicit word in the vocabulary construct. Below, I present a brief overview of one timed lexical
measure of strength and two implicit lexical tasks.
Timed Lexical Measure of Strength
Yes-No RT Test. Although the Yes-No test format, which requires test takers to select the words
they know from a list, dates back to 1929 (Beeckmans et al., 2001), Meara and Buxton (1987)
were the first to use this format as a vocabulary test for second language learners. Traditionally,
accuracy data to real words and non-words provide measures of vocabulary knowledge and the
level of guessing, respectively. In particular, correct responses to real words (hits) are used to
infer one’s vocabulary knowledge, while the incorrect yes responses to non-words (false
alarms) provide information regarding the test taker’s level of guessing. A more reliable
estimate of one’s lexical knowledge often involves both hit and false alarm measures whereby
adjustment for guessing is factored into the scoring (see Huibregtse et al., 2002 for a
comprehensive review).
        This format of two-option forced choices (i.e., Yes vs. No) is flexible because it can be
seen as aligning with how reaction times are typically measured in psycholinguistic research.
For example, in psycholinguistic research, participants can be asked to indicate whether a
sentence presented to them is grammatically acceptable or not by pressing the corresponding
Yes or No button on a response pad as quickly and accurately as possible (for a methodological
                                                  21


review, see Plonsky et al., 2020). In a lexical decision task, participants decide whether or not a
presented letter string forms a word, which is similar to a Yes-No test when programmed to
collect reaction times. Indeed, Harrington (2006) proposed using reaction time data from lexical
decisions as vocabulary measures to index both accuracy and speed of access to word
knowledge. In the study, the author reported that higher proficiency learners tended to
respond faster and more accurately. Similarly, Pellicer-Sánchez and Schmitt (2012) collected
and analyzed reaction time data on a Yes-No RT test. The authors compared a reaction time-
based scoring approach and a non-word approach, whereby they adjusted for guessing based
on responses to non-words. For the reaction time-based approach, the authors established
reaction time thresholds (e.g., 577.27 ms for non-native speakers) to discriminate between
accurate and inaccurate responses. However, the authors found no clear advantage of using
the reaction time approach. In Hui and Godfroid (2020), the authors used an auditory Yes-No RT
test to investigate the relationship between lexical strength and second language listening
comprehension. They used accuracy, reaction time, and the coefficient of variation
                        𝑀𝑀
(𝐶𝐶𝐶𝐶 where 𝐶𝐶𝐶𝐶𝑅𝑅𝑅𝑅 = 𝑆𝑆𝑆𝑆𝑅𝑅𝑅𝑅 ) (Segalowitz & Segalowitz, 1993) to index vocabulary size and lexical
                            𝑅𝑅𝑅𝑅
processing speed and automaticity. Regression and subsequent mediation analyses showed
that accuracy and reaction times on the test predicted L2 listening comprehension at both
propositional and discourse levels.
         To date, use of the Yes-No RT test is still rather limited in SLA research, such a timed test
allows researchers to make inferences about one’s efficiency of accessing lexical information.
At the same time, it is not entirely clear how reaction time data, which are believed to unveil
one’s lexical strength, are related to other traditional, paper-based tests which entirely ignore
                                                      22


time pressure as a condition in real-life communication. Pellicer-Sánchez and Schmitt (2012)
was the first to try to establish such relationships between timed and untimed tests, but more
validation work needs to be conducted to scrutinize the underlying construct(s) that the
different measures afforded by the test are tapping into.
Implicit Word Measures
Masked Repetition Priming. The priming paradigm is commonly used in psycholinguistic and
bilingualism research (e.g., Trofimovich & McDonough, 2011; VanPatten & Jegerski, 2014). In
general terms, the mechanism of the paradigm involves prior exposure to linguistic information
(i.e., the prime) influencing (often facilitating) subsequent recognition and/or production of
language. In the case of masked repetition priming, the facilitation has been a well-establish
phenomenon in both L1 and L2 speakers (e.g., Evett & Humphreys, 1981; Forster & Davis, 1984;
Gollan et al., 1997; Jiang, 1999). At the behavioral level, participants have been found to
respond faster to the target word when it is preceded by an identical prime than when the
target word is preceded by an unrelated, non-identical prime. Two aspects of this phenomenon
are important: first, the prime is presented only very briefly and is masked (e.g., preceded
and/or followed by symbols such as a string of hash signs [#]), often resulting in a reduced
visibility of the prime and a lack of awareness by the participant that it is there. Therefore, the
processes underlying this phenomenon are considered automatic and operate largely out of the
conscious control of the participant. Although the decisions the participant makes about the
targets are conscious, the processes that this priming task taps into are not. Second, masked
repetition priming is not found on non-word trials, indicating that the processes which drive
this effect operate at the lexical level (e.g., Forster, 1998), as opposed to the sub-lexical level
                                                    23


(i.e., lexical forms, or spelling). To account for this effect, then, researchers proposed that, in an
experimental trial, the prime pre-activates the lexical representation carrying both formal-
lexical and semantic information. This pre-activation in turns facilitates the recognition of the
target (e.g., Grainger et al., 2003), leading to faster responses. In the context of vocabulary
learning, when a learner has acquired a word in the sense of having established a lexical
representation, such facilitation (i.e., priming) should be observed. In contrast, if the lexical
representation is not robust, or if no lexical representations have been established, such
priming may not be observed. Leveraging this phenomenon, researchers can then examine
vocabulary knowledge and acquisition via lexical processes that are not in the learner’s
awareness. Importantly, this fine-grained measure represents a tool to investigate the extent to
which a certain learning condition may shape vocabulary acquisition. For example, when an
experimental learning condition is less conducive to learning, a reduced (or eliminated) priming
can be expected (e.g., Elgort, 2017).
          While masked priming has been well documented in the psycholinguistic literature,
Elgort (2011) was the first researcher who used the task as a vocabulary measure in second
language research. In the study, the author examined the extent to which deliberate,
decontextualized word learning can lead to the development of tacit word knowledge.
Participants learned 48 pseudowords using flash cards and word lists. After one week of study,
they returned to the lab and took part in three priming tasks: a form priming, a masked
repetition priming, and a semantic priming task. Focusing on the masked repetition priming
task, results showed that participants responded on average 52 ms and 75 ms faster on the
related, repetition trials (than on the unrelated trials) for the newly learned pseudowords and
                                                    24


low-frequency real words, respectively. As expected, no facilitation was found for the non-word
trials. The robust priming observed for the newly learned pseudowords led the author to
conclude that deliberate, decontextualized word learning can result in tacit word knowledge,
measured by the priming task (see also Elgort & Piasecki, 2014).
         In another study, Elgort (2017) investigated the extent to which the accuracy of initial
word meaning inferences under contextual learning conditions has an impact on subsequent
word learning outcomes. Participants were first exposed to target words embedded in
informative sentences. They were asked to make inferences based on the context, after which
they were presented with the correct meaning for verification. During the testing phase, the
participants took part in, among other tests, a mixed-modality priming task where the masked
prime was presented visually, while the target was presented auditorily. Results showed robust
priming in the sense that participants responded faster on the related trials (where the visual
and auditory stimuli corresponded) than on the unrelated trials. But, this priming was not
moderated by the accuracy of the initial meaning inferences the learner had made in the
learning phase when reading the informative sentences. Based on the lack of a significant
interaction, the author concluded that incorrect meaning inferences “appear to be benign as far
as the development of implicit knowledge and establishment of lexical representations are
concerned” (Elgort, 2017, p. 8).
         In sum, the masked repetition priming paradigm has been adopted to L2 vocabulary
research to gain insights into the extent to which learners have established a robust lexical
entry for the word items tested. An observed priming effect is taken as evidence of implicit
                                                 25


word knowledge and reflects that a lexical representation has been established in the mental
lexicon.
Semantic Priming. A semantic priming task is another test in the general priming paradigm
commonly used in psycholinguistic research. Semantic priming can be used to evaluate the
robustness of the semantic representations of the stimuli. As with the repetition priming, a
facilitation (priming, or faster responses) is expected when the prime and the target are
semantically related. The idea is that, on related trials, the prime activates the semantic
network which overlaps with that of the target. This activation then facilitates the recognition
of the target, leading to faster responses. On unrelated trials, since there are no semantic
overlaps, no priming should be observed. This semantic priming effect is, again, well
documented in psycholinguistic research and is taken as evidence that the semantic
information between word items is interlinked (e.g., Collins & Loftus, 1975; McRae & Boisvert,
1998). In vocabulary research, then, priming should be observed when the participant (1) has
integrated the stimuli in semantic network, (2) can access these lexical semantic
representations fluently, and (3) processes the primes and targets on the related trials as
semantically similar (e.g., Elgort & Warren, 2014).
         In Elgort’s (2011) study, reviewed previously, the author found a 22-ms semantic
priming effect, indicating that the participants had acquired the lexical-semantic
representations for the pseudowords that they had learned in a deliberate, decontextualized
manner. This level of priming was compared with the 37-ms facilitation when the prime was a
known English word. The author suggested that “these representations [for the pseudowords]
were probably less stable than those of known L2 words and that their integration into the
                                                  26


lexical-semantic memory system of the participants was in its early stages.” (Elgort, 2011, p.
394). In another study, learners did not show reliable semantic priming after encountering
pseudowords embedded in a text multiple times under incidental word learning conditions
(Elgort & Warren, 2014). This result was despite an interaction whereby the degree of priming
depended on the age when the participant had started learning the target language (i.e.,
English). Specifically, there was “faster and more robust lexicalization… for those who started
learning English earlier in life” (Elgort & Warren, 2014, p. 396). Similarly, Bordag et al. (2015)
and Chen (2021), in two incidental word learning studies, reported a lack of robust semantic
representations after multiple exposures to target novel words. Taken together, perhaps,
semantic integration represents a high bar for learners as far as word learning is concerned.
When learners study words intentionally, semantic priming can be observed. In contrast, if
vocabulary learning takes place under incidental conditions, the integration (if any at all) may
not be robust enough to be detected by a semantic priming task. In more general,
methodological terms, the semantic priming task has been used as another implicit vocabulary
measure to examine the extent to which semantic representations of word items have been
established in the mental lexicon.
         Taken together, the adoption of these time-sensitive measures to the investigation of
lexical strength and implicit word knowledge and learning has expanded the assessment
toolbox of vocabulary researchers. Researchers are now in a better position to address
questions pertaining to implicit word knowledge and efficient access to word knowledge.
Importantly, they can answer the calls from vocabulary scholars to use multiple measurements
concurrently to better understand the construct of vocabulary knowledge at a theoretical level
                                                 27


(e.g., Milton & Fitzpatrick, 2014; Read, 2020; Schmitt, 2010; Webb, 2005; Yanagisawa & Webb,
2020). At the same time, it is crucial to understand what is being measured exactly by each of
these time-sensitive measures. To date, the use of these measures has been advocated through
argumentation (e.g., time pressure is present in authentic communications, and so timed tasks
need to be administered to add face validity to the measures). Not much work has provided
empirical evidence that imposing time pressure qualitatively changes the knowledge being
measured and/or that using an implicit task can measure what explicit tests cannot tap into.
Perhaps more importantly, researchers must understand how exactly these time-sensitive
measures can complement traditional ones. For example, might it be the case that these time-
sensitive tests simply represent a more difficult set of tasks (or a higher learning target), but
they ultimately tap into the same dimension of word knowledge as that measured by explicit
tests? Alternatively, do these reaction time-based tasks represent measures that are
qualitatively different from explicit tests? At a practical level, clarifying how the different tools
at researchers’ disposal relate to each other will enable more informed decisions regarding
what sets of measures to administer. In theoretical terms, engaging in measurement validation
work makes conceptual claims empirically testable. Two specific instances are the extent to
which lexical strength is a separate dimension of word knowledge and the extent to which
explicit and implicit word knowledge are distinct. Below, I review two strands of measurement
validation studies conducted in second language research.
Measurement Validation
        Researchers relying on quantitative data to address research questions and test
hypotheses need valid measurement tools. Validity, in its most general sense, is the extent to
                                                  28


which a measure measures what it purports to measure (e.g., Sireci, 2009). At the same time,
validity should be conceptualized as the quality of a test that is relevant to the interpretation
and use of its results (Messick, 1989). In this light, then, the validation process involves
collecting evidence to corroborate or dispute these interpretations and uses (Messick, 1989). In
a seminal paper, Messick (1989) foregrounded construct validity in his unified, single view of
validity (as opposed to a multifaceted view of validity, such as content and criterion-related
validity). In other words, when interpreting test results, to what extent do these interpretations
relate to the construct in question? In order to focus on interpretations of test results, one
needs to first understand the inference test users and researchers make from the assessment
(Chapelle, 1999, 2021). Recently, Chapelle (2021) provided a list of seven types of inferences
that are commonly made. The most relevant to the present context is domain definition. The
question at hand is the extent to which the domain of vocabulary knowledge is “adequately
analyzed to create test tasks that elicit relevant performance” (Chapelle, 2021, p. 16). As
established above, certain aspects of the vocabulary construct are not well captured by
traditional, offline tasks. In that sense, the limits on domain coverage of these tests cast doubts
on the interpretations of the results (e.g., those scoring high have higher lexical skills) because,
for example, lexical strength is under-represented (if at all). Now that time-sensitive words
measures are available, researchers can expand the domain coverage of the vocabulary
construct, and at the same time refine its definition and evaluate the extent to which these
time-sensitive measures accurately summarize performance. However, to date, there has been
a limited number of construct validation studies involving comparison between multiple
measures in vocabulary research. Therefore, in this section, I will first review a somewhat
                                                   29


parallel literature in construct validation studies in grammar research before returning to
validation work that has been conducted in the domain of vocabulary.
Measurement Validation Studies in Grammar Research
        As mentioned in the section on implicit and explicit word knowledge, many grammar
researchers differentiate explicit from implicit knowledge (e.g., Andringa & Rebuschat, 2015;
DeKeyser, 2003; N. Ellis, 2005; Hulstijn, 2005). In other words, the construct of grammatical
knowledge is believed to be two-dimensional in nature, at least according to these authors. To
test this psychological dimensionality, researchers need measures that have a similar
psychometric dimensionality (Henning, 1992). Put differently, researchers need separate,
independent items and/or tests that tap into these two psychological dimensions (i.e., explicit
and implicit knowledge). Caution is required when psychological and psychometric
dimensionality do not correspond, potentially suggesting that certain dimensions cannot be
measured or have not been theorized appropriately. Therefore, identifying this correspondence
between psychological and psychometric dimensionality has been the theme of this line of
validation studies (e.g., R. Ellis, 2005; R. Ellis & Loewen, 2007; Godfroid & Kim, in press;
Gutiérrez, 2013; Spada et al., 2015; Suzuki, 2017; Vafaee et al., 2017).
        In a seminal paper by R. Ellis (2005), the author administered a total of five tests, two of
which were hypothesized to measure explicit knowledge, and three of which were designed to
tap into implicit knowledge. In the operationalization of the explicit measures, participants
made judgments on the grammaticality of sentences presented to them. In this grammaticality
judgment task, no time pressure was imposed on the test takers. Participants also completed a
meta-linguistic knowledge test where they identified and explained grammatical errors. For the
                                                     30


implicit measures, participants engaged in an oral production task and an elicited imitation task
where they orally repeated spoken sentences (some containing grammatical errors) in correct
English. Finally, in a separate grammaticality judgement task, they were placed under time
pressure as they were responding to the stimuli. The results of these tasks were subjected to a
principal component analysis which showed two underlying components of what had been
measured by these tasks, aligning well with the initial hypothesis (i.e., the implicit and explicit
measures tapping into implicit and explicit knowledge, respectively).
        Researchers have followed up the findings reported by R. Ellis (2005) (e.g., Gutiérrez,
2013; Spada et al., 2015; Suzuki, 2017; Vafaee et al., 2017). They often administer a battery of
tests and conduct a factor analysis on the test results to assess the correspondence between
the psychological (e.g., implicit vs. explicit knowledge) and psychometric dimensionality of the
battery (e.g., a two-factor solution in the factor analysis). For example, Gutiérrez (2013)
adapted a subset of tasks from Ellis (2005), both timed and untimed written grammaticality
judgement tasks and a meta-linguistic test. The author separately submitted the grammatical
and ungrammatical items on the two judgement tasks (four measures) together with the meta-
linguistic test to a confirmatory factor analysis. Results showed that the ungrammatical items
from both judgement tasks and the meta-linguistic test loaded onto a factor the author labelled
as explicit knowledge. The grammatical items, on the other hand, loaded on another factor
called implicit knowledge. Based on these results, the author suggested that grammaticality of
items can determine whether a task is an explicit or implicit measure. In addition, Gutiérrez
(2013) found insufficient evidence that time pressure alone plays a similar role in a task in
terms of measuring explicit versus implicit knowledge.
                                                  31


         This line of work has been important to SLA researchers because it provides validity
evidence for a particular (set of) task(s). Researchers interested in the development of a certain
type of knowledge may take advantage of this literature when deciding on their own measures.
In a study by Issa et al. (2020), for example, the authors administered an elicited imitation task
and acceptability judgement tasks to measure grammatical development of their participants
during study abroad. The validation ground laid by previous research allows Issa et al. (2020) to
claim that they have covered key domains of grammar knowledge in their measurement.
Another advantage of these validation studies is that they demonstrated how grammatical
knowledge may be modelled at the latent (i.e., unobserved) level, essentially allowing for the
measurement of theoretical constructs via real world measures and allowing for a test of
whether or not real-world measures tap into the theoretical constructs. In addition, researchers
can use multiple measures to distill a purer measure of grammatical knowledge by accounting
for measurement errors in individual tests and the different degrees of strength between the
measures and the latent construct (Brown, 2015). Methodological discussions on this literature
have also highlighted the principled use of confirmatory analysis (R. Ellis & Loewen, 2007;
Isemonger, 2007; Vafaee & Kachinske, 2019), informing subsequent researchers of the best
statistical practices in this line of work. Building upon the review of validation studies in
grammar, I return to two validation studies in vocabulary research.
Measurement Validation Studies in Vocabulary Research
         Although validating vocabulary tests is not new (e.g., Schmitt et al., 2020), there have
been only two published validation studies that focus on construct validity of a battery of tests
in L2 vocabulary research: González-Fernández and Schmitt (2020) and Koizumi and In’nami
                                                   32


(2020). Using confirmatory factor analyses, these authors tested the extent to which different
vocabulary measures can separately tap into different constructs under the word component
knowledge approach to vocabulary depth (e.g., Schmitt, 2014). Their approach, therefore,
mirrors that has been taken by grammar researchers seeking to identify a correspondence
between psychological and psychometric dimensionality.
        In González-Fernández and Schmitt’s (2020) study, the authors were interested in how
relationships between different word knowledge components should be conceptualized. In
other words, to what extent are different aspects of word knowledge distinct? Participants
were one hundred and forty-four Spanish learners of English. They completed a total of eight
tasks, tapping into four word knowledge components (i.e., form-meaning link, derivatives,
multiple meanings, and collocations) at two levels of sensitivity (i.e., recall and recognition).
There were a total of 20 target words repeated across all the tests, which ranged from the first
to the ninth 1000 most frequently occurring words (1K – 9K). The authors initially hypothesized
a second-order model where the highest, second-order latent variable represents vocabulary
knowledge in a general sense. The indicators of this latent variable were four first-order latent
variables, representing the four word knowledge components. Each of these four first-order
latent variables had two observed indicators (its respective recall and recognition tests).
However, the model-implied variance-covariance matrix did not replicate the characteristics in
the data, indicating that the hypothesized model was not an acceptable representation of the
data. In other words, a correspondence between psychological and psychometric
dimensionality was not found.
                                                33


        The author, then, revised the model specifications such that eight indicators (four word
knowledge components at two levels of sensitivity) loaded to one factor named vocabulary
knowledge. In addition, the authors added correlations between the residuals of the
recognition and the recall tests for each word knowledge component. The authors reported a
good fit for this model. Based on the factor loadings (.81 – .93), the authors suggested that “all
word knowledge aspects mak[ing] a large contribution to the explanation of the Vocabulary
Knowledge construct” (González-Fernández & Schmitt, 2020, p. 497).
        Although the initially proposed multi-dimensionality of word knowledge based on
different word knowledge components was not empirically supported, this study represented
the first of its kind, and it has shed important light on this area of research. For example, depth
was conceptualized only as word knowledge components (i.e., knowledge of more word
components signaling greater depth of knowledge). As the authors alluded to, the relationships
between recognition and recall can be further examined from a developmental perspective,
essentially investigating the extent to which, for example, receptive and productive knowledge
can be distinctly measured. This follow-up would then conceptualize the construct of depth as a
developmental trajectory in line with what has been discussed previously (e.g., Schmitt, 2014).
More generally, this approach of measurement validity can be applied to test some competing
conceptualizations of vocabulary knowledge. For example, researchers can use this approach to
address the long-standing relationships between vocabulary size and depth (e.g., Schmitt,
2014), which in fact was exactly the aim of Koizumi and In’nami’s (2020) study.
        In Koizumi and In’nami’s (2020) study, 225 Japanese learners of English took a total of
five vocabulary tests: one measuring vocabulary size and four tapping into vocabulary depth as
                                                   34


operationalized, again, as word knowledge components (i.e., word association, polysemy [L1 to
L2 and L2 to L1], collocation). Words differed across different tasks but were all sampled from a
wordlist compiled by a local teacher association (i.e., Japan Association of College English
Teachers). The author further broke down the size test into three frequency levels to be
submitted to the confirmatory factor analysis as different indicators. In total, then, there were
seven indicators (three from the size test, four of each of the tests of depth). The one-factor
model had a latent variable named size and depth, onto which all seven indicators loaded. This
model represented a unified, single construct of vocabulary knowledge. The two-factor solution
had correlated, but separate factors for size and depth. Three indicators from the size test
loaded to the size factor, while the four tests of depth loaded to the depth factor.
        In terms of the results, both the one- and two-factor solutions produced a good fit,
indicating that they were both good representations of the data. The two-factor solution had a
better fit as assessed by common fit indices (e.g., CFI, RMSEA, SRMR), AIC, and a chi-square
difference test of deviance. These results suggested that the measures demonstrated a
psychometric dimensionality that echoes the theorization of size and depth. Despite the
evidence for distinct dimensions for size and depth, the correlation between the two at the
latent level was .945. Since the use of confirmatory factor analysis takes measurement errors
into account, this correlation estimate is more accurate than previous studies correlating
individual size and depth measures (for a review, see Schmitt, 2014). On this account, then, this
high correlation may point to a lack of practical significance to differentiate size from depth
because of the lack of divergent validity.
                                                 35


         Taken together, the two studies reviewed have investigated the dimensionality of
vocabulary measures and have focused upon vocabulary depth operationalized as multiple
word knowledge components. Appropriate for their studies, these authors have relied
exclusively on explicit word measures. In this light, there is much room to investigate the extent
to which other different conceptualizations as reviewed in the first section of this chapter (e.g.,
lexical strength and implicit word knowledge) can be measured in a psychometrically distinct
manner. In other words, this line of work will shed important light on how well word measures
can be mapped to the theoretical conceptualizations of vocabulary knowledge as a construct.
For example, are traditional and time-sensitive measures of vocabulary psychometrically
distinct? Should vocabulary strength (which underlies fluent use of language) be conceptualized
as a separate, independent dimension (e.g., Daller et al., 2007; Harrington, 2018; Yanagisawa &
Webb, 2020)? Are explicit and implicit word knowledge (e.g., Elgort, 2011; 2018)
distinguishable at the behavioral level? As alluded to earlier, an additional advantage of this line
of research is that it will also provide a solid basis for researchers to take advantage of multiple
measures in modeling vocabulary knowledge at a latent level. In this regard, it will be useful to
understand the predictive validity of the vocabulary construct under different
conceptualizations.
The Present Study
         In the light of the literature reviewed in this chapter, I conducted the present construct
validation study of a battery of six vocabulary measures. There were two overall goals: first, to
assess the alignment between the psychological dimensionality of word knowledge, as
theoretically conceptualized differently by vocabulary researchers, and the psychometric
                                                   36


dimensionality of its measurement, as tested statistically in the present study; second, the
examine the predictive validity of the vocabulary construct under different conceptualizations.
This study will adduce empirical evidence to the construct validity of time-sensitive measures of
vocabulary that are believed to be qualitatively different from traditional explicit tests. This
evidence will complement the argumentation approach that researchers have taken to contrast
(explicit) knowledge and lexical strength and implicit word knowledge (e.g., Godfroid, 2020b).
The predictive validity evidence will also inform researchers what conceptualization may be
more superior in terms of accounting for individual differences in general proficiency. I
formulated the following research questions to guide the study:
RQ1a: To what extent do implicit word measures demonstrate a distinct psychometric
dimension from explicit word knowledge measures?
RQ1b: To what extent do time-sensitive measures of lexical strength demonstrate a distinct
psychometric dimension from untimed word knowledge measures?
RQ2a: How well can the vocabulary construct conceptualized as explicit and implicit knowledge
predict self-reported general proficiency?
                                                 37


RQ2b: How well can the vocabulary construct conceptualized as word knowledge and strength t
predict self-reported general proficiency?
                                             38


                                      CHAPTER 2: METHODOLOGY
         In this chapter, I present the methodology to address the research questions for the
present study. I detail the information on the participants, the critical words, the measures, and
the procedure. At the end of the chapter, I also present the data analysis plan.
Participants
         Given the current research aims, I engaged in sample size planning based on overall
model fit of a hypothesized CFA model (e.g., K. H. Kim, 2005), as opposed to the power required
to detect an effect (i.e., a significant regression path) (e.g., Muthén & Muthén, 2002). In an
initially hypothesized, two-factor CFA model, I had six observed variables (degrees of freedom
available = 6 (6+1)/2) = 21) and 13 freely estimated parameters. Although one measure was
later dropped, which led to a revision of model specifications (see Measures and Data Analysis
below), these numbers meant that the initially hypothesized model had eight degrees of
freedom (21 - 13 = 8). I used this information for sample size planning prior to data collection.
The formulas provided by K. H. Kim (2005) suggested that the desirable sample size ranged
from 127 to 752, depending on the chosen fit indices, the strength of factor loadings (λx), and
the desired fit and power levels.
         Table 2 summarizes these recommended sample sizes based on some conventional
criteria of model fit (e.g., < .05 for Root Mean Squared Error of Approximation [RMSEA] and >
.95 for Comparative Fit Index [CFI], Hu & Bentler, 1999).
                                                   39


Table 2
Desired Sample Sizes Based on Model Fit
 Power             RESEA                 CFI               λx              Desired Sample Size (N)
 .80               .05                                                     752
 .80                                     .95               .80             127
 .80                                     .95               .60             429
Note. RMSEA = Root Mean Squared Error of Approximation; CFI = Comparative Fit Index; λx =
Factor loadings in the factor analysis model
        Given this range of sample sizes, one practical determining factor appeared to be the
expected strength of factor loadings. I then consulted the two previous studies using a similar
data analysis approach in this research field (i.e., L2 vocabulary studies): González-Fernández
and Schmitt (2020) and Koizumi and In’nami (2020). In both studies, the factor loadings of the
final model reported were above 0.80. At the same time, it was important to note that these
studies only included accuracy-based measures. In addition to these measures, the present
study incorporated a number of reaction time-based measures. These reaction time-based
measures, according to L2 grammar research using a similar analytic approach, could have
much lower factor loadings despite a satisfactory global fit (e.g., Suzuki, 2017). These low
loadings could potentially result from relatively low reliability levels when experimental tasks
are used to index individual differences between participants (e.g., Draheim et al., 2019;
Rouder & Haaf, 2019). Therefore, it was less than straightforward to have an accurate a priori
expectation of the factor loadings in the sample size planning stage. In this regard, this
procedure also highlighted the difficulty for researchers to obtain sufficient, useful information
for an accurate sample size plan, especially when a study is the first of its kind (e.g., Brysbaert,
                                                  40


2020). Considering (1) the sample sizes in González-Fernández and Schmitt (2020) (N = 144) and
Koizumi and In’nami (2020) (N = 255), (2) the sample size range returned by the sample
planning procedure based on model fit (see Table 2), and (3) practical considerations in terms
of available funds and time (e.g., Loewen & Hui, 2021), I had planned for recruiting 150
participants.
        In the end, one hundred and forty-five participants took part in the experiment. They
were sampled from the international student population at Michigan State University. The
participants were undergraduate and graduate students who majored in various disciplines and
were speakers of a variety of first languages (L1), including but not limited to Mandarin Chinese
(32%), Hindi (16%), Vietnamese (6%), Korean (4%), and Marathi (4%). Their demographic
information, such as age and length of residence in the US, is presented in Table 3. As also
noted by González-Fernández and Schmitt (2020), the analysis required a sample to have a
reasonably large range of proficiency levels and hence sufficient between-participants variance
in the data. Therefore, I strategically included, through different means of recruitment,
participants of different levels of studies (e.g., freshmen, seniors, and MA and PhD students)
and different lengths of residence in the US.
                                                   41


Table 3
Demographic Information About the Participants
                                                                       Mean (SD)
 Age                                                                  24.97 (5.37)
 Length of residence (in years)                                        3.16 (2.87)
 Frequency of English use (overall)1                                   6.44 (2.53)
 Frequency of English use (past week)     1                            6.71 (2.06)
 Frequency of English use (past month)      1                          6.79 (2.11)
 Self-rated proficiency2                                               7.0 (1.69)
                                                        Undergraduate               Graduate
                                                                                (e.g., MA, PhD or
                                                                              Professional Degrees)
 Level of current study                                      44 %                      56 %
Notes: participants self-reported on a sliding scale where 1 represented “Never” and 10 meant
        1
“Always”; 2 participants self-reported on a sliding scale where 1 represented “Total beginner”
and 10 meant “Native-like”
         All participants received monetary compensation for their time. Ethical clearance was
obtained from the Institutional Review Board (IRB) according to our university’s regulations
governing research involving human participants.
Critical Words
         In choosing the critical words, I considered two factors: the number of words required
and their frequency levels as appropriate for the participants (L2 learners in an American
university setting). First, considering the number of critical words, I referred to both González-
Fernández & Schmitt (2020) and Koizumi & In’nami (2020). I considered these studies to be
relevant because they employed the same statistical analyses to address very similar research
questions to the present research. In the study by González-Fernández & Schmitt (2020), the
authors included 20 target words while Koizumi & In’nami (2020) had 20 to 40 items depending
on the test. Since reaction time-based measures (see Measures below) typically have a low
                                                  42


level of reliability, Siegelman et al. (2017) suggested that researchers should increase the
number of trials as a way to improve the ability of the tasks to discriminate individuals of
varying ability levels. In addition, a relatively larger number of words would allow room for
removal unsatisfactory items that demonstrate poor psychometric properties. At the same
time, researchers need to avoid exploiting participants’ time and effort. Having participants
take part in an unnecessarily long experiment can be an ethical issue (e.g., Loewen & Hui,
2021). All considered, I decided to have initially 40 critical words.
         In terms of the composition of the critical word set, I considered the expected
vocabulary size of the sample and hence the expected variability in the data (i.e., the individual
differences in participants’ lexical proficiency). I took two pieces of information into account:
first, non-native Ph.D. students are estimated to have a vocabulary of 9,000 word families
(Nation, 2012). Second, with an inclusion criterion of a 9,000-word vocabulary size, Godfroid et
al. (2018) reported their participants’ vocabulary sizes to be between 9,100 and 12,200 word
families as measured by the 14k Vocabulary Size Test (VST) (Nation & Beglar, 2007). Given that
(1) my sample included both undergraduate and graduate students, (2) the VST is a receptive
vocabulary test (i.e., easier than a productive task), (3) reaction time-based effects might be
more difficult to detect, and (4) items in an experimental task used as an individual differences
measure should vary across different difficulty levels (Siegelman et al., 2017), I decided to
include words between the frequency bands of K2 and K5 as the critical words for the present
study, representing a reasonably wide range with appropriate difficulty levels.
                                                    43


Table 4
List of Critical Words
 2nd 1000 Level (K2)       3rd 1000 Level (K3)      4th 1000 Level (K4)      5th 1000 Level (K5)
         maintain                   soldier               compound                    deficit
           stone                    restore                  latter                    weep
           upset                      jug                   candid                     nun
          drawer                     scrub                  tummy                     haunt
         patience                  dinosaur                   quiz                   compost
             cap                     strap                   input                     cube
            pub                      pave                     crab                  miniature
           circle                    dash                 vocabulary                   peel
       microphone                  poverty                 remedy                    fracture
             pro                  lonesome                   allege                 bacterium
         After deciding on the frequency bands of the critical words, I had initially adopted the
forty items between K2 to K5 from the 14k Vocabulary Size Test (Nation & Beglar, 2007), which
was developed from the spoken section of the British National Corpus, to be the critical words
for the present research. During piloting, I found that participants (N = 24) scored especially low
for nil (33%) and rove (38%) on the Yes-No RT test (see details below), compared with the
overall accuracy mean of 88% (SD = 32%) for the whole set of items. Therefore, I replaced these
two words with cap and poverty both of which were in the same corresponding frequency
bands. I present all 40 critical words in Table 4.
Data Collection Platform
         All data were collected on Gorilla (www.gorilla.sc), an online platform which can be used
for psycholinguistic research. The decision to collect data online was mainly due to the COVID-
19 pandemic which resulted in (partial) closure of university buildings (hence our lab) and
students leaving campus. Since online research, particularly in psychology, has only recently
                                                  44


grown, I considered a number of concerns researchers may have in relation to data quality
(e.g., Woods et al., 2015). Here, I also outline measures that I implemented to mitigate some of
these potential problems.
         First, in terms of identity of the participants, one immediate reaction to online data
collection can be that researchers have no ways to verify one’s identity. In the present research,
there were two key qualifying criteria for participants: (1) be a non-native speaker of English
and (2) be a student at an American university. As in lab-based research, I relied on self-report
by the participant of their non-native speaker status. For their student status, I requested that
they provide a university email address for communication and for the Gorilla system to send a
link to participate in the experiment. It is true that not all who possess a university email
address is a student. But, I considered asking participants to provide further identification (e.g.,
student card) be unnecessary because that would mean collecting more personal information
(e.g., student number and picture) than necessary for this study. Participants also self-reported
their non-native speaker and student status twice: once in a screening survey, and once in
Gorilla before the actual experiment (see Procedure below).
         The second concern was potential attention lapses. In lab-based contexts, there may be
a certain level of supervision on site from the researcher to maintain participants’ attention.
Such supervision is absent in virtual space. In the worst-case scenario, participants could be
merely guessing, randomly responding to experimental items. In the data analysis procedure
(see below), I paid close attention to individual participants’ accuracy scores. Given the
thoughts put into selecting the critical words, I expected participants to perform reasonably
well. At the very least, they should perform above chance levels (50% accuracy when given two
                                                    45


forced choices and 25% when given four options in the multiple-choice format). In tasks that
contained non-words (letter strings that do not form a word), I expected a low false alarm rate
(incorrect responses to non-words, suggesting guessing) (e.g., < 50% in a binary, forced choice
situation). I used these criteria to exclude participants who either did not pay attention or did
not have the proficiency levels to provide useful information for this study. As for general
fatigue, I arranged participants to take part in the experiment (of five tasks, see Measures
below) on two separate days. Each day took approximately 30 – 45 minutes. I also built in
breaks within and between tasks to allow participants to rest if needed.
        Indeed, there has been a small literature comparing online and lab-based data collection
(Crump et al., 2013; Germine et al., 2012; Klein et al., 2014; Ruiz et al., 2019). Together, these
studies suggested that the two data collection settings do not differ to a considerable extent.
For example, Germine et al. (2012) found similar results in their replication attempt using
online tools to those reported in the initial studies. Importantly, the tasks involved in the study
were considered to be more vulnerable to lapses in attention, such as the Cambridge Face
Memory Test and the Forward Digit Span task. Similarly, comparable results obtained from the
two data collection settings were reported by Crump et al. (2013) and Klein et al. (2014) who
incorporated both reaction-time and memory tasks in their study. In second language
acquisition research, Ruiz et al. (2019) also reported similar findings between the lab- and web-
based versions of their working and declarative memory tasks.
        While some of these findings may seem encouraging, the only task that Crump et al.
(2013) failed to replicate (one in eight tasks) was a masked priming task involving symbol (i.e.,
arrows pointing to different directions). The general idea is that participants were expected to
                                                  46


respond faster to an arrow preceded by another one pointing to the same direction than when
it was preceded by one pointing to a different direction. The authors suggested that
experiments involving brief presentations of elements (e.g., 16, 32 ms) can be less reliable in
internet-based research. In fact, Hamarick (2020) outlined a number of challenges face
researchers using reaction-time based measures, even in lab-based settings. These challenges
can influence the data quality to varying degrees depending on factors such as equipment,
experimental set-up, and instructions to participants, all of which could introduce random
variability (noise) in the data which in turn buries important signals that researchers look for
(e.g., an expected effect of 20-ms difference in reaction time). The key questions at hand were
the accuracy and precision levels of timing measurements of online data collection platforms
and the extent to which such levels would be acceptable.
         One study that systematically evaluating the accuracy and precision of online
experiment platforms was Anwyl-Irvine et al. (2020). The authors investigated the impact of
different system set-up combinations (platforms [e.g., Gorilla], browsers [e.g., Google Chrome],
and operating systems [e.g., Windows on a laptop]) on display time across 30 different time
frame durations and on reaction time recording. Results showed that, for example, Gorilla had
a mean of 13.44 ms of visual duration delay with a standard deviation of 15.41 ms. It means
that when a stimulus was due to be presented, it is only after, on average, 13.44 ms later that it
was actually shown on the screen. It ranked third in terms of absolute mean values among the
four platforms compared (mean values for delay ranged from -6.24 ms [stimulus presented
before it is due] to 26.02 ms). Note that, if the delay had been consistent, it would have
potentially posed less of a problem. However, the standard deviation of 15.41 ms indicated a
                                                  47


rather large variability given the scale. The more important measure was reaction-time
recording. Gorilla recorded reaction times as the time between the actual presentation of the
stimulus and a response. In a way, then, any (variable) delay of the presentation would not
impact the reaction time recording. For reaction-time recording, the mean delay for Gorilla was
78.53 ms with a standard deviation of 8.25 ms. It means that the platform only detected a
response by the robot actuator only after, on average, 78.53 ms the key was struck. Again, it
ranked the third in absolute mean values, compared with other systems. Note that, however,
the standard deviation was rather small, especially when compared with other platform whose
standard deviations ranged from 15.27 ms to 28.16 ms. In other words, potentially, although
there was a general delay, such delays can be relatively consistent on Gorilla, at least compared
with other systems tested in the study.
        Although the authors optimistically concluded that these platforms provided
“reasonable accuracy and precision” (p. 1) in terms of display duration and reaction-time
recording, there was still variability associated with what equipment the participant used. Also,
in the context of the present study, some of the delays and variability in the delays represented
potential random noise in the data which appeared larger than ideal. For example, for the
priming tasks (see Measures below) which were most susceptible to reaction time accuracy and
precision, I expected a group-level difference of 22 ms to 80 ms as informed by Elgort (2011).
The potential random variability described above (e.g., a standard deviation of 8.25 ms in
reaction time recording delays) may then prevent the signal (i.e., expected effects of priming)
to be observed. In a way, then, the situation called for strategies to reduce the amount of noise
in the data to the extent that was possible, which in turn meant that the signal could be emerge
                                                  48


more clearly (e.g., Siegelman et al., 2017). To achieve this, I engaged in item analyses to identify
items that elicited random performance given the current experiment set up. As will be
detailed in the data analysis section, removing items that did not elicit priming may render the
task a more reliable measure (Siegelman et al., 2017). Furthermore, when random variance was
inevitable, I attempted to incorporate such variability in the statistical modeling so that it was
properly accounted for (Rouder & Haaf, 2019). Using mixed-effects modeling, variability due to
participants (which resulted from both difference in their ability and technological set up) and
items can be partitioned to highlight the expected effect (Rouder & Haaf, 2019) (see more
discussion in the Data Analysis section).
Measures
        In this study, there were a total of five tasks, all of which tap into the participants’ form-
meaning link of the critical words. I modeled these tasks after previous research in order to
better align the present study with the existing research base and practices. These tasks were a
form-meaning receptive task, a form-meaning productive task, a simple lexical decision task (or
a Yes-No RT test), a masked repetition priming task, and a semantic priming task. For each task,
I describe the knowledge intended to be measured, the task itself, and methodological details
related to the administering of the test. I will also detail information obtained from the task
construction and piloting stages to demonstrate the quality of the instruments. In Table 5, I
summarize these tasks, their expected effects, constructs being measured, and the type of data
they afforded for the final data set.
                                                   49


Table 5
Summary of Measures for the Present Research
       Task          Expected         Test Construct     Explicit or   Timed or        Outcome
                       Effect                              Implicit     Untimed        Variable
  Form-           NA                 Knowledge of          Explicit     Untimed     Accuracy
  Meaning                            the form-                                      data
  Receptive                          meaning link at
  Test                               the level of
                                     recognition
  Form-           NA                 Knowledge of          Explicit     Untimed     Accuracy
  Meaning                            the form-                                      data
  Production                         meaning link at
  Test                               the level of
                                     recall
  Yes-No RT       NA                 Access to the         Explicit      Timed      Accuracy and
  Test                               form-meaning                                   reaction time
                                     link                                           data
  Masked          Repetition         Lexical               Implicit      Timed      Reaction
  Repetition      priming (faster representation                                    time data
  Priming Task    responses to
                  identity prime-
                  target pairs)
  Semantic        Semantic           Semantic              Implicit      Timed      Reaction
  Priming Task    priming (faster representation                                    time data
                  responses to
                  prime-target
                  pairs that are
                  semantically
                  related)
Note. NA = Not Applicable
Form-Meaning Receptive Test
        I adopted the thirty-eight items in the K2 to K5 frequency bands from the 14k
Vocabulary Size Test (Nation & Beglar, 2007) for this task. For cap and poverty, the two critical
words replacing the original, unsatisfactory items identified during piloting, I wrote the test
items myself as an experienced teacher of English as a foreign language. This task was designed
to measure participants’ knowledge of the form-meaning link at the sensitivity level of meaning
                                                  50


recognition. Items were clustered according to their frequency levels. At each level, there were
ten items. For each item, I presented the target word together with a sentence in which the
target word was used, as well as four options of definitions. Only one of these options was
correct in describing the sense of the target as used in the sentence. The task for the participant
was to choose the closest meaning to the target word (see an example item below and the full
set of the test in Appendix A. The definitions for the target were always of higher frequency
levels (i.e., more common) than the critical word in question. This manipulation was to
minimize the probability that the participant would fail to select the answer because they did
not know the words in the definitions. Here, I present an example item for the critical word
patience:
         PATIENCE: He has no patience.
         a. will not wait happily
         b. has no free time
         c. has no faith
         d. does not know what is fair
         I invited three native speakers of English to attempt the test although the 14k
Vocabulary Size Test (Nation & Beglar, 2007) had been used somewhat widely in vocabulary
research (Godfroid et al., 2018; Peters, 2019; Vafaee & Suzuki, 2020) and had been subjected to
                                                  51


psychometric validation (Beglar, 2010). The native speakers all obtained a perfect score. In
terms of assessing the face validity of the test, participants needed to have established the
form-meaning link of the critical word in order to select the correct meaning. Also, the sentence
was written in such a way that guessing meaning from context was not possible. Finally, the
definitions were presented as choices, hence participants only needed knowledge at the
sensitivity level of recognition to complete the task. Taken together, I consider this test to have
sufficient face validity as an explicit, untimed measure of the form-meaning link at the level of
recognition (i.e., receptive knowledge of the form-meaning link).
Form-Meaning Productive Test
        I modeled the format of this test after the productive Vocabulary Levels Test (Laufer &
Nation, 1999). This test was designed to measure the participants’ productive knowledge of the
form-meaning link. In particular, it tested participants’ “controlled productive ability” (Laufer &
Nation, 1999, p. 36) in the sense that learners needed to produce the target words when given
a meaningful, obligatory sentence context. As an experienced teacher of English as a foreign
language, I wrote one item for each of the forty critical words. For each item, I supplied a
meaningful sentence context as well as the first letter(s) of the critical word. The provision of
the first letter(s) was to prevent learners from filling in other legitimate alternatives. In writing
these items, I referred to sentence examples in dictionaries, such as the Collins Dictionary
(www.collinsdictionary.com). Care was given not to include words that are of lower frequency
in the sentence context than the critical word. This was to minimize cases where participants’
failure to supply the target was because they could not understand the context. Here is an
example item for the target word patience (see the full set of items in Appendix B).
                                                  52


        In the end, I lost my pat______ and shouted at them.
        To gather validity information, I engaged with four rounds of revision with four native
speakers of English who were writing consultants at the University’s Writing Center. In each
round, the native speaker attempted the task. For items that were not attempted correctly
and/or were deemed confusing, we discussed ways to modify the context and/or wording to
improve the items. After each review, I revised the items according to their feedback and
proceeded to the next round of review with another native speaker. In the final, fourth round,
the native speaker was able to provide correct answers to all items. In the design of this test,
participants needed to have productive knowledge of the target’s form-meaning link (i.e.,
spelling and meaning). Although this assumed understanding of the sentence context, I
considered this test to be valid for tapping into such explicit knowledge in an untimed manner.
Yes-No RT Test (Access to the Form-Meaning Link)
        I modeled this task after previous studies that implemented a computerized version
Meara's (2010) Yes-No test (e.g., Hui & Godfroid, 2020; Pellicer-Sánchez & Schmitt, 2012). This
task was designed to measure access to the form-meaning link under time pressure (i.e., lexical
strength). Stimuli included the 40 critical words for the present study. In order to make it a
genuine task for participants, another 40 non-words obtained from the ARC Nonword Database
(Rastle et al., 2002) were also included. Although there has been little consensus on the
proportion of non-words, the typical range falls between 25% and 50% (Beeckmans et al., 2001;
                                                 53


Pellicer-Sánchez & Schmitt, 2012; X. Zhang et al., 2020). I chose 50% (i.e., there was one non-
word for every real word) in an attempt to reduce guessing. In total, there were then 80 trials
(40 real words and 40 non-words). Participants were asked to indicate if they knew a given
word. To be more specific about what it means to know a word for this task, I followed Pellicer-
Sánchez and Schmitt (2012), who told participants that a yes response means that the
participant would recognize the word in a text and know its meaning(s). The participants
indicated their knowledge by pressing the corresponding buttons on their keyboard (the j key
representing Yes and the F key representing No). I instructed participants to judge as quickly
and as accurately as possible, following conventions in psycholinguistics to guide participants to
place equal emphasis on response accuracy and speed so that the accuracy and reaction time
data can manifest performance differences somewhat equally (e.g., Draheim et al., 2019).
        Each trial started with a fixation cross (+) presented for 400 ms, which was followed by
the target in lowercase (e.g., patience) until the participant responded. The next trial followed
after 100 ms. All trials were randomized by Gorilla. There was a practice block of six items at
the beginning. Feedback was provided only in the practice block. All stimuli are presented in
Appendix C. This task afforded both accuracy and reaction time data.
        I consider this task to be a valid test of acess to the form-meaning link because
participants needed to know the word meaning in order to respond correctly and that time
pressure was imposed on them. In terms of the reaction time data, I made the assumption that
“more hesitant and inaccurate responses would be slower, whereas more certain and accurate
ones would be faster” (Pellicer-Sánchez & Schmitt, 2012, p. 492-3). Therefore, shorter reaction
times was taken as evidence of more efficient access to the form-meaning link, a manifestation
                                                   54


of lexical strength. In other words, I used this task as a measure of lexical strength in the
present study. In addition, since the task induces participants’ awareness of the knoweldge
assessed, this test is an explicit test.
         I piloted this task with 24 learners of English drawn from the target population. As
mentioned in the Critical Words section, the overall accuracy for real words was 88% with a
standard deviation of 32%. The by-participant analysis suggested that participants performed
generally well with real words. Accuracy ranged from 0.70 – 1.00 with a median of 0.93,
suggesting ceiling effects for some participants. However, there was some guessing as
demonstrated in the non-word data. The average false alarm rate (incorrect yes responses to
non-words) was 12% with a standard deviation of 33%. The false alarm rate in the by-
participant analysis ranged from 0.00 to 0.55 with a median of 0.05. These results underscored
the need to account for guessing in the data analysis (Huibregtse et al., 2002; X. Zhang et al.,
2020) because some participants guessed more than 50% of the time.
         The by-item analysis revealed that two words had a very low accuracy (0.33 and 0.38),
which I reported above in the Critical Words section. These two words were then replaced. The
remaining word items showed a satisfactory accuracy range from 0.71 to 1.00. There were
ceiling effects for 12 items to which all participants responded correctly (accuracy = 1.00),
meaning that these items had no discriminant ability for this sample. For the sake of computing
reliability for this pilot test, I removed these items because they demonstrated no item
variance. With the resulting data set, Cronbach’s alpha was .77, and McDonald’s omega was
also .77. In order to maintain test equivanlency across the five tasks for this study, I kept the
items with ceiling effects in the stimulus list.
                                                    55


         In terms of reaction time data, I only examined the correct yes responses. I trimmed
items for which the reaction time fell outside the 300 ms - 2500 ms window (e.g., Jiang, 2013).
Table 6 summarizes the reaction times in this pilot test. Overall, participants responded
descriptively faster to real words than to non-words, which was expected (e.g., Scarborough et
al., 1977; Stenneken et al., 2007). Analyzing only real word data, the split-half relability was .85.
Taken together, the pilot results suggested this test worked as intended.
Table 6
Means and Standard Deviations of Reaction Times for Words and Non-words
                                     Mean RT in Millisecond (SD)
                     Real words                                      Non-words
                      778 (295)                                        846 (290)
Masked Repetition Priming (Lexical Representations)
         I modeled this task after the operationalization of the masked repetition priming
paradigm in Elgort (2011). As discussed in the Literature Review, this task taps into participants’
lexical representations. When the lexical representations in question are established in the
mental lexicon, participants are expected to make a faster lexical decision when the target is
preceded by an identical prime than by an orthographically or semantically unrelated prime.
This facilitation (priming) can only be observed when the target is lexically represented in
memory. The general idea is that the word prime pre-activates the lexical representation in
question, and so access to it is easier (faster) when the participant sees the target, resulting in a
faster response. In contrast, in cases where no lexical representations are established in
                                                  56


memory, the prime will have no effects on the response to the target because there is no pre-
activation in the absence of lexical representation.
        In this task, each critical word constituted an item. Each item was a duplet (i.e.,
consisted of two trials). One trial was in the related condition and the other in the unrelated
condition. In the related condition, the prime and the target were identical (e.g., patience–
PATIENCE). In the unrelated condition, the prime was changed such that it was not related to
the target in form (i.e., no letters are in the same position of the words) and meaning (e.g.,
occasion–PATIENCE). I matched the prime in both conditions by length, parts of speech,
frequency in Zipf (according to Brysbaert & New [2009] with a tolerance of +/- 0.15), and
character bigram probability (according to Brysbaert & New [2009] with a tolerance of +/-
0.001) using the LexOPS package in R (Taylor et al., 2020). The task for the participant was to
make a lexical decision on the target (e.g., to judge whether or not PATIENCE was a word in
English). With 40 critical words, there were 80 critical trials (one trial in each of two conditions).
Following Elgort (2011), I further reduced the proportion of related trials in the stimulus list by
introducing an additional 80 unrelated word pairs as fillers. This was to minimize the prime
validity effect, which was priming due to a high proportion of repetitions (e.g., Bodner &
Masson, 2001). To make the lexical decision a genuine task for the participant, there were 160
non-word trials, half in the related (repetition) condition and half in the unrelated condition.
Table 7 provides a summary of different trial types. All stimuli are presented in Appendix D.
                                                  57


Table 7
Summary of Trial Types in the Masked Repetition Priming Task
     Trial      Condition      Prime          Target               Target Type           No. of Trial
    Type
   Critical      related      patience      PATIENCE       critical words in the study        40
   Critical     unrelated     occasion      PATIENCE       critical words in the study        40
    Filler      unrelated     brother          SONG              other real words             80
  Nonword        related       snarbs        SNARBS                 non-words                 80
  Nonword unrelated             plisc        SNARBS                 non-words                 80
         In total, there were 320 trials, of which 200 (63%) were unrelated and 120 (37%) were
related. Participants took part in all trials. Following Elgort (2011), I used the standard three-
field masking paradigm. This means that, following a fixation cross (+) presented for 400 ms, I
first presented a forward mask (a string of hash signs [###]). This mask stayed on the center of
the screen for 500 ms. Immediately after that, the prime in lowercase (e.g., patience) appeared
for 55 ms in the same space as the mask, followed by the target in uppercase letters (e.g.,
PATIENCE) for 500 ms before turning into a blank screen. This blank screen was displayed until
the participant responded. The next trial started after 100 ms. The use of both lower- and
upper-case was to ensure that the participant would respond to the target (not the prime) and
to rule out the effect of visual overlap between the two. In the present case of a repetition
priming task, different capitalization made it clearer to the participant what the target was
although participants were not expected to be aware of the presentation of the prime. I asked
participants to judge whether or not the presented letter string (the target) forms a word in
English (i.e., to make a lexical decision). As in the Yes-No RT test, participants made their
judgements by pressing the corresponding key on their keyboard. I also instructed participants
to judge as quickly and as accurately as possible. There was no feedback, except in the practice
                                                    58


block of six items. Breaks were inserted approximately every 80 trials to avoid fatigue. All trials
were pseudo-randomized by Gorilla. This task afforded reaction time data from which I
analyzed the level of priming (facilitation) as a measure of implicit lexical-formal knowledge of
the critical words.
          I piloted this task with a sample of 24 participants drawn from the target population.
The overall accuracy rates for all trials (words and non-words), word trials (critical words and
fillers), and critical word trials were 81% (SD = 39%), 85% (SD = 36%), and 88% (SD = 36%),
respectively. The false alarm rate (incorrect responses to nonwords) was 22% (SD = 42%). From
the accuracy data, then, it appeared that participants generally had the proficiency to complete
the task, consistent with the pilot results of the Yes-No RT test. However, there was a certain
level of guessing with a rather large variability between participants, pointing to the needs to
consider individual participants’ false alarm rate to ensure data quality.
          To demonstrate that this task was able to generate the intended the repetition priming
effect as an indication of established lexical representations, I focused on two aspects of the
reaction time data: first, for the critical trials, participants were expected to respond faster to
the target in the related trials than in the unrelated trials (i.e., there should be facilitation as a
result of repetition), and second, such priming would not be observed in the non-word data. As
with the Yes-No RT test, I trimmed the reaction time data using the lower and upper thresholds
of 300 ms and 2500 ms, respectively (Jiang, 2013). I then computed descriptive statistics of the
reaction times in both conditions by trial types (see Table 8).
                                                     59


Table 8
Means and Standard Deviations of Reaction Times for Critical Words Between Conditions
(Repetition Priming)
                                                     Mean RT in Millisecond (SD)
                                              Related                           Unrelated
         Critical words           (e.g., patience – PATIENCE)         (e.g., occasion – PATIENCE)
                                            623 (222)                           650 (209)
           Non-words                  (e.g., rects – RECTS)                (e.g., flief – RECTS)
                                            720 (197)                           718 (196)
         Elgort (2011) reported a 52-ms priming for the target words in her study (i.e.,
pseudowords that her participants had learned recently) and a 75-ms priming for the real, low
frequency word trials. In the present pilot data, I obtained a 27-ms priming effect, which was
consistent with but somewhat smaller than the previous results. I followed up on this 27-ms
difference with mixed-effects modeling, an appropriate statistical approach to handle nested
data in psycholinguistic experiments (Baayen et al., 2008). Nested data call for the
incorporation of cross random effects (i.e., each participant provided multiple, correlated data
points and each stimulus elicited multiple observations that autocorrelate) (e.g., Baayen et al.,
2008). This simultaneous handling of random effects due to participants and stimuli represents
a key advantage of the technique over separate by-participant and by-item analyses where data
are aggregated, leading to loss of statistical information. In building the mixed-effects models,
the outcome was always a reciprocal transformation of reaction time (i.e., -1/RT). I started with
a null model with no fixed effects but only the random intercepts by participants and by items.
This null model provided an intra-class correlation of .38 as statistical evidence for the need to
account for the random effects. Then, I added condition (related [0] vs. unrelated [1]) as a fixed
effect. In terms of the random effect structure, I built a maximal model where all random
                                                   60


intercepts and slopes were entered (Barr et al., 2013). The significance of condition indicated
that the 27-ms difference was reliable, after controlling for random variability between
participants and items. Hence, the current task successfully elicited the targeted priming effects
from participants. The model summary is presented in Table 9.
Table 9
Summary of Mixed Models for Pilot Data - Masked Repetition Priming Task
                           m0 (null model)                     m1 (maximal)
 Fixed effects             estimate a (SE) t (p)               estimate (SE)    t (p)
 Intercept                 -1.67            -33.23 (<.001)     -1.73            -30.45
                           (0.05)                              (0.06)           (<.0001)
 Condition                                                     0.11             4.80
                                                               (0.02)           (<.001)
 Random effects            Variance (SD)    Intercept-         Variance (SD)    Intercept-
                                            slope                               slope
                                            correlation                         correlation
 By-participant            0.05                                0.07
 intercept                 (0.24)                              (0.26)
 By-item                   0.007                               0.01
 intercept                 (0.08)                              (0.10)
 By- participant slope                                         0.004            -.77
 for condition                                                 (0.07)
 By- item slope                                                0.003            -.83
 for condition                                                 (0.06)
 Residual                  0.10                                0.10
                           (0.32)                              (0.31)
 AIC                       -22161                              -22219
Note. a all estimates were multiplied by 1000 for easier reading.
        I also conducted by-participant and by-item analyses where reaction times were
aggregated across items and participants, respectively. In the by-participant analysis, 18
                                                61


participants (out of 24) showed a net positive difference between the related and the unrelated
conditions (unrelated minus related), indicating some priming at the descriptive level. These
differences ranged from 10 ms to 171 ms. The remaining six participants showed a net negative
difference, ranging from -6 ms to -72 ms. In the by-item analysis, 29 items (out of 40) showed a
net positive difference between the related and the unrelated conditions (unrelated minus
related), indicating some priming at the descriptive level. These differences ranged from 0.2 ms
to 190 ms. The remaining 11 items showed a net negative difference, ranging from -2 ms to -
161 ms. These analyses showed that despite an overall effect of priming, there remained a
certain amount of variability between participants and between items, highlighting the need for
participant- and item-level inspections.
        I then proceeded to analyzing the non-word data. As mentioned, no priming was
expected in the non-word data, because participants should not have these letter strings
represented in memory as lexical items. Therefore, the prime should not facilitate the response
to the target even when it was repeated. From the descriptive statistics (see Table 8), there was
a 2-ms difference between the conditions. In the light of the standard deviations, I concluded
that there were no reliable differences between the two conditions, supporting the lexical
nature of masked repetition priming.
        Finally, in order to assess the extent to which participants were aware of the prime, I
asked participants in a debriefing survey progressively (1) whether or not they had seen
something between the fixation cross and the target, and (2) if so, what they had seen. All
participants mentioned the mask (i.e., the hash signs), and no participants reported seeing a
word (i.e., the prime). Therefore, the masking of the prime in this task worked as intended,
                                                  62


lending further support to this task as an implicit measure of vocabulary knowledge. Taken
together the accuracy and the reaction time analyses, this masked priming task appeared to
function as intended at the piloting stage.
Semantic Priming (Semantic Representations)
         I modeled this task after Elgort’s (2011) operationalization of the semantic priming
paradigm to measure the establishment of semantic representations of words in memory. The
general idea behind the priming effects expected in this task is similar to that in masked
repetition priming task described above. Specifically, participants were expected to respond
faster to a related prime-target trial (e.g., patience–calm) than an unrelated one (e.g.,
chestnut–calm). This facilitated processing of the target (e.g., calm) can be attributed to the
prime pre-activating the overlapping semantic representations of the target. On the other
hand, when the prime and the target are unrelated, the activation of the representations
associated with the prime does not spread to those associated with the target because prime
and target are not meaningfully related or interconnected. Therefore, I made the assumption
that if learners show priming in this implicit task, they have integrated the prime into their
semantic network of both the prime and the target in memory.
         In this task, there were 40 critical items. For each item, there were two trials (i.e., a
duplet), and each trial appeared in one of two conditions: semantically related and unrelated.
In the related condition, the prime and the target were related semantically (e.g., patience-
calm), and they were not related in the unrelated condition (e.g., chestnut-calm). The critical
words of this present study were primes in the related trials while a matched prime was chosen
                                                   63


for the unrelated trials. That way, participants responded to the same target (e.g., calm) across
conditions, allowing a meaningful comparison.
        In establishing semantic (un)relatedness, I relied on the web application Snaut, a
platform for semantic association evaluation (Mandera et al., 2017). Specifically, I obtained the
cosine distance value for each critical trial, which represents the semantic space between the
prime and the target (Günther et al., 2016). The lower the value, the smaller the semantic space
there is between the two (i.e., the more related). At the item level, the related trial always has
a smaller value than the unrelated trials with a mean difference of 0.31 units. At the group
level, an independent Welch t-test was performed to test the significance of the difference. The
assumption of equal variances was violated (Levene’s test: p = .02), but the assumption of
normality in both groups held (Shapiro-Wilk test: p = .071 and .072). Results confirmed that the
cosine values for the related trials (M = 0.62, SD = 0.11, 95%CI [0.59, 0.65]) were statistically
smaller than those in the unrelated trials (M = 0.93, SD = 0.07, 95%CI [0.90, 0.95]), t (37) = -
15.14, p < .001, d = -3.39. This indicated that the semantic space between the related pairs was
smaller (semantically closer) than that between the unrelated pairs, providing objective
evidence for the validity of the manipulation.
        I also took care to ensure that any priming to be observed would be due to the semantic
relatedness, and not merely word association between the prime and the target. To measure
word association, researchers often rely on databases created from word association tasks,
where participants are presented with a stimulus and asked to give the first word that comes to
their mind. It is important to note that there can be more than one reason for a particular
response to come to the participant’s mind. For example, words can be associated because
                                                  64


they often occur together (e.g., new – year). They can also be semantically related (e.g.,
synonyms). Therefore, in order to strengthen the internal validity of this semantic priming
tasks, I followed Elgort (2011) in keeping the associative relationship between the prime and
the target low. As a result, the task tapped into participants’ semantic representations to the
largest extent possible, rather than their associative knowledge of the critical words.
         I used a web-based platform (http://rali.iro.umontreal.ca/word-associations/query/) to
examine word association in the critical, semantically related trials, according to the Edinburgh
Associative Thesaurus (EAT) (Kiss et al., 1973) and University of South Florida Free Association
Norms (USF-FAN) (Nelson et al., 1998). I only used the forward association information because
it reflected the order of presentation in the present task. For example, given patience (a critical
word in the present study), the databases returned 18 tokens of 16 unique responses with the
frequency of occurrence ranging from 1 – 2 (e.g., tolerant, waiting). This means that, for
example, when given patience, two participants responded tolerant. From this information, I
was able to estimate and quantify the strength of forward association between patience and
tolerant (2/18 = 0.11). In the case of the patience-calm pair for the present study, no
association between these two words is listed in both databases. Overall, 20 of the 40 critical,
related pairs (50%) fell into this category. In 11 other pairs, the association was low (< 0.10). For
seven pairs, the association was moderate (0.10 – 0.40). There were three pair (latter-former;
haunt-ghost; fracture-break) that showed high association (>.40).
         In addition, I made sure that the frequency of the target (e.g., calm) was of the same or
a higher frequency band than the prime word (the critical word in the study, e.g., patience) so
that knowledge of the target can be assumed. However, in three prime-target pairs, the target
                                                  65


belongs to a lower frequency band than the prime word. They were strap-bra, dinosaur-fossil,
and vocabulary-grammar. Given the multiple dimensions of control (semantic relatedness,
association, and frequency), these exceptions to the general rules for stimulus selection were
considered as necessary concessions for the present study. At the same time, as detailed
below, I attempted to control for variability specific to individual items through statistical
means. If these exceptions had had an effect on the results, they would have been accounted
for by the model. I present this association information in Appendix E and the full set of
materials in Appendix F.
         On top of the 80 critical trials, I added 80 unrelated filler trials to decrease the
proportion of related trials in the stimulus list to minimize task taking strategies. I also
introduced 80 non-word pairs which were presented twice. These 160 trials balanced the word
and non-word trials to make it a genuine task for the participant. In total, then, there were 320
trials. Following Elgort (2011), I presented these trials in a list-wise fashion where participants
made a lexical decision for each stimulus. Following a practice block of six items, the
presentation list started with a fixation cross presented for 2000 ms. Then, each trial began
with a blank screen (200 ms) followed by the stimulus (prime or target). In Table 10, I present a
summary of the trial types.
                                                    66


Table 10
Summary of Trial Types in the Masked Semantic Priming Task
     Trial      Condition          Prime           Target             Target Type                No. of Trial
     Type
   Critical       related        patience           calm       target words in the study             40
   Critical      unrelated       chestnut           calm       target words in the study             40
     Filler      unrelated        brother           song            other real words                 80
  Nonword unrelated                 plisc          snarbs               nonwords                    160
          This task went through five rounds of piloting with native and non-native speakers of
English. This process resulted in identification of programming errors, revisions of the prime-
target pairs, and a change in the presentation method, all of which are reflected in the
description above. Here, I report the last round of piloting involving 18 native speakers and ten
participants drawn from the target learner population. These numbers were lower than the
number of pilot participants in the first round as reported above. Although this was not
optimal, it has also been suggested that a pilot study with a small number of participants can
still be informative (Jiang, 2013). Specifically, the author suggested that “[a] general rule of
thumb is that you can check the results after you have tested six to seven participants on each
presentation list” (Jiang, 2013, p. 31). While Jiang (2013) acknowledged the low statistical
power as a result of such a small sample, the author argued that researchers can gain insights
into the direction and magnitude of an effect. For example, he wrote that “[running more
participants] won’t turn a weak -2 ms negative priming effect into a strong +34 ms positive
priming effect” (Jiang, 2013, p. 31). Given the resources available, I took advantage of the
information provided by these participants.
          In the learner data, the overall accuracy rates for all trials (words and non-words), word
trials (critical and filler trials), critical trials (prime and target trials), and target trials were 84%
                                                         67


(SD = 37%), 84% (SD = 37%), 89% (SD = 32%), and 92% (28%). The false alarm rate (incorrect
responses to non-words) was 16% (SD = 36%), with one participant at 51%. In addition, it might
be note-worthy that the accuracy for the prime trials (M = 86%, SD = 35%) was descriptively
lower than the target trials (M = 92%, SD = 28%). This result made sense because the prime
words in this experiment were the critical words for the study (e.g., patience), and the target
trials were a semantically related word of a higher or a similar frequency level (e.g., calm). The
high accuracy of the target trials also suggested that it was reasonable to assume participants’
knowledge of the targets. Finally, the accuracy rates for the target trials by condition (i.e.,
preceded by a related vs. unrelated prime) were similar (91% [SD = 0.28] vs. 92% [SD = 0.27]).
Taken together, these figured signaled that participants generally had the proficiency levels
sufficient to complete task and that there were some levels of guessing, consistent with the two
other reaction time-based measures.
        In a list-wise presentation where participants respond to all stimuli (both the prime and
the target), there is one additional factor to consider in data preparation: For the current
semantic priming measure to be valid, participants need to know (respond correctly to) both
the prime (in order to pre-activate the semantic representation) and the target. This general
principle applies to both conditions (related vs. unrelated). In other words, in the data
preparation, all four responses (to the prime and the target in both conditions) need to be
correct for a given item of a given participant to be included for analysis. Although this could be
a high bar, this requirement ensured the data quality in the reaction time analysis. As with the
other reaction time tasks, reaction times were trimmed using a lower threshold of 300 ms and
an upper threshold of 2500 ms. As a result of these criteria, one participant, who had a 51%
                                                 68


false alarm rate, had no data left in the resulting data set, other participants had 16 – 37 items
(out of 40 items) in the data set. Different items had data for 1 to 9 participants (out of 9). In
Table 11, I present the descriptive statistics for reaction times in both conditions.
Table 11
Means and Standard Deviations of Reaction Time for Learners in the Semantic Priming Task -
Piloting
                                     Mean RT in Millisecond (SD)
                         Related                                       Unrelated
                (e.g., patience – clam)                         (e.g., chesnut – calm)
                        601 (170)                                      600 (164)
        Although there was a lack of priming effects in either direction at the group level, I also
inspected the by-participant analysis. Out of the nine participants, six showed a net positive
difference in reaction times in the expected direction, ranging from 3.24 ms to 33.24 ms. For
the other three participants, they showed a net negative difference (reverse priming), ranging
from -4.51 ms to - 86.97 ms. In Elgort (2011), the author reported a 22-ms difference for her
target words (i.e., pseudowords that her participants had recently learned) and a 37-ms
difference for the real, low frequency word trials. Using Elgort’s (2011) results as a reference,
four participants (out of nine) showed signs of priming.
        I also conducted a by-item analysis although each item had at most nine participants’
reaction times to aggregate from, potentially making the results less reliable. Out of 40 items,
23 showed a net positive difference in the expected direction, ranging from 4.17 ms to 122.64
ms. The remaining 17 items had a negative reaction time difference ranging from -1.66 to -
163.30. Using the figures reported by Elgort (2011), 15 (out of 40 items) showed signs of
                                                  69


eliciting priming. I did not engage in inferential statistics because of the sampling errors
associated with only nine participants.
          These results for learners were not optimal, but they were somewhat expected upon
reflection. First, based on Elgort (2011), the expected effect size of the semantic priming task
(i.e., a 22-ms difference) was smaller than that in the masked repetition priming task (i.e., a 50-
ms difference), making it more difficult to detect any semantic priming reliably. Online data
collection also inevitably introduced random variability to the data, worsening the situation.
Finally, the sampling errors associated with the small sample size for this piloting could have
prevented any trustworthy signals from emerging.
          In the native speaker data, the overall accuracy rates for all trials (words and non-
words), word trials (critical and filler trials), critical trials (prime and target trials), and target
trials were 86% (SD = 34%), 90% (SD = 30%), 93% (SD = 25%), and 97% (18%). The false alarm
rate (incorrect responses to non-words) was 21% (SD = 41%), with one participant at 51%.
Again, the accuracy for the prime trials (M = 90%, SD = 30%) was descriptively lower than the
target trials (M = 97%, SD = 18%), confirming again that it was reasonable to assume
participants’ knowledge of the targets. In terms of the reaction times, I present the descriptive
statistics in both conditions in Table 12.
                                                      70


Table 12
Means and Standard Deviations of Reaction Time for Native Speakers in the Semantic Priming
Task - Piloting
                                    Mean RT in Millisecond (SD)
                       Related                                         Unrelated
              (e.g., patience – clam)                           (e.g., chesnut – calm)
                      554 (161)                                        578 (168)
        The mixed-effects model with the maximal random effects structure suggested a
significant fixed effect of condition, indicating that the native speakers showed reliable priming
at the group level. I present the model summary in Table 13. I took this result as evidence that
the experiment was properly set up and that the manipulation of materials (i.e., semantic
relatedness) was successful. However, it was important to bear in mind that native speakers
and learners can behave differently in an experiment.
                                                   71


Table 13
Summary of Mixed Models for Pilot Data – Semantic Priming Task (Native Speaker)
                            m0 (null model)                   m1 (maximal)
 Fixed effects              estimate a (SE) t (p)             estimate (SE)     t (p)
 Intercept                  -1.91            -34.38 (<.001)   -1.94             -34.91
                            (0.06)                            (0.06)            (<.0001)
 Condition                                                    0.07              2.66
                                                              (0.03)            (.02)
 Random effects             Variance (SD)    Intercept-       Variance (SD)     Intercept-
                                             slope                              slope
                                             correlation                        correlation
 By-participant             0.05                              0.05
 intercept                  (0.22)                            (0.21)
 By-item                    0.003                             0.01
 intercept                  (0.06)                            (0.09)
 By- participant slope                                        0.003             .25
 for condition                                                (0.05)
 By- item slope                                               0.01              -.93
 for condition                                                (0.08)
 Residual                   0.10                              0.10
                            (0.32)                            (0.32)
 AIC                        666                               653
Note. a all estimates were multiplied by 1000 for easier reading.
        Overall, the results from native speakers and learners offered a mixed message in that
priming was only observed in the native speaker data. At the same time, I decided to proceed
to the main data collection because of, in part, resources available (time and funds) and, in
part, a few reasons for optimism. In the main data collection, I had planned to have more than
100 learner participants to the extent that resources allow. This sample size would reduce the
sampling errors, providing more meaningful evidence. I had also planned to incorporate
random effects associated with both participants and items, putting me in a position to inspect
                                                 72


the variability in eliciting priming between different items. This information would become
crucial in assessing what item(s) consistently elicit random responses. Removal of these item
could not only improve data quality, but also reduce the noise (random variability) in the data
set to allow the genuine effects to emerge (e.g., Siegelman et al., 2017) (see more discussion in
Data Analysis).
        Despite the optimism, overall priming was still not observed in the main round of data
collection with 145 participants even after item screening (see Chapter 3 for details). This result
casted doubt on what was being measures by the task. Therefore, I decided to drop this
measure from the present study. I will report the results for this task in Chapter 3 and return to
discussing this decision in Chapter 5.
Self-reported Proficiency
        Participants self-reported their perceived general proficiency levels on a 10-level Linkert
scale where 1 was labelled as “Total beginner” and 10 was labelled as “Native-like.” The
decision to use a self-assessment questionnaire item was largely motivated by the resources
available for the present research. Although it was not a formal assessment, the self-reported
rating represented a proxy for their language performance. In a recent meta-analysis involving
67 primary studies and 68, 500 participants, Li and Zhang (2021) found an overall, moderate
correlation at .47 (p < .01) between self-assessment and language performance, confirming the
value and validity of self-assessment, especially when resources present a limitation for
researchers to administer a formal language proficiency test.
                                                  73


Procedure
         I collected data online through Gorilla, as reported above. Interested participants first
completed a screening survey where they read information about the study (e.g., procedure,
potential risks, payment, and so on) and expressed consent to participate. They also reported
their status as (1) a non-native speaker of English and (2) a current student at an American
university. When participants fulfilled these two inclusion criteria, I then entered their
university email to Gorilla which sent a message to the participants, directing them to log into
the system. Before the start of the experiment, they read the study’s information and offered
consent again. They also provided their demographic information (e.g., age, length of residence
in the US).
         The whole experiment consisted of a battery of five tests (see Measures above),
administered on two separate days with at least 48 hours apart. In determining the order of
administration, I considered the extent to which a given task might have an effect on
subsequent ones. This was an important consideration given that multiple testing of the same
critical words appeared to be “unavoidable” in this line of research (González-Fernández &
Schmitt, 2020, p. 23). I also considered both the task requirement (production vs. recognition),
the proportion of the critical words in the whole stimuli set, and potential impact of fatigue on
the data quality and on participant attrition. I decided that the three sensitive tasks generating
reaction time data should precede the paper-based tests because the former would be more
sensitive to any potential effects of fatigue. Then, the form-meaning productive task should
precede the form-meaning receptive task. Between these three psycholinguistic tasks, I
counter-balanced the order to cancel out any potential ordering effects. On Day 1, then,
                                                  74


participants took part in the three reaction time-based tasks. Forty-eight hours later, the
system sent a reminder email to the participant who was then able to log in and complete the
two paper-based tests. At the end of the study, participants entered their payment information
(e.g., electric payment account).
Data Analysis
         In this section, I detail the data analysis plan. I will first discuss the analyses for specific
tasks with the ultimate goal to obtain the most accurate and reliable measures for each task.
Then, I describe the overall CFA data analysis to address the research question. Table 14
summarizes the number of data points for each participant in each of the five tasks.
Table 14
Summary of Number of Data Points for Each Participant
                 Task                     No. of Data Points                     Details
                                              Analyzed
 Form-Meaning Receptive Test           40 accuracy data points       One observation for each
                                                                     critical word (K = 40)
 Form-Meaning Productive               40 accuracy data points       One observation for each
 Test                                                                critical word (K = 40)
 Yes-No RT Test                        80 accuracy and 40            One accuracy and one
                                       reaction time data            reaction time data point for
                                       points                        each critical word (k = 40),
                                                                     and one accuracy data point
                                                                     for each non-word (k = 40)
 Masked Repetition Priming             80 reaction time data         Two reaction time data
 Task                                  points                        points for each critical word
                                                                     (K = 40)
 Semantic Priming Task                 80 reaction time data         Two reaction time data
                                       points                        points for each critical word
                                                                     (K = 40)
                                                     75


The Form-Meaning Receptive Test
        This task afforded accuracy data for all 40 critical words. Coding was conducted
automatically through matching the key and the option chosen by the participant. To this end, I
used the programming language R to minimize human errors in the process. I adopted the
suggested key in the Vocabulary Size Test as the expected answers. I coded responses that
matched with the key as 1 and those that did not match as 0. Since all items required a
response, there were no missing data. The system also did not allow selection of more than one
option for most items (38 out of 40). However, due to a programming error, participants were
able to choose multiple answers for two items. In these cases, the participant scored 0
(incorrect) for that item because they did not follow the test instructions, regardless of whether
the correct option was chosen or not. This scenario represented 0.07% of the data.
        I then conducted a by-participant analysis to screen out participants who might not have
paid sufficient attention and/or did not have the proficiency levels to take part in the study
meaningfully. In particular, when a participant scored below 25% (chance level given four
options), I coded their data for this task as missing. After the by-participant analysis, I inspected
the items. I first removed items that had no item variance (i.e., those that all participants
responded to (in)correctly) because these items could not discriminate participants’ ability.
Then, I started by examining the instrument reliability in terms of both Cronbach’s alpha and
McDonald’s omega (e.g., McNeish, 2018; Raykov & Marcoulides, 2019). I also submitted the
data to a dichotomous basic Rasch analysis (Rasch, 1960) with an aim to identify unsatisfactory
items.
                                                  76


        Rasch analysis is a commonly used statistical technique in psychological science (e.g.,
Müller, 2020) and language testing (e.g., Aryadoust et al., 2020; McNamara & Knoch, 2012)
particularly to evaluate the psychometric properties of assessment items. A Rasch model
predicts the probability of a learner answering an item correctly. To do so, it estimates
parameters for both learner ability and item difficulty on the same scale. When the difficulty
level of an item coincides with a person’s ability level, the person has a .50 chance of answering
it correctly. When one has a higher ability in the construct that the test is set out to measure,
one performs better on the test because one has a high probability of correctly answering more
items across levels of difficulty (see Aryadoust et al., 2020 for an overview). At first sight, this
might seem intuitive, but a couple of statistical assumptions are often made: First, all items are
assumed to measure one single construct. This assumption of unidimensionality has sometimes
been regarded as “too stringent” for language assessments because constructs, such as reading
and listening, are fundamentally multidimensional (Aryadoust et al., 2020, p. 4). Another
assumption that might not be easily met is local independence, meaning that all items
measuring a unidimensional latent trait (i.e., person ability) are correlated with each other only
because of the trait. Once the trait is controlled for, there should no longer be correlation left
between these items. In statistical terms, the residuals (errors) associated with the items when
regressed on learner ability should not correlate with those of other items (e.g., Aryadoust et
al., 2020).
        Despite some rather strong assumptions, a Rasch model provides useful information
about the learners and the items. First, by taking into account item difficulty, estimates of
learner ability can be more accurate than using a sum score that ignores item characteristics.
                                                 77


Second, a Rasch model provides statistics for evaluation of the measurement. For example, infit
and outfit metrics helps researchers identify items that elicit erratic responses both near or far
from the learner ability (i.e., on- and off-target responses). In particular, the mean square index,
summarizing the standardized residuals based on the estimates of response probabilities, can
offer insight into the amount of noise (random variability) in the data for each item. In
particular, when the mean square value is too low, this indicates that an item is overfitted (too
little error), potentially signaling the redundancy of the item (Wright & Linacre, 1994). In
contrast, an underfitted item (too much error) reveals itself by having a high mean square
value, potentially due to random lucky guesses (Wright & Linacre, 1994).
         Using basic Rasch models, I inspected these fit statistics for each item. There are two
ways to evaluate item fit: One can use rule-of-thumb critical values, or one can conduct a
formal test and compare the resulting fit statistics against a normal distribution (Müller, 2020).
Although any rule-of-thumbs are almost always controversial among statisticians, I adopted this
former approach because it is more straightforward for applied researchers. Also, the exact
distribution of the fit statistics is not yet very well known; therefore, conclusions from a formal
test may or may not be appropriate (Müller, 2020). When determining the appropriate critical
values for the present context, I first considered the scale of these item fit statistics. Both infit
and outfit statistics are on a scale between 0 to positive infinity, with 1 being the ideal value
indicating no or little distortion in the measurement system (Wright & Linacre, 1994). A value
higher than 1 indicates underfitted items, while a value below 1 signals overfitting (redundancy)
(Wright & Linacre, 1994). Wright and Linacre (1994) suggested that researchers should first
focus on outfit (off-target responses) before infit (on-target responses) statistics, and on high
                                                    78


values (underfitting) before low (overfitting) values. In terms of an appropriate range, 0.5 to 1.5
is most commonly used in language testing contexts (Aryadoust et al., 2020). But, the authors
also recommended 0.7 as the lower bound for low-stakes dichotomous tests. For the upper
bound, the authors suggested following the equation provided by Smith et al. (1998) which
takes the total number of items into consideration. In the case of 40 items, for example, the
threshold can be set at 1.95 (1 + 6 / √40 = 1.95). I kept these guidelines in mind when
inspecting the items, while also considering the test reliability measures. I then excluded items
that were considered not psychometrically satisfactory. I then repeated the analysis and saved
the learner ability parameters from the refitted Rasch model for further CFA analysis.
The Form-Meaning Productive Test
        This test afforded 40 accuracy data points for each participant. I coded each correct
answer as 1 and incorrect answers as 0. I followed Laufer and Nation (1999) in ignoring “[m]inor
spelling… and grammatical mistakes” (p. 38 – 39). In terms of grammatical mistakes, I only
accepted mistakes in inflections (e.g., plural or tenses) and capitalization. Incorrect parts of
speech (i.e., derivational errors) were marked as incorrect. From all responses, I first
summarized the unique responses for all items with an aim to create a coding scheme that
specifies acceptable alternative responses. Due to the subjectivity involved, I invited a second
rater, who is an experienced teacher of English as a second language, to discuss and co-
construct the coding scheme with me. Then, I used the programming language R to match
participants’ response with this coding scheme. I followed the same analytic procedure as
described above. In short, I first conducted a by-participant analysis, followed by an assessment
of test reliability. Then, I inspected item characteristics with a Rash analysis. I saved the learner
                                                  79


parameters from the Rasch model built upon a final data set after any item exclusion for further
analysis.
Yes-No RT Test
          This test provided both accuracy and reaction time data automatically logged by Gorilla.
I first inspected the false alarm (incorrect responses to non-words) rates by participants. Any
participants who had a false alarm rate higher than 0.50 were deemed as either not paying
sufficient attention and/or did not have the proficiency levels to complete the study
meaningfully. I coded their data as missing for the CFA. In terms of item quality, I followed X.
Zhang et al. (2020) in fitting two separate Rasch models to the real and non-word data.
Unsatisfactory items were removed for further analysis. I also computed reliability estimates for
both data sets. The inspection procedure was the same as described above.
          In analyzing the accuracy data, I consulted the literature on the scoring of the
traditional, paper-based Yes-No test (e.g., Huibregtse et al., 2002; X. Zhang et al., 2020). I first
computed the hit (correct responses to real words) and false alarm (incorrect responses to non-
words) rates for each individual. As alluded to in the Measures section, it was important to take
guessing into account when computing an index to reflect one’s ability. I used individuals’ false
alarm rate (guesses on non-words) as an operational measure of guessing on real words (X.
Zhang et al., 2020; cf. Stubbe, 2012). In the literature on scoring a paper-based Yes-No test,
different formulas have been proposed to calculate a measure to index ability. These formulas
include the simple hits-minus-false-alarms rule, the correction for guessing formula, the delta
m, and the index of signal detection (see a review in Huibregtse et al., 2002). On the one hand,
there appeared little consensus as to the best-performing scoring method (Pellicer-Sánchez &
                                                   80


Schmitt, 2012), and little meaningful differences were found in Mochida and Harrington (2006).
On the other, a recent study by X. Zhang et al. (2020) reported a range of correlations between
Yes-No Test scores adjusted by these different formulas and two reference tests: an MC
Vocabulary Size Test and a translation task. The overall corrections ranged from .289 (when
using delta m) to .621 (when using the index of signal detection). An additional consideration in
their study was the false alarm rates observed in the sample. The authors computed
correlations for those one with high and low false alarm rates. As expected, the group with a
high false alarm rate showed much weaker correlations (.297 - .530) than the group with a low
false alarm rate (.636 - .708). This was because participants in the high false alarm group were
guessing randomly more, hence there was more noise in their data, weakening the correlations.
In both group and overall analyses, the index of signal detection had the strongest correlations
with the vocabulary size test, which was essentially the form-meaning receptive test in the
present study. For this reason, I chose this formula to compute individual’s performance for
further CFA analysis:
                                          4(1 − 𝑓𝑓) − 2(ℎ − 𝑓𝑓) (1 + ℎ − 𝑓𝑓)
                           𝐼𝐼𝑆𝑆𝑆𝑆𝑆𝑆 = 1 −
                                          4(1 − 𝑓𝑓) − (ℎ − 𝑓𝑓) (1 + ℎ − 𝑓𝑓)
where h is the hit rate and f is the false alarm rate.
        In analyzing the reaction time data, I included only the correct, real word trials,
following standard practice in psycholinguistic research because recognition of word and non-
word involves different processes. I further trimmed the spuriously short (< 300 ms) and long (>
                                                    81


2500 ms) reaction times (Jiang, 2013) because they do not reflect the lexical processing under
investigation. I then computed a mean reaction time for each participant for further analysis.
Masked Repetition Priming
         Although the priming effects were the primary focus of this task, I first inspected the
false alarm rate of individual participants to exclude anyone who guessed on more than 50% of
the non-words. With regards to the priming effects, the present task has mostly been used in
experimental contexts where a group-level effect is examined (e.g., Draheim et al., 2019;
Rouder & Haaf, 2019; Siegelman et al., 2017). Psychometric properties of experimental tasks,
such as this one, have recently raised concerns in terms of their level of reliability (Draheim et
al., 2019). In more general terms, Siegelman et al. (2017) discussed three design features of
experimental tasks that make them perform somewhat unreliably: first, there are often a small
number of trials, leading to little room for between-participant variance, which is an important
analytical component in individual differences research. Second, participants oftentimes
perform at chance levels on some, if not most, items. When that is the case, any expected
effects, especially those with small effect sizes, can be buried in the random variability (noise)
of the data. Finally, most of the items often have similar levels of difficulty, which limits the
ability of the tasks to differentiate participants’ ability.
         More directly relevant to the present task is the discussion of the same issue in reaction
time-based research in behavioral science by Draheim et al. (2019). In general, researchers
often use a difference in reaction time to demonstrate an effect. In the present task, for
example, participants were expected to respond faster when the target was preceded by a
related, repetition prime than when preceded by an unrelated prime. Then, the difference in
                                                  82


reaction time could be computed as that in the unrelated condition minus that in the related
condition, whereby I expected a positive net value (i.e., faster responses in the related trial).
Draheim et al. (2019) pointed out that the primary issue with using a difference reaction time is
that “as the correlation between the two component scores increases [reaction times in each
condition], the reliability of the resulting difference score decreases” (p. 511, emphasis
original). It is because subtraction removes systematic variance due to the participant’s
characteristics, hence increasing the proportion of error variance in the data (Draheim et al.,
2019). For these reasons, the authors concluded that “difference scores are poorly suited for
this purpose [of individual differences research]” (Draheim et al., 2019, p. 513).
        To mitigate the potential issue at hand (i.e., a potential low reliability), I adopted three
strategies: first, I minimized random variability in the data through an item analysis. Second, I
engaged in mixed-effects modeling at the trial level (Rouder & Haaf, 2019). Finally, when
constructing the mixed models, I engaged in model criticism to maximize model fit (Baayen &
Milin, 2010) as a means to further reduce random variability in the data.
        As pointed out by Siegelman et al. (2017), individual items might elicit random
performance from participants. Although this task had been piloted and the results were
encouraging, the reliability of results might still benefit from removal of unsatisfactory items
(reducing noise in the data). However, there have not been explicit guidelines that
psycholinguists rely on in terms of item inspection, in part because researchers are often
interested in group-level effects. Therefore, to inspect item quality, I drew on the
recommendations of using mixed-effects modeling in analyzing these data by Rouder and Haaf
(2019).
                                                   83


        Mixed-effects modeling is a multi-equation technique which has the ability to model and
account for dependency between observations due to a nested data structure (e.g., Gelman &
Hill, 2007; Hox et al., 2018). A nested data set is one where lower-level observations correlate
with each other because they are associated with a higher-level unit. The typical textbook
example in educational contexts is one where students (level-1 units) are nested within classes
(level-2 units). In psycholinguistic research, one participant contributes multiple data points to
the data set (as they respond to multiple items); at the same time, each item elicits multiple
responses from the sample of participants. In that case, the data can be said to be cross-nested,
and random effects by participants and by items should be incorporated simultaneously in the
data analysis in order to account for the dependency (e.g., Baayen et al., 2008). Conceptually,
the mixed model estimates an intercept and a slope value (as well as the correlation between
them) for each participant and item (level-2 units). The intercept value represents the mean
(transformed) reaction time for the participant or item in the reference condition (i.e., the
related condition for the present study). The slope value represented the simple effect of
condition (i.e., the difference in reaction time between the related and unrelation conditions)
for a particular participant or item as deviated from the fixed effects estimate.
        Rouder and Haaf (2019) demonstrated that estimates associated with participants tend
to be more accurate and reliable when the item-related variability is incorporated in the model
as random effects. Put differently, after accounting for item characteristics, researchers can
more accurately, reliably model an effect at an individual’s level. To do that, one can fit a
maximal model where all relevant random intercepts and slopes (both by participants and by
items) are entered and allowed to vary (e.g., Barr et al., 2013). This analytical approach also
                                                  84


lends itself as tool for a model-based item inspection. In the case of a maximal model, the
random slopes for condition by items represent the effects of condition for a specific item.
Then, one can inspect the random slopes by items to identify (and potentially remove) which
items elicit random responses from the sample of participants. However, excessive item
removal could result in an undesired decrease in reliability. Therefore, caution was exercised in
balancing identifying and removing potential noise in the data and retaining as much statistical
information as possible.
         In addition, Baayen and Milin (2010) recommended that researchers analyzing reaction
time data should engage in model criticism (see Godfroid [2020a] for a similar recommendation
for analyzing reading time data obtained from eye tracking). Model criticism is a strategy to
treat outliers in reaction time data. Researchers first fit a mixed model to the data. From this
initial model, researchers remove observations with “an absolute standardized residual
exceeding 2.5 standard deviations” (Baayen & Milin, 2010, p. 17). This strategy is compared
favorably with traditional outlier removal procedures (e.g., removing data points that are
beyond 2.5 SD of the mean before a model is fitted). The authors demonstrated the advantage
of model criticism in terms of the number of observations needed to be removed (or ability to
retain as much information as possible) and model fit (i.e., R2 values). The key rationale behind
this strategy is that removal of outliers is more principled when informed by a model, taking all
relevant conditional effects into account in a parsimonious manner. On this account, this
model-based analytic procedure has the potential to address the issue of reliability associated
with experimental tasks. This is because a better model fit (as a result of engaging in model
criticism) means less random variability in the data, potentially leading to more accurate
                                                 85


estimates of the parameters associated with the participants, which represent the individual
differences measure for the present task. Taken together, the ultimate goal here was to
minimize random errors in the model from which I saved the calibrated participant-related
estimates as a measure for this task. This was achieved by an item inspection procedure and
use of mixed-effects models for which I engaged in model-based outlier removal.
         In analyzing the data, I only included correct responses and the trimmed reaction times
using the same thresholds as other reaction time tasks (i.e., 300 ms < RT < 2500 ms). Then, I
constructed mixed effects models according to the modeling procedure reported in the
Measures section. Then, I engaged in model criticism and refitted the model with outliers
removed. From the refitted model, I inspected the random slopes for each item. I also split the
data into two halves and repeated the same procedure to obtain split-half reliability for the by-
participant random slopes. In an iterative manner, I assessed the impact of item removal on the
split-half reliability and decided on the final data set (i.e., to remove the item in question or
not). I returned to the mixed model constructed from this final data set and saved the by-
participants slopes for further CFA analysis.
Semantic Priming
The data analysis procedure was identical to that for the masked repetition priming.
Main CFAs and SEMs
         Using the results of the individual tasks, I engaged in a confirmatory factor analysis to
address the first research questions (RQ1a and RQ1b). Confirmatory factor analysis is often
contrasted with exploratory factor analysis, both of which can be regarded as a member of a
                                                  86


family of techniques known as factor analysis (e.g., Loewen & Gönülal, 2015). The main use of
factor analysis is to lay bare the underlying relationships between a set of (observed) variables.
These relationships are believed to be driven by a more parsimonious number of latent
(unobserved) variables, known as factors (e.g., Brown, 2015). By specifying the factor structure
(e.g., a one- vs. two-factor solution) as informed by prior evidence and theory, researchers are
in a position to test the psychometric dimensionality of a set of measures, lending itself an
appropriate tool for construct validation research (R. Ellis & Loewen, 2007; Isemonger, 2007;
Vafaee & Kachinske, 2019). In addition, researchers can take advantage of the modeling
flexibility in CFA to evaluate construct validity in the light of different assessment methods by
introducing (and partially out) method effects in the model (Brown, 2015).
         In terms of model specification, I initially had a total of six measures: (1) the form-
meaning receptive test (Recep), (2) the form-mean productive test (Prod), (3) the Yes-No RT
test – accuracy data (Yes-No-Acc), (4) the Yes-No RT test – reaction time data (Yes-No-RT), (5)
the masked repetition priming task (RepPrim), and (6) the semantic priming task (SemPrim). As
reported, I dropped the semantic priming task (see results in Chapter 3 and discussion in
Chapter 5) because overall priming was not observed. Therefore, I had a total of five indicators
in the model. In the variance-covariance matrix, then, there were 15 pieces of information (five
variances on the diagonal and 10 covariances [4+3+2+1] between the indicators off the
diagonal). For identification purposes, the CFA models can have at most 14 freely estimated
parameters so that the model is over-identified for model fit evaluation (i.e., assessing the
extent to which the model is an acceptable representation of the data).
                                                    87


         In order to address RQ1a, the extent to which there were two separate dimensions of
lexical knowledge based on awareness (explicit vs. implicit), I had had an initial plan to fit a two-
factor model (one factor labelled as explicit and another labelled as implicit). In this model, the
four explicit measures (Recep, Prod, Yes-No-Acc and Yes-No-RT) would load onto the explicit
factor, while the two priming tasks would load onto the implicit factor. This two-factor model
would be compared against the rivalry one-factor model where all six measures loaded onto
one single vocabulary factor. However, the semantic priming task was dropped as reported in
the Measures section. This decision led to identification issues with the implicit factor because
it had then one indicator remaining. As a result, two models were tested to shed light on the
research question, but they were limited in terms of offering conclusive evidence. In model
CFA-M1, all four explicit measures loaded a single latent variable in a one-factor model. I
labelled the factor as vocabulary. Then, in CFA-M2, I added repetition priming to examine
potential evidence of uni-dimensionality (i.e., to what extent this implicit measure can be
placed on the same psychometric dimension as the other explicit tasks). In this model, I allowed
the residuals of the mean RT and the repetition priming to correlate because both measures
were reaction time-based. This additional parameter represented a multitrait–multimethod
approach to partial out any method effects (e.g., Brown, 2015). Although a direct model
comparison between CFA-M1 and CFA-M2 would be inappropriate (because the two models
had different sets of indicators), the model fit of these two models could still provide some
useful information. Specifically, if CFA-M2 could not produce a good fit, there would then be
evidence against a uni-dimensionality view. To address RQ1b, I fitted a two-factor model (CFA-
M3) where the untimed tasks (Recep and Prod) loaded onto a knowledge factor and the timed
                                                 88


measures (Yes-No-Acc, Yes-No-RT, and RepPrim) loaded onto a strength factor. Again, methods
effects were accounted for by allowing the residual terms of the two reaction time-based
measures to correlate. The rivalry model to this one was the one-factor CFA-M2 described
above. I summarize the hypothesized CFA models in Table 15.
Table 15
Summary of Hypothesized CFA Models
 Explicit vs. implicit
 Model             Vocabulary                                                  Method Effects
 CFA-M1 (one Recep; Prod;
 factor)           Yes-No-Acc;
                   Yes-No-RT
 CFA-M2 (one Recep; Prod;                                                      Yes-No-RT ~~
 factor)           Yes-No-Acc;                                                 RepPrim
                   Yes-No-RT;
                   RepPrim
 Model             Vocabulary           Knowledge            Strength          Method Effects
 CFA-M2 (one Recep; Prod; Yes-                                                 Yes-No-RT ~~
 factor)           No-Acc; Yes-No-                                             RepPrim
                   RT; RepPrim
 CFA-M3 (two                            Recep; Prod          Yes-No-Acc;       Yes-No-RT ~~
 factors)                                                    Yes-No-RT;        RepPrim
                                                             RepPrim
        In terms of RQ2, concerning the predictive validity of the different vocabulary
constructs, I used a full structural equation modelling (SEM) approach where I regressed self-
reported proficiency on the CFA (measurement) models described above. Specifically, in SEM-
M1a, I used CFA-M1 to predict self-reported proficiency. I included the repetition priming
measure as an observed variable (an additional predictor) in SEM-M1b. As the next step, I fitted
SEM-M2 where I used CFA-M2 to predict self-reported proficiency. With regards to the
                                                 89


knowledge vs. strength distinction, I fitted SEM-M3 where the two factors in CFA-M3 predicted
self-reported proficiency. Finally, all five measures were used as observed variables to predict
proficiency in SEM-M4 (i.e., the same as a multiple regression model where the outcome was
proficiency, and the five measures were predictors).
         In estimating all CFA and SEM models, I used case-wise (full information) maximum
likelihood estimation with robust (Huber-White) standard errors correction. The case-wise
estimation allows a participant’s data to contribute to the model even when they have missing
data for a specific task, and the robust standard errors mitigate the impact on non-normality in
the data (e.g., Brown, 2015; Kline, 2015).
         To address the first research questions, I relied on evidence of good model fit. I
examined both global and local fit (for the CFA models). For global fit, I used the following fit
statistics: model chi-square with its degree of freedom and associated p value (> .05), Root
Mean Square Error of Approximation (RMSEA) and its confidence intervals (< .05), Comparative
Fit Index (CFI) (> .95), and Standardized Root Mean Square Residual (SRMR) (< .08) (e.g., Hu &
Bentler, 1999). In addition to global indices, I also inspected the factor loadings, the
standardized residuals (< |1.96|), and the modification indices (< 3.84) to assess potential local
misfit. For the second research questions, I inspected the R2 value for the self-reported
proficiency measures. These values represented the explanatory power of a given
conceptualization of the vocabulary construct in accounting for the outcome (i.e., proficiency).
For illustration, I visualize the CFA-M2 and SEM-M3 Figure 2 and Figure 3, respectively.
                                                  90


Figure 2
Visualization of CFA-M2
Figure 3
Visualization of SEM-M3
                        91


Analysis Software Packages
In Table 16, I list all software packages used in data wrangling and analysis:
Table 16
Summary of Software Packages Used
 Tasks                              Package Name (Version)         Reference
 General use                        R (4.0.2)                      R Core Team (2020)
 General use                        RStudio (1.3.1093)             RStudio Team (2020)
 Data wrangling                     tidyverse (1.30)               Wickham et al. (2019)
 Data visualization                 ggplot2 (3.3.3)                Wickham (2016)
 Summary and reliability            psych (2.0.12)                 Revelle (2020)
 Mixed-effects modeling             lme4 (1.1-26)                  Bates et al. (2015)
                                    lmerTest (3.1-3)               Kuznetsova et al. (2017)
                                    performance (0.7.0)            Lüdecke et al. (2020)
                                    brms (2.14.4)                  (Bürkner, 2018)
 Rasch analysis                     eRm (1.0-2)                    Mair and Hatzinger (2007)
 CFA                                lavaan (0.6-7)                 Rosseel (2012)
                                                  92


                             CHAPTER 3: RESULTS (INDIVIDUAL TASKS)
        In this chapter, I report results of the data analysis of the five individual tasks. These
results also represent the data preparation process towards the main CFA and SEM which will
be covered in the next chapter. As mentioned, the goal of these analyses was to maximize the
accuracy and reliability of the measure indexing participants’ performance on each test.
The Form-Meaning Receptive Test
        Based on the by-participant analysis, one participant was excluded based on below-
chance performance (25%). However, the analysis suggested a ceiling effect, creating a left-
skewed distribution (see upper panel of Figure 4). I present the descriptive statistics in Table 17.
Table 17
Descriptives for the Form-Meaning Receptive Test
                N = 144                        Mean (SD)                          Range
                                               [95% CIs]
        Total Score (K = 40)                  37.27 (3.95)                        15 – 40
                                             [36.62, 37.92]
        Total Score (K = 35)                  32.34 (3.81)                        11 – 35
                                             [31.71, 32.97]
        I first inspected the item variances before computing reliability measures. All items had
some variance. With this full 40-item data set, Cronbach’s alpha was .90, and McDonald’s
omega was .91. At first sight, these figures suggested very good reliability, but caution was
exercised because of the ceiling effect observed. In other words, many items can potentially be
redundant, causing an undesired inflation of the reliability (Wright & Linacre, 1994). With this in
mind, I proceeded to the dichotomous basic Rasch analysis. I present the person-item map in
                                                   93


Figure 4. The upper panel represents the left-skewed distribution of the person ability
parameters, indicating the ceiling effect. In the lower panel, item difficulty is visualized in
relation to the amount of person ability required to answer the item correctly (x-axis).
Therefore, items whose data point is plotted left to the graph are relatively easy items (or items
that require less person ability to answer correctly). The difficult items have their data point
towards the right-hand side of the graph. Due to the ceiling effect, most test takers had an
ability higher than the item difficulty of most items. At the same time, there was still a range of
item difficulty, with some potentially redundant items (being too easy) clustering in the top
right corner of the panel.
         In terms of the item fit statistics, given 40 items in this data set, the upper threshold was
1.95 (1+6/√40 = 1.95), following Smith et al. (1998). Inspecting the outfit statistics first,
following recommendations by Wright and Linacre (1994), one item (stone) was underfitted,
with its outfit statistic exceeding the threshold of 1.95 (outfit = 2.67), potentially indicating this
item was often guessed correctly. In terms of the lower bound of the statistics (signaling
overfitting [or redundancy]), I first adopted the commonly used threshold of 0.5 (Aryadoust et
al., 2020). Four items were below 0.10. They corresponded to the items clustering in the top
right corner of the person-item map in Figure 4. Another 13 items had values that ranged from
0.11 to 0.49. As mentioned, these items with a low item fit statistic represented redundant
items. However, they did not distort the measurement system (Wright & Linacre, 1994). I also
eye-balled the infit statistics, which showed no additional issues.
         Since item removal generally results in a lower test reliability, I examined the impact of
removing different number of items. I first removed stone based on a high outfit value and the
                                                    94


four items with overly low outfit values. In total, then, five items were removed, leaving 35
items on the test. With this data set, Cronbach’s alpha for this subset was .89, and McDonald’s
omega was .90. I fitted the Rasch model to this 35-item data set again. Outfit values ranged
from 0.11 to 1.92. Twelve items fell outside the critical threshold range of 0.5 to 2.01, but only
on the lower end). I then assessed the extent to which further removal of these items would be
beneficial. Reliability estimates for the 23-item test dived to .79 for both Cronbach’s alpha and
McDonald’s omega. In addition, the total proportion of variance explained by the Rasch model
dipped slightly to 37.46% (down from 39.94% in the 35-item test). Although a reliability of .79 in
the 23-item test can still be deemed as acceptable, I decided to use the data set with 35 items
because it had a better Rasch model fit. I present the person-item map for the Rasch analysis
with this data set in Figure 5. At the same time, I acknowledge that some items with relatively
low outfit statistics could inflate the reliability of the 35-item test. I saved the person ability
parameters, which correlated with the raw score at .90, for further analysis.
The Form-Meaning Productive Test
        In total, there were 1086 unique responses for all 40 items. To construct a coding
scheme, a second rater and I assessed the acceptability of each of these responses separately
with reference to the general rule (i.e., only minor spelling and inflectional errors were to be
accepted). The initial agreement rate was 97.4%. Cohen’s Kappa (interrater) reliability was .91
after accounting for chance agreement. We revisited all instances of disagreement. For all but
four disagreements, we were able to self-correct the coding and reach an agreement without a
discussion. The four occasions that required a discussion involved participants putting in more
than one word and mistakes concerned with the -ed vs. -ing forms (e.g., paved vs. paving). We
                                                    95


resolved these issues, and the resulting coding scheme was implemented in the scoring of the
test using the programming language R.
Table 18
Descriptives for the Form-Meaning Productive Test
              N = 138                         Mean (SD)                       Range
                                              [95% CIs]
        Total Score (K = 40)                 29.05 (6.95)                     11 – 40
                                             [27.9, 30.2]
        Total Score (K = 38)                 27.19 (6.92)                      9 – 38
                                           [26.02, 28.35]
         The by-participant analysis suggested that seven participants scored very low on the test
(raw scores = 0 and 8 [out of 40 items]). I excluded them because their attention and/or
proficiency levels were questionable. Then, I generated descriptives as presented in Table 18.
All items had some item variance for the computation of reliability measures. Cronbach’s alpha
and McDonald’s omega were both at .88.
         From the Rasch analysis, two items were underfitted (upset and cap) having an outfit
statistic exceeding the upper threshold (given 40 items, 1+ 6/ √40 = 1.95) (Smith et al., 1998).
Two items (microphone, and crab) were identified as redundant with outfit statistics ranging
from 0.38 – 0.45. As with the other tasks, I only removed the underfitted item (k = 2), resulting
in a 38-item data set. I then repeated the accuracy analysis again. Cronbach’s alpha did not
change, while McDonald’s omega slightly increased to .89. Item fit issues were not found from
the refitted Rasch model (outfit statistics ranging from 0.37 to 1.61 with the model accounting
                                                  96


for 75% of the variance). I saved the person ability parameters for further analysis. I present the
person-item map in Figure 6.
                                               97


Figure 4
Person-Item Map for the Receptive Test (40-Item)
                                              98


Figure 5
Person-Item Map for the Receptive Test (35-Item)
                                              99


Figure 6
Person-Item Map for the Productive Test (38-Item)
                                              100


Yes-No RT Test
Accuracy Data
        Based on the by-participant analysis, seven participants (out of 145) were excluded
based a false alarm rate larger than 0.50, indicating a high level of guessing on the non-words
(false alarm rates ranged from 0.58 – 0.90). As with the form-meaning tests, the by-participant
analysis suggested a ceiling effect, creating a left-skewed distribution in both the word and non-
word data. I present the descriptive statistics in Table 19.
Table 19
Descriptives for the Yes-No RT Test (Accuracy Data)
                N = 138                           Mean (SD)                        Range
                                                   [95% CIs]
      Total Score (Word, K = 38)                 34.46 (4.31)                     16 – 38
                                                [33.73, 35.18]
      Total Score (Word, K = 33)                 29.65 (4.15)                     13 – 33
                                                [28.95, 30.35]
   Total Score (Non-word, K = 40)                36.24 (4.41)                     20 – 40
                                                [35.50, 36.98]
   Total Score (Non-word, K = 37)                33.30 (4.29)                     17 – 37
                                                [32.58, 34.03]
        Before computing the reliability measures, I removed two items from the word data
(input, and vocabulary). They had no item variance in that all participants responded to them
correctly. No non-words needed removal on the basis of a lack of item variance. With the word
and non-word data, Cronbach’s alpha and McDonald’s omega estimates were all at .86.
        In the first round of Rasch analysis with the word data, three items were underfitted
(restore, drawer, and maintain), having outfit statistics at 2.09, 2.16, and 4.90, exceeding the
upper threshold (given 38 items, 1+ 6/ √38 = 1.97) (Smith et al., 1998). One items (stone) was
                                                 101


overly redundant with the outfit statistic far under the 0.5 threshold at 0.09. It represented an
item that was too easy to participants (see also the person-item map in Figure 7). An additional
two items (cap, and peel) were also identified as redundant with outfit statistics ranging from
0.28 – 0.48. As with the other tasks, I only removed the underfitted (k = 3) and overly
redundant (k = 1) items, resulting in a 34-item data set for real words. I then repeated the
accuracy analysis again.
         With these 34 items, both Cronbach’s alpha and McDonald’s omega were at .86 (same
as the last round). The Rasch model further suggested that dash as underfitted (outfit statistics
at 2.25 which was larger than the threshold of 2.03 [given 34 items]). Removal of it resulted in a
slight dip in Cronbach’s alpha to .85, and McDonald’s omega was the same at .86. Therefore, I
settled on this 33-item data set. The Rasch model constructed with this data set showed
acceptable item fit with outfit statistics ranging from 0.33 – 1.81. The total proportion of
variance explained by the Rasch model was 51%. I then used this 33-item data set to compute a
hit rate (correct responses to words), which was then a basis for the index of signal detection to
be submitted to further analysis.
         In terms of the non-word data, no item needed removal due to a lack of variance. The
initial reliability estimates were both at .86. Rasch analysis indicated that one item (skoign) was
underfitted (outfit = 2.60). Seven items may be considered as redundant (outfit ranging from
0.26 – 0.49). Since the outfit statistics were not overly low, I arbitrarily decided to use a 0.3 cut-
off to balance retaining as many items as possible and reducing noise in the data set. As a
result, two items (prarns, and zolved) were additionally removed. After removal, the two
reliability measures dipped slightly to .85 (from .86 with the full set of 40 words). I refitted the
                                                  102


Rasch model, and all items had acceptable fit (outfits: 0.35 – 1.57). The total proportion of
variance explained by the Rasch model was 51%. I then used this 37-item data set to compute a
false alarm rate (incorrect responses to non-words), which was then the basis for the index of
signal detection to be submitted to further analysis.
Reaction Time Data
        The full word data set (K = 40) was used for this analysis. Before the data analysis, seven
participants who had a false alarm rate larger than 0.50 were excluded as discussed above.
Filtering correct responses resulted in a loss of 489 observations (9%), echoing the accuracy
data reported above. Trimming of reaction times (i.e., 300 ms < RT < 2500 ms) removed 149
observations (3%). After this procedure, one participant had no data left. In Table 20, I present
the descriptive statistics for the reaction times. I further divided the data set by two (evenly
across frequency bands) to compute a split half reliability. The correlation estimates for the
mean raw reaction times were at .87. I then recombined the data set and submitted the overall
mean reaction times for further analysis.
Table 20
Descriptive Statistics for the Yes-No Test (Reaction Time Data)
                  N = 137, K = 40                                      mean (SD)
                                                                        [95% CIs]
               Reaction Time (ms)                                       755 (309)
                                                                       [746, 763]
                                                 103


Figure 7
Person-Item Map for the Yes-No RT Test - Word Data (33-item)
                                            104


Figure 8
Person-Item Map for the Yes-No RT Test - Non-Word Data (37-item)
                                            105


Masked Repetition Priming
         In this task, 16 participants had a false alarm rate larger than 0.50, ranging from 0.53 –
0.91. These values suggested that they were either largely guessing on the non-words and/or
did not have the proficiency levels that this study targeted. For this reason, I removed them
from further analysis. The overall accuracy rates for all trials (words and non-words), word trials
(critical and filler trials), and critical trials were 82% (SD = 38%), 85% (SD = 36%), and 90% (SD =
31%), respectively. The overall false alarm rate (incorrect responses to non-words) was 21% (SD
= 41%).
         In terms of item screening, I modeled the trimmed (300 ms < RT < 2500 ms), reciprocally
transformed reaction times (-1/RT) using a mixed-effects model with maximal random effects
(Barr et al., 2013) and with condition (related [0] vs. unrelated [1]) as the fixed effect. The
iterative process of item inspection and model fitting suggested that the split-half reliability for
the by-participant random slopes peaked when all items were retained. Since the goal here was
to obtain a reliable measure of learner ability, I decided to use the full, 40-item data set to build
the final model for this task. Descriptive statistics and the model summaries are presented in
Table 21 and Table 22.
Table 21
Descriptive Statistics for the Masked Repetition Priming Task
                             Mean RT in Millisecond (SD)                   Split-half Reliability for
                                                                        By-Participant Random Slopes
                         Related                      Unrelated
                (e.g., patience – calm)        (e.g., chestnut – calm)
    K = 40              654 (232)                      686 (231)                     .962
    K = 39              642 (204)                      681 (211)                     .961
                                                       106


Table 22
Summary of Mixed Models - Masked Repetition Priming Task
                                                  m1 (K = 40)
  Fixed effects                                   Estimate (SE)a             t (p)
  Intercept                                       -1.66 (0.03)               -58.24 (<.001)
  Condition                                       0.10 (0.01)                11.85 (<.001)
  Random effects                                  Variance (SD)              Intercept-slope
                                                                             correlation
  By-participant intercept                        0.08 (0.29)
  By-item intercept                               0.01 (0.08)
  By- participant slope for condition             0.001 (0.03)               -.97
  By- item slope for condition                    0.001 (0.03)               -.18
  Residual                                        0.10 (0.2)
  Number of Observations                          8271
Semantic Priming
          The by-participant accuracy analysis showed that 25 participants had a false alarm rate
(incorrect responses to non-words) higher than 0.50, ranging from 0.51 – 0.98. I removed the
data of these participants because of their high level of guessing on the non-words, casting
doubt on their attention levels during the experiment and/or proficiency levels suitable for the
task. The overall accuracy rates for all trials (words and non-words), word trials (critical and
filler trials), critical trials (prime and target trials), and prime trials were 85% (SD = 36%), 87%
(SD = 33%), 93% (SD = 26%), and 96% (SD = 19%), respectively. The false alarm rate (incorrect
responses to non-words) was 21% (SD = 41%). In addition, it might be note-worthy that the
accuracy for the prime trials (M = 89%, SD = 32%) was descriptively lower than the target trials
(M = 96%, SD = 19%), which was expected because the prime words were the critical words in
the study. Therefore, the priming effect can be attributed to knowledge of the prime words
                                                      107


(i.e., the critical words of the study) and did not depend on participants’ knowledge of the
target words in the experiment which elicited a ceiling accuracy.
         In terms of item and reliability inspection, the full data set resulted in the highest level
of reliability. I used this data set to build the mixed model from which I saved the by-participant
random slopes as the measure of this task. Descriptive statistics and the model summary are
presented in Table 23 and Table 24. Overall priming was not observed, meaning that, as a
group, these participants did not appear to have robust semantic networks for this whole set of
stimuli. Despite the high split-half reliability for the by-participant random slopes, the lack of
priming casted doubt on what was being measured. For this reason, I discarded this measure in
the following analysis.
Table 23
Means and Standard Deviations of Reaction Times for Critical Words Between Conditions
(Semantic Priming)
                             Mean RT in Millisecond (SD)                  Split-half Reliability for
                          Related                  Unrelated          By-Participant Random Slopes
                 (e.g., patience – calm)    (e.g., chestnut – calm)
    K = 40               615 (219)                  622 (218)                        .94
    K = 39 a             595 (172)                  608 (185)                        .85
   K = 38  b             605 (187)                  612 (192)                        .90
Note. a removing an item that elicited the most reverse-priming; b removing two items that
elicited the most priming.
                                                    108


Table 24
Summary of Mixed Models - Semantic Priming Task
                                             m1 (K = 40)
  Fixed effects                              Estimate (SE)a              t (p)
  Intercept                                  -1.78 (0.03)                -55.33 (<.001)
  Condition                                  0.02 (0.02)                 1.03 (.31)
  Random effects                             Variance (SD)               Intercept-slope
                                                                         correlation
  By-participant intercept                   0.08 (0.27)
  By-item intercept                          0.02 (0.12)
  By- participant slope for condition        0.001 (0.04)                .46
  By- item slope for condition               0.02 (0.13)                 -.64
  Residual                                   0.10 (0.31)
  Number of Observations                     7094
Note. a all estimates were multiplied by 1000 for easier reading.
Summary of Results for Individual Tasks
         In Table 25, I summarize the results for the individual tasks, including the number of
items and participants included in the final data set, the descriptive statistics, the reliability
estimates. I also present the correlation matrix for the task results in Table 26. These variables
were standardized before submitted to the confirmatory factor analysis.
                                                 109


Table 25
Summary of Individual Task Results
 Tasks          No. of        No. of        Measure          Mean (SD)a      Reliability
                Items         Participants
 Form-              35            144       Person-ability        3.39              α = .89
 Meaning                                    parameters           (1.27)            ω = .90
 Receptive                                  from Rasch
 Test                                       analysis
 Form-              38            138       Person-ability        1.48              α = .88
 Meaning                                    parameters           (1.30)            ω = .89
 Productive                                 from Rasch
 Test                                       analysis
 Yes-No RT      word = 33;        138       Index of Signal       0.71          α = .85 (word)
 Test            non-word                   Detection            (0.22)        ω = .86 (word)
 (Accuracy)        = 37                     Theory
                                                                                    α = .85
                                                                                 (non-word)
                                                                                   ω = .85
                                                                                 (non-word)
 Yes-No RT          40            137       Mean Reaction         772          Split-half = .87
 Test                                       Time                 (189)
 (Reaction
 Time)
 Masked             40            129       By-participant      1.5e-04        Split-half = .96
 Repetition                                 random slopes        (0.03)
 Priming
 Semantic           40            120       By-participant      7.4e-05        Split-half = .94
 Priming                                    random slopes        (0.02)
Note. Empirical notation is used, where, for example, 1.2e-02 = 1.2 * 10-2 = 0.012
      a
                                               110


Table 26
Correlation Matrix for the Individual Task Results
                   recep     prod YesNoAcc meanRT RepPrim SemPrim                  prof
      recep         1.00
       prod         0.57      1.00
    YesNoAcc        0.39      0.59       1.00
     meanRT        -0.36     -0.39      -0.36       1.00
     RepPrim        0.35      0.28       0.28      -0.55      1.00
    SemPrim        -0.17     -0.05      -0.11       0.39     -0.59        1.00
       prof         0.24      0.36       0.38      -0.24      0.25       -0.18     1.00
Notes. Recep = Form-Meaning Receptive Test; prod = Form-Meaning Productive Test; YesNoAcc
= Yes-No RT Test (Accuracy); meanRT = Yes-No RT Test (Reaction Time); RepPrim = Masked
Repetition Priming; SemPrim = Semantic Priming; prof = self-rated proficiency
                                               111


                                CHAPTER 4: RESULTS (CFA AND SEM)
        In this chapter, I report results of the confirmatory factor analysis and structural
equation models which addressed my research questions. I will start with presenting the global
model fit for all hypothesized models (i.e., the extent to which the model is an acceptable
representation of the data), followed by an assessment of the local fit.
RQ1a – Explicit vs. Implicit
        In Table 27, I present the level of fit of all the hypothesized models. Both one-factor
solutions (with [CFA-M2] and without [CFA-M1] repetition priming) produced a good fit,
suggesting evidence that lexical (implicit) knowledge measured by the repetition priming task
can be placed on the same psychometric dimension as knowledge assessed by the other explicit
tasks. Since these two models included different indicators, it was not appropriate to compare
model fit between them. In terms of local fit, both models had standardized factor loadings
ranging |0.50| to |0.91|, no standardized error variances larger than 1.96 (ranging from 0.17 to
0.79), and no modification indices larger than 3.84 (ranging from 0.03 to 2.70). I present the
model summary of CFA-M2 in Table 28.
RQ1b – Knowledge vs. Strength
        Similar to the one-factor, rivalry model (CFA-M2), the two-factor solution (CFA-M3) also
produced a good fit (see fit indices in Table 27), suggesting evidence that lexical strength as
assessed by timed tasks can be a psychometrically distinct dimension from lexical knowledge
measured by untimed vocabulary tests. However, a Χ2 difference test revealed no significant
difference in the Χ2 statistics (p = .243). For parsimony, the one-factor solution should be
favored. Other fit indices also pointed to the same conclusion in that both the AIC and BIC had a
                                                   112


slightly lower value for the one-factor model than for the two-factor solution (AIC: 1732 vs.
1731; BIC: 1782 vs. 1779). In addition, the correlation between the two factors in CFA-M3 was
at .915. This high correlation indicated that the knowledge measured by timed and untimed
tasks was strongly related despite the potential, distinct dimensions. In terms of the local fit, no
issues were found in the two-factor model. Standardized factor loadings ranged from |0.47| to
|0.92|, no standardized error variances were larger than 1.96 (ranging from 0.16 to 0.78), and
no modification indices were larger than 3.84 (ranging from 0.01 to 2.88). I present the model
summary of CFA-M3 in Table 29.
                                                113


Table 27
Summary of Confirmatory Factor Analysis and Structural Equation Models
          Models      df   χ2 (p)   RMSEA         CFI    SRMR         Fit   R2 for
                                    [95%CI]                               Proficiency
 Threshold for good       p < .05     < .05     > 0.95   < 0.08
             fit
          CFA-M1      2   2.409      0.038      0.998    0.019      Good
    (one-factor with      (.300)    [0.000,
       four explicit                 0.174]
        indicators)
          CFA-M2      4   4.915      0.040      0.996    0.023      Good
   (one-factors with     (0.296)    [0.000,
      all indicators)                0.137]
          CFA-M3      3   3.552      0.036      0.998    0.022      Good
   (two-factors with     (0.314)    [0.000,
     knowledge and                   0.149]
         strength)
         SEM-M1a      5   4.609      0.000      1.000    0.022      Good    0.261
 (CFA-M1 predicting      (0.465)    [0.000,
       proficiency)                  0.111]
         SEM-M1b      9   60.569     0.212      0.763    0.166      Poor
      (CFA-M1 plus       (< .001)   [0.164,
  repprim predicting                 0.265]
       proficiency)
          SEM-M2      8   7.651      0.000      1.000    0.025      Good    0.271
 (CFA-M2 predicting      (0.468)    [0.000,
       proficiency)                  0.095]
   SEM-M3 (CFA-M3     6   4.063      0.000      1.000    0.019      Good    0.321
        predicting       (0.668)    [0.000,
       proficiency)                  0.086]
    SEM-M4 (all five                            NA                          0.190
 indictors predicting
       proficiency)
                                            114


Table 28
Model Summary of CFA-M2
 Latent variables
    vocab =~                   Estimate    Std.Err      z-value     P(>|z|)      Std.all
                    recep       1.000                                            0.767
                     prod       1.264       0.135        9.341       0.000       0.900
                  YesNoAcc      0.963       0.116        8.309       0.000       0.730
                   meanRT       -0.666      0.142       -4.703       0.000       -0.503
                   repprim      0.611       0.117        5.243       0.000       0.456
  Covariances:
                               Estimate    Std.Err      z-value     P(>|z|)      Std.all
   meanRT ~~
                   repprim      -0.375      0.088       -4.280       0.000       -0.471
   Intercepts:
                               Estimate    Std.Err      z-value     P(>|z|)      Std.all
                    recep       0.000       0.083        0.000       1.000       0.000
                     prod       -0.092      0.089       -1.029       0.303       -0.086
                  YesNoAcc      -0.036      0.086       -0.416       0.677       -0.036
                   meanRT       0.036       0.090        0.402       0.688       0.036
                   repprim      -0.054      0.092       -0.592       0.554       -0.053
                    vocab       0.000                                            0.000
   Variances:
                               Estimate    Std.Err      z-value     P(>|z|)      Std.all
                    recep       0.409       0.060        6.806       0.000       0.411
                     prod       0.218       0.072        3.038       0.002       0.190
                  YesNoAcc      0.477       0.067        7.125       0.000       0.468
                   meanRT       0.764       0.094        8.113       0.000       0.747
                   repprim      0.829       0.136        6.080       0.000       0.792
                    vocab       0.584       0.129        4.537       0.000       1.000
Notes. Recep = Form-Meaning Receptive Test; prod = Form-Meaning Productive Test; YesNoAcc
= Yes-No RT Test (Accuracy); meanRT = Yes-No RT Test (Reaction Time); repprim = Masked
Repetition Priming
                                            115


Table 29
Model Summary of CFA-M3
 Latent Variables:
                                   Estimate      Std.Err      z-value    P(>|z|)  Std.all
 knowledge =~
                       recep        1.000                                         0.763
                        prod        1.301         0.152        8.577      0.000   0.918
 strength =~
                    YesNoAcc        1.000                                         0.781
                     meanRT         -0.681        0.130       -5.224      0.000   -0.531
                      repprim       0.606         0.126        4.824      0.000   0.467
 Covariances:
                                   Estimate      Std.Err      z-value    P(>|z|)  Std.all
   meanRT ~~
                     repprim        -0.354         0.09       -3.950      0.000   -0.458
  knowledge ~~
                     strength       0.548         0.103        5.317      0.000   0.915
 Intercepts:
                                   Estimate      Std.Err      z-value    P(>|z|)  Std.all
                       recep        0.000         0.083        0.000      1.000   0.000
                        prod        -0.095        0.090       -1.051      0.293   -0.088
                    YesNoAcc        -0.035        0.086       -0.409      0.682   -0.035
                     meanRT         0.036         0.090        0.399      0.690   0.035
                     repprim        -0.051        0.092       -0.561      0.575   -0.050
                   knowledge        0.000                                         0.000
                     strength       0.000                                         0.000
           Variances:
                                   Estimate      Std.Err      z-value    P(>|z|) Std.all
                       recep        0.415         0.062        6.704      0.000   0.418
                        prod        0.183         0.079        2.307      0.021   0.157
                    YesNoAcc        0.396         0.090        4.397      0.000   0.389
                     meanRT         0.735         0.093        7.938      0.000   0.718
                     repprim        0.816         0.140        5.813      0.000   0.782
                   knowledge        0.578         0.131        4.407      0.000   1.000
                     strength       0.622         0.135        4.609      0.000   1.000
Notes. Recep = Form-Meaning Receptive Test; prod = Form-Meaning Productive Test; YesNoAcc
= Yes-No RT Test (Accuracy); meanRT = Yes-No RT Test (Reaction Time); repprim = Masked
Repetition Priming
                                            116


RQ2a – Predictive Validity of a Single Vocabulary Construct
         When vocabulary was used as a single construct (with and without repetition priming)
to predict self-rated proficiency, both SEMs (SEM-M1a & SEM-M2) produced a good fit (see
Table 27). The regression path from vocabulary to proficiency was also significant in both
models (b = 0.680, SE = 0.115, p < .001 in SEM-M1a; b = 0.680, SE = 0.114, p < .001 in SEM-M2).
In terms of the explanatory power of these models in accounting for proficiency, the addition of
the repetition priming to the vocabulary construct increased the R2 value for self-rated
proficiency from 0.261 in SEM-M1a to 0.271 in SEM-M2. This result suggested that the
repetition priming task was not explaining much of the variance in proficiency. At the same
time, this figure still compared favorably with that of the multiple regression model where the
five measures served as the predictors (SEM-M4, R2 = 0.190). Interestingly, when repetition
priming was treated as a separate observed variable (predictor), the model (SEM-M1b)
produced a poor fit, indicating model misspecification (i.e., relationships between variables in
the data were not well reflected in the specification of the model). In other words, repetition
priming was related to self-rated proficiency only when (or to a larger extent when) it was
incorporated into a single vocabulary construct.
RQ2b – Predictive Validity of Lexical Knowledge and Strength
         SEM-M3 where lexical knowledge and strength predicted proficiency produced a good
fit (see Table 27). Both regression paths were not significant (knowledge -> proficiency: b = -
0.167, SE = 0.637, p = .79; strength -> proficiency: b = 0.858, SE = 0.662, p = .195). This result
indicated that neither lexical knowledge nor strength could account for unique variance in
proficiency above and beyond each other. However, the R2 value for self-rated proficiency
                                                 117


increased from 0.271 in SEM-M2 (with the one-factor solution as predictor) to 0.321 when
knowledge and strength were treated as separate factors. In other words, the overall
explanatory power was larger when timed and untimed measures were conceptualized as
distinct constructs.
Summary of Findings
        To summarize all findings, I found psychometric evidence that all five measures can
belong to a single dimension, suggesting a uni-dimensional view of the vocabulary construct. At
the same time, there are also signs indicating that lexical strength (as measured by timed tasks)
can represent a distinct construct from lexical knowledge (as assessed untimed tests). When
vocabulary is considered as separate constructs, its explanatory power for self-rated proficiency
is strongest.
                                                118


                             CHAPTER 5: DISCUSSION AND CONCLUSION
        In this chapter, I discuss the findings in relation to the research questions as well as
methodological issues pertaining the data analysis procedure. I will also suggest directions for
further research before drawing a conclusion to close the present dissertation.
The Jury Is out but…
        In this dissertation, I set out to validate six word measures by examining (1) the
alignment between the psychological (e.g., explicit vs. implicit word knowledge) and
psychometric dimensionality as well as (2) their predictive validity under different
conceptualizations of the vocabulary construct. The present research represents the very first
attempt in the field at extending vocabulary construct validation research to the domain of
implicit vs. explicit and time sensitive vs. non-time sensitive word measures. In light of the
present results, a straightforward answer is unlikely. On the contrary, the findings invite more
research questions than they address.
        In terms of the distinction between implicit and explicit word knowledge, the one-factor
solution (CFA-M2) suggests evidence that the repetition priming task can be placed on the same
psychometric dimension as the other explicit vocabulary measure, supporting a uni-
dimensionality view of the vocabulary construct. However, in the absence of a two-factor
model, the current findings remain silent on the extent to which there can be more than one
dimension as far as implicit and explicit word knowledge is concerned. As for the distinction
between lexical knowledge and strength (as measured by untimed and timed tasks,
respectively), the picture is more complex. On the one hand, the two-factor model (CFA-M3)
produces a good fit, suggesting that timed tasks can be considered as psychometrically distinct
                                                   119


from untimed tasks. On the other, the high correlation at .92 between the two constructs
signals caution as it casts doubt on the discriminant validity of these factors. In addition, the
rivalry, one-factor model (CFA-M2) fits equally well with the data, indicating that a one-factor
representation of the data is also accurate and acceptable. For parsimony, the one-factor
solution is preferred. Taken together, the present data set has provided evidence for a unitary
view of vocabulary knowledge as judged by the factor structure. This view appears to hold true
in both the implicit vs. explicit and the knowledge vs. strength distinctions.
A Broader, Unitary View of Vocabulary Knowledge
         The good-fitting one-factor model (CFA-M2) points to the legitimacy to view vocabulary
knowledge as a unitary construct if this finding is consistently replicated in the future. Although
the present study is the first to include reaction time-based measures, this idea of a unitary
view of vocabulary is not new. Focusing exclusively on accuracy measures of word component
knowledge, González-Fernández and Schmitt (2020) also reported a one-factor model
suggesting that one should view vocabulary as a single construct. In a similar vein, although
Koizumi and In’nami (2020) reported a two-factor model for vocabulary size and depth, the two
factors were highly correlated at .945, indicating that the two constructs (size and depth) are
very highly related. It is reasonable then to suggest that this high correlation means a lack of
discriminant validity between the two factors. Indeed, the authors’ one-factor model also
produced a good fit although the two-factor model fitted statistically better. In the present
study, despite suggestions that timed measures would tap into distinct dimensions of lexical
knowledge, I found no overwhelming evidence for this view in the present data set.
                                                 120


         In qualifying their results, González-Fernández and Schmitt (2020) stress the
interconnection between different word knowledge components. They also suggest that “no
aspect is learned in a way that is detached from the other aspects” (González-Fernández &
Schmitt, 2020, p. 498). Put differently, then, development in one aspect of word knowledge is
likely to facilitate that of the other aspects. Similar claims may apply to the present context
where explicit and implicit word knowledge as well as lexical knowledge and strength should be
viewed as intimately connected. Although the present study remains silent on the potential
developmental trajectory of individual aspects of the vocabulary construct, a unitary view can
lead some to believe that they are acquired somewhat similarly. At least, acquisition of a
certain aspect (e.g., explicit knowledge) should carry a facilitating role in the learning of other
aspects (e.g., implicit knowledge). Indeed, Elgort (2011) show that intentional word learning
can result in acquisition of word meaning that is measured by implicit tasks. A similar role of
direct instruction in the development of implicit collocation knowledge is also reported by
Toomer and Elgort (2019). Note that this view of similar developmental pathways for explicit
and implicit word knowledge as well as lexical knowledge and strength differs from how implicit
knowledge of grammar is believed to be acquired (i.e., it is learned dominantly through
exposure, and explicit instruction only carries indirect roles). These differences highlight that,
despite some parallelism, vocabulary and grammar research does not always mirror each other
as far as explicit and implicit knowledge and learning are concerned.
         Despite the unitary view as supported by the good fit for the one-factor solution, the
association between the factor and the reaction time-based indicators is only of moderate
strength. Standardized factor loadings were in the |.45| to |.50| range. This level of strength
                                                  121


signals that the factor has a broader coverage than what individual tasks measure, and hence a
deviation from a good alignment between the factor and the reaction time measures.
Essentially, as one loads more indicators to a factor, more sub-domain coverage is achieved. At
the same time, the addition of indicators also changes the nature of the latent variable. This
shift can cause a reduced strength of association between the factor and individual indicators
because the factor’s wider coverage often means that it is further away from what a specific
task measures. On this account, the coverage of these five measures in the present study
together represent multiple sub-domains of word knowledge that are all important to the
conceptualization of vocabulary. Ignoring any of these aspects might result in an overly
simplistic, narrow view of lexical knowledge. Therefore, researchers should answer the call for
the use of measures that are of different natures (e.g., Elgort, 2018; Godfroid, 2020b;
Vandenberghe et al., 2021). In doing so, more aspects of word knowledge can be accounted for
in the vocabulary construct (see also discussion below on modeling vocabulary at a latent level).
For example, as argued in Chapter 1, vocabulary tests administered without time pressure
suffer from a lack of face validity (e.g., Godfroid, 2020b; Hui & Godfroid, 2020). To complement
these untimed measures, researchers should start using more time-pressured tasks, especially
when they are shown to provide additional information to traditional tests in the present study.
This idea of obtaining further insights through implicit and timed measures leads the present
discussion to the findings of the second research question regarding the predictive validity of
the different conceptualizations of the vocabulary construct.
                                                  122


What Implicit and Timed Measures Offer
         When paper- and accuracy-based vocabulary tests, which are relatively easy to
administer, are already widely used in L2 research, a key question to ask is what additional
information implicit and/or timed measures can offer to researchers. This issue bears practical
implications on research practices, in addition to the theoretical discussion of the vocabulary
construct (i.e., the psychological dimensionality). Given limited time and funds, should
researchers only use explicit, untimed vocabulary tests? What is the value of the implicit and
timed measures that is worth the resources? From my data, the addition of an implicit measure
to the vocabulary construct (from CFA-M1 to CFA-M2) seems to increase the explanatory power
of (self-reported) proficiency only to a very small extent. One straightforward interpretation
may be that the repetition priming task is not really providing a lot more additional (statistical)
information. At the same time, it is good to bear in mind that the repetition priming task targets
the establishing of lexical representations. In other words, it is a task to examine the extent to
which there is an entry in the mental lexicon for a given letter string. Even when detected, the
lexical entries may or may not contain sufficiently enriched information (e.g., semantics) that
can be put into authentic language use. On this account, the explanatory power of a vocabulary
construct may also be related to the demand of the tasks in the battery as a whole. Indeed,
language use (e.g., reading and listening) is often more highly correlated with a recall test than
a recognition test (e.g., S. Zhang & Zhang, 2020), with the former placing more demand of
knowledge on the learner. Therefore, it appears that the implicitness (or explicitness) of a task
does not have a large impact upon the explanatory power of the vocabulary construct.
                                                 123


        Another finding is that the level of predictive validity of the same set of measures
depends also on the factor structure of the vocabulary construct. In particular, when vocabulary
is conceptualized separately as lexical knowledge and strength (see CFA-M3), it has most
explanatory power in accounting for the variance in proficiency. In this light, although the CFA
results point to a one-factor model for parsimony, a two-factor solution based on time pressure
can be more useful for researchers because it carries the kind of statistical information that can
explain individual differences in proficiency among learners. In a way, the principle of
parsimony should be an important factor to consider, but whether or not it should be the only
criterion deserves some more deliberation. Potentially, researchers should take predictive
validity into account when conceptualizing the vocabulary construct. In addition, the fact that
different conceptualizations of the vocabulary construct have various levels of predictive power
has implications on the construct validation research in grammar. Much, if not all, of the
literature focuses on the factor structure of a battery of tests. A winner model is decided based
on model fit and/or the principle of parsimony. Perhaps, examining the predictive validity of
these conceptualizations could also lead to fruitful insights in grammar research as well.
        Finally, it is worth noting that when these word measures are modeled at an observed
level (SEM-M4), it has the least explanatory power. This result highlights the importance of
modeling vocabulary at a latent level. First, CFA and SEM allow researchers to achieve more
comprehensive sub-domain coverage. In the present study, CFA-M2, for example, has a single
factor labelled as vocabulary. This factor has five indicators. In other words, this underlying
construct of vocabulary covers all aspects that these five measures collectively tap into. With a
more comprehensive coverage and the flexibility of specify the model according to different
                                                 124


theoretical conceptualizations, vocabulary knowledge as a latent variable can capture a
learner’s lexical proficiency to a fuller extent, as shown in the present findings. Therefore, this
modeling of a latent variable goes one step beyond the mere use of multiple vocabulary
measures in a study that researchers have called for (e.g., Read, 2020). Further, González-
Fernández and Schmitt (2020) point out that using latent variables to examine vocabulary
allows researchers to purify their vocabulary measure because relationships between variables
(both latent and observed) are examined free of errors; therefore the representation (of
vocabulary knowledge) is believed to be more accurate. Having a relatively pure measure of
vocabulary is very important, especially when vocabulary is used to predict other outcomes of
interest. When measurement errors are not properly modeling, they can bias the regression
coefficient that most researchers are interested in. Last but not least, using SEM to study
vocabulary allows researchers to have more flexibility to model the many different
relationships related to vocabulary than regression analyses which can handle only one
outcome. For example, one can investigate the extent to which a particular instruction method
can lead to larger vocabulary growth, which in turn may enhance one’s reading or listening
performance. In this case, researchers can build a mediation model where treatment is the
predictor, vocabulary is the mediator, and language performance is the outcome. It is through
modeling these sophisticated relationships that researchers gain insights into how vocabulary is
learned, and the ramifications of successful lexical development.
Understanding Priming Tasks as Individual Differences Measures
        In addition to the discussion concerning the research questions, there are a couple of
methodological notes. The first is related to the use of priming tasks in this line of research. As
                                                  125


discussed in Chapter 2, priming tasks have often been used in psycholinguistic experiments
where researchers are interested in group-level effects. At the group level, if learners show
priming, researchers conclude that the learners have established the relevant representations
in memory. Therefore, oftentimes, results are interpreted in a binary fashion—either there is
priming or there is no priming. In the present study, the priming tasks were used as individual
differences measures. That is, the level of priming of an individual was used to index
performance relative to the sample. This use of priming as an individual differences measure
brought about two key issues: first, how the level and direction of priming are related to one’s
overall lexical knowledge; and second, how reliable priming tasks are when used in individual
differences research.
          When the level of priming is used as a continuous variable, it is necessary to
understand what a high level of priming means. In other words, do researchers expect more
skilled learners to demonstrate larger priming effects, for example? Or might it be the case that
the opposite is predicted? These questions may not be very intuitive because priming tasks
have mostly been used in contexts where researchers see priming as a binary. Indeed, my data
show that, for the masked repetition task, higher levels of priming are associated with more
skilled learners. This is manifested in two statistical estimates. First, in the analysis of the
masked repetition priming task (Table 22), the mixed model estimated a negative correlation
between the by-participant random intercepts and slopes at -.97. This means that when a
participant has a low intercept value (representing overall faster responses in the related
condition), they tend to have a high slope value (larger priming). Another estimate that signals
more priming is better is in the CFA results. From Table 28, for example, it can be seen that the
                                                 126


standardized factor loading for the task is positive at 0.46, indicating that a learner scoring high
on the factor (vocabulary knowledge) also scores high on the repetition priming task (larger
priming).
        However, the same observations do not apply to the semantic priming task. On the
contrary, the opposite is true. For the semantic priming task, the more skilled a learner is, the
less priming (or even reverse priming) is observed. Evidence can be found in correlation matrix
(Table 26) where the correlation between the mean reaction time in the Yes-No RT test and
semantic priming is positive at .39, representing that when a learner is slow (less skilled as
manifested in a high RT), they demonstrate more priming. The semantic priming task also
negatively correlates with the repetition priming task at -.59. In the mixed-effects model (Table
24), the correlation between the by-participant random intercepts and slopes is positive at .46,
suggesting that when a learner is slow in the related condition (larger RT), they show more
semantic priming. This is an unexpected finding that deserves further investigation. In addition,
since overall priming is not observed, I was not confident to include this task in the further
analysis because what was being measure is not entirely clear. In Elgort’s (2011) study, the
author notes that when the prime “is not fully acquired, the semantic priming effect may be
inhibitory [reverse priming], whereas if the primes are acquired, the effect is facilitatory
[priming]” (p. 394, my additions in brackets). Likewise, Bordag et al. (2015) took reverse priming
(inhibition) as evidence of “memory traces of the new semantic representation” (p. 372). The
lack of overall semantic priming for the present sample, therefore, may be a manifestation of
opposite effects cancelling each other. That is, there was reverse priming for those who did not
fully acquire all critical words, and there was some priming for those who did. Finally, more
                                                 127


proficient learners may also be more engaged in processing the semantics of the words, which
causes them to respond slower.
         Taken together, researchers should first clarify the characteristics of the semantic
priming task by, for exampling, examining and confirming the fuller developmental trajectory of
lexical knowledge as measured by a semantic priming task. Potentially, learners may first show
no effects, followed by reverse priming (i.e., negative differences between reaction times in the
related and the unrelated condition), and then priming (a positive difference). This U-shaped
trajectory may somewhat resemble that of lexical processing stability where learners’
processing initially becomes less stable when new representations are established, before it
becomes more stable again (Hui, 2020). In addition, when researchers use a semantic priming
task as an individual differences measure, they need to specify the direction of effects a priori;
otherwise, claims of learning may be unfalsifiable as both priming and reverse priming can be
taken as evidence for learning. Importantly, this non-linear trajectory of semantic priming can
make the measure less reliable as noted by Elgort (2011).
         Lastly, one initial concern for using priming tasks as an individual differences measure
was their potential low reliability (Draheim et al., 2019). To mitigate this problem, I employed
three strategies as discussed in Chapter 2: first, I used mixed-effects models to account for
item-related variability (Rouder & Haaf, 2019). Second, I engaged in model criticism to treat
outliers in fitting the mixed models (Baayen & Milin, 2010). Finally, I engaged in item inspection
to attempt to remove random variability in the data. The first two strategies appear to be very
useful in that the reliability levels of the two priming tasks are unexpectedly satisfactory (see
Table 25). They are indeed on par with the accuracy-based measures. On the other hand, as
                                                  128


shown in Table 22 and Table 23, removing items does not always result in a (much) higher level
of reliability. This success in achieving a relatively high reliability for priming measures has
methodological implications for researchers using priming tasks as individual differences
measures. Potentially, one can adopt these strategies and compare their impact on reliability
with an aim to identify an optimal way for analyzing priming data. If this route is proven
successful, one can assess if the same set of strategies can be applied to other processing time
data, such as reading times obtained from eye tracking (Staub, 2021).
Alternative and Equivalent Models
        Another methodological note concerns the fact that the good-fitting one- and two-
factor models (CFA-M2 and CFA-M3) are statistically undistinguishable, suggesting that they are
almost equally valid representations of the data. This scenario exemplifies a feature of
confirmatory factor analysis (and indeed the structural equation modeling [SEM] framework in
general) which is the existence of alternative and equivalent models. Brown (2015) defines
equivalent solutions as “different model specifications produc[ing] identical goodness of fit
(with the same number of df) and predicted covariance matrices (Σ) in any given data set” (p.
180). Since the fit for the models in the present case is not truly identical in statistical terms, I
consider them to be alternative models. Indeed, both alternative and equivalent solutions are
not uncommon. In an oft-cited paper, MacCallum et al. (1993) reviewed 20 articles using
structural equation modeling that were published in a psychology journal between 1988 and
1991. MacCallum and colleagues (1993) found that all of them could have examined three or
more equivalent models. The median number of potential equivalent models is three, with a
range from three to 33,925. These numbers highlight the extent to which there exist potential,
                                                   129


alternative explanations to what is found by researchers. In terms of dealing with potential
alternative and equivalent models, Brown (2015) suggests first ruling out theoretically
implausible models, such as those that can be fitted but do not make practical sense (e.g., using
data at Time 2 to predict performance at Time 1). In other cases, closely examining the
contender models may have “considerable heuristic value” (Brown, 2015, p. 183). For example,
researchers may not be aware of theoretical models that could also be plausible. Therefore,
testing, comparing, and reporting alternative and equivalent models should be appreciated in
any scientific endeavor.
        Having said that, when there are competing theories, one approach researchers take
may be to adjudicate on the case through rigorous, empirical work. I, myself, initially took this
approach when designing this present study. But, I have also come to realize the limitations of
having to decide on a final, best fitting, “winner” model. In particular, one could easily overlook
what other potentially alternative and equivalent (or even less good fitting) models have to
offer. On this account, it is unfortunate that researchers often do not acknowledge the
existence of potentially alternative and equivalent models. In MacCallum et al. (1993), the
authors reported that none of the articles they reviewed explicitly recognize the possibility of
equivalent models, not to mention testing and comparing them. However, vocabulary
researchers using confirmatory factor analysis (or SEM) have done a much better job in
acknowledging alternative and equivalent models. For example, Koizumi and In’nami (2020)
explicitly tested and compared different models as rivals of each other. González-Fernández
and Schmitt (2020) also wrote that “[they] cannot claim that [their best fitting model] is the
only valid statistical representation of vocabulary knowledge, but it is the model that best fit
                                                 130


[their] data, with its particular measures and participants” (p. 504, emphasis original). Taken
together, I argue that researchers should take full advantage of alternative and equivalent
models when they are found. In particular, these models allow researchers to examine the
strengths and limitations of different conceptualizations, moving beyond a simplistic
adjudication one might be the winner.
Limitations and Future Directions
        First and foremost, this study needs to be replicated. In general terms, replication
research can help researchers assess the extent to which findings reported are reliable. Types
of replication research range from exact replications, where subsequent researchers repeat the
study without any intended changes to the methodology, to partial replications, where one
principled change is made, and to conceptual replications, where more changes are made
(Marsden et al., 2018). While exact replications offer the most comparability between the initial
and subsequent studies, partial replications can help researchers assess the generalizability of
findings to other contexts, for instance. Therefore, the research field will benefit from
researchers replicating this present study with, for instance, a different population (e.g., EFL
learners) and/or with a different set of critical words and/or tasks.
        Second, to the extent that the data allow, researchers should test more alternative and
equivalent models to thoroughly understand the relationships between different aspects of
vocabulary. For example, when a good two-factor model is available, one can assess if there is
yet another second-order factor (e.g., vocabulary) governing the two first-order factors
(explicit, and lexical strength and implicit word knowledge). Perhaps, a bi-factor model, where
each indicator loads onto both a single factor (e.g., proficiency) and a construct factor (e.g.,
                                                  131


explicit word knowledge), can reveal the extent to which proficiency as a single factor drives
performance in the indicators, in addition to explicit and implicit word knowledge. All these
avenues present themselves as future directions for research.
Conclusion
         This present study is one of the very first steps to shed light on the relationships
between explicit and implicit as well as timed and untimed word measures. Using a battery of
five vocabulary measures, I found evidence for a broad, unitary view of vocabulary knowledge.
While these measures demonstrate psychometric unidimensionality, they represent a range of
aspects of word knowledge that are collectively important to the construct of vocabulary. The
initially hypothesized distinct dimension of lexical strength is not well supported by the present
data, but this conceptualization offers the strongest predictive validity in explaining self-
reported proficiency. Methodologically, I also demonstrated how priming data can be analyzed
to achieve a high reliability level, a psychometric property that is particularly important to
individual differences research.
         Construct validity of measures carries important roles in (dis)confirming the theoretical
conceptualizations of vocabulary as a construct. It is the prerequisite for language scientists
who use quantitative methods to understand vocabulary learning and teaching. The
dimensionality of word knowledge is as important as how researchers believe the knowledge is
acquired and its implications on actual language performance. For example, could one ace all
measures by exclusively studying word cards? Would naturalistic exposure (e.g., study abroad)
promote implicit word knowledge more than explicit knowledge? What does it mean to have
strong implicit word knowledge? Would strong implicit knowledge facilitate language
                                                  132


processing such as predictions that takes place during listening? To address these questions,
researchers need to draw on measurement research such as this piece to make informed
decisions on what vocabulary measures are appropriate for their research needs. In more
general terms, vocabulary knowledge and learning should be modeled at the latent level to
capture a more comprehensive, principled view of the construct.
                                               133


APPENDICES
   134


                                         APPENDIX A
                            THE FORM-MEANING RECEPTIVE TEST
Instructions
See the closest meaning to the key word in the question. Here is an example.
SEE: They saw it.
    a.      cut
    b.      waited for
    c.      looked at
    d.      started
The answer is (c).
K2 Word Level
1. MAINTAIN: Can they maintain it?
a. keep it as it is
b. make it larger
c. get a better one than it
d. get it
                                             135


2. STONE: He sat on a stone.
a. hard thing
b. kind of chair
c. soft thing on the floor
d. part of a tree
3. UPSET: I am upset.
a. tired
b. famous
c. rich
d. unhappy
4. DRAWER: The drawer was empty.
a. sliding box
b. place where cars are kept
c. cupboard to keep things cold
d. animal house
                                 136


5. PATIENCE: He has no patience.
a. will not wait happily
b. has no free time
c. has no faith
d. does not know what is fair
6. CAP: They are talking about the cap.
a. cover for letters
b. kind of hat
c. place to live inside a tall building
d. food grown in garden
7. PUB: They went to the pub.
a. place where people drink and talk
b. place that looks after money
c. large building with many shops
d. building for swimming
                                        137


8.CIRCLE: Make a circle.
a. rough picture
b. space with nothing in it
c. round shape
d. large hole
9. MICROPONE: Please use the microphone.
a. machine for making food hot
b. machine that makes sounds louder
c. machine that makes things look bigger
d. small telephone that can be carried around
10.PRO: He's a pro.
a. someone who is employed to find out important secrets
b. a stupid person
c. someone who writes for a newspaper
                                              138


d. someone who is paid for playing sport etc.
K3 Word Level
1.SOLDIER: He is a soldier.
a. person in a business
b. student
c. person who uses metal
d. person in the army
2.RESTORE: It has been restored.
a. said again
b. given to a different person
c. given a lower price
d. made like new again
3. JUG: He was holding a jug.
a. a container for pouring liquids
                                              139


b. an informal discussion
c. a soft cap
d. a weapon that explodes
4. SCRUB: He is scrubbing it.
a. cutting shallow lines into it
b. repairing it
c. rubbing it hard to clean it
d. drawing simple pictures of it
5.DINOSAUR: The children were pretending to be dinosaurs.
a. robbers who work at sea
b. very small creatures with human form but with wings
c. large creatures with wings that breathe fire
d. animals that lived a long time ago
Q16. STRAP: He broke the strap.
                                                140


a. promise
b. top cover
c. shallow dish for food
d. strip of material for holding things together
Q17. PAVE: It was paved.
a. prevented from going through
b. divided
c. given gold edges
d. covered with a hard surface
Q18. DASH: They dashed over it.
a. moved quickly
b. moved slowly
c. fought
d. looked quickly
                                                 141


Q19. POVERTY: Poverty is a topic of discussion.
a. having little money
b. history
c. useful thing
d. action
Q20. LONESOME: He felt lonesome.
a. ungrateful
b. very tired
c. lonely
d. full of energy
K4 Word Level
Q31. COMPOUND: They made a new compound.
a. agreement
b. thing made of two or more parts
c. group of people forming a business
d. guess based on past experience
                                              142


Q32. LATTER: I agree with the latter.
a. man from the church
b. reason given
c. last one
d. answer
33. CANDID: Please be candid.
a. be careful
b. show sympathy
c. show fairness to both sides
d. say what you really think
Q34. TUMMY: Look at my tummy.
a. cloth to cover the head
b. stomach
c. small furry animal
d. thumb
Q35. QUIZ: We made a quiz.
a. thing to hold arrows
                                      143


b. serious mistake
c. set of questions
d. box for birds to make nests in
Q36. INPUT: We need more input.
a. information, power, etc. put into something
b. workers
c. artificial filling for a hole in wood
d. money
Q37. CRAB: Do you like crabs?
a. sea creatures that walk sideways
b. very thin small cakes
c. tight, hard collars
d. large black insects that sing at night
Q28.VOCABULARY: You will need more vocabulary.
a. words
b. skill
c. money
d. guns
                                               144


Q29. REMEDY: We found a good remedy.
a. way to fix a problem
b. place to eat in public
c. way to prepare food
d. rule about numbers
Q30.ALLEGE: They alleged it.
a. claimed it without proof
b. stole the ideas for it from someone else
c. provided facts to prove it
d. argued against the facts that supported it
K5 Word Level
Q31. DEFICIT: The company had a large deficit.
a. spent a lot more money than it earned
b. went down a lot in value
c. had a plan for its spending that used a lot of money
d. had a lot of money in the bank
                                                 145


Q32. WEEP: He wept.
a. finished his course
b. cried
c. died
d. worried
Q33. NUN: We saw a nun.
a. long thin creature that lives in the earth
b. terrible accident
c. woman following a strict religious life
d. unexplained bright light in the sky
Q34. HAUNT: The house is haunted.
a. full of ornaments
b. rented
c. empty
d. full of ghosts
                                              146


Q35. COMPOST: We need some compost.
a. strong support
b. help to feel better
c. hard stuff made of stones and sand stuck together
d. rotted plant material
Q36. CUBE: I need one more cube.
a. sharp thing used for joining things
b. solid square block
c. tall cup with no saucer
d. piece of stiff paper folded in half
Q37. MINIATURE: It is a miniature.
a. a very small thing of its kind
b. an instrument to look at small objects
c. a very small living creature
                                              147


d. a small line to join letters in handwriting
Q38. PEEL: Shall I peel it?
a. let it sit in water for a long time
b. take the skin off it
c. make it white
d. cut it into thin pieces
Q39. FRACTURE: They found a fracture.
a. break
b. small piece
c. short coat
d. rare jewel
Q40. BACTERUM: They didn't find a single bacterium.
a. small living thing causing disease
b. plant with red or orange flowers
                                               148


c. animal that carries water on its back
d. thing that has been stolen and sold to a shop
                                               149


                                   APPENDIX B
                    THE FORM-MEANING PRODUCTIVE TEST
Critical Words                                  Prompts
 1. maintain   One needs to exercise regularly to mai________ their fitness.
 2. stone      When I was young, my father taught me how to skip a st____ across
               the pond.
 3. upset      I didn’t mean to up______ him and make him cry – it was just a bit of
               fun.
 4. drawer     If you pull out the dr_______, you can find the knives and forks.
 5. patience   In the end, I lost my pat______ and shouted at them.
 6. cap        He likes wearing a baseball c___ because it can block the sunshine.
 7. pub        They visit to go to the pu___ for a drink every Friday.
 8. circle     If you want to teach a child to draw a sun, you can start by drawing a
               cir____.
 9. microphone People cannot hear well. Could you speak into the micr_________.
 10. pro       She won nine games out of the ten that she played. She is surely a
               pr___.
 11. soldier   At that point, the sol_____ in full uniform opened fire on the car.
 12. restore   Although the house had a lot of damages, it has now been res__.
 13. jug       She filled the ju____ up with milk.
 14. scrub     He scr___ the dishes clean with a sponge.
 15. dinosaur  Din______ are a type of large animals that became extinct long ago.
                                       150


16. strap      How can I adjust the str______ on this helmet? It’s too tight.
17. pave       Because of the holes on the road, they pa_____ it again.
18. dash       The dog ran off, and she da______ after him.
19. poverty    Many people here live in po______, making very little money for their
               living.
20. lonesome   Since there is no one around, he feels lones______.
21. compound   A com______ word is formed by putting two words together, like
               classroom is from the words class and room.
22. latter     Faced with two plans, she preferred the la____.
23. candid     If you say what you think, you are a can____ person.
24. tummy      Asian children are told not to have cold drinks or their tum______
               would hurt.
25. quiz       There was a pop qu____ in math at school today.
26. input      You don’t need to thank me. My inp_____ into the project was just one
               idea.
27. crab       We went to a seafood restaurant, and we ordered cr___.
28. vocabulary He needs to learn more words. A larger vo_______ can help with
               understanding the language.
29. remedy     Hot soup is the best rem____ for the common cold.
30. allege     Without any evidence, they all____ that man killed his neighbor.
31. deficit    The company is spending more than it makes. The owner is expecting a
               de______ in the budget.
                                       151


32. weep      After hearing the sad story, he wanted to we____.
33. nun       He was taught by Catholic n______ at school. That’s why he knows so
              much about the religion.
34. haunt     He was so scared walking out of the ha_____ house.
35. compost   Dead plants are used to create com_____ soil for gardening.
36. cube      Many Americans like to have ice cu____ in their water.
37. miniature This one is too big. I like the mini______ version.
38. peel      She doesn’t like the skins. Would you p_____ these potatoes?
39. fracture  After the accident, the doctor saw a bone fra______ in his x-ray film.
40. bacterum  After cleaning this with alcohol, I don’t think we can find a single
              bact______.
                                        152


                            APPENDIX C
                  STIMULI FOR THE YES-NO RT TEST
ItemNo    Item       WordType       ItemNo           Item   WordType
   01   maintain       word            41            yimb   nonword
   02     stone        word            42           snarbs  nonword
   03     upset        word            43          ghieved  nonword
   04    drawer        word            44           knilms  nonword
   05   patience       word            45            mulb   nonword
   06      cap         word            46            shists nonword
   07      pub         word            47           thraifs nonword
   08     circle       word            48            psuse  nonword
   09  microphone      word            49            dause  nonword
   10      pro         word            50           zolved  nonword
   11    soldier       word            51            yooks  nonword
   12    restore       word            52          trensed  nonword
   13      jug         word            53         twouche   nonword
   14     scrub        word            54           flarred nonword
   15   dinosaur       word            55         rourned   nonword
   16     strap        word            56             chift nonword
   17     pave         word            57            brisps nonword
   18     dash         word            58            graun  nonword
   19   poverty        word            59        gwoothed   nonword
                               153


20 lonesome   word     60 spreathed nonword
21 compound   word     61  flormed  nonword
22     latter word     62    prarns nonword
23    candid  word     63     shrut nonword
24   tummy    word     64    skoign nonword
25      quiz  word     65   broined nonword
26     input  word     66      jeg  nonword
27      crab  word     67  skobbed  nonword
28 vocabulary word     68  gnoathe  nonword
29   remedy   word     69   clersed nonword
30    allege  word     70  spermed  nonword
31    deficit word     71   blossed nonword
32     weep   word     72    chead  nonword
33      nun   word     73  gwoaked  nonword
34    haunt   word     74    strake nonword
35  compost   word     75     zatts nonword
36      cube  word     76    melch  nonword
37  miniature word     77    shrect nonword
38      peel  word     78  throaves nonword
39  fracture  word     79 thwagued  nonword
40 bacterium  word     80     plisc nonword
                   154


                            APPENDIX D
          STIMULI FOR THE MASKED REPETITION PRIMING TASK
ItemNo    Prime         Target      TargetType    TrialType Condition
  C01   maintain       MAINTAIN        word        critical  related
  C01   announce       MAINTAIN        word        critical unrelated
  C02      stone        STONE          word        critical  related
  C02      chest        STONE          word        critical unrelated
  C03      upset        UPSET          word        critical  related
  C03     major         UPSET          word        critical unrelated
  C04     drawer       DRAWER          word        critical  related
  C04     actors       DRAWER          word        critical unrelated
  C05    patience      PATIENCE        word        critical  related
  C05    occasion      PATIENCE        word        critical unrelated
  C06        nil         CAP           word        critical  related
  C06       lop          CAP           word        critical unrelated
  C07       pub          PUB           word        critical  related
  C07       gap          PUB           word        critical unrelated
  C08      circle       CIRCLE         word        critical  related
  C08      effect       CIRCLE         word        critical unrelated
  C09  microphone    MICROPHONE        word        critical  related
  C09  suspension    MICROPHONE        word        critical unrelated
  C10       pro          PRO           word        critical  related
                                155


C10      shy       PRO       word critical unrelated
C11    soldier   SOLDIER     word critical  related
C11  account     SOLDIER     word critical unrelated
C12   restore    RESTORE     word critical  related
C12   cherish    RESTORE     word critical unrelated
C13      jug        JUG      word critical  related
C13      icy        JUG      word critical unrelated
C14     scrub     SCRUB      word critical  related
C14     draft     SCRUB      word critical unrelated
C15  dinosaur   DINOSAUR     word critical  related
C15   triangle  DINOSAUR     word critical unrelated
C16     strap     STRAP      word critical  related
C16    pause      STRAP      word critical unrelated
C17     pave       PAVE      word critical  related
C17     loom       PAVE      word critical unrelated
C18     dash      DASH       word critical  related
C18    memo       DASH       word critical unrelated
C19   poverty    POVERTY     word critical  related
C19   dentist    POVERTY     word critical unrelated
C20 lonesome   LONESOME      word critical  related
C20  concrete  LONESOME      word critical unrelated
C21 compound   COMPOUND      word critical  related
                         156


C21  sympathy   COMPOUND      word critical unrelated
C22      latter   LATTER      word critical  related
C22    unkind     LATTER      word critical unrelated
C23     candid    CANDID      word critical  related
C23     unreal    CANDID      word critical unrelated
C24    tummy      TUMMY       word critical  related
C24     poems     TUMMY       word critical unrelated
C25       quiz      QUIZ      word critical  related
C25      lump       QUIZ      word critical unrelated
C26      input     INPUT      word critical  related
C26      stool     INPUT      word critical unrelated
C27       crab      CRAB      word critical  related
C27       bolt      CRAB      word critical unrelated
C28 vocabulary  VOCABULARY    word critical  related
C28 psychiatry  VOCABULARY    word critical unrelated
C29    remedy     REMEDY      word critical  related
C29     crater    REMEDY      word critical unrelated
C30     allege    ALLEGE      word critical  related
C30     pitted    ALLEGE      word critical unrelated
C31     deficit   DEFICIT     word critical  related
C31   surname     DEFICIT     word critical unrelated
C32      weep      WEEP       word critical  related
                          157


C32     bark     WEEP       word critical unrelated
C33      nun      NUN       word critical  related
C33      dip      NUN       word critical unrelated
C34    haunt    HAUNT       word critical  related
C34     unite   HAUNT       word critical unrelated
C35  compost   COMPOST      word critical  related
C35   leakage  COMPOST      word critical unrelated
C36     cube     CUBE       word critical  related
C36     polo     CUBE       word critical unrelated
C37 miniature MINIATURE     word critical  related
C37 incorrect MINIATURE     word critical unrelated
C38     peel      PEEL      word critical  related
C38     echo      PEEL      word critical unrelated
C39  fracture  FRACTURE     word critical  related
C39 database   FRACTURE     word critical unrelated
C40 bacterium BACTERIUM     word critical  related
C40 aggregate BACTERIUM     word critical unrelated
F01  package     WOLF       word  filler  unrelated
F02      lion     LAKE      word  filler  unrelated
F03     stem      SIGN      word  filler  unrelated
F04  weapon     TRIPOD      word  filler  unrelated
F05      soil    BRAKE      word  filler  unrelated
                        158


F06     scab   WINDOW      word filler unrelated
F07    coast    SOCCER     word filler unrelated
F08  umpire    MERCURY     word filler unrelated
F09    string     FEET     word filler unrelated
F10     vest    COURT      word filler unrelated
F11 swimming     WINK      word filler unrelated
F12  clothes     WICK      word filler unrelated
F13   womb     PLATTER     word filler unrelated
F14   willow      RING     word filler unrelated
F15     rope     NOSE      word filler unrelated
F16    toilet   MOON       word filler unrelated
F17    wine      WIND      word filler unrelated
F18    sand    TONGUE      word filler unrelated
F19     rock     LOOT      word filler unrelated
F20  cockpit     MULE      word filler unrelated
F21      veil    RASH      word filler unrelated
F22   home      WOOD       word filler unrelated
F23   tunnel  ACCORDION    word filler unrelated
F24    cider     CHALK     word filler unrelated
F25    plum      HOOD      word filler unrelated
F26  panties    BURRO      word filler unrelated
F27   candy      NOTE      word filler unrelated
                       159


F28    coral    MENU      word filler unrelated
F29   sneeze     REED     word filler unrelated
F30     king     SHIP     word filler unrelated
F31     lung    CHEST     word filler unrelated
F32     roof    LUMP      word filler unrelated
F33  cabinet    CEDAR     word filler unrelated
F34     lime    MOSS      word filler unrelated
F35  brownie  MONARCH     word filler unrelated
F36     star     KNEE     word filler unrelated
F37    wren      SHOE     word filler unrelated
F38    hump     MOLE      word filler unrelated
F39    pork   MANSION     word filler unrelated
F40      nail  SULTAN     word filler unrelated
F41    soda      TIRE     word filler unrelated
F42  temple      RUBY     word filler unrelated
F43     sofa    MEAT      word filler unrelated
F44    tube     CHART     word filler unrelated
F45  mustard   STREET     word filler unrelated
F46     plug   CREAM      word filler unrelated
F47 ambulance   TOWN      word filler unrelated
F48   sketch     RAIL     word filler unrelated
F49  triangle  SPIDER     word filler unrelated
                      160


F50   supper   WOODLAND      word filler unrelated
F51   yellow       MILK      word filler unrelated
F52  channel       PEAR      word filler unrelated
F53    lamb        PINE      word filler unrelated
F54      pill      KISS      word filler unrelated
F55    pope       PUMP       word filler unrelated
F56      lice  TROMBONE      word filler unrelated
F57     twig      CHEEK      word filler unrelated
F58    lawn      COUCH       word filler unrelated
F59      jail     SISTER     word filler unrelated
F60   walnut      TOOL       word filler unrelated
F61  tortoise    VIOLET      word filler unrelated
F62 sycamore       LIMB      word filler unrelated
F63 aluminium   PASSAGE      word filler unrelated
F64    cigar     URCHIN      word filler unrelated
F65   bloom        TAIL      word filler unrelated
F66  oatmeal      WOOL       word filler unrelated
F67   ceiling   SIXPENCE     word filler unrelated
F68  scorpion      POLE      word filler unrelated
F69       lily   TURTLE      word filler unrelated
F70     rain      WALL       word filler unrelated
F71   sunset      TANK       word filler unrelated
                         161


F72  stable   CHAIN      word filler unrelated
F73 brother   SONG       word filler unrelated
F74   china    SINK      word filler unrelated
F75    pool    SURF      word filler unrelated
F76   hook    CRANE      word filler unrelated
F77  crowd    ROOT       word filler unrelated
F78   meal     LOCK      word filler unrelated
F79   ramp     JEEP      word filler unrelated
F80   hoof    CHOIR      word filler unrelated
N01  snarbs  SNARBS      non  non     related
N01    plisc SNARBS      non  non    unrelated
N02  knilms  KNILMS      non  non     related
N02 lymphs   KNILMS      non  non    unrelated
N03   shists  SHISTS     non  non     related
N03   ficed   SHISTS     non  non    unrelated
N04  zolved  ZOLVED      non  non     related
N04 rhoiled  ZOLVED      non  non    unrelated
N05 ghieved  GHIEVED     non  non     related
N05   shrut  GHIEVED     non  non    unrelated
N06  brisps   BRISPS     non  non     related
N06  skoign   BRISPS     non  non    unrelated
N07  prarns  PRARNS      non  non     related
                     162


N07    frifth  PRARNS      non non unrelated
N08   dause     DAUSE      non non  related
N08   blunch    DAUSE      non non unrelated
N09    graun    GRAUN      non non  related
N09  clersed    GRAUN      non non unrelated
N10    shrut    SHRUT      non non  related
N10  gnewth     SHRUT      non non unrelated
N11   skoign    SKOIGN     non non  related
N11   zolved    SKOIGN     non non unrelated
N12  trensed   TRENSED     non non  related
N12    yooks   TRENSED     non non unrelated
N13 twouche   TWOUCHE      non non  related
N13   stroop  TWOUCHE      non non unrelated
N14   flarred  FLARRED     non non  related
N14  zatched   FLARRED     non non unrelated
N15   strake    STRAKE     non non  related
N15    shaifs   STRAKE     non non unrelated
N16   shrect    SHRECT     non non  related
N16    stelks   SHRECT     non non unrelated
N17  broined   BROINED     non non  related
N17   flarred  BROINED     non non unrelated
N18    zatts     ZATTS     non non  related
                       163


N18  cronked     ZATTS      non non unrelated
N19     yimb     YIMB       non non  related
N19  thwerge     YIMB       non non unrelated
N20    psuse     PSUSE      non non  related
N20     clike    PSUSE      non non unrelated
N21    flunes   FLUNES      non non  related
N21 thwagued    FLUNES      non non unrelated
N22    yooks    YOOKS       non non  related
N22  ghieved    YOOKS       non non unrelated
N23   driped    DRIPED      non non  related
N23   strake    DRIPED      non non unrelated
N24 spreathed SPREATHED     non non  related
N24  broined  SPREATHED     non non unrelated
N25    prives   PRIVES      non non  related
N25     cless   PRIVES      non non unrelated
N26   gwilns    GWILNS      non non  related
N26     chift   GWILNS      non non unrelated
N27     chift    CHIFT      non non  related
N27    chead     CHIFT      non non unrelated
N28    chead    CHEAD       non non  related
N28   prowse    CHEAD       non non unrelated
N29    melch    MELCH       non non  related
                        164


N29   thraifs  MELCH      non non unrelated
N30   spreat   SPREAT     non non  related
N30  phroaps   SPREAT     non non unrelated
N31   vaides   VAIDES     non non  related
N31 spreathed  VAIDES     non non unrelated
N32     plisc   PLISC     non non  related
N32   snarbs    PLISC     non non unrelated
N33   rhorts   RHORTS     non non  related
N33    psuse   RHORTS     non non unrelated
N34  skobbed  SKOBBED     non non  related
N34   knilms  SKOBBED     non non unrelated
N35  jumbed   JUMBED      non non  related
N35    prives JUMBED      non non unrelated
N36   snance   SNANCE     non non  related
N36    farge   SNANCE     non non unrelated
N37   clersed CLERSED     non non  related
N37   spreat  CLERSED     non non unrelated
N38   stroop   STROOP     non non  related
N38    truiff  STROOP     non non unrelated
N39   drongs  DRONGS      non non  related
N39    pised  DRONGS      non non unrelated
N40   wrukes  WRUKES      non non  related
                      165


N40   snooks    WRUKES      non non unrelated
N41   blunch    BLUNCH      non non  related
N41    janns    BLUNCH      non non unrelated
N42    stelks    STELKS     non non  related
N42  prathed     STELKS     non non unrelated
N43 gwoaked    GWOAKED      non non  related
N43  trensed   GWOAKED      non non unrelated
N44   thraifs   THRAIFS     non non  related
N44    shrect   THRAIFS     non non unrelated
N45  rourned   ROURNED      non non  related
N45    gurve   ROURNED      non non unrelated
N46   dradge    DRADGE      non non  related
N46 gwoothed    DRADGE      non non unrelated
N47   rhoiled   RHOILED     non non  related
N47   vaides    RHOILED     non non unrelated
N48     psith     PSITH     non non  related
N48     zatts     PSITH     non non unrelated
N49     truiff   TRUIFF     non non  related
N49     psith    TRUIFF     non non unrelated
N50  gnewth    GNEWTH       non non  related
N50    rhean   GNEWTH       non non unrelated
N51     frifth   FRIFTH     non non  related
                        166


N51    flunes    FRIFTH     non non unrelated
N52 thwagued  THWAGUED      non non  related
N52    driped THWAGUED      non non unrelated
N53     rhean    RHEAN      non non  related
N53  gwoaked     RHEAN      non non unrelated
N54       jeg     JEG       non non  related
N54    gwilns     JEG       non non unrelated
N55   dweem     DWEEM       non non  related
N55    rhorts   DWEEM       non non unrelated
N56   zatched  ZATCHED      non non  related
N56     skurs  ZATCHED      non non unrelated
N57     zorns    ZORNS      non non  related
N57    dradge    ZORNS      non non unrelated
N58    snooks   SNOOKS      non non  related
N58    ghurrs   SNOOKS      non non unrelated
N59  thwerge   THWERGE      non non  related
N59    snance  THWERGE      non non unrelated
N60  prathed   PRATHED      non non  related
N60   dweem    PRATHED      non non unrelated
N61     janns    JANNS      non non  related
N61   jumbed     JANNS      non non unrelated
N62     skurs    SKURS      non non  related
                        167


N62      jeg    SKURS      non non unrelated
N63  phroaps   PHROAPS     non non  related
N63    zorns   PHROAPS     non non unrelated
N64 gwoothed  GWOOTHED     non non  related
N64    brisps GWOOTHED     non non unrelated
N65    plurrs   PLURRS     non non  related
N65  flormed    PLURRS     non non unrelated
N66  wharked   WHARKED     non non  related
N66    melch   WHARKED     non non unrelated
N67     cless    CLESS     non non  related
N67    shists    CLESS     non non unrelated
N68     frelt    FRELT     non non  related
N68   wrukes     FRELT     non non unrelated
N69   prowse   PROWSE      non non  related
N69  skobbed   PROWSE      non non unrelated
N70    gurve    GURVE      non non  related
N70  wharked    GURVE      non non unrelated
N71    farge    FARGE      non non  related
N71   prarns    FARGE      non non unrelated
N72    shaifs   SHAIFS     non non  related
N72    zulbs    SHAIFS     non non unrelated
N73  cronked   CRONKED     non non  related
                       168


N73 rourned  CRONKED     non non unrelated
N74    zulbs   ZULBS     non non  related
N74   plurrs   ZULBS     non non unrelated
N75   ghurrs  GHURRS     non non  related
N75   graun   GHURRS     non non unrelated
N76    clike   CLIKE     non non  related
N76    yimb    CLIKE     non non unrelated
N77  lymphs   LYMPHS     non non  related
N77  drongs   LYMPHS     non non unrelated
N78 flormed  FLORMED     non non  related
N78 twouche  FLORMED     non non unrelated
N79   pised    PISED     non non  related
N79    frelt   PISED     non non unrelated
N80    ficed   FICED     non non  related
N80   dause    FICED     non non unrelated
                     169


                          APPENDIX E
       WORD ASSOCIATION NORMS FOR CRITICAL RELATED TRIALS
ItemNo    Prime          Target           EAT         USF-FAN
  C01   maintain          keep
  C02     stone           brick           0.02
  C03     upset          worried
  C04    drawer            desk           0.05          0.14
  C05   patience          calm            0.05          0.17
  C06      cap            head            0.19          0.11
  C07      pub              bar           0.05          0.29
  C08     circle            ball
  C09  microphone         voice                         0.04
  C10      pro          champion
  C11    soldier         military                       0.07
  C12    restore          build           0.01          0.04
  C13      jug           pouring          0.01
  C14     scrub          cleaner
  C15   dinosaur          fossil
  C16     strap             bra           0.08          0.06
  C17     pave             road
  C18     dash             race           0.01
  C19   poverty          hunger           0.05
                               170


C20  lonesome        sad    0.03
C21 compound    laboratory
C22     latter     former   0.90
C23    candid      honest
C24    tummy        belly   0.02
C25      quiz       exam
C26     input       entry
C27      crab       pinch
C28 vocabulary   grammar
C29   remedy      solution  0.02 0.09
C30     allege     accuse
C31     deficit    budget
C32     weep       sorrow
C33      nun       church
C34     haunt       ghost   0.50 0.32
C35   compost        soil
C36      cube      square
C37  miniature       tiny
C38      peel     banana    0.01 0.13
C39   fracture      break   0.35 0.62
C40 bacterium     disease   0.08 0.13
                        171


Note. EAT_BW = Edinburgh Associative Thesaurus (Backward Association); EAT_FW =
Edinburgh Associative Thesaurus (Forward Association); USF-FAN_BW = University of South
Florida Free Association Norms (Backward Association); USF-FAN_FW = University of South
Florida Free Association Norms (Forward Association)
                                            172


                               APPENDIX F
              STIMULI FOR THE MASKED SEMANTIC PRIMING TASK
ItemNo Prime       Target         TargetType TrialType  Condition Cosine
C01    maintain    keep           word       critical   related   0.55
C01    announce    keep           word       critical   unrelated 0.85
C02    stone       brick          word       critical   related   0.61
C02    chest       brick          word       critical   unrelated 0.89
C03    upset       WORRIED        word       critical   related   0.34
C03    major       WORRIED        word       critical   unrelated 0.83
C04    drawer      DESK           word       critical   related   0.47
C04    actors      DESK           word       critical   unrelated 0.91
C05    patience    calm           word       critical   related   0.68
C05    chestnut    calm           word       critical   unrelated 0.95
C06    cap         head           word       critical   related   0.68
C06    lop         head           word       critical   unrelated 0.86
C07    pub         bar            word       critical   related   0.52
C07    gap         bar            word       critical   unrelated 0.98
C08    circle      ball           word       critical   related   0.74
C08    effect      ball           word       critical   unrelated 0.93
C09    microphone voice           word       critical   related   0.56
                                   173


C09 suspension voice    word critical unrelated 0.95
C10 pro        champion word critical related   0.67
C10 shy        champion word critical unrelated 0.90
C11 soldier    military word critical related   0.55
C11 account    military word critical unrelated 0.98
C12 restore    build    word critical related   0.63
C12 cherish    build    word critical unrelated 0.85
C13 jug                 word critical related   0.69
C13 icy                 word critical unrelated 0.79
C14 scrub      CLEANER  word critical related   0.71
C14 draft      CLEANER  word critical unrelated 0.95
C15 dinosaur   fossil   word critical related   0.48
C15 triangle   fossil   word critical unrelated 1.08
C16 strap      bra      word critical related   0.64
C16 essay      bra      word critical unrelated 0.91
C17 pave       road     word critical related   0.76
C17 loom       road     word critical unrelated 0.92
C18 dash       race     word critical related   0.80
C18 memo       race     word critical unrelated 0.94
C19 poverty    hunger   word critical related   0.54
C19 dentist    hunger   word critical unrelated 0.95
C20 lonesome   SAD      word critical related   0.61
                        174


C20 concrete   SAD        word critical unrelated 0.87
C21 compound   LABORATORY word critical related   0.71
C21 sympathy   LABORATORY word critical unrelated 0.99
C22 latter     FORMER     word critical related   0.67
C22 unkind     FORMER     word critical unrelated 0.92
C23 candid     HONEST     word critical related   0.59
C23 unreal     HONEST     word critical unrelated 0.82
C24 tummy      belly      word critical related   0.54
C24 poems      belly      word critical unrelated 0.90
C25 quiz       exam       word critical related   0.70
C25 lump       exam       word critical unrelated 0.86
C26 input      entry      word critical related   0.85
C26 stool      entry      word critical unrelated 1.02
C27 crab       pinch      word critical related   0.88
C27 haze       pinch      word critical unrelated 0.88
C28 vocabulary grammar    word critical related   0.52
C28 psychiatry grammar    word critical unrelated 0.94
C29 remedy     solution   word critical related   0.59
C29 crater     solution   word critical unrelated 0.94
C30 allege     ACCUSE     word critical related   0.56
C30 pitted     ACCUSE     word critical unrelated 0.85
C31 deficit    BUDGET     word critical related   0.59
                          175


C31 surname   BUDGET  word critical unrelated 1.06
C32 weep      SORROW  word critical related   0.47
C32 bark      SORROW  word critical unrelated 0.83
C33 nun       CHURCH  word critical related   0.65
C33 dip       CHURCH  word critical unrelated 0.95
C34 haunt     GHOST   word critical related   0.55
C34 unite     GHOST   word critical unrelated 0.90
C35 compost   SOIL    word critical related   0.55
C35 leakage   SOIL    word critical unrelated 0.90
C36 cube      SQUARE  word critical related   0.70
C36 polo      SQUARE  word critical unrelated 1.01
C37 miniature TINY    word critical related   0.59
C37 incorrect TINY    word critical unrelated 0.97
C38 peel      banana  word critical related   0.56
C38 echo      banana  word critical unrelated 0.93
C39 fracture  BREAK   word critical related   0.70
C39 database  BREAK   word critical unrelated 1.02
C40 bacterium disease word critical related   0.62
C40 aggregate disease word critical unrelated 1.06
F01 home      phial   word filler   unrelated 0.86
F02 garbage   bow     word filler   unrelated 0.86
                      176


F03 suds     cell     word filler unrelated 0.98
F04 bronze   trail    word filler unrelated 0.98
F05 ramp     slush    word filler unrelated 0.86
F06 latch    camp     word filler unrelated 1.02
F07 flood    servant  word filler unrelated 0.95
F08 steam    carpet   word filler unrelated 0.87
F09 chicken  gingham  word filler unrelated 0.79
F10 truck    emerald  word filler unrelated 1.03
F11 farmyard casement word filler unrelated 0.91
F12 stub     ticket   word filler unrelated 0.75
F13 branch   squire   word filler unrelated 1.01
F14 oboe     bank     word filler unrelated 0.99
F15 building cologne  word filler unrelated 0.86
F16 abscess  bill     word filler unrelated 0.87
F17 aunt     lettuce  word filler unrelated 0.87
F18 pollen   mixer    word filler unrelated 0.96
F19 bourbon  rope     word filler unrelated 0.87
F20 pedal    hoe      word filler unrelated 0.88
F21 fellow   nickel   word filler unrelated 0.80
                      177


F22 boy      soap    word filler unrelated 0.80
F23 tweezer  horn    word filler unrelated 0.93
F24 lane     crow    word filler unrelated 0.82
F25 head     velvet  word filler unrelated 0.83
F26 rake     velvet  word filler unrelated 0.85
F27 necklace prairie word filler unrelated 0.89
F28 dance    sleet   word filler unrelated 0.90
F29 men      forearm word filler unrelated 0.88
F30 pocket   shutter word filler unrelated 0.89
F31 coke     missile word filler unrelated 0.88
F32 blade    worker  word filler unrelated 0.90
F33 material kernel  word filler unrelated 0.91
F34 dweller  bread   word filler unrelated 0.79
F35 tennis   hood    word filler unrelated 0.89
F36 fan      stable  word filler unrelated 1.00
F37 ship     cloak   word filler unrelated 0.84
F38 soup     shell   word filler unrelated 0.79
F39 bagpipe  naval   word filler unrelated 0.93
F40 spire    student word filler unrelated 1.02
                     178


F41 seaman  paint    word filler unrelated 0.97
F42 thong   pole     word filler unrelated 0.80
F43 ammonia malaria  word filler unrelated 0.95
F44 trapeze brownie  word filler unrelated 0.93
F45 harvest lumber   word filler unrelated 0.82
F46 china   yew      word filler unrelated 0.86
F47 shield  kerchief word filler unrelated 0.82
F48 straw   music    word filler unrelated 0.95
F49 vault   cinnamon word filler unrelated 0.99
F50 blouse  mantle   word filler unrelated 0.76
F51 mole    ether    word filler unrelated 0.89
F52 drill   novel    word filler unrelated 0.99
F53 male    child    word filler unrelated 0.76
F54 quarter doctor   word filler unrelated 0.92
F55 bible   cottage  word filler unrelated 0.94
F56 epistle pancreas word filler unrelated 0.89
F57 boulder hog      word filler unrelated 0.93
F58 shed    bristle  word filler unrelated 0.97
F59 nun     larch    word filler unrelated 0.95
                     179


F60 corpse   sofa     word filler unrelated 0.77
F61 slime    lynx     word filler unrelated 0.98
F62 plane    land     word filler unrelated 0.64
F63 machine  referee  word filler unrelated 0.97
F64 wife     guest    word filler unrelated 0.72
F65 puddle   walrus   word filler unrelated 0.95
F66 mineral  ankle    word filler unrelated 0.93
F67 emperor  crumb    word filler unrelated 0.90
F68 elephant cafe     word filler unrelated 0.92
F69 rash     brim     word filler unrelated 1.02
F70 oil      crowd    word filler unrelated 0.96
F71 ramrod   corridor word filler unrelated 0.81
F72 drain    rung     word filler unrelated 0.93
F73 dump     pea      word filler unrelated 0.92
F74 banana   beak     word filler unrelated 0.82
F75 toy      umpire   word filler unrelated 0.90
F76 cage     tailor   word filler unrelated 0.90
F77 statue   dough    word filler unrelated 0.88
F78 diving   body     word filler unrelated 0.87
                      180


F79 janitor  halter   word filler unrelated 0.99
F80 weapon   magician word filler unrelated 0.77
N01 shryst   dweased  non  non    na        na
N02 juits    whilns   non  non    na        na
N03 spurved  theethe  non  non    na        na
N04 smighs   phleuds  non  non    na        na
N05 thralked glurch   non  non    na        na
N06 pseague  skiln    non  non    na        na
N07 cloob    skaved   non  non    na        na
N08 guins    pheeped  non  non    na        na
N09 thriest  phrirts  non  non    na        na
N10 dweet    speemed  non  non    na        na
N11 tord     jalt     non  non    na        na
N12 swant    pralled  non  non    na        na
N13 snam     gheche   non  non    na        na
N14 sproque  blibbed  non  non    na        na
N15 plulf    blild    non  non    na        na
N16 swants   gooms    non  non    na        na
N17 psurnt   ufts     non  non    na        na
                      181


N18 strurse  farged  non non na na
N19 clitch   wrursts non non na na
N20 smarps   kolts   non non na na
N21 plowl    frouts  non non na na
N22 gwaths   skeld   non non na na
N23 dwerd    chowth  non non na na
N24 snawse   cloot   non non na na
N25 swegg    thafes  non non na na
N26 cumbed   blulls  non non na na
N27 swerch   drithed non non na na
N28 phromped tudged  non non na na
N29 sneefs   keph    non non na na
N30 sliche   shrur   non non na na
N31 gidge    sporde  non non na na
N32 spreese  stumes  non non na na
N33 yamps    rast    non non na na
N34 prage    ghaitch non non na na
N35 thourged dwagged non non na na
N36 thrilth  zimes   non non na na
                     182


N37 yighed   yoal     non non na na
N38 phrombed wared    non non na na
N39 zirms    stroards non non na na
N40 thogs    myed     non non na na
N41 trume    wheamed  non non na na
N42 shilch   chur     non non na na
N43 spriel   tudd     non non na na
N44 glalc    smaunch  non non na na
N45 knund    ormed    non non na na
N46 frant    flusk    non non na na
N47 tafts    fomb     non non na na
N48 plaired  rooths   non non na na
N49 scrauve  snerts   non non na na
N50 momes    spocs    non non na na
N51 rhurb    deich    non non na na
N52 ghised   hersed   non non na na
N53 knawn    wrass    non non na na
N54 croints  kear     non non na na
N55 blaled   loamed   non non na na
                      183


N56 taive    gwoured non non na na
N57 shrarps  dwades  non non na na
N58 pryp     pryths  non non na na
N59 theezed  ghooze  non non na na
N60 clyes    whoy    non non na na
N61 crisk    yeathed non non na na
N62 cloothe  splurf  non non na na
N63 thweined chowls  non non na na
N64 trowse   thwelt  non non na na
N65 skuned   ghinked non non na na
N66 skask    troots  non non na na
N67 gnanch   smuids  non non na na
N68 dwek     clake   non non na na
N69 ghoathed twirped non non na na
N70 knirr    gneke   non non na na
N71 zarc     splee   non non na na
N72 walds    stupe   non non na na
N73 shroved  skulged non non na na
N74 spligned sperk   non non na na
                     184


N75 gief    truv    non non na na
N76 wrarned stryth  non non na na
N77 shoign  luilds  non non na na
N78 toathed knuch   non non na na
N79 kands   cagues  non non na na
N80 maunt   flevved non non na na
                    185


REFERENCES
    186


                                               REFERENCES
Altmann, G. T. M., & Kamide, Y. (1999). Incremental interpretation at verbs: Restricting the
        domain of subsequent reference. Cognition, 73(3), 247–264.
        https://doi.org/10.1016/S0010-0277(99)00059-1
Anderson, R. C., & Freebody, P. (1981). Vocabulary knowledge. In J. T. Guthrie (Ed.),
        Comprehension and teaching: Research reviews (pp. 77–117). International Reading
        Association.
Andringa, S., & Rebuschat, P. (2015). New directions in the study of implicit and explicit
        learning: An introduction. Studies in Second Language Acquisition, 37(2), 185–196.
        https://doi.org/10.1017/S027226311500008X
Anwyl-Irvine, A., Dalmaijer, E. S., Hodges, N., & Evershed, J. K. (2020). Realistic precision and
        accuracy of online experiment platforms, web browsers, and devices. Behavior Research
        Methods. https://doi.org/10.3758/s13428-020-01501-5
Aryadoust, V., Ng, L. Y., & Sayama, H. (2020). A comprehensive review of Rasch measurement in
        language assessment: Recommendations and guidelines for research. Language Testing,
        026553222092748. https://doi.org/10.1177/0265532220927487
Baayen, R. H., Davidson, D. J., & Bates, D. M. (2008). Mixed-effects modeling with crossed
        random effects for subjects and items. Journal of Memory and Language, 59(4), 390–
        412. https://doi.org/10.1016/j.jml.2007.12.005
Baayen, R. H., & Milin, P. (2010). Analyzing reaction times. International Journal of Psychological
        Research, 3(2), 12–28. https://doi.org/10.21500/20112084.807
Barr, D. J., Levy, R., Scheepers, C., & Tily, H. J. (2013). Random effects structure for confirmatory
        hypothesis testing: Keep it maximal. Journal of Memory and Language, 68(3), 255–278.
        https://doi.org/10.1016/j.jml.2012.11.001
Bates, D., Mächler, M., Bolker, B., & Walker, S. (2015). Fitting linear mixed-effects models using
        lme4. Journal of Statistical Software, 67(1), 1–48. https://doi.org/10.18637/jss.v067.i01
Beeckmans, R., Eyckmans, J., Janssens, V., Dufranne, M., & Van de Velde, H. (2001). Examining
        the Yes/No vocabulary test: Some methodological issues in theory and practice.
        Language Testing, 18(3), 235–274. https://doi.org/10.1177/026553220101800301
Beglar, D. (2010). A Rasch-based validation of the Vocabulary Size Test. Language Testing,
        27(1), 101–118. https://doi.org/10.1177/0265532209340194
                                                     187


Bodner, G. E., & Masson, M. E. J. (2001). Prime validity affects masked repetition priming:
        Evidence for an episodic resource account of priming. Journal of Memory and Language,
        45(4), 616–647. https://doi.org/10.1006/jmla.2001.2791
Bordag, D., Kirschenbaum, A., Tschirner, E., & Opitz, A. (2015). Incidental acquisition of new
        words during reading in L2: Inference of meaning and its integration in the L2 mental
        lexicon. Bilingualism: Language and Cognition, 18(3), 372–390.
        https://doi.org/10.1017/S1366728914000078
Brown, T. A. (2015). Confirmatory factor analysis for applied research (Second edition). The
        Guilford Press.
Brysbaert, M. (2020). Power considerations in bilingualism research: Time to step up our game.
        Bilingualism: Language and Cognition, Advance online publication.
        https://doi.org/10.1017/S1366728920000437
Brysbaert, M., & New, B. (2009). Moving beyond Kučera and Francis: A critical evaluation of
        current word frequency norms and the introduction of a new and improved word
        frequency measure for American English. Behavior Research Methods, 41(4), 977–990.
        https://doi.org/10.3758/BRM.41.4.977
Bürkner, P.-C. (2018). Advanced Bayesian multilevel modeling with the R package brms. The R
        Journal, 10(1), 395–411. https://doi.org/10.32614/RJ-2018-017
Chapelle, C. A. (1999). Validity in language assessment. Annual Review of Applied Linguistics, 19,
        254–272. https://doi.org/10.1017/S0267190599190135
Chapelle, C. A. (2021). Validity in language assessment. In P. Winke & T. Brunfaut (Eds.), The
        Routledge handbook of second language acquisition and language testing (pp. 11–20).
        Routledge.
Chen, Y. (2021). Comparing incidental vocabulary learning from reading-only and reading-while-
        listening. System, 97, 102442. https://doi.org/10.1016/j.system.2020.102442
Cheng, J., & Matthews, J. (2018). The relationship between three measures of L2 vocabulary
        knowledge and L2 listening and reading. Language Testing, 35(1), 3–25.
        https://doi.org/10.1177/0265532216676851
Collins, A. M., & Loftus, E. F. (1975). A spreading-activation theory of semantic processing.
        Psychological Review, 82(6), 407–428. https://doi.org/10.1037/0033-295X.82.6.407
Crump, M. J. C., McDonnell, J. V., & Gureckis, T. M. (2013). Evaluating Amazon’s Mechanical
        Turk as a tool for experimental behavioral research. PLoS ONE, 8(3), e57410.
        https://doi.org/10.1371/journal.pone.0057410
                                                 188


Daller, H., Milton, J., & Treffers-Daller, J. (2007). Editor’s introduction. In H. Daller, J. Milton, & J.
          Treffers-Daller (Eds.), Modeling and assessing vocabulary knowledge (pp. 1–32).
          Cambridge University Press.
DeKeyser, R. (2003). Implicit and explicit learning. In C. J. Doughty & Ml. H. Long (Eds.),
          Handbook of second language acquisition (pp. 313–348). Blackwell Publishing Ltd.
Denovan, A., & Dagnall, N. (2019). Development and evaluation of the Chronic Time Pressure
          Inventory. Frontiers in Psychology, 10, 2717. https://doi.org/10.3389/fpsyg.2019.02717
Draheim, C., Mashburn, C. A., Martin, J. D., & Engle, R. W. (2019). Reaction time in differential
          and developmental research: A review and commentary on the problems and
          alternatives. Psychological Bulletin, 145(5), 508–535.
          https://doi.org/10.1037/bul0000192
Elgort, I. (2011). Deliberate learning and vocabulary acquisition in a second language. Language
          Learning, 61(2), 367–413. https://doi.org/10.1111/j.1467-9922.2010.00613.x
Elgort, I. (2017). Incorrect inferences and contextual word learning in English as a second
          language. Journal of the European Second Language Association, 1(1), 1.
          https://doi.org/10.22599/jesla.3
Elgort, I., Brysbaert, M., Stevens, M., & Van Assche, E. (2018). Contextual word learning during
          reading in a second language: An eye-movement study. Studies in Second Language
          Acquisition, 40(2), 341–366. https://doi.org/10.1017/S0272263117000109
Elgort, I., & Piasecki, A. E. (2014). The effect of a bilingual learning mode on the establishment
          of lexical semantic representations in the L2. Bilingualism: Language and Cognition,
          17(3), 572–588. https://doi.org/10.1017/S1366728913000588
Elgort, I., & Warren, P. (2014). L2 vocabulary learning from reading: Explicit and tacit lexical
          knowledge and the role of learner and item variables: L2 vocabulary learning from
          reading. Language Learning, 64(2), 365–414. https://doi.org/10.1111/lang.12052
Ellis, N. C. (1994). Consciousness in second language learning: Psychological perspectives on the
          role of conscious processes in vocabulary acquisition. AILA Review, 11, 37–56.
Ellis, N. C. (2005). At the interface: Dynamic interactions of explicit and implicit language
          knowledge. Studies in Second Language Acquisition, 27(02), 305–352.
          https://doi.org/10.1017/S027226310505014X
Ellis, R. (2004). The definition and measurement of L2 explicit knowledge. Language Learning,
          54(2), 227–275. https://doi.org/10.1111/j.1467-9922.2004.00255.x
                                                   189


Ellis, R. (2005). Measuring implicit and explicit knowledge of a second language: A psychometric
          study. Studies in Second Language Acquisition, 27(02), 141–172.
          https://doi.org/10.1017/S0272263105050096
Ellis, R., & Loewen, S. (2007). Confirming the operational definitions of explicit and implicit
          knowledge in Ellis (2005): Responding to Isemonger. Studies in Second Language
          Acquisition, 29(01). https://doi.org/10.1017/S0272263107070052
Evett, L. J., & Humphreys, G. W. (1981). The use of abstract graphemic information in lexical
          access. The Quarterly Journal of Experimental Psychology, 33(4), 325–350.
          https://doi.org/10.1080/14640748108400797
Forster, K. I. (1998). The pros and cons of masked priming. Journal of Psycholinguistic Research,
          27(2), 203–233. https://doi.org/10.1023/A:1023202116609
Forster, K. I., & Davis, C. (1984). Repetition priming and frequency attenuation in lexical access.
          Journal of Experimental Psychology: Learning, Memory, and Cognition, 10(4), 680–698.
          https://doi.org/10.1037/0278-7393.10.4.680
Gelman, A., & Hill, J. (2007). Data Analysis Using Regression and Multilevel/Hierarchical Models.
          Cambridge University Press.
Germine, L., Nakayama, K., Duchaine, B. C., Chabris, C. F., Chatterjee, G., & Wilmer, J. B. (2012).
          Is the Web as good as the lab? Comparable performance from Web and lab in
          cognitive/perceptual experiments. Psychonomic Bulletin & Review, 19(5), 847–857.
          https://doi.org/10.3758/s13423-012-0296-9
Godfroid, A. (2020a). Eye Tracking in Second Language Acquisition and Bilingualism: A Research
          Synthesis and Methodological Guide (1st ed.). Routledge.
          https://doi.org/10.4324/9781315775616
Godfroid, A. (2020b). Sensitive measures of vocabulary knowledge and processing: Expanding
          Nation’s framework. In S. Webb (Ed.), The Routledge handbook of vocabulary studies
          (pp. 433–453). https://doi.org/10.4324/9780429291586-28
Godfroid, A., Ahn, J., Choi, I., Ballard, L., Cui, Y., Johnston, S., Lee, S., Sarkar, A., & Yoon, H. J.
          (2018). Incidental vocabulary learning in a natural reading context: An eye-tracking
          study. Bilingualism, 21(3), 563–584. https://doi.org/10.1017/S1366728917000219
Gollan, T. H., Forster, K. I., & Frost, R. (1997). Translation priming with different scripts: Masked
          priming with cognates and noncognates in hebrew-english bilinguals. Journal of
          Experimental Psychology: Learning, Memory, and Cognition, 23(5), 1122–1139.
          https://doi.org/10.1037/0278-7393.23.5.1122
                                                     190


González-Fernández, B., & Schmitt, N. (2020). Word knowledge: Exploring the relationships and
         order of acquisition of vocabulary knowledge components. Applied Linguistics, 41(4),
         481–505. https://doi.org/10.1093/applin/amy057
Grainger, J., Diependaele, K., Spinelli, E., Ferrand, L., & Farioli, F. (2003). Masked repetition and
         phonological priming within and across modalities. Journal of Experimental Psychology:
         Learning, Memory, and Cognition, 29(6), 1256–1269. https://doi.org/10.1037/0278-
         7393.29.6.1256
Günther, F., Dudschig, C., & Kaup, B. (2016). Latent semantic analysis cosines as a cognitive
         similarity measure: Evidence from priming studies. Quarterly Journal of Experimental
         Psychology, 69(4), 626–653. https://doi.org/10.1080/17470218.2015.1038280
Gutiérrez, X. (2013). The construct validity of grammaticality judgment tests as measures of
         implicit and explicit knowledge. Studies in Second Language Acquisition, 35(3), 423–449.
         https://doi.org/10.1017/S0272263113000041
Harrington, M. (2006). The lexical decision task as a measure of L2 lexical proficiency. EUROSLA
         Yearbook, 6(1), 147–168. https://doi.org/10.1075/eurosla.6.10har
Harrington, M. (2018). Lexical facility: Size, recognition speed and consistency as dimensions of
         second language vocabulary knowledge. Palgrave Macmillan.
Henning, G. (1992). Dimensionality and construct validity of language tests. Language Testing,
         9(1), 1–11. https://doi.org/10.1177/026553229200900102
Henriksen, B. (1999). Three dimensions of vocabulary development. Studies in Second Language
         Acquisition, 21(2), 303–317. https://doi.org/10.1017/S0272263199002089
Hox, J., J., Moerbeek, M., & van de Schoot, R. (2018). Multilevel Analysis Techniques and
         Applications. Routledge.
Hu, L., & Bentler, P. M. (1999). Cutoff criteria for fit indexes in covariance structure analysis:
         Conventional criteria versus new alternatives. Structural Equation Modeling: A
         Multidisciplinary Journal, 6(1), 1–55. https://doi.org/10.1080/10705519909540118
Hui, B. (2020). Processing variability in intentional and incidental word learning: An extension of
         Solovyeva and Dekeyser (2018). Studies in Second Language Acquisition, 42(2), 327–357.
         https://doi.org/10.1017/S0272263119000603
Hui, B., & Godfroid, A. (2020). Testing the role of processing speed and automaticity in second
         language listening. Applied Psycholinguistics, Advance online publication.
         https://doi.org/10.1017/S0142716420000193
                                                  191


Huibregtse, I., Admiraal, W., & Meara, P. (2002). Scores on a yes-no vocabulary test: Correction
         for guessing and response style. Language Testing, 19(3), 227–245.
         https://doi.org/10.1191/0265532202lt229oa
Hulstijn, J. (2005). Theoretical and empirical issues in the study of implicit and explicit second-
         language learning: Introduction. Studies in Second Language Acquisition, 27(02).
         https://doi.org/10.1017/S0272263105050084
Hulstijn, J. (2007). Psycholinguistic perspectives on language and its acquisition. In J. Cummins
         & C. Davison (Eds.), The international handbook of English language teaching (pp. 783–
         796). Springer.
Isemonger, I. M. (2007). Operational definitions of explicit and implicit knowledge: Response to
         R. Ellis (2005) and some recommendations for future research in this area. Studies in
         Second Language Acquisition, 29(01). https://doi.org/10.1017/S0272263107070040
Issa, B. I., Faretta–Stutenberg, M., & Bowden, H. W. (2020). Grammatical and lexical
         development during short‐term study abroad: Exploring l2 contact and initial
         proficiency. The Modern Language Journal, 104(4), 860–879.
         https://doi.org/10.1111/modl.12677
Jeon, E. H., & Yamashita, J. (2014). L2 reading comprehension and its correlates: A meta-
         analysis. Language Learning, 64(1), 160–212. https://doi.org/10.1111/lang.12034
Jiang, N. (1999). Testing processing explanations for the asymmetry in masked cross-language
         priming. Bilingualism: Language and Cognition, 2(1), 59–75.
         https://doi.org/10.1017/S1366728999000152
Jiang, N. (2013). Conducting reaction time research in second language studies. Routledge.
         https://doi.org/10.4324/9780203146255
Jiang, N. (2015). Six decades of research on lexical representation and processing in bilinguals.
         In J. W. Schwieter (Ed.), The Cambridge Handbook of Bilingual Processing (pp. 29–84).
         Cambridge University Press. https://doi.org/10.1017/CBO9781107447257.002
Kim, H. S., Lee, J. H., & Lee, H. (2020). The relative effects of L1 and L2 glosses on L2 learning: A
         meta-analysis. Language Teaching Research, 136216882098139.
         https://doi.org/10.1177/1362168820981394
Kim, K. H. (2005). The relation among fit indexes, power, and sample size in structural equation
         modeling. Structural Equation Modeling: A Multidisciplinary Journal, 12(3), 368–390.
         https://doi.org/10.1207/s15328007sem1203_2
Kim, K. M., & Godfroid, A. (2019). Should we listen or read? Modality effects in implicit and
         explicit knowledge. Modern Language Journal, 103(3), 648–664.
         https://doi.org/10.1111/modl.12583
                                                  192


Kiss, G. R., Armstrong, C., Milroy, R., & Piper, J. (1973). An associative thesaurus of English and
         its computer analysis. In A. J. Aitken, R. W. Baileu, & N. Hamilton-Smith (Eds.), The
         computer and literary studies (pp. 153–165). Edinburgh University Press.
Klein, R. A., Ratliff, K. A., Vianello, M., Adams, R. B., Bahník, Š., Bernstein, M. J., Bocian, K.,
         Brandt, M. J., Brooks, B., Brumbaugh, C. C., Cemalcilar, Z., Chandler, J., Cheong, W.,
         Davis, W. E., Devos, T., Eisner, M., Frankowska, N., Furrow, D., Galliani, E. M., … Nosek,
         B. A. (2014). Investigating variation in replicability: A “Many Labs” replication project.
         Social Psychology, 45(3), 142–152. https://doi.org/10.1027/1864-9335/a000178
Kline, R. B. (2015). Principles and practice of structural equation modeling (4th ed.). The
         Guilford Press.
Koizumi, R., & In’nami, Y. (2020). Structural equation modeling of vocabulary size and depth
         using conventional and Bayesian methods. Frontiers in Psychology, 11, 618.
         https://doi.org/10.3389/fpsyg.2020.00618
Kroll, J., & Tokowicz, N. (2005). Models of bilingual representation and processing: Looking back
         and to the future. In J. Kroll & A. M. B. De Groot (Eds.), Handbook of bilingualism:
         Psycholinguistic approaches (pp. 531–553). Oxford University Press.
Kuznetsova, A., Brockhoff, P. B., & Christensen, R. H. B. (2017). lmerTest package: Tests in linear
         mixed effects models. Journal of Statistical Software, 82(13), 1–26.
         https://doi.org/10.18637/jss.v082.i13
Laufer, B. (1992). How much lexis is necessary for reading comprehension? In P. J. L. Arnaud &
         H. Béjoint (Eds.), Vocabulary and Applied Linguistics (pp. 126–132). Palgrave Macmillan
         UK. https://doi.org/10.1007/978-1-349-12396-4_12
Laufer, B., & Nation, P. (1999). A vocabulary-size test of controlled productive ability. Language
         Testing, 16(1), 33–51. https://doi.org/10.1177/026553229901600103
Leow, R. P. (2001). Attention, awareness, and foreign language behavior. Language Learning,
         51, 113–155. https://doi.org/10.1111/j.1467-1770.2001.tb00016.x
Li, M., & Zhang, X. (2021). A meta-analysis of self-assessment and language performance in
         language testing and assessment. Language Testing, 38(2), 189–218.
         https://doi.org/10.1177/0265532220932481
Loewen, S., & Gönülal, T. (2015). Exploratory factor analysis and primcipal componetns analysis.
         In L. Plonsky (Ed.), Advancing quantitative methods in second language research (pp.
         182–212). Routledge.
Loewen, S., & Hui, B. (2021). Small samples in instructed second language acquisition research.
         The Modern Language Journal, 105(1), 187–193. https://doi.org/10.1111/modl.12700
                                                   193


Lüdecke, D., Makowski, D., Waggoner, P., & Patil, I. (2020). performance: Assessment of
       Regression Models Performance. CRAN. https://doi.org/10.5281/zenodo.3952174
MacCallum, R. C., Wegener, D. T., Uchino, B. N., & Fabrigar, L. R. (1993). The problem of
       equivalent models in applications of covariance structure analysis. Psychological
       Bulletin, 114(1), 185–199. https://doi.org/10.1037/0033-2909.114.1.185
Maie, R., & DeKeyser, R. M. (2020). Conflicting evidence of explicit and implicit knowledge from
       objective and subjective measures. Studies in Second Language Acquisition, 42(2), 359–
       382. https://doi.org/10.1017/S0272263119000615
Mair, P., & Hatzinger, R. (2007). Extended Rasch modeling: The eRm package for the application
       of IRT models in R. Journal of Statistical Software, 20(9).
       http://www.jstatsoft.org/v20/i09
Mandera, P., Keuleers, E., & Brysbaert, M. (2017). Explaining human performance in
       psycholinguistic tasks with models of semantic similarity based on prediction and
       counting: A review and empirical validation. Journal of Memory and Language, 92, 57–
       78. https://doi.org/10.1016/j.jml.2016.04.001
Marsden, E., Morgan-Short, K., Thompson, S., & Abugaber, D. (2018). Replication in second
       language research: Narrative and systematic reviews and recommendations for the
       field: replication in second language research. Language Learning, 68(2), 321–391.
       https://doi.org/10.1111/lang.12286
Marslen-Wilson, W. D., & Welsh, A. (1978). Processing interactions and lexical access during
       word recognition in continuous speech. Cognitive Psychology, 10(1), 29–63.
       https://doi.org/10.1016/0010-0285(78)90018-X
McClelland, J. L., & Elman, J. L. (1986). The TRACE model of speech perception. Cognitive
       Psychology, 18(1), 1–86. https://doi.org/10.1016/0010-0285(86)90015-0
McNamara, T., & Knoch, U. (2012). The Rasch wars: The emergence of Rasch measurement in
       language testing. Language Testing, 29(4), 555–576.
       https://doi.org/10.1177/0265532211430367
McNamara, T. P. (2005). Semantic priming: Perspectives from memory and word recognition.
       Psychology Press. http://site.ebrary.com/id/10163350
McNeish, D. (2018). Thanks coefficient alpha, we’ll take it from here. Psychological Methods,
       23(3), 412–433. http://dx.doi.org.proxy1.cl.msu.edu/10.1037/met0000144
McRae, K., & Boisvert, S. (1998). Automatic semantic similarity priming. Journal of Experimental
       Psychology: Learning, Memory, and Cognition, 24(3), 558–572.
       https://doi.org/10.1037/0278-7393.24.3.558
                                                 194


Meara, P. (1992). Network structures and vocabulary acquisition in a foreign language. In P. J. L.
       Arnaud & H. Béjoint (Eds.), Vocabulary and applied linguistics (pp. 62–70). Palgrave
       Macmillan UK. https://doi.org/10.1007/978-1-349-12396-4_6
Meara, P. (2009). Connected words: Word associations and second language vocabulary
       acquisition. John Benjamins Pub. Co.
Meara, P. (2010). EFL vocabulary tests (2nd ed.). Centre for Applied Language Studies,
       University College Swansea.
Meara, P., & Buxton, B. (1987). An alternative to multiple choice vocabulary tests. Language
       Testing, 4(2), 142–151.
Messick, S. (1989). Meaning and values in test validation: The science and ethics of assessment.
       Educational Researcher, 18(2), 5–11.
Milton, J., & Fitzpatrick, T. (2014). Dimensions of vocabulary knowledge. Palgrave.
Mochida, A., & Harrington, M. (2006). The Yes/No test as a measure of receptive vocabulary
       knowledge. Language Testing, 23(1), 73–98.
       https://doi.org/10.1191/0265532206lt321oa
Müller, M. (2020). Item fit statistics for Rasch analysis: Can we trust them? Journal of Statistical
       Distributions and Applications, 7(1), 5. https://doi.org/10.1186/s40488-020-00108-7
Muthén, L. K., & Muthén, B. O. (2002). How to use a Monte Carlo study to decide on sample
       size and determine power. Structural Equation Modeling: A Multidisciplinary Journal,
       9(4), 599–620. https://doi.org/10.1207/S15328007SEM0904_8
Nakata, T., & Elgort, I. (2020). Effects of spacing on contextual vocabulary learning: Spacing
       facilitates the acquisition of explicit, but not tacit, vocabulary knowledge. Second
       Language Research, 026765832092776. https://doi.org/10.1177/0267658320927764
Nation, I. S. P. (2012). The BNC/COCA word family lists. Unpublished paper. Available at:
       Http://www.victoria.ac.nz/lals/about/staff/paul-nation.
Nation, I. S. P. (2013a). Learning vocabulary in another language (2nd ed.). Cambridge
       University Press.
Nation, I. S. P. (2013b). Learning vocabulary in another language (Second Edition). Cambridge
       University Press.
Nation, I. S. P., & Beglar, D. (2007). A vocabulary size test. The Language Teacher, 31, 9–13.
Nation, I. S. P., & Webb, S. A. (2011). Researching and analyzing vocabulary (1st ed). Heinle,
       Cengage Learning.
                                                  195


Nelson, D. L., McEvoy, C. L., & Schreiber, T. A. (1998). The University of South Florida word
        association, rhyme, and word fragment norms. http://www.usf.edu/FreeAssociation/
Paribakht, T. S., & Wesche, M. B. (1993). Reading comprehension and second language
        development in a comprehension-based ESL program. TESL Canada Journal, 11(1), 09.
        https://doi.org/10.18806/tesl.v11i1.623
Pavlenko, A. (2009). Conceptual representation in the bilingual lexicon and second language
        vocabulary learning. In A. Pavlenko (Ed.), The bilingual mental lexicon (pp. 125–160).
        Multilingual Matters. https://doi.org/10.21832/9781847691262-008
Pellicer-Sánchez, A., & Schmitt, N. (2012). Scoring Yes–No vocabulary tests: Reaction time vs.
        nonword approaches. Language Testing, 29(4), 489–509.
        https://doi.org/10.1177/0265532212438053
Perfetti, C. (2007). Reading ability: Lexical quality to comprehension. Scientific Studies of
        Reading, 11(4), 357–383. https://doi.org/10.1080/10888430701530730
Peters, E. (2019). The effect of imagery and on‐screen text on foreign language vocabulary
        learning from audiovisual input. TESOL Quarterly, 53(4), 1008–1032.
        https://doi.org/10.1002/tesq.531
Plonsky, L., Marsden, E., Crowther, D., Gass, S. M., & Spinner, P. (2020). A methodological
        synthesis and meta-analysis of judgment tasks in second language research. Second
        Language Research, 36(4), 583–621. https://doi.org/10.1177/0267658319828413
Qian, D. D., & Lin, L. H. F. (2020). The relationship between vocabulary knowledge and language
        proficiency. In S. Webb (Ed.), The Routledge Handbook of Vocabulary Studies (1st ed.,
        pp. 66–80). Routledge. https://doi.org/10.4324/9780429291586-5
R Core Team. (2020). R: A language and environment for statistical computing. https://www.R-
        project.org/
Ramezanali, N., Uchihara, T., & Faez, F. (2021). Efficacy of multimodal glossing on second
        language vocabulary learning: A meta-analysis. TESOL Quarterly, n/a(n/a), Advance
        online publication. https://doi.org/10.1002/tesq.579
Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests. Danish
        Institute for Educational Research.
Rastle, K., Harrington, J., & Coltheart, M. (2002). 358,534 nonwords: The ARC nonword
        database. The Quarterly Journal of Experimental Psychology Section A, 55(4), 1339–
        1362. https://doi.org/10.1080/02724980244000099
                                                  196


Raykov, T., & Marcoulides, G. A. (2019). Thanks coefficient alpha, we still need you! Educational
        and Psychological Measurement, 79(1), 200–210.
        https://doi.org/10.1177/0013164417725127
Read, J. (1993). The development of a new measure of L2 vocabulary knowledge. Language
        Testing, 10(3), 355–371. https://doi.org/10.1177/026553229301000308
Read, J. (2004). Plumbing the depths: How should the construct of vocabulary knowledge be
        defined? In P. Bogaards & B. Laufer (Eds.), Vocabulary in a second language (pp. 209–
        227). John Benjamins.
Read, J. (2020). Key issues in measuring vocabulary knowledge. In S. Webb (Ed.), The Routledge
        handbook of vocabulary studies (1st ed., pp. 545–560). Routledge.
        https://doi.org/10.4324/9780429291586-34
Revelle, W. (2020). psych: Procedures for Psychological, Psychometric, and Personality
        Research. Northwestern University. https://CRAN.R-project.org/package=psych
Révész, A., & Brunfaut, T. (2021). Validating assessments for research purposes. In P. Winke &
        T. Brunfaut (Eds.), The routledge handbook of second language acquisition and language
        testing (pp. 21–32). Routledge. https://doi.org/10.4324/9781351034784
Rosseel, Y. (2012). lavaan: An R Package for Structural Equation Modeling. Journal of Statistical
        Software, 48(2), 1–36.
Rouder, J. N., & Haaf, J. M. (2019). A psychometrics of individual differences in experimental
        tasks. Psychonomic Bulletin and Review, 26, 452–467. https://doi.org/10.3758/s13423-
        018-1558-y
RStudio Team. (2020). RStudio: Integrated Development Environment for R.
        http://www.rstudio.com/
Ruiz, S., Chen, X., Rebuschat, P., & Meurers, D. (2019). Measuring individual differences in
        cognitive abilities in the lab and on the web. PLOS ONE, 14(12), e0226217.
        https://doi.org/10.1371/journal.pone.0226217
Scarborough, D. L., Cortese, C., & Scarborough, H. S. (1977). Frequency and repetition effects in
        lexical memory. Journal of Experimental Psychology: Human Perception and
        Performance, 3(1), 1–17. https://doi.org/10.1037/0096-1523.3.1.1
Schmitt, N. (2010). Researching vocabulary: A vocabulary research manual. Palgrave.
Schmitt, N. (2014). Size and depth of vocabulary knowledge: What the research shows.
        Language Learning, 64(4), 913–951. https://doi.org/10.1111/lang.12077
                                                 197


Schmitt, N., Nation, P., & Kremmel, B. (2020). Moving the field of vocabulary assessment
         forward: The need for more rigorous test development and validation. Language
         Teaching, 53(1), 109–120. https://doi.org/10.1017/S0261444819000326
Segalowitz, N. S., & Segalowitz, S. J. (1993). Skilled performance, practice, and the
         differentiation of speed-up from automatization effects: Evidence from second language
         word recognition. Applied Psycholinguistics, 14(3), 369–385.
         https://doi.org/10.1017/S0142716400010845
Siegelman, N., Bogaerts, L., & Frost, R. (2017). Measuring individual differences in statistical
         learning: Current pitfalls and possible solutions. Behavior Research Methods, 49(2), 418–
         432. https://doi.org/10.3758/s13428-016-0719-z
Sireci, S. G. (2009). Packing and unpacking sources of validity evidence: History repeats itself
         again. In R. W. Lissitz (Ed.), The concept of validity: Revisions, new directions, and
         applications (pp. 19–37). IAP Information Age Publishing.
Smith, R. M., Schumacker, R. E., & Bush, M. J. (1998). Using item mean squares to evaluate fit to
         the Rasch model. Journal of Outcome Measurement, 2(1), 66–78.
Sonbul, S., & Schmitt, N. (2013). Explicit and implicit lexical knowledge: Acquisition of
         collocations under different input conditions. Language Learning, 63(1), 121–159.
         https://doi.org/10.1111/j.1467-9922.2012.00730.x
Spada, N., Shiu, J. L.-J., & Tomita, Y. (2015). Validating an elicited imitation task as a measure of
         implicit knowledge: Comparisons with other validation studies. Language Learning,
         65(3), 723–751. https://doi.org/10.1111/lang.12129
Staub, A. (2021). How reliable are individual differences in eye movements in reading? Journal
         of Memory and Language, 116, 104190. https://doi.org/10.1016/j.jml.2020.104190
Stenneken, P., Conrad, M., & Jacobs, A. M. (2007). Processing of syllables in production and
         recognition tasks. Journal of Psycholinguistic Research, 36(1), 65–78.
         https://doi.org/10.1007/s10936-006-9033-8
Stubbe, R. (2012). Do pseudoword false alarm rates and overestimation rates in Yes/No
         vocabulary tests change with Japanese university students’ English ability levels?
         Language Testing, 29(4), 471–488. https://doi.org/10.1177/0265532211433033
Suzuki, Y. (2017). Validity of new measures of implicit knowledge: Distinguishing implicit
         knowledge from automatized explicit knowledge. Applied Psycholinguistics, 38(5), 1229–
         1261. https://doi.org/10.1017/S014271641700011X
Tanabe, M. (2016). Measuring second language vocabulary knowledge using a temporal
         method. Reading in a Foreign Language, 28(1), 118–142.
                                                  198


Taylor, J. E., Beith, A., & Sereno, S. C. (2020). LexOPS: An R package and user interface for the
        controlled generation of word stimuli. Behavior Research Methods, 52(6), 2372–2382.
        https://doi.org/10.3758/s13428-020-01389-1
Toomer, M., & Elgort, I. (2019). The development of implicit and explicit knowledge of
        collocations: A conceptual replication and extension of Sonbul and Schmitt (2013).
        Language Learning, 69(2), 405–439. https://doi.org/10.1111/lang.12335
Trofimovich, P., & McDonough, K. (Eds.). (2011). Applying priming methods to L2 learning,
        teaching and research: Insights from psycholinguistics. John Benjamins Pub. Co.
Ullman, M. T. (2001). The Declarative/Procedural Model of Lexicon and Grammar. Journal of
        Psycholinguistic Research, 30(1), 37–69. https://doi.org/10.1023/A:1005204207369
Vafaee, P., & Kachinske, I. (2019). The inadequate use of confirmatory factor analysis in second
        language acquisition validation studies. Studies in Applied Linguistics and TESOL, Vol. 19
        No. 2 (2019). https://doi.org/10.7916/SALT.V19I2.4184
Vafaee, P., & Suzuki, Y. (2020). The relative significance of syntactic knowledge and vocabulary
        knowledge in second language listening ability. Studies in Second Language Acquisition,
        42(2), 383–410. https://doi.org/10.1017/S0272263119000676
Vafaee, P., Suzuki, Y., & Kachisnke, I. (2017). Validating grammaticality judgment tests: Evidence
        from two new psycholinguistic measures. Studies in Second Language Acquisition, 39(1),
        59–95. https://doi.org/10.1017/S0272263115000455
Vandergrift, L., & Baker, S. (2015). Learner variables in second language listening
        comprehension: An exploratory path analysis. Language Learning, 65(2), 390–416.
        https://doi.org/10.1111/lang.12105
VanPatten, B., & Jegerski, J. (Eds.). (2014). Research methods in second language
        psycholinguistics. Routledge.
Webb, S. (2005). Receptive and productive vocabulary learning: The effects of reading and
        writing on word knowledge. Studies in Second Language Acquisition, 27(01), 33–52.
        https://doi.org/10.1017/S0272263105050023
Webb, S. (2007). The effects of repetition on vocabulary knowledge. Applied Linguistics, 28(1),
        46–65. https://doi.org/10.1093/applin/aml048
Webb, S. (2012). Depth of vocabulary knowledge. In C. A. Chapelle (Ed.), The encyclopedia of
        Applied Linguistics (pp. 1656–1663). Blackwell Publishing Ltd.
        https://doi.org/10.1002/9781405198431.wbeal1325
                                                   199


Wesche, M., & Paribakht, T. S. (1996). Assessing second language vocabulary knowledge: Depth
       versus breadth. Canadian Modern Language Review, 53(1), 13–40.
       https://doi.org/10.3138/cmlr.53.1.13
Wickham, H. (2016). ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York.
       https://ggplot2.tidyverse.org
Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L., François, R., Grolemund, G.,
       Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T., Miller, E., Bache, S., Müller, K.,
       Ooms, J., Robinson, D., Seidel, D., Spinu, V., … Yutani, H. (2019). Welcome to the
       Tidyverse. Journal of Open Source Software, 4(43), 1686.
       https://doi.org/10.21105/joss.01686
Woods, A. T., Velasco, C., Levitan, C. A., Wan, X., & Spence, C. (2015). Conducting perception
       research over the internet: A tutorial review. PeerJ, 3, e1058.
       https://doi.org/10.7717/peerj.1058
Wright, B. D., & Linacre, J. M. (1994). Reasonable mean-square fit values. Rasch Measurement
       Transactions, 8, 370–371.
Yanagisawa, A., & Webb, S. (2020). Measuring depth of vocabulary knowledge. In S. Webb (Ed.),
       The Routledge Handbook of Vocabulary Studies (1st ed., pp. 371–386). Routledge.
       https://doi.org/10.4324/9780429291586-24
Yanagisawa, A., Webb, S., & Uchihara, T. (2020). How do different forms of glossing contribute
       to L2 vocabulary learning from reading?: A meta-regression analysis. Studies in Second
       Language Acquisition, 42(2), 411–438. https://doi.org/10.1017/S0272263119000688
Zhang, S., & Zhang, X. (2020). The relationship between vocabulary knowledge and L2
       reading/listening comprehension: A meta-analysis. Language Teaching Research,
       136216882091399. https://doi.org/10.1177/1362168820913998
Zhang, X., Liu, J., & Ai, H. (2020). Pseudowords and guessing in the Yes/No format vocabulary
       test. Language Testing, 37(1), 6–30. https://doi.org/10.1177/0265532219862265
                                                200