LEARNING WORDS UNDER INCIDENTAL AND INTENTIONAL LEARNING CONDITIONS: AN EYE-TRACKING STUDY By Ina Choi A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Second Language Studies – Doctor of Philosophy 2018 ABSTRACT LEARNING WORDS UNDER INCIDENTAL AND INTENTIONAL LEARNING CONDITIONS: AN EYE-TRACKING STUDY By Ina Choi The present study investigated the cognitive processes of vocabulary learning under incidental and intentional conditions using eye-tracking. It aims to find out the extent to which intentionality and time restrictions are associated with vocabulary learning; as well as the mechanism through which these relationships are mediated by attention, controlling for the effects of word length, predictability, and part of speech of target words. Forty-four high-intermediate L2 English learners were randomly assigned to one of three different groups: no test announcement with time restriction (Group 1: Incidental Timed), test announcement without time restriction (Group 2: Intentional Untimed), and test announcement with time restriction (Group 3: Intentional Timed). The participants read an 1100-word-long reading passage twice while their eyes were being tracked. Twelve low-frequency English words in the text served as targets for word learning. In order to accurately measure noticing, two eye- tracking measures were used: total fixation duration and the difference between the observed and expected duration. After reading, participants received three surprise vocabulary tests in the following order: form recognition, meaning recall, meaning recognition. The descriptive statistics confirmed a pattern of incremental vocabulary development with the highest scores on form recognition, followed by meaning recognition, and then meaning recall. Eye-movement data showed that Intentional Untimed and Intentional Timed (Group 2 and 3) spent similar amounts of time on target words while Incidental Timed (Group 1) paid significantly less attention to targets than the other groups. More importantly, multivariate multilevel mediation model demonstrated the importance of attention in predicting learning success. Effects of the test announcement and the time limit were completely mediated by the total reading time on target words. The results further support the hypothesis that intentional and incidental learning differ quantitatively by showing that the effect of the test announcement was significant for the total reading time, but not on the extra attentional processing time. Copyright by INA CHOI 2018 ACKNOWLEDGEMENTS I would like to express my deepest gratitude and sincere appreciation to my advisor, Dr. Aline Godfroid, for the support and guidance she has provided during the entire process of writing my dissertation. I am deeply indebted to her for her extreme patience and constant encouragement, which have helped me stay on track and keep pushing forward when the going got tough. I would like to extend my gratitude to my committee members, Dr. Susan Gass, Dr. Shawn Loewen, and Dr. Paula Winke, for their understanding, support, and feedback during this project. My appreciation also goes to Unhee Ju and Hope Akaeze for their valuable advice on the statistical analysis and Jennifer Majorana for reading my dissertation. I am grateful for the financial support provided for the dissertation by the Journal of Language Learning with a Language Learning Dissertation Grant and by the College of Arts and Letters at Michigan State University with a Dissertation Completion Fellowship. I am especially thankful for my colleagues and friends who have cheered me on and helped me stay steady through these many years – Yaqiong Cui, Talip Gonulal, Jihyun Park, and Lorena Valmori. The relationships and memories that I have developed with you all will always be priceless for me and I will cherish them for life. Most importantly, none of this would have been possible without the love and encouragement of my family, including Yoon-A and In-hyuck. My special thank goes to my mother for her unconditional love, faith, and confidence in me. My most important source of support and strength has been my husband, Sunil. This is truly an accomplishment that belongs v to both of us and I would never have achieved it without your support, your love, and your belief in me. vi TABLE OF CONTENTS LIST OF TABLES ........................................................................................................................ ix LIST OF FIGURES ....................................................................................................................... x INTRODUCTION .......................................................................................................................... 1 CHAPTER 1: REVIEW OF THE LITERATURE ........................................................................ 22 1.1 Attention and awareness in Second Language Acquisition ................................................. 23 1.2 Involvement Load Hypothesis ............................................................................................... 4 1.3 Understanding incidental vocabulary learning ...................................................................... 8 1.4 Hulstijn’s methodological operationalization ...................................................................... 13 1.5 Understanding eye-tracking methodology ........................................................................... 16 1.6 Eye-tracking and second language vocabulary acquisition ................................................. 17 1.7 Conclusion ........................................................................................................................... 21 2.2 Materials ............................................................................................................................... 23 CHAPTER 2: THE CURRENT STUDY ..................................................................................... 22 2.1 Overview of the research design ......................................................................................... 23 2.2.1 Experimental text .......................................................................................................... 23 2.2.2 Target words .................................................................................................................. 25 2.2.3 Language background questionnaire ............................................................................. 26 2.2.4 Prescreening vocabulary test ......................................................................................... 26 2.2.5 Reading proficiency test ................................................................................................ 27 2.2.6 Comprehension test ....................................................................................................... 27 2.2.7 Vocabulary tests ............................................................................................................ 28 2.3 Participants ........................................................................................................................... 30 2.4 Procedure ............................................................................................................................. 31 2.4.1 Apparatus ...................................................................................................................... 31 2.4.2 Pretests .......................................................................................................................... 31 2.4.3 Reading experiment ...................................................................................................... 32 2.4.4 Posttests ......................................................................................................................... 33 2.4.5 Word predictability ratings ........................................................................................... 34 2.5. Analysis ............................................................................................................................... 35 2.5.1 Definition of variables ................................................................................................... 35 2.5.2 Eye-tracking data preparation ....................................................................................... 36 2.5.3 Data structure ................................................................................................................ 37 2.5.4 Multivariate Multilevel Mediation Analysis (MMMA) ................................................ 38 2.5.4.1. Path analysis ......................................................................................................... 38 2.5.4.2 Mediation analysis ................................................................................................. 39 2.5.4.2.1 Multilevel Mediation Analysis ............................................................... 42 2.5.4.2.2 Multilevel Mediation Analysis with Multiple Outcomes ....................... 45 2.5.5 Statistical analysis ......................................................................................................... 45 vii CHAPTER 3: RESULTS ............................................................................................................... 48 3.1 Pretests ................................................................................................................................. 48 3.2 Time on the reading task by group ...................................................................................... 50 3.3 Vocabulary test results by group .......................................................................................... 51 3.4 Eye fixations by group ......................................................................................................... 54 3.4.1 Comparison between Session 1 and Session 2 .............................................................. 54 3.4.2 Summed Total Reading Time and Summed DOE ......................................................... 56 3.5 Multivariate Multilevel Mediation Model Results .............................................................. 58 3.5.1 Model comparisons ....................................................................................................... 58 3.5.2 Final Multivariate Multilevel Mediation Model .......................................................... 61 3.5.3 Effect of Test Announcement and Time Limit ............................................................. 62 3.5.3.1 Effect on eye-tracking measures ............................................................................ 62 3.5.3.2 Indirect effect on learning ...................................................................................... 63 3.5.4 Effect of Word Length, Predictability, Part of Speech ................................................. 64 3.5.4.1 Effect on eye-tracking measures ............................................................................ 64 3.5.4.2 Indirect on learning ................................................................................................ 65 3.5.5 Effect of summed total reading time and DOE ............................................................ 66 3.5.6 Comparative strength .................................................................................................... 67 CHAPTER 4: DISCUSSION AND CONCLUSION .................................................................... 79 4.1 Offline vocabulary post-test measures ................................................................................. 79 4.2 Online eye-tracking measures .............................................................................................. 81 4.3 Looking at many variables combined: The multilevel multivariate mediation model ........ 82 4.4 Methodological contribution to SLA ................................................................................... 84 4.5 Limitations and future research ........................................................................................... 86 4.6 Pedagogical implication ....................................................................................................... 88 4.7 Conclusion ........................................................................................................................... 90 APPENDICES ............................................................................................................................... 91 Appendix A Experimental text .................................................................................................. 92 Appendix B Language background questionnaire ..................................................................... 95 Appendix C Sample of prescreening vocabulary test ................................................................ 98 Appendix D Sample of reading proficiency test ...................................................................... 100 Appendix E Comprehension test ............................................................................................. 101 Appendix F Form recognition test ........................................................................................... 102 Appendix G Meaning recognition test ..................................................................................... 103 Appendix H Meaning recall test .............................................................................................. 104 Appendix I Instruction by condition ........................................................................................ 105 REFERENCES ............................................................................................................................ 106 viii LIST OF TABLES Table 1 Group descriptions ........................................................................................................ 22 Table 2 Vocabulary Profiles for Experimental Text .................................................................. 23 Table 3 New Vocabulary Levels Test Results ........................................................................... 24 Table 4 Readability Assessment of Experimental Text ............................................................. 25 Table 5 Target Words ................................................................................................................ 25 Table 6 Procedure of the Study .................................................................................................. 34 Table 7 Average Scores on Two Pretests by Group .................................................................. 49 Table 8 Average Time on Reading Task by Group ................................................................... 50 Table 9 Average Scores on Three Vocabulary Post-test Measures by Group ........................... 51 Table 10 Mean Fixation Count, Mean Total Reading Time, and Mean DOE for the First and Second Session ........................................................................................................................... 53 Table 11 Average Summed Total Reading Time and Summed DOE by Group ...................... 57 Table 12 Model Fit Comparisons .............................................................................................. 60 Table 13 Effects of Predictors on Total Reading Time and DOE .............................................. 76 Table 14 Effects of Total Reading Time and DOE on Vocabulary Learning ............................ 77 Table 15 Indirect Effects of Predictors on Vocabulary Learning .............................................. 78 ix LIST OF FIGURES Figure 1 Data structure ............................................................................................................... 38 Figure 2 Path diagram for a basic single-mediator model ......................................................... 40 Figure 3 Path diagrams for a partial and full mediation model ................................................. 41 Figure 4 Performance on the three vocabulary post-test measures ............................................ 52 Figure 5 Mean total reading time by target words and groups .................................................. 55 Figure 6 Mean DOE by target words and groups ....................................................................... 56 Figure 7 Alternative path model 1 ............................................................................................. 70 Figure 8 Alternative path model 2 ............................................................................................. 71 Figure 9 Alternative path model 3 ............................................................................................. 72 Figure 10 Final path model 4 ..................................................................................................... 73 Figure 11 Final path model 4 with all dependent variables included ........................................ 74 Figure 12 Path model 5 for DOE ................................................................................................ 75 x INTRODUCTION Building vocabulary knowledge is the most basic and essential element of language learning and language use. According to Nation (2001), language learners need to know at least 6,000 word families to understand spoken language and 8,000 word families to understand written language. Learners’ achievement of this requirement cannot be explained by explicit language learning in language classes alone. Instead, extensive reading serves as a key to increasing the size of learners’ vocabulary. Extensive reading often involves a high level of incidental learning because learners do not have the intention of learning lexical items when reading, but they may pick up words incidentally in the process. Many researchers have long recognized that incidental and intentional vocabulary learning differ in their effectiveness, but it is unclear whether such differences reflect quantitative or qualitative differences in the underlying cognitive processes. Provided that intentional learning often involves more time on tasks and attracts more attention on targets than incidental learning, learning can be just more challenging under disadvantageous conditions, suggesting that the facilitative effect of intentional learning may simply due to the longer time paid to targeted lexical forms. Using an eye-tracking method, the current study addresses these concerns by investigating the cognitive processes of vocabulary learning under incidental and intentional conditions. Results are expected to inform researchers and practitioners as to whether intentional and incidental learning differ qualitatively or only quantitatively, contributing to a growing body of research on the second language vocabulary learning. 1 CHAPTER 1: REVIEW OF THE LITERATURE 1.1 Attention and awareness in Second Language Acquisition The constructs of noticing, attention, and awareness have been explored in Second Language Acquisition (SLA) since the 1980s (e.g., Hulstijn, 1989). However, the first serious discussions and analyses of noticing emerged during the 1990s with Richard Schmidt’s noticing hypothesis (Schmidt, 1990, 1994, 1995, 2001, 2010), which has proved a powerful concept in the cognitive-oriented research in SLA ever since. It has served as a theoretical framework to interpret various pedagogical phenomena including input and interaction (e.g., Gass, 1997; Long, 1996), Focus on Form (e.g., Doughty & Williams, 1998; R. Ellis, 2002; Williams, 2005), and implicit and explicit language learning (e.g., N. Ellis, 1994). According to the noticing hypothesis, noticing is “the necessary and sufficient condition for the conversion of input to intake” (Schmidt, 1990, p.129). The hypothesis is based on Schmidt’s own language learning experience in Brazil studying Portuguese (Schmidt & Frota, 1986). Analyzing his journals and recordings of his conversations with Brazilian interlocutors, Schmidt and Frota found that the linguistic features that he was able to incorporate into his speech were generally those that he had consciously noticed in Schmidt’s speech. Forms were not used in production if they were not noticed although they were present in the input. Based on his findings, Schmidt maintained that noticing is a prerequisite for learning to take place and is necessarily a conscious process. This early version of the noticing hypothesis (Schmidt, 1990) has received some criticism for several reasons. First, the weakness of the hypothesis came from the fact that the term “intake” was not clearly defined. Godfroid, Housen, and Boers (2010) contended, “this characterization does not specify exactly what intake is, other than that it is the product of 2 noticing and an intermediary step in the acquisition process” (p. 170). Second, researchers disagreed with the premise that noticing is a necessary condition for learning. For example, in Gass, Svetics, and Lemelin’s (2003) study, Italian learners in the non-focused-attention group showed greater gains than those who in the focused-attention group. Third, Schmidt’s (1995) idea that unconscious learning does not exist has been contested by a series of studies and reviews (e.g., Hama & Leow, 2010; Leow et al., 2008; Rosa & Leow, 2004; Williams, 2005). Williams (2005) reported that learning without awareness can occur, and Robinson (1995) also suggested that both focal attention and awareness are required for the representation of novel linguistic forms. In later publications, Schmidt (2001) weakened his argument, proposing that noticing is at least a facilitative, if not a necessary and sufficient, condition for L2 development. However, what is important for him may be the fact that “more awareness leads to more learning” (p. 8) rather than whether awareness is necessary or not. Due to its theoretical confusion, cognitive psychologists retired the term “noticing,” and started adopting the constructs of “attention” and “awareness” (Godfroid, Boers, & Housen, 2013). “Noticing” now serves as an umbrella term that involves both attention and awareness. Schmidt also implied these two concepts in a recent publication, saying “the idea that SLA is largely driven by what learners pay attention to and become aware of in target language input seems the essence of common sense” (Schmidt, 2010, p. 721). Schmidt (2001) also highlighted the role of attention, stating that “There is no doubt that attended learning is far superior, and for all practical purposes, attention is necessary for all aspects of second language learning” (Schmidt, 2001, pp. 1-2). In the field of SLA, many researchers have examined the role of awareness in the noticing studies through various techniques such as think-alouds, underlining, and stimulated 3 recall interviews (e.g., Hanaoka, 2007; Godfroid & Spino, 2015, Izumi and Bigelow, 2000). However, the current study aims to focus on noticing as attention by adopting eye-tracking methodology. 1.2 Involvement Load Hypothesis Among a variety of approaches and methods of vocabulary teaching, Laufer and Hulstijn (2001) formulated the Involvement Load Hypothesis, claiming that the degree of involvement is a key to better retention of unknown words. Involvement is regarded as a combination of a motivational and cognitive construct, which has three main elements: need, search, and evaluation. According to Laufer & Hulstijn (2001), the need means the need to achieve, indicating how much students need the word to complete the task. T he search is the attempt to find the meaning, concerning with whether students have to look for the meaning or it is given directly. The evaluation refers to assessment if the meaning is correct, inferring whether students have to look at different meaning and figure out the correct one. The need and evaluation components can be divided into three levels (0 ~2) depending on whether learners are intrinsically motivated or externally motivated and whether the degree of cognitive processing is moderate or strong. The cognitive component search can either be absent (0) or present (1). Calculating the sum of these three components, any tasks can be rated in the range from a minimum of 0 to a maximum of 5. That is, tasks with the higher rating are considered to be more effective vocabulary tasks according to the Involvement Load Hypothesis. Most of studies referring to the hypothesis have appeared in the area of incidental vocabulary learning and task-based learning. Word retention was found to be longer when 4 words were looked up in a dictionary compared to when words were explained (Cho & Krashen, 1994). The marginal glosses were also found to be facilitative when compared to the control group (Hulstijn, Hollander, & Greidanus, 1996). Findings from Joe (1995, 1998)’s research were in consistent with the previous research in that engagement in tasks enhanced the acquisition of vocabulary. A series of experiments in Laufer's (2005, 2006) study supports filling the blank in the sentences using the target words led better retention than reading a text for comprehension. All in all, the results of the research seem to suggest that varying degrees of involvement load have some effects on vocabulary learning as the Involvement Load Hypothesis predicts. While those research addressed above were interpreted in light of the involvement load explanation, Hulstijn and Laufer (2001) and Kim (2008) were designed to test the involvement load hypothesis directly and empirically. Hulstijn and Laufer (2001) explored the effects of three different tasks with the total of 225 English language learners in Israel and Netherlands. Those three tasks were designed to represent three different involvement loads: (1) reading comprehension with marginal glosses, (2) comprehension plus filling in target words, and (3) composition-writing with target words. Participants were intended to learn ten target words incidentally through the tasks. Unexpected vocabulary tests were administered right after the completion of the task and 1-2 weeks later to measure short-term and long-term retention of the words. The results of Hulstijn and Laufer (2001) were different in two groups of participants (Israel and Netherlands). In case of Israeli participants, the findings are in accordance with the involvement load hypothesis. The composition task yielded the highest score, lower score for the reading with filling the blanks task and the lowest score for the reading with marginal glossing task. In the experiment in Netherlands, however, scores from the fill-in-the-blank task and the reading with glossing task 5 were not significantly different although participants outperformed significantly in the composition task compared to two other tasks. The main concern of Hulstijn and Laufer (2001)’s study lies in the fact that the assigned time of each task was different: 40-45min for reading plus marginal glosses, 50-55min for fill-in- the-blank, and 70-80min for composition. Although the authors stateed that time on task is considered as an inherent property of a task, it is unclear that the difference in scores from three tasks was attributed to either time or involvement load. It is possible that the composition task yielded higher score because participants had more time to learn and remember the target vocabulary. Also, when operationalizing the levels of involvement load (need, search, and evaluate), the hypothesis assumes that the degree is the same for each element to affect involvement load. For example, the impact of strong need might be different from the impact of strong evaluation to involvement load although both are represented as the same number/degree (2) in the involvement load index. However, the ecological generalizability for the study is fairly high. The experiments were conducted during normal classes and participants were randomly selected and assigned to each task in two different countries. Considering that most of studies in SLA field have limited number of participants, Hulstijn and Laufer (2001)’s research can be considered to have a competitive number of participants (97 and 128 in each countries). Kim (2008) also explored the effect of task involvement load on second language vocabulary learning, including participants with two different levels of proficiency (Experiment1) two different types of tasks with the same involvement load (Experiment2). Sixty-four participants were recruited for Experiment 1 and forty participants for Experiment 2. Following Hulstijn and Laufer (2001), three similar tasks were designed to operationalize different levels of involvement load: reading, gap-fill, and composition and ten words were 6 served as targets. The difference from Hulstijn and Laufer (2001)’s study was that time was the same for all three tasks. The results of Experiment 1 and 2 were in line with the Involvement Load Hypothesis, revealing that tasks with the same involvement load induce similar outcomes in vocabulary learning and higher involvement loads results in greater vocabulary gains over time. In Kim (2008)’s study, the Vocabulary Knowledge Scale (VKS) was used to measure participants’ long-term and short-term retention of target words. The VKS, developed by Wesche & Paribakht (1996), is a self-report scale, containing five stages of differing degrees of knowledge. Wesche and Paribakht’s stages are listed below: 1. I don't remember having seen this word before. 2. I have seen is word before but I don't know what it means. 3. I have seen is word before and I think it means _____________________. 4. I know this word. It means _____________________. 5. I can use this word in a sentence. For example, _____________________________. In the VKS, the five stages are considered as a progression or succession of word learning based on the assumption that a word that was successfully produced (stage 5) is learned better than a word that was recognized. However, the use of VKS has been contested in recent years for several reasons (see Schmitt, 2010 for details). Meara (1996) claimed that the single, unidimentional scale may not accurately represent lexical development of the targeted words, Read (2000) noted that the increments between each stage cannot be assumed to indicate an equal interval, and Schmitt (2010) stated that the self-report data on stage 1 and 2 are not reliable, which does not disclose the direct representation of learners’ vocabulary knowledge. 7 Schmitt (2010) admits, however, “no current scale gives a full account of the incremental path of mastery of a lexical item, and perhaps acquisition is too complex to be so described” (p. 224). 1.3 Understanding incidental vocabulary learning Use of the notions, incidental and intentional, dates back to the early twentieth-century i(Hulstijn, 2003). Since the end of the century, these constructs have started receiving elevated interest in the SLA field, specifically in the domain of vocabulary (e.g., Ellis, R., 1994; Gass, 1999; Godfroid et al., 2017; Hucklin & Coady, 1999; Hulstijn, 2001, 2003; Laufer, 2005; Rieder, 2003; Schmitt, 2008, 2010; Pellicer-Sánchez & Schmitt, 2010). Despite its popularity, Husltijn (2003) commented, “incidental learning has often been rather loosely interpreted in common terms, not firmly rooted in a particular theory” (p. 357). Interpretation of incidental vocabulary learning in the existing literature can be categorized in one of two ways: classroom-oriented and attention-oriented (Sok, 2014). First, in the classroom-oriented interpretation, incidental and intentional learning is explained in the frame of classroom instructions. Incidental learning refers to the learning that occurs when the pedagogical purpose of instruction is not language, whereas intentional learning is described as the type of learning that is designed and intended to focus on the formal information being learnt. Content-based instruction (CBI), learning of language while studying content matter subjects, is a good illustration of how incidental learning is viewed from the classroom-oriented perspective. Grabe and Stoller (1997) report theoretical and experimental support for CBI in second language acquisition research, stating “language is best acquired incidentally through extensive exposure to comprehensible input in content-based classrooms” (p.6). 8 The classroom-oriented distinction between incidental and intentional learning can be related to the two major types of form-focused instruction (FFI): Focus on Form (FonF) and Focus on Forms (FonFs). Although there is still debate on the definition and operationalization of the two instructional practices (Loewen, 2011), FonF is generally considered a teaching approach that puts a primary focus on communication and meaning and occasional while incidental attention to linguistic forms is provided when the need arises. In contrast, FonFs means lessons in which language features are taught or practiced in isolation without contextual connections, which seems to coincide with the construct of intentional learning. Laufer (2006), for example, compared the effectiveness of FonF and FonFs approaches in vocabulary learning with 158 English learners in Israel. Participants in the FonF condition were invited to read a 165 word-length text and encouraged to use the bilingual dictionary when it was needed while participants in the FonFs condition studied a list of target words with their meanings and explanations in English and completed two word-focused exercises. A surprise vocabulary test was administered to the participants in both groups and their scores were subsequently analyzed by the researcher. Laufer found that the FonFs group scored significantly higher than the FonF group. Based on her findings, Laufer claimed that FonFs is indispensable to vocabulary instruction due to the nature of lexical competence; it “has major importance in any learning context that cannot recreate the input conditions of first-language acquisition” (p.162). Another example of a study that investigated incidental learning from a pedagogically- oriented perspective is Coll’s (2002) study of 40 low-intermediate English language learners in a hypermedia-assisted learning environment. The participants were exposed to a set of multimedia lessons, including chemistry-related video segments. Various comprehension tools (e.g., L1 translations of questions and answers, L2 video transcript, translation of transcript sentences, 9 etc.) were provided to make sure that the participants learn the meaning of the words in a contextualized form. The researcher recorded and analyzed the participants’ actions to find out what tools were more frequently used, and administered pre- and post-treatment vocabulary tests to evaluate the learning gains. Coll concluded that the hypermedia-based instruction can be an effective way to enhance learners’ retention of words when it is associated with word-related activities. Incidental learning was thus operationalized in the context of the study as “vocabulary is learned incidentally when the learning focus is on listening comprehension training” (p. 266). Coll added “vocabulary was not taught explicitly, but rather implicitly by providing the learner with verbal as well as visual input (p. 268). On the other hand, other researchers have taken an attention-oriented interpretation of incidental and intentional learning. This approach views incidental vocabulary learning as the absence of conscious intention (Barcroft, 2004), directly contrasted with intentional vocabulary learning, which refers to any activity geared toward committing lexical information to memory (Hulstijn, 2001, p. 271). Thus, incidental learning is often described as learning which accrues as a “by-product” (Schmitt, 2010, p. 29) or as the unplanned picking-up of vocabulary within an activity where meaning is the primary focus. This psycholinguistic approach carries the underlying assumption that learners’ attention is drawn to meaning during incidental learning and to form during intentional learning. One of the earliest studies that looked at incidental learning using an experimental setup was Saragi, Nation, and Meister (1978). In this study, 20 native English speakers were asked to read Anthony Brugess’s A Clockwork Orange, in which 241 Russian slang words (nadsat) were embedded. To keep the purpose of the experiment hidden, the researchers forewarned the participants of a comprehension and literacy criticism test afterwards. Some questions were also 10 presented to students while reading to ensure that the students read the text for comprehension. In other words, through the forewarning of posttests and the while-reading questions, the researchers manipulated learners’ intention and attention to the content of the story rather than to the individual vocabulary items. Still, results revealed an average of 76% learning gains of the 90 Russian slang words used in the novel. The surprisingly high learning gains, however, could not be reproduced in replication studies such as Pitts, White, and Krashen (1989) and Hulstijn (1992) where second language learners recorded less than 10% gains of new words. Horst, Cobb, and Meara (1998) explained that the case of native English speakers learning Russian words from an L1 context does not accurately represent the L2 learning condition. Another study that subscribed to the attention-oriented definition of incidental learning was Waring and Takaki (2003). Waring and Takaki were concerned with the rate at which learners learn and retain new words from reading a graded reader and with the effect of frequency of exposure rates on incidental vocabulary learning. As in Saragi et al.’s (1978) study, reading was the main activity for their participants, 15 university students in Japan. Participants were told to “read the story as usual and enjoy it” (p. 141), and were also informed that they would be tested after reading without being told what kind of test it would be. Three different types of vocabulary tests (word-form recognition, prompted meaning recognition, and unprompted meaning recognition) were conducted immediately after reading, and one and three months later to measure retention. However, the researchers did not administer the comprehension test to confirm that participants’ attention was directed towards on meaning nor did the researchers take any measure to prevent participants from paying deliberate attention to target words. In addition, using artificial words as targets, as Waring and Takaki did, may invite more attention because of the nonwords’ saliency. In turn, this may lead participants to expect 11 that the following test would be vocabulary-related. These limitations weaken the authors’ claim that the learning gains are from incidental learning/ leisure reading, suggesting that it might be insufficient to simply assume that participants in the experimental setting would read naturally as they would do in their real life. Although the classroom-oriented and attention-oriented perspectives of incidental vocabulary learning are not identical, we can say that vocabulary learnt as a by-product of other activities implies learning without intention to learn the lexical items. For example, a learner may pick up some words while reading a text for comprehension, and the learnt words can be regarded as a by-product. However, it is difficult to investigate what the learner actually does when encountering new lexical items while reading. That is, the learner may have intentionally and voluntarily tried to infer the meanings of certain words while reading for pleasure or simply paid attention to the unknown items without the intention to learn. That is, it is ambiguous if some degree of intentionality is involved in a supposedly incidental condition (Bruton, Garcia Lopez, & Esquiliche Mesa, 2011; Godfroid et al., 2017). In a similar vein, Huckin and Coady (1999) stated that incidental learning is not entirely incidental, as the learner must pay at least some attention to individual words (p. 190). According to Barcroft (2004), moreover, vocabulary learning is neither purely incidental nor purely intentional in a real-world context (p. 201), so incidental and intentional learning should be viewed as a continuum from highly incidental to highly intentional. Several researchers (e.g., Gass, 1999; Hulstijn, 2001, 2003) have also pointed out that the lack of consensus over the constructs of incidental and intentional learning in terms of attention and awareness, coupled with the ill-informed understanding of the terms “incidental” and “intentional,” all of which may lead to misguided pedagogical implications (Hulstijn, 2001, p. 261). These controversies regarding the role of attention in 12 incidental learning have led many researchers to prefer the method-oriented definition of incidental learning, which will be introduced and discussed in the next section. 1.4 Hulstijn’s methodological operationalization Due to the lack of a finely established theory and of a satisfactory operationalization of intent and learning (Hulstijn, 2003), researchers in SLA have attempted to operationalize intentional and incidental learning experimentally. On this view, intentional and incidental learning are distinguished simply based on the presence and absence of an explicit instruction to learn (Hulstijn, 2003), assuming that forewarning of a post-test invites more intentional learning. The authors that adopted the classroom-oriented and attention-oriented interpretations in the previous section also designed the incidental learning environments through conducting unexpected vocabulary tests after meaning-oriented activities such as reading or listening. However, the method-oriented definition is mainly concerned about the absence and presence of instruction to learn the vocabulary. That is, there can be posttests, but the explicit information about the posttests is the only criterion to determine the two types of learning within the study (Sok, 2014). Accordingly, whether learners’ attention is drawn to meaning or form is neither the major concern nor the assumption behind this approach. Barcroft (2009) and Peters, Hulstijn, Sercu, and Lutjeharms (2009) adopted this methodological operationalization of incidental and intentional learning. Barcroft (2009) compared incidental and intentional conditions in relation to the effects of synonym generation on L2 vocabulary learning. Spanish-speaking learners of English were asked to read an English text with 10 target words and their translations in parentheses next to each target word. Participants were divided into four groups according to whether an 13 announcement of a pending vocabulary posttest was made before reading (i.e., intentional vs. incidental learning group) and whether participants were asked to write a synonym of the target words. Barcroft found that intentional learning yielded higher L2-word-form learning from reading than incidental learning. In addition, negative effects of synonym generation on L2- word-form learning were found in both incidental and intentional conditions. Peters and her colleagues (2009) investigated how three techniques (vocabulary test announcement, task-induced word relevance, and vocabulary task) affected learners’ behavior of looking up words in an online dictionary and what the effects on subsequent word retention were. The three techniques were designed to enhance learners’ attention to lexical items a) by informing the learners about a pending vocabulary test to be given after the reading, b) by having them complete comprehension questions, and c) by giving them an additional vocabulary task. The results indicated that announcing the type of test to be taken after reading positively affected learners’ performance on the form recognition test but not on the meaning recall tests. More interestingly, however, the authors found more significant effects on overall vocabulary learning when target words were relevant to the comprehension questions. In other words, manipulating students’ attention by the test announcement does not appear to be as influential as increasing target words’ salience through external intervention in vocabulary learning. In another study, Peters (2006) examined whether learners’ performance is influenced by four different task instructions—announcement of a vocabulary test, combined with announcement of a comprehension test. Results revealed that participants approached the experiment similarly regardless of task instructions, which indicates that test announcement itself cannot control learners’ behavior or cognitive processing. 14 The problem of distinguishing between incidental and intentional learning grounds in Pellicer-Sánchez and Schmitt’s (2010) study. Although none of the participants in the study were informed of the post-reading vocabulary tests after reading a novel, the participant who received the highest score on the vocabulary tests expressed that she expected that the knowledge of the foreign words would be examined after reading. Consequently, she paid more attention to the target words by underlining unknown words in the novel and revisiting the words after finishing the reading. This anecdotal evidence clearly shows that learners may actually intend to learn some words when they are not supposedly induced to learn lexical items (Bruton, Garcia Lopez, & Esquiliche Mesa, 2011). Likewise, informing participants of a vocabulary test before reading does not necessarily lead learners to learn the target items during reading. Considering the limitations of the methodological operationalization of incidental and intentional learning, Hulstijn (2001, 2003) commented that the usage of the terms should not be expanded to understand learners’ attention but instead of limited to explaining and discussing experimental procedures. The reason for this suggestion is that the underlying concept of the incidental and intentional learning distinction cannot be completely explained yet on a theoretical level although the distinction is fairly straightforward in operational terms. Several researchers in earlier times asserted that the dichotomous distinction between incidental and intentional learning is not valid (Postma, 1964, as cited in Hulstijn, 2001), or that complete incidental learning in an absolute sense does not exist (McGeoch, 1942, as cited in Hulstijn, 2001) and thus, the distinction should be viewed with regards to the degree of attention (R. Ellis, 1994). These issues can now be addressed with the eye-tracking methodology, which is a tool to measure the amount of attention. 15 1.5 Understanding eye-tracking methodology Eye-tracking refers to the recording of an individual’s eye movements. Eye-movements have been a useful data source in cognitive research since the mid-seventies to investigate underlying processes during scene perception, reading, and visual search (Rayner, 1998, 2009). Research adopting eye-tracking technology is based on the assumption that eye gaze (overt attention) provides information about cognitive processes (covert attention). According to this assumption of an eye-mind link (Reichle, Pollatsek, & Rayner, 2012), an individual’s cognition in the primary drives of when and where the eyes move (duration and location). In other words, increased processing demands are associated with longer processing time and longer processing times are believed to be reflected by longer fixation durations or a larger number of fixations. Eye-movements provide several important advantages as a measure of reading behavior relative to measuring overall reading times. Specifically, monitoring eye-movements offers multiple aspects of eye-movement data (e.g., fixations, saccade, regression). Eye-movements also reflect text features (e.g., word length, predictability) and individual reader differences (e.g., age, proficiency). However, the most favorable aspect of the eye-tracking method is that it produces “a good moment-to-moment indication of cognitive processes during reading” (Rayner, 2009, p. 1461). To assess learners’ cognitive activities during performance of a certain language task, several online and offline methods have been used in SLA, including think-alouds, self- recording, note taking, and underlining. A well-known problem with these measures is reactivity (Bowles, 2010, Fox, Ericsoon, and Best, 2011). Reactivity describes the phenomenon that research treatments or instruments alter participants’ performance or behavior. For example, in Sachs and Polio’s (2007) study, a think-alouds protocol was employed to examine learners’ attentional processes in relation to written feedback on a L2 writing revision task. The results 16 showed that learners who were not required to make verbal reports produced significantly more accurate revisions than those who were instructed to speak their thoughts aloud while processing the feedback. Sachs and Polio acknowledged the think-alouds were reactive in their study, concluding that SLA researchers should implement and interpret think-aloud protocols with caution. Bowles (2010) conducted a meta-analysis on 12 reactivity studies, and concluded that tthink-alouds has a very small effects on task performance, but that the effects may depend on tasks and subjects. Reviewing the reactivity issues in SLA literature, Godfroid, Boers, and Housen (2013) proposed that the eye-tracking technique can provide a more sensitive measure of the amount and locus of attention during processing. Eye-tracking methodology is beneficial in reading research in that it makes it possible to investigate readers’ ongoing cognitive processing of readers without significantly altering the original characteristics of either the task or the presentation of the stimuli (Dussias, 2010). Since eye movements occur naturally as a part of reading, recording eye movements does not alter the thought processes. Although forehead and chin rests, screen layout, and font size and type may influence the reading process in eye- tracking studies (Godfroid & Spino, 2015), researchers seem to be in agreement that it is “probably the closest experimental operationalization of natural reading” (Van Assche, Drieghe, Duyck, Welvaert, and Hartsuiker, 2011, p. 93). 1.6 Eye-tracking and second language vocabulary acquisition Eye-movement recordings in second language research have gained popularity as techniques to investigate various aspects of SLA theories (for an overview, see Conkin & Pellicer-Sánchez, 2016; Dussias, 2010; Frenck-Mestre, 2005; Siyanova-Chanturia & Roberts, 17 2013; Winke, Godfroid, & Gass, 2013). Godfroid and Schmidtke (2013) and Godfroid, Boers, and Housen (2013) claimed that eye-movement registration is a valuable tool in helping researchers examine theoretical models of the language learners’ minds and understand the cognitive process of L2 development. Since eye-tracking methodology has been integrated into SLA research quite recently, only a handful of studies have attempted to link eye-movement behavior and second language vocabulary acquisition. Godfroid, Housen and Boers (2010) and Godfroid, Boers, and Housen (2013) introduced eye tracking to L2 vocabulary studies and provided the evidence for the noticing hypothesis in relation to second language vocabulary. The authors investigated the role of attention in incidental vocabulary learning through four different input conditions. Participants’ eye movements were monitored as they read 20 English texts that included 12 target words, whereby contextual support to infer the meanings of target words was manipulated within subjects. After the reading task, participants took a surprise vocabulary test to measure their learning. They found that participants fixated longer on novel words than on known words, regardless of whether the novel words were presented with appositive contextual cues. The results also support Schmidt’s noticing hypothesis that more attention to a novel word leads to better recognition of that word on the posttest. Mohamed (2017) later extended this finding to meaning recognition and meaning recall. In a similar vein, Godfroid and Schmidtke (2013) analyzed verbal reports from participants, showing that awareness (verbal reports) and attention (eye-fixation) are closely related. Depending of the level of awareness, eye fixation durations were found to vary. Overall, there seems to be some evidence that increased fixation during later stages of processing (i.e., second pass time and total reading time) is a positive indicator of learning success. 18 Several studies have specifically looked at other factors affecting the eye fixation times during reading such as frequency, familiarity, predictability, word length, and part of speech (e.g., Elgort & Warren, 2014; Godfroid et al. 2017; Mohamed 2017; Pellicer-Sánchez, 2015; Waring & Takaki, 2003; Webb, 2007). Similar decreasing patterns in reading times across repeated exposures were observed by Pellicer-Sánchez (2015), who examined three components of vocabulary knowledge (i.e., form recognition, meaning recognition, and meaning recall) acquired incidentally from reading. In line with previous studies (Pellicer-Sánchez & Schmitt, 2010), receptive aspects of vocabulary knowledge were found to be easier to acquire than productive knowledge. Another major finding of the study was that L2 learners needed at least eight encounters to read unknown words in a fashion similar to how they read known words. Results also showed that participants who had longer reading times scored higher on the meaning recall test, again highlighting the important role of attention in vocabulary learning. Regarding the role of frequency to target words in reading, Godfroid and colleagues (2017) extended the eye-movement investigation to a longer, more authentic reading passage and highlighted the role of word repetition in vocabulary learning. Thirty-five advanced English language learners and 19 native speakers of English read an English novel A Thousand Splendid Suns containing Dari words. After reading, participants performed a comprehension test and three surprise vocabulary tests. The number of exposures was found to be a predictor for all three types of vocabulary knowledge. The results further showed that the tests scores increased and processing time decreased as the readers encountered the targets more often in the text. While there is sufficient evidence to demonstrate that frequency of exposure plays a beneficial role in vocabulary processing and acquisition, the effect of contextual clues seem to be 19 inconclusive outside the realm of eye-tracking studies. For example, Zahar, Cobb and Spada (2001) and Schwanenflugel, Stahl and McFalls (1997) stated that the role of contextual support in assisting vocabulary learning was unclear. On the other hand, Webb (2008) found that contextual richness positively influenced meaning-related word-learning rather than form-related word-learning. This view was further supported by Hu (2013), who argues that the quality of context supports form-meaning connection and grammatical features, whereas repetition is associated with the knowledge of form. Although there has been little agreement on the effect of context on learning, eye-tracking researchers have shown that a high context predictability invites higher skipping rates and reduced processing time measured by different types of eye- fixations (e.g., Calvo & Meseguer, 2002; Brysbaert, Drieghe, & Vitu, 2005; Drieghe, Brysbaert, Desmet, & De Baecke, 2004; Rayner, Ashby, Pollatsek, & Reichle, 2004). In L2 vocabulary research, Mohamed (2017) recorded 42 English language learners’ eye- movements as they read a graded reader, Goodbye Mr. Hollywood, containing 20 pseudo words and 20 known words. The targets varied in the number of occurances and the level of predictability. After reading, participants were asked to take comprehension questions and vocabulary posttests. The results showed that words with rich context clues required less processing time. Mohamed also found evidence for the role of contextual support for meaning recognition and recall, which is consistent with Webb (2008) and Hu’s (2013) findings. Another noteworthy finding is that the role of context predictability played a more important role in later encounters than in early encounters with target words. Mohamed (2017) explained that new, unknown words required more repetition to be recognized before participants were able to utilize the context clues to infer the meanings. So far, no previous study has directly compared the incidental and intentional learning using eye-tracking technology. 20 In the field of psychology, there is already a large volume of studies available about the role of word length in fixation durations (e.g., Rayner, Sereno, & Raney, 1996; Schilling, Rayner, & Chumbley, 1998). Rayner (2009) simply stated, “As word length increases, the probability of fixating a word increases” (p. 1461). In the field of SLA, Godfroid and her colleagues (2017) included word length and part of speech as control variables. They found that word length negatively affected all types of vocabulary learning and the meanings of noun were learnt more than those of other words. 1.7 Conclusion Incidental and intentional learning are the two key mechanisms through which language learners build up their lexical repertoire. As this review has shown, many previous scholars have investigated incidental and intentional vocabulary learning in regards of its effectiveness, although they have conceptualized and operationalized incidental and intentional learning in different ways. Moreover, the use of eye-tracking technology has been growing in recent years, but most of studies have only focused on incidental learning. Consequently, to what extent incidental and intentional learning differ in terms of the underlying cognitive process still remains unclear. Therefore, in the current study, by using the eye-tracking methodology, I aim to answer to the question whether the distinction between intentional and incidental learning reflects a quantitative difference in reading time or a qualitative difference in the degrees of learners’ intentionality. 21 CHAPTER 2: THE CURRENT STUDY 2.1 Overview of the research design The current study is a between-subject design with three conditions: no test announcement with time restriction (Group 1: Timed Incidental), test announcement without time restriction (Group 2: Untimed Intentional), and test announcement with time restriction (Group 3: Timed Intentional). Group 1 performed the reading task without instructions to learn lexical items and then was given three vocabulary tests with no prior announcement. On the other hand, Group 2 and Group 3 were told in advance that they would be tested on vocabulary knowledge. However, Group 2 was told to complete the reading task at their own pace whereas Group 1 and 3 were asked to finish the reading in a limited time. The purpose of including the time restriction is to find out whether the longer time learners spend results in the beneficial effect of intentional learning. If intentional and incidental learning differs not in the cognitive processes but in the amount time allotted for the task, Group 3 and Group 1 would perform in a similar manner. While the independent variables are the presence or absence of a test announcement and time restrictions, the dependent variables include test scores and participants’ eye-fixations on target lexical items. See Table 1 for an overview of the three groups. Table 1 Group descriptions Group 1 (Timed Incidental) Group 2 (Untimed Intentional) Group 3 (Timed Intentional) Test announcement Time restriction Yes No Yes No Yes Yes 22 2.2 Materials 2.2.1 Experimental text. I adapted and modified a passage ¾ “Smart Cars, Intelligent Highways” ¾ from the textbook World Class Readings 3 Student Book: A Reading Skills Text (Rogers, 2005). I used two tools to examine whether the reading would be appropriate for the participants’ proficiency level: vocabulary profiles and readability. A vocabulary profile indicates how large a vocabulary is needed to read a text and which words readers are unlikely to know, while a readability index reveals how complex a passage is to read based on sentence length and other factors. First, I used Compleat Web Vocabulary Profiler, an online research tool developed by Tom Cobb at the University of Quebec at Montreal (http://www.lextutor.ca/vp/eng/). This tool brakes down a text into 25 frequency bands and provides the percentage of lexical coverage in each band, based on the Corpus of Contemporary American English (COCA: Davies, 2008) and the British National Corpus (BNC) (Nation, 2005). For example, the K-1 band includes the most frequent 1000 words of English and K-2 includes the second most frequent 1000 words of English (i.e., 1001- 2000). Before entering the text into the tool, I re-categorized proper nouns and target words to high-frequency words for an accurate analysis. Overall, K-1 to K-2 words composed approximately 90% of the text and that figure rose to approximately 94% when K-3 Table 2 Vocabulary Profiles for Experimental Text Number of tokens Cumulative tokens (%) K-1 890 K-2 137 K-3 59 K-4 17 K-5 Others Total 16 29 1151 77.32 89.22 94.35 95.83 97.22 100 23 words were included. In other words, if a reader knew the first 3000 words of English, he or she would be able to understand 94% of this passage. Previous research has reported that learners need to know at least 90 to 95% of the words in a text (Hu & Nation, 2000; Laufer, 1997; Stahl, 1999) to be able to infer and generate the meaning of unknown words. As all the participants achieved at least 87.24% accuracy on the 3K level of the Levels Test, it is clear that the lexical demand of the reading text was suitable for them. Table 3 reports the results of these analyses. Table 3 New Vocabulary Levels Test Results Mean % (SD) Part 1 96.06 (0.75) Part 2 90.08 (1.40) Part 3 87.24 (1.81) Part 4 72.57 (2.19) Part 5 59.62 (3.32) Part 6 72.32 (5.53) Second, to estimate the degree of difficulty in reading the text, I adopted three readability measures: Flesch-Kincaid Grade Level, the SMOG Index, and the Flesch Reading Ease. These traditional readability formulas are based on word and sentence lengths and approximate the age or number of years of education needed to understand the text. According to the Flesch- Kincaid Grade level and the SMOG Index, approximately nine to ten years of education in the United States is required to be able to read and comprehend the experimental text without difficulty. Next, in the Flesch Reading Ease test, possible scores range from 100 (indicating the easiest) to 0 (indicating the most difficult). The experimental reading passage scored 54.7 on this test, denoting that it can be considered comprehensible for 10th to 12th grade students. Thus, it seems suitable for the high-intermediate learners of English. 24 Table 4 Readability Assessment of Experimental Text Flesch-Kincaid Grade Level (0–12) The SMOG Index Flesch Reading Ease (0–100) 2.2.2 Target words. 9.9 9.3 54.7 To ensure lack of previous knowledge of the target items, 12 words in the text were replaced with low-frequency words. These words were composed of three nouns, five verbs, and four adjectives. Each word occurred once in the passage. All of the target words belonged to the Table 5 Target Words Part of Speech Target words Definition Frequency Level nouns adjectives verbs gizmo fatality calamity perilous incessant bewildering staggering apprise decipher succumb chauffeur sip device death accident dangerous constant distracting overwhelming inform figure out die drive drink a little 25 K-13 K-4 K-9 K-5 K-8 K-5 K-4 K-14 K-7 K-6 K-8 K-4 range of K-4 to K-14 level based on Compleat Web Vocabulary Profiler (Cobb, 2013). Table 5 displays the list of the target words and meanings. 2.2.3 Language background questionnaire. A paper-based background questionnaire was prepared for the participants. The questionnaire elicited basic information about participants’ gender, age, year in college, major, English language use, and English/ second language learning experience. In addition, participants provided standardized test scores (e.g., TOEFL, IELTs, and/or TOEIC scores) and self-rated their speaking, writing, reading, and listening proficiency levels (see Appendix B). 2.2.4 Prescreening vocabulary test. As a prescreening measure, I adopted the New Vocabulary Levels Test (NVLT), developed by McLean and Kramer (2015): www.lextutor.ca/tests/. This NVLT is a diagnostic tool for measuring learners’ receptive knowledge of the most frequent 5,000 word families. The test is composed of five 24-item parts in a multiple-choice format, one part each for representing the 1000-, 2000-, 3000-, 4000-, and 5000- word level. Additionally, the sixth part includes thirty items to measure academic word knowledge. The prescreening vocabulary test served two purposes: 1) to measure participants’ vocabulary knowledge and 2) to control for participants’ pre-existing knowledge of the target words. For the former, I used the first three parts of the NVLT. For the latter, I randomly added the 12 target words to Part 6 of the NVLT, resulting in the 30 original items and 12 newly added items. The total score on the first five parts of the New Vocabulary Levels Test (NVLT) was used to ensure participants in each group had similar amounts of vocabulary knowledge. In sum, 26 the final version of the prescreening vocabulary test comprised 174 items in total, 24 items for the first five parts and 54 items for the last part. 2.2.5 Reading proficiency test. A reading proficiency test was administered to control for a possible effect of reading ability on participants’ learning in the main experiment. I used the 2013 Sample Test Materials of the Examination for the Certificate of Competency in English (ECCE) developed by Cambridge Michigan Language Assessments (CAMLA). A standardized test for high- intermediate English language learners, the ECCE is divided into four sections: speaking, listening, GVR (grammar/vocabulary/reading), and writing. For the purpose of the current study and because of time limitations, participants took Part 1 of the reading section only. It included two reading passages followed by 5 multiple-choice comprehension check questions each. One point was assigned for correct answers, and zero points were given for incorrect answers. The cut-off score for inclusion in the study was set at six points. The reliability coefficient (Cronbach’s α) obtained for all participants was .73 for the 10 items. This was considered to indicate acceptable test reliability (Field, 2013). 2.2.6 Comprehension test. A set of paper-based comprehension questions was developed to ensure that participant actually read the text and to measure participants’ understanding of the text. The test included 10 statements with three possible answers: true, false, and I don’t know. I piloted the initial version on ten native speakers and five advanced non-native speakers of English. Based on their performance and opinions, some items were revised. One point was assigned for correct 27 answers, and zero points were given for incorrect and “I don’t know” answers. See Appendix E for a copy of the comprehension tests. 2.2.7 Vocabulary tests. In this study, learning was operationalized as the ability to recognize and produce the form and meaning of target words. To measure the different aspects of participants’ vocabulary learning, three vocabulary tests were designed and administered in the following order: form recognition, meaning recall, meaning recognition (see Appendices X, Y, and Z for each test). All three tests assessed participants’ knowledge of the 12 target words, so the maximum score of each test is 12. Items on each test were presented in a random order. Form recognition test. A form recognition test assessed participants‘ ability to recognize the target words. The test contained 12 items, each with five options: one target word, three distractors, and “I don’t know.” The distractors were selected randomly from low-frequency words. Participants were asked to circle one word they remembered seeing in the reading for each item and were encouraged not to guess. Participants earned one point for correct answers and zero points for false and “I don’t know” answers. The reliability coefficient of the test was good, a = .79. [Example] Circle words you saw in the reading. If you do not know the answer, do not guess. There is a penalty for wrong answers. (a) veer (b) distend (c) cardinal (d) sip (e) I don’t know. Meaning recall test. A translation-type meaning recall tests measured the learners’ ability to recall the meanings of the 12 target words. Participants were asked to write down anything they remembered about the meaning of each word. When asked, I allowed them to write in their 28 native language. Participants earned one point for the correct meanings, close synonyms, and related words and zero points for irrelevant answers or blanks. Half marks were not allocated. Spelling and grammar mistakes in responses were ignored because the test was intended to measure knowledge of meanings of words. The reliability coefficient of the test was good, a = .72. [Example] For each word, write down anything you can remember about its meaning. sip ________________________ Meaning recognition test. A meaning recognition test was administered to examine participants‘ ability to identify the meanings of target items. The test is regarded as easier than the meaning recall test because it is receptive in nature. Participants were instructed to circle one of five possible options: one correct answer, three possible but incorrect choices, and “I don’t know.” The three distractors were selected to match the target words in terms of the part of speech. For example, if the target word was a concrete noun, all three distractors were concrete nouns as well. Distractors were also chosen to not be close in meaning to the correct answer so that partial knowledge could be demonstrated. I awarded one point to correct answers and zero points if the target words were not circled. The reliability coefficient of the test was high, a = .81. [Example] Circle the correct meaning of each given word. If you do not know the meaning, please circle “I don’t know”. sip (1) brew (2) order (3) serve (4) drink (5) I don’t know 29 2.3 Participants Participants in the study were non-native speakers of English and came from two sources within a large Midwestern university: 47 were high-intermediate learners of English enrolled in Level 4 or 5 classes of a five level pre-university English program, where they took IEP or EAP classes, separately. Thirty-three participants were regularly matriculate students. From an original pool of 80 potential participants, 36 participants were excluded due to the following reasons: (1) having recognized three or more target words during the prescreening vocabulary test (n = 2), (2) failing to achieve an overall accuracy of 60% or more in the reading proficiency test (n = 2), (3) failing to attend the second session of the experiment (n = 1), (4) having technical errors during the experiment, including the unsuccessful calibration of the eye-tracker and the unexpected shutdown of the eye-tracking program (n = 5), (5) having produced inaccurate eye-tracking data (n = 26). The final sample of 44 participants (28 females and 16 males) were mostly freshmen (n = 39) with a few sophomores (n = 4) and a single graduate student (n = 1), pursuing 32 different academic specializations, e.g. music, advertisement, business administration, computer science, and engineering. A majority of the participants were from China (n = 34), two came from Saudi Arabia, and 8 from other parts of the world, which reflects the population of the intensive English program at the university where the data was collected. Their age ranged from 17 to 30 (M = 19.43, SD = 2.43). Based on their self-rated English proficiency level on a 5-point scale, they were more confident in their reading (M = 3.62, SD = 0.78) and listening skills (M = 3.38, SD = 0.76) than they were in speaking (M = 2.67, SD = 0.57) and writing (M = 3.16, SD = 2.67). 30 2.4 Procedure 2.4.1 Apparatus. The eye-tracking reading task was programmed using Experiment Builder and performed through EyeLink 1000, a desk-mounted eyetracker (SR Research Ltd. http://www.sr- research.com/). The eyetracker sampled gaze data 1,000 times per second. However, fixations below 120 milliseconds were eliminated from analysis because fixations below 120 milliseconds are considered not to reflect cognitive processing of words (Ashby, Rayner, & Clifton, 2005; Reichle, Rayner, & Pollatsek, 2003). The full experiment consisted of 5 screens for practice texts, 17 screens for the main text, and 8 screens for instructions and a break page. For the text, each screen contained an average of 66.88 words (SD=10.31) and included between five and ten lines of text. The entire text was presented in regular Consolas font, size 18, double-spaced. Position of target words on screens was controlled, so that no target words appeared in the first or last line, and on the left or right side of the screen. Participants were seated in front of a computer monitor with their head placed against a chin and forehead rest to ensure the highest levels of accuracy and spatial resolution. The spacebar on the keyboard was used for participants to move from one screen to the next. Participants were not allowed go back to the previous screens while reading. While the eye tracker was calibrated at the beginning of the experiment and after the return from breaks, drift correction was set to be performed at the beginning of each screen. 2.4.2 Pretests. The first session took place a week prior to the main reading experiment. In this session, two to three participants came to the office together. After signing the consent form and 31 completing the background questionnaire, they took the New Vocabulary Levels Test (NVLT) and the reading proficiency test as prescreening measures. Participants who met the three conditions were qualified to continue participating in the next session of the study following: (a) achieving 90% or higher accuracy in Part 1 and Part 2 of the New Vocabulary Levels Test (NVLT), (b) having recognized two or fewer target words in Part 6 on the Vocabulary Levels Test (NVLT), and (c) achieving an overall accuracy of 60% or higher on the reading proficiency test. The first session took about an hour for the participants to complete. Test takers who did not meet the eligibility criteria were given $10 for their time. 2.4.3 Reading experiment. In the second session, participants came to the eye-tracking laboratory individually and completed the main reading experiment and posttests. I started by giving directions on the eye- tracking experiment procedure and told all the participants they would answer 10 questions to check for their understanding of the story after reading. Depending on their randomly assigned group, participants received different instructions on time restrictions and vocabulary tests. Participants in Group 1 (Timed Incidental) were asked to read each slide in 30 seconds without any announcement of vocabulary tests at the end. Instead, they were asked to comprehend the reading in order to be able to answer comprehension check questions. On the other hand, participants in Group 2 (Untimed Intentional) were instructed to read at their own pace and encouraged to learn some unknown words because vocabulary tests would follow; however, instructions did not specify on which words they would be tested. Participants in Group 3 (Timed Intentional) were also forewarned of vocabulary tests after reading, but they were told to finish reading each slide in 30 seconds. I also told Group 1 and 3 that slides were programmed 32 to transition to the next slide automatically after 30 seconds although, in reality, the slide was to be presented until participants hit the space bar to advance to the next page. I adopted the perceived time pressure rather than the real time limit to avoid losing data from slow readers. Considering that everyone has a different speed of reading, it would not be enough especially for the individuals with low reading speed to complete the reading task. Pilot test results indicated an average time of 28.83 seconds was spent on each slide, indicating that 30 seconds was sufficient time. Participants were calibrated with a standard nine-point grid for the right eye. Once the calibration was successful, they were instructed to fixate on a dot in the upper left corner of the monitor to start reading. If the eye tracker identified a fixation on the fixation spot, the reading text appeared. This procedure is called drift correction and it took place at the beginning of each screen. After reading five slides for practice, they read 17 slides for the reading passage on the screen while their right eye was being tracked. The break page was inserted after seven slides of the main text. Because vocabulary research indicates participants generally need multiple exposures to words to learn them, I had the participants re-read the main text. Again, they had a short break after the seventh slide. The reading experiment took a maximum of 30 minutes. 2.4.4 Posttests. Immediately after participants completed the eye-tracking reading task, they took the comprehension tests first and the three vocabulary tests afterward. To minimize the transfer effect from preceding vocabulary tests, the participants took the vocabulary tests in the following order: the form recognition test, the meaning recall test, and the meaning recognition test. It took 33 an average of 10 to 15 minutes for participants to finish the posttests. Participants received $25 for attending both session 1 and 2. Table 6 illustrates the procedure of the study. Table 6 Procedure of the Study Session 1 1. Consent form 2. Background questionnaire 3. New Vocabulary Levels Test (NVLT) 4. Reading proficiency test 2.4.5 Word predictability ratings. Session 2 1. Reading 2. Re-reading 3. Comprehension test 4. Three vocabulary tests Fifty-one native English speakers who did not participate in the main experiment performed a cloze predictability task to assess the degree of difficulty in guessing the meanings of the targets from context. They were provided with the reading passage ‘Smart Cars, Intelligent Highways’ with the target words deleted. On a separate sheet of paper, the raters were then asked to supply as many words as possible to fill in each blank and rate each case on a 5-point scale ranging from very easy to guess (1) to very difficult to guess (5). All raters were undergraduate students at Michigan State University. I calculated the percentage of correct answers to each item and used the percentage as a continuous variable in the analysis. If one of the supplied answers was correct, the answer was counted as correct. As an example of the target word ‘incessant’, ‘never-ending’ and ‘constant’ were graded as correct, but ‘busy’ and ‘terrible’ were graded as incorrect. Semantically, syntactically, and contextually appropriate words were regarded as correct answers and spelling mistakes were ignored. The mean rating for each target word was also calculated, but I excluded 34 the rating variable from the analysis. Therefore, to avoid any issues of collinearity in the model, I excluded the rating variable from the analysis. The predictability task took about 30 to 40 minutes to complete and raters received $10 for their participation. 2.5 Analysis 2.5.1 Definition of variables. Term (Acronym) Test Announcement (TA) Time Limit (TL) Definition Whether the participant received an announcement of vocabulary posttests prior to reading Whether the participants were told that a time limit was set for reading Word Length (WL) The number of letters in a word Predictability (PD) Part of Speech (PoS) Total Reading Time (TRT) Summed Total Reading Time (STRT) DOE The correct answers expressed as a proportion of the total number of responses for an item on the cloze predictability task Whether the word is a verb or not. 1 for verb, and 0 for non- verbs (i.e., nouns and adjectives) Summation of the duration across all fixations on the target word and across the two reading sessions Aggregated Total Reading Time (TRT) by participants The difference between observed total reading time and expected eye-fixation durations on the target word Fixation count The number of overall fixations Form Recognition (FoReco) Whether the subject correctly recognized the form of words Meaning Recognition (MeReco) Whether the subject correctly recognized the meaning of words Meaning Recall (MeReca) Whether the subject correctly recalled the meaning of words 35 2.5.2 Eye-tracking data preparation. The purpose of cleaning data was to scan for unusual events, and deal with those events in an appropriate manner. I first filtered out fixations shorter than 120 milliseconds as these fixations are less likely to be associated with readers’ cognitive processes (Ashby, Rayner, & Clifton, 2005; Reichle, Rayner, & Pollatsek, 2003). It is also common practice to exclude fixations longer than 800 milliseconds. However, considering that participants were English language learners and some of them were aware of the following vocabulary tests after reading, I did not remove the long fixations as I acknowledge that longer fixations could have been made intentionally. I also manually reviewed and inspected each trial fixation by fixation, looking for inconsistences in the data. For example, when fixations were off the line of text, I moved the fixations either up or down depending on which line a fixation was intended. A total of 80 data files were collected and 61 data files were used in the main analysis with 19 data files excluded for various reasons (see section 2.3). There are many eye-movement measures including first fixation duration, first-pass reading time, regression path duration, and total reading time. In the current study, I used two eye-tracking measurements: total reading time (TRT) and the difference between observed and expected fixation duration (DOE). First, total reading time, the sum of all fixations on the target word, indicates how much total time a reader spent at the region during the entire course of reading. Total reading time is considered a late measure that reflects late cognitive processes such as text comprehension and information reanalysis (Roberts & Siyanova-Chanturia, 2013). Considering the predictor variable of the current study, Test Announcement, would affect on primarily late eye-movement measure, and the aim of the study is to uncover the associations with vocabulary learning, I concluded that total reading time is suitable to represent the amount 36 of attention paid to the targets in the current study. Second, the difference between observed and expected fixation duration (DOE) was calculated following Indrarathne and Kormos (2016, 2017). The procedure of getting the DOE value is as follows: 1) Extract the total reading time for the whole page for each participant by summing up all fixation durations on all words within the page 2) Calculate the expected fixation durations based on the proportion of the number of syllables that the target word has in relation to the number of syllables on the whole page where the particular target word occurs. Expected fixation duration of a target word for a participant = !"."$ &'(()*(+& "$ ,)-.+, /"-0 × ,",)( -+)02!. ,23+ "$ ,ℎ+ /ℎ"(+ 5).+ !"."$ &'(()*(+& "$ ,ℎ+ /ℎ"(+ 5).+ 3) Subtract the expected fixation duration from the observed total reading time for each target word for each participant The difference between the observed and expected fixation durations (DOE) is regarded as instances of noticing because it measures “extra attentional processing load” (Indrarathne & Kormos, 2016, p.6) of target words. 2.5.3. Data structure. The data have a hierarchically clustered structure whereby target words were nested within subjects and each subject provided multiple observations. That is, repeated observations were made on the same individual. Specifically, 61 participants reported eye-fixation data for 12 target words each and provided three types of vocabulary test results for each word. An important feature of the data is that predictors reside at different levels of the data structure. Word length, Predictability, and Part of Speech are measured at Level 1 because they are the 37 characteristics of target words while Test Announcement and Time Limit are measured at Level 2 because they are the treatments given to subjects. Therefore, the number of data points at Level 1 is 732 (61 participants ´ 12 target words) and the sample size at Level 2 is 61. Figure 1 shows an example of the data structure of the current study. Figure 1 Data structure. 2.5.4 Multivariate Multilevel Mediation Analysis (MMMA). 2.5.4.1 Path analysis. For the current study, I adopted a path analysis, which represents a special case of structural equation modeling (SEM) (Marcoulides & Schumacker, 1996). Both path analysis and SEM are extensions of multiple regression to estimate the relationships among the variables that can accommodate nested data with predictors at different levels and multiple outcomes. In addition, both analyses are useful ways to examine how the effect of an independent variable (X) on an outcome (Y) is mediated through an intervening variable (M). However, path analysis describes the relationships between observed or directly measured variables, while SEM deals with latent or unobserved constructs. An observed variable is measurable, such as a test score and time on task, whereas a latent variable is a variable that cannot be measured directly, for instance, motivation or attitude. Thus, latent variables are inferred indirectly from the variances 38 and covariances in a set of observed variables. Although a general case of a mediation analysis with multiple outcomes and/or multiple mediators is commonly undertaken within an SEM framework, the statistical approach used in the current study is more closely related to path analysis for the reason that it does not involve a latent variable measure model. 2.5.4.2. Mediation analysis. As diagrammed in Figure 2, mediation occurs through an added variable that affects the causal relationship of X to Y, describing the mechanism of how the predictor (X) causes the mediator (M), and the mediator (M) causes the outcome (Y). Rectangles indicate observed variables, and each straight line represents a causal relation with an arrowhead at one end, pointing from predictor to outcome. Mediation analysis distinguishes three types of effects: direct, indirect, and total effect. The direct effect refers to the influence the predictor variable has directly on the outcome variable, the indirect effect refers to the pathway from the predictor to the outcome through the mediator, and total effects denotes the aggregated effect of the direct and indirect effects on the outcome. In Figure 2, the paths a and b represent the indirect effect of X on Y, the path c’ represents the direct effect of X on Y, and c = ab+ c’ is the total effect of X on Y. The diagram on the top represents the total effect of X on Y and the diagram on the bottom shows the mediated effect of X on Y through M. 39 Figure 2 Path diagram for a basic single-mediator model. The regression equations of these two diagrams are as follows: 6=28+:;++8 <=2=+);++= 6=2>+:′;+*<++> Coefficient c denotes the total effect of X on Y, coefficient a represents the effect of X on M, coefficient b quantifies the effect of M on Y adjusted for X, and coefficient c’ is the direct effect of X on Y that is not transmitted through M. 28,2=,and 2> are the intercepts and +8,+=,and +> are the residuals. Mediation can either be partial or complete. Figure 3 shows the diagrams for a partial and a full mediation model. Partial mediation is the case in which an independent variable has both direct and indirect effects on a dependent variable. If Test Announcement has a direct significant impact on the test scores and it also has a significant impact on Total Reading Time, which has a significant impact on the test scores, this is known as a case of partial mediation. Complete or full mediation is the case in which the total effect of an independent variable on a 40 dependent variable is transmitted through mediators. If Test Announcement does not have a direct impact on test scores, but it has a significant effect on Total Reading Time, which also has a significant impact on test scores, this is known as a case of full mediation. In the case of a full mediation, the mediator fully explains the association between the predictor and the outcome. Figure 3 Path diagrams for a partial and full mediation model. A meditation analysis can also be done through ordinary least squares (OLS) and logistic regression. However, the benefits of path analysis using the SEM framework for testing questions of mediation, Bryan, Schmiege, and Broaddus (2007, p.366) summarized and compared to the OLS and logistic regression approach. (1) testing of direct, indirect, and total effects simultaneously (2) testing of complicated mediation models with multiple mediators and/or dependent variables (3) testing of particular indirect effects within the mediation models (4) the ease of correcting for missing data and non-normality in data Considering that the current study includes five independent variables, three dependent outcomes, multilevel data with some missing values, and a non-normal distribution for several 41 variables, the use of path analysis via SEM framework was implemented to answer the questions related to mediation. 2.5.4.2.1. Multilevel Mediation Analysis. In multilevel modeling, different types of models exist depending on the data structure. Regarding the types of models, when all variables are measured at Level 1, the model is referred to as a 1-1-1 mediation model (Krull & MacKinnon, 1999, 2001). Models in which the predictor and the mediator are assessed at Level 2 and outcome variables are assessed at Level 1 are called 2-2-1 mediation models. For example, one could hypothesize that the relationship between classmates’ language skill (a Level 2 predictor) and individuals’ language skill (a Level 1 outcome) is mediated by the effect of classroom quality (Level 2 mediator). In the 2-1-1 mediation model, a Level 2 predictor influences a Level-1 mediator, which then affects a Level-1 outcome. An example of this mediation would be that the instructional practices (Level 2 predictor) impact on individual’s motivation (Level 1 mediator), which in turn affects learning outcomes (Level 1 outcome). In the current study, total reading time denotes a mediator, learning gains from three vocabulary tests serve as outcome variables, and Test Announcement, Time Limit, Word Length, Predictability, and Part of Speech are predictors. That is, it is hypothesized that Announcement, Time Limit, and three other lexical factors (X) predict processing time (M), which in turn affects learning of unknown words (Y). One of the complications of the study is that predictors lie at different levels as described in the previous part. More specifically, Test Announcement and Time limit are measured at Level 2 while Word Length, Predictability, and Part of Speech are measured at Level 1. Therefore, I merged the 2-1-1 and 1-1-1 mediation models to 42 accommodate all predictors at two levels in the present study. In addition, the existence of the 1- 1 linkage in the model invites special attention because the between-subjects effect and the within-subjects effect of the mediator (Total Reading Time and DOE) on the outcome variable needs to be examined separately, according to Preacher, Zyphur, and Zhang (2010). In the current study, for instance, Test Announcement is the Level-2 independent variable, with one group receiving the announcement and another not receiving the announcement. Total Reading Time and DOE are the Level-1 mediators and learning outcomes are the Level-1 dependent variables. In this case, Test Announcement only varies between groups, whereas both Total Reading Time (or DOE) and test scores vary both within and between groups. That is, each target word differs from each other within the person in its eye fixations and learning outcomes (within-subjects effect), and there are differences between the person in eye fixations and learning outcomes (between-subject effects). When one estimates the influence of Test Announcement on eye fixations, Test Announcement influences individual target words but does so for the person, making the effect of Test Announcement a between-subjects effect. Because Test Announcement was provided to the person without differential application across target words within the person, it cannot account for within-person differences of any kind. This is not to say that Test Announcement has no impact on Level-1 vocabulary learning. It does, but only because each target word belongs to participants that either did or did not receive the announcement of the upcoming posttests. For the same reason, Test Announcement can impact on vocabulary gains only at the level of person. Test Announcement cannot account for individual differences within a participant in only reading patterns and vocabulary learning, because Test Announcement was applied equally to each participant. Therefore, the indirect effect of Test Announcement on vocabulary learning through Total Reading Time may function 43 only through the between-group variance in the mediator (Total Reading Time) and the dependent variables (learning gains). The idea is supported by Preacher, Zyphur, & Zhang (2010), stating that in a mediation model for 2-1-1 data, “when the b effect (the effect of the mediator to the outcome variable) estimate conflates the Within and Between effects, the indirect effect that necessarily operates between groups in confounded with the within-group portion of the conflated b effect (p.211).” In order to disentangle between-subjects and within-subjects effect, I adopted a strategy called unconflated multilevel modeling (UMM) (Hedeker & Gibbons, 2006; MacKinnon, 2008; Preacher, Zhang, & Zyphur, 2011; Preacher, Zyphur, & Zhang, 2010). Following UMM, I replaced the Level 1 mediator with two mediators: a group-mean centered total reading time (i.e., deviations from group means) at the within-subjects level, and the group mean of total reading time at the between-subjects level. Here, group-mean centering subtracts the individual’s group mean of total reading time from an individual’s total reading time for each target word. I followed the same procedure for the ΔOE value. In this way, the within- and between-subjects effects of the model are no longer conflated, because they are not combined into a single estimate (Preacher, Zyphur, Zhang, 2010). However, the main drawback of this approach is that using the group mean as a proxy for the between-subjects effect introduces bias of the between- subjects effect for the predictor, which in turn also contributes to biased indirect effects at the between-level. To solve this problem, Preacher et al. (2010) suggest using a multilevel structural equation modeling (MSEM) approach in investigating multilevel mediation. This approach allows for separate estimation of the within- and between-subjects components of the model, so the direct and indirect effects at each level can be examined separately. Although MSEM is the most advanced approach for mediation in nested data, more future studies need to empirically 44 prove that the MSEM method is superior in decreasing bias in indirect effects and estimating those effects in an absolute sense (Preacher et al., 2011). I first attempted to run the multilevel mediation model using the MSEM approach, but the model did not converge correctly because between-subjects variances were too small to support the MSEM model. So, I chose the UMM to assess the multilevel phenomenon of the present study, considering that UMM also allows for decomposition of Level 1 and Level 2 effects, and it is known to be more valid than traditional multilevel statistical methods including multilevel modeling and multiple linear regression (Bauer, Preacher, & Gil, 2006). 2.5.4.2.2. Multilevel Mediation Analysis with Multiple Outcomes. Following other recent work on incidental vocabulary learning, the study includes multiple dependent variables: Form Recognition, Meaning Recognition, and Meaning Recall. Instead of running the multilevel model three times for each outcome variable, data were analyzed using the multivariate multilevel model, the extended version of the multilevel model, to accommodate multiple outcomes (Baldwin, Imel, Braithwaite, & Atkins, 2014). Snijders and Bosker (2012) explained that the multivariate approach is more rigorous than the univariate approach, especially if a correlation between dependent variables exists. This approach decreases the probability of Type I error, which will otherwise be inflated when carrying out separate tests for multiple dependent variables. 2.5.5. Statistical analysis. Multivariate multilevel mediation model was conducted using Mplus 8 (Muthén & Muthén, 2017) to evaluate the possible relationships among clustered data with multiple 45 dependent variables simultaneously. As the data set included 1.89% of missing values, I handled the missing data using the full-information maximum likelihood (FIML; Enders, 2010) estimator implemented in Mplus. Because the outcome variables are binary (either correct or incorrect), a logistic regression model was fitted by using the robust maximum likelihood (MLR) estimator with the LINK option. The MLR estimator has the benefit of accounting for the non-normality in the measures (Muthén & Muthén, 2017). To include random intercepts, random slopes, and random variances in the multilevel analyses, TWO LEVEL RANDOM option was selected for the type of analysis. However, I group-mean-centered the mediator (Total reading time and DOE), I specifically fixed the intercept within participants to zero to take it out of the model. Also, the residual variances of Meaning Recognition and Meaning Recall were fixed at 0 because they were close to 0 (.002 and .009, respectively). Mplus uses a binary logistic regression for all multilevel analyses with a categorical outcome variable, thus the estimates for paths from predictors to dependent variable are logit regression coefficients (b). For ease of interpretation, I also report the exponentiation of the B coefficient (exp(b)) for the final model (model 1d), which is an odds ratio. An odds ratio greater than 1 implies a positive relationship. Putting it differently, a positive coefficient indicates that the probability of the categorical dependent variables occuring (the probability of getting a correct answer on the vocabulary tests) increases when the predictor values increases. In contrast, when the odds ratio is smaller than 1, it implies a negative relationship. An odds ratio smaller than 1 indicates that when the predictor values decreases, the likelihood of a correct test answer increases. In order to compare the relative strength of the effect of each individual independent variable to the dependent variable, standardized coefficients (β) were additionally calibrated using Bayesian estimation because Mplus does not provide standardized estimates for 46 the multilevel analyses with the current MLR estimator. Standardized beta coefficients (β) are reported along with unstandardized coefficients (b) in the path diagrams. 47 CHAPTER 3: RESULTS The results presented in this chapter are organized by research questions. 1. What is the effect of test announcement and time restrictions on the acquisition of receptive and productive knowledge of word form and meaning, as measured in vocabulary posttests? 2. What is the effect of test announcement and time restrictions on eye-fixation times on novel words? 3. What are the interrelationships between intentionality (test announcement), time pressure, attention (eye-fixation duration), and vocabulary learning (test scores)? I first look at the comparability of the groups at the pretest before examining the vocabulary posttests and processing time by group. Then, I examine whether the effect of test announcement and time limit on vocabulary learning was mediated by eye-fixations. It was hypothesized that the intentional learning condition would produce longer processing times, and longer fixations on target words would enhance the initial stages of the acquisition process, such as recognizing the word form or inferring the meaning of the word; that is, I hypothesized total time as a mediation variable. For the statistical analysis, the alpha level was set at .05 (α = .05). 3.1 Pretests Table 7 displays the means and standard deviations derived from participants’ performance on the reading proficiency test and the New Vocabulary Levels Test (NVLT). The average score from the reading proficiency test was greatest for the Timed Intentional group (Group 3) (M = 8.60, SD = 1.00), followed by the Timed Incidental group (Group 1) (M = 8.15, SD = 1.23), and lastly the Untimed Intentional group (Group 2) (M = 7.90, SD = 1.18). The 48 average score on the first five parts of New Vocabulary Levels Test (NVLT) was 81.85 out of 120 (SD = 13.29) for the Timed Incidental group (Group 1), 81.80 (SD = 10.28) for the Timed Intentional group (Group 3), and finally 80.24 (SD = 12.87) for the Untimed Intentional group (Group 2). Table 7 Average Scores on Two Pretests by Group Reading Proficiency Test Mean (SD) New Vocabulary Levels Test (NVLT) Mean (SD) Group 1 (Timed Incidental, n = 16) Group 2 (Untimed Intentional n = 14) Group 3 (Timed Intentional n = 14) 82.29 (11.70) Note. Summed scores from first five parts of the NVLT were reported and analyzed. The maximum score is 12 for the reading proficiency test and 100 for the NVLT. 80.50 (13.07) 83.00 (12.55) 8.31 (1.20) 7.64 (1.15) 8.64 (1.15) To ensure that three groups were comparable in terms of their English proficiency level, a series of one-way analyses of variance (ANOVA) were conducted on the reading proficiency test and the New Vocabulary Levels Test (NVLT). Participants with prior knowledge of three or more lexical items were excluded in the final analysis, so I did not perform a statistical test on the prior knowledge of the targeted lexical item across groups. First, a one-way ANOVA was conducted with Group as the independent variable and Reading test scores as the dependent variable. The results showed that there were no significant differences between groups, F (2, 43) = 2.676, p = .081, indicating participants started out at similar reading proficiency level. Second, a one-way analysis of variance (ANOVA) was run with Group as the independent variable and total scores from the five parts on the New Vocabulary Levels Test (NVLT) as the dependent 49 variable. Results confirmed that the three groups did not differ statistically with respect to their vocabulary level (F (2, 43) = .157, p = .856. Taken together, these two tests established the comparability of the three groups in the study. 3.2 Time on the reading task by group To measure the accurate time on reading, I calculated the average of all fixations on the text for each session. Table 8 indicates that the Untimed Intentional group (Group 2) yielded the longest the average reading time (M = 515.00, SD = 79.38) followed by the Timed Intentional group (Group 3) (M = 394.06, SD = 79.88), and lastly the Timed Intentional group (Group 1) (M = 313.60, SD = 94.66). Table 8 Average Time on Reading Task by Group Group 1 (Timed Incidental, n = 16) Group 2 (Untimed Intentional, n = 14) Group 3 (Timed Intentional, n = 14) Note. Times are given in seconds. Mean 313.60 515.00 394.06 SD 94.66 79.38 79.88 An initial inspection of the eye-tracking data revealed that the assumptions of using a parametric test including normality, homogeneity of variance, and the independence of observations were met (Field, 2009). Therefore, one-way ANOVA was conducted to compare the time taken to complete reading among the groups. The comparison of the three showed a statistically significant difference between groups (F (2, 43) = 20.867, p < .001). Tukey’s post 50 hoc tests revealed that the Timed Incidental group elicited statistically significantly shorter time to complete reading than the Untimed Intentional group (mean difference = 201.40, 95% CI [123.36 279.45], p < .001). Also, the Timed Incidental group and the Timed Intentional group showed a statistically significant difference (mean difference = 80.46, 95% CI [2.42, 158.51], p = .041). Likewise, there was statistically significant difference between the Untimed and Timed Intentional groups (mean difference = 120.94, 95% CI [40.33, 201.54], p = .002). 3.3 Vocabulary test results by group To compare the groups with respect to their test scores, I used the summed scores over the 12 target words by participants for the analysis. Descriptive statistics for the participants’ performance on three vocabulary tests are presented in Table 9. Overall, regardless of which group they belonged to, participants earned highest scores on the form recognition test (46.75% accuracy on average) and lowest scores on the meaning recall test (10.08% accuracy on average). Comparing the groups, the untimed intentional group (Group 2) recorded the highest gains in all three tests, followed by the timed intentional group (Group 3) and finally the timed incidental group (Group 1). Figure 4 illustrates the results on vocabulary tests by group. Table 9 Average Scores on Three Vocabulary Post-test Measures by Group Form Recognition Mean (SD) Meaning Recognition Mean (SD) Meaning Recall Mean (SD) 5.25 (3.07) Group 1 (Timed Incidental, n = 16) Group 2 (Untimed Intentional, n = 14) Group 3 (Timed Intentional, n = 14) Note. The maximum score of each test is 12. 6.29 (2.92) 5.29 (2.87) 2.94 (1.18) .97 (1.26) 4.29 (1.98) 1.48 (1.80) 3.50 (1.02) 1.18 (1.40) 51 Figure 4 Performance on the three vocabulary post-test measures. To explore potential group differences in vocabulary test performance, a 3 (Vocabulary test) x 3 (Group) mixed-design ANOVA was performed. Before conducting the statistical test, I initially inspected the learning data to check for statistical assumptions following Larson-Hall’s guidelines (2011). Scores on the form recognition test were normally distributed and variances were largely equal, whereas the meaning recognition and meaning recall test scores were found to violate both the normality and equal variances assumptions across the groups. Therefore, a log transformation was performed on each set of test scores. The results showed that there was no significant interaction between Vocabulary Test and Group, F (4, 64) = 1.118, p = .356, ηp2 = .065. The main effect of Group was not significant F (2, 32) = .189, p = .829, ηp2 = .012, indicating that the learning gains of participants were not significantly different from each other across the different types of vocabulary tests. The main effect of Test was significant F (2, 64) = 52 66.281, p < 0.001, ηp2 = .674, indicating test scores differed strongly and significantly between Vocabulary Tests. Post hoc tests using the Bonferroni correction revealed that Form Recognition was significantly different from Meaning Recognition (p = .021) and Meaning Recall (p < .001) and Meaning Recognition is significantly different from Meaning Recall (p < .001). These findings indicate that the participants performed in a parallel manner regardless of the groups they belong to. Table 10 Mean Fixation Count, Mean Total Reading Time, and Mean DOE for the First and Second Session 2nd reading session Fixation count Mean TTR (SD) 1st reading session Target words Fixation count apprise bewildering calamity chauffeur decipher fatalities gizmos incessant 2.98 (2.51) 4.13 (2.77) 3.7 (2.71) 3.8 (2.53) 3.26 (2.26) 3.54 (2.65) 2.77 (2.30) 3.61 (2.36) 1 2 3 4 5 6 7 8 Mean TTR (SD) 745 (703.92) Mean DOE (SD) 182 (367.91) 799 (654.51) 16 (413.92) 1003 (774.21) -44 (462.98) 880 (529.55) 352 (401.53) 772 (600.39) 36 (388.18) 868 (705.04) -41 (503.57) 635 (534.88) -4 (358.28) 880 (653.68) 95 (474.89) 53 3.26 (3.46) 3.23 (2.80) 2.9 (2.30) 2.85 (1.89) 2.54 (2.03) 2.87 (2.61) 2.48 (1.76) 2.62 (2.17) Mean DOE (SD) 302 (369.62) 24 (599.42) -17 (454.43) 262 (367.92) -8 (396.17) 40 (534.83) 840 (1091.92) 786 (735.04) 763 (624.98) 654 (444.7) 609 (485.33) 733 (682.56) 645 (536.12) 172 (384.10) 652 (696.68) 31 (373.86) Table 10 (cont’d) 9 perilous 10 sipped 11 staggering 3.9 (3.00) 917 (730.64) 50 (402.37) 4.02 (2.36) 4.39 (2.49) 954 (612.32) 605 (441.00) 1019 (594.64) 174 (452.67) 3.41 (2.09) 2.33 (1.51) 3.97 (3.14) 835 (584.99) 559 (420.00) 946 (820.49) 162 (400.11) 360 (324.41) 278 (521.84) 12 succumb 3.36 (2.57) (461.23) Note. TRT = Total Reading Time in millisecond (the sum of all fixations on the target word) (546.05) (570.08) (414.38) 802 295 2.38 (2.18) 569 190 3.4 Eye fixations by group 3.4.1 Comparison between Session 1 and Session 2. Since participants read the same text twice for the purpose of comprehension, total reading time on targets from first and second reading are reported. Table 10 shows that in both reading sessions, the target word “staggering” elicited the longest mean total reading time whereas the target word “gizmo” in the first session and “sipped” in the session elicited the shortest mean total fixations. Expectedly, total reading time in the second session was shorter than those in the first session. As shown in Figure 5 and 6, reading-time patterns for each target word were generally similar between first and second reading sessions although these patterns appeared to be different across the three groups. 54 Figure 5 Mean total reading time by target words and groups. 55 Figure 6. Mean DOE by target words and groups. 3.4.2. Summed Total Reading Time and Summed DOE In the second research question, I asked whether the attention that participants paid to the target words differed by group. First, mean scores for both variables, summed total reading Time and the summed difference between the observed and expected TFD (ΔOE), were computed. The descriptive statistics presented in Table 11 show that Untimed Intentional group 56 had the highest TRT (M = 1871, SD = 599.52) and ΔOE values (M = 220, SD = 544.77). This group is followed by the Timed Incidental group (M = 1669, SD = 609.2 for TRT; M = 220, SD = 544.77 for ΔOE) while the lowest values were recorded for the participants in the Timed Incidental group (M = 1206, SD = 405.03 for TRT; M = 157, SD = 465.75 for ΔOE). Table 11 Average Summed Total Reading Time and Summed DOE by Group Mean Summed Total Reading Time (SD) Mean Summed DOE (SD) Group 1 (Timed Incidental, n = 16) 1162 (634.40) Group 2 (Untimed Intentional, n = 14) 1871 (599.52) Group 3 (Timed Intentional, n = 14) 1352 (658.06) 138 (493.03) 369 (801.27) 153 (601.61) Note. Times are given in milliseconds. As both eye-tracking measures, summed total reading Time and the summed difference between the observed and expected TFD (ΔOE), were positively skewed, a log transformation was performed (Larson-Hall, 2010) and 25 outliers were excluded for normality of the data. The transformed data satisfied for parametric analysis. The one-way ANOVA results revealed that there was a statistically significant difference on summed total reading time between groups as determined by (F (2, 707) = 25.804, p < .001, ηp2 = .06). Comparisons using Tukey’s contrast revealed a statistical difference between the Timed Incidental group and the Untimed Intentional group (mean difference = .148, 95% CI [.09, .20], p < .001, d = .42) and between the Timed Incidental group and the Timed Intentional group (mean difference = .106, 95% CI [.05, .26], p < .001, d = .60). However, there was no statistically significant difference between the Untimed Intentional group and the Timed 57 Intentional group (mean difference = .042, 95% CI [-.01, .10], p = .144, d = .17). In summary, these results suggest that participants in the intentional-learning mode spent a comparable amount of time on target words regardless of time restrictions. Participants in the incidental- learning mode spent significantly less time on target words than ones in the intentional-learning mode. The same analyses were conducted on the DOE data, the results showed that there was no significant effect on DOE, F (2, 683) = 1.557, p =.211, ηp2 = .04). 3.5 Multivariate Multilevel Mediation Model Results As the primary purpose of analyzing the DOE data is to compare the effect of Test Announcement on total reading time and DOE value. Therefore, I first tested alternative models and found the best-fitting model to describe the relation among predictors, vocabulary knowledge, and total reading time. Next, using the same model, I estimated the relation among predictors, vocabulary knowledge, and DOE value to investigate the differential effects on variables. 3.5.1 Model comparisons. To identify the most parsimonious and well-fitting model, I removed one path at a time and evaluated changes in fit across models. This approach is known as a theoretically driven model testing, and is consistent with common practice (Kline, 2016). For this, I began by testing the full model in Figure 7 and continued testing theoretically plausible alternatives. For the sake of clarity and succinctness, I report the four representative models including the first full model and the final model in this paper. Table 12 presents fit statistics of all models that were considered to determine the most appropriate model. The models presented from Figure 7 to Figure 10 illustrate the sequential process implemented to examine the direct and indirect 58 contributions of intention and time limit to vocabulary learning via total reading time. For convenience and presentation clarity, the path diagrams are presented separately by dependent variable despite the fact that all the estimates were examined simultaneously within a single multivariate model. That is, Model 1a, Model 1b, and Model 1c are not three separate univariate analyses, but three parts of one multivariate analysis. The final path model (Model 4) is presented once again with all the dependent variables included (see Figure 11). From Figure 7 to 11, standardized coefficients are presented on the left and unstandardized coefficients are presented on the right. Solid and dashed lines represent significant and nonsignificant effects, respectively. A series of alternative models were tested to find the best-fitting model to describe the relationship among pedagogical interventions, attention, and learning outcomes. The first is a partial mediation model where the relationship of all five independent variables (i.e., Test announcement, Time limit, Word length, Predictability, and Part of speech) to vocabulary learning is partially mediated by total reading time (Model 1). In the second model, total reading time completely mediates the relation between the independent variables on Level 1 (i.e., Word length, Predictability, Part of speech) and vocabulary learning whereas relations of the other variables on Level 2 (i.e., Test announcement, Time limit) and vocabulary learning are partially mediated by total reading time (Model 2). In the third model, total reading time partially mediates the relationship between the independent variables on Level 1 (i.e., Word length, Predictability, Part of speech) and vocabulary learning, whereas relationships of the other variables on Level 2 (i.e., Test announcement, Time limit) and vocabulary learning are completely mediated by total reading time (Model 3). The fourth model is a fully mediated model in which only total reading time is hypothesized to have a direct relationship with 59 vocabulary learning, completely mediating the relationship of each independent variable (i.e., Test announcement, Time limit, Word length, Predictability, Part of speech) to vocabulary learning (Model 4). Alternative models were evaluated using the relative fit statistics of Akaike information criterion (AIC), Bayesian information criterion (BIC), and sample-size adjusted Bayesian Information Criteria (aBIC) (see Table 12). As Mplus does not produce the degrees of freedom of multilevel models in which dependent variables are categorical, the other fit indices (such as Chi-square statistics, comparative fit index, and the Tucker-Lewis index) cannot be considered. For AIC, BIC, and aBIC, models with lower values are preferred. According to Raftery’s (1995) guidelines, a BIC difference of over 10 implies “very strong” evidence in favor of the model with the smaller BIC; a difference of 6 – 10 is “strong;” 2 – 6 is “positive,” and 0 – 2 is “weak” evidence. As a counterpart, Burnham and Anderson (2002) declared some rules of thumb for AIC differences, DAICi = AICi – AICmin, in which AICmin is the minimum AIC value (i.e., the best model) over all models considered, which are especially useful for nested models. The larger difference in AIC indicates strong evidence against the best model in the set of models of Table 12 Model Fit Comparisons Model 1 Model 2 Model 3 Model 4 AIC 5213.684 4211.410 4206.940 4204.646 BIC 4379.132 4335.496 4344.814 4301.158 a.BIC 4264.820 4249.496 4249.554 4234.476 Note. AIC = Akaike Information Criteria; BIC = Bayesian Information Criteria; a.BIC, sample-size adjusted Bayesian Information Criteria 60 interest. The evidence associated with a difference of greater than 10 is “essentially none,” 4-7 is “considerably less,” and 0-2 is “substantial” for supporting that the model is the best model given the data. Based on the results, the preferred model for the data at hand is the fourth one (Model 4), which is the fully mediated model. This model has the lowest values on both AIC, BIC, and aBIC. The absolute value of difference in BICs between Model 4 and the next best fitting model (Model 3) is 43.656 (= 4344.814 - 4301.158), providing very strong evidence that Model 4 is favored. However, the difference in AICs for Model 3 is 2.294 (= 4206.940 - 4204.646), providing substantial evidence for continuing to consider the alternative model. The results are summarized in Table 12 above. 3.5.2 Final Multivariate Multilevel Mediation Model The final model is a fully mediated model in which only total reading time has direct relations to vocabulary learning, completely mediating the relation of each independent variable (i.e., Test Announcement, Time Limit, Word Length, Predictability, Part of Speech) to vocabulary learning (Model 4). I will focus on the final model by analyzing and reporting each path between variables. As I have group mean centered total reading time, I fixed intercepts at within-subject level to zero to take it out of the model. Intercepts of total reading time at between-subjects level were estimated as b = 1.409, SE = .204, p < .001, exp(b) = 2.384. Equations corresponding to this model are as follows: i = target words (Level 1) j = participants (Level 2) HELNDE++HOI FGHI=PGGH+PG8HQRE+PG=HQKE+SGHI 61 F8HI=P8GH F=HI=P=GH F>HI=P>GH YDE=βGE+β8EV