TESTING THE THREE-STAGE MODEL OF SECOND LANGUAGE SKILL ACQUISITION By Ryo Maie A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Second Language Studies—Doctor of Philosophy 2022 ABSTRACT Skill acquisition theorists conceptualize second language (L2) learning as the acquisition of a set of perceptual, cognitive, and motor skills. The dominant view in skill acquisition theory is to regard L2 skill acquisition as a three-stage process “from initial representation of knowledge through initial changes in behavior to eventual fluent, spontaneous, largely effortless, and highly skilled behavior” (DeKeyser, 2020, p. 83). While there is indirect evidence that indicates the existence of such developmental stages, the number and the nature of those stages are often assumed a priori, and whether or not these stages actually exist remains untested. My dissertation study was designed to test and validate the three-stage model of L2 skill acquisition derived from cognitive psychological research, namely, the cognitive, associative, and autonomous stage (Fitts & Posner, 1967), each of which draws on distinct cognitive processes for learning. Sixty-five adult learners deliberately learned and practiced a miniature language based on Japanese, called Mini-Nihongo, for a total of 1,056 practice trials. The participants also took a battery of tests on three dimensions of cognitive abilities that are known to be active at each stage of skill acquisition: declarative memory, procedural memory, and psychomotor ability (Ackerman, 1988, 1992; Anderson, 1982). Comprehension practice took place in the form of a sentence-picture matching task, and production practice was implemented in the form a productive maze task. Accuracy, reaction time (RT), and the coefficient of variability (CV) of RT were analyzed as the dependent variables. There were six tests of cognitive abilities: the Continuous Visual Memory Task and LLAMA-B for declarative memory ability, an alternating serial reaction time task and a statistical learning task for procedural memory ability, and the alternating serial reaction time task and a two-choice RT task for psychomotor ability. I analyzed the data from the language practice and the battery of cognitive tests in two steps. First, I fitted a series of hidden Markov models (HMMs) to the RT data that represented different hypotheses regarding the number of skill acquisition stages (i.e., one, two, or three stages). This first step of the analysis revealed that the acquisition of comprehension skills can be best conceptualized as a three-stage process, whereas the acquisition of production skills encompassed two stages. Based on the best-fitting HMMs and the corresponding number of learning stages, I then utilized a series of generalized linear mixed models to investigate the nature of the identified skill acquisition stages. Specifically, I examined whether the three dependent variables, accuracy, RT, and the CV, could be predicted by the three dimension of cognitive abilities at each stage of skill acquisition. The results showed that different cognitive abilities variably predicted learning at each stage, with the trends largely consistent with the general skill acquisition theory (DeKeyser, 2020; Lyster & Sato, 2013; Y. Suzuki, 2022). Overall, the findings of the study lend support to the three-stage model of L2 skill acquisition, but its proposed mechanisms may have to be revised to suit the specific cognitive processes involved in L2 learning. In addition, when applying the skill acquisition theory (or variants thereof) to L2 learning, one may have to analyze the theory not only at the level of learning mechanisms and processes (e.g., declarative learning, proceduralization, and production tuning) but also at the level of cognitive processing (e.g., lexical and grammatical de/encoding, syntactic parsing, and monitoring) that are specific to L2 learning. Copyright by RYO MAIE 2022 This dissertation is dedicated to my parents. Words cannot describe how lucky I am to be your son. v ACKNOWLEDGEMENTS I could not have come this far without the guidance, support, and friendship I received from many individuals. First and foremost, I would like to express my sincere gratitude to my Ph.D. supervisor, Professor Aline Godfroid. Aline has been an unrivaled inspiration for me throughout my study at Michigan State University (MSU). She is my role model as a researcher in second language acquisition (SLA), and I cannot believe how lucky I am to be her student. This dissertation would have never been possible without her patient guidance and caring support. Professor Michael Long, the founder of SLA, once named Aline as someone who would lead the field of SLA for the coming decades. Now, receiving a Ph.D. and graduating from MSU as her advisee, I have never believed that more. I would also like to thank my committee members, Professors Shawn Loewen, Koen Van Gorp, Paula Winke, and Phillip Hamrick. Dr. Shawn Loewen was the director when I entered the Second Language Studies (SLS) program and continued to support me throughout my study at MSU. He is also the only birdwatcher I could talk to in Michigan, and I cherish every moment I had with him, both serious and comical. I am also grateful to Shawn for giving Kiyo and me the name: Dynamic Duo. Dr. Koen Van Gorp is someone who I talked to when it came to task-based language teaching (TBLT), even though I probably did not work on it enough to call myself a TBLT researcher. Koen was also the instructor of my first course at MSU (LLT807: Language Teaching Methods), and it is still my best class at MSU. Dr. Paula Winke is the current director of the SLS program, and she is the one who helped me through many of my professional and administrative issues. I first met/contacted Paula when I was a master’s student at the University of Maryland. I was looking for a working memory task I could use for my master’s thesis. I still remember the exact moment when I got a reply from Paula. I was astonished because someone vi like her, established in SLA, could be so kind and approachable. Paula’s students always talk highly of her, and there is no wonder why that is the case. Lastly, Dr. Phillip Hamrick is a respected psycholinguist and cognitive scientist who kindly agreed to serve on my committee. I first found Dr. Hamrick’s name when I read his doctoral dissertation and later his article in Language Learning (Hamrick, 2014) for my master’s thesis study. I later met him at the Second Language Research Forum 2018 in Montréal. He kindly asked me a question at my presentation. I followed him after the talk, and I am glad that I gathered the courage to talk to him because he is now a member of my dissertation committee. I am also indebted to many individuals in the SLS program and other parts of the world for the professional and emotional support they provided me as my colleagues and friends (in alphabetical order): Masaki Eguchi, Curtis Green-Eneix, Bronson Hui, Robert Randez, and Kiyotaka Suga. I also want to thank Professors Robert DeKeyser and Yuichi Suzuki for their inspiration and help throughout my graduate study at the University of Maryland and MSU. Finally, my arigato goes to my family, Masaru, Yumiko, and Yusuke, for their endless support and care. I am forever indebted to them. This dissertation study was funded by the Nation Science Foundation Doctoral Dissertation Improvement Grant (Award Number: 2140704), NFMLTA-MLJ Dissertation Writing Support Grants, the Dissertation Completion Fellowship from the College of Arts and Letters at MSU, and the SLS Dissertation Participant funding. Completing this dissertation would have been difficult without their financial support. In addition, I would like to thank Dr. Caitlin Tenison of Educational Testing Service for providing me with the Python codes to run hidden Markov modeling analysis and answering my queries regarding the analysis. vii TABLE OF CONTENTS INTRODUCTION ......................................................................................................................... 1 CHAPTER 1: REVIEW OF THE LITERATURE .................................................................... 4 CHAPTER 2: THE CURRENT STUDY................................................................................... 46 CHAPTER 3: RESULTS ............................................................................................................ 99 CHAPTER 4: DISCUSSION .................................................................................................... 146 CHAPTER 5: CONCLUSION AND LIMITATIONS ........................................................... 162 REFERENCES .......................................................................................................................... 167 APPENDIX A: DESCRIPTIVE STATISTICS OF THE ANALYZED VARIABLES....... 184 APPENDIX B: RESULTS OF THE REGRESSION ANALYSIS ........................................ 191 viii INTRODUCTION Second language (L2) learning is often conceptualized as the acquisition of a set of perceptual, cognitive, and motor skills (see DeKeyser, 2020; Lyster & Sato, 2013; Y. Suzuki, 2022 for reviews). In this view, mastering L2 skills is equated with acquiring skills in other non- linguistic domains, such as typing, driving a car, or solving arithmetic problems. The current research base in the field of second language acquisition (SLA) has provided ample evidence for the parallel nature of L2 learning and skill acquisition with respect to both the process and the product of learning (e.g., De Jong, 2005; DeKeyser, 1997; Ferman, Olshtain, Schechtman, & Karni, 2009; Robinson, 1997; Robinson & Ha, 1993). This collection of evidence suggests that L2 learning can and should be done by the same domain-general learning mechanisms that apply to learning of other non-linguistic skills (e.g., perceptual, motor, and cognitive skills). L2 researchers to date have turned to neighboring fields, mainly cognitive psychology, for accounts of L2 skill acquisition processes using domain-general cognitive mechanisms (e.g., memory, attention, and problem-solving ability). The current dominant view is to regard L2 skill acquisition as a three-stage process: learning progresses from the cognitive, through the associative, to the autonomous stage (Fitts, 1964; Fitts & Posner, 1967); or the declarative, transitional, and procedural stage (Anderson, 1982, 1983b, 2007). At present, there is indirect evidence that indicates the existence of such developmental stages in L2 learning (Ferman et al., 2009; Pili-Moss, Brill-Schuetz, Faretta-Stutenberg, & Morgan-Short, 2020), yet the number and the nature of the stages are assumed a priori and not themselves the object of research. As an interdisciplinary discipline, the field of SLA has benefited from cross-fertilization with other scientific disciplines, but any theories adopted into L2 research must be tested for their validity vis-à-vis L2 learning. This is critical because the three-stage model (or part thereof) is already 1 represented in many subdomains of SLA (e.g., language instruction: DeKeyser, 1998; 2001; Lyster & Sato, 2013; language assessment: ACTFL, 2012; Council of Europe, 2020), but the model itself is currently only theoretical and lacks empirical support. In the oft-cited review of skill acquisition research in L2 learning, DeKeyser (2020) lamented this fact: “More importantly for our purposes here, not much research in the field of second language learning has explicitly set out to gather data from second language learners to test (a specific variant of) Skill Acquisition Theory” (p. 88). Against this backdrop of the gap in the literature, this dissertation brings together three lines of research in SLA and cognitive psychology to provide the first direct evidence for (or against) the influential three-stage model of L2 skill acquisition. The first line of research concerns a collection of SLA studies that investigated the parallel nature of L2 learning and skill acquisition by documenting how people develop accuracy and fluency in a novel language as a function of practice (e.g., De Jong, 2005; DeKeyser, 1997; Ferman et al., 2009; Robinson, 1997; Robinson & Ha, 1993). The second line of research focuses on the study of individual differences in skill acquisition to identify what cognitive abilities underlie learning during skill acquisition in general (Ackerman, 1987, 1988, 1990, 1992; Ackerman & Cianciolo, 2000; Ackerman, Kanfer, & Goff, 1995) and L2 skill acquisition in particular (Li, 2017; Maie, 2021; Pili-Moss et al., 2020; Y. Suzuki, 2018). Lastly, the third line of research involves using cognitive modeling to mathematically model skill acquisition processes to detect distinct phases of learning during skill acquisition (Tenison & Anderson, 2016; Tenison, Fincham, & Anderson, 2016). By taking conceptual and methodological insights from each of the three lines of research, this dissertation sets out to test the number and the nature of skill acquisition stages in the context of L2 learning. 2 Before moving forward, I would like to clarify the terminology that will be used throughout the dissertation. First, the term L2 learning will refer to any kind of development in L2 knowledge or performance without recourse to a specific learning process or mechanism involved. Hence, the term covers L2 development in any approach, be it formal, usage-based, or skill acquisition approaches. When discussing L2 learning in general, I will thus prefer the word learn over acquire. However, I will use the term skill acquisition to refer to the entire process of mastering skills because it is generally preferred over the other terms such as skill learning or development. Hence, skill acquisition in L2 learning will specifically be called second language skill acquisition or L2 skill acquisition, and the general theory of skill acquisition (without recourse to any specific models of process or mechanism) will be the skill acquisition theory. Lastly, I will use the term second language acquisition or SLA to refer to the scientific research on how people learn and use any additional language after one’s first language; hence, I will use the term second language or L2 to mean any languages other than one’s first language. 3 CHAPTER 1: REVIEW OF THE LITERATURE In this chapter, I review the literature on skill acquisition as it relates to L2 learning. I will first provide an overview of skill acquisition processes and discuss phenomena that are almost universally observed in skill acquisition research (Section 1.1). I will then define the concept of automaticity and automatization as the end product of skill acquisition and review some methods that have been proposed to operationalize automaticity/automatization (Section 1.2). Theoretical models of skill acquisition will then be introduced, including the three-stage model and its rival models, with particular attention paid to how each model accounts for the skill acquisition phenomena and the development of automaticity (Section 1.3). I will then review a recent line of research in cognitive psychology that attempted to pit the rival theoretical models against each other by modeling skill acquisition processes using cognitive modeling (Section 1.4). After clarifying the underpinning concepts and theories of skill acquisition, I will review the literature on L2 skill acquisition, focusing on evidence that shows the parallel nature of skill acquisition in general and L2 learning and research that investigates the role of cognitive individual differences in (L2) skill acquisition (Section 1.5). Finally, I will point out the fundamental problems in the L2 skill acquisition literature and discuss how the present dissertation research achieves to fill the gaps. Because the overall theme of the dissertation is to investigate the applicability of cognitive skill acquisition theory in the context of L2 skill acquisition, in Section 1.1–1.4, I will primarily draw from the cognitive psychology literature to review the basic phenomena of skill acquisition, the definition and measurement of automaticity/automatization, and theoretical models of skill acquisition. 4 1.1 Skill Acquisition and the Associated Phenomena Skill acquisition is the process of learning skills to advanced proficiency “from initial representation of knowledge through initial changes in behavior to eventual fluent, spontaneous, largely effortless, and highly skilled behavior” (DeKeyser, 2020, p. 83). The process of skill acquisition has been well documented in a variety of domains, ranging from perceptual (e.g., Kolers, 1975; Neisser, Novik, & Lazar, 1963) and motor skills (e.g., Card, English, & Burr, 1978; Crossman, 1959; Snoddy, 1926) to complex cognitive routines (e.g., Anderson, 1983a; Card, Moran, & Newell, 1980; Compton & Logan, 1991; Neves & Anderson, 1981). Although the interpretation of the process varies in detail from one researcher to another, there is a general consensus that (a) extended practice is necessary to achieve full mastery of the skill and that (b) the end state of skill acquisition is automaticity (see Section 1.2 for more discussion of automaticity) (see VanLehn, 1996; DeKeyser, 2001, 2020, for a review in cognitive skill acquisition and in L2 skill acquisition, respectively). Additionally, the current stockpile of skill acquisition research has shown that two phenomena are almost universally observed in learning of any skill. These phenomena are (a) the power-law of practice and (b) the skill specificity. Due to the ubiquity of the phenomena, every theoretical model of skill acquisition is expected to account for how proposed learning mechanisms give rise to the phenomena. 1.1.1 The Power-Law of Practice The power-law of practice is a scientific law of learning that states that the time it takes one to perform a skill decreases with the amount of practice, and the decrease follows a specific non-linear curve defined by a power function. Figure 1.1 (left panel) provides an example of a power function applied to skill acquisition data. The speedup in performance following the power law is typified by an initial short period of rapid decrease in performance times followed 5 by a gradual and slow process of fine-tuning skill performance to reach the asymptotic level of performance. The seminal article by Newell and Rosenbloom (1981) illustrated that the same power function applies to a great variety of skills and domains, qualifying the phenomenon as a scientific law. Although the exact form of the power function is a subject of debate, its basic form can be expressed in the following formula: 𝑇 = I + 𝛽N !" 𝑇 = is the time to perform a skill, I is the asymptote (i.e., the psycho-physical limit of speed one can achieve after an infinite amount of practice), 𝛽 is the difference between the initial trial and the asymptote, and therefore how much one can speedup (after an infinite amount of practice), N is the number of practice trials, and 𝛼 is the learning rate that controls how fast one can speed up. The minus sign on the exponent (𝛼) produces the decelerating decrease in performance times. One mathematical corollary of the power function is that plotting the logarithm of performance times (𝑇) against the logarithm of practice trials (N) yields a straight line; log(𝑇) ~ log(N) produces 𝑅# = 1.0. Hence, the linear regression of log(𝑇) (as the dependent variable) on log(N) (as the predictor) serves as a test tube of whether one’s dataset conforms to the power-law of practice. This is why Newell and Rosenbloom (1981) alternatively called the phenomenon the log-log linear law. Some researchers showed that the accuracy of performance (Anderson, 1995) and the standard deviation of performance times (Logan, 1988, 1992, 2002) also follow the power-law of practice. However, evidence is currently too limited to draw any conclusions. 6 Figure 1.1. An example of the power function fit to skill acquisition data. Note. The left panel shows data on the original scale and the right panel shows the data with performance times and practice trials transformed to their natural logarithms. The dataset was simulated from a power function: 𝑇 = 200 + 800𝑁 !$.& + N(0, 30), where T is performance times, N is the number of practice trials, and N(0, 30) is a normal distribution with the mean of 0 and the standard deviation of 30 to add sampling error. For this dataset, 𝑅# = .992. One recurring criticism against the power-law of practice is that it often does not apply to individual data points and hence its ubiquity may be an artifact of data averaging (Adi-Japha et al., 2008; Gallistel et al., 2004; Haider & Frensch, 2002; Heathcote et al., 2000; Myung et al., 2000). For instance, Adi-Japha et al. (2008) pointed out that applying a power function to aggregated data may conceal important patterns in individuals’ gains. Heathcote et al. (2000) revealed that an exponential function fit better than a power function for all 7,910 learning series (from 475 participants) of performance times. The exponential function takes the form of 𝑇 = I + 𝛽𝑒 !"' ; hence, the practice trials (N) is the exponent of e (i.e., the Euler’s number or ≈ 2.71828) rather the base of the exponent 𝛼 in the power function. Heathcote et al. further proposed what they called the APEX function, which incorporated an additional pre- experimental practice parameter. One reason of the power function’s inability to explain 7 individual data is that raw data are (more) prone to inherent variability (than when they are aggregated). However, the variability not only results from systematic effects caused by the underlying learning mechanisms but also due to pure performance or sampling error that is nothing to do with the process of skill acquisition. While deviations from the power function due to the systematic effects are worth of close theoretical and empirical attention, it is unwise to question the power-law of practice if the deviations result from sampling error. Recently, some researchers showed that when sampling error is incorporated in a model (coupled with different power functions for distinct learning phases; see below), the power function and the exponential function become highly equivalent in how they fit individual and item-level data (but the APEX function shows substantially poorer fit) (e.g., Tenison & Anderson, 2016). Consequently, the compatibility of the power function to raw data is still an open question at best. Another issue surrounding the power-law of practice is the number of power functions required to account for skill acquisition data. Initially, Newell and Rosenbloom (1981) and other trailblazing research (e.g., Anderson, 1981; Logan, 1988; MacKay, 1982) only considered using a single power function. However, Rickard (1997, 1999) later demonstrated that two power functions better account for data (see Delaney et al., 1998 for the same finding). As described in Section 1.3 (Models of Skill Acquisition), the number of power functions crucially depends on whether a given theory of skill acquisition describes the speedup in skill execution as a quantitative change in the underlying cognitive process (i.e., improvement of the same process) or a qualitative change (i.e., shifting to more efficient processes). The idea of connecting the power-law of practice to the underlying cognitive mechanisms is crucial because “[the power- law] has captured little attention, especially theoretical attention, in basic cognitive or experimental psychology, though it is sometimes used as the form for displaying data” (Newell 8 & Rosenbloom, 1981, p. 2). It is thus incumbent on skill acquisition researchers to theorize and investigate what its shape and ubiquity signify in terms of psychological mechanisms. 1.1.2 The Skill Specificity The skill specificity refers to the negative correlation between the amount of practice and the generalizability of a skill; when the skill is highly practiced through a specific task or in a specific domain (e.g., visual vs. auditory), it becomes less available for transfer on another task or in a different domain. The classic experiment on the transfer of learning by Thorndike and Woodworth (1901) (see also Thorndike, 1906) was the first to document that skills may not transfer well from one task to another unless the tasks share “identical elements” (e.g., content and procedure). In skill acquisition research, a corpus of evidence has demonstrated the specificity of extensively practiced skills (see Healy & Bourne, 1995; Rogoff & Lave, 1984). Singley and Anderson (1989) provided the most impressive demonstration of the specificity of cognitive skills to date (and also an extensive literature review on the issue of transfer), showing that the skill of reading versus writing computer programs, when highly practiced, give rise to an overwhelming directional asymmetry in that performing the same skill in a reverse direction (i.e., writing computer programs when one extensively practiced reading or vice versa) leads to more errors and slower performances (see also Anderson & Fincham, 1994). As with the power-law of practice, the ubiquity of the skill specificity demands theoretical explanations. The specificity of skill is not without counterevidence, however. Ackerman (1990) reviewed that two independent camps have investigated the issue of transfer in skill acquisition, a group of traditional transfer-of-training experiments (see Adams, 1987; Singley & Anderson, 1989 for reviews) and a group of individual-differences studies (see Marteniuk, 1974 for a review). While the former includes research reviewed above and examines whether training of a 9 skill through a criterion task or in a specific domain influences performance on a transfer task or in a different domain, the latter group of studies investigates how individual differences in performing the criterion task can be predicted by individual differences in abilities that are theoretically related to the target skill. The underlying argument behind the individual- differences paradigm is “difficulties in discovering abilities that predict individual differences at highly skilled levels of performance” (Ackerman, 1990, p. 883); that is, if learning a skill is specific to a task or domain, abilities that are related to the performance of the skill should not predict the asymptomatic performance of the skill (see also Fleishman, 1972; Marteniuk, 1974). With this prediction, several studies show that some cognitive abilities indeed predict performances of a skill even at the asymptomatic level (Ackerman, 1987, 1988, 1990, 1992; Adams, 1957, 1987), therefore discrediting the skill specificity phenomenon. However, a caveat in interpreting these findings is that the correlational design of the individual-differences paradigm does not directly speak to transfer. In this light, findings from the traditional transfer- of-training experiments bear more empirical credibility. Nonetheless, Ackerman and colleagues provided a useful theoretical model of individual differences in skill acquisition (Ackerman, 1987, 1988, 1990, 1992; Ackerman & Cianciolo, 2000; Ackerman et al., 1995), which will be reviewed as one of the main theoretical bases for my dissertation study (Section 1.5.2). 1.2 Automaticity and Automatization The concept of automaticity (or an automatic process) abounds in one’s everyday life. A sliding door opens automatically reading from an overhead motion sensor, a vehicle with an automatic transmission shifts gears by itself, and a modern robotic vacuum cleaner operates automatically on itself. Humans also display an amazing degree of automaticity. One does not (or is not able to) think about the process of standing or moving hands (i.e., innate automaticity), 10 or even when carrying out complex skills such as using a smartphone, driving a car, or speaking the first language; as long as one is highly experienced, the underlying perceptual, cognitive, and motor processes do not even come to the mind (i.e., acquired automaticity). The word “automatic” originates from the Greek word (adjective) automatos: “self-acting” or “acting on its own”. Therefore, automaticity implies that the process operates without any intervention from an active agent. Although the omnipresence of automaticity seemingly attests that its concept is well understood by scientists and the general public alike, cognitive psychologists have diverged regarding the specific features that characterize automaticity. 1.2.1 Concepts of Automaticity and Automatization The first systematic attempt to tackle the concept of automaticity was Shiffrin and Schneider’s (1977) (also Schneider & Shiffrin, 1977) dichotomy of controlled and automatic processing (i.e., the dual-process theory). The theory was innovative in that the researchers tackled the construct of automaticity as a problem of attention: a process is said to be automatic if it takes place “without the necessity of active control or attention by the subject” (Shiffrin & Schneider, 1977, pp. 155–156). An automatic process is thus a cognitive activity that does not expend (or barely requires) one’s attentional resources in working memory, whereas a controlled process entails attentional processing that calls for one’s cognitive resources. Learning is considered a transition from a controlled to an automatic process (see Schneider & Chein, 2003 for a more recent treatment of the dual-process theory). In L2 learning, Johnson (1996) also defined automaticity from the same perspective, as “the ability to get things right when no attention is available for getting them right” (p. 137). The consequence of theorizing the dichotomy with respect to attention is that automatic and controlled processes can be juxtaposed with other in terms of their functional characteristics. 11 While an automatic process is parallel, ballistic, and effortless, a controlled process is serial, controllable, and effortful. In the context of visual search tasks (i.e., searching for a target from a display with other similar objects), Schneider and Shiffrin (1977) operationalized automatic processing as load-independent processing; if the process of visually searching a target is automatic (and hence does not require attentional resources), it should operate regardless of how much information needs to be processed at the same time. Schneider and Shiffrin found that automaticity only developed in a specific condition called “consistent-mapping” (as opposed to “variable-mapping”) in which a stimulus (e.g., a visual image) always appeared as a target in the experiment but never as a distractor. As Segalowitz (2003) discussed, the role of consistent stimulus-response experiences has an important implication for language learning because a linguistic unit (stimulus) (e.g., a word or a morpheme) may not exclusively map onto a single semantic referent (response). One caveat of Shiffrin and Schneider’s (1977) dichotomy is that it presents automaticity as a binary phenomenon, automatic or not automatic. However, many cognitive psychologists now view automaticity as being on a continuum, and the transition from controlled to automatic processing is investigated as a gradual (rather than an abrupt) process. Along with domain- independent processing, researchers have also proposed many additional features of automaticity: (a) fast, (b) parallel, (c) effortless, (d) ballistic, (e) result of consistent practice, (f) unimpeded (little interference from) by a secondary task, (g) unconscious, and (h) based on memory retrieval. This large collection of the features does not mean that cognitive psychologists agree on the nature of automaticity. Moors and De Houwer (2006) (also Moors, 2016) discussed the field’s inability to reach a consensus concerning which set of the features one should use to define the construct of automaticity. As a case in point, Posner and Snyder 12 (1975) claimed three features (ballistic, little interference from a secondary task, and unconscious), whereas MacKay (1982) proposed four features (fast, effortless, little interference, and unconscious). Schneider and Detweiler (1988) mentioned only one feature (little interference), but Anderson (1992) listed five features (fast, little attention, ballistic, consistent practice, and little interference) (see DeKeyser, 2001, p. 128 for a very similar but more extensive illustration). Given the researchers’ wide disagreement, providing componential explanations of automaticity that “unpacks the components of automaticity and specifies the relations among them” (Moors, 2016, p. 264) may be an intractable problem. Rather, the aim of this dissertation is to seek a mechanistic explanation of automaticity that “specifies the low-level processes underlying automatization” (Moors, 2016, p. 264). The word automatization is also a multi-sense term. Cognitive psychologists have largely used the term with three different levels of generality with respect to the developmental continuum of automaticity (see Figure 1.2 for illustration): (a) the entire process of developing automaticity, from initial slow(er) performance through the initial sharp drop in performance times to the gradual fine-tuning of the skill and increasing speed to reach the asymptotic performance (see Figure 1.1, left panel); (b) the initial drop in performance times only; or (c) the gradual fine-tuning of the skill only. Currently, the interpretation of the term depends on the underlying mechanisms posed to explain the development of automaticity (see Section 1.3). In the three-stage model, the initial slower performance is due to the interpretive application of declarative knowledge, and the abrupt reduction in performance times is attributed to proceduralization (Anderson, 1982, 1983b, 2007; see Section 1.3.1). In this dissertation, I will thus reserve the term automatization to refer to the flatter part of the curve where one gradually fine-tunes the performance to reach the asymptotic level (the right panel in Figure 1.2). This 13 characterization of automatization is also consistent with how the term is defined in the SLA literature (DeKeyser, 2020; Y. Suzuki, 2022). Figure 1.2. The three meanings of automatization. Note. The left panel shows automatization as the entire process of developing automaticity, the middle panel shows automatization as the initial sharp drop in performance time, and the right panel shows automatization as the gradual fine-tuning of the skill performance. One issue that surrounds the mechanistic explanation of automaticity is whether it arises from a quantitative change in the underlying mechanism (i.e., improvement of the same process), a qualitative change (i.e., shifting to more efficient processes), or the combination of both. Rival models of skill acquisition propose competing proposals on this point (see Section 1.3). As reviewed in the next section, theorizing the exact mechanism of developing automaticity is crucial as it has an implication for how the construct of automaticity can be measured. 1.2.2 Measuring Automaticity Establishing reliable measurements of automaticity is crucial for understanding the process and the mechanism of developing automaticity. It is possible that one operationalizes all potential features of automaticity (see Section 1.2.1) and observes how indices operationalizing the features change as a result of practice. However, since cognitive psychologists disagree as to the best set of features that characterize automaticity (Moors, 2016; Moors & De Houwer, 2006), 14 this method may not be an empirically realistic plan. Traditionally, skill acquisition researchers have focused on the speed aspect of skill performance (i.e., an automatic process is fast). The most common method is to track the reaction time (RT) of performance and examine how it declines as a function of practice. Given the robust findings on the power-law of practice, observing that participants have reached an asymptotic level of performance (i.e., reaching almost the end of the power-law curve; see Figure 1.1) likely indicates that the participants have achieved automaticity. Obviously, this method captures only one aspect of automaticity as it singly focuses on how fast one carries out the skill. The most innovative development in the measurement of automaticity comes from a research program in the field of SLA led by Norman Segalowitz and colleagues (Phillips, Segalowitz, O’Brien, & Yamasaki, 2004; Segalowitz & Freed, 2004; Segalowitz & Segalowitz, 1993; Segalowitz, Segalowitz, & Wood, 1998; Segalowitz, Watson, & Segalowitz, 1995; see Segalowitz, 2010 for a review). Segalowitz and Segalowitz (1993) asserted that automaticity not only requires performance to be fast but also carried out with stability (i.e., less variability in the speed of performance). This proposition is based on the belief that automaticity results from “qualitative changes in the functioning of the underlying processes through a restructuring effect”, whereby component processes underlying the skill performance “become organized differently; for selected, inefficient processes to drop out; for new, more efficient processes to replace older, less efficient ones; or for some mixture of all these possibilities to occur” (Segalowitz & Segalowitz, 1993, p. 373–374) (see also McLauglin, Rossman, & McLeod, 1983; McLeod & McLaughlin, 1986 for theoretical reviews, especially in L2 contexts). Operationally, Segalowitz and Segalowitz proposed the coefficient of variability (CV) of RT (i.e., the standard deviation of RT divided by the mean of the RT) as a measure of processing stability and 15 suggested that a researcher must examine reductions in both RT and the CV. This is because when learners simply speeds up their performance, only RT is expected to drop (but not CV), and the standard deviation (SD) of the RT decreases at the same rate (due to the mathematical nature of smaller numbers being associated with smaller variability). However, when one’s performance becomes more stable, and hence more automatic, not only RT but also the CV would decrease, and because the SD (of RT) decreases disproportionately to the rate of the RT decrease, RT and CV will positively correlate. Therefore, the evidence of automaticity in this paradigm requires both (a) RT and (b) CV decrease in some meaningful way (e.g., statistical significance), and (c) the RT and CV positively correlate with each other. Currently, the CV has been most widely used to investigate the automaticity of lexical processing, measured through such tasks as a lexical decision task (e.g., Segalowitz & Segalowitz, 1993; Segalowitz et al., 1998; Elgort, 2011) and a semantic classification task (e.g., Phillips et al., 2004; Segalowitz & Freed, 2004; Hui, 2020; Maie & Godfroid, 2022). An issue, however, is whether the CV can also be a valid measure of automaticity at the sentence-level performance. While the first validation attempt by Hulstijn, Van Gelderen, and Schoonen (2009) showed that the validity of the CV may be limited to lexical-processing tasks only, a conceptual replication by Lim and Godfroid (2015) showed that the CV can be as useful in the study of sentence processing (using sentence construction and verification tasks) as in the study of lexical processing. More recent findings seem to support the conclusion of Lim and Godfroid (see Y. Suzuki, 2018; Pili-Moss et al., 2020; but cf. McManus & Marsden, 2019). Segalowitz and Segalowitz (1993) and subsequent research called the CV a measure of “automatization”, but they neither explicated the exact mechanism of automatization nor specified what level on the continuum of developing automaticity (see Figure 1.2) the reduction 16 in the CV corresponds to. Theoretical models of skill acquisition (Section 1.3.1 and 1.3.2) largely argue that the putative qualitative changes in the underlying processes occur at the initial level of development, causing the early abrupt reduction in performance times. In Section 1.2.1, we defined the term automatization to refer to the latter (and flatter) end of the power-law curve. Hence, the definition of automatization as used by Segalowitz and Segalowitz (1993) does not match what is considered automatization in this dissertation. Using the terminology from the three-stage model of skill acquisition (Anderson, 1982, 1983b, 2007), the CV is rather a measure of proceduralization than automatization. One last issue surrounding the use of the CV concerns how it may decrease as a function of practice. Does it linearly decrease with the amount of practice, or does the decrease follow some specific mathematical function such as a power function? A common finding in skill acquisition research is that RT tends to deviate from the power-law curve at the very initial level of development (e.g., Delaney et al., 1998; Logan, 1988; Newell & Rosenbloom, 1981; Neves & Anderson, 1981; Rickard, 1997). This is primarily because initial performance is slow and hence subject to higher variability. Intuitively, it is thus possible to predict that the CV (as a measure of performance stability and variability) may not decrease smoothly in the early phase of learning. In a similar vein, Solovyeva and DeKeyser (2018) focused on cases where the CV initially increases rather than decreases as a function of practice. They argued that the increase could be due to the addition of a new knowledge representation: “If reductions in the CV result from the elimination of (inefficient) component processes or their restructuring, increases in the CV should signal the opposite—the addition of component processes or new representations” (p. 228). Later, Hui (2020) also demonstrated that the CV can initially increase before it starts to decrease as an index of automaticity. However, a recent longitudinal study of practice (using an 17 artificial language) by Pili-Moss et al. (2020) found that the CV simply decreased without the initial increase. At present, the issue is far from being settled. Nonetheless, the CV has the potential to index the changes in the functioning of the underlying processes as long as its use is accompanied by a detailed analysis of how it changes as a function of practice. 1.3 Models of Skill Acquisition In cognitive psychology, several models of skill acquisition propose competing explanations of how learners achieve automaticity through repeated practice. Crucially, the models are classified according to whether they are based on a ruled-based approach or an item- based approach. In this section, I first review the three-stage model as the leading theory of the rule-based approach. Specifically, I draw on Fitts and Posner’s (1967) (also see Fitts, 1964) three-stage model of learning and Anderson’s Adaptive Control of Thought (ACT) theory (Anderson, 1982, 1983b, 1992, 2007; Anderson et al., 2004, 2019) to describe how the three- stage model provides a mechanistic explanation of skill acquisition and the development of automaticity (Section 1.3.1). Rival models are then introduced, and I specifically focus on the Race model (Compton & Logan, 1991; Logan, 1988, 2002) as a major theoretical model from the item-based approach, and the Component Power Laws (CMPL) theory (Bajic & Rickard, 2011; Rickard, 1997, 1999, 2004) as a model that combines the rule-based approach and the item-based approach (Section 1.3.2). The three theoretical models advance rival hypotheses regarding the number and the nature of skill acquisition stages and how they account for the phenomena widely observed in skill acquisition research (Section 1.3.3). 1.3.1 The Three-Stage Model In theorizing the process of skill acquisition, Fitts (1964) made the first observation that learning a skill seems to consist of three phases, with transitions between stages caused by 18 “gradual shifts in the factor structure of skills, or in the nature of processes (strategies and tactics, executive routines and subroutines) employed” (p. 261). This proposition of the “gradual shifts” assumed that each learning phase involved distinct cognitive processes. Later, Fitts and Posner (1967) summarized the three stages as the cognitive stage, the associative stage, and the autonomous stage. Although Fitts and Posner originally proposed the three stages within the context of perceptual-motor skill learning, they believed that the same theory applies to the learning of cognitive and linguistic skills as well (see Fitts, 1964, p. 243). The cognitive stage is characterized by initial slow and controlled performance in which learners must encode the skill into some crude form that can produce the target behavior. Executing a skill in this stage is mentally taxing because the encoding process heavily relies on one’s working memory. Hence, learners are commonly observed to use working memory to verbally rehearse information required to execute the skill (i.e., verbal mediation). The necessity of this verbal mediation drops out in the associative stage because learners develop direct behavioral routines to execute the skill, and this direct route significantly reduces one’s time on task. However, the newly developed procedure can be error-prone and variable in its application, so it requires further practice to be applied more correctly and efficiently. Eventually, one achieves the autonomous stage after a long period of practice applying the direct procedure. At this point, the execution of the skill becomes automatic, which means that the skill can be performed effortlessly and simultaneously with another task that is cognitively demanding. The three-stage model by Fitts and Posner (1967) remained agnostic on the specific psychological mechanisms responsible for learning during each stage. Anderson’s Adaptive Control of Thought (ACT) theory (Anderson, 1982, 1983b) and its computer-implemented cognitive architecture, Adaptive Control of Thought-Rational (ACT-R: Anderson, 2007; 19 Anderson et al., 2004, 2019; Anderson & Fincham, 1994; Anderson & Lebiere, 1998), extended the three-stage model by adding cognitive explanations of how learners come to master skills through the three stages. A cognitive architecture such as ACT-R is a theory of how human cognition learns and organizes knowledge to produce intelligent behaviors (see Anderson, 2007 for a more extensive treatment of how ACT-R simulates human cognition). In particular, ACT-R is specific to account for the acquisition of cognitive skills. In ACT-R, knowledge is represented as either declarative knowledge or procedural knowledge. Declarative knowledge is the knowledge of factual information (i.e., episodic and semantic knowledge), whereas procedural knowledge is the knowledge of how to perform a given skill. ACT-R implements declarative knowledge as “chunks” of information (Miller, 1956), and procedural knowledge is represented as sets of production rules. A production rule is a primitive rule in the form of a condition-action pair (or an IF-THEN conditional, see Table 1.1 below), which encodes a cognitive contingency such that when(ever) the condition is met, the action is performed (Anderson, 1982). As Anderson (1983b) explained, “[i]f there is one term that is most central to the ACT theory, it is ‘production’”. The central role of production rules makes ACT-R a rule-based approach. Skill acquisition in ACT-R begins with learners gaining declarative knowledge about a skill through receiving instruction or observing others acting out the skill. At this initial stage, the only way to execute the skill is by engaging general skill-independent procedures (i.e., general-purpose production rules), which interpret declarative knowledge and produce the target behavior at a rudimentary level. According to Anderson (1983b), there are at least three ways in which declarative knowledge can be interpretively used by the general-purpose production rules: by (a) faithfully following declarative information that takes the form of direct instruction (e.g., a recipe); (b) using general problem-solving algorithms that will work out the skill using general 20 knowledge within a domain (e.g., for unknown arithmetic problems, using general knowledge of mathematics); and (c) using analogy-forming procedures that map a declarative representation of a previously observed behavior onto a new behavior. The variety in which declarative knowledge can be used to guide a behavior creates the flexibility of how human cognition learns in different learning environments. The flexibility of learning at this stage is also crucial because a skill becomes hard to modify after it is extensively practiced (i.e., the skill specificity, see Section 1.1.2). This phase of learning corresponds to the cognitive stage in Fitts and Posner’s model. While the interpretation of declarative knowledge has the advantage of flexibility, it is often a slow and costly process because it requires declarative information to be retrieved from long-term memory and maintained in working memory. As a result, learners develop skill- specific procedures, or procedural knowledge, to optimize their performance. These procedures are still production rules, but they are skill-specific so that they can be applied directly, without the mediation of interpretive productions. ACT-R implements this process through knowledge compilation, a process subsuming two mechanisms: composition and proceduralization (Anderson, 1986; 1987; Neves & Anderson, 1981). Composition collapses sequences of production rules into one larger production, and proceduralization creates a novel procedure that is specific to the skill being practiced. The new, proceduralized production no longer requires the interpretation of declarative knowledge because the production directly incorporates declarative knowledge in its procedure. Table 1.1 illustrates the process of knowledge compilation for learning how to produce the word broke as the irregular past tense form of the word break (cf. Taatgen & Anderson, 2002). Declarative knowledge here is the fact that (a) some words take a specific irregular form for the past tense and (b) the past tense form of break is broke. 21 Table 1.1. Production rules to produce the past tense form of the word break. Interpretive use of declarative knowledge P1: IF the goal is to produce a past tense of a word AND there is a specific form for the past tense of the word THEN set the answer of the goal to retrieving the specific form. P2: IF the goal is to retrieve a specific form for the past tense of a word AND the word is break THEN set the answer of the goal to retrieving the past tense form of break. P3: IF the goal is to retrieving the past tense form of break AND the past tense form of break is broke THEN set the answer of the goal to retrieving broke. Macro-production created by composition P4: IF the goal is to produce a past tense of a word AND there is a specific form for the past tense of the word AND the word is break AND the past tense form of the word is broke THEN set the answer of the goal to retrieving broke. Skill-specific production created by proceduralization P5: If the goal is to produce the past tense form of the word break THEN set the answer of the goal to retrieving broke. Initially, producing the past tense form broke requires three production rules that need to be executed serially (i.e., P1→P2 →P3). However, composition collapses the three productions into one larger macro-production, and this process greatly reduces one’s time on task because what used to take three productions is now done by a single production. Proceduralization further takes the macro-production and develops a novel production that is skill-specific (producing broke as the past tense of break). Notice that the two pieces of declarative knowledge, (a) there is a specific form for the past tense of the word and (b) the past tense form of the word is broke, have dropped out in the new production. This is how proceduralization develops procedural knowledge based on declarative knowledge, thereby obviating the necessity of learners’ reliance on working memory. Assuming that automatic behaviors are ones that do not entail attention (as one aspect of automaticity: Schneider & Chein, 2003; Shiffrin & Schneider, 1977), proceduralization is what drives automaticity. The process of knowledge compilation is a transitional stage during which one relies on both declarative and procedural knowledge. Hence, this phase of learning corresponds to the associative stage in Fitts and Posner’s model. 22 Later, compiled productions undergo the process of tuning. Production tuning is a piecemeal process that increases or lowers the probability of a given production rule being chosen as the method for the skill (see also Rumelhart & Norman, 1978 for a review of tuning as part of the fundamental human learning mechanisms). Often, there are multiple ways to carry out the same skill, and learners must search for the best method to perform the task. To achieve the optimal balance between the cost and benefit of finding the solution, the search process must be both correct and fast. Anderson, Kline, and Beasley (1980) proposed three subprocesses (of tuning) to make this possible: generalization, discrimination, and strengthening. While generalization makes a production rule broader in its applicability, discrimination makes the scope of the production narrower. Strengthening controls the probability of selecting among different production rules such that more useful rules are strengthened and poorer rules are weakened. Although its process is more gradual than that of knowledge compilation, tuning also increases one’s speed in performing the skill. Note that knowledge compilation is a qualitative change in the underlying cognitive processes, whereas tuning is a quantitative change that incrementally optimizes the same process. Additionally, ACT-R also allows declarative strengthening, which increases the accuracy and the speed of retrieving declarative knowledge per se. In Fitts and Posner’s model, this phase corresponds to the autonomous stage. Before concluding the review of the three-stage model, it is worth clarifying the distinction between the concept of within-stage speedup and between-stage speedup. When one increases the speed to perform a skill, an individual can become faster either within or between stages. The within-stage speedup results from improving the same psychological process; that is, by a continual quantitative change in the underlying mechanism. In contrast, the between-stage speedup is caused by qualitative changes by shifting the underlying cognitive process (or 23 strategy) to a more efficient process. The major prediction of the three-stage model is that a majority of the performance speedup is ascribed to between-stage speedup (see Tenison & Anderson, 2016 for empirical support). This is because the between-stage speedup, such as knowledge compilation, not only increases the speed of performance but also makes its procedures more efficient. 1.3.2 Rival Models of Skill Acquisition The Race model is an item-based approach to the mechanism of skill acquisition and the development of automaticity that has been cited as widely as the ACT theory in the cognitive psychology literature. The model was proposed by a cognitive psychologist, Gordon Logan, as part of the instance theory of automatization (with a larger sense than defined in this dissertation) (Compton & Logan, 1991; Lassaline & Logan, 1993; Logan, 1988, 1992, 2002). The Race model claims that each experience of carrying out a skill becomes encoded in memory as an “instance”. Although the model does not explicate what constitutes an instance, it is considered a type of memory trace that contains contextual and task-specific cues that are relevant to the skill. The more instances learners accrue in memory, the faster one’s performance becomes. Initially, learners perform a skill using general problem-solving algorithms (e.g., directly calculating an answer for an arithmetic problem), but they begin to rely more on retrieving past solutions as the number of instances increases with practice. The model assumes that algorithm and retrieval run in parallel; initially, algorithm strategies outpace retrieval, but the gradual accumulation of instances (with themselves racing against each other) speeds up the retrieval process and eventually takes over. A skill is thus considered automatic “when it is based on single-step direct-access retrieval of past solutions from memory” (Logan, 1988, p. 493). Note that the model does not conceive any speedup in the use of algorithms. Rather, any reductions in 24 performance latency must be attributed to a single mechanism of accumulating instances. Because the same cognitive state (i.e., the parallel execution of algorithmic processing and memory retrieval) is maintained throughout the entire process of skill acquisition, the Race model is a one-stage model that conceptualizes the development of automaticity as a quantitative change (see also Section 1.3.3 for conceptualizing the Race model as a one-stage model). While the Race model accounts for the practice-induced speedup using the process of accruing instances, the Component Power Laws (CMPL) Theory offers a slightly different mechanism of skill acquisition (Rickard, 1997, 1999, 2004; see also Bajic & Rickard, 2011 for a recent treatment of the theory). The CMPL theory maintains that the transition from algorithms to memory retrieval drives the development of automaticity. However, it does not assume that the two processes run in parallel. Rather, learners choose between the two from the outset, either applying algorithms or retrieving the answer from memory. More importantly, the CMPL theory suggests that the speedup not only applies to the memory retrieval process (as in the Race model) but also to the execution of algorithmic processing. In the CMPL theory, the speedup in skill execution is driven by two forces. On the one hand, learners learn to behave more efficiently by shifting the task strategy from algorithms to direct memory retrieval (between-stage speedup). On the other hand, algorithmic processing or retrieval process can themselves be accelerated (within-stage speedup). This is why Rickard (1997, 1999) proposed that two power functions, one for algorithm use and the other for memory retrieval, better account for experimental data (see Section 1.1.1). This makes the CMPL theory a two-stage model. In summary, existing models of skill acquisition offer some insights into how skill acquisition may occur in learning of cognitive skills. ACT-R, the Race model, and the CMPL theory all share that skills are initially performed using general-purpose mechanisms, but later, 25 skill-specific procedures (or specific memory instances) overtake as learners accumulate experience using the skills. The Race model conceives the development of automaticity as a single quantitative change (i.e., a single stage). The CMPL theory conceptualizes two stages, with the first stage characterized by performance based on general algorithms and the second stage marked by direct memory retrieval. ACT-R, on the other hand, advocates three stages, with the first stage characterized by reliance on declarative knowledge (the cognitive stage), the second marked by a mixture of both declarative and procedural knowledge (the associative stage), and the third dominated by procedural knowledge (the autonomous stage). Qualitative changes in the underlying cognitive processes drive the progression through the skill acquisition stages; for the CMPL theory, it is the shift from problem-solving algorithms to memory retrieval, and for ACT-R, the first shift is caused by knowledge compilation (by which procedural knowledge is developed) and the second is marked by the end of knowledge compilation, at which point only the long process of production tuning remains. Acceleration in skill execution is driven by two forces. On the one hand, learners accelerate within stages by gradually refining the same process. On the other hand, they can optimize the performance between stages by making qualitative changes to the underlying cognitive processes. The latter not only increases the speed of performance but also makes its procedures more efficient. 1.3.3 The Explanatory Power and Empirical Predictions of the Skill Acquisition Models The power-law of practice and the skill specificity are two empirical phenomena that are widely observed in skill acquisition research (see Section 1.1). The three theoretical models reviewed previously provide unique explanations of how the proposed learning mechanisms give rise to the phenomena. The Race model describes that the power-law of practice results from two counteracting factors (see Logan, 1988, p. 496). As learners accumulate instances in memory, 26 there are more opportunities to achieve extreme values of low performance times, thereby creating the speedup in performance. At the same time, the more extreme values they observe, the lower the likelihood of achieving even faster performance times, so the speedup decelerates. Since the CMPL theory is basically an extension of the Race model, it makes the same prediction for the power-law of practice. Interestingly, neither the Race model nor the CMPL theory provides an explicit description of why extensively practiced skills become less available for transfer. In this light, ACT-R offers higher explanatory power because it accounts for both the power-law of practice and the skill specificity. Concerning the power-law of practice, Anderson and Fincham (1994) stated that the initial reduction in performance times “would reflect the compilation of the production rule [i.e., knowledge compilation], and the remaining power-raw learning would reflect the accumulation of production strength [i.e., production tuning]” (p. 1324). Knowledge compilation leads to abrupt changes in one’s performance times because it fundamentally restructures the underlying mechanism of performance (see Section 1.3.1 and Table 1.1). In contrast, production tuning is a pure optimization process, so its benefit diminishes as the performance gradually approaches the optimal level. ACT-R further explains that the skill specificity results from proceduralization and production tuning (Anderson & Fincham, 1994; Anderson & Lebiere, 1998; Singley & Anderson, 1989). Once production rules are proceduralized, they become specific to the skill being practiced. Furthermore, production tuning adjusts the applicability of the production rules so that they apply in specific conditions or contexts only. More importantly, to the aim of this dissertation, the three theoretical models propose opposing hypotheses regarding the number and the nature of skill acquisition stages. Figure 1.3 summarizes how each model accounts for the practice-induced speedup in relation to the number 27 and the nature of skill acquisition stages. Notice that the number of stages, and hence the number of distinct cognitive states, corresponds to the number of power functions required to account for the performance speedup (see Section 1.1.1; Delaney et al., 1998; Rickard, 1997, 1999; Tenison & Anderson, 2016 for findings that each cognitive phase requires separate power functions). The Race model (the top panel) attributes any reductions in performance times to the mechanism of instance accumulation. Since there is only one mechanism improving the same cognitive state (i.e., the parallel implementation of problem-solving algorithms and memory retrieval), the model predicts that a single power function best fits experimental data. Hence, any speedup occurs within a single stage as a continuous quantitative change in the underlying mechanism (Logan 1988, 1992). Figure 1.3. Model predictions of practice-induced speedup. Note. The top panel shows the prediction from the Race model, the middle panel shows the CMPL theory, and the bottom panel shows ACT-R. 28 In contrast, the CMPL theory (the middle panel) allows for the speedup both within and between stages. Algorithmic processing and memory retrieval themselves can be accelerated (i.e., within-stage speedup), but more speedup is expected by shifting the cognitive processes (i.e., between-stage speedup). Therefore, the CMPL theory incorporates both quantitative and qualitative changes in the underlying mechanism. Two power functions, one for algorithmic processing and the other for memory retrieval, are hypothesized to best account for latency data (Rickard, 1997, 1999). Lastly, ACT-R (and Fitts and Posner’s theory of learning) (the bottom panel in Figure1.3) predicts that skill acquisition is a three-stage process. Each phase (i.e., the cognitive, associative, and autonomous stage) requires separate power functions to account for the practice-induced speedup. Learners can increase the speed of performance both within and between stages, but the largest drops in performance times take place between stages by transitioning from the cognitive stage to the associative stage and from the associative to the autonomous stage (Anderson, 1982, 1983b; Tenison & Anderson, 2016). Testing these model-based implications in empirical research is crucial for the scholarship of skill acquisition research because it allows researchers to pit rival theoretical models against each other. Recently, a line of research in cognitive psychology has attempted to test the model predictions within the framework of cognitive modeling (i.e., using computer and mathematical simulations to model human cognition). In particular, Tenison and Anderson (2016) (also Tenison et al. 2016) conducted an innovative study to investigate the tenability of the three models within cognitive skill acquisition. In the next section, I will turn to such line of research as it provides a model approach for how theoretical models of skill acquisition can be tested in L2 contexts. 29 1.4 Testing the Models of Skill Acquisition Figure 1.3 in Section 1.3.3 showed that testing rival models of skill acquisition boils down to the question of whether the practice-induced speedup is caused by quantitative changes or qualitative changes in the cognitive processes underlying skill performance. The number of learning stages (or the number of distinct cognitive states) corresponds to the number of power functions required to account for latency data. However, what does it mean exactly to fit a power function to each stage of learning? Analytically, the following updated form of the power function can be used to test the number of learning stages: 𝑇(,* = I + 𝛽( 𝑗 !" In addition to the basic form of the power function discussed in Section 1.1.1 (i.e., 𝑇 = I + 𝛽N !" ), this formula incorporates two subscripts, 𝑖 for each learning stage and 𝑗 for each practice trial. Hence, 𝑇(,* denotes the time to perform a skill in the stage 𝑖 on the trial 𝑗, and 𝛽( indicates the slope of the power function specific to each learning stage. Hence, the number of learning stages is represented as the number of power-function slopes that are estimated. For instance, the Race model requires a single slope (𝛽) because it assumes skill acquisition within a stage. In contrast, the CMPL theory and ACT-R claim multiple learning stages, each of which requires a different slope parameter. In (linear) regression modeling, this is analogous to having varying slopes for learning stages or treating the stages as separate dummy-coded predictor variables (e.g., 𝑦 = 𝑎 + 𝑏+,-./0 + 𝑏+,-./# + 𝑏+,-./& + 𝜀). In this light, testing the three theoretical models entails evaluating the plausibility of different power function models that incorporate varying numbers of slope parameters. One criticism against the power-law of practice is that the power function may not apply to individual data (see Section 1.1.1). Traditionally, skill acquisition researchers have applied a 30 power function to aggregated data after averaging raw data across individual participants and items (e.g., Anderson, 1981; Logan, 1988; MacKay, 1982; Newell & Rosenbloom, 1981). However, this practice of (only) fitting a single function over the entire sample makes an unrealistic assumption that individual learners become faster at the same rate and that every learner transitions to subsequent stages after the same amount of practice. The ideal method thus requires a type of statistical models that enables researchers to fit a power function to individual data points while at the same time evaluating the feasibility of the power-function model at the group level. Furthermore, unaggregated data are inherently subject to (more) variability, so the method must be able to detect whether the trial-by-trial speedup is due to sheer performance (or sampling) error or to systematic learning effects triggered by practice. One candidate means to avoid modeling aggregated data is to apply a class of multilevel power-function models that incorporate varying intercepts and slopes for participants and items (e.g., mixed-effects models) (see Gelman & Hill, 2007 for how). The estimation of participant- and item-specific parameters makes it possible to test whether the form of the power function dictated by theoretical models makes sense at every level of analysis (i.e., participants, items, and the entire group); hence, it enables researchers to examine the number of skill acquisition stages at the individual and the group level within one coherent model. However, one critical limitation of the method is that it provides too little information regarding how fast individual learners move through learning stages or which stage they may reside in after a given number of practice trials. In a recent study of cognitive skill acquisition, Tenison and Anderson (2016) took advantage of a technique called hidden Markov modeling to overcome the limitation. The hidden Markov model (HMM) is a stochastic time-series model consisting of a Markov chain, a mathematical system that represents a sequence of states. A Markov chain makes an assumption 31 that a given state is only dependent on the previous state (i.e., the Markov assumption). The HMM is a special type of Markov chain that treats the actual states as hidden (and hence latent) but whose probability can be estimated based on observed data. It is most commonly used as a stochastic pattern-recognition method in computational linguistics, such as research in speech recognition, where the HMM is used to segment and identify words based on temporal structures in speech as well as a database of various sounds in a variety of languages (Rabiner, 1989; see Chan, Verspoor, & Vahtrick, 2015 for using a HMM in L2 research). Figure 1.4 provides a visual illustration of the structure of a HMM applied to skill acquisition data (see Section 2.3.2 for the mathematical rendition of the HMM used in this dissertation study). In this example, Markov states represent two learning stages. Using a vector of RTs as the dependent variable (RT1–RT5), the HMM provides two types of likelihood: (a) transition probability, the probability of individuals eventually moving from one state to another (i.e., from State 1 to State 2), and (b) emission probability, the probability of a data point being generated by a given state. Assuming two learning stages, the transition probability in Figure 1.4 means that learners eventually transition to the second stage with the probability of .96. In contrast, the emission probability of RT5 in State 2, for example, is .90, which means that with the probability of .90, learners have already transitioned to the second stage at the fifth practice trial. Of particular interests to the current discussion is the emission probability because it can be used as the estimate of which learning stage participants reside in after a particular number of practice trials. The estimation of the HMM parameters considers every trajectory of learners transitioning to subsequence stages after every number of practice trials and provides the relative likelihood of the trajectories given the data. 32 Figure 1.4. The illustration of a hidden Markov model applied to skill acquisition data. In the study, Tenison and Anderson (2016) examined how learners developed fluency in an arithmetic task called a Pyramid problem. A typical item of the Pyramid problem takes the form of “base$height” where the base is the starting number, and the height indicates the number of terms to be summed, with each term being less than the previous one. For instance, the answer for the problem 8$3 can be found as 8 + 7 + 6 = 21. In the study, participants (N = 23) practiced solving 21 unique problems over six blocks of 36 trials. Of the item set, three items were chosen as practice problems that were repeated 36 times, while the remaining 18 problems were practiced only twice as novel problems. The researchers applied a HMM to the latency history of each individual participant on each practice item. As deliberated in Section 2.3.2, the HMM adopted by Tenison and Anderson (2016) was a unique HMM model that yielded the probability of each participant being in a given stage by considering how well the data are consistent with a power function that incorporated varying numbers of slope parameters (e.g., the CMPL theory requires two slopes). Since the number of slopes corresponds to the number of learning stages 33 (see Section 1.3.3), the number of HMM states also corresponds to the number of slopes in the power function. Comparing HMMs that assumed one to five states, Tenison and Anderson found that the three-state HMM showed the best fit, while the one- and two-state HMMs were substantially less consistent with the data. Tenison and Anderson (2016) hypothesized that the three stages identified by the HMM analysis corresponded to the cognitive, the associative, and the autonomous stage, following the three-stage theory by Fitts and Posner (1967). Specifically, they drew on the learning mechanisms proposed by ACT-R and suggested that the first stage involved a sequence of direct calculations to find the answer, the second stage involved an effortful retrieval of the past solution from memory, and the third stage involved an automatic recognition of the problem such that the solution is produced as a reflex. To validate the prediction, Tenison and Anderson collected neural signatures from an functional magnetic resonance imaging (fMRI) and examined how brain activation patterns changed as the participants transitioned from the first to the second stage and from the second to the third stage. The researchers conducted a region of interest analysis on specific regions that were hypothesized to be engaged in skill acquisition of the Pyramid problem: (a) the left horizontal intraparietal sulcus (HIPS) for numerical processing in the first stage, (b) the left lateral inferior prefrontal cortex (LIPFC) for effortful retrieval in the second stage, and (c) the left fusiform gyrus for visual recognition of stimuli and the left motor cortex for motor responding in the third stage. A regression analysis of the fMRI data showed that the left HIPS was more strongly activated when the participants were in the first stage than in the second stage, the left LIPFC was more actively involved in the second stage than in the third stage, and the left fusiform gyrus and the left motor cortex were most strongly activated in the third stage. These findings, combined with the results of the HMM analysis, provided 34 convincing evidence that cognitive skill acquisition (at least of the Pyramid problem) is a three- stage process and that the cognitive process involved in each of these three stages is consistent with the predictions from Fitts and Posner (1967) and ACT-R (Anderson, 2007; Anderson et al., 2004, 2019; Anderson & Lebiere, 1998). The study by Tenison and Anderson (2016) is one of the rarest studies in cognitive psychology that attempted to formally test rival theoretical models of skill acquisition using an appropriate analytical method to overcome the issues that pervaded in previous research (e.g., applying a power function to individual data points and identifying the speedup of performance due to mere performance error or systematic learning effects). In SLA, the skill acquisition theory has been one of the major theoretical approaches to explaining the process of L2 learning (DeKeyser, 2020; Lyster & Sato, 2013; Y. Suzuki, 2022). While researchers often cite and adopt various theoretical models of skill acquisition to account for L2 phenomena, the validity of the models is rarely tested vis-à-vis L2 learning. This gap in the literature thus calls for an empirical study that adopts the research design and the analytical method of Tenison and Anderson (2016) in L2 learning contexts. The aim of my dissertation study is to do just that. However, before introducing the study, it is worthwhile to review the available literature on skill acquisition in L2 learning. The literature on L2 skill acquisition mostly concerns (a) providing evidence for the parallel nature of L2 learning and skill acquisition in general and (b) investigating individual differences in L2 skill acquisition and their cognitive correlates. 1.5 Second Language Skill Acquisition It is clear from the outset that L2 learning can be more complex than typical skill acquisition studied in cognitive psychology (e.g., arithmetic tasks such as the Pyramid problem) because successful performance in L2 requires the coordination of multiple levels of linguistic 35 knowledge (e.g., phonology, vocabulary, morphosyntax, and pragmatics). Furthermore, L2 learning can take place in varying conditions (e.g., classroom learning through instruction vs. naturalistic learning from mere exposure or usage), which can engage different learning processes (e.g., intentional vs. incidental learning). It is thus no surprise that different approaches to L2 learning espouse different models of skill acquisition. For instance, from a usage-based perspective, N. Ellis (2002) accounted for the relationship between automaticity and frequency effects in language processing using Logan’s (1988) instance theory (of which the Race model is a part). Presumably, this is due to the theory being an instance-based approach, which is compatible with the primary role of item-based learning in usage-based linguistics. In contrast, DeKeyser’s (1997) seminal study on automatization of L2 morphosyntax and his accounts of L2 skill acquisition are based on the concepts in ACT-R (e.g., production rules and declarative/procedural knowledge) (see DeKeyser, 2020). Individual models (or parts thereof), therefore, should be useful to different extents, depending on different conditions and linguistic targets of learning (DeKeyser, 2001). The instance theory, for instance, may be useful to explaining incidental learning of vocabulary items, whereas ACT-R may be informative to account for learning of morphosyntax under instructed settings. To date, there has been considerable evidence in SLA that points to the parallel nature of L2 learning and (cognitive) skill acquisition (e.g., De Jong, 2005; DeKeyser, 1997; Ferman et al., 2009; Li & DeKeyser, 2017; Robinson, 1997; Robinson & Ha, 1993; Y. Suzuki & Sunada, 2020). In this section, I review and discuss the existing literature on L2 skill acquisition. In Section 1.5.1, I focus on evidence for conceptualizing L2 learning as a type of skill acquisition. I achieve this aim by specifically discussing available evidence for the power-law of practice and the skill specificity in L2 contexts. In Section 1.5.2, I focus on individual differences in L2 skill 36 acquisition and what kinds of cognitive correlates predict the individual differences. Such research has the potential of revealing what cognitive mechanisms underlie L2 skill acquisition. 1.5.1 Evidence for Second Language Skill Acquisition The most extensive evidence for L2 skill acquisition comes from the seminal study by DeKeyser (1997). In the study, participants (N = 61) longitudinally learned and practiced how to comprehend and produce an artificial language called Autopractan for over eight weeks. DeKeyser specifically posed three hypotheses to find evidence for L2 skill acquisition: (a) RT and error rates (of performance) would decrease as the result of practice; (b) the decrease in RT and error rates would follow a power function; (c) resulting competence would become skill- specific such that performing in a reverse condition (i.e., comprehending the language when one extensively practiced production or vice versa) leads to more errors and slower performances. Prior to practice, participants were taught vocabulary and grammar rules of the language by means of a direct presentation, accompanied by an explicit explanation of how sentences in Autopractan can be constructed using the learned words and four case markers. Subsequently, the participants engaged in 15 sessions of comprehension and production practice (1,440 trials in total, with 720 trials for comprehension and production practice). An observation of RT and error data showed that the decrease in RT clearly followed the # power-law of practice (𝑅12.(4)12. (7) = .974 and .966 for comprehension and production, respectively), even though the improvement in error rates was less consistent with the power # function (𝑅12.(4)12. (7) = .613 and .651 for comprehension and production, respectively). Additionally, practice led to linguistic competence that was skill-specific in that when learners were tested during the final practice session, they made more errors and performed noticeably slower in the opposite mode of language use (i.e., comprehension vs. production). Interestingly, 37 DeKeyser also implemented a within-participant manipulation of single-task and dual-task conditions in the practice tasks so that in half of the practice trials, the participants engaged in a secondary task while simultaneously carrying out the primary task of language practice. As reviewed in Section 1.2.1, one indication of automaticity is that learners get little interference from performing a secondary (cognitively demanding) task. The results showed that participants initially performed slower and made more errors in the dual-task condition, but the difference between the two conditions disappeared in the final practice session. This finding, together with the evidence for the power-law of practice and the skill specificity, led DeKeyser to conclude that “the learning of second language grammar rules can proceed very much in the same way that learning in other cognitive domains, from geometry to computer programming, has been shown to take place” (p. 214). Since DeKeyser (1997), many L2 studies have attested the power-law of practice and the skill specificity, but there is imbalance in the supporting evidence. Much fewer studies have explicitly tested the power-law of practice than the skill specificity (e.g., N. Ellis & Schmidt, 1998; Ferman et al., 2009; Robinson, 1997) probably because the power-law, as it is a scientific law, is already well accepted in SLA, or the exact form of how learners increase the speed or the accuracy of performance is not sometimes of primary interests to L2 researchers. Despite the limitation, Table 1.2 summarizes the available evidence for the power-law of practice in L2 learning, especially concerning the speed of performance (i.e., log(RT) ~ log(practice)). Although the exact fit of a given power function varies from one study to another, it is clear that when learners are provided with a sufficient amount of practice (i.e., excluding Robinson, 1997, who only had 55 practice trials, which is usually not enough for automatization to take place), the applicability of the power function is robust and consistent. Furthermore, the evidence seems 38 to hold across different types of linguistic target, whether it is the entire language, morphosyntactic structures, or vocabulary items. Table 1.2. Summary of previous L2 research on the power-law of practice. Study N R2 Trials Target DeKeyser (1997) 61 .96-.97 720 Language Robinson (1997) 60 .12 55 Grammar N. Ellis & Schmidt (1998) 7 .74-.97 344 Language Ferman et al. (2009) 8 .80-.95 1904 Language Cornillie et al. (2017) 23 .97-.98* 264 Grammar Hui (2020) 35 .98* 160 Vocabulary Pili-Moss et al. (2020) 14 .92* 720 Language Maie (2021) 40 .94 320 Vocabulary Note. N indicates the sample size in the study. * indicates a reanalysis of open-access data or descriptive statistics reported in the article. When the study had multiple target grammatical structures or modes of language use (i.e., comprehension and production), the number of practice trials was divided by the number of targets; however, this was not applied to Hui (2020) and Maie (2021), who studied skill acquisition in learning of 16 vocabulary words. “Language” in the rightmost column indicates that the participants did not possess any knowledge of the language and hence learned and practiced the entire language from scratch. Conversely, the skill specificity has been captured in many L2 studies, mostly in terms of how the learned product of practice transfers to performance in another task or in another domain (e.g., Allen, 2000; De Jong, 2005; DeKeyser & Sokalski, 1996; Keating & Farley, 2008; Li & DeKeyser, 2017; Morgan-Short & Bowden, 2006; Y. Suzuki & Sunada, 2020; Toth, 2006; VanPatten & Cadierno, 1993a, b). This line of research can be traced back to the classic debate on the superiority of comprehension versus production practice in facilitating L2 learning. VanPatten and Cadierno (1993a, b) and Cadierno (1995) initially showed that comprehension practice (labeled as processing instruction in the original studies) led to increase in both comprehension and production skills, while production practice (traditional instruction in the original studies) only led to development in production skills (see also VanPatten, 2020 for a discussion). Later, DeKeyser and Sokalski (1996) pointed out methodological limitations in the 39 previous studies, including an instructional design that unfairly favored input practice groups and used a highly narrow operationalization of production practice that was implemented as a type of mechanical practice (i.e., practice that can be completed without understanding semantically what one is saying) (see DeKeyser, Salaberry, Robinson, & Harrington, 2002 for a more theoretical discussion). DeKeyser and Sokalski showed that when such methodological limitations are taken into account, both practice types gave rise to the skill specificity: comprehension practice is most useful in developing comprehension skills and production practice is most effective in developing production skills. A later meta-analysis by Shintani, Li, and Ellis (2013) points to the same finding. In summary, there is copious evidence for the power-law of practice and the skill specificity in L2 learning, hence showing the plausibility of conceptualizing L2 learning as a type of skill acquisition. What is lacking, however, is empirical research that tests specific models of skill acquisition in L2 contexts. In his review of L2 skill acquisition research, DeKeyser (2020) alluded to this fact: “More importantly for our purposes here, not much research in the field of second language learning has explicitly set out to gather data from second language learners to test (a specific variant of) Skill Acquisition Theory” (p. 88). In the seminal study, DeKeyser (1997) held the ACT theory as “[t]he most widely accepted theory on how automaticity is brought about” (p. 196). This view is consistent with a contemporary review of L2 skill acquisition research (DeKeyser, 2020; Lyster & Sato, 2013; Y. Suzuki, 2022). In discussing the external validity of the ACT theory, Anderson (1983b) believed that “all higher- level cognitive functions are achieved by the same underlying architecture” (p. 261). This claim makes language no exception. In fact, Anderson (1983b, Chapter 7) provided examples of how the learning principles in ACT-R can be used to explain the process of first language (L1) 40 acquisition, especially focusing on how children come to learn and produce syntactic structures. However, as DeKeyser (1997, 2001) pointed out, the application of the ACT theory to L1 acquisition can be controversial because the theory maintains that all learning must start out from declarative knowledge (see Anderson & Fincham, 1994, however, for relenting that not all knowledge need to start out in declarative forms). In contrast, L2 learning is (far) more likely to involve declarative learning at the initial levels of development (see DeKeyser, 1994, 2017, for discussion) especially when learners are adults, and targets of learning are explicitly taught through instruction. In this light, L2 learning affords greater compatibility with ACT-R and serves as a better place to test its learning theory. Recently, L2 researchers have investigated individual differences in L2 skill acquisition and their cognitive correlates (e.g., Li, 2017; Maie, 2021; Pili-Moss et al., 2020; Y. Suzuki, 2018). Of typical interests to this dissertation study is a group of studies that investigated the role of declarative and procedural memory ability in L2 skill acquisition. Studying the relationship between individual differences in L2 skill acquisition and declarative and procedural memory ability has the potential of revealing whether those cognitive abilities are involved in L2 skill acquisition and hence testing the applicability of the ACT-R learning mechanisms in L2 learning (see DeKeyser, 2012 for a rationale). 1.5.2 Individual Differences in Second Language Skill Acquisition The different learning mechanisms proposed by the Race model, the CMPL theory, and ACT-R entail different predictions for cognitive abilities that are active at each stage of skill acquisition. The Race model predicts that skill acquisition is a one-stage process, and the development of automaticity results from the accumulation of instances. Although the Race model does not fully define what constitutes an “instance”, it certainly assumes that an instance is “represented as a processing episode” (Logan, 1988, p. 495). Hence, learning must be 41 controlled by cognitive processes that are dependent on declarative memory, given that declarative memory subsumes episodic memory (see Eichenbaum, 2017; Squire & Wixted, 2011). The same reasoning also extends to the CMPL theory in that the transition from the first to the second stage is dependent on declarative memory. On the other hand, ACT-R presupposes three stages, with the transition from the first to the second stage controlled by declarative memory and the transition from the second to the third stage controlled by procedural memory. Declarative memory is a long-term memory system that stores factual information such as episodic and semantic knowledge. On the other hand, procedural memory is one of the nondeclarative memory systems specializing in encoding, storing, and retrieving procedures for performing various types of skills. While declarative learning is conceptualized as (primarily) conscious, rapid, and categorical, procedural learning is known to be unconscious, incremental, and probabilistic (see Eichenbaum, 2017; Squire & Wixted, 2011). Drawing on Fitts and Posner’s three-stage theory of learning, Ackerman (1988, 1992) made the first attempt in the cognitive psychological literature to theorize and empirically test the role of (cognitive) individual differences in skill acquisition. Ackerman posited that three sets of cognitive abilities underlie each stage of skill acquisition: 1. The cognitive state is associated with demands on general intellectual abilities. 2. The associative stage is associated with demands on perceptual speed abilities. 3. The autonomous stage is associated with demands on psychomotor abilities. General intellectual abilities are cognitive abilities that pertain to declarative, effortful, and attentional learning, such as general intelligence, declarative memory, and working memory. Perceptual speed abilities are used when one must develop a simple procedure for skill performance, including procedural memory, statistical learning ability, and classical 42 conditioning. Finally, psychomotor abilities represent “individual differences predominantly in the speed of responding to test items with little or no cognitive processing demands” (Ackerman, 1988, p. 291). The psychomotor abilities differ from the perceptual speed abilities in that they only concern psychophysical limitations in the subject’s motor programming, which are largely independent of information processing. In a series of experiments, Ackerman and his colleagues examined the degree of correlations between the latency of skill execution and the three sets of abilities at each block of practice. They found evidence for a dynamic interplay of the cognitive abilities and how each becomes more or less dominant depending on the learning stages (e.g., Ackerman, 1987, 1988, 1992; Ackerman & Cianciolo, 2000; Ackerman et al., 1995). In the field of SLA, only a few studies of individual differences explicitly adopted the skill acquisition theory as the primary theoretical framework (see Li, 2017; Maie, 2021; Pili- Moss et al., 2020; Y. Suzuki, 2018). However, there is an independent line of research that has investigated the role of two long-term memory systems, declarative and procedural memory, in accounting for individual differences in L2 achievement (e.g., Faretta-Stutenberg & Morgan- Short, 2018; Hamrick, 2015; Morgan-Short et al., 2014; Pili-Moss et al., 2020; see Buffington & Morgan-Short, 2019 for a review). Recent SLA research has collectively shown that L2 learning (especially for adults) is initially supported by declarative memory and later dominated by procedural memory (see Hamrick et al., 2018 for a meta-analysis). Although this is consistent with what ACT-R would predict, these studies were conducted independently of the L2 skill acquisition research. Rather, Ullman’s (2004, 2016, 2020) declarative and procedural (D/P) model served as the guiding framework. The D/P model is a neurobiologically motivated model of language, which claims that declarative and procedural memory underlie the learning, representation, and retrieval of different types of linguistic knowledge. Specifically, declarative 43 memory supports learning and using L2 lexical items across all levels of proficiency, but for grammar, it is only responsible for the initial stages of learning, and procedural memory takes over as learners practice and develop proficiency. There is recent evidence in SLA that the use of vocabulary knowledge also involves procedural memory (Maie, 2021), but the general idea of initial reliance on declarative memory and the later involvement of procedural memory is consistent with the learning mechanisms in ACT-R. While many have already examined the role of declarative and procedural memory in L2 achievement, only a handful investigated their role in developing L2 automaticity (Li, 2017; Maie, 2021; Pili-Moss et al., 2020; Y. Suzuki, 2018). The available evidence, however, seems to favor the prediction from ACT-R and the D/P model. For example, Pili-Moss et al. (2020) investigated how the two memory systems predicted the accuracy and automaticity of comprehending and producing a newly learned artificial language over extensive practice sessions (see also Morgan-Short et al., 2014 for details). Concerning automaticity (operationalized as the CV of RT), the researchers found a three-way interaction effect of practice sessions, declarative memory, and procedural memory, such that when the practice sessions were divided into three stages, procedural memory was predictive of automaticity at the two later stages, but this was contingent on whether learners had a higher declarative learning ability. Given the declarative-procedural transition proposed in ACT-R and the D/P model, this result seems reasonable as the transition becomes impossible if learners lack any declarative knowledge to proceduralize. However, this study did not collect RT data for production practice and only addressed skill acquisition specific to comprehension skills. The researchers, furthermore, divided practice blocks into three stages on an arbitrary basis, which precluded them from contrasting claims about the number and the nature of skill acquisition stages. 44 The current dominant view in SLA is to regard L2 skill acquisition as a three-stage process. However, the theory must be formally tested in L2 contexts before one can put trust in the skill acquisition theory (and variants thereof) as applied to L2 learning. This is critical because the three-stage model (or part thereof) is already represented in many subdomains of SLA (e.g., language instruction: DeKeyser, 1998; 2001; Lyster & Sato, 2013; language assessment: ACTFL, 2012; Council of Europe, 2020), but the model itself is currently only theoretical and lacks empirical support. In this dissertation research, I investigated the number and the nature of skill acquisition stages in L2 learning. The number of skill acquisition stages was tested by adopting the analysis methodology of Tenison and Anderson (2016) in an empirical study of L2 learning that longitudinally examined how L2 learners developed accuracy and fluency in a novel language as a result of practice. The nature of stages, or distinct cognitive states active at each stage of learning, was tested by adopting the research design of those L2 studies that investigated the role of cognitive individual differences in L2 skill acquisition (e.g., Maie, 2021; Pili-Moss et al., 2020; Y. Suzuki, 2018). Following the theoretical model of individual differences in cognitive skill acquisition by Ackerman (1988, 1992) and the previous research on the role of declarative and procedural memory in L2 learning, I elected declarative memory, procedural memory, and psychomotor ability as the candidate cognitive abilities. Hence, by bringing together the three lines of research in SLA and cognitive psychology, the study attempted to provide the first direct evidence for (or against) the influential three-stage model of L2 skill acquisition. In the next chapter, I describe the research and analytical design of the study, including the research questions and hypotheses, methods, procedure, and analysis. 45 CHAPTER 2: THE CURRENT STUDY 2.1 Research Questions and Hypotheses The overarching goal of the study was to test the validity of the three-stage model of skill acquisition in the context of L2 learning. If, as Anderson (1983b) claimed, “language is cut from the same cloth as the other cognitive processes” (p. 261), the process and the mechanisms of L2 learning must be explained by the cognitive models of skill acquisition reviewed in the last chapter, including the three-stage model. To investigate this claim, I conducted a language learning experiment in which participants deliberately learned and practiced a novel foreign language for an extended period of time. I asked the following questions for the study: Research Question 1: How many stages of skill acquisition, each characterized by distinct cognitive states for learning and consolidation, do L2 learners go through while learning and practicing a novel miniature language for an extended period of time? Research Question 2: Which cognitive abilities, declarative memory, procedural memory, and psychomotor ability, are implicated at each stage of skill acquisition? Research Question 3: Do the results for the number (RQ1) and the nature (RQ2) of skill acquisition stages in L2 learning differ between comprehension and production of the target language? I hypothesized that L2 learning consists of three stages, given the dominant view of L2 skill acquisition as a three-stage process (RQ1) (DeKeyser, 2020; Lyster & Sato, 2013; Y. Suzuki, 2022). Assuming that the three-stage model best encapsulates L2 skill acquisition, I predicted that individual differences in declarative memory, procedural memory, and psychomotor ability would manifest themselves at the first, second, and third stage of learning, respectively (RQ2). This is to follow Ackerman’s (1988, 1992) theory and predictions based on 46 ACT-R and the D/P model. Furthermore, because learners in this study were trained on an entire language, beginning with explicit-deductive instruction on vocabulary items and morphosyntactic rules, I expected that the learning mechanisms proposed in ACT-R would show the highest consistency with how learners learn to comprehend and produce the target language (DeKeyser, 1997, 2001). Although there is good evidence that procedural knowledge for comprehension skills does not transfer well to production skills (or vice versa) (De Jong, 2005; Li & DeKeyser, 2017; Y. Suzuki & Sunada, 2020), the mechanisms underlying L2 skill acquisition must be identical regardless of the mode of language use. Hence, the same (or at least similar) results should be observed for comprehension and production of the language (RQ3). 2.2 Methods 2.2.1 Participants Seventy-three participants whose L1 was English were recruited to participate in the study. In total, eight participants were excluded from the analysis (i.e., 10.75% attrition) because they either did not complete the entire study (i.e., six data collection sessions) or provided responses that were psychophysically implausible for the experimental tasks at hand. For instance, one participant produced responses with RT lower than 300 milliseconds (ms) in 71.30% (1,004/1,408 trials) of the final session of production practice. With the mean accuracy of 54.75%, I deemed that this participant did not pay close attention to the task because L1 speakers, whose processing is highly automatized, take at least 300 ms to recognize a word and then manually produce a response (Jiang, 2012). After discarding data from such participants, the final sample consisted of 65 participants (46 female, 14 male, and 5 non-binary or not specified) with a mean age of 20.35 years old (SD = 2.61, Min = 18, Max = 30). Additionally, five participants were excluded for production data because they did not properly perform production 47 tasks in terms of the accuracy and the speed of performance (while their performance on comprehension tasks did not pose a problem). Hence, the dataset for production practice only consisted of 60 participants. The sample size is comparable to that of L2 skill acquisition research in general (e.g., De Jong, 2005 with N = 59; DeKeyser, 1997 with N = 61; Robinson, 1997 with N = 60) Due to the nature of the target language, I only invited those participants who had not learned or studied any case-marking languages, such as German, Greek, Korean, Russian, and Turkish. On average, the participants knew 1.2 additional languages (SD = 0.79, Min = 0, Max = 4), including Spanish (n = 46), French (n = 11), Mandarin Chinese (n = 4), American Sign Language (n = 4), Arabic (n = 3), Italian (n = 3), Portuguese (n = 2), Albanian (n = 1), Latin (n = 1), Punjabi (n = 1), and Yiddish (n = 1). All participants were recruited through the Registrar Data Request System (https://reg.msu.edu/Forms/DataRequest/DataRequest.aspx) at Michigan State University, which distributed recruitment emails to eligible participants. The participants contacted the researcher to indicate their interest in participating in the study, and after confirming that they were indeed eligible, I sent them a URL link to access the experiment (because the study was held online: see Section 2.2.3). The participants received 60 U.S. dollars upon the completion of the entire study. 2.2.2 Language Due to the nature of the study facing limited budget and resources, it was necessary to find a linguistic target that could be trained to automaticity within a reasonable duration of time. I chose a miniature version of Japanese, called Mini-Nihongo (translated as “Mini-Japanese”), that was originally constructed by Mueller and colleagues (Mueller, 2006; Mueller et al., 2005, 2007). The researchers have shown consistent evidence that grammatical and semantic violations 48 in Mini-Nihongo elicit ERP (i.e., even-related potential) signatures that are identical to those that L1 Japanese speakers show when comprehending Japanese. The language thus preserves an appropriate level of complexity and naturalness in its grammatical and semantic system while at the same time allowing L2 learners to be trained to advanced proficiency within a relatively brief period of time. The decision to use part of a natural language (i.e., Japanese) rather than an artificial language such as Autopractan (DeKeyser, 1997) or Broncanto2 (e.g., Pili-Moss et al., 2022) was that learning a natural language (or part thereof) affords some practical value, which in turn may motivate the participants to engage in the study. Figure 2.1 illustrates the entire structure of Mini-Nihongo adopted in the study. Most features of the language in the original form were kept, except that (a) two verbs (out of four), tsukitobasu (push away) and oiharau (chase away), were replaced with tsukamaeru (capture) and otozureru (visit) because the original verbs conveyed meanings that were hard to represent with pictures or overlapped with those of the other two verbs; (b) an adjective modifier, akai (red), was removed to equalize the length of the sentences across the entire item set; and (c) a temporal adverbial, tokoro desu (about to take place), was dropped because it added unnecessary complexity to the target language. 49 Figure 2.1. The entire structure of Mini-Nihongo. Mini-Nihongo used in this study consisted of five grammatical categories: four nouns, four verbs, two numerals, two numerical classifiers, and three postpositions (see Figure 2.1). Although Japanese in general allows scrambling of word order, I only used the Subject-Object- Verb order, which is canonical in Japanese. Hence, a sentence in Mini-Nihongo always contained two noun phrases followed by a main verb. The first noun phrase corresponded to the grammatical subject and the second to the object. A noun phrase consisted of a case-marked head noun that was modified by a numeral and a classifier. In Japanese, number is not marked morphologically and hence must be conveyed by numerical classifiers. The choice between two classifiers depended on whether the noun they marked was a bird (hato, kamo) or another type of small animal (nezumi, neko). The postposition -ga was the nominative marker, -o was the accusative marker, and the -no was the genitive marker. Numerals and classifiers must be marked by the genitive marker in order to connect to the head noun. The entire structure afforded 256 unique sentences, with four examples listed below. Each sentence was matched with a colored picture that conveyed the meaning of the sentence. Because each practice session 50 consisted of 128 trials for comprehension and production practice, I divided the stimulus list into two sets (List A and List B) and counterbalanced the order of the stimuli across the mode of language use and the number of practice sessions. In comprehension practice, the participants thus saw sentences from List A for the first and the third session of practice and from List B for the second and the fourth session. In production practice, the participants conversely saw sentences from List B for the first and the third session and from List A for the second and the fourth session. All the stimuli and the corresponding pictures can be found at: https://osf.io/x9u6h/. 1. Ichi wa no hato ga ni hiki no nezumi o tsukamaeru. one [bird] [gen.] pigeon [nom.] two [small-animal] [gen.] mouse [acc.] capture. A pigeon captures two mice. 2. Ni hiki no neko ga ichi wa no kamo o tobikoeru. two [small-animal] [gen.] cat [nom.] one [bird] [gen.] duck [acc.] jump over. Two cats jump over a duck. 3. Ni wa no kamo ga ichi hiki no nezumi o oikakeru. two [bird] [gen.] [duck] [nom.] one [small-animal] [gen.] mouse [acc.] visit. Two ducks visit a mouse. 4. Ichi hiki no nezumi ga ni wa no hato o oikakeru. one [small-animal] [gen.] mouse [nom.] two [bird] [gen.] red duck [acc.] chase away. A mouse chased away two ducks. 2.2.3 General Procedure The general procedure of the study is summarized in Table 2.1. The study consisted of six sessions of data collection (Day 1–Day 6). In principle, the participants were required to complete the study over six consecutive days, but a two-day interval was allowed in case of emergency (see Pili-Moss et al., 2020 for the same range of intervals between study sessions). On average, the participants completed the study in 6.13 days (SD = 0.39). Because the entire experiment was held online on GorillaTM (https://app.gorilla.sc/), every procedural instruction 51 within and between tasks was implemented as video instruction so that the participants fully understood what they were supposed to do (and not supposed to do). Table 2.1. The procedure of the entire study. Day 1 (39 minutes) Day 4 (65 minutes) Task Min. Task Min. 1. Background questionnaire 1 1. Vocabulary and grammar tests 5 2. Two-choice response task 3 2. Production practice 40 3. Alternating serial reaction time task 15 3. Comprehension practice 20 4. Statistical learning task 20 Day 5 (60 minutes) Day 2 (60 minutes) 1.Vocabulary and grammar tests 5 1. Continuous visual memory task 10 2. Comprehension practice 20 2. LLAMA-B 10 3. Production practice 35 3. Explicit instruction of Mini-Nihongo 20 Day 6 (55 minutes) 4. Vocabulary and grammar tests 5 1. Vocabulary and grammar tests 5 5. Warmup practice of Mini-Nihongo 15 2. Production practice 30 Day 3 (70 minutes) 3. Comprehension practice 20 1. Vocabulary and grammar tests 5 2. Comprehension practice 20 3. Production practice 45 Upon logging into the study, the participants were guided to the first session of the study (Day 1). They first completed an IRB-approved consent form and filled out a background questionnaire that asked about their email, age, gender, and knowledge of L1 and L2. The questionnaire is available at https://osf.io/zr9j8/. In Day 1, the participants only completed tasks of psychomotor ability and procedural memory capacity, in the order of a two-choice reaction time task (2CRT), an alternating serial reaction time task (ASRT), and a statistical learning task (SL). See Section 2.2.6 for each individual cognitive task. After completing SL, the participants were reminded that they must come back to the study on the next day for Day 2 and that they would receive a reminder email. In Day 2, the participants first took two tasks of declarative memory, the Continuous Visual Memory Task (CVMT) and LLAMA-B, in that order. Afterward, they received explicit 52 instruction of Mini-Nihongo by watching a 19-minute video that described and quizzed about the vocabulary and grammar structure of the language (see Section 2.2.4 for the content of the instruction). Vocabulary and grammar knowledge tests subsequently followed the instruction, which ascertained that the participants indeed developed explicit, declarative knowledge of the language. The participants were then guided to warmup practice, the purpose of which was to familiarize them with the format of comprehension and production tasks. Day 3–Day 6 had an identical structure: the vocabulary and grammar knowledge tests were administered first and then comprehension and production practice tasks. In Day 3 and Day 5, comprehension practice preceded production practice, but in Day 4 and Day 6, the order was reversed. In every beginning and end of a study session, the participants also completed a self-checklist to report that (a) they are/were in a quiet room, (b) to the best of their knowledge, they have/had a reliable internet connection, and (c) they will not/did not step away from the computer during a task. The purpose of the checklist was to remind the participants of the importance of following the criteria due to the nature of the study being offered online. The participants were informed that if they should fail to comply with the requirements (e.g., stepping away from the computer during a task for one hour), they may not be able to continue with the study. The checklist can be found at https://osf.io/69eap. 2.2.4 Language Training The three-stage model (represented by ACT-R) holds declarative memory as the fundamental mechanism at the initial level of learning (e.g., Anderson, 1982, 1983b, 2007). It was thus important that the participants developed declarative knowledge of the target language before engaging in comprehension and production practice. To achieve this aim, the participants received explicit instruction on vocabulary and grammar rules of Mini-Nihongo in the form of a 53 19-minute video (https://osf.io/vh6ap/). The instruction began with a slide showing that Mini- Nihongo is comprised of five vocabulary categories: four animal words (nouns), four action words (verbs), two words of number, two words of animal class (classifiers), and three words of grammar category (case markers). The video subsequently presented nouns of Mini-Nihongo twice by directly presenting word-picture pairs. After the first presentation, the participants were told to memorize the nouns using 30 seconds. This step was followed by a mini-exercise (four items) that asked the participants to match the words with corresponding pictures. The first presentation of the nouns was accompanied by their English equivalent (neko = cat), but the translation was removed in the second presentation to ensure that the participants associated the words with the pictures rather than with the L1 equivalents. The same form of presentation- exercise-presentation cycle took place for the verbs and the number words. The instruction subsequently introduced two classifiers, -wa and -hiki, explaining that (a) the former is used to indicate a bird class and the latter to indicate a small animal class and that (b) these two words are used when combining a noun (e.g., neko) with a number word (e.g., ichi). A small exercise (four items) followed the explanation, which asked the participants to choose either -wa or -hiki depending on the picture presented. For instance, the participants chose -wa when they were presented with a picture of two pigeons. Lastly, three case-markers, - ga, -o, and -no, were presented with a description that these words (a) attach to the end of a noun and (b) indicate the subject (-ga) or the object (-o) of a sentence or the status of possession (-no) just like the English word “of” or the possessive construction “John’s”. Afterward, all the five vocabulary categories of Mini-Nihongo were presented once again and the participants were told to review the content for one minute. 54 The participants then learned the phrasal and sentence structures of Mini-Nihongo. For the (noun) phrase structure, the instruction first presented all eight renditions of the noun phrase structure (i.e., ichi/ni + wa/hiki + no + hato/kamo/nezumi/neko) twice and asked the participants to figure out the ordering of the words. The participants were then presented a rule that a noun phrase (NP) in Mini-Nihongo takes the word order of [number] + [animal class] + [possessive marker] + [noun]. To deeper the participants’ understanding of the rule, two examples were provided, showing that for instance, a noun phrase ichi wa no hato corresponds to [one] [bird] [of] [pigeon]. The participants then worked on a small exercise (four items) that asked them to reorder the provided words to match them with a picture. For instance, the participants saw a picture of a duck and reordered wa/no/ichi/kamo to ichi wa no kamo. Lastly, the participants learned how to create a sentence in Mini-Nihongo. From the outset, they were explicitly told that (a) the language has the strict S-O-V word order or NP + NP + Verb and that (b) the subject of the sentence is marked by -ga and the object with -o. The participants took 30 seconds to process and memorize the rule. This step was followed by an exercise (four items) that asked the participants to reorder eleven provided words to create a sentence that corresponded to the picture presented. At the end of the instruction, the participants were reminded that it is important that they review the vocabulary and grammar rules of the language because they would be tested at the beginning of every subsequent study session. After the instruction as well as at the beginning of every subsequent session, the participants were tested on their declarative knowledge of vocabulary and grammar rules of Mini-Nihongo. The vocabulary test dealt with the nouns and verbs of the language and was implemented as a picture-word matching task. The participants saw a picture and two words presented together and chose the word that conveyed the meaning of the picture. Each noun was 55 paired with every other noun as distractors (12 combinations) and every verb was paired with every other verb (12 combinations), which made up a total of 24 trials (i.e., each word tested three times). The grammar test was a metalinguistic knowledge test in a fill-in-the-blank format. The participants were presented with nine metalinguistic statements (randomly ordered) that described a morphosyntactic rule of Mini-Nihongo with some portion of the statement left blank. They were asked to choose an answer from two options. Figure 2.2 shows the outlook of the vocabulary and the grammar knowledge test, and Figure 2.3 summarizes the participant’s scores on the two tests. Immediately after the explicit instruction (in Day 2), the participants had already developed solid declarative knowledge of vocabulary (M = .95, SD = .06, Min = .71, Max = 1.00) and grammar of the language (M = .89, SD = .11, Min = .56, Max = 1.00). Note that although the mean on the grammar test was lower than .90, the test only contained nine items; hence, missing even one item could make the participant’s score lower than .90 (i.e., 8/9 = .88). One participant scored .56 on the grammar test in Day 2, but s/he showed a steady improvement through Day 3–Day 6, with the score of .66, .88, .88, and 1.00, respectively. Figure 2.2. The outlook of the vocabulary (left) and grammar (right) test. 56 Figure 2.3. The participants’ scores on the vocabulary and grammar test. Note. The error bars show 95% confidence intervals. 2.2.5 Language Practice In Day 3–Day 6, the participants engaged in comprehension and production practice of Mini-Nihongo after taking the vocabulary and grammar knowledge tests. The comprehension task was designed as a sentence-picture matching task in which the participants saw a sentence with two pictures and chose which picture corresponded to the sentence by pressing either the S- key or the K-key on the keyboard. Figure 2.4 shows the outlook of the task. Although DeKeyser (1997) implemented a four-picture format (instead of two) for his comprehension task to ensure that the participants fully read the sentence before making a decision, I deemed that this design was not feasible for my study because doing so for Mini-Nihongo at least required six picture options, which was likely confusing and too demanding to the participants. In this study, the two options were chosen by contrasting the pictures in terms of either (a) the subject noun, (b) the number on the subject, (c) the object noun, (d) the number on the object, or (e) the verb. Note 57 that the word order and the case markers could not be tested directly because their positions were fixed in Mini-Nihongo. However, making a correct decision in the task required the participants to understand and process those features; for instance, the participants must recollect the word order (i.e., S-O-V) and the case-marking rules (i.e., -wa markers the subject and -o marks the object) to understand which noun phrase in the sentence corresponded to the subject or the object. The presentation of test items was randomized, with the five critical features evenly distributed throughout the task. In addition, the position of the correct answer (versus the distractor) was randomized. Each trial began with a fixation cross that was presented for 250 milliseconds (ms), and it subsequently turned into a test item. The participants received correctness feedback throughout the task. Figure 2.4. The outlook of the comprehension practice task. The production task was implemented as a maze task (see Forster, Guerrera, & Elliot, 2009 for a review). A maze task, first introduced by Freedman and Forster (1985), is an online measure of incremental sentence processing that asks test-takers to build a sentence by choosing from a series of two alternative options as if they were going through a maze. For instance, in a five-frame trial, one sees four two-alternative options: A / bird * play / is * our / think * 58 singing / a song * beautiful. Figure 2.5 shows the outlook of the maze task adopted in the current study. At each frame, the participants selected the continuation that best represented the pictured event. Since the numerals and classifiers had only two variants, one was always the distractor for the other (i.e., ichi and ni, wa and hiki). Each noun, verb, and case marker was paired with every other word from the same category with equal frequency. The position of the correct answer (versus the distractor) was randomized except for the number words and the classifiers because they only had two options. As in the comprehension task, trials in the production task began with a fixation cross that remained on the screen for 250 ms. Note that because Mini-Nihongo sentences consisted of eleven words, a single trial always consisted of a collection of 11 responses. For each response, the participants received correctness feedback. 59 Figure 2.5. The outlook of the production practice task. Traditional maze tasks (as used in psycholinguistic research) do not directly pertain to production skills. However, the way the task was adopted in the study rendered it (more) applicable to assessing production skills because the participants chose one of the two options to match a picture prompt rather than choosing according to whether a given option is grammatically correct or incorrect. These features are in contrast with ordinary maze tasks, for which distractors are chosen such that they are always grammatically and semantically 60 impossible. In Figure 2.5, both ichi and ni (in the first frame) are grammatically plausible, but the latter does not match the content of the picture. Hence, the maze task in this study can be considered a type of controlled production tasks. In recent L2 research, Y. Suzuki and Sunada (2018) successfully adopted a maze task to gauge L2 speakers’ sentence processing speed and automaticity. Furthermore, a large-scale study by S. Suzuki and Kormos (2022) demonstrated that RT measured by a maze task correlated well with L2 speakers’ oral fluency. Through a structural equation modeling analysis, they found that RT in the maze task was the best indicator of learners’ general processing speed (as part of cognitive fluency), which, in turn, was the best predictor of an aspect of (utterance) fluency termed speed fluency (indicated by articulation rate and mean length of run). Based on the findings, together with how the task was designed in the study, I deemed that the maze task was a useful measure of how the participants developed production skills. Adopting a maze task for production practice also obviated the need to consider when to start the timer of RT measurement, an issue often debated in L2 psycholinguistic research (Jiang, 2012). Before the participants engaged in the main practice sessions in Day 3–Day 6, they were guided to initial warmup practice (in Day 2). This phase served as a familiarization period during which the participants became used to the format of the comprehension and production tasks. The warmup began with trials on individual words (i.e., nouns and verbs) and noun phrases ([number] [animal class] [possessive] [noun]), but the task then progressed to full sentences. When practicing on individual words or noun phrases for comprehension, the participants saw a word (or a noun phrase) together with two pictures and responded by choosing the picture that matched the word (or the noun phrase). For production practice, they saw a single picture and constructed a sentence (or chose a word/noun phrase) through the maze task (see Figure 2.5). For 61 both comprehension and production warmup practice, there were 24 word-level trials (eight content words repeated three times), eight phrase-level trials, and 16 sentence-level trials. Identical items were used for both comprehension and production tasks. The entire item set for the warmup practice tasks can be found at: https://osf.io/x9u6h/. In each main practice session (Day 3–Day 6), the participants practiced comprehending and producing Mini-Nihongo sentences for 128 trials (i.e., 256 trials in total) in 8 blocks of 16 trials. After each block, the participants were allowed to take a 3-to-5 minute break. Combining all four practice sessions in addition to the warm practice (32 trials), the participants thus practiced Mini-Nihongo for a total of 1,056 trials in 33 blocks. All participants practiced in the same list of items, but the order of presentation was randomized within the same list. In each block, the participants also encountered a surprise attention-check trial where they were asked to press the SPACE bar to continue with the task. This trial was intended to maintain the participant’s focus on the task and to detect if one was mindlessly pressing the response keys (i.e., S-key or K-key) without paying attention to the task. 2.2.6 Cognitive Individual Differences In this study, I focused on declarative memory, procedural memory, and psychomotor ability as three dimensions of cognitive abilities that have been theorized to underlie the acquisition of cognitive skills (Ackerman, 1987, 1988, 1990, 1992; Ackerman & Cianciolo, 2000; Ackerman et al., 1995). By extension, I predicted that they also play pivotal roles in L2 skill acquisition (Li, 2017; Maie, 2021; Pili-Moss et al., 2020; Y. Suzuki, 2018). There were two tasks for each ability dimension: the Continuous Visual Memory Task (CVMT) and LLAMA-B for declarative memory capacity, an alternating serial reaction time task (ASRT15) and a statistical learning task (SL) for procedural memory capacity, and a two-choice reaction time 62 task (2CRT) and the first block of ASRT (ASRT1) for psychomotor ability. Within each dimension, I chose one domain-general (non-linguistic) and one domain-specific (linguistic) task. Table 2.2 summarizes the cognitive ability tasks. Table 2.2. The summary of the cognitive ability tasks. Task Ability Domain 1. CVMT Declarative General 2. LLAMA-B Declarative Specific 3. ASRT15 Procedural General 4. SL Procedural Specific 5. ASRT1 Psychomotor General 6. 2CRT Psychomotor Specific 2.2.6.1 The Continuous Visual Memory Task CVMT is a test of one’s ability for nonverbal declarative learning using a visual recognition paradigm (Trahan & Larrabee, 1988). The original CVMT consists of four phases (practice, acquisition, delayed recognition, and visual discrimination), but I only adopted the first two phases, which is typical of how L2 researchers have used the task (see Buffington & Morgan-Short, 2019). During the task, the participants saw a series of complex abstract designs, and they were tested on their ability to recognize seven target designs that were repeated among the other distractor designs. The task began with 11 practice trials (the practice phase) followed by 112 test trials (the acquisition phase). Of the 112 test trials, 49 were the seven target items that were presented seven times, and the remaining 63 trials were distractors that appeared only once. The order of presentation was the same as that of the original version of the task. The participants indicated whether they had seen the design (old) or not (new) in the sequence by pressing the S-key (old) or the K-key (new). Each design was only visible for two seconds, but the participants were able to respond any time later. Figure 2.6 shows the outlook of the task. 63 Figure 2.6. The outlook of the Continuous Visual Memory Task. In CVMT, learning ability is quantified using d-prime scores. d-prime is a sensitivity index that operationalizes one’s ability to discriminate signals from noise. In the current study, the statistic operationalized the participants’ ability to discriminate old items from new items. I calculated d-prime scores by subtracting the z-score for the proportion of old items that were incorrectly labeled as new items (i.e., false alarms) from the z-score for the proportion of new items that were correctly labeled s new items (i.e., hits). The effective limit of d-prime scores is ±4.65. The internal consistency of the task based on the Kuder-Richardson Formula 20 (KR-20) was .72 [.63, .82]. 2.2.6.2 LLAMA-B LLAMA-B is a vocabulary learning subtest within the LLAMA language aptitude tests (Meara & Rogers, 2019). In L2 research, it has been used as a domain-specific (or language- based) measure of learner’s declarative memory ability (e.g., Hamrick, 2015; Saito et al., 2022). The task assesses one’s ability to learn the name of unfamiliar objects in two phases: the study phase and the testing phase (see Figure 2.7). In the study phase, the participants were presented 64 with 20 unfamiliar objects with their names presented right beneath the objects. Given 2 minutes, the participants were asked to memorize the association between the names and the objects. The original version of the task implements a unique graphical user interface that allows test takers to move a cursor over an object to see its name. However, this feature was not available in Gorilla; instead, I presented an array of objects with their names together in a single screen (see Figure 2.7, the left panel). This presentation format was similar to that of Part V of the Modern Language Aptitude Test, a conceptual model of LLAMA-B (see Bokander & Bylund, 2022; Rogers et al., 2017 for reviews). In the testing phase, the participants were tested regarding how many of the associations they were able to recollect. For each trial, an object name appeared at the bottom of the screen (see Figure 2.7, the right panel). The participants indicated which object corresponded to the name by clicking the picture of the object. Instruction of the task explicitly stated that the participants were not allowed to take any notes, and if they could not find the object or did not know the answer, they could guess by clicking an object at random. The testing phase consisted of 20 items with no time limit imposed on each item. I used raw scores as the participants’ declarative (and vocabulary) learning scores. The internal consistency of the task based on KR-20 was .76 [.67, .84]. LLAMA-B is publicly available at: https://www.lognostics.co.uk/tools/LLAMA_3/index.htm. Figure 2.7. The outlook of LLAMA-B. Note. The left panel shows the learning phase, and the right panel shows the testing phase. 65 2.2.6.3 Alternating Serial Reaction Time Task The ASRT examines one’s ability of implicit sequence learning (e.g., Howard & Howard, 1997; Nemeth et al., 2010). In L2 research, it is one of the most popular tasks to assess learner’s procedural learning ability (e.g., Godfroid & Kim, 2021; Morgan-Short et al., 2014; Pili-Moss et al., 2020). As depicted in Figure 2.8, the participants saw an array of four circles, one of which was sequentially filled with an orange bird for each trial. The sequence in which one of the four circles was filled followed a second-order conditional rule where pattern trials were interleaved with random trials. In this study, the participants were exposed to the same sequence of 1r4r3r2, where r corresponded to a random location. The participants made responses as quickly and accurately as possible by pressing the corresponding keys on the keyboard ([z] for 1, [x] for 2, [> .] for 3, and [? /] for 4, using their left middle and index fingers and the right index and middle fingers, respectively). The participants had to press a correct key to proceed to the next trial; in other words, the task did not proceed unless they produced a correct response. The task has been variably adopted by L2 researchers in terms of the amount of learning trials. For instance, Godfroid and Kim (2021) used 10 blocks of learning trials whereas Morgan-Short et al. (2014) (and also Pili-Moss et al., 2020) implemented 20 blocks. In this study, I chose a middle ground and exposed the participants to 15 blocks of learning trials. Each block consisted of 88 trials, the first eight of which were random practice trials. In total, the participants went through 600 patten trials and 720 random trials (including the practice trials). 66 Figure 2.8. The outlook of the alternating serial reaction time task. Learning in ASRT is often quantified by taking the difference in RT between pattern and random trials. This method of scoring yields two (overlapping) measures depending on whether one uses the entire learning blocks (e.g., Buffington et al., 2021; Morgan-Short et al., 2014; Pili- Moss et al., 2020) or the final block (e.g., Godfroid & Kim, 2021). I adopted the latter measure because Godfroid and Kim (2021) provided evidence that it is a reliable predictor of procedural (or implicit) knowledge analyzed through structural equation modeling. I took the mean of each participant’s RT over the final block (Block 15) and subtracted the means on the pattern trials from those on the random trials. Any responses that had RT lower than 100 ms or did not fit within the range of individual’s mean ±3SD were removed from the analysis (1.6% of the dataset). Typically, L2 researchers estimate reliability coefficients for ASRT by splitting raw data (i.e., item-level responses) in two random halves and compute the correlation between the two halves using the Spearman-Brown prophecy formula. However, this method is not appropriate because the actual learning scores are calculated based on aggregated means rather than raw data. I thus simply took the mean of the participants’ RT on the pattern and the random 67 trials and calculated Cronbach’s alpha with two items (i.e., RT on the pattern and the random trials). The internal consistency of the scores was 𝛼 = .98 [.98, .99]. Finally, I calculated the mean of the participants’ RT in Block 1 as a measure of the their psychomotor ability (ASRT1). The internal consistency of the scores based on Cronbach’s alpha (using the mean RT on the pattern and the random trials) was .98 [.98, .99]. 2.2.6.4 Statistical Learning Task A statistical learning task based on language(-like) stimuli typically examines one’s ability to learn either adjacent (e.g., Aslin, Saffran, & Newport, 1998; Saffran, 2002; Thompson & Newport, 2007) or non-adjacent dependencies (e.g., Gómez, 2002; Gómez & Maye, 2005; Newport & Aslin, 2004). However, Romberg and Saffran (2013) pointed out that learning of a language often involves the simultaneous learning of both adjacent (e.g., collocation) and non- adjacent relationships (e.g., morphosyntax); therefore, investigating the learning of adjacent and non-adjacent dependencies at the same time is most conducive to examining a statistical learning ability relevant for language learning. In this study, I thus adopted a statistical learning task used in Romberg and Saffran (2013, Experiment 1). The target stimuli followed those of Gómez (2002), a list of three-word phrases in the form of A-X-B. There were three words for A words (pel, vot, and dak), three words for B words (rud, jic, and tood), and sixteen words for X words (balip, benez, deecha, fengle, gensim, gople, hiftam, kicey, loga, malsig, plizet, puser, roose, skiger, suleb, and vamey). Crucially, each A word was paired with a B word as a categorial non- adjacent dependency frame (pel_rud, vot_jic, and dak_tood), but the relationship between A words and X words, and B words and X words was probabilistic, as shown in Figure 2.9. Specifically, the X words were grouped into four groups (of four words): XED, XHP, XLP, and Xunattested. XED were evenly distributed words that occurred with the same probability for each 68 A_B frame. For each X word, there was one A_B frame for which it was XHP (high probability words), one frame for which it was XLP (low probability words), and one frame for which it was Xunattested (unattested). In each frame, XHP words occurred four times more frequently than XLP words, and XLP and XED words were equally frequent. Xunattested word were not instantiated in the frame. The entire stimulus set can be found at: https://osf.io/cdy8b. Figure 2.9. The stimulus structure in the statistical learning task adopted from Romberg and Saffran (2013). In Romberg and Saffran (2013), the modality of the task was auditory following Gómez (2002). However, I adapted the task using visual stimuli to make the task consistent with the other tasks in the current study (including the practice tasks), which were all visually based. One issue associated with visual statistical learning tasks is that participants tend to perform better in the visual mode. For instance, Onnis, Christiansen, Chater and Gómez (2002) showed that participants, when exposed to a list of stimuli containing non-adjacent dependencies, on average scored more than 10% higher in the visual mode than in the auditory mode. This difference was likely due to the fact that the visual presentation of stimuli makes the target rule (whether hidden or not) more salient than otherwise. This feature of visual modality can be potentially 69 problematic to statistical learning tasks because statistical learning is, by nature, an implicit process (see Monaghan, Schoetensack, & Rebuschat, 2019 for a review). To circumvent the issue, I piloted the task with four applied linguists who were familiar with some type of statistical learning tasks. The underlying logic was that if these linguists did not consciously identify the underlying rules (i.e., adjacent and non-adjacent dependencies), naïve participants in my study (who were not a linguist) would be unlikely to notice the rules. In the task, the participants were exposed to a list of 72 three-word phrases (i.e., A-X-B) for four repetitions (288 trials in total). An interstimulus interval between each phrase was set at 750 ms, following Romberg and Saffran (2013). Subsequently, a recognition test assessed to what extent the participants developed knowledge of adjacent and non-adjacent dependencies. In the test, the participants saw two phrases in a sequence and decided which one of the two phrases sounded more familiar in that they have heard it during the familiarization phase. There were 30 items, 12 for non-adjacent dependencies, 12 for adjacent dependencies, and 6 for checking whether the participants paid attention during the familiarization phase. Both options in the non- adjacent dependency trials contained X words that were evenly distributed across the three A_B frames (i.e., XED) because the adjacent relationships between A words and X words and X words and B words could give rise to construct-irrelevant variance that is nothing to do with measuring one’s ability of learning the non-adjacent dependencies. For the adjacent dependency trials, I always contrasted XHP and Xunattested to test the knowledge of the adjacent dependencies. The attention-check trials contained X words that the participants never heard during the familiarization period (chila, coomo, nilbo, taspu, wadim, and wiffle). The participants were tested in three blocks, in the order of the non-adjacent dependency trials, the adjacent dependency trials, and the attention-checking trials (see Romberg & Saffran, 2013, pp. 7–9). 70 However, the order of presentation was randomized within each block. The test stimuli can be found at https://osf.io/cdy8b. After completing the recognition test, the pilot participants received a three-part questionnaire that asked them to (a) verbally report any rules they noticed during the task (i.e., retrospective verbal reports), (b) judge the familiarity of six phrases that contained non-adjacent dependencies (three correct and three incorrect A_B combinations) along with their confidence in judgment (i.e., confidence ratings), and (c) rate the familiarity of six phrases that contained adjacent dependencies (three high-frequency and three low-frequency phrase). For both confidence and familiarity ratings, the participants used a scale of 1 to 5, with 5 indicating higher confidence or familiarity. Figure 2.10 summarizes the mean confidence and familiarity ratings of the pilot participants. All four pilot participants rated correct phrases more confidently (M = 3.75, SD = 0.86) than incorrect phrases (M = 3.66, SD = 0.88) and high-frequency phrases as being more familiar (M = 3.91, SD = 1.16) than low-frequency phrases (M = 3.66, SD = 1.33). However, the difference seemed quite minimal, and considering the variance associated with each mean, it was more likely that the participants rated all phrases equivalently. Two of the four pilot participants stated in the retrospective report that the first and the third words always constituted a pair, but none of them (even after 288 trials) provided an explicit example of the A_B frames (non-adjacent) or touched upon the probabilistic relationship between X words and A/B words (adjacent). These results met the so-called zero-correlation criterion for unconscious knowledge (Dienes, Altmann, Kwan, & Goode, 1995; Dienes & Scott, 2005). However, the results also need to be interpreted with caution given that the methodological reliability of confidence ratings has been questioned (e.g., Maie & DeKeyser, 2020) and that learners tend to underreport what they consciously know in retrospective reports (e.g., Hama & Leow, 2010). 71 Figure 2.10. The mean confidence/familiarity ratings of the pilot participants. Note. The error bars show 95% confidence intervals. Romberg and Saffran (2013) did not provide an estimate of reliability for the statistical learning task. I thus calculated KR-20 for each section (non-adjacent and adjacent dependency trials) as well as for the whole test: KR20,/+, = .70 [.60, .81], KR2089 = .70 [.60, .81], and KR209 = .44 [.25, .64]. The reliability of the adjacent-dependency section seemed low but was still on par with the reliability of procedural learning (or memory) tasks often used in L2 research (but see Perruchet, 2021 for a critical comment on the issue). I examined item-level statistics of the adjacent-dependency section to improve its psychometric quality. Specifically, I examined how removing each item changed the reliability coefficient. As a result, I excluded two items (Item 18 and 21), for removing these two items resulted in an increase in the overall internal consistency of the section. The correlation between the non-adjacent (k = 12) and adjacent sections (k = 10) was r = .32 [.09, .52], p = .007, suggesting that the abilities to learn categorical non-adjacent dependencies and probabilistic adjacent dependencies only coincide with each other to a small extent. However, scores in each section correlated strongly with scores on the entire test: r = .86 [.79, .91], p < .001 for the non-adjacent dependency section and r = .75 72 [.62, .84], p < .001 for the adjacent dependency section. In the main analysis, I thus adopted the participants’ raw scores on the entire test (k = 22) because the total scores on the test captured the participants’ scores in each section reasonably well. 2.2.6.5 Two-choice Reaction Time Task Choice reaction time tasks are primarily used to investigate psychomotor functioning in humans and animals (see Smith, 1968; Trueman, Brooks, & Dunnett, 2021 for reviews). In contrast to simple reaction time tasks, during which one reacts to a single stimulus associated with only one response type (e.g., clicking as soon as one detects the word bird), choice reaction time tasks involve multiple stimuli each requiring a separate response (i.e., pressing the S-key when one sees the word apple and the K-key when orange). The task thus entails not only rapid identification of target stimuli but also accurate categorization of them depending on the responses assigned to each. Ackerman (1988, 1992) used choice reaction time tasks as measures of learner’s psychomotor processing speed. In this study, I adopted a two-choice RT task (2CRT), which is the most common format of the task. The participants were randomly presented with either the word falcon or eagle and asked to press the S-key as soon as they recognize falcon and the K-key whenever they saw eagle. There were 50 experimental trials (25 trials for each word) with 10 practice trials with the words cat and dog. Each trial began with a fixation cross, which subsequently turned into the stimulus. I took an individual participant’ mean RT for correctly responded trials as their individual difference score. The mean accuracy of the participants’ responses was .95, SD = .03, 95% CI [.88, 1.00]. Responses with RT shorter or longer than an individual mean RT ±3SD were removed from the analysis (5.63% of the dataset). I took an individual participant’s mean RT for correctly responded trials as the 73 individual difference score. The split-half reliability of the scores (divided into items for falcon versus eagle) was r = .94 [.90, .96]. 2.3 Analysis In this section, I describe the statistical analysis conducted to answer the research questions of the study (see Section 2.1 for the research questions and hypotheses). I will first describe how dependent and independent variables were defined and how the dataset was processed to identify and replace outlying data points. I will then review the conceptual and mathematical backgrounds of hidden Markov modeling, especially with reference to the specific hidden Markov model (HMM) adopted from Tenison and Anderson (2016). The HMM analysis was used to test the number of skill acquisition stages the participants underwent while practicing Mini-Nihongo (RQ1). Finally, I will lay out the details of regression models that were specified to investigate which cognitive individual difference variables predicted learning at each stage of skill acquisition identified by the HMM analysis (RQ2). 2.3.1 Measurement In this study, I focused on three dependent variables: accuracy, RT, and coefficient of variability (CV) of RT. Table 2.3 presents the operational definitions of accuracy and RT as observed in each practice trial. CV was computed for each participant at the block level by dividing the standard deviation of RT by the corresponding mean. According to the three-stage model of skill acquisition, the first stage of learning (i.e., the cognitive stage) involves the reliance on declarative memory and general cognitive abilities such as problem-solving skills. In this stage of learning, it is important that learners develop accuracy of performance so they can proceduralize a correct set of behaviors (DeKeyser, 2015; Lyster & Sato, 2013; Y. Suzuki, 2022). I predicted that accuracy shows the earliest sign of learning and hence would be 74 correlated with the participant’s declarative memory capacity. CV, on the other hand, was used to operationalize the degree of proceduralization (Segalowitz & Segalowitz, 1993; see also Section 1.2.2), which is known as a necessary process for learners to proceed to the second stage (i.e., the associative stage). CV should thus be predicted by one’s procedural memory capacity especially during the second stage. Lastly, after reaching the asymptotic level of performance (i.e., the autonomous stage), learners can only be distinguished in terms of their psychophysical limitations in generating motor responses. Hence, they were expected to only differ in mere RT of performance, which would be predicted by their psychomotor ability. Table 2.3. The operational definition of accuracy and RT. Accuracy RT Comprehension Whether or not participants chose The time from the onset of a the correct picture out of two stimulus to the participant’s options (0 or 1) response (seconds) Production Whether or not participants chose The sum of RT across 11 word- the correct option out of two level decisions, with each response pictures across the entire sentence. involving choosing an option out Each trial consisted of 11 word- of two words (seconds) level responses, and hence the accuracy was calculated as ∑!"## $"# ($ 2; 0) 00 (0–1.0) Prior to the statistical analysis, I processed the RT data in two steps. At the trial level, I first winsorized 2.5% of the data from either end of the distribution and replaced the excluded values with the trial mean (aggregated over the participants). This step was necessary to filter out extremely fast or slow responses, given the structure of the comprehension and production practice tasks. Mean replacement is often criticized for its potential to overly inflate confidence in the mean as a summary of data (because it increases data points on the mean). I avoided the issue by using the trial mean over the entire sample rather than the mean of each participant. Figures 2.11 and 2.12 show the histogram of RT for each participant on comprehension and 75 production practice after the winsorization was applied. Note that RT was transformed to its logarithm because it only takes positive values, which often makes its distribution positively skewed. Mean replacement was preferred over simply removing the data points because the hidden Markov modeling analysis (see Section 2.3.3) required a complete dataset; it was also preferred over replacing with the corresponding boundary value (2.5% and 97.5% point of the distribution) because the analysis tends to be sensitive to extreme values. I applied the winsorization method because there is no consensus regarding the lower boundary of how long L1 or L2 speakers take to comprehend or produce a sentence in general. Subsequently, I computed the mean and standard deviation of RT for each participant in each block and removed any data points that were outside the range of the individual mean ±3SD. The trimmed values were replaced with the respective block mean of the participant. Finally, I removed data points on the first trial of each practice session (Trial 17, 145, 273, and 401) because the participants tended to perform unusually slow (or slower than expected) on those trials. See Figure 2.13 for the issue at hand (especially for comprehension data). When learners show this kind of regression between study sessions, it is unclear whether this is due to forgetting of the target skill or because they simply become less familiar with the experimental task at hand. Examining Figure 2.13, the initial slowness was only observed in the first trials, and the participants recovered their speed from the second trial and onwards. This suggested that the regressions were most likely due to the fact that the participants simply needed some time to briefly refamiliarize themselves with the experimental tasks. In skill acquisition research, this phenomenon is referred to as a warmup decrement (Adams, 1961; see also Anderson & Fincham, 1994, Experiment 2). 76 Figure 2.11. Histogram of reaction time data for comprehension practice. Note. RT was transformed to its logarithm. 77 Figure 2.12. Histogram of reaction time data for production practice. Note. RT was transformed to its logarithm. 78 Figure 2.13. RT data before the removal of the first trials. Note. The dotted lines show the first trial of each practice session. The first and the second step of data cleaning in total affected 6.54% and 5.60% of the dataset for comprehension and production practice, respectively. Figure 2.14 shows the changes from the original dataset (RT) after the first step (RT.A) and the second step (RT.B). Those data points that did not coincide with their values in the previous step are marked in red. The denser concentrations of data points in the left panels (than in the right panels) show that the first step affected more data (6.15% and 5.01%) than the second step (0.39% and 0.58%). Furthermore, the fact that the discrepancies between RT.A and RT.B (the right panels) were mostly below the 79 straight line indicated that the second step mostly affected data points that were beyond the individual mean +3SD (rather than −3SD). Figure 2.14. The summary of changes in the dataset following data processing. Note. Those data points that do not coincide with their previous step are marked in red. Table 2.4 summarizes variables that were used as independent (or predictor) variables in the study. In the regression analysis (Section 2.3.3), I subtracted 1 from Trial and Block so that the first trial or block corresponded to the intercept of the model. All variables from the cognitive tests were transformed to their z-scores to make the intercept and their coefficients (easily) interpretable. The state occupancy (Stage) was dummy-coded with the first stage as the baseline. Lastly, scores from the declarative memory measures (CVMT and LLAMA-B) and the 80 psychomotor ability measures were combined into corresponding factor scores. See 3.1.2 for (a) the convergent validity evidence of CVMT and LLAMA-B to operationalize one’s declarative memory capacity and 2CRT and ASRT1 as a measure to quantify one’s psychomotor ability and (b) the procedure of exploratory factor analysis to extract the factor scores. Table 2.4. The operational definition of independent variables. Variable Definition Trial The number of practice trials (1–524) Block The number of practice blocks (1–33) Stage Which learning stage the participants resided in, identified by the HMM analysis (First, Second, and Third) CVMT d-prime score (-4.65–4.65); see Section 2.2.6.1 LLAMA-B Percentile of accurate responses across the test (0–100) ASRT15 The difference between the mean RT on random trials and that on pattern trials in Block 15 SL Raw accuracy score across the test (1–22) 2CRT The mean RT across the task ASRT1 The mean RT in the first block of the task 2.3.3 Hidden Markov Modeling The Hidden Markov model (HMM) is an extension of the Markov chain, a stochastic model that represents a sequence of random variables called Markov states (Rabiner, 1989). Markov chain makes an assumption that the future state is only dependent on the current state, and the past state does not influence the future state except via the current state (i.e., the Markov assumption). In the current study, the Markov states represented learning stages defined as distinct cognitive states the participants went through while practicing how to comprehend and produce the target language. In this dissertation, I will use the term state to refer to the hypothesized learning stage encoded in the model and stage to refer to the actual learning stage whose ontological reality can be assumed. The HMM treats the actual states as hidden, but their probability can be estimated based on observed data by assuming that the hidden states produced 81 the observed data. It is a type of machine learning model often utilized in computational linguistics and natural language processing (see Jurafsky & Martin, 2021, Appendix Chapter A for a review). For instance, the HMM can be used to develop an automated speech recognition system by identifying words (states) based on acoustic information (observed data); it can be used to create a scheme of automatically tagging lexical items by categorizing words into respective parts of speech. In the current study, the HMM estimated the underlying learning stages based on an array of RT obtained through comprehension and production practice trials. In cognitive psychology, Anderson and colleagues have shown that the HMM analysis can be applied and generalized to a wide variety of cognitive tasks, and when tested with datasets where the true number of processing stages (or development stages) is known, the model recovers the stages with a reasonable degree of accuracy (e.g., Anderson, 2012; Anderson et al., 2018; Anderson & Fincham, 2014; Anderson, Zhang, Borst, & Walsh, 2016; Borst, Ghuman, & Anderson, 2016; Tenison, Fincham, & Anderson, 2016). Following Tenison and Anderson (2016), I fitted a series of HMMs to RT data at the trial level and estimated the probability of each participant residing in a given state on each practice trial (but see below for an issue in parameter estimation). Informed by the models of skill acquisition (Section 1.3), I compared three HMMs, with one, two, and three states, by examining how well each model fit the data. The one-, two-, and three-state models corresponded to the prediction by the Race model, the CMPL theory, and ACT-R, respectively. The HMM adopted in this study was a complex model in which a three-parameter power function was embedded. Note that Tenison and Anderson (2016) also tested a three-parameter exponential function and the APEX function (Heathcote et al., 2000), but I only focused on the power function because (a) the power-law of practice is already well established in skill acquisition research, (b) Tenison 82 and Anderson showed that the APEX function tended to provide worse fits than the power or the exponential function, and (c) as shown in Section 3.1.1, the power function was noticeably more compatible with the current dataset than the exponential function. As introduced in Section 1.4, the three-parameter power function can be denoted as RT(,* = I + 𝛽( 𝑗 !" , where the reaction time on practice trial 𝑗 within State 𝑖 (RT(,* ) was modeled as a function of the intercept (I), the slope per state (𝛽( ), and the learning rate parameter (𝛽( ). Following Tenison and Anderson (2016), I assumed that the intercept and the exponent were constant across learning states and estimated different values of the slope for each state. Note that the number of parameters was not exactly three in the model because estimating a slope per state meant that the model required one extra parameter for each additional learning state. Additionally, the HMM estimated transition probability, the probability of individuals eventually moving from one state to another (see Section 1.4). There were 𝑖 − 1 transition parameters for each 𝑖-state model; for instance, the three-state model required two transition parameters, one for transitioning from the first to the second state and the other from the second to the third state. In total, the entire HMM estimated 2𝑖 + 1 parameters for an 𝑖-state model (except for the one-state model which did not expect any state transition). I estimated the value of the parameters that maximized the probability of obtaining the current dataset. Specifically, the probability of a sequence of RT for each participant (524 trials) was estimated using the following formula: *&) % 5 !' RT %&$ Pr(𝑗, 𝑖) = ) *(1 − 𝜋!,!#$ ) 𝜋!,!#$ ./ 𝑔 1RT)#'&$ , 3, 67 Pr(𝑗 + 𝑘, 𝑖 + 1): 3 %($ '($ *&)#$ 5 !' RT + (1 − 𝜋!,!#$ )*&) ; / 𝑔 1RT)#'&$ , 3, 6< 3 '($ 83 Pr(𝑗, 𝑖) denotes the probability of RT from trial 𝑗 to the last trial given that the participant entered State 𝑖 on trial 𝑗. The equation consists of two parts, one within the summation sign (Σ) and the other outside. The former calculates the probability of trials in which the participant transitions to the next state, and the latter concerns trials in which the participant remains in the same state until the last trial. 𝜋(,(<0 is the transition probability from State 𝑖 to State 𝑖 + 1, and (1 − 𝜋(,(<0 )=!0 𝜋(,(<0 denotes the probability of spending 𝑘 trials in State 𝑖. This part of the equation allowed the model to consider every possible number of trials in each specified state and to choose the best number of trials that maximized the likelihood of obtaining the observed data. In other words, the model considered every possible way of partitioning a sequence of RT data (524 trials) into specified sets dictated by the learning states. Hence, the HMM jointly estimated the power-function parameters, the transition probability (or probabilities), and the number of practice trials within each state. The probability of spending 𝑘 trials in State 𝑖, (1 − 𝜋(,(<0 )=!0 𝜋(,(<0 , was then multiplied A $% ?@ by 𝑔(RT*<>!0 , 3, & ), the probability of the observed RT (i.e., RT*<>!0 ) on trial 𝑗 + 𝑚 − 1 given the predicted latency from the power function, RT R (> , for the 𝑚th trial in State 𝑖. 𝑔( ) means that the probability was computed based on a gamma distribution, which made it possible to explicitly incorporate the variability among the RT data as part of the model. The Gamma distribution (or the Gamma function) is a class of continuous probability distributions among the exponential family, and it is often used in psychology to model the distribution of RT data (see Palmer et al., 2011 for example). Following Tenison and Anderson (2016), I set the shape 0 parameter of the distribution to 3 and the scale parameter to & RT R (> . This assumed that the variance of RT decreased in proportion to the values of RT; that is, when RT decreased due to practice, so did the variance of RT. The last part of the equation within the summation sign, 84 Pr(𝑗 + 𝑘, 𝑖 + 1), is the probability of RT from trial 𝑗 + 𝑘 to the last trial given that the participant enters the next state (State 𝑖 + 1) on trial 𝑗 + 𝑘. Lastly, the equation outside the summation sign deals with cases in which the participant stays in the same state until the last trial. (1 − 𝜋(,(<0 )'!* indicates subtracting the transitional probability (𝜋(,(<0 ) from 1, which is the probability of not transitioning between states during trial 𝑗 to the last trial (𝑁). Note that when the transition probability is 0, the entire equation reduces to what is outside the summation sign, the probability of trials where the participant remains in the same state until the last trials. In the current analysis, applying the HMMs to trial-level RT data led to an issue of scalability; that is, computational inability to calculate the probability of each participant residing in a given state at each trial. This is due to the fact participants in this study had substantially more practice trials (524 trials) than those in the original study (36 trials). As I inspected the issue, this was not an issue of the models per se, but rather an issue of the estimation method (the expectation maximization algorithm, see below) not being able to converge on a solution. In principle, there were other approaches to estimate the parameters (e.g., naïve Bayes), but those methods have not been used in skill acquisition research and hence were not available at the time of data analysis. To resolve the computational issue, I chose to find the lowest level of data aggregation required to make the computation tractable. Obviously, the first candidate was to analyze the RT data at the block level, which involved aggregating every 16 trials of practice (33 blocks). However, DeKeyser (1997) showed that in his study of L2 skill acquisition (morphosyntax), proceduralization could have been complete as early as by the first 16 trials of practice. This meant that aggregating 16 practice trials into one data point could run the risk of missing the first (very brief) stage of skill acquisition. Instead, I engaged in an exploratory 85 approach, in which I aggregated every 2 to 16 trials of practice and searched for the lowest level of data aggregation that allowed the HMM algorithm to run. This was at the level of four trials (524 / 4 = 131 bins) for comprehension practice and at the level of six trials (514 / 6 ≈ 88 bins) for production practice. Hence, I averaged RT data in every 4-trial and 6-trial bins for comprehension and production practice, respectively, and used the resulting dataset to fit the HMMs. I acknowledge that it was most ideal to analyze the raw (trial-level) data, but as discussed in Section 3.2, this data averaging was unlikely to affect the results, especially in regards to the number of HMM states most consistent with the data. Hence, I assumed that at every four or six trials of practice, participants were in the stage of skill acquisition. Although HMM states represented leaning stages in this study, the mapping between the two sides was not completely one-to-one because the current HMM conceptualized a distinct state for each number of practice trial (or a 4- or 6-trial bin rather); that is, the model went through every trajectory of state transitions to be possible at each trial (or a bin). Figure 2.15 illustrates the difference between a typical HMM (Figure 2.15a) and the model adopted from Tenison and Anderson (2016) (Figure 2.15b). At each trial, participants in the current model had two options: they either proceed to the next trial of the current state or transition to the first trial of the next state. The sheer reason of adopting this complex model was to circumvent the fact that models with within-state speedup violates the Markov assumption when there is only one Markov state for each learning state. Due to this complexity, if there were 𝑖 states, the HMMs had 131𝑖 states for comprehension practice and 88𝑖 states for production practice (if no aggregation had not been applied, there would have been 524𝑖 states). All parameters associated with the HMMs and the power function were estimated using the expectation maximization algorithm (Rabiner, 1989). The HMM fitting was done on Spyder (Version 5.1.5; 86 https://www.spyder-ide.org/), an open-source platform to program and execute the Python language. The codes to execute the analysis was provided by Dr. Caitlin Tenison, the corresponding author of Tenison and Anderson (2016). Figure 2.15. The difference between a typical HMM and the model in the current study. After fitting the HMM with differing number of states (i.e., one to three states), I compared the models based on Akaike Information Criterion corrected for small sample sizes (AICc) and Bayesian Information Criterion (BIC). Although Tenison and Anderson (2016) solely relied on BIC to compare the competing HMMs, the use of BIC assumes that the true model exists in the set of candidate models, which is rarely true in (most) psychological research. In contrast, AIC(c) does not make such an assumption, and it only computes the information loss when a researcher’s model is compared against the true model (even though AIC has a drawback of tending to prefer overly complex models, which is not the case for BIC) (see Burnham & Anderson, 2002; Kass & Raftey, 1995 for discussions advocating AIC or BIC). I chose AICc B over AIC because AIC is only valid for large datasets, with a common threshold being = < 40, 87 where 𝑛 is the sample size and 𝑘 is the number of parameters estimated (Burnham & Anderson, 2002). AICc and BIC were defined as: 2𝑘(𝑘 + 1) AICc = −2 log 𝐿 + 2𝑘 + 𝑛−𝑘−1 BIC = −2 log 𝐿 + 𝑘 log(𝑛) where log 𝐿 is the log likelihood of obtaining observed data under the model, 𝑘 is the number of parameters in the model, and 𝑛 is the sample size. Because examining the values of AICc or BIC per se does not indicate how well the best model compares to its rival models, I calculated the so-called Akaike weight and the BIC model weight, which can be interpreted as a conditional probability of a model when compared to the other candidate models in the set (with the value between 0 and 1). I followed the formulation of the weights summarized in Wagenmakers and Farrell (2004): 1 exp ]− 2 ∆( (index)_ 𝑤( (index) = 1 ∑C =D0 exp ]− 2 ∆= (index)_ where ∆(index) is the difference between AICc or BIC of the best model and that of a model in focus. The primary purpose of using both AICc and BIC was to gather and triangulate multiple sources of information for (or against) competing HMM models. One caveat of the current HMM is that it did not conceptualize cases in which the participants regressed back to the previous learning states. However, as is clear in both cognitive psychology and SLA, the skill acquisition theory is also a theory of skill retention (Kim, Ritter, & Koubek, 2013; Li & DeKeyser, 2017; Y. Suzuki & Sunada, 2019). Fitting a power function to RT data thus assumed that there would be a smooth and continuous decrease in RT, which is not true particularly when a skill must be learned over a long period of time. While I was willing to 88 make such idealization for the purpose of the current study, any studies that span over multiple sessions/days necessarily invite some degree of forgetting on the participants’ side. 2.3.3 Regression Modeling Regression modeling answered the second research question of the study by investigating which cognitive individual difference variables predicted learning at each stage of skill acquisition. I modelled three dependent variables, accuracy, CV, and RT by fitting generalized linear mixed models (GLMM) using Bayesian inference. I used the software R (Version 4.1.2; R Core Team, 2022) and the R package brms (Version 2.16.3; Büerkner, 2017), which was a front- end R package of Stan (Version 2.21.3; Stan Development Team), a probabilistic programming language for Bayesian inference and optimization. In Bayesian analysis, prior knowledge in the form of probability distributions is combined with observed data to create posterior distributions (see Gelman et al., 2013; Kruschke, 2015; McElreath, 2020 for general reviews). Mathematically, posterior distributions are derived as the precision-weighted average of the prior and the observed data. Maie and Godfroid (2022), Murakami and Ellis (2022), and Saito et al. (2020) provide recent examples of Bayesian data analysis in SLA. Below, I list mathematical details of the GLMMs that were applied to the current dependent variables. To sum, accuracy from comprehension practice was modeled with a binomial GLMM, but the same variable from production practice was modeled using a zero-one inflated beta GLMM. Regardless of the mode of language practice, RT and CV were modeled with normal GLMMs. Often, these regression models are alternatively called a logistic, zero-one inflated beta, and linear mixed-effects model, respectively. 89 Binomial model for accuracy (comprehension) 𝑦! ~ Binomial(𝑛, 𝑝! ), where 𝑖 = 1, 2, … , 𝑛 and indicates participants 1 𝑝! = 1 + 𝑒 +, 𝑋𝛽! = 𝛼 + 𝛼-./0123[!] + 𝛼6317[)] + R𝛽896:; + 𝛽-./0123[!] S𝑥896:; + 𝛽𝑥<3:=1> + 𝛽𝑥<3:=1? + 𝛽𝑥@12;:9:36A1 + 𝛽𝑥B:@12;:9:36A1 + 𝛽𝑥<3:=1>:B::F-G2HI7I3I9 + 𝛽𝑥<3:=1?:@12;:9:36A1 + 𝛽𝑥<3:=1?:B!"#$%&' 𝜎R!"#$%&' 𝜎S!"#$%&',)*+,- 𝜌 𝐒𝐬𝐮𝐛𝐣𝐞𝐜𝐭 = ` c 𝜎S!"#$%&',)*+,- 𝜎R!"#$%&' 𝜌 𝜎S>!"#$%&',)*+,- 𝛼 ~ Normal(1, 5) all 𝛽s ~ Normal(0, 1) 𝜎R!"#$%&' ~ HalfCauchy(1) 𝜎S!"#$%&',)*+,- ~ HalfCauchy(1) 𝜌 ~ LKJ(1) 90 Zero-one inflated beta model for accuracy (production) 𝑦! ~ Beta(𝜐! , 𝜔! ), where 𝑖 = 1, 2, … , 𝑛 and indicates participants 𝜐! = 𝜇! 𝜐! + 𝜔! 1 𝜇! = 1 + 𝑒 +, 𝑋𝛽! = 𝛼 + 𝛼-./0123[!] + 𝛼6317[)] + 𝛽𝑥<3:=1> + 𝛽𝑥@12;:9:36A1 + 𝛽𝑥B:@12;:9:36A1 + 𝛽𝑥<3:=1>:B::F-G2HI7I3I9 𝛼-./0123 ~ Normal(0, 𝜎R>!"#$%&' ) 𝛼 ~ Normal(1, 5) all 𝛽s ~ Normal(0, 1) 𝜎R!"#$%&' ~ HalfCauchy(1) 91 Normal model for CV (comprehension, production) 𝑦! ~ Normal(𝜇, 𝜎 > ), where 𝑖 = 1, 2, … , 𝑛 and indicates participants 𝜇 = 𝛼 + 𝛼-./0123[!] + 𝛼6317[)] + R𝛽896:; + 𝛽-./0123[!] S𝑥896:; + 𝛽𝑥<3:=1> + 𝛽𝑥<3:=1? + 𝛽𝑥@12;:9:36A1 + 𝛽𝑥B:@12;:9:36A1 + 𝛽𝑥<3:=1>:B::F-G2HI7I3I9 + 𝛽𝑥<3:=1?:@12;:9:36A1 + 𝛽𝑥<3:=1?:B!"#$%&' 𝜎R!"#$%&' 𝜎S!"#$%&',)*+,- 𝜌 𝐒𝐬𝐮𝐛𝐣𝐞𝐜𝐭 = ` c 𝜎S!"#$%&',)*+,- 𝜎R!"#$%&' 𝜌 𝜎S>!"#$%&',)*+,- 𝛼 ~ Normal(0, 1) for comprehension and production all 𝛽s ~ Normal(0, 1) for comprehension and production 𝜎R!"#$%&' ~ HalfCauchy(1) for comprehension and production 𝜎S!"#$%&',)*+,- ~ HalfCauchy(1) for comprehension and production 𝜌 ~ LKJ(1) for comprehension and production 92 Normal model for RT (comprehension, production) log (𝑦! ) ~ Normal(𝜇, 𝜎 > ), where 𝑖 = 1, 2, … , 𝑛 and indicates participants 𝛼 + 𝛼-./0123[!] + 𝛼6317[)] + R𝛽896:; + 𝛽-./0123[!] S𝑥896:; + 𝛽𝑥<3:=1> + 𝛽𝑥<3:=1? + 𝛽𝑥@12;:9:36A1 + 𝛽𝑥B:@12;:9:36A1 + 𝛽𝑥<3:=1>:B::F-G2HI7I3I9 + 𝛽𝑥<3:=1?:@12;:9:36A1 + 𝛽𝑥<3:=1?:B!"#$%&' 𝜎R!"#$%&' 𝜎S!"#$%&',)*+,- 𝜌 𝐒𝐬𝐮𝐛𝐣𝐞𝐜𝐭 = ` c 𝜎S!"#$%&',)*+,- 𝜎R!"#$%&' 𝜌 𝜎S>!"#$%&',)*+,- 𝛼 ~ Normal(0, 2) for comprehension and all 𝛼 ~ Normal(0, 3) for production all 𝛽s ~ Normal(0, 1) for comprehension and production 𝜎R!"#$%&' ~ HalfCauchy(1) for comprehension and production 𝜎S!"#$%&',)*+,- ~ HalfCauchy(1) for comprehension and production 𝜌 ~ LKJ(1) for comprehension and production 93 2.3.3.1 Accuracy (comprehension) The probability of choosing an correct answer, that is, 𝑝( (𝑦( = 1), was modeled using a binomial distribution with the number of trials, 𝑛, set at 524 trials. A probability is by nature bound between 0 and 1 and hence was transformed to its logit (i.e., log odds) (which is bound between −∞ and +∞) by the logit link function so that the independent variables corresponded to the logit of the probability in a linear manner. The predictor variables included a main effect of residing in a given skill acquisition stage, that is, 𝛽𝑥E,-./# and 𝛽𝑥E,-./& (Stage 1 corresponded to the intercept) and a main effect of declarative memory capacity (𝛽𝑥F/G1-;-,HI/ ), ASRT15 (𝛽𝑥9E?@0J ), SL (𝛽𝑥EK ), and psychomotor ability (𝛽𝑥L+MGN2O2,2; ). I let each stage and cognitive variable interact with each other, so there was a two-way interaction of Stage 2 with declarative memory capacity (𝛽𝑥E,-./#:F/G1-;-,HI/ ), ASRT15 (𝛽𝑥E,-./#:9E?@0J ), SL (𝛽𝑥E,-./#:EK ), and psychomotor ability (𝛽𝑥E,-./#:L+MGN2O2,2; ), and a two-way interaction of Stage 3 with declarative memory capacity (𝛽𝑥E,-./#:F/G1-;-,HI/ ), ASRT15 (𝛽𝑥E,-./#:9E?@0J ), SL (𝛽𝑥E,-./#:EK ), and psychomotor ability (𝛽𝑥E,-./#:L+MGN2O2,2; ). Although I was primarily interested in how each cognitive variable predicted the logit-transformed accuracy rate at each learning stage, I added a main effect of Trial to account for the changes in the accuracy rate as a function of practice trials. Random effects consisted of the maximal structure allowed by the experimental design. Hence, I estimated varying intercepts and slopes of Trial for individual participants. Note that random effects for items could not be incorporated because there were only two repetitions of the same items across the four practice sessions. Random effects were estimated as if they were drawn from a multivariate normal distribution with the mean of 0 and with the standard deviation indicated by the variance-covariance matrix 𝐒𝐬𝐮𝐛𝐣𝐞𝐜𝐭 . See above for the content of the variance- covariance matrix. A multivariate normal distribution is a multi-dimensional extension of a 94 normal distribution. This specification is the same as that of GLMMs implemented in lme4 package (Bates et al., 2015). Because there was no a priori information regarding how the participants would increase accuracy (in the specific experimental tasks adopted in the current study) as a function of practice and how the cognitive individual difference variables predicted the participant’s accuracy rate, I used weakly informative priors so that the data would overwhelm the priors when there was (at least some) meaningful information in the dataset. Weakly informative prior can be defined as a class of prior distributions “that are explicitly designed to encode information that applies to a general class of problems without taking full advantage of problem-specific knowledge” (Gelman, Simpson, & Betancourt, 2017, p. 3). A weakly informative prior is made intentionally weak so that it does not affect the posterior distribution but still allows for the regularization of extreme values in the posterior samples. For the intercept (𝛼), I used a prior distribution in the form of a normal distribution with the mean of 1 and the standard deviation of 5. This prior expected that aggregating over the cognitive variables, the mean accuracy rate of the participants at Stage 1 (and Trial 1) would be logit(1) = .731, but it can be between .017 and .997 (= logit[1 − 5] and logit[1 + 5]) within ±1SD of uncertainty. For the regression coefficients, I set the prior distribution as a normal distribution with the mean of 0 and the standard deviation of 1. This prior expected that when the mean accuracy rate was .731, one standard deviation increase in CVMT, for instance, was associated with a 15% increase in accuracy (logit[1 + 1] − logit[1] = .149). For the random-effects parameters, that is, 𝜎"&'()*+, and 𝜎X&'()*+,,./012 , I used a half- Cauchy distribution with its scale parameter set at 1. The half-Cauchy distribution is a positive half of a Cauchy distribution, which can be derived by holding the degree of freedom parameter 95 of a student-t distribution to 1 (note, as df → ∞, t distribution → normal distribution). I chose a Cauchy distribution because Gelman (2006) showed that with complex models such as GLMMs, standard deviation parameters can often be better approximated by a half-Cauchy distribution than by an inverse-gamma or non-informative prior distribution (which are traditionally used to model error parameters). In addition, I used the Lewandowski-Kurowicka-Joe correlation (LKJ) distribution (Lewandowski, Kurowicka, & Joe, 2009) as the prior distribution for the correlation between the random-effects parameters. I set its scale parameter to 1 so that the distribution becomes (almost) uniform to allow for any value of the correlation between the varying intercepts and slopes for individual participants. I estimated the posterior distribution of the model parameters through a Markov chain Monte Carlo (MCMC) simulation consisting of four chains of 5,000 iterations each (with 1,000 warmup trials). Stan implements a No-U-Turn Sampler as a MCMC algorithm, which is an extension of Hamiltonian Monte Carlo (Hoffman & Gelman, 2014). To check whether each MCMC chain converged on model parameters with a stationary distribution, I monitored whether the value of 𝑅l associated with each parameter (as a convergence index) was within the range of 1 ≤ 𝑅l ≤ 1.1 (Gelman & Rubin, 1992). I adopted expected a posteriori (i.e., the mean of the posterior distribution) and 95% credible intervals (CrI: i.e., highest posterior desity intervals) as the point and interval estimates of the model coefficients, respectively. The procedure of parameter estimation was the same throughout the regression analysis and hence it will not be described further in this dissertation unless any changes were made. 2.3.3.2 Accuracy (production) The zero-one inflated beta regression is an extension of the beta regression, which, in addition to handling variables that are bound between 0 and 1 (as the beta regression does), deals 96 with data that have many data points near 0 or 1 (see Ospina & Ferrari, 2012 for a review of a general class of zero-or-one inflated beta models). It models a dependent variable with a beta distribution, a continuous probability distribution defined on the interval between 0 and 1. It takes two shape parameters, 𝜐 and 𝜔, whose relative values determine the shape of the Y distribution. The mean of the distribution, 𝜇, can be found as 𝜇 = Y