SYNTACTIC COMPLEXITY AS A PREDICTOR OF SECOND LANGUAGE WRITING PROFICIENCY AND WRITING QUALITY By Ji-Hyun Park A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Second Language Studies—Doctor of Philosophy 2017 ABSTRACT SYNTACTIC COMPLEXITY AS A PREDICTOR OF SECOND LANGUAGE WRITING PROFICIENCY AND WRITING QUALITY By Ji-Hyun Park Syntactic (i.e., grammatical) complexity refers to the range and the degree of sophistication of the forms that appear in language production (Ortega, 2003). This concept has long been regarded as an important construct of language proficiency and has been actively investigated in the field of second language (L2) writing. Syntactic complexity is multidimensional in nature, and there are a variety of measures that tap into different dimensions of the construct. Widely used measures of complexity (e.g., mean length of T-unit and the number of clauses per T-unit) capture a relative degree of sophistication, but do not provide measures of participants’ command of a range of diverse syntactic structures. In contrast, in L2 assessment, grammatical knowledge is often evaluated in terms of both syntactic elaboration and structural variety (Rimmer, 2006). To address the gap, the present study proposes new measures that tap into the diversity dimension of syntactic complexity: types and type/token frequency of verb-argument constructions (VACs). The present study investigates whether the proposed diversity measures of syntactic complexity, in combination with currently used measures of elaboration, accurately predict L2 written proficiency and writing quality. Specific research questions that guide the study are as follows: (1) Does the syntactic complexity of Korean EFL learners’ writing production, as measured by various quantitative complexity measures, function as an indicator of proficiency? In addition, does adding diversity measures increase the predictive power of syntactic complexity in discriminating proficiency levels? (2) How do different syntactic complexity measures relate to subjective ratings of writing quality judged by human raters? Which measure(s) best predict writing quality? (3) How do raters interpret the notion of syntactic complexity that appear on Language Use scale of a given analytic writing rubric? Essays were collected from 390 Korean EFL learners and analyzed using corpus analytic tools. Fourteen elaboration measures were calculated using Syntactic Complexity Analyzer, an automated computational tool developed by Lu (2010). For the diversity measures, all instances of VACs in the participants’ essays were retrieved and analyzed using a part-of-speech tagging tool and a concordance tool. Thirteen VAC patterns (e.g., verb + direct object, verb + indirect object + direct object, verb + direct object + object predicative, etc.) and their sub-patterns were identified based on findings in usage-based approaches to grammar, namely, construction grammar and corpus-based descriptive grammar. Then the distribution and the number of VAC types and type/token frequency of VACs were examined. Participants’ proficiency levels were independently measured by a cloze test, and the quality of their essays was evaluated by human raters. The empirical results of the study indicated that measures of syntactic complexity functioned as predictors that discriminate among different proficiency levels, and adding diversity measures of complexity increased the predictive power. The diversity measures were also found to be strong predictors of human-rated writing quality, which lend support to the use of the diversity measures in this area of research. Qualitative data obtained from the rater interviews showed that notions of grammatical complexity as interpreted by raters generally overlap with the notion of syntactic complexity in SLA. However, variability was found in the interpretations between raters. Copyright by JI-HYUN PARK 2017 ACKNOWLEDGMENTS First of all I would like to sincerely thank my advisor Dr. Charlene Polio for her guidance during my years in the SLS program. I am very thankful for her extraordinary support and encouragement throughout the dissertation project. My special thanks go to the committee members, Dr. Susan Gass, Dr. Patti Spinner, Dr. Paula Winke, and Dr. Aline Godfroid, for their time reading my dissertation and helpful comments. I also want to thank my advisors at Seoul National University, Professor Hyun-Kwon Yang and Hyunkee Ahn, for being great mentors through my first years of research. Their commitment and dedication to teaching and research has always been an inspiration to me. This dissertation would not have been possible without the help of a number of people: the instructors in Korea who helped data collection, the students who participated in the study, and the raters who read the student essays. I also gratefully acknowledge the financial support from a Dissertation Completion Fellowship from the College of Arts and Letters and a Special College Research Abroad Money award from the Graduate School at Michigan State University. I also need to thank my colleagues and friends who helped make my time in the doctoral program more bearable and enjoyable. Special thanks go to Ina Choi, Yaqiong Cui, Talip Gonulal, Lorena Valmori, Susie Kim, and Unhee Ju. I also thank Magda Tigchelaar for reading my dissertation. Last but not least, I am grateful to my family for their love and support. I thank my parents and parents-in-law who have always believed in me and supported my decisions in every way. My most heartfelt gratitude goes to my husband, Jongwon Shin. Without his encouragement, care and unfailing love, this work would not have been completed. v TABLE OF CONTENTS LIST OF TABLES…………………………………………………………………...……...…...ix LIST OF FIGURES……………………………………………………………...……...……....xi INTRODUCTION…………………………………………………………………...……...….…1 CHAPTER 1: BACKGROUND……….…………………………………………………….……3 1.1 Complexity in SLA research…………………………………………………….….……3 1.2 Measures of syntactic complexity in SLA research…………………………..…….……6 1.2.1 Review of syntactic complexity measures in L2 writing studies (2009-2016)....… 12 1.3 Syntactic complexity and L2 writing………………………….………………………..20 1.3.1 Complexity measures as performance descriptors………………….……………...20 1.3.1.1 Effects of a pedagogical intervention……………………………………21 1.3.1.2 Effects of task- and genre-related variation……………………………...22 1.3.1.3 Effects of first language (L1) ……………………………………………23 1.3.2 Complexity measures as indices of development and proficiency………………...24 1.3.2.1 Development in writing over time……………………………………….24 1.3.2.2 Texts written by learners across proficiency levels……………………...27 1.4 Grammatical complexity: L2 assessment……………………………………………….28 1.4.1 Assessment of grammar performance …………………………………………….28 1.4.2 Relationship between syntactic complexity measures and human ratings………...32 1.5 Summary………………………………………………………………………………..34 1.6 Proposed measures of syntactic complexity to capture syntactic diversity/ variety.…...35 CHAPTER 2: THE CURRENT STUDY………………………………………………………..40 2.1 Research questions and hypotheses……………………………………………………..40 2.2 Participants…………………………………………..……………………………….…43 2.2.1 Korean learners of English………………………..…………………..……………43 2.2.2 Raters…………………………………………………………………………..…..45 2.3 Instruments……………………………………………………………………………...46 2.3.1 Writing tasks…………………………………………………………………….....46 2.3.2 English proficiency test (C-test) …………………………………………………..47 2.3.3 Language learning background questionnaire……………………………………..49 2.3.4 Rater background questionnaire…………………………………………….……..50 2.3.5 Rating rubric……………………………………………………………………….50 2.4 Procedures………………………………………………………………………………50 2.4.1 Korean learners of English……………………………………………….………..50 2.4.2 Raters………………………………………………………………………………51 2.5 Data analysis…………………………………………………………………………….53 2.5.1 Quantitative analysis………………………………………………..……………...53 2.5.1.1 Proficiency test………………………………………….………………..53 vi 2.5.1.2 Subjective ratings…………………………………………….…………..54 2.5.1.3 Syntactic complexity: Elaboration measures………………...…………..54 2.5.1.4 Syntactic complexity: Diversity measures………………......…………..55 2.5.2 Qualitative analysis………………………………………………………….…….58 2.6 Statistical analysis…………………………………….………………….……………..59 CHAPTER 3: RESULTS………………………………………………………………………...61 3.1 Preliminary results………………………………………………………………….…...61 3.1.1 English proficiency test (C-test) and proficiency level placement………….……..61 3.1.2 Subjective ratings on essays……………………………………………………….62 3.1.3 Relationship between proficiency test scores and subjective ratings…….………..63 3.2 ANOVAs and discriminant function analyses (DFAs): Research question 1…………..64 3.2.1 ANOVAs…………………………………………………………………………...64 3.2.2 DFAs……………………………………………………………..………………..66 3.2.2.1 Variable selection…………………………………………….…………..67 3.2.2.2 Discriminant function analyses: Elaboration and diversity measures…...68 3.3 Correlation and regression analyses: Research question 2……………………….…….77 3.3.1 Correlations…………………………………………………………………….…77 3.3.2 Regression analyses……………………………………………………………….78 3.3.2.1 Variable selection……………………………………………………….78 3.3.2.2 Relationship between syntactic complexity indices and Total score…...79 3.3.2.3 Relationship between syntactic complexity indices and Language Use score…………………………………………………………………………….81 3.4 Rater interview results: Research question 3………………………………….………..81 3.4.1 Overall rating process……………………………………………………………...81 3.4.1.1 Rating sequence………………………………………………………….81 3.4.1.2 Lack of information provided by the rating scale………………………..82 3.4.2 Rating process for Language Use………………………………………………….84 3.4.2.1 Balancing between accuracy and complexity……………………………85 3.4.2.2 Criteria not specified in the rubric……………………………………….87 3.4.3 Perceptions of the language use section of the rubric…………………………..….88 3.4.3.1 Tension between accuracy and complexity……………………………...88 3.4.3.2 Overlap with other categories of the rubric……………………………...89 3.4.3.3 Vagueness of descriptors………………………………………………...90 CHAPTER 4: DISCUSSION…………………………………………………………………….95 4.1 Research question 1: Syntactic complexity and proficiency……………………………95 4.2 Research question 2: Grammatical complexity and writing quality……………………100 4.3 Research question 3: Human raters’ perceptions of the Language Use section of an analytic rubric…………………………………………………………………….………..106 CHAPTER 5: CONCLUSION…………………………………………………………………112 5.1 Summary of findings…………………………………………………………………..112 5.2 Implications……………………………………………………………………………114 5.2.1 Research implications…………………………………………………………….114 5.2.2 Practical implications…………………………………………………………….115 vii 5.3 Limitations and future research………………………………………………………..116 APPENDICES………………………………………………………………………….………118 Appendix A English Proficiency Test (C-test)……….…………………....……….….…..119 Appendix B Table 24 C-test: Item Facilities and Item Discriminations……………….…..121 Appendix C Language Learning Background Questionnaire (for college students)…...….123 Appendix D Language Learning Background Questionnaire (for high school students)….124 Appendix E Language Learning Background Questionnaire in Korean (for college students)…………………………………………………………………………………....125 Appendix F Language Learning Background Questionnaire in Korean (for high school students)………………………………………………….…………………………...……126 Appendix G Rater Background Questionnaire ……………………….……….….…….…127 Appendix H Table 25 Rating rubric…………………….……….……..………….….……129 REFERENCES………………………………………………………………………....………130 viii LIST OF TABLES Table 1 Wolfe-Quintero et al.’s (1998) inventory of grammatical complexity measures…..…….7 Table 2 Inventory of (possible) grammatical complexity measures (adapted from Bulté & Housen, 2012 and Ortega, 2003)………………………………………………………….……...10 Table 3 Inventory of grammatical complexity measures in L2 writing studies (2009-2016)…..14 Table 4 References to syntactic complexity in rating scales for writing…………………………29 Table 5 Verb argument constructions…………...………………………………………………37 Table 6 Verb complementation types (Quirk et al., 1985)……………………………………...38 Table 7 Korean participants’ demographic and learning background…………………………..44 Table 8 Raters’ teaching and rating background………………………………………………..45 Table 9 C-test reliability……..………………………………………………………………….49 Table 10 Verb-argument structures…………………….……………………………………….56 Table 11 Descriptive statistics: C-test and subjective ratings on the writing task………………62 Table 12 Inter-rater reliability (ICCs)…………………………………………………...………63 Table 13 Proficiency-level effect on syntactic complexity measures (One-way ANOVAs)…...65 Table 14 Post-hoc pairwise comparisons (p values) between each proficiency level…………..66 Table 15 Bivariate correlations between syntactic complexity measures………………………68 Table 16 Relationship output for individual predictor variables and functions………………...70 Table 17 Group centroids……………………………………………………………………….71 Table 18 Prediction of group membership according to three discriminant analyses…………..75 Table 19 Correlations between syntactic complexity measures and subjective ratings………...78 Table 20 Multiple regression analyses: Model summary……………………………………….79 ix Table 21 Standard regression coefficients…………………………………………………….80 Table 22 Language Use section of the rubric…………………………………………………..85 Table 23 Raters’ interpretations of the descriptors in the rubric……………………………….92 Table 24 C-test: Item Facilities and Item Discriminations…………………….……..……….121 Table 25 Rating rubric…………………………….…………………………….……..……….129 x LIST OF FIGURES Figure 1 Cases and group centroids for two discriminant functions: 5 elaboration measures….72 Figure 2 Cases and group centroids for two discriminant functions: 2 diversity measures…….73 Figure 3 Cases and group centroids for two discriminant functions: 2 diversity measures…….74 xi INTRODUCTION What it means to be a proficient language user and how to describe and measure learners’ proficiency are two major questions that have been at the core of many studies in second language acquisition (SLA) and applied linguistics (Housen & Kuiken, 2009). There is now a shared belief among researchers and practitioners that second language (L2) proficiency, both oral and written, is a multidimensional rather than unitary construct. This multidimensionality has been captured by three constructs, namely, complexity, accuracy, and fluency (Ellis, 2003; Housen, Kuiken, & Vedder, 2012; Norris & Ortega, 2009; Skehan, 1998), which have become recognized as “principal and basic dimensions of L2 performance, proficiency, and development” (Bulté & Housen, 2014, p.13). Originating from L1 research and first introduced by Skehan (1998) in an L2 model, these three constructs have emerged as research variables in the field of SLA over the past 25 years (Housen & Kuiken, 2009). Among the three constructs, complexity—especially syntactic complexity (also called grammatical complexity)—has a long history in the research on L2 writing development (Biber, Gray, & Poonpon, 2011). Often defined as “the range of forms that surface in language production and the degree of sophistication of such forms” (Ortega, 2003, p. 492), complexity has been recognized as an important construct in L2 writing teaching and research. Researchers have assumed that learner language becomes more complex as learners progress and have viewed increased complexity as an indication of language development or proficiency. Accordingly, establishing and scrutinizing measures of syntactic complexity has become common. Developing objective methods to assess language proficiency has been one of the main goals in the L2 assessment field. Grammatical competence is one aspect of communicative 1 competence (Bachman, 1990; Canale & Swain, 1980) and is central to describing test-taker performance (Rimmer, 2006). In addition, in describing grammatical competence, grammatical complexity (complexity of form and structure) is considered to be crucial (Rimmer, 2006). For example, rubrics used to rate the speaking or writing performance of L2 learners (e.g., TOEFL writing rubrics, IELTS writing band descriptors) often illustrate the use of a variety of syntactic structures or sentence forms as a measure of test-takers’ language use. Although research in both SLA and L2 assessment pursue a similar goal, few attempts have been made to compare and contrast how the construct of syntactic (grammatical) complexity is interpreted and operationalized in each field. In the present study, I attempt to build a connection between the two fields so that findings and practices in these areas can inform each other. Specifically, this study aims to study how syntactic or grammatical complexity has been operationalized in each field, critically review the measures of complexity and examine the relationship between measures in the two fields, and propose new measures to fill the gap. In addition, I investigate whether the proposed measures, together with conventional complexity measures that have been used in SLA, can be indicative of L2 writing proficiency and writing quality as judged by human raters. 2 CHAPTER 1: BACKGROUND In this chapter, I first examine how syntactic complexity has been defined in SLA and L2 writing research. This is followed by a summary of syntactic complexity measures that have been frequently used in SLA. I then introduce previous studies that employed these measures to identify how and for what purposes these measures have been used in SLA. I focus in particular on studies that investigated the relationship between syntactic complexity and L2 development and proficiency levels. The following section contains a review of studies on L2 assessment. I examine how grammatical development has been viewed and measured, and how the test-takers’ performance is interpreted in relation to these measures. Then I review studies that investigate the link between the syntactic complexity measures used in the SLA field and writing performance assessed by human raters. I note that currently used measures do not capture the diversity dimension of complexity and that a mismatch exists between the interpretations of the construct in SLA and in L2 assessment. At the end of this chapter, I propose new measures of syntactic complexity, informed by findings in usage-based linguistics, in an attempt to fill the gap. 1.1 Complexity in SLA research Research on complexity and complex systems has flourished since the 1990s in various disciplines such as the natural, social, and psychological sciences as well as language sciences (Bulté & Housen, 2014). Although no consensus has yet emerged on the definition of complexity, this construct is commonly understood across the disciplines as a property or entity in terms of “(1) the number and the nature of the discrete components that the entity consists of, and (2) the number and the nature of the relationship between the constituent components” (Bulté & Housen, 2012, p.22). For example, in the language sciences, including SLA and 3 applied linguistics, complexity is often defined in terms of the number and the nature of language components and the combinations thereof, as reflected in some traditional working definition of complexity such as “using a wide range of structures and vocabulary” (Lennon, 1990, p.390) or “[t]he extent to which the language produced in performing a task is elaborate and varied” (Ellis, 2003, p.340). Dictionaries define complexity as “(1) the quality or state of not being simple: the quality or state of being complex; (2) a part of something that is complicated or hard to understand (Merriam-Webster Dictionary).” In the field of SLA, researchers have acknowledged these two meanings of complexity by distinguishing absolute and relative complexity (Pallotti, 2015). This distinction is also referred to as objective and subjective. Absolute or objective complexity refers to learner-independent linguistic properties, while relative or subjective complexity is a language-user or learner-dependent concept related to learners’ cognitive abilities. Bulté and Housen used the term (cognitive) difficulty to refer to the latter concept (subjective or relative complexity) and reserved the term complexity for L2 linguistic complexity. Many researchers have pointed out the difficulty in defining complexity in SLA studies (Housen & Kuiken, 2009; Pallotti, 2015; Vyatkina, Hirschmann & Golcher, 2015). Several reasons account for this. First, the term complexity has been used to refer to both features of a communicative task that learners perform (task complexity) and language produced by learners in the field (L2 complexity). L2 complexity can be, again, interpreted as either absolute (also called objective) complexity or relative complexity (subjective complexity, cognitive complexity, or difficulty), as described above. In addition, complexity can be observed in various language subsystems such as vocabulary, morphology and syntax, which makes it hard to treat as a single construct. 4 Researchers have often failed to capture the complex and multi-faceted nature of the construct when defining complexity, and used very general and vague terms in defining and operationalizing complexity (Bulté & Housen, 2012). Recently, in an attempt to advance the understanding of the construct, several researchers have tried to describe complexity from a more comprehensive and systematic perspective (e.g., Bulté & Housen 2012; Norris & Ortega, 2009; Ortega, 2012; Pallotti, 2009). One of the most recent attempts to conceptualize the notion of complexity in SLA is the work by Bulté and Housen (2012), who classified components of complexity at several levels. In their taxonomic model of L2 complexity, the authors first distinguished difficulty from complexity, and further categorized complexity. In a broad sense, (L2) complexity consists of linguistic, discourse-interactional, and propositional complexity, the latter two of which have received relatively less attention in SLA studies. Linguistic complexity can be approached either globally (system complexity) or at the level of local structures (structure complexity). System complexity refers to the linguistic repertoire that learners have in their L2 system. In other words, it involves the range, diversity or variety of different structures that learners use. Structure complexity refers to the depth or sophistication of individual structures, either in a formal or functional sense. Both system and structure complexity can be evaluated at different domains of language: lexis, morphology, syntax and phonology, and subdomains of each (see Bulté and Housen, 2012, for more discussion of the model.) In SLA studies, L2 complexity often refers to linguistic complexity, and lexical and syntactic complexity have been studied as two of its major components. While acknowledging the meaning of complexity in a broad sense and the various components of the construct, the present study focuses on syntactic complexity. Syntactic complexity is also called grammatical complexity in the literature. Grammatical complexity is sometimes interpreted in a broader 5 sense that involves not only syntactic but also morphological and phonological complexity (Bulté & Housen, 2012), but often the two terms (syntactic and grammatical complexity) are used interchangeably as morphological and phonological complexity have rarely been investigated in SLA research. The following are some definitions of syntactic or grammatical complexity used in previous L2 literature: “progressively more elaborate language;’ ‘a greater variety of syntactic patterning” (Foster & Skehan, 1996, p.303); “a wide variety of both basic and sophisticated structures are available and can be accessed quickly” (Wolfe-Quintero, Inagaki & Kim, 1998, p.69); “the range of forms that surface in language production and the degree of sophistication of such forms” (Ortega, 2003, p.492). As evident in these definitions, previous researchers have related syntactic complexity to forms of linguistic structures and have understood the construct in terms of (1) range or variety and (2) the degree of elaborateness of those structures. Referring to Bulté and Housen’s taxonomy model, the scope of definitions covers syntactic complexity in a formal sense, encompassing both system and structure complexity. The present study is also concerned with complexity in this sense. 1.2 Measures of syntactic complexity in SLA research Bulté and Housen (2012) proposed that the construct of linguistic complexity be examined at three levels. First, researchers need to establish what the construct is at the theoretical level. Then, researchers can think about how the construct is observable in language performance at the observational level. Finally, they address quantifiable measures of performance at the lowest, operational level. As mentioned above, the present study concerns grammatical complexity (focusing on syntactic complexity) in both systemic and structural senses, which are observed through 6 grammatical diversity and sophistication, respectively. In the rest of this section, I review how the construct has been operationalized through quantitative measures in the SLA and L2 writing literature. I begin by introducing measures reviewed in three research syntheses (Bulté & Housen, 2012; Ortega, 2003; Wolfe-Quintero, Inagaki, & Kim, 1998). Wolfe-Quintero et al. comprehensively reviewed measures of grammatical complexity and explored the relationship between the measures and second language development in writing. They examined 32 studies on L2 writing published between 1974 and 1996 and categorized grammatical complexity measures used in these studies into three types: frequencies, ratios, and indices (see Table 1)1. Table 1 Wolfe-Quintero et al.’s (1998) inventory of grammatical complexity measures Frequencies Reduced clauses Preposed adjectives Dependent clauses Pronouns Passives Articles Passive sentences Connectors Adverbial clauses Transitional connectors Adjective clauses Subordinating connectors Nominal clauses Coordinating connectors Prepositional phrases Note. *originally categorized as fluency measures by the authors; # = number 1 Although the authors classified length-based measures such as a clause, sentence and T-unit length as fluency measures, I have included them as complexity measures, following a more conventional view that length-based measures address syntactic complexity (Ortega, 2003). 7 Table 1 (cont’d) Ratios Measure Formula Clause length (MLC)* #of words / #of clauses Sentence length (MLS)* #of words/ #of sentences T-unit length (MLT)* #of words/ #of T-units T-unit complexity ratio (C/T) #of clauses/ #of T-units Sentence complexity ratio (C/S) #of clauses/ #of sentences Clauses per error-free T-unit (C/EFT) #of clauses/ #of error-free T-units Dependent clauses ratio (DC/C) #of dependent clauses/ #of clauses Dependent clauses per T-unit (DC/T) #of dependent clauses/ #of T-units Adverbial clauses per T-unit (AdvC/T) #of adverbial clauses/ #of T-units Complex T-unit ratio (CT/T) #of complex T-units/ #of T-units Sentence coordination ratio (T/S) #of T-units/ #of sentences Coordinate clauses per T-unit (CC/T) #of coordinate clauses/ #of T-units Coordinate phrases per T-unit (CP/T) #of phrases with coordinators/ #of T-units Dependent infinitives per T-unit (DI/T) #of dependent infinitives/ #of T-units Complex nominals per T-unit (CN/T) #of nominals/ #of T-units Passives per T-unit (P/T) #of passives/ #of T-units Passives per clause (P/C) #of passives/ #of clauses Passives per sentence (P/S) #of passives/ #of sentences Indices Measure Coordination index Formula #of independent clause coordination/ #of combined clauses Complexity formula Score of weighted structures/ #of sentences Complexity index Sum of T-unit scores/ #of T-units Note. *originally categorized as fluency measures by the authors; # = number 8 Measures that take the form of frequency simply count the number of specific structures. In ratio measures, the occurrence, frequency, or length of one type of unit is expressed in relation to another type of unit. For example, some may count the number of occurrences of passive structures in a sample essay, while others may count how many times the structure occurs per sentence. The former exemplifies a frequency measure, and the latter represents a ratio measure. Some of the commonly used base units for ratio measures are clauses, T-units, and sentences. These base units are defined as follows: Clauses refer to a “structure with a subject and a finite verb” (Hunt, 1965, p.15), and are of various types such as independent clauses, main clauses, adjective, adverbial, and nominal clauses (Cooper, 1976; Hunt, 1965). The last three types are dependent clauses which are “instances of relativization and subordination” (Homburg, 1984, p.92). T-units consist of a main clause and “any subordinate clause or non-clausal structure that is attached to or embedded” (Hunt, 1970, p.189). Lastly, sentences are defined as “a group of words delimited with a punctuation mark” (Wolfe-Quintero et al., 1998, p.84). In index measures, various structures are weighted based on their syntactic complexity. For example, in the complexity formula measure, different scores (0, 1, and 2) are assigned to grammatical structures according to their complexity or difficulty. Synthesizing the results of the studies, Wolfe-Quintero and her colleagues concluded that T-unit length (MLT), clause length (MLC), Tunit complexity ratio (C/T), dependent clause ratio (DC/C), and dependent clauses per T-unit (DC/T) were the best measures for L2 writing development. Ortega (2003) and Bulté and Housen (2012) performed research syntheses similar to that of Wolfe-Quintero et al. Ortega reviewed 21 cross-sectional studies and five longitudinal studies on college-level L2 writing. More recently, Bulté and Housen (2012) reviewed 40 task-based L2 learning studies (published between 1995 and 2008). Five of the studies (Ellis & Yuan, 2004, 9 2005; Ishikawa, 2007; Révész, 2008; Storch & Wigglesworth, 2007) investigated learners’ performance in written tasks. They classified syntactic complexity measures into overall measures, measures at sentential/ clausal/ phrasal levels, and frequency measures of specific structures. The inventory of measures identified in the two syntheses is presented in Table 2. Table 2 Inventory of (possible) grammatical complexity measures (adapted from Bulté & Housen, 2012 and Ortega, 2003) Overall Mean length of T-unit (MLT) Mean length of C-unit Mean length of turn Mean length of AS-unit Mean length of utterance Mean length of sentence (MLS) S-nodes/ T-unit S-nodes/ AS-unit Sentential—Coordination Coordinated clauses/ Clauses T-units/ Sentences (T/S) Sentential—Subordination Clauses/ AS-unit Clauses/ c-unit Clauses/ T-unit (C/T) Dependent clauses/ Clause (DC/C) # of subordinated clauses Subordinate clauses/ Clauses (SC/C) Subordinate clauses/ Dependent clauses (SC/DC) Subordinate clauses/ T-unit (SC/T) Relative clauses/ T-unit (RC/T) Verb phrases/ T-unit (VP/T) Note. * indicates possible measures that have not been used in the literature. # = number 10 Table 2 (cont’d) Subsentential (Clausal + Phrasal) Mean length of clause (MLC) S-nodes/ Clause Clausal Syntactic arguments/ Clause* Phrasal Dependents/ (noun, verb) phrase* Other (± syntactic sophistication) Frequency of passive forms Frequency of infinitival phrases Frequency of conjoined forms Frequency of Wh-clauses Frequency of imperatives Frequency of auxiliaries Frequency of comparatives Frequency of conditionals Note. * indicates possible measures that have not been used in the literature. # = number According to Ortega (2003), the six most frequently used measures were sentence length (MLS), MLT, MLC, T-units per sentence (T/S), C/T, and DC/C. Among the five studies on L2 writing performance investigated by Bulté and Housen (2012), C/T was the most popular measures, followed by MLT and MLC. Clauses per T-unit (C/T) was employed in four studies, and MLT and MLC were employed in two studies. Other measures used in these studies were the frequency of passive forms and several subordination measures such as DC/C and subordinate clauses per clause (SC/S), per dependent clause (SC/DC) and per T-unit (SC/T). Overall, the studies reviewed in these two research syntheses used predominantly length-based measures and measures of amount of subordination, while uses of other measures were limited. Bulté and Housen (2012) pointed out potential problems related to this trend. First, length-based measures such as MLT and MLS can be elevated in many different ways, for example, through the addition of another clause via coordination or subordination, or another nominal, adjectival, 11 or adverbial phrase. Therefore, these measures can only capture overall or generic syntactic complexity. Subordination measures also have limitations, but in a different sense. They only tap into complexity at the sentential level and, thus, may not capture the full trajectory of L2 development. In addition, researchers have not accounted for different types of subordination. For example, a noun complement clause followed by a verb (as in “I think that…”) and more difficult structures such as an objective relative clause have not been treated separately in the literature. These problems were also identified by Norris and Ortega (2009), who called for the use of more specific measures of coordination and phrasal complexity as well as global measures. They also argued for the use of multiple measures that tap into multiple dimensions of complexity. However, according to Bulté and Housen (2012), few researchers have employed multiple measures in a single study. To summarize, many of the measures whose validities were confirmed by WolfeQuintero et al. (1998) have been popularly used in SLA studies in recent years. These measures include global length-based measures (e.g., MLT, MLS) and measures of subordination (e.g., C/T, DC/C, and DC/T). Some new measures that attend to specific structures such as relative clauses and infinitival clauses have emerged. However, measures are still lacking for the examination of complexity at the clausal and phrasal level. 1.2.1 Review of syntactic complexity measures in L2 writing studies (2009-2016) The inventory of measures introduced in the previous analyses covers most of the measures employed in L2 writing studies published until 2008. In order to see a more recent trend in the field, I searched for measures of syntactic complexity used in 27 empirical studies on L2 writing published after 2009. The categorization of measures followed the previous reviews 12 (Bulté and Housen, 2012; Ortega, 2003; Wolfe-Quintero et al., 1998). The results are summarized in Table 3. This inventory shows that ratio measures are still used more frequently than frequency measures. About half of the studies employed at least one measure of overall complexity such as MLS or MLT (14 studies) and a subordination measure such as C/T, DC/T or DC/C (16 studies). Such a prevalence of ratio measures is understandable considering that frequency measures are affected by text length, which makes them less valid than objective measures, as Wolfe-Quintero et al. (1998) pointed out. Some researchers overcame this disadvantage of frequency measures by using normed frequencies or relative frequencies (e.g., Spoelman & Verspoor, 2010; Verspoor, Schmid, & Xu, 2012). Index measures were rarely used. One of the noticeable trends was the increased use of specific measures. Amount of coordination was investigated both at the sentential and clausal levels. In addition, many researchers tried to capture complexity at the phrasal level, especially for nominal phrases. For example, Bulté and Housen (2014) and Spoelman and Verspoor (2010) calculated the mean length of noun phrases. Some researchers looked into the occurrence of complex nominals per T-unit (CN/T) or per clause (CN/C). Crossley and McNamara (2011, 2014) and Guo, Crossley, and McNamara (2013) indirectly calculated the length of nominals in subject positions by measuring the mean number of words before the main verb. 13 Table 3 Inventory of grammatical complexity measures in L2 writing studies (2009-2016) Measure Study Frequencies Subordination # of embedded (dependent, Spoelman & Verspoor (2010) subordinate) clauses Guo, Crossley & McNamara (2013) Normalized subordinating Vyatkina (2012) conjunctions per 100 words Coordination Normalized coordinating Vyatkina (2012) conjunctions per 100 words Specific # of verb phrases Crossley & McNamara (2014) structures Part of speech (POS) tags Guo, Crossley & McNamara (2013) Incidence of negation, Crossley & McNamara (2014) prepositional phrases, subject relative clauses, that verb complements, S-bars, and infinitives Normed frequencies of 78 Asención-Delaney et al. (2011) grammatical features (e.g., different types of nouns, adjectives, and verbs, etc.) Syntactic similarity (measured by Crossley & McNamara (2011, 2014) the uniformity and consistency of Guo, Crossley & McNamara (2013) syntactic constructions in the text, Mazgutova & Kormos (2015) using phrasal and syntactic categories) Frequencies of modifiers Vyatkina et al. (2015) Distribution of types of sentences Spoelman & Verspoor (2010) (fragment, simple, compound, Verspoor, Schmid, & Xu. (2012) complex, compound-complex) Note. # = number 14 Table 3 (cont’d) Measure Study Frequencies Specific Distribution of types of DC Verspoor, Schmid, & Xu. (2012) structures (finite-adverbial, nominal, relative vs. nonfinite) Distribution of types of VP Verspoor, Schmid, & Xu. (2012) constructions (present, tense, past tense, present perfect, etc.) Types of NPs Crossley & McNamara (2014) Mean length of sentence (MLS) Ai & Lu (2013) Ratios Overall Bulté & Housen (2014) Lu (2011) Vyatkina (2012) Yoon & Polio (2016) Mean length of T-unit (MLT) Ai & Lu (2013) Bulté & Housen (2014) Danzak (2011) Gyllastad et al. (2014) Lu (2011) Mazgutova & Kormos (2015) Verspoor, Schmid, & Xu (2012) Yoon & Polio (2016) Length of production unit Ai & Lu (2013) Mean # of high-level constituents Crossley & McNamara (2011) (sentences and embedded Guo, Crossley & McNamara (2013) sentence constituents) per words in sentences Note. # = number 15 Table 3 (cont’d) Measure Study Ratios Sentential- Ratio of finite verb units per Vyatkina (2012) Subordination sentence (VP/S) & coordination Clauses per sentence (C/S) Lu (2011) Simple sentence ratio (SSR) Bulté & Housen (2014) Compound sentence ratio (CdSR) Bulté & Housen (2014) Complex sentence ratio (CxSR) Bulté & Housen (2014) Compound-complex sentence Bulté & Housen (2014) ratio (CdCxSR) Sentential- Clauses per T-unit (C/T) Benevento & Storch (2011) Subordination Larsen-Freeman (2006) Llanes & Munoz (2013) Serrano, Llanes & Tragant (2011) Serrano, Tragant & Llanes (2012) Storch (2009) Lu (2011) Yoon & Polio (2016) # of embedded subordinate Ai & Lu (2013) clauses per T-unit (SC/T): Danzak (2011) Dependent clauses per T-unit Frear & Bitchener (2015) (DC/ T) Gyllstad et al. (2014) Guo, Crossley & McNamara (2013); Storch (2009) Lu (2011) Mazgutova & Kormos (2015) Yoon & Polio (2016) Note. # = number 16 Table 3 (cont’d) Measure Study Ratios Sentential- Adjectival DC/T Frear & Bitchener (2015) Subordination Nominal DC/T Frear & Bitchener (2015) Adverbial DC/T Frear & Bitchener (2015) Complex T-units per T-unit Lu (2011) (CT/T) Dependent clauses per clause Ai & Lu (2013) (DC/C) Lu (2011) Yoon & Polio (2016) Sentence structure: proportion of Norrby & Hakansson (2007) subordinate clauses: Subclause Bulté & Housen (2014) ratio (SCR) Sentential- Verb phrases (VPs) per T-unit Lu (2011) (VP/T) Yoon & Polio (2016) T-units per sentence (T/S) Ai & Lu (2013) Coordination Lu (2011) Coordinate clause ratio (CCR) Bulté & Housen (2014) Subsentential Words per finite verb-unit Vyatkina (2012) (Clausal Mean length of finite clause Bulté & Housen (2014) +Phrasal) (MLCfin) Mean length of clause (MLC) Ai & Lu (2013) Gyllastad et al. (2014) Lu (2011) Vyatkina (2013) Yoon & Polio (2016) Clausal- Coordinate phrases per clause Ai & Lu (2013) Coordination (CP/C) Lu (2011) Vyatkina (2013) Note. # = number 17 Table 3 (cont’d) Measure Study Ratios Clausal- Coordinate phrases per T-unit Ai & Lu (2013) Coordination (CP/T) Lu (2011) Phrasal Mean length of noun phrase Bulté & Housen (2014) (MLNP) Spoelman & Verspoor (2010) Complex nominals per clause Ai & Lu (2013) (CN/C) Lu (2011) Vyatkina (2013) Yoon & Polio (2016) Complex nominals per T-unit Ai & Lu (2013) (CN/T) Lu (2011) Yoon & Polio (2016) Nonfinite VP per clause Vyatkina (2012) Mean # of words before the main Crossley & McNamara (2011, 2014) verb Guo, Crossley & McNamara (2013) Mean # of complex nominals in Mazgutova & Kormos (2015) subject position # of modifiers per NP Crossley & McNamara (2014) Guo, Crossley & McNamara (2013) Mazgutova & Kormos, 2015 Note. # = number In addition, most researchers tended to employ more than one measure in their studies, though some instructed SLA researchers still employed one representative measure of complexity (e.g., Frear & Bitchener, 2015; Mazgutova & Kormos, 2015). Employing more than one measure is a desirable trend because complexity is a multidimensional concept that can only be captured by multiple measures. This effort seems to have been accelerated by advances in technology. Many previous studies focused on a small number of measures or analyzed small 18 amounts of data due to the labor-intensiveness of manual analyses (Lu, 2011). Researchers can now automatically compute a variety of syntactic complexity measures (partially) by using recently-developed tools such as computerized profiling (Long, Fey & Channell, 2008), CohMetrix (e.g., Guo, Crossley & McNamara, 2013), D-Level Analyzer (Lu, 2009) or Syntactic Complexity Analyzer (Lu, 2010). For example, several studies reviewed above (Ai & Lu, 2013; Lu, 2011; Yoon & Polio, 2016) used Syntactic Complexity Analyzer to automatically compute a number of syntactic complexity measures that have been popularly used in L2 development studies. However, Norris and Ortega (2009) also cautioned that care should be taken when employing more than one measure due to a potential problem of redundancy. Some measures tap into almost identical characteristics of texts even though they look different. For example, C/T and DC/T measure the same trait. MLT and MLS are also quite similar. These measures are likely to be highly correlated with each other, which in turn may violate assumptions for multivariate statistical analyses (Norris & Ortega, 2009). Finally, syntactic complexity is often defined in terms of the range and the degree of elaborateness of syntactic structures. Widely used measures of syntactic complexity mostly capture the degree of elaboration (by using length measures and subordination measures) but give less attention to the degree of variation (in other words, diversity), though this dimension has been widely investigated in terms of lexical complexity. Although Norris and Ortega reported signs of researchers’ interest in measuring complexity as structural diversity in their research synthesis published in 2009, diversity of syntactic structures seems to remain a relatively infrequent concern compared to other dimensions of complexity. Among the studies published after 2009, I found only seven studies out of 27 in which researchers attended to the dimension of variety (Asención-Delaney et al., 2011; Crossley & McNamara, 2011, 2014; Guo, 19 Crossley & McNamara, 2013; Spoelman & Verspoor, 2010; Verspoor et al., 2012; Vyatkina et al., 2015). They did so by calculating either frequencies or distributions of various grammatical structures. However, the selection of grammatical structures varied from study to study, and the researchers did not specify the rationale behind the inventory of grammatical structures they investigated; thus, the validity of these measures is still unexplored. 1.3 Syntactic complexity and L2 writing Complexity measures have been used in SLA research to describe L2 learners’ performance and measure their proficiency or progress in language learning (Housen & Kuiken, 2009). In L2 writing research specifically, syntactic complexity has been employed for the following purposes: “(a) to evaluate the effects of a pedagogical intervention on the development of grammar, writing ability, or both; (b) to investigate task-related variation in L2 writing; and (c) to assess differences in L2 texts written by learners across proficiency levels and over time” (Ortega, 2012, p.128). In the following two sections (1.3.1 and 1.3.2), I review studies on L2 writing that included syntactic complexity as a research variable, in accordance with these purposes. The first section reviews studies that investigated influences of external factors on writing performance or ability. The second section reviews the literature on syntactic complexity across proficiency levels or its change over time. 1.3.1 Complexity measures as performance descriptors Numerous researchers have investigated syntactic complexity as a way to assess the influences of learning conditions on L2 writing proficiency. Their work is mostly in the area of instructed SLA research. In these studies, the purpose of measuring complexity was to scrutinize “how and why language competencies develop for specific learners and target languages, in response to particular tasks, teaching, and other stimuli” (Norris & Ortega, 2009, p. 557). In 20 other words, researchers were interested in how language learners’ performance changes under different learning conditions, and complexity measures were employed as dependent variables to describe their performance. In addition to syntactic complexity, most studies also examined constructs such as accuracy, fluency, and global quality. In most cases, researchers employed one or two measures that represent each construct. 1.3.1.1 Effects of a pedagogical intervention Some researchers have measured the syntactic complexity of L2 learners’ written texts to investigate the effects of an intervention or learning context on their writing skill (e.g., Benevento & Storch, 2011; Casanave, 1994; Ishikawa, 1995; Serrano, Llanes, & Tragant, 2011; Serrano, Tragant, & Llanes, 2012; Shang, 2007; Stockwell & Harrington, 2003; Storch, 2009; Storch & Tapper, 2009). Shang (2007), for example, employed a pretest-posttest design to investigate whether EFL students benefit from practice in writing emails. He found that practice in writing and sending emails improved students’ overall sentence complexity as well as grammatical accuracy in subsequent email writing. Storch (2009), also employing a pretestposttest design, investigated the impact of studying in a L2-medium university over a semester on the writing of students. The results showed that students’ writing did not improve significantly in terms of syntactic complexity measured by C/T and DC/C, whereas analytic writing scores significantly increased by the end of the semester. Serrano, Llanes, and Tragant (2011) and Serrano, Tragant, and Llanes (2012) were interested in the effects of studying abroad. In the former study, the authors compared the effects of learning in an international and two domestic (intensive and semi-intensive) contexts. Syntactic complexity was measured by C/T and was not found to be different across the three learning contexts. The latter study tracked the English language development of 14 Spanish learners who studied at a UK university for over a 21 year. Students’ written production was examined three times over the year, and the authors found significant increase in C/T measure over time. Llanes and Munoz (2013) examined the effects of the learning context and its interaction with the age of learners. The authors found significant main effects of learning context and age as well as an interaction effect on syntactic complexity. In all these studies, the authors assumed that an increase in syntactic complexity reflected a learning gain, although several of these works did not find a significant increase in this construct. 1.3.1.2 Effects of task- and genre-related variation Researchers who were interested in task-based language learning (TBLL) investigated how task variations affected language learners’ performance. Some were interested in the relationship between task complexity and learners’ writing performance (e.g., Frear & Bitchener, 2015; Ishikawa, 2007; Kormos & Trebits, 2012; Kuiken & Vedder, 2007), while others examined effects of manipulating task conditions such as types of planning (e.g., Ellis & Yuan, 2004 ; Storch & Wigglesworth, 2007). These researchers employed syntactic complexity measures together with fluency and accuracy measures to describe learners’ language production. Ellis and Yuan (2004) examined how different task planning conditions influence the language that learners use to perform the task. They found that pre-task planning resulted in greater syntactic variety, while online planning contributed to higher accuracy. Frear and Bitchener (2015) replicated Kuiken and Vedder’s (2007) study, which investigated the relationship between cognitive task complexity and linguistic complexity, employing more finegrained measures of syntactic complexity. They found no significant effect of increased task complexity on the ratio of dependent clauses to T-units (DC/T) as a whole as was found in the Kuiken and Vedder study. However, when dependent clauses of different types were examined 22 separately, they could see varied effects of increased task complexity on complexity measures. For example, the ratio of adverbial clauses to T-units significantly decreased when task complexity increased, while the ratio of adjectival clauses to T-units remained the same. Some researchers have studied the effects of different writing-task genres or registers and compared writers’ performance in terms of syntactic complexity. For example, AsenciónDelaney and Collentine (2011) conducted a multidimensional analysis of a written L2 Spanish corpus in order to investigate how learners’ language differs in various types of discourse. They factor analyzed various lexical and grammatical features and found different linguistic complexity measures factored together differently depending on the types of stylistic variations: narrative versus expository. Lu (2011) investigated the effect of genre on syntactic complexity measures in his cross-sectional study. Comparing argumentative and narrative essays written by Chinese learners of English, he found that learners produced more complex structures in argumentative essays than in narrative essays. Yoon and Polio (2016) also examined genre differences in their longitudinal study on ESL students’ writing development and found similar results to Lu’s. One interesting finding was that genre effects were found on the phrase-level measures but not on clause-level measures. 1.3.1.3 Effects of first language (L1) Owing to their interest in the influence of first language (L1) on L2 writing, Crossley and McNamara (2011) compared the writings of learners with different L1 backgrounds. Looking at various linguistic features, including syntactic complexity, they found that L2 learners were homogenous and that the differences between the L1 and L2 writings were attributed to limited linguistic resources rather than cultural or L1 differences. Lu and Ai (2015) focused on syntactic 23 complexity and explored this construct in more depth. They found varied patterns in multiple dimensions of syntactic complexity among learners with different L1 backgrounds. 1.3.2 Complexity measures as indices of development and proficiency Some researchers have placed syntactic complexity as a primary focus of investigation in their studies. They have attempted to confirm whether syntactic complexity measures stand as valid and reliable indices of second language development or global proficiency in the target language (Lu, 2011). Researchers have investigated how complexity measures change across different proficiency levels (e.g., Lu, 2011) or over time (e.g., Hunt, 1970; Stockwell, 2005; Norrby, 2007). The following sections review these studies. 1.3.2.1 Development in writing over time I have already introduced some studies in the previous section that investigated changes in learner language over time in specific learning contexts or pedagogical interventions (e.g., Casanave, 1994; Benevento & Storch, 2011; Stockwell & Harrington, 2003; Storch, 2009). Here I have included studies in which the construct of complexity was the primary focus of investigation rather than being employed as a way to measure the influence of external factors. Some researchers have used relatively large corpus data sets to investigate L2 writing development. Bulté and Housen (2014) focused on short-term development in L2 linguistic complexity (both syntactic and lexical). Analyzing essays written by 45 ESL students in the beginning and at the end of the semester in terms of ten syntactic complexity and three lexical diversity measures, they found that not all measures manifested changes over the course of a semester. Significant gains were evident in the length-based measures (MLS and MLT), clause coordination (compound sentence ratio and coordinate clause ratio) and phrasal elaboration (i.e., mean length of noun phrase [MLNP]), but not in subordination measures (i.e., complex sentence 24 ratio, compound-complex sentence ratio, and subclause ratio). The result was contrary to Norris and Ortega’s (2009) model of syntactic complexity development, which proposed that syntactic sophistication occurs initially through clausal coordination, is then realized through subordination at the intermediate level, and at a more advanced stage, is achieved predominantly by means of clausal and phrasal elaboration rather than subordination at the sentence level. Crossley and McNamara (2014) used the same corpora as the Bulté and Housen study and conducted a similar study employing a different set of syntactic complexity indices. They used 11 Coh-Metrix indices that measure “syntactic variety, syntactic transformations (e.g., negations and questions), syntactic embeddings, incidence of phrase types, and phrase length” (p.5). They found significant changes in learners’ texts over the observed period. The texts contained more noun phrases than verb phrases and a greater number of phrasal modifications at the end of the semester. The syntactic similarity score decreased significantly, which indicated that students used a wider variety of syntactic constructions after a semester of study. Yoon and Polio (2016) used self-compiled corpus data collected every two to three weeks throughout a semester to investigate learner language development over time. They found a statistically significant but weak change in the MLS measure over time. Interesting to note were the interaction effects of genre and time on MLT and on one subordination measure (C/T). MLT was longer and C/T was larger in argumentative essays than in narrative essays at the beginning of the semester. However, the differences between the two genres decreased over time: increases in the measures were found only in narrative essays. Overall, they did not find strong indication of development in terms of syntactic complexity over the course of a semester. Other researchers included a small number of participants in their studies and focused on their individual trajectories in L2 writing development. Vyatkina (2013) observed two novice 25 learners of German over a more extended period of time: 19 time points over the course of four semesters. She found that the learners’ development of syntactic complexity followed a similar pattern initially, but then the learning paths diverged in the last two semesters. While one learner relied on coordination to lengthen sentences, the other used more complex clausal structures. Based on the results, she argued for the importance of employing both global and specific measures of complexity. Some researchers investigated learner development within the Dynamic Systems Theory framework (Larsen-Freeman, 2006; Spoelman & Verspoor, 2010). These researchers were interested in how constructs of language proficiency interact with each other. They emphasized variability between the learners as well as variation within the learner in the development of these constructs. Larsen-Freeman (2006) observed five Chinese learners of English over six months and investigated how fluency, grammatical complexity, accuracy, and vocabulary complexity emerged and developed in their oral and written performance. She found the individual development trajectories to be very different from one another, while at the same time, the whole group seemed to make progress in general. For example, she found one of the participants focused on lexical complexity throughout the observation period, while others focused more on grammatical complexity. Spoelman and Verspoor (2010) conducted a longitudinal study of a beginning Dutch learner of Finnish. They focused on different complexity measures at the word, phrase, and sentence levels and investigated how these measures developed in relation to one another. To capture dynamic developmental processes, they analyzed the interactions among variables. They found that word complexity and sentence complexity grew together, but NP complexity and sentence complexity developed alternately in a competitive manner. 26 1.3.2.2 Texts written by learners across proficiency levels Wolf-Quintero and her colleagues (1998) evaluated the results of studies that investigated the relationship between L2 proficiency levels and syntactic complexity measures. Proficiency levels were mostly defined by school level, program level, or a holistic rating of learner writing performance. The researchers reported that two length-based measures, MLC and MLT, and three measures of subordination, C/T, DC/C and DC/T generally showed a positive linear relationship to proficiency levels. Mixed results were reported for coordination measures such as number of T-units per sentence (T/S). Some studies found that the more frequent use of specific structures such as reduced clauses (Homburg, 1984; Monroe, 1975) or passive sentences (Kameen, 1979) were indications of proficiency levels. Lu (2011) used a corpus of college-level second language writing at various proficiency levels in evaluating the computation tool he created. He calculated 14 measures using the L2 Syntactic Complexity Analyzer and compared the values across three proficiency levels, which were defined by institutional level. He found that six measures linearly increased along the three proficiency levels. These measures were MLC, MLT, CP/C, CP/T, CN/C and CN/T. Gyllstad, Granfeldt, Bernardini and Kallkvist (2014) also found that some measures discriminate certain levels better than others. They reported that MLC was a better measure for advanced-level writing. Verspoor, Schmid, and Xu (2012) investigated 64 variables related to constructions, chunks, lexicon, and accuracy in the writings of L2 learners at various proficiency levels in order to search for more reliable indices of written language development. They were interested in which measures can discriminate among proficiency levels, which were predefined by holistic 27 writing scores. They found that MLT was a medium discriminator and that more dependent clauses were used as the proficiency level increased. One thing to note in these studies is how proficiency level was operationalized. In Lu’s study, naturally occurring groups were used to determine proficiency levels, while Verspoor et al. (2012) assessed writing samples holistically and grouped learners based on the scores. This inconsistency in measures of proficiency makes it hard to compare findings across studies. In addition, although Norris and Ortega (2003) observed that operationalizing proficiency levels in terms of holistic ratings provided more homogenous findings than naturally occurring classes or groups (p.502), cautious interpretation is required when proficiency is measured in this way due to the inherent relationship between quantitative complexity measures and holistic scores. 1.4 Grammatical complexity: L2 assessment In this section, I examine how grammatical competence has been interpreted and operationalized in assessing L2 writing performance in an attempt to compare the ways grammatical complexity has been viewed in the field of language assessment and SLA. The section also contains a review of studies that employed syntactic complexity measures used in SLA research in investigating testing-related issues. 1.4.1 Assessment of grammar performance Scholars have viewed language proficiency as a many-faceted skill, and many of them have identified grammar as one distinct component of language competence (e.g., Canale & Swain, 1980; Bachman, 1990). However, the assessment of grammatical knowledge has remained relatively neglected in the language-testing field (Purpura, 2004, p.4). Purpura (2004) made one of the first attempts to investigate comprehensively the construct of grammatical knowledge in the testing context (Zandi, 2014). He proposed a general model of grammar in 28 which he distinguished between grammatical knowledge, ability, and performance. According to Purpura, grammatical knowledge indicates learners’ mental representations of informational structures related to grammatical form and meaning, and grammatical ability incorporates both grammatical knowledge and strategic competence for using the knowledge. It is grammatical ability about which assessors attempt to make inferences in testing. These inferences can be made on the basis of grammatical performance, which is “observable manifestation of grammatical ability” (Purpura, 2004, p.87). Rimmer (2006) identified two measurable dimensions of test-takers’ grammar performance: accuracy and range. Accuracy is defined as “control of structures and freedom from error”. Range refers to “the variety of grammatical structures that test-takers employ” (p.498), and it concerns the number of different structures and their degree of complexity. Rimmer’s notion of range thus incorporated both variety and elaboration in grammatical structures and can be understood as an equivalent concept to syntactic complexity in SLA research. Table 4 References to syntactic complexity in rating scales for writing Rubric TOEFL Level/Score 5 Descriptors • displays consistent facility in the use of language, demonstrating Independent syntactic variety, appropriate word choice, and idiomaticity, though Writing it may have minor lexical or grammatical errors 4 • displays facility in the use of language, demonstrating syntactic variety and range of vocabulary, though it will probably have occasional noticeable minor errors in structure, word form, or use of idiomatic language that do not interfere with meaning 29 Table 4 (cont’d) Rubric TOEFL Level/Score 3 Independent Writing IELTS • may display accurate but limited range of syntactic structures and vocabulary 2 • an accumulation of errors in sentence structure and/or usage 1 • serious and frequent errors in sentence structure or usage 9 • uses a wide range of structures with full flexibility and accuracy; rare minor errors occur only as ‘slips’ Writing band descriptors: Descriptors 8 • uses a wide range of structures Task 1 • the majority of sentences are error-free (Grammatical • makes only very occasional errors or inappropriacies range and 7 • uses a variety of complex structures • produces frequent error-free sentences accuracy) • has good control of grammar and punctuation but covers the requirements of the task trends, differences or stages 6 • uses a mix of simple and complex sentence forms • makes some errors in grammar and punctuation but they rarely reduce communication 5 • uses only a limited range of structures • attempts complex sentences but these tend to be less accurate than simple sentences • may make frequent grammatical errors and punctuation may be faulty; errors can cause some difficulty for the reader 4 • uses only a very limited range of structures with only rare use of subordinate clauses • some structures are accurate but errors predominate, and punctuation is often faulty 3 • attempts sentence forms but errors in grammar and punctuation predominate and distort the meaning 2 • cannot use sentence forms except in memorized phrases 1 • cannot use sentence forms at all 30 Table 4 (cont’d) Rubric Level/Score Descriptors Jacobs et 25-22 EXCELLENT TO VERY GOOD: effective complex constructions, al.’s ESL agreement, tense, number, word order/function, Composition Profile articles, pronouns, prepositions 21-18 (Language minor problems in complex constructions Use) agreement, tense, number, word order/function, articles, pronouns, prepositions but meaning seldom obscured 17-11 FAIR TO POOR: major problems in simple/complex constructions ord order/function, articles, pronouns, prepositions and/or fragments, run- 10-5 meaning confused or obscured VERY POOR: virtually no mastery of sentence construction rules to evaluate This notion of grammatical performance is manifested in rating scales that are used to evaluate learners’ language performance. Table 4 shows how the construct is illustrated in some widely-used rating scales for assessing the writing performance of L2 learners. For example, in the holistic rating scale used for the TOEFL (https://www.ets.org/toefl) independent writing task, test-takers’ language use is evaluated in terms of consistency in using a variety of structures accurately. IELTS (https://www.ielts.org/) uses an analytic rating scale that consists of four subscales: task achievement, coherence and cohesion, lexical resources, and grammatical range and accuracy. According to the descriptors in the grammatical range and accuracy section, use of a wide range of structures and use of complex sentences are indications of advanced proficiency. The last example is the language use section of the ESL Composition Profile created by Jacobs, Hartfiel, Hughey, and Wormuth (1981). The descriptor in this rating scale 31 also refers to the use of complex versus simple constructions in describing writers’ performance. Overall, learners’ language use is evaluated in terms of the ability to use a variety of structures and complex sentences accurately in a given writing task. Both diversity and degree of sophistication are addressed in assessing L2 writing performance, while syntactic complexity has been mostly captured by measures of depth or sophistication of structures in SLA studies. 1.4.2 Relationship between syntactic complexity measures and human ratings Recently, there have been some attempts to link human raters’ perceptions of writing quality and linguistic features of texts represented by syntactic complexity measures used in the field of SLA. Crossley and McNamara (2014) investigated the relationship between the indices of syntactic complexity that are sensitive to L2 development and human ratings of language use in L2 writing. They computed various indices using Coh-metix, ran correlation analyses to identify measures that are related to human ratings, and then conducted regression analyses in order to examine whether these indices could be predictive of the subjective ratings. They found that, in addition to the production of all clause types (e.g., matrix, coordinating and embedded clauses), the incidences of infinitives and that verb complements were strong predictors of higher ratings of writing quality. An interesting finding was that there was a mismatch between syntactic complexity measures that developed in L2 writing over a semester and those that predicted overall writing quality (as measured by the total writing scores and language use scores). Although the development in learner language over the semester was characterized by more reliance on nominal style and phrasal modifications, raters’ judgments of writing quality were not strongly predicted by these features. Similar results were found by Bulté and Housen (2014). They investigated whether human raters’ judgments of writing performance based on an analytic rating scale are related to 32 syntactic complexity measures. They found Language Use scores correlated with most of the syntactic complexity measures they examined: MLS, MLT, the simple sentence ratio (SSR), the complex sentence ratio (CxSR), the compound-complex sentence ratio (CdCxSR), the subclause ratio (SCR), MLC, and mean length of noun phrase (MNLP). Most of these measures were the ones found to be correlated with the overall writing scores as well. However, these measures were not necessarily development-sensitive. For example, a measure of subordination, CxSR, was significantly correlated with writing quality, but it did not increase over time. Conversely, clausal coordination measures significantly increased over time, while they were not significantly correlated with the subjective ratings of writing quality. Guo, Crossley, and McNamara (2013) were interested in whether the independent and integrated writing tasks of TOEFL elicit similar performances from L2 writers. They investigated which linguistic features, such as syntactic complexity, predict overall writing scores given by human raters and how much such features predict the scores. Their results did not provide evidence that the syntactic complexity indices they investigated (i.e., number of words before the main verb, number of higher-level constituents per word, number of modifiers per noun phrase, syntactic similarity, and number of embedded clauses) can be predictive of writing scores given by human raters. The authors found a potential reason for the results from the test-takers’ proficiency level. TOEFL test-takers are generally assumed to be advanced learners of English, and syntactic complexity indices are not strong discriminators of proficiency among learners at this level, as maintained by Norris and Ortega (2009). Overall, previous studies have reported mixed results regarding whether syntactic complexity measures have a relationship with subjective ratings by human raters. The results are 33 far from conclusive, as measures examined varied from study to study. In addition, some commonly used measures in SLA such as DC/C and C/T remain to be investigated. 1.5 Summary Syntactic (i.e., grammatical) complexity refers to the range and the degree of sophistication of the forms that appear in language production (Ortega, 2003). SLA and L2 writing researchers have employed the construct in order to describe learners’ performance and to assess changes in learner language over time or across proficiency levels. Grammatical complexity has also been an important factor in the L2 assessment field. The construct is considered crucial in describing grammatical competence; for example, rating rubrics often utilize the complexity of structures as a descriptor of the writing performance of test takers. However, how the construct is measured in assessing L2 (writing) performance does not coincide with the ways it is conventionally operationalized in SLA and L2 writing research. As Polio (2001) noted, “the various measures of complexity… indicate that variety does not enter in the equations,… yet the terms complex sentences and variety of structures often appear as part of other components on analytic scales” (p.96). Even after a decade, the degrees of sophistication or elaboration of language structures are used to measure the complexity of writing performance in SLA, while the diversity and complexity of structures used by test-takers are also considered in human raters’ subjective evaluations of L2 performance. Addressing the gap, there have been some recent efforts to attend to the structural diversity dimension of syntactic complexity in SLA studies (Asención-Delaney et al., 2011; Crossley & McNamara, 2011, 2014; Guo, Crossley & McNamara, 2013; Spoelman & Verspoor, 2010; Verspoor et al., 2012; Vyatkina et al., 2015). In addition, some researchers have attempted to link the measures used in L2 writing studies and writing quality by investigating the 34 relationship between the two. Adding to these previous attempts, the present study aims to fill the gap in the literature by proposing a way to tap into the diversity dimension of syntactic complexity. The following section describes the diversity measures that I am proposing. 1.6 Proposed measures of syntactic complexity to capture syntactic diversity/ variety In the current study, I propose a way to approach the diversity dimension of syntactic complexity from the verb-argument construction perspective. I test whether diverse use of verbargument constructions (VACs) can be an indicator of L2 writing proficiency and quality. Verbargument structures and their contribution to sentence form and meaning have been at the center of many sentence processing models from both theoretical linguistics and psycholinguistics through the years (Becini & Goldberg, 2000). In addition, the syntactic configuration of verbs is known to pose challenges to children in their native language acquisition (Alishahi & Stevenson, 2008) as well as to second language learners (Gries & Wulff, 2009). In addition, there is some empirical evidence to show that syntactic constructions can be predicative of writing proficiency. Hinkel (2003) quantitatively analyzed L1 and L2 academic texts and found that the prevalence of simple constructions such as be-copula was characteristic of non-native students’ writing. She concluded that non-native students’ productive range of grammar was relatively small. Jarvis, Grant, Bikowski, and Ferris (2003) compared the linguistic features of higher-rated and lowerrated ESL compositions and found more frequent use of stative be verb constructions in lowerrated compositions and of passive constructions in higher-rated compositions. Therefore, I believe that VACs constitute an appropriate domain of grammar for the study of syntactic proficiency. As a measure of syntactic diversity, I compute the number of VAC types and the corrected type-token ratio of VAC (VAC CTTR) and used them as two diversity measures in the 35 current study, following conventions in lexical diversity studies. In examining lexical diversity, many researchers have used the corrected type-token ratio instead of the traditional type-token ratio, as a way to lessen the effect of the length variation of the sample essays. Corrected typetoken ratio is calculated by dividing the types of words by the square root of twice the tokens (Carroll, 1964). VAC CTTR in the present study is computed in the same way. In identifying a set of verb-argument structures, I rely on findings in corpus-informed linguistics: 1) construction grammar and 2) corpus-based descriptive grammar. Linguists with constructionist approaches see knowledge of language as consisting of constructions, which are defined as learned parings of form and meaning at different levels of generality. Words and idioms are constructions, and VACs are constructions at a more abstract level (Goldberg & Suttle, 2010). Goldberg (1995) studied constructions that correspond to basic sentence types, which she believed to reflect basic event types that humans experience. The list of VACs studied by Goldberg (1995) and other researchers in the field (e.g., Becini & Goldberg, 2000; Ellis & Ferrerira-Junior, 2009a; 2009b) are provided in Table 5. Based on corpus findings, Biber, Johansson, Leech, Conrad, Finegan and Quirk (1999) described major clause patterns that are comparable to VAC types identified by constructionists. Some of these major patterns can be further divided in terms of complementation types. This further classification relies on the work of Quirk, Leech, Sartvik, and Greenbaum (1985) (see Table 6). 36 Table 5 Verb argument constructions Construction Descriptive grammar grammar (Biber et al., 1999) (Goldberg, 1995, Example and others) 1 Subject—verb phrase The sun is shining. 2 Subject—verb phrase— SVPP obligatory adverbial (Intransitive- My office is in the next building. motion) 3 Subject—verb phrase—subject SVC(AP) predicative Your dinner seems ready. (Intransitiveresultative) 4 5 Subject—verb phrase—direct SVO That lecture bored me. object (Transitive) Subject—verb phrase— He is looking after the dog. prepositional object 6 7 8 9 Subject—verb phrase— SVOO I must send my parents an indirect object—direct object (Ditransitive) anniversary card. Subject—verb phrase—direct SVOPP I must send an anniversary card to object—prepositional object (Dative) my parents. Subject—verb phrase—direct SVOC(AP) You made him angry. object—object predicative (Resultative) Subject—verb phrase—direct SVOPP object—obligatory adverbial (Caused-motion) 10 Passive You can put the dish on the table. Passive My bicycle is broken. construction 11 Existential there there construction There are books on the table. 12 Extraposition It was a good idea to leave early. 13 Cleft It was my mom who called me. 37 Table 6 Verb complementation types (Quirk et al., 1985) Variants Example Copular SV C (Adjective) The girl seemed restless. (SVC & SVA) SV C (Nominal) William is my friend. SV Adverbial The kitchen is downstairs. SV O (NP) with passive Tom caught the ball. SV O (NP) without passive Paul lacks confidence. SV O (that-clause) I think that we have met. SV O (wh-clause) Can you guess what she said? SV O (wh-infinitive) I learned how to sail a boat. SV O (to-infinitive -S) We’ve decided to move house. SV O (ing -S) She enjoys playing squash. SV O (to-infinitive + S) They want us to help. SV O (ing + S) I hate the children quarreling. Complex SVO C (Adjective) That music drives me mad. transitive SVO C (Nominal) They named the ship ‘Zeus.’ (SVOC & SVO C (Adverbial) I left the key at home. SVOA) SVO C (to-infinitive) They knew him to be a spy. SVO C (bare infinitive) I saw her leave the room. SVO C (-ing clause) I heard someone shouting. SVO C (-ed clause) I got the watch repaired. SVO O (NP) They offered her some food. SVO AdvP : Dative Please say something to us. SVO O (that-clause) They told me that I was ill. SVO O (wh-clause) He asked me what time it was. SVO O (wh-infinitive) Mary showed us what to do. SVO O (to-infinitive) I advised Mark to see a doctor. Monotransitive 38 The coding for verb-argument structures that appeared in participants’ essays was conducted against the above lists. The coding procedure is described in more details in Chapter 2. 39 CHAPTER 2: THE CURRENT STUDY 2.1 Research questions and hypotheses The current study was guided by three research questions. The first considers the role of syntactic complexity as an index of L2 proficiency. (1) Does the syntactic complexity of Korean EFL learners’ writing production, as measured by various quantitative complexity measures, function as an indicator of proficiency? In addition, does adding diversity measures increase the predictive power of syntactic complexity in discriminating proficiency levels? Previous studies have reported mixed results on how syntactic complexity measures are related to different proficiency levels or developmental stages. Some researchers indicated that syntactic complexity measures change according to proficiency levels (e.g., Verspoor, Schmid, & Xu, 2012; Lu, 2011) or over time (e.g., Vyatkina, 2013), while others reported that proficiency or time was not a significant predictor of variation for syntactic complexity features (e.g., Biber, Gray, and Staples, 2014; Yoon & Polio, 2016). Researchers also reported that some measures are better indicators of proficiency levels or development than others. Lu (2011) tested 14 syntactic complexity measures that are also used in the present study. His results indicated that only six of the measures linearly progressed along three proficiency levels and significantly differentiated between them. These measures were mean length of clause (MLC), mean length of T-unit (MLT), coordinate phrases per clause and per T-unit (CP/C and CP/T), and complex nominal per clause and per T-unit (CN/C and CN/T). In addition, post hoc tests revealed that only one of them significantly differentiated among all three levels. The other measures discriminated between levels two and three only. Gyllstad, Granfeldt, Bernardini and Kallkvist (2014) also found that some measures discriminate 40 certain levels better than others. They reported that MLC was a better measure for advanced level writings. Building on the previous research on the change of individual complexity indices across proficiency levels, I attempt to investigate whether these measures, individually and as a group, can be predictive of different proficiency levels. I also expect to find that adding diversity measures increase the predictive power. The second research question targets the link between the L2 syntactic complexity and assessment of L2 writing quality. (2) How do different syntactic complexity measures relate to subjective ratings of writing quality judged by human raters? Which measure(s) best predict writing quality? Some researchers found that complexity measures are positively correlated with writing quality judged by human raters (Bulté & Housen, 2014; Crossley & MaNamara, 2014; Kuiken & Vedder, 2014). Bulté and Housen (2014) reported that seven out of ten complexity measures they employed (i.e., mean length of sentence (MLS), MLT, simple sentence ratio (SSR), complex sentence ratio (CxSR), subclauses ratio (SCR), mean length of finite clause (MLCfin) and mean length of noun phrase [MLNP]) showed significant positive correlations with overall writing quality. For measuring the writing quality, the authors used the score of rating scale Language Use in addition to the mean total score of five rating scales of an analytic rubric. Crossley and McNamara (2014) found that, in addition to the production of all clause types (e.g., matrix, coordinating and embedded clauses), the incidences of infinitives and that verb complements were strong predictors of higher ratings of writing quality. In other words, more diverse syntactic structures were related to higher ratings. In line with these results, I predict that Language Use scores be highly correlated with many of the syntactic complexity measures tested, and that measures representing the diversity dimension explain these scores better than 41 elaboration measures. By testing numerous measures that represent various dimensions of syntactic complexity, including the proposed measures of syntactic diversity, the results of the present study are expected to add findings to the literature. The third research question investigates how the construct of syntactic complexity is interpreted in the L2 assessment field. As discussed in the previous chapter, SLA researchers use rather objective, quantitative indices to measure the level of complexity, while it is assessed by human raters’ evaluation of the criterion in L2 assessment field. In addition to investigating the relationship between the two approaches of measurements with the second research question, I looked into how raters reach their judgments of the level of complexity through rater interviews. (3) How do raters interpret the notion of syntactic complexity that appears on the Language Use scale of a given analytic writing rubric? To my knowledge, no previous studies have directly asked how raters interpret the descriptors in the Language Use scale of an analytic rubric in relation to the notion of syntactic complexity used in SLA. Relevant is research on raters’ cognitive process in relation to their response to a rating scale for writing assessment (Barkaoui, 2010; Cumming, Kantor, & Powers, 2002; Knoch, 2009; Lumley, 2002; Winke & Lim, 2015). These researchers investigated raters’ decision making process, and many of them found variability in rater scoring behaviors. For example, Lumley (2005) reported that his raters often failed to make decisions based on common interpretation of the scale contents and resorted on different strategies, resulting in variability in rating process. Knoch (2009) also reported that her raters used various coping strategies to deal with difficulty in deciding on a score such as assigning a global score in a holistic rather than an analytic way, or disregarding descriptors. Using an eye-tracker, Winke and Lim found results 42 similar to Knoch’s results: Winke and Lim found that raters use the rating rubric in a systematic (from left to right) way and often disregard descriptors, which suggested that raters use a more holistic approach than the rubric designers perhaps would have surmised. Winke and Lim also suggested that a rather low interrater reliability estimate for the Language Use section of the rubric showed that raters had trouble in interpreting or applying the Language Use section. I predict that in my study raters will have difficulties in interpreting Language Use section of the rubric, and their interpretation of the descriptors and resulting rating processes will vary. I expect the interview data will provide some insights on how the notion of syntactic complexity is interpreted in the L2 assessment field. 2.2 Participants 2.2.1 Korean learners of English The primary data were collected from Korean learners of English at two different institutional levels. I recruited a total of 187 high school students from four high schools and 203 college students from three universities in South Korea. Their proficiency levels varied and were evaluated by an independent English proficiency test, which is described in detail in the instrument section. The high school participants were enrolled in four high schools in two provinces in Korea. Three schools, namely schools A, B, and C, were located in Seoul, and the other (school D) was located in Gyeongsang province. School A and B were boys’ high schools, and School C was a girls’ high school. School D was a co-ed school. The number of students recruited from each school was 17, 21, 89, and 60 (27 male and 33 female students), respectively. 43 Table 7 Korean participants’ demographic and learning background Institutional level Gender Grade Major Years studying English Months in English speaking countries Secondary (N = 187) 65 male 122 female 3 freshman 184 junior N/A 10.1 mean 2.38 SD 2 low 164 None 9 0-6 months 4 7-12 months 16 high 2 13-18 months 3 19-24 months 2 2 years 1 2.5 years 1 4 years 1 7 years 14.86 mean 3.23 SD 5 low 21 high 154 None 21 1-6 months 2 13-18 months 4 19-24 months 1 2.5 years 1 4 years 1 5 years 1 10 years College (N = 203) 85 male 118 female 14 freshman 27 sophomore 70 junior 92 senior 47 Social science 44 Engineering 35 Humanities and language 34 Education 29 Science and medical 9 Business 5 Arts The college students were enrolled in three universities in different provinces. The majority, 146 participants, was from Seoul National University (70 male students and 76 female students). Twenty-eight participants were recruited from Pusan National University (12 male and 16 female students) and 29 were recruited from Gyeogin National University of Education (4 male and 25 female students). They were pursuing a variety of majors such as science and engineering (e.g., electrical engineering, medical science, and animal science), social science 44 (e.g., economics, politics, and sociology), humanities and language (e.g., linguistics, history, and English literature), and education. About half of the students were in their senior years, and the other half were mostly juniors followed by sophomores and freshmen (see Table 7 for more demographic information). 2.2.2 Raters A group of raters also participated in the study. I recruited a total of seven raters at Michigan State University. Four of them were faculty members in English Language Center, and three of them were Ph. D students in Studies in Second Language Program at the university. They were all native speakers of English and had a varied amount of teaching and rating experience (see Table 8). The raters were asked to score the writings of Korean learners based on an analytic rubric, and they participated in a 20-minute interview. Table 8 Raters’ teaching and rating background Occupation ESL/EFL ESL/EFL Abilities to teaching composition evaluate experience rating experience ESL/EFL compositions (self-evaluation) Rater 1 ESL instructor 27 years Yes Expert Rater 2 ESL instructor 35 years Yes Expert Rater 3 ESL instructor 5 years Yes Expert Rater 4 ESL instructor 5 years Yes Competent Rater 5 Ph. D. student 7 years No Competent Rater 6 Ph. D. student 6 years Yes Competent Rater 7 Ph. D. student 4 years Yes Novice 45 2.3 Instruments 2.3.1 Writing tasks As some previous studies have found genre effects on syntactic complexity measures (e.g., Asención-Delaney & Collentine, 2011), I used writing prompts of different genres to elicit writers’ use of a variety of linguistic features. Two genres were used: an argumentative and a narrative writing task. Care was taken to select topics that were suitable for both high school and college student groups. First, I searched for writing prompts that were originally designed as timed writing tasks for young adult learners of English. After the initial selection of potential prompts, I asked two high school teachers in Korea to remove any prompts that were socioculturally unknown or irrelevant for either group of learners and to recommend ones that would be interesting to students. The argumentative essay prompt was adopted from MSU-CELP exam preparation materials. MSU-CELP is an English language examination developed by the English Language Center at Michigan State University. The exam aims to assess English language ability in four areas, namely writing, listening, reading, and speaking at the C2 level of the Common European Framework of Reference (CEFR). As the test was developed for EFL learners and targets not only adult college learners but also high school learners, it seemed reasonable to adopt one of the writing prompts developed for the exam and administer it to the participants of the current study. I chose several writing prompts from the exam preparation materials that are published online and open to public access (http://www.msu-exams.gr/swift.jsp?CMRCode=1807P3P4S) to create an initial list of possible prompts and selected one. The prompt for the narrative writing task was adopted from Yoon and Polio (2016). Both prompts were edited in an attempt to make the two 46 prompts comparable in length. The final versions of the prompts used in the study are as follows. - Argumentative writing task (adapted from MSU-CELP Practice Test 2). Teachers sometimes require students to work together on specific projects. Each student then gets a grade based on the group’s success. Some students are quite happy to receive a grade based on the work of the group, while others feel that being graded as part of a group is not fair. What is your opinion about being graded as part of a group? Be sure to support your opinion with examples, reasons, and explanations. - Narrative writing task (adapted from the MSU Corpus). Think about a particularly good or bad teacher or professor that you had. Tell a story about your experience with that teacher. Be sure to fully develop your story by including relevant examples and specific details. 2.3.2 English proficiency test (C-test) A C-test, which is a type of cloze test, served as an independent measure of English language proficiency of the Korean participants in this study. The strength of cloze tests lies in their practicality. It is a short, paper-based test that is not constrained by time and space limit, and therefore was chosen as a global proficiency measure in the current study. The main purpose of using the C-test in the current study was to group Korean EFL learners into different proficiency levels. Cloze tests have been used in language research for a long time. In a cloze test, testtakers are given a passage with a number of deleted single words replaced with blanks and are asked to fill in the blanks. Although the issue of which abilities cloze tests actually measure still remains unresolved (Tremblay, 2011), many researchers found evidence supporting the 47 reliability of the test (Bachman, 1985; Tremblay, 2011). A C-test was developed in an attempt to resolve difficulties with scoring objectively. Since Raatz and Klein-Braley (1981) introduced a new deletion technique to delete the second half of every second word in a text and coined the term C-test, researchers have examined various deletion rates and deletion patterns. For example, Sigott and Kobrel tested different deleting patterns such as deleting 2/3 of the words or leaving the first letter only in an attempt to increase the test difficulty (as cited in Babaii & Ansary, 2001, p.212). For the present project and another project, I and a colleague developed a 45-item C-test with a first-letter deletion pattern and it can be found in Appendix A. The test consisted of three texts taken from online articles, which was designed to present texts at various comprehensibility levels. The texts were of varied length and structural and lexical complexity. The first sentence of each passage was left intact in order to provide an introduction to the passage. Then, we deleted roughly every 8th word except for the first letter and replaced with blanks. We moved blanks either to the preceding or following words when the same words were deleted repeatedly, or the moved blanks we thought to contribute to the overall quality of the test better than the original ones. We piloted the initial version on three native speakers and six advanced learners of English and then we revised the instrument based on the test takers’ responses and opinions. We also pilot-tested the revised version with 13 native speakers of English, and they scored between 76% and 96% (Mean = 84.79, SD = 6.34). Their responses were used as acceptable answers. After the administration of the C-test to 390 Korean participants, the reliability and the discriminability of the test were examined. First, the reliability of the test was estimated using Cronbach’s alpha. The Cronbach’s alpha value of the entire test was .94, and the values for the 48 three texts were all above .70, which was interpreted as high. The reliability values are presented in Table 9. Table 9 C-test reliability N of items Cronbach’s Alpha Cronbach’s Alpha based on standardized items ALL 45 .94 .94 Text 1 11 .75 .76 Text 2 14 .83 .82 Text 3 20 .92 .92 In order to examine the level of difficulty of the test items, item facility (IF) and the item discrimination (ID) were checked. IF is an estimate of item difficulty, and ID is a measure of how well a given item discriminates test-takers with high and low ability (Carr, 2011, pp.269270). The ID values were all positive, and most of them had a value over .20, which was found acceptable by experts (Nelson, 2000). Three items had an ID score below .20 and were therefore removed from the analysis. The IF value for about half of the items (24 items) was between .30 and .70, which is a normally used target range in practice (Carr, 2011, p.270). Eighteen out of 45 had a value lower than .30, which are interpreted as difficult items, and three items were considered easy (i.e., ID above .70). These easy and difficult items were not removed from the analysis as long as their ID values were above .20, as the purpose of the test was to discriminate participants. Item facility and item discrimination values by item are reported in Appendix B. 2.3.3 Language learning background questionnaire A questionnaire (adopted from Kim, 2014) asked Korean participants to provide information regarding their gender, year in school, age of first exposure to English, experience 49 living in English-speaking countries, and their perceived proficiency level in speaking, writing, reading, and listening (see Appendix C and D, Korean translation Appendix E and F). 2.3.4 Rater background questionnaire At the end of the rating session, I asked raters to fill in a questionnaire asking about their teaching and composition-rating experience, their perceived competency in rating student essays, and their familiarity with any language other than English. I adapted a rater background questionnaire used by Winke and Gass (2013) (See Appendix G). 2.3.5 Rating rubric Human raters scored writings on an analytic rubric scale developed by Polio (2013) and it can be found in Appendix H. Polio revised an analytic scale adapted from Jacobs et al.(1981), which was based on the evaluation of experienced ESL instructors and targets content, organization, vocabulary, language use, and mechanics. The revised scale consists of the same five components, but the descriptors were changed in accordance with raters’ comments and perception of the scale (see Connor-Linton & Polio, 2014, p.4, for more information on the revision process of the scale). Polio (2013) reported that the revised scale was more reliable and valid. Four of the five subscales—Content, Organization, Vocabulary and Language Use—are on a scale from zero to 20, and the mechanics scale ranges from zero to 10, for a total score of 90. 2.4 Procedures In this section, I report the procedures for each participant group. 2.4.1 Korean learners of English The main data collection was conducted in Korea in the summer of 2015. Korean EFL students were asked to complete one of the two essay writing tasks. High school students 50 completed the experiment in a regular English classroom, and their English teachers administered the procedures. First, the students completed a language learning background questionnaire. Then, the teacher distributed the writing prompt and the essay sheet. Half of the students in each class were given the argumentative prompt, and the other half received the narrative prompt. The distribution of the two prompts was random. The writing task lasted for about 30 minutes. Then the teacher collected the essays and gave out the proficiency test. The students filled in the blanks in three passages in increasing order of difficulty (passage 1 to 2 to 3) in 15 minutes. I met college students individually in the library or a café in the campus. They performed the same sequence of tasks as the high school students. 2.4.2 Raters Writings collected from Korean participants were typed and printed, and any personal information was removed before they were given to the raters. The rating procedure was conducted over the Fall semester in 2015. The raters participated in a norming session, scored 74 essays individually over two weeks after the norming session, and participated in an interview after the completion of rating. The norming session was guided by one of the raters. He had substantial experience in rating and leading workshops for raters, and volunteered for leading the session. The norming session started with an introduction to the project, and then the raters talked about the rating scale first. The guiding rater read through the descriptors and band levels in the scale briefly, and any questions were resolved through discussion. After reading through the rating scale together, the raters were given a number of sample essays. They rated one sample essay individually and shared what score they gave to the essay and the rationale behind the rating. They continued the discussion until they reached an 51 agreement on the scores, and then moved to the next sample essay. In total, they rated and discussed four sample essays. The norming session lasted for 90 minutes. After the norming session, I gave each rater a packet of argumentative essays. In the packet were 10 essays that were common to all raters, and 27 essays that were unique to each individual rater. A rating scale and a scoring sheet were also included. I asked the raters to finish rating within a week, and everyone returned the packet to me in time. Then I distributed a packet containing the same number (37) of narrative essays to the raters and gave them a week to finish scoring. Within a week after all the raters had returned the scores for narrative essays, I met them individually for a retrospective interview. During the interview, I asked about the raters’ overall procedure of rating, their global impression of the rating rubric, and how they interpreted the descriptors in Language Use section of the scale. The interview lasted for 15 to 20 minutes. The interview employed a semi-structured format, which included a set of interview questions to begin with but deviated from them or added more to pursue the topics arising in the course of the interview (Friedman, 2012). The following are the interview questions that were common to all the interviewees. - Can you walk me through your overall rating process (and specifically rating of language use)? - What did you think about the language use section of the rating rubric? - (Showing sample essays that each rater rated) What were you thinking/ what affected you when you were scoring this essay? - How did you interpret the wording of the rubric? Could you give me some examples?  (errors in) complex structures 52 -  (errors in) morphology  (frequent use of / minimal use of/ no attempt at) complex sentences  (excellent/ good/ little/ no) sentence variety Do you see differences between complex sentences and complex structures? 2.5 Data analysis 2.5.1 Quantitative analysis The essays were evaluated by means of human ratings and by a number of quantitative measures gauging L2 syntactic complexity. For computational analyses of syntactic complexity measures and subjective ratings, essays were typed and saved as individual text files on a computer. 2.5.1.1 Proficiency test According to Brown (1980), there are various methods that have been developed for the scoring of cloze tests, some of which are exact-answer, acceptable-answer, clozentropy, and multiple-choice (Brown, 1980, p.311). Exact-answer scoring counts the exact words used in the original text as correct, while acceptable-answer scoring counts all contextually acceptable answers as correct. Clozentropy is a refined version of acceptable-answer scoring method. It takes frequency of the acceptable answers in a native speaker pretest into account. In multiplechoice scoring method, test-takers are given a set of alternative answers and asked to choose the correct answer. For the present study, multiple-choice scoring method was not an option as I asked participants to write down answers rather than to choose one from the given options. Among the other options, I chose an acceptable-answer scoring method as it was reported to be a more reliable measure in ESL contexts than an exact-word criterion (Oller, 1972). The 53 participants’ response for each blank was counted as correct when it was contextually and grammatically acceptable and started with a given letter. Each correct answer received one point. I did not employ a partial-credit scoring, meaning grammatically and contextually unacceptable answers were counted as zero. The original total score was 45, but after checking the item difficulty (ID) values, I discarded three items from the analysis. As a result, the maximum possible score became 42. 2.5.1.2 Subjective ratings Each essay was given five scores based on the analytic rating scale. The sum of these five scores (i.e., Total score) and the score for Language Use category were used as a rater’s judgment of the writing quality. 2.5.1.3 Syntactic complexity: Elaboration measures I computed 14 syntactic complexity measures using an automated tool developed by Lu (2011), namely, Syntactic Complexity Analyzer. Syntactic Complexity Analyzer is a computational system for automatic analysis of syntactic complexity. The system takes written texts as input and computes 14 measures of syntactic complexity that were selected based on the research syntheses by Wolfe-Quintero et al. (1998) and Ortega (2003). (Refer to Lu (2010) for a detailed description of the system.) These measures consist of; 1) three measures of length of production units (mean length of clause [MLC], mean length of sentence [MLS], and mean length of T-unit [MLT], 2) a sentence complexity ratio (number of clauses per sentence [C/S]), 3) four subordination ratios (T-unit complexity ratio [C/T], complex T-unit ratio [CT/T], dependent clause ratio [DC/C], and dependent clauses per T-unit [DC/T]), 4) three coordination measures (coordinate phrases per clause [CP/C], coordinate phrases per T-unit [CP/T], and sentence coordination ratio [T/S]), and 5) three measures that consider the relationship between 54 particular structures and larger production units (complex nominals per clause [CN/C], complex nominals per T-unit [CN/T], and verb phrases per T-unit [VP/T]). 2.5.1.4 Syntactic complexity: Diversity measures In addition to measuring how elaborate structures the essays involve, I also investigated how diverse verb-argument constructions (VACs) were used in essays. The coding procedure for the diversity measures was semi-automated. Two corpus tools were used to identify the verbargument structures in the data set. First, the essays were part-of-speech (POS) tagged by TagAnt (Anthony, 2014). All essays that were saved as text files were entered into the program and the program tagged every word with POS code. Next, a concordance tool called AntConc (Anthony, 2014) was used to identify the instances of verbs and generate the concordance lines of these verbs. Concordance searches for all verbs resulted in 15,298 hits. I manually filtered the retrieved concordances to keep only the lines containing a main verb. A main verb in the present study was operationalized as a tensed verb in a finite clause. When the tense was marked on an auxiliary verb, the following content verb was viewed as a main verb. Through this process, the instances of auxiliary verbs, gerunds, and to or bare infinitive verbs were deleted from the database. Consequently, 9135 hits remained for the analysis. The concordance lines were exported to an Excel sheet for verb-argument structure coding. The coding procedures were as follows. First, I identified the linear structure of each line using the POS tags. Second, phrase structures were identified by grouping constituents together. Then, verb–argument structure codes were assigned. Based on the summaries of English sentence structures from a corpus-based grammar and construction grammar perspective, I began coding with the distribution of 11 verb-argument structures: (1) verb, (2) verb + obligatory adverbial, (3) verb + subjective predicative, (4) verb + direct object, (5) verb + 55 prepositional object, (6) verb + indirect object + direct object, (7) verb + direct object + prepositional object, (8) verb + direct object + object predicative, (9) verb + direct object + obligatory adverbial, (10) passive construction, and (11) there construction. The coding process was iterative, as the set of verb–argument structures identified evolved through repeated coding and grouping. Cleft and extraposition constructions and sub-types of several sentence structures emerged during this procedure. The final coding was conducted against the list of verb– argument constructions identified in previous studies. In the end, a total of 39 verb-argument types were identified (Table 10). Table 10 Verb-argument structures Major types Sub-types Example from data 1 V (Intranstive) So I cried. 2 V + Obligatory adverbial (Copular) …when I was in high school, 3 V + Subjective V + AdjP His class is so interesting that.. predicative V + NP Although I am a student, … (Copular) V + CP My opinion is that group project is unfair. V + PP My opinion is that group project is unfair. V + to infinitive Second important thing is to evaluate each other. 4 V + Direct object V + past participle …he will not get even punished… V + present participle The workload will keep rising. V + NP …and then build up my opinion. V + (that) clause I believe that it is necessary… (Transitive) Note. V = verb; AdjP = adjective phrase; NP = noun phrase; CP = complementizer phrase (i.e., clause); PP = prepositional phrase 56 Table 10 (cont’d) Major types 4 V + Direct Sub-types Example from data Most students didn’t care what we are V + wh clause object (Transitive) doing. V + wh to infinitive N/A V + to infinitive My class began to laugh at me. V + present participle He didn’t give up teaching me. V + [NP + to infinitive] He wanted me to know what was wrong. V + [NP + V-ingP] N/A V + “CP” She said, “You are right.” V + so N/A 5 V + Prepositional object Everyone agreed with it. (Transitive, prepositional verb) 6 V + Indirect V + NP + NP She brought us some snacks. object + V + NP + that clause I’ve never told him I had been there. Direct object V + NP + wh clause A’s mother asked me why I hit him. (Ditransitive) V + NP + wh to infinitive He showed us how to live our lives. V + NP + to infinitive Lots of professors ask students to work together on their projects. V + NP + “CP” I asked myself, “Did I have very big fault?” 7 V + Direct object + The professor gave lots of articles to Prepositional object (Dative) 8 V + Direct students. …and it drive one crazy. V + NP + AdjP object + Object predictive (Complex transitive; resultative) Note. V = verb; AdjP = adjective phrase; NP = noun phrase; CP = complementizer phrase (i.e., clause); PP = prepositional phrase 57 Table 10 (cont’d) Major types 8 Sub-types Example from data V + Direct V + NP + NP …so I call them professors. object + V + NP + Adverbial …I regard his students as his younger brother. Object V + NP + to infinitive …he encouraged me to study hard. predictive V + NP + bare infinitve That simple word made me cry. (Complex V + NP + V-edP I had made this machine broken. transitive; V+ NP + V-ingP I saw some groups having problem.. resultative) 9 V + Direct object + Obligatory I still keep that email in my mail box. adverbial (Caused-motion) 10 Existential there There are several reasons. 11 Passive Most of the work is only achieved with multiple people. 12 Extraposition But, it was too difficult to have any question about that. 13 Cleft It is not a teacher but a textbook or US drama that make my English mostly grow. Note. V = verb; AdjP = adjective phrase; NP = noun phrase; CP = complementizer phrase (i.e., clause); PP = prepositional phrase After the coding, the number of each VAC type used by each participant in each essay (i.e., subject), and the corrected type-token ratio (CTTR) were calculated. 2.5.2 Qualitative analysis Recordings of the rater interviews were analyzed in a qualitative manner. The norming session and seven interviews with each rater were audio-recorded and transcribed. Following Chapelle and Duff (2003)’s and Baralt (2012)’s guidelines, data analysis followed an iterative and cyclical process. First, I started the coding process by transcribing the audio-recorded interview broadly. Then I read the transcript several times and took notes. I segmented the data 58 and coded each segment, assisted by the program NVivo 9. After that, I grouped related codes together. Through the procedure, I identified themes that are particularly relevant to the issue of syntactic complexity and language use. 2.6 Statistical analysis Quantitative measures were entered into and analyzed by SPSS version 23. In this section, I report what statistical analyses were used to answer the research questions. First, I investigated the use of syntactic complexity as an index of L2 written language proficiency. The first research question asked whether the syntactic complexity of Korean EFL learners’ writing production, as measured by various quantitative complexity measures, function as an indicator of different proficiency levels. In other words, it asked whether syntactic complexity measures can be used to distinguish between proficiency levels. To answer this question, first, the writings of EFL Korean students were divided into three groups according to their English proficiency test scores. Then I conducted a series of one-way analysis of variance (ANOVA) and a discriminant function analysis (DFA). Through ANOVAs, I investigated whether a significant trend across the three proficiency levels existed for a particular complexity index (Homburg, 1984, p.97). Post hoc analyses were conducted to reveal where the difference occurred, if any. DFA investigates “the extent to which a set of measured variables can distinguish—“discriminate”—between members of different groups or distinct levels of another, nominal or possibly ordinal, variable” (Norris, 2015, p.306). It also provides information regarding which measure best discriminates among groups (Homburg, 1984, p.98). In the present study, DFA was conducted to determine whether the selected sets of syntactic complexity measures can predict proficiency group membership and to find the best predictor. In addition, I compared predictive power of different sets of syntactic complexity measures— 59 elaboration measures, diversity measures, and the combination of all measures by running three separate discriminant analyses. Cross-validation of the study was done by splitting the data into two halves, running one analysis on one half and running the second analysis with the other. The classification accuracy of the two analyses were compared (Norris, 2015). The second research question asks whether complexity measures are associated with the quality of writing and subjective ratings of language use. I performed correlations in order to investigate the relationship between each complexity measure and two writing quality scores, Total and Language Use scores. Multiple regression analyses were carried out to investigate the relationship between a group of calculated complexity measures and subjective ratings given by human raters. 60 CHAPTER 3: RESULTS In this chapter, I first summarize preliminary results obtained from the English proficiency test and the writing task. Descriptive statistics for C-test scores and subjective ratings given on the student essays are presented. I also report the result of proficiency group division based on the C-test scores and the inter-rater reliability for the subjective ratings given by human raters. In the following sections, the results are organized by research question. I first look at the relationship between syntactic complexity measures and English proficiency. Next, I turn to the relationship between the measures and subjective ratings given by human raters. Lastly, I report the results from the rater interviews on their perceptions regarding the syntactic complexity manifested in the rating scale. 3.1 Preliminary results 3.1.1 English proficiency test (C-test) and proficiency-level placement The Korean students (N = 390) were divided into three proficiency groups based on the result of the C-test. Those who scored over 60% on the test were considered the advanced-level learners (n = 94). Eighty-one students in this group reported that they had taken one of the three proficiency tests: TOEIC, TOEFL, or Test of English Proficiency developed by Seoul National University (TEPS). The average score (converted on a TOEFL iBT scale based on a TEPS versus TOEFL conversion table [http://www.teps.or.kr/Teps/Public/conversion_table.aspx]) was 108.74 (SD = 8.57). Students who scored between 30% and 60% (n = 143) were placed into a high-intermediate group. Ninety of them had an average TOEFL iBT score of 98.24 (SD = 1.52). Lastly, students who scored below 30% (n = 153) were placed into a low-intermediate group. Only 13 of them reported an independent proficiency test score, and their average score was 89.77 (SD = 8.93). The descriptive statistics for the result of the essay ratings are shown in 61 Table 11. The mean C-test scores increased as the proficiency levels went up, regardless of the prompt. The results of a factorial ANOVA revealed a statistical main effect of proficiency (F(2, 384) = 1341.84, p <.001, ηp2 =.88). A post-hoc Tukey HSD showed statistical differences between all three proficiency groups. However, neither a main effect of genre (F(1, 384) = .51, p = .48, ηp2 = .00) nor the interaction effect between the proficiency level and genre (F(2,384) = .02, p = .98, ηp2 =.00) was identified. The results show that the distribution of participants across proficiency groups was balanced for each genre. Table 11 Descriptive statistics: C-test and subjective ratings on the writing task Proficiency Prompt Language Use Total Mean SD Mean SD Mean SD 74 6.07 3.37 7.63 2.86 34.57 12.98 Narrative 79 5.71 3.65 7.49 3.04 34.95 14.29 Total 153 5.88 3.51 7.56 2.95 34.77 13.63 75 18.64 3.71 11.79 2.35 53.61 10.60 Narrative 68 18.37 3.51 11.93 2.85 55.32 10.83 Total 143 18.51 3.61 11.86 2.59 54.42 10.70 Argumentative 46 29.78 3.69 14.67 2.39 67.32 10.63 Narrative 48 29.63 3.46 13.96 2.91 65.64 11.53 Total 94 29.70 3.56 14.30 2.68 66.46 11.07 Low-intermediate Argumentative High-intermediate Argumentative Advanced N Writing Task C-test 3.1.2 Subjective ratings on essays The essays were rated on an analytic rating scale by seven native speakers of English. Twenty of the 390 essays (ten narrative and ten argumentative essays) were commonly evaluated by all seven raters, and the other essays were scored by either one or two raters. The average 62 scores of the 390 essays on the two subjective rating scales, Language Use and Total (the sum of five rating scales, including Language Use), were 10.76 (SD = 3.88) and 49.61 (SD = 17.54), respectively. Mean ratings by genre and proficiency group are presented in Table 11. In order to check for the inter-rater reliability, I calculated Intra-class correlation coefficients (ICCs) on the scores of the common essays rated by all seven raters. I used a twoway mixed effects model with absolute agreement definition, which assumes that each subject (essay) was rated by two or more raters and that these raters were the only raters participating in the study (Landers, 2015). The average rater ICCs for the Total score and each of the subsections of the analytic rubric were found to be high. Table 12 presents the results for the Total and for the Language Use section which are of interest in the current study. Table 12 Inter-rater reliability (ICCs) Intra-class 95% Confidence Interval Correlation Lower Bound Upper Bound Language Use .93*** .87 .97 Total .96*** .91 .98 Note. *** p < .001 3.1.3 Relationship between proficiency test scores and subjective ratings The relationship between proficiency level as measured by the C-test and writing quality as judged by human raters was also investigated. A Pearson’s correlation between the C-test scores and Total scores given on the essays found that the effect size of the correlation was large (r(390) = .78, p < .001, 95% CI [.74, .82], R2 = .61). This result shows a strong positive linear 63 relationship between the proficiency test score and the subjective ratings of writing quality, which supports the validity of the C-test as a writing proficiency measure. 3.2 ANOVAs and discriminant function analyses (DFAs): Research question 1 The first research question asks whether syntactic complexity measures can be used to distinguish between proficiency levels. To address this question, I conducted analyses of variance (ANOVAs) that examined differences in each complexity measure across proficiency levels. I also conducted discriminant function analyses (DFAs), in which a group of syntactic complexity measures were used to discriminate between proficiency groups. 3.2.1 ANOVAs Results from a series of one-way ANOVAs showed that all measures linearly increased across the three proficiency levels, and the mean differences were significant as a function of proficiency level (Table 13). The effect of proficiency was significant on all these quantitative complexity measures with a Bonferroni adjusted alpha level of .003 except for CT/T. The effect sizes were generally large for length-based measures (i.e., MLC, MLS, and MLT), measures of the relationship between particular structures and production units (i.e., CN/C, CN/T, and VP/T) and diversity measures (i.e., VAC Types and VAC CTTR). The effect sizes for coordination measures (i.e., CP/C, CP/T, and T/S) and subordination measures (C/T, DC/C, and DC/T) were small to medium. 64 Table 13 Proficiency-level effect on syntactic complexity measures (One-way ANOVAs) N Low- High- intermediate intermediate M M SD SD Advanced M df F SD Sig. (p) Effect size (ηp2) MLC 387 6.92 1.27 7.78 1.25 8.78 1.33 2 62.05 < .001 .24 MLS 385 11.58 3.43 14.09 3.55 17.27 3.83 2 73.31 < .001 .28 MLT 387 10.56 2.64 12.52 2.98 15.15 3.49 2 68.17 < .001 .26 C/S 386 1.69 0.50 1.81 0.40 1.99 0.41 2 12.43 < .001 .06 C/T 385 1.54 0.38 1.60 0.30 1.72 0.33 2 8.43 < .001 .04 CT/T 390 0.44 0.22 0.46 0.17 0.52 0.15 2 6.07 .003 .03 DC/C 389 0.36 0.13 0.37 0.11 0.42 0.10 2 8.24 < .001 .04 DC/T 384 0.56 0.30 0.61 0.26 0.76 0.32 2 13.09 < .001 .06 CP/C 385 0.11 0.11 0.14 0.10 0.19 0.10 2 16.04 < .001 .08 CP/T 386 0.17 0.17 0.22 0.15 0.32 0.18 2 23.13 < .001 .11 T/S 385 1.09 0.14 1.13 0.14 1.15 0.11 2 6.37 .002 .03 CN/C 390 0.71 0.29 0.81 0.22 0.96 0.26 2 28.54 < .001 .13 CN/T 388 1.11 0.55 1.32 0.46 1.64 0.54 2 3.90 < .001 .14 VP/T 385 1.90 0.46 2.10 0.45 2.43 0.51 2 37.40 < .001 .16 Type 390 6.54 2.70 9.76 2.39 11.64 2.93 2 117.92 < .001 .38 CTTR 390 1.21 0.31 1.41 0.26 1.48 0.27 2 < .001 .14 3.84 Table 14 summarizes the between-level differences in post-hoc Tukey HSD tests. The results showed that differences between the nonadjacent levels (i.e., low-intermediate and advanced) were significant. Differences between the adjacent levels (low-intermediate and highintermediate, and high-intermediate and advanced) were significant in most of the measures, except for the sentence complexity ratio (C/S), four subordination measures, one coordination measure and one diversity measure. Subordination measures (C/T, CT/T, DC/C, and DC/T) and C/S discriminated between the high-intermediate and advanced levels only. The coordination 65 measure at a sentence level, T/S, and a diversity measure, VAC CTTR, only discriminated the two lower levels. Table 14 Post-hoc pairwise comparisons (p values) between each proficiency level Length of Low-intermediate – Low-intermediate High-intermediate High-intermediate – Advanced – Advanced MLC < .001 < .001 < .001 production units MLS < .001 < .001 < .001 MLT < .001 < .001 < .001 C/S .06 < .001 .01 C/T .26 < .001 .02 CT/T .65 .00 .03 DC/C .66 < .001 .01 DC/T .31 < .001 .00 CP/C .03 < .001 .00 CP/T .02 < .001 < .001 T/S .03 .00 .54 Particular CN/C .00 < .001 < .001 structures CN/T .00 < .001 < .001 VP/T .00 < .001 < .001 VAC Types < .001 < .001 < .001 VAC CTTR < .001 < .001 .10 Sentence complexity ratio Subordination Coordination Diversity 3.2.2 DFAs The purpose of the DFAs was to examine whether a group of syntactic complexity measures that have been used popularly in the field of SLA and L2 writing are able to predict 66 proficiency levels. In addition, I wanted to investigate if the proposed diversity measures would add to the predictive power of syntactic complexity measures. 3.2.2.1 Variable selection As described earlier, 16 measures consisting of 14 syntactic elaboration measures and two diversity measures were originally computed for participants’ essays. Before conducting DFAs, three important assumptions for discriminant analyses were checked following Norris’ guidelines (2015). First, univariate normality of distribution was checked for each complexity measure at each proficiency level, and outliers were removed. Second, multicollinearity among predictor variables was checked using a correlation analysis. The bivariate correlations between measures are presented in Table 15. When high correlations between measures (r >. 80) were identified (Field, 2009), a single predictor variable was selected. Finally, sample size was checked. In the current study, the smallest group sample size was 94 (advanced group). In order to reduce the likelihood of model overfitting, I followed a criterion of 15 observations to 1 predictor. Such a ratio allowed me to include six to seven variables in the analyses. Multivariate outliers were also removed before running analyses. The final set of predictor variables was selected based on the correlation and ANOVA analyses. In selecting one of the strongly correlated measures, variables with higher effect sizes in ANOVA analyses were primarily selected, following previous studies (e.g., Crossley & McNamara, 2009; Crossley, Salsbury & McNamara, 2011; McNamara, Crossley & McCarthy, 2010). In addition, care was taken to select at least one measure for every dimension of syntactic complexity: length of production units, sentence complexity ratio, subordination, coordination, particular structures and diversity. As a result, seven measures remained for the main analyses 67 (five elaboration measures and two diversity measures): MLC, DC/T, CP/T, T/S, CN/T, VAC Types and VAC CTTR. Table 15 Bivariate correlations between syntactic complexity measures 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1 MLC -- 2 MLS .59 -- 3 MLT .60 .91 -- 4 C/S -.05 .74 .63 -- 5 C/T -.09 .60 .68 .86 -- 6 CT/T -.08 .52 .57 .74 .82 -- 7 DC/C .02 .56 .62 .71 .75 .79 -- 8 DC/T .00 .59 .68 .80 .89 .82 .94 9 CP/C .57 .28 .29 -.10 -.14 -.09 .01 -.02 10 CP/T .55 .45 .47 .11 .08 .10 .19 .19 .94 -- 11 T/S .11 .42 .06 .44 .03 .10 .06 .04 .06 .08 12 CN/C .67 .53 .63 .14 .18 .20 .27 .28 .31 .35 -.01 -- 13 CN/T .45 .71 .84 .53 .63 .57 .61 .66 .17 .32 -.01 .85 -- 14 VP/T .30 .77 .86 .73 .84 .70 .73 .81 .10 .28 .05 .40 .73 -- 15 Type .26 .29 .28 .16 .11 .08 .12 .16 .16 .17 .15 .09 .13 .27 -- 16 CTTR .20 .22 .25 .14 .13 .10 .11 .14 .04 .07 .05 .13 .17 .26 .77 16 ---- 3.2.2.2 Discriminant function analyses: Elaboration and diversity measures In order to compare the predictive power of syntactic elaboration and diversity measures in classifying learners into proficiency levels, I conducted three discriminant analyses. Researchers can use a discriminant analysis when they have two or more groups (in this study, the groups are the learners at the three proficiency levels) that theoretically differ on several 68 -- interval-level independent variables (in this study the independent variables are syntactic elaboration and diversity measures). Researchers use discriminant analysis to analyze the differences among the groups (on the variables under study) and to provide a way to assign (classify) any one learner into the group with which he or she (based on his or her measures of syntactic elaboration and diversity) most closely resembles (see Klecka, 1980, p. 8). In other words, the analysis looks at whether the group assignment can be predicted by the variables under study. In the first analysis, five elaboration measures were included in the model. The second model included two diversity measures as predictors. In the third model, all seven complexity measures were entered as predictors. All three analyses identified two discriminant functions, the first function accounting respectively for 98.5%, 99.4%, and 98.4% of the discriminant ability of the variables in each model. Therefore, the first function explains most of the discriminant ability of the predictors in all three models. In the first model, the combined functions (1 and 2) showed a significant discriminating ability (Wilks’ lambda = .65, χ2(10, N = 363) = 152.24, p < .001), indicating that the combined predictor variables were able to account for around 35% of the actual variance in proficiency between the three groups. In the second model, an overall statistically significant effect was found for the combined functions (Wilks’ lambda = .60, χ2 (4, N = 363) = 184.81, p <.001). In the third model, in which all seven measures were entered, the combined functions were found to be significant (Wilks’ lambda = .44, χ2 (14, N = 363) = 294.91, p <.001). The combined functions accounted for around 40% and 56% of the actual variance in proficiency between the three groups in Model 2 and Model 3, respectively. 69 Table 16 Relationship output for individual predictor variables and functions Model 1 2 3 Variable Correlation between Standardized canonical discriminant function and discriminant function predictor variables coefficients Function Function 1 2 1 2 Five MLC .82* -.11 .90 -.30 elaboration CN/T .65* .15 .01 -.28 measures CP/T .50 .61* .01 .77 T/S .27 -.53* .29 -.55 DC/T .36 .48* .50 .52 Two diversity VAC Types .94* .36 1.32 -.67 measures VAC CTTR .45 .89* -.53 1.39 Seven VAC Types .67* -.57 1.10 -.33 measures MLC .53* .35 .49 .37 combined CN/T .42* .42 .30 -.30 CP/T .32 .62* -.02 .41 VAC CTTR .32 -.57* -.53 -.33 DC/T .23 .47* .14 .69 T/S .18 -.19* .13 -.18 Note. * Largest absolute correlation between each variable and any discriminant function. Variables are ordered by absolute size of correlation within function. In Table 16, the first column, correlation between discriminant function and predictor variables, presents the relationship between individual indices and each function. In the first model including five measures of syntactic elaboration, MLC, CN/T, and CP/T were found to be highly correlated with Function 1 (> .50). MLC was the strongest marker for the function (r = .82) and contributed the most to separating the proficiency groups. In the second analysis with 70 two diversity measures, VAC Type was a stronger marker than VAC CTTR for the first function (r = .94). In the third analysis in which all seven measures were entered, MLC and VAC Type were the two variables most highly correlated with the first function. Table 17 Group centroids Level Function 1 1 2 3 2 Five elaboration Low-intermediate -0.77 0.06 measures High-intermediate 0.06 -0.11 Advanced 1.09 0.08 Two diversity Low-intermediate -0.95 -0.03 measures High-intermediate 0.24 0.08 Advanced 1.09 -0.07 Seven measures Low-intermediate -1.26 0.09 combined High-intermediate 0.23 -0.18 Advanced 1.58 0.15 Table 17 shows the group centroids for each function. Group centroids are the means of the discriminant function scores by proficiency group (low-intermediate, high-intermediate, and advanced). The discriminant function scores are derived using the discriminant equations, which are similar to equations obtainable through multiple regression. Individual standardized scores are multiplied by the standardized canonical function coefficients (see Table 16) to compute function scores (Ramos & Liow, 2013). Figures 1 through 3 display these proficiency group centroids and individual cases for each analysis in two dimensions. The horizontal dimension displays Function 1, and the vertical dimension displays Function 2. All three figures illustrate that the first function distinguished between the three groups more clearly than the second 71 function. However, the horizontal distances between the levels were not found to be equal in the first and the second model. In the first model, the distance between the advanced level and the other two levels was larger than the distance between the two lower groups, indicating that the first function distinguished the advanced level from the other two groups rather well (Figure 1). In the second analysis, the first function distinguished much more strongly between the lowintermediate level and the other two levels (Figure 2). Figure 1 Cases and group centroids for two discriminant functions: 5 elaboration measures 72 Figure 2 Cases and group centroids for two discriminant functions: 2 diversity measures Finally, Table 18 shows the classification results for the discriminant analysis. Row sample count presents the predicted frequencies of group (proficiency level) from each analysis, showing the numbers of cases that are correctly and incorrectly classified. For example, 97 cases out of 137 original low-intermediate level cases were correctly predicted, while 34 and 6 cases were predicted to be high-intermediate and advanced level, respectively. Overall, in all three analyses, the combined Functions 1 and 2 were able to classify the cases correctly into the three levels above chance level (i.e., above 33%). The five elaboration measures alone offered a predictive value of 52.9%, and the two diversity measures alone correctly predicted 61.9% of cross-validated grouped cases. The combined use of all seven measures worked best, correctly predicting 68.6% of the cross-validated grouped cases. 73 Figure 3 Cases and group centroids for two discriminant functions: 7 measures combined The five elaboration measures were found to be more useful in predicting placement into the low-intermediate level than into the other two groups. Accuracy of predicted placement was much higher for the low-intermediate level, with 67.9% of cross-validated cases predicted correctly. The predictions were substantially less accurate for the high-intermediate and advanced levels, with only 41% and 48% of cases classified accurately. In the second analysis with the two diversity measures, the accuracy of assigning the essays into the low-intermediate level was still superior to the other two levels, with a predictive value of 73%. Compared to the elaboration measures, the diversity measures were extremely useful for the accurate prediction of placement into the high-intermediate level. The prediction was accurate for about 60% of the cases for this level. The prediction accuracy for the advanced group was similar to that of the first model. Lastly, the combined use of all seven measures was found to be the most useful for the accurate prediction of placement. The combined measures exhibited the highest prediction 74 accuracy in all three levels—the low-intermediate (74%), the high-intermediate (68%), and the advanced level (61%). Although the combined measures still predicted the low-intermediate level essays best, placement into the high-intermediate and advanced levels was more accurate than when either elaboration or diversity measures were used alone. All three levels were classified at a more similar rate compared to the first and the second model. Table 18 Prediction of group membership according to three discriminant analyses Predicted group membership Low-intermediate High-intermediate Advanced Total Original grouped cases Model Raw 1a sample count Low-intermediate 97 34 6 137 High-intermediate 51 62 24 137 Advanced 5 40 44 89 Percentage Low-intermediate 70.80 24.82 4.38 100.00 High-intermediate 37.23 45.26 17.52 100.00 Advanced 5.62 44.94 49.44 100.00 Low-intermediate 101 33 4 138 High-intermediate 36 83 20 139 Advanced 10 37 43 90 Percentage Low-intermediate 73.19 23.91 2.90 100.00 High-intermediate 25.90 59.71 14.39 100.00 Advanced 11.11 41.11 47.78 100.00 Low-intermediate 103 32 2 137 High-intermediate 24 94 19 137 Advanced 1 34 54 89 Percentage Low-intermediate 75.18 23.36 1.46 100.00 High-intermediate 17.52 68.61 13.87 100.00 Advanced 1.12 38.20 60.67 100.00 Model Raw 2b sample count Model Raw 3c sample count a b Note. 55.9/52.9 of original/cross-validated grouped cases correctly classified; 61.9/61.9% of original/cross-validated grouped cases correctly classified; c 69.1/68.6% of original/crossvalidated grouped cases correctly classified. 75 Table 18 (cont’d) Predicted group membership Low-intermediate High-intermediate Advanced Total Cross-validated grouped cases Model Raw 1a sample count Low-intermediate 93 38 6 137 High-intermediate 53 56 28 137 Advanced 5 41 43 89 Percentage Low-intermediate 67.88 27.74 4.38 100.00 High-intermediate 38.69 40.88 20.44 100.00 Advanced 5.62 46.07 48.31 100.00 Low-intermediate 101 33 4 138 High-intermediate 36 83 20 139 Advanced 10 37 43 90 Percentage Low-intermediate 73.19 23.91 2.90 100.00 High-intermediate 25.90 59.71 14.39 100.00 Advanced 11.11 41.11 47.78 100.00 Low-intermediate 102 33 2 137 High-intermediate 24 93 20 137 Advanced 1 34 54 89 Percentage Low-intermediate 74.45 24.09 1.46 100.00 High-intermediate 17.52 67.88 14.60 100.00 Advanced 1.12 38.20 60.67 100.00 Model Raw b 2 sample count Model Raw 3c sample count Note. a 55.9/52.9 of original/cross-validated grouped cases correctly classified; b 61.9/61.9% of original/cross-validated grouped cases correctly classified; c 69.1/68.6% of original/crossvalidated grouped cases correctly classified. In summary, the results show that complexity measures, either elaboration or diversity measures, or a combination, provided a better overall predictive power for the lowest-level essays than for essays in the two upper proficiency levels. In addition, diversity measures were found to be more useful than elaboration measures in predicting placement into the highintermediate level. While the five elaboration measures alone predicted the advanced level 7% 76 better than the upper-intermediate level, the two diversity measures alone exhibited 12% extra predictive value in favor of the high-intermediate level. The use of all seven measures afforded the best predictive placement power into all three levels. 3.3 Correlation and regression analyses: Research question 2 The second research question asks whether syntactic complexity measures are predictive of writing quality as judged by human raters. In order to investigate this question, two correlation analyses were performed between each complexity measure and 1) Language Use scores and 2) Total scores. Two multiple linear regressions were then conducted with each writing quality rating score as an outcome variable and the various syntactic complexity indices as predictor variables. 3.3.1 Correlations The correlation analyses showed that all syntactic complexity indices were significantly and positively correlated with Total and Language Use scores, indicating that essays with higher scores on these variables tend to be given higher writing scores (Table 19). Differences between the results for Total and Language Use scales were slight. The strongest correlations were found for VAC Types followed by length-based measures—MLT, MLS, and MLC— in both cases. Following Cohen (1992), who defined effect sizes R2 = .01, .09, and .25 as small, medium, and large effects, respectively, VAC Types was understood to have a strong relationship with writing quality scores. The effect sizes associated with measures for length of production units and particular structures were medium (.09 to .23). The effect sizes for subordination (C/T, CT/T, DC/C) and coordination measures (CP/C, CP/T, T/S) were found to be small (.02 to .09). 77 Table 19 Correlations between syntactic complexity measures and subjective ratings Length of production units Sentence complexity ratio Subordination Language Effect size Use (R2) MLC .42*** .18 .43*** .18 MLS .47*** .22 .47*** .22 MLT .48*** .23 .46*** .21 C/S .23*** .05 .22*** .05 C/T .18*** .03 .15** .02 CT/T .17** .03 .13* .02 DC/C .18*** .03 .15** .02 DC/T *** .06 *** .04 *** .07 *** .09 *** .24 *** CP/C Coordination Particular structures Diversity .26 *** CP/T .29 .07 .08 Total Effect size .21 .27 .30 (R2) T/S .15 ** .02 .20 .04 CN/C .30*** .09 .29*** .08 CN/T .33*** .11 .31*** .10 VP/T .37*** .14 .35*** .12 VAC Types .62*** .38 .68*** .46 VAC CTTR .37*** .14 .39*** .15 Note. * p < .05, ** p < .01, *** p < .001. 3.3.2 Regression analyses 3.3.2.1 Variable selection Some assumptions for multivariate analysis were examined before running regression analyses. After controlling for the normality assumption by removing univariate and multivariate outliers and examining the multicollinearity assumption by checking inter-variable correlations, the same set of predictor variables that were used for discriminant function analyses 78 were selected for the regression analyses: MLC, DC/T, CP/T, T/S, CN/T, VAC Types and VAC CTTR. The multicollinearity and independent errors assumptions were checked after running the regression analyses. The VIF values were all under 5 (Table 21). The assumption for independent errors was tested with the Durbin-Watson test. The values (= 1.90 for Total score; = 2.00 for Language Use score) were sufficiently close to 2; thus it was assumed that the residuals were uncorrelated. 3.3.2.2 Relationship between syntactic complexity indices and Total score The standard multiple regression model with seven complexity measures as predictors revealed that there was a statistical relationship between the set of predictor variables and Total score (F(7, 355) = 73.51, p < .001). The result showed that about 59% of the variance in total scores was accounted for by the set of variables (Table 20). Table 20 Multiple regression analyses: Model summary R R2 Adjusted R2 Std. Error of the Estimate Durbin-Watson Language Use .71 .51 .50 2.63 2.00 Total .77 .59 .58 10.78 1.90 79 Table 21 Standard regression coefficients Unstandardized Standardized Coefficients Coefficients t B Language (Constant) Std. Error -.19 Statistics Sig. β -0.32 1.69 Collinearity Correlations Zero- Partia Part Toleran order l (sr2) ce VIF .85 Use score MLC 0.65 0.17 .25 3.90 < .001 .47 .20 .15 0.34 2.98 DC/T 1.19 0.80 .09 1.49 .14 .25 .08 .06 0.36 2.78 CP/T 0.03 1.04 .00 .02 .98 .32 .00 .00 0.61 1.63 T/S 1.32 1.15 .04 1.15 .25 .17 .06 .04 0.95 1.05 CN/T 1.02 0.51 .14 2.00 .05* .40 .11 .07 0.28 3.60 VAC Types 0.76 0.07 .66 10.75 < .001 .59 .50 .40 0.37 2.74 VAC CTTR -3.11 0.74 -.25 -4.19 < .001 .34 -.22 -.16 0.40 2.51 Total (Constant) -1.494 6.91 -0.22 .83 MLC 2.37 0.68 .20 3.48 .00** .47 .18 .12 0.34 2.98 DC/T 1.60 3.29 .03 0.49 .63 .22 .03 .02 0.36 2.78 CP/T 2.26 4.28 .02 0.53 .60 .33 .03 .02 0.61 1.63 10.58 4.70 .08 2.25 .03* .22 .12 .08 0.95 1.05 CN/T 5.58 2.10 .17 2.66 .01*** .38 .14 .09 0.28 3.60 VAC Types 4.06 0.29 .78 13.96 < .001 .65 .60 .47 0.37 2.74 VAC CTTR -17.48 3.04 -.31 -5.75 < .001 .36 -.29 -.19 0.40 2.51 T/S Note. *p < .05, ** p < .01 The independent relationship between Total score and the predictor variables was examined through regression coefficients and their significance. As shown in Table 21, t-test results indicated that MLC, T/S, VAC Types and VAC CTTR contributed uniquely to the outcome variable. DC/T and CP/T did not contribute to the regression model. The relative 80 importance of each variable was examined by comparing squared semipartial correlations (sr2) for each term. Among the variables, the strongest predictor of Total score was VAC Types (sr2 = .47, B = 4.061, β = .783) followed by MLC (sr2 = .12, B = 2.369, β = .204), CN/T (sr2 = .09, B = 5.579, β = .171), and T/S (sr2 = .08, B = 10.583, β = .078). 3.3.2.3 Relationship between syntactic complexity indices and Language Use score Similar results were found in the regression model that investigated the relationship between syntactic complexity measures and Language Use score. The model was significant (F(7, 355) = 52.414, p < .001), and the set of variables predicted 51% of the variance in the language use score (Table 20). Four of the predictor variables independently contributed to Language Use score: MLC, CN/T, VAC Types and VAC CTTR. As was the case for the Total score, VAC Types was the strongest predictor of the language use score (sr2 = .40, B = 0.762, β = .762). The second and the third strongest predictors were MLC (sr2 = .15, B = .648, β = .251) and CN/T (sr2 = .07, B = 1.020, β = .141), respectively. 3.4 Rater interview results: Research question 3 The third research question addresses raters’ perceptions of the rating and the rating scale in relation to grammatical complexity. The interviews started with a more general question about their overall rating process and then proceeded with more specific questions regarding rating for the Language Use scale. 3.4.1 Overall rating process 3.4.1.1 Rating sequence When asked to describe their rating process, most of the raters reported that they started rating by skimming through an essay. Then they rated each subscale of the rubric. Some raters reread (some parts of) the essay to assign the final scores. Which part of the rubric received 81 attention first varied from rater to rater. Although moving directly from left to right on the rubric—in the order of Content, Organization, Vocabulary, Language Use, and Mechanics—was the most common strategy, not all of the raters suggested that they followed this pattern. One rater reported that he worked in the opposite direction: “For my overall rating, I usually worked backwards on the rubric. I would start with Mechanics, because to me that was sort of the category that was kind of arbitrary in a way. It was a harder one to use” (Rater 5). Rater 1 said that the Language Use section was the area that he addressed first when he was looking at an essay. Another rater reported that the order of scoring differed from essay to essay. He started by rating one or two categories that stuck out after skimming a given essay: So for instance, if an essay, regardless of the length, has really good grammatical structure, I might look at the language use category first. And then, other essays that, yeah, the ideas are really strong, if that stands out initially, I’ll look at the content band first. So I would say I don’t rate in the same order in every essay. (Rater 4) 3.4.1.2 Lack of information provided by the rating scale In giving scores for each section of the rubric, a common strategy was to decide in which band to place the essay first and then assign a score within the band: I looked at the rubric and kind of decided first within which band I wanted to go. And then usually thinking carefully about the different components of the band depending how high up, you know, between an eleven or a fifteen or whatever, I wanted to give it. (Rater 3) 82 And that was how I would make the decision in terms of what band I was in. And then from there, I would kind of just start considering the strength of each of the qualities. If they’re all present, but they’re like, “Well, it could be better, I mean, it’s there,” it’d be a lower score, if it’s like, “It’s quite good, but it doesn’t quite move it to the next band up,” then it would still get a score, a higher score, within that band. (Rater 5) Some raters noted the difficulty of allocating a specific score within the selected band: “In terms of the Language section, I guess really the most difficult part sometimes was trying to assign scores within a band” (Rater 6). In the next extract, Rater 3 complained: In other words it’s …with this amount of information I think it is very, very difficult to clearly justify the difference between a seventeen and an eighteen. I think that is very, very arbitrary on the part of the rater. […] it was not difficult to decide most of the time which of the bands it was going to go into, but in going, you know, twelve, thirteen, eleven, it’s kind of arbitrary. (Rater 3) More experienced raters seemed to have internalized the descriptors in the rubric and had less difficulty in deciding what scores to give: “The rubric, often times it doesn’t make any difference as long as I can put them in the proper bands” (Rater 2). Rater 1 is similarly experienced: I’m too accustomed to these things. And, really, if you said take out all these descriptors and put these essays in those numbers, I would do that. I don’t need that stuff at this point. […] In all language categories, you know, can I sit down and then write a description of why I did that in terms of those kinds of descriptors without looking at that? Yes, I can. (Rater 1) 83 Several less experienced raters reported that they referred back to the benchmark essays that all the raters had evaluated during the norming session. So, after the norming session, I realized that I tended to rate the higher essays too low and the lower essays too high. So, I was having trouble going to either extreme. So, that was a consideration throughout. So, for the first while, while I still felt like I was getting comfortable, I tended to rate them, and if I thought that it was a poor essay, I would tend to go with my original rating, and then take it down a mark or two, and the same thing for the higher ones, I typically take my marks up if I thought that they were strong essays. (Rater 5) 3.4.2. Rating process for Language Use I specifically asked the interviewees to describe their rating process for the language use section of the rubric. I selected three essays that were placed into a high, mid, and low score band by each rater and asked the rater what affected the rating of the essays. The raters read through the essays and reflected on their rating process. 84 Table 22 Language Use section of the rubric Score band 20 Descriptors No major errors in word order or complex structures No errors that interfere with comprehension Only occasional errors in morphology Frequent use of complex sentences 16 Excellent sentence variety 15 Occasional errors in awkward order or complex structures Almost no errors that interfere with comprehension Attempts, even if not completely successful, at a variety of complex structures Some errors in morphology Frequent use of complex sentences 11 Good sentence variety 10 Errors in word order or complex structures Some errors that interfere with comprehension Frequent errors in morphology Minimal use of complex sentences 6 Little sentence variety 5 Serious errors in word order or complex structures Frequent errors that interfere with comprehension Many error in morphology Almost no attempt at complex sentences 0 No sentence variety 3.4.2.1 Balancing between accuracy and complexity In describing the rating process for the language use section, all the raters referred directly to the scale. They seemed to have taken most of the criteria in the rubric into account. One theme that regularly emerged from the interviews was the raters’ attempt to balance between two constructs of grammatical ability: accuracy and complexity. I could see that the raters 85 considered both aspects of grammatical ability while scoring essays for language use, following the descriptors in the rubric. As shown in Table 22, Language Use section of the rubric used in the current study simultaneously addresses both errors and complexity of the language used in writing. For example, an essay is evaluated in terms of the use of complex structures, complex sentences, morphology, and sentence variety and in terms of the errors in them. Rater 4 reflected that he considered both the variety and accuracy of the language: So, this essay, to my recollection, has a good variety of grammatical structures. Um, and not only is it a good variety, they’re fairly accurate. […] If I’m looking at the rubric, the top category…so no major errors in word order or complex structures. Looking at this again, I still don’t see any global or local errors. No errors that interfere with comprehension. Yeah, I mean, overall, excellent sentence variety. I think there is a pretty good variety of clause structure. (Rater 4) Similarly, Rater 6 recalled that, while rating one essay, he was impressed by the use of complex phrases and sentences, and at the same time, he noticed some errors that prevented him from giving a higher score: This one … it seems like, I think, I was impressed by the phrasal structures. You know, the essay starts off with, like, “Through the evaluation of the groups’ success”, kind of a dense, pretty complex phrase to start off with. And I see that throughout. There’s also a lot of complex sentences, um, “Although, they put much more efforts to investigate, write a paper, and complete a presentation, it is not fair that some free riders ignoring the parts that should been done can get the same score.” There were a few errors here and there, which is probably why I didn’t go higher. (Rater 6) 86 3.4.2.2 Criteria not specified in the rubric Several raters mentioned a few factors that affected their rating process although they are not specified in the rubric. One factor that regularly emerged across the interviews was the length or fluency of essays. This issue was often mentioned while the raters were describing their rating process for essays placed into lower bands of the rubric. Raters commented that they gave low scores to short essays because there was not much to evaluate: “I think I want to cross this towards the bottom in part because it was weak. Because there was very little to evaluate” (Rater 1); “You can’t say there is? sentence variety, there’s really no sentences here” (Rater 5). Difficulty of grammatical structures was another criterion for rating that was mentioned by several raters. While reflecting on the rating process for a sample high-score essay, Rater 3 referred to good use of a difficult structure in the essay: “Uh, this is very nice, ‘…which means it could not have been completed…’ Nice use of modal, very nice. Modals are hard as I’m sure you know.” Rater 4 also thought that his teaching experience and knowledge of structures with which learners’ have difficulty affected her rating: Use of different adverb phrases. Um…yeah I think a lot of it. I am very heavily influenced by the writing class I teach here that has some pretty explicit grammar objectives. Um, so when I start to see those, a lot of the grammar structures I teach, if I start to see those used accurately in these types of evaluations, that to me is an indicator that these students have flexibility in knowing not only what different subordinators use, for instance. Like, the use of ‘so that.’ ‘So that’ uses an adverb of purpose, and ‘so’ uses a conjunction. That’s really tricky for a lot of students. And if I can…if students use those, those are just two of many examples, students are able to use those, pretty accurately, I think that is an indication, and this is obviously a proficiency exam, that to 87 me is an indication of proficiency that is higher than um…knowing how to use ‘because’. ‘Because’ is…maybe level two here at the ELC. Yeah, so I think that is mostly what I am looking for: the use, the use of, in the case of adverbs, adverb clauses that are using different subordinating words and phrases with fair amount of accuracy. (Rater 4) From his experience, the rater knew that certain subordinators raised more challenge than others, and he gave credit for the use of those structures. He also commented that he did not take into consideration errors in the use of determiners because she knew that to be one of the most challenging aspects in learning English. 3.4.3 Perceptions of the language use section of the rubric I asked the interviewers how they had perceived the descriptors of the language use section of the rubric. The following themes emerged from their comments. 3.4.3.1 Tension between accuracy and complexity As described earlier, raters attempted to encompass both accuracy and grammatical complexity in evaluating the language used in essays. However, it was often difficult for them to balance between accuracy and complexity. The following comment from Rater 5 illustrates this challenge: The hardest thing for me to balance between was ideas like, so if we look here, attempts, even if not completely successful at a variety of complex structures, right? Uh… but, occasional errors and awkward order complex structures. That one’s differentiated here from error in word order or complex structures. So, to me it’s kind of difficult in that, ok, they’re attempting to use complex structures, which is great, 88 but let’s say that the attempt doesn’t work because they don’t have the word order correct. (Rater 5) Not all the raters valued accuracy and complexity to the same extent. For example, Rater 1 prioritized accuracy over complexity. He recalled that he first considered overall accuracy of the language used in an essay: “…so as soon as I am evaluating an essay, I am looking for accuracy, especially in terms of word order, and phrase construction, and phrase order. […] That’s the kind of thing that sticks out more to me” (Rater 1). To Rater 1, accuracy was the construct that gave him the first overall impression of the language used in an essay. 3.4.3.2 Overlap with other categories of the rubric Another theme that emerged regarding the raters’ perceptions of the rubric was an overlap between the language use section and other categories of the rubric. As described earlier, raters often took the length or fluency of the essays into consideration in scoring for the language use category. However, evaluating the “number of words for the amount of time given” is a criterion specified in the content category of the rubric as well. Rater 4 also noted the overlap between the two categories when evaluating the comprehensibility of the language: Uh, “Frequent errors that interfere with comprehension.” It’s completely incomprehensible, right? “(reading a student’s essay) Other than he thinks it’s fair.” What’s fair? And that goes to content organization. […] But, I mean, other than that, like, it’s just incomprehensible. That starts with the language that you put out there. I mean, you can have perfectly structured English, and not be comprehensible. Right? […] Your language doesn’t have to be perfect to be comprehensible, but it at least has to make… combined in some way that one can understand what is 89 happening here. Right? Ah, here it just doesn’t combine in any way that makes any logical sense. (Rater 5) In addition, some raters noted that it was often difficult to separate language use from vocabulary use. The distinction between the two categories became problematic especially in relation to morphology issues: I think at times it’s, it was difficult to like, I’m looking at language usage, and so I’m looking, really to me, language use was basically grammar, in a lot of ways. But, uh, at times, like, morphology comes up with it, so a vocabulary can be tied to morphology to me very easily. (Rater 5) There had been sometimes essays with, either…actually pretty clean in the sense that …very few errors… the syntax, the composition of sentences but some really awkward word choices; um, and then with the, uh, morphology thing: Is that derivational morphology or, you know, or nominalizations… when it should have been an adjective. Things like that, kind of sometimes created a little bit of difficulty to keep those two categories separate. (Rater 6) 3.4.3.3 Vagueness of descriptors One of the most frequently mentioned problems that raters faced with the Language Use scale was the vagueness of its descriptors: “I think the, the descriptors… I think that’s what’s tricky. Language Use is one of the more challenging parts of evaluating essays with this type of rubric” (Rater 4). Difficulty with referring to the rubric due to the vagueness of its descriptors is also reflected in the following comment by Rater 5: 90 I think there is always the question, occasional errors in awkward order…well yeah, what is considered a complex structure and what is not considered a complex structure? Um, I, I think my other writing colleagues and I…I think that’s…Is there a specific definition of what that means? It almost seems like it’s one of those things…we know it when we see it, but we don’t know how to define it beforehand. (Rater 5) This comment by Rater 5 shows a lack of common and concrete definitions of descriptors that were shared among the raters; thus, the evaluation based on these criteria depended upon the subjective interpretations of raters. Consequently, the raters interpreted the descriptors in the rubric in different ways. In order to detect raters’ interpretations of the descriptors, I specifically asked the raters how they understood major terms in the rubric such as morphology, complex structures, complex sentences and sentence variety. Table 23 summarizes their comments. Morphology was mostly interpreted as word forms or word endings. However, most raters did not comment much on this criterion when describing their rating process. It seemed that they did not pay much attention to the criterion because morphology does not generally raise problems in comprehension, and even the highest score band features essays with (occasional) errors in morphology. The following comment by Rater 1 illustrates this point: Morphology. Word endings, which wasn’t usually a problem with most of these essays. I would say overall morphology was pretty good. Um, people didn’t really have trouble with plurals, which is, you know, one of the easiest parts of, um, English morphology, in my opinion at least. Um, verbs are a little bit harder, but for the most part they weren’t that bad. (Rater 1) 91 Table 23 Raters’ interpretations of the descriptors in the rubric Rater 1 Morphology Complex Complex structures sentences word form  Multi-clausal sentences Sentence variety  Using simple sentences  Adverb clauses effectively and then complex  Adjective clauses sentences effectively  A properly subordinated clause either placed in front with comma  Matrix clause followed by a dependent clause or following a main clause without  Dependent clause followed by a matrix clause  Fronting of elements other than a subject  Using simple, compound, and complex sentences, not an overreliance on simple structures or compound structures  Using clauses (when-clauses, Rater 2  Little variety means using adjective clauses, noun clauses, the same kinds of sentences adverb clauses) appropriately all the time.  Not just simple sentences, using compound and complex sentences Rater 3 word endings  Dependent clauses  Complex noun  Main clause  Clauses led by various with dependent subordinators, e.g., if, even if, clauses wh-words  Not repeating the same phrase with pre- or post- sentence pattern modification 92 Table 23 (cont’d) Morphology Complex Complex structures sentences  A variety of adverb clauses Rater 4  Noun clauses with relative clauses  A fairly well-constructed sentence that has two or three clauses Rater 5 word form Sentence variety  Different types of clause structures  Different uses of phrases  Reductions  Multiple clauses  Relative clauses Rater 6 word form  Passive voice  Coordinated  Perfect aspect sentences  Relative clauses Rater 7  Subordinated sentences  Transition words  Classic subject + verb + object, choppy and short pattern illustrates a less complex structure In most cases, complex structures were interpreted in the same way as they interpreted complex sentences. They were understood to refer to sentences with multiple clauses. Most raters (five out of seven) said they did not distinguish between the two at all. They did not actually notice that the rubric describes complex structures and complex sentences as separate criteria. There is some overlap with complex structures and complex sentences and sentence variety. It kind of feels like, if you have one or two of those then you have the rest of them by default. (Rater 6) 93 Those who distinguished the two categories did not have the same understanding of the terminology either. One rater mentioned passive voice and perfect aspect as examples of complex structures, while the other referred to complex noun phrases. Lack of reference to other structures does not necessarily mean that they were not considered as complex structures by raters, however. As was evident in an earlier extract from Rater 5, it is possible that raters did not have specific structures in mind until they read and identified particular structure from the essays. Lastly, to most of the raters, sentence variety meant not using the same sentence pattern repeatedly. What was meant by sentence pattern varied from rater to rater, however. The use of simple, compound, and complex sentences alternately, different clause structures, clauses led by numerous subordinators and varied order of clauses exemplified the notion of sentence variety. The fronting of phrases or clause-to-phrase reduction were also referred to as an indication of diversity in sentence patterns. 94 CHAPTER 4: DISCUSSION This chapter summarizes the results and discusses the findings of the current study in relation to previous research in the fields of SLA, L2 writing and L2 assessment. The following sections are organized according to the three research questions. 4.1 Research question 1: Syntactic complexity and proficiency The first research question for this study asked whether the syntactic complexity of Korean EFL learners’ writing production, as measured by various quantitative complexity measures, functions as an indicator of proficiency. The results presented in Chapter 3 show that all syntactic complexity measures examined in the present study linearly increased as the proficiency levels increased. These measures were: 1) three length-based measures (mean length of clause [MLC], mean length of sentence [MLS], and mean length of T-unit [MLT]), 2) a sentence complexity ratio (number of clauses per sentence [C/S]), 3) four subordination ratios (T-unit complexity ratio [C/T], complex T-unit ratio [CT/T], dependent clause ratio [DC/C], and dependent clauses per T-unit [DC/T]), 4) three coordination measures (coordinate phrases per clause [CP/C], coordinate phrases per T-unit [CP/T], and sentence coordination ratio [T/S]), 5) three measures of specific structures (complex nominals per clause [CN/C], complex nominals per T-unit [CN/T], and verb phrases per T-unit [VP/T]), and 6) two diversity measures (verbargument construction types [VAC Types] and VAC corrected type-token ratio [VAC CTTR]). A series of one-way ANOVAs confirmed that the effect of proficiency level on these complexity measures was significant. The results for length-based measures were generally consistent with previous studies, in which these measures were confirmed to be an indicator of L2 proficiency. Wolf-Quintero et al. (1998), in their research synthesis, stated that MLC, MLS, and MLT tended to increase at each 95 proficiency level defined by school or program level, and that these measures discriminated among the levels. Lu (2011) also reported that MLC, MLS, and MLT discriminated the two lowest school levels among four and three nonadjacent levels. Gyllstad et al. (2014) found strong correlations between CEFR scores and MLT and MLC. The results of this study also showed that differences in MLC, MLS, and MLT among all three adjacent proficiency levels were significant. They all increased with a medium effect size (≥ .05). Previous studies have provided mixed opinions regarding subordination measures. Wolfe-Quintero et al. reported that C/T, DC/C, and DC/T increased along with proficiency levels. Gyllstad et al. also found that C/T was a strong discriminator between CEFR levels A (basic) and B (independent) and that the measure increased with respect to the two levels. Lu (2011), however, found that DC/C and DC/T each discriminated nonadjacent levels in a negative rather than progressive direction, and that C/T and CT/T did not discriminate school levels. Biber et al. (2011) also questioned the validity of clausal subordination for assessing grammatical complexity in writing development, arguing that clausal subordination features were characteristics of everyday conversation rather than written language. The results of the current study seem to support Wolfe-Quintero et al.’s findings, in that all subordination measures (i.e., C/T, CT/T, DC/C, and DC/T) linearly increased across the levels, and although the effect sizes were small, the differences due to proficiency were significant for all subordination measures except CT/T. In addition, post-hoc multiple comparison analyses produced results contrary to those of Gyllstad et al. While Gyllstad et al. found C/T to be a discriminator for low proficiency levels, all four subordination measures were found to work as a discriminator for the two higher levels but not for the lower levels in the current study. This result corroborates Norris and Ortega’s (2009) claim that subordination measures should be more suitable for measuring 96 complexity at intermediate and upper-intermediate levels rather than at beginning levels. Norris and Ortega also claimed that these indices are not very useful for measuring advanced levels of L2 development, for which Lu (2011) has provided empirical support. This study has not found evidence for the above claim, however, as the subordination measures continued to increase across the levels. The contradictory results may have been due to differences in the populations investigated in the two studies. Perhaps the advanced level students in this study were not as advanced L2 writers as Lu’s highest proficiency group. However, comparison of the average values of each measure in the two studies did not reveal evidence for this difference in levels. The mean values of DC/T for the lowest level was 0.56 in this study, which was comparable to the values for the two lowest levels in Lu’s study, 0.52 and 0.54. The measure then decreased to 0.50 and 0.44 in Lu’s study, while the mean value continued to increase in the present study (0.61 and 0.76, for the high-intermediate and advanced levels, respectively). The contradictory findings invite more empirical investigations on subordination measures in future research. The current study reveals a general progression in coordination measures as well. Two phrasal coordination measures, CP/C and CP/T, distinguished among all three levels. A sentential coordination measure, T/S, discriminated the two lower levels, suggesting its greater usefulness as an indicator of lower-level proficiency. The results support Bardovi-Harlig’s (1992) proposal that coordination measures can be more sensitive in capturing the proficiency development of beginning learners because coordination can occur at different levels (i.e., phrase, clause, and sentence levels). This study’s results suggest that sentential coordination is a more sensitive measure for earlier L2 development. Three measures of specific structures (CN/C, CN/T, and VP/T) also significantly increased with increases in proficiency level, with a medium effect size, and the measures 97 distinguished among all three levels. The results for nominal phrases are in line with Lu (2011), who found strong discriminative power in the complex nominals. However, the results for verb phrases (VP/T) were contrary to his findings. This measure also significantly increased along proficiency levels, while Lu did not find significant differences between school levels. In addition, the results of this study show that indices for syntactic diversity could function as indicators of proficiency. Two diversity measures proposed in the present study were also shown to statistically distinguish all three proficiency levels. VAC Type linearly increased, with a large effect size (≥ 0.25). More advanced L2 learners used more varied types of verbargument constructions, and the corrected VAC type-token ratios also increased significantly across the proficiency levels. The results are broadly in line with Crossley and McNamara (2014), who found that the syntactic similarity score decreased significantly over time, indicating the association between the production of a wider variety of syntactic constructions and language development. Researchers have noted the multi-dimensional nature of syntactic complexity and the importance of using multiple measures in assessing the construct, arguing that no single measure is a perfect indicator for proficiency (Norris & Ortega, 2009). In an attempt to add empirical support for this claim, the present study examined whether a number of measures that represent different dimensions of syntactic complexity could, as a group, efficiently function as indicators of writing proficiency. I first investigated the predictive power of the model consisting of five measures of syntactic elaboration in discriminating proficiency levels. The model included one length-of-unit measure (MLC), one subordination measure (DC/T), coordination measures at a phrasal and sentential level (CP/T and T/S), and one measure of nominal phrasal complexity (CN/T). Next, I compared this model with the second model including two measures of 98 syntactic diversity—VAC Types and VAC CTTR. Finally, I investigated whether the combination of all these measures would increase the predictive power. Overall, the results of discriminant analyses showed that the sets of syntactic complexity measures could predict proficiency level. In all the models, group memberships were found to be reliably predicted from the set of predictor variables (shown by the significant Wilks’s lambda and chi-square results associated with each function). More than half of the cases (essays) were classified correctly by each model (53%, 62%, and 69%, respectively), which was far above chance level (i.e., 33%). These models were most accurate in predicting the low-intermediate level correctly. Accuracy of predicted placement into this level was around 70% in all the models. The diversity measures afforded better predictive power than the elaboration measures overall and were found especially useful in discriminating the low-intermediate and highintermediate levels. While the elaboration measures incorrectly predicted 39% of the highintermediate level essays as low-intermediate and predicted 41% correctly, the diversity measures wrongly placed only 26% of the high-intermediate level essays into the lowintermediate level and placed 60% correctly. The third model including all seven measures accounted for the largest variance in proficiency between the three groups and showed the highest predictive accuracy for all three levels, indicating that the proficiency levels can be best predicted by the combination of both elaboration and diversity indices. Among the seven measures, VAC Types was found to contribute the most to the predictive power of the model, followed by MLC and CN/T. To summarize, the results of the present study suggest that all the investigated syntactic complexity measures can function as indices of proficiency. However, not all the measures functioned in the same manner. Some measures were found to effectively discriminate all levels, 99 while others discriminate only one adjacent level pair. Overall, ANOVA analyses showed that the diversity measures, length-based measures and measures of particular structures were good indicators of proficiency. They linearly increased along the three proficiency levels, with a medium to large effect size. The subordination and coordination measures functioned better for certain levels than for others. The subordination measures were useful in discriminating the two higher proficiency levels, while a sentential coordination measure, T/S, was more useful in discriminating the two lower levels. The discriminant analyses suggest that adding the diversity measures to the existing elaboration measures increases the power of the set of complexity measures in predicting proficiency levels. These measures were also weakly correlated with the other 14 elaboration measures (between .04 and .29), which confirms that these measures tap into different aspect of complexity compared with the other measures (Lu, 2011). 4.2 Research question 2: Grammatical complexity and writing quality The second research question investigated the link between the syntactic complexity measures and the writing quality judged by human raters. Some studies have investigated the relationship between holistic ratings on essays and various complexity indices (e.g., Grant & Ginther, 2000; Taguchi, Crawford, & Wetzel, 2013). Many of these studies used subjective ratings as a way to define proficiency levels. In other words, they placed essays into proficiency levels based on holistic ratings and examined whether complexity indices could function as an index of these levels. More recently, researchers have more directly investigated the relationship between writing quality perceived by human raters and syntactic complexity indices (Crossley & McNamara, 2014; Guo, Crossley & McNamara, 2013; Kuiken & Vedder, 2014). They 100 investigated linear relationships between complexity indices and human-judged writing quality. In line with these studies, the current study also investigated the relationship between syntactic complexity indices and writing quality using correlation and regression analyses. For the assessment of writing quality, I used two analytic rating scales, following Crossley and McNamara (2014). I used the sum of five subscales of the analytic rubric (i.e., Content, Organization, Vocabulary, Language Use, and Mechanics) as a judgment of overall writing proficiency and Language Use score as a “more fine-grained measure of syntactic proficiency” (Crossley & McNamara, 2014, p.66). The results of correlation analyses showed that all 16 syntactic complexity measures, including the two diversity measures newly proposed in this study, were significantly and positively correlated with writing scores given by human raters. The differences in the results for Total and for Language Use scores were slight. The strongest correlations between complexity indices and both writing quality scores were found for a diversity measure, VAC Types, indicating that the use of many different verb-argument structures was related to higher writing quality judged by humans. The second strongest correlations were found for lengthbased measures (MLC, MLS, and MLT). As was also reported by Bulté and Housen (2014), the longer clausal and sentential units seem to be a sign of higher writing quality. The correlations between phrasal coordination measures (CP/C, CP/T) and writing quality were moderate, and the correlation between the sentential coordination (T/S) measure and writing quality was found to be weak. The result for T/S is consistent with Bulté and Housen (2014), who found weak and non-significant correlations between writing quality ratings and clausal coordination measures. 101 Relatively low correlations were found for subordination measures (C/T, CT/T, DC/C, and DC/T), although the relationships were found to be statistically significant. The results for subordination measures differ from Bulté and Housen (2014), in which the authors found a strong relationship between writing quality ratings and a clausal subordination measure. They claimed that more subordination was a characteristic of more advanced writing. However, the results of the current study did not provide clear support for their finding. Kuiken and Vedder (2014) reported that C/T and DC/T were not correlated with raters’ judgments of linguistic complexity, which is contrary to Bulté and Housen’s findings. As previous studies provided mixed results, more investigation on these measures is needed. Bulté and Housen (2012) categorized the number of verb phrases per T-unit (VP/T) as a subordination measure as well. Crossley and McNamara (2014) also interpreted the production of fewer verb phrases as indicative of fewer embedded clauses. An interesting observation from the current study was that the relationship between VP/T and writing quality was stronger while very weak correlations were found for other subordination measures, as described above. Somewhat different results found for these measures may have been due to the discrepancy in how the seemingly equivalent measures were defined. Verb phrases identified by the computational tool used in this study, the L2 Syntactic Complexity Analyzer, includes non-finite verb phrases as well as finite verb phrases (Lu, 2010). Therefore, the greater number of verb phrases used per T-unit is not necessarily an indication of more subordination. A closer investigation is needed of the aspect of syntactic complexity addressed by the measure. Nominal phrasal complexity indices were also found to correlate significantly with writing quality. The number of complex nominals per clause (CN/C) and per T-unit (CN/T) were all positively correlated with Total and Language Use scores. Complex nominals includes 102 a number of structures: 1) nouns modified by other elements such as adjectives, appositives, or relative clauses, 2) nominal clauses, and 3) gerunds and infinitives in subject position (Cooper, 1976). The result is broadly in line with Crossley and McNamara’s (2014) and Bulté and Housen’s (2014) findings, although the indices examined in this study and their studies were not identical. Instead of counting the number of complex nominals, they examined a few other comparable indices. Crossley and McNamara examined three indices: average number of modifiers per noun phrase, mean number of words before the main verb, and counted incidence of subject relative clauses. They found a significant positive correlation only for the average number of modifiers per noun phrase. The use of more modifiers in noun phrases was an indication of more proficient writing. Bulté and Housen found a significant correlation between the mean length of noun phrase and writing quality scores, indicating that longer noun phrases are related to higher writing quality. The present study found that more use of complex nominals was related to higher writing quality. Regression analyses investigated whether a number of selected complexity indices (MLC, DC/T, CP/T, T/S, CN/T, VAC Types and VAC CTTR) would be predictive of writing quality as judged by human raters. The results provided evidence that these measures could significantly predict human judgments of writing quality. The measures jointly accounted for a good proportion of the variance in perceived writing quality. Five measures (MLC, CN/T, T/S, VAC Types and VAC CTTR) were found to be significant contributors to Total scores, accounting for 59% of the variance in the scores. The same measures, with the exception of T/S, were also significant predictors of Language Use scores, accounting for 51% of the variance. The strongest predictor for both Total and Language Use scores was the number of VAC Types. Essays containing more diverse VACs types were rated higher. Another index of 103 diversity, VAC CTTR, however, was found to be negatively associated with Total scores when the linear effects of the other variables in the model had been removed. This divergence could have resulted from the relationship between the variable (VAC CTTR) and another predictor variable, VAC Types, given that they are highly correlated (r = .77). In other words, the effect of VAC CTTR may have been subsumed by VAC Types. When the number of VAC Types was held constant, perceived writing proficiency decreased as the corrected type-token ratios increased, which means that fewer verb-argument construction tokens was a sign of lower writing quality. Thus, the measure no longer functioned as a measure of diversity. The result suggests that either of the measures should be selected for multivariate analyses. The next strongest predictor was MLC. Similar to the finding in Bulté and Housen’s (2014) study, the results demonstrated that MLC had a large effect on essay scores. The longer clauses were viewed by human raters as an indicator of better writing. The third strongest predictor was CN/T, indicating that the more frequent use of complex nominals was a sign of higher writing quality. The sentential coordination measure, T/S, was the lowest significant predictor for Total scores. This index was not a significant predictor for Language Use scores, however. The phrasal coordination measure (CP/T) and the subordination measure (DC/T) were also not significant predictors of writing quality. Overall, the use of coordination and subordination measures contributed little to human raters’ perception of writing quality. This result contradicts Bulté and Housen’s (2014) finding that the proportion of simple sentences and the subclause ratio are significant predictors of writing quality. To combine the findings for Research Questions 1 and 2, the results of the present study demonstrate a similar pattern in the analyses of syntactic complexity measures that are indicators of proficiency and measures that are predictive of human-rated writing quality. First, length-of- 104 unit measures (MLC, MLS, and MLT) were strong predictors in both analyses. These measures discriminated all three proficiency levels, with large effect sizes, and were significantly correlated with both writing scores rated by human raters. MLC was a strong and significant predictor of writing quality. Second, subordination measures (C/T, CT/T, DC/C, and DC/T) were neither strong predictors of L2 proficiency nor of perceived writing quality. These measures discriminated the high-intermediate and advanced levels significantly, but with small effect sizes. The measures showed low correlations with writing quality ratings. The results for coordination measures showed slightly different patterns. The sentential coordination measure (T/S) discriminated the two lower proficiency levels only, and the correlation between this measure and writing quality was weak. The regression analysis also demonstrated that the measure was not a significant predictor of perceived syntactic proficiency (assessed by Language Use scores). However, it was a significant predictor for overall writing quality (Total scores). Phrasal coordination measures (CP/C and CP/T) distinguished all the levels, with small to medium effect sizes, and were moderately correlated with both writing quality ratings. However, CP/T was not found to be a significant predictor of writing quality. Next, both the number of verb phrases (VP/T) and complex nominals per T-unit (CN/T) were strong discriminators among the three proficiency levels and were moderately correlated with writing quality ratings. The regression analysis confirmed that CN/T was a significant predictor of writing quality. Finally, the two measures of syntactic diversity that were proposed in this study, the number of verbargument construction types (VAC Types) and the corrected type-token ratio of constructions (VAC CTTR), were strong discriminators of proficiency and also strongly correlated with writing quality ratings. The number of VAC Types was the strongest predictor for both proficiency level and perceived writing quality. 105 To summarize, the analyses demonstrate that syntactic complexity measures that show significant increase across proficiency levels coincide with the measures that are positively correlated with writing quality ratings in general. The results suggest that syntactic diversity measures, length-of-unit measures, and nominal phrasal indices are good indicators of language proficiency and of writing quality as perceived by human raters. However, the results question the use of subordination or coordination measures as predictors in either case. 4.3 Research question 3: Human raters’ perceptions of the Language Use section of an analytic rubric The third research question asked how human raters interpreted the syntactic (grammatical) complexity descriptors that appear on the Language Use scale of an analytic rating rubric. To answer this question, I performed an individual interview with each rater on their rating process and their interpretation of the descriptors in the rubric. The interviews started with a general question about the raters’ overall rating process. In general, raters read essays first, referred to the rubric to assign scores, and revisited the essays to finalize the scores. This process is consistent with the three-stage model in the rating process described by Lumley (2002) which involves first reading (pre-scoring), then rating for each scoring category, and finally contemplating the given scores. Previous research has found variability in rating sequence and allocation of attentional focus (Barkaoui, 2010; Cumming, Kantor & Powers, 2002; Lumley, 2002). For example, Cumming, Kantor, and Powers (2002) found that ESL/EFL raters attended to language-related features more extensively than to rhetoric, while native English-speaking raters allotted relatively balanced attention to all main features during holistic rating. Lumley (2005) reported that not all raters rated categories in an orderly way, while Winke and Lim (2015) found that all the raters 106 assessed categories as arranged in the rubric. Winke and Lim (2015) linked the rating order and the amount of attention given to particular categories to measure of inter-rater reliability. They found that raters paid more attention to the categories that were located in the left part of the rubric, assessing these first, and the authors interpreted the trend as a primacy effect. They also showed how the least attended to category also had the lowest inter-rater reliability (Mechanics). But the results of the current study showed that the order of rating subscales varied depending on the rater, corroborating Lumley’s (2005) finding. Although about half of the raters reported that they scored through the rubric from left to right, as in the study by Winke and Lim, several raters reported that they scored a particular section first. One reported that he rated backward on the rubric, starting with Mechanics because it was the most problematic category for him. Another rater reported that the Language Use section was the category that he assessed first, and a third rater described that he started scoring a couple of categories that stood out to him while reading a given essay, resulting in a varied order from essay to essay. The order of rating may or may not reflect the raters’ imbalanced attention to different subscales of the rubric. However, the interview data at least did not provide clear evidence that raters weighted any particular category more importantly than others. The raters attended to each subscale of the rubric, and no instance of skipping any subscale was reported. Any indication of a holistic type of rating as was discovered by Knoch (2009) was also not found. All the raters appeared to indicate that they scored each category independently. However, these were self-reported data, and eye-tracking methods, such as those used by Winke and Lim (2015), could verify these self-reported data. In order to understand raters’ perceptions of grammatical complexity manifested in the rubric and its application in scoring, more specific questions were asked on the Language Use 107 category of the rubric. Raters were asked to describe their scoring procedure for the category and to report their interpretations of the descriptors. While describing the rating process, raters noted a number of problems with scoring based on the criteria in the rubric. First, balancing between accuracy and complexity was viewed as a difficult task. The Language Use scale considers both accuracy and complexity components of language structures used in texts, and how much to credit incorrect uses of complex structures was up to the raters’ discretion. Second, difficulties due to overlaps between the Language Use scale and other categories of the rubric were also reported. Fluency of writing, which is commonly rated for the Content category, also affected the rating of Language Use, and separating Language Use from Vocabulary Use was often considered difficult. Finally, the vagueness of descriptors was one of the most commonly mentioned problems, as was also noted by Knoch (2009). Raters felt that clear definitions of terms were lacking and left up to their subjective interpretations. Consequently, operationalization of some major terms such as ‘complex structures’ and ‘sentence variety’ varied from rater to rater. In addition, some descriptors were not interpreted as intended. Most of the raters did not note the distinction between ‘complex sentences’ and ‘complex structures’. Complex structures were often understood to be the same as complex sentences, which are defined as sentences with multiple clauses. Some raters interpreted complex structures as difficult structures, although ‘difficulty’ was not a criterion explicitly manifested in the rubric. In addition, the raters’ teaching experience influenced them when determining difficult structures, which also contributes to the variability among raters. The notion of grammatical ability in the Language Use category of the rubric used in this study was captured using four separate criteria: complex sentences, complex structures, 108 morphology, and sentence variety. The raters interpreted these descriptors as follows. First, the raters understood complex sentences as multi-clausal coordinated or subordinated sentences. Linguistic features that were mentioned to exemplify complex structures were complex nominals, passive voice, perfect aspect, transition words, and complex clause patterns. Many of these structures coincided with syntactic complexity indices that have been popularly used in SLA research, such as the number of coordinated phrases or dependent clauses per T-unit, the mean length of noun phrases, or the number of complex nominals. Other examples such as incidences of passive voice and various transition words were reported in Wolfe-Quintero et al.’s (1998) research synthesis, and the distribution of various verb tenses and aspects has been used in some recent studies as well (e.g., Verspoor, Schmid, & Xu; 2012). Next, morphology was understood by most of the raters to indicate word endings. The criterion is intended to be evaluated in terms of accuracy rather than complexity according to the rubric. In other words, the descriptors refer to the number of morphological errors rather than the complexity of morphology used. However, morphological ability seemed to be often disregarded by raters. The criterion was not mentioned as often as other criteria such as complex structures or sentence variety while describing the rating process for the Language Use category. From raters’ perspectives, morphology had less room to vary compared with other aspects of grammatical complexity, especially when accuracy of use was considered jointly. Morphology-related errors do not often interfere with understanding, thus raters tended to pay less attention to these features. Lastly, sentence variety was interpreted as the use of diverse types of sentence and clause patterns. How the sentence and clause patterns were interpreted was different from rater to rater. The use of simple, compound, and complex sentences, and clause patterns beyond a simple ‘subject-verb-object’ pattern were mentioned to exemplify sentence variety. 109 To summarize, overall, human raters’ interpretation of grammatical complexity corresponded to the notion of syntactic complexity common in SLA. Most of the grammatical structures to which raters attended in order to evaluate grammatical complexity coincided with quantitative indices of syntactic complexity used in SLA research. Sentences with multiple clauses via coordination or subordination were the linguistic features most commonly mentioned as exemplifying complex structures. Noun phrases modified by relative clauses or other pre-/ post-modifications were also considered to be an indication of grammatically complex writing. However, the raters did not have identical interpretations of the construct of grammatical complexity. Several major terms in descriptors were abstract and simple, and no further explanations were provided. Consequently, many raters felt that they were not given enough information for scoring from the rubric. They said they needed more specific instructions in order to assign varied scores. To combine the results for Research Questions 2 and 3, the interview data seem to provide an explanation for the significant correlations between many syntactic complexity measures and writing quality scores reported in the previous section. It is interesting to note, however, that discrepancies were also found between the linguistic features that raters recognized as exemplifications of complex structures and syntactic complexity measures that significantly predicted writing quality. For example, none of the raters directly mentioned clause length in illustrating complex structures, while MLC was a strong predictor of writing quality scores. There are various elements that can lengthen a clause, such as noun phrase modifiers (adjective or prepositional phrases), nonfinite clauses, and adverbial phrases (Norris & Ortega, 2009). Although some raters mentioned noun phrase modifiers, other phrasal modifications were not reported in this study. In addition, although subordinated sentences were the structures that 110 all raters found complex and regarded as features of advanced writing, subordination measures were not found to be strong predictors of writing quality. Perhaps features that the raters think characterize high-rated essays are not necessarily considered during the rating process. It is also possible that raters could not verbalize all the features they attended to during the rating process. As one rater mentioned, it is possible that raters do not clearly picture what complex structures mean until they find some while reading students’ essays. Further research is needed for more direct investigation into raters’ cognitive rating process for the language used in L2 learners’ essays. 111 CHAPTER 5: CONCLUSION In this concluding chapter, I first summarize my research findings in light of the purposes and research questions stated in the beginning of the dissertation. Next, research and practical implications are presented. The chapter concludes with a discussion of the limitations of the current study and suggestions for future research. 5.1 Summary of findings As outlined in the Chapter 1, this study was designed to accomplish three major purposes. First, the study aimed to determine how syntactic or grammatical complexity has been operationalized in the fields of SLA and L2 assessment and to review the indices of complexity. Second language syntactic complexity in SLA research has often been defined as the degree of sophistication and the range of forms in learner language. Similar to the definition of syntactic complexity in SLA research, in the area of L2 assessment, grammatical performance has been assessed in terms of the number of different structures and their degree of complexity. Through the review of the literature, I found that SLA researchers have noted the multidimensional nature of the construct and developed and used measures that tap into various facets of syntactic complexity. I also found, however, that not all dimensions of the construct have been attributed the same level of importance in research. For example, syntactic complexity at the clausal and phrasal level, and the diversity aspect of complexity have been relatively under-researched. In contrast, both the sophistication and the range of linguistic structures used were main criteria in the assessment of L2 performance. Noting the gap in the research, the second purpose of this study was to propose measures that address the diversity dimension of syntactic complexity. I proposed to focus on the diverse use of verb-argument constructions. The choice was motivated by the fact that previous studies 112 in linguistics have found verb-argument constructions to be suitable for evaluating the development of language proficiency. I opted for the number of different verb-argument structure types used and their corrected type-token ratio. Low correlations between the diversity measures and the existing measures of syntactic sophistication demonstrated that the proposed measures tap into an independent trait of complexity. I also investigated whether the proposed measures, together with traditional complexity measures that have been used in SLA, can be indicative of L2 writing proficiency and writing quality as judged by human raters. The results presented in Chapter 3 showed that most complexity measures worked well as an indicator of proficiency, including the two proposed measures. The complexity measures were also found to be highly correlated with human-rated writing quality. In general, the diversity measures, length of production units, and number of complex noun phrases were better predictors than subordination or coordination measures. The results also showed that adding the diversity measures to the existing elaboration measures increased the predictive power for L2 proficiency and that the number of types of VACs was the strongest predictor of human-rated writing quality. The results lend support to the use of the diversity measures in this area of research. I also found that notions of grammatical complexity as interpreted by raters overlap with the notion of syntactic complexity in SLA. The data obtained from the rater interviews showed that the raters’ conceptualizations of complex structures were comparable to some complexity measures used in SLA, which also explains the high correlations between the measures and the human-rated writing scores. However, variability was found in the interpretations between raters. Another interesting finding was that some features commonly perceived by raters as 113 characteristic of complex language were not actually found to be strong predictors of writing quality, and some significant predictors were not explicitly recognized by raters. 5.2 Implications 5.2.1 Research implications The present study has several implications that can inform the fields of SLA, L2 writing, and L2 assessment. First and foremost, the study adds to the literature by proposing measures that capture relatively under-researched aspect of syntactic complexity. The proposed measures are theoretically motivated, and the results of the present study confirm their strong predictive power for L2 proficiency and writing quality as evaluated by human raters. Second, the study provides empirical support for the usefulness of syntactic complexity measures that have been traditionally used in SLA in predicting second language proficiency and second language writing quality. Previous literature has reported conflicting findings on the validity of the measures and pointed out the difficulty of comparing their reliability (e.g., Lu, 2011). The current study overcomes these problems by concurrently examining multiple measures using a large data set. Next, the study also fills a research gap with regard to the link between the understanding of syntactic or grammatical complexity in SLA research and L2 writing assessment practices. The investigation into the relationship between complexity indices used in the field of SLA and writing quality as perceived by human raters offers some insights regarding the link. The data obtained from rater interviews also furthers our understanding of the assessment of grammatical complexity in L2 writing. Finally, the present study investigated writing samples collected from English learners in a non-English speaking country, South Korea. Research on the development of L2 writing has been prevalent in second language contexts (Byrnes, Maxim and Norris, 2010), 114 and the present study contributes to the understanding of L2 writing development in the instructed FL context. 5.2.2 Practical implications The results of this study offer several implications that are relevant to practices in L2 assessment. First, the results regarding the relationship between syntactic complexity measures and L2 writing proficiency inform the L2 assessment field by providing insights on what features need to be considered in assessing grammatical ability. Second, the present study reveals some gaps between the factors that affect human judgements of writing quality and human raters’ perception of them. The discrepancy requires further investigation, and the results should be reflected in rating scale development/revision and rater training. Next, as described above, this study exposed raters’ difficulties in interpreting abstract and vague descriptors in the Language Use section of the analytic rating scale and brought attention to the variability that exists among the raters’ operationalizations of major criteria. As Lumley (2002) pointed out, different reactions to a rating scale may result in problems with consistent measurement and interpretation of scores. Efforts to improve raters’ common understanding of descriptors are invited. Providing more concrete wording or further illustrations of descriptors would help solve the problem. Rater training would also enhance shared understanding of the rating criteria suitable for a given rating context. The present study also sheds lights on practices in second language writing pedagogy. The findings regarding the factors that predict L2 proficiency and perceived writing quality can be reflected in syllabus and teaching material development and in teacher education. 115 5.3 Limitations and future research The present study has a number of limitations that need to be considered in further research. First, the proficiency test developed and used in this study requires further examination. The use of an independent measure of proficiency was an improvement compared to the previous studies that used school or program levels in that it enables replication and comparisons among studies. However, the test used in the study did not directly assess L2 writing proficiency. Although a positive relationship between the writing scores and the proficiency test scores was confirmed from the data of the current study, more investigation into the validity of the test is recommended. Second, the present study investigated essays collected from learners with the same first language (L1) background. Although the use of a homogeneous group has the advantage of controlling for variation due to the learners’ L1, further research is needed to generalize the findings of the present study to other L1 groups. Third, the way raters’ perceptions of the rating rubric were investigated bares some methodological limitations. As many researchers have stated, verbal reports cannot completely reveal a person’s cognitive processes. Features that they implicitly and intuitively attended to may not have been successfully retrieved. In addition, the features mentioned by the raters may not have actually been attended to while rating as the interviews were not concurrently conducted. More data needs to be cumulated on this issue via various data collection methods, such as eye-tracking methods (as done by Winke & Lim, 2015), think-aloud protocols, stimulated-recalls, and questionnaires, and with a larger sample. Finally, the measures of syntactic diversity proposed in this study entail labor-intensive manual analysis, which may impede the practical use of the measures. The development of an 116 automated analytic tool that can extract incidences of verb-argument constructions would enhance the efficiency and reliability of the analysis. Another possible direction for future research would be to identify early and later developed verb-argument structures and use them as benchmarks for different levels of L2 proficiency. For example, one could investigate the emergence of the construction relative to proficiency level employing the implicational scaling technique. The identified benchmark structures could be used in language pedagogy and assessment. 117 APPENDICES 118 Appendix A English Proficiency Test (C-test) Fill in one word in each blank. You may write directly on the test. Complete the texts in order (TEXT 1 TEXT 2 TEXT 3). 예) The girl was walking (0) d____________ the street when she stepped on some ice and fell. Answer: down TEXT 1 Steven loved almost everything about his grandma. There was only one thing he hated. She always knitted sweaters for (1) h____________. Steven understood that she did it to be (2) n____________. However, all the sweaters were very ugly. Steven (3) v____________ her once a week. She had a new (4) s____________ for him each time. Steven lived in a (5) s____________ apartment. There was no room for him to (6) k____________ all the sweaters. He had to give all of them (7) a____________. “Grandma will never find out,” he thought. One (8) d____________, Steven’s grandma visited him by surprise. She asked to (9) s____________ his sweaters. “Someone stole all of them!” he (10) s____________. “They were too nice.” She (11) m____________ him ten more by the next month. TEXT 2 Depression is a serious but treatable disorder that affects millions of people, from young to old and from rich to poor. It gets in the way of everyday (12) l____________, causing tremendous pain, hurting not just those suffering (13) f____________ it, but also impacting everyone around them. If (14) s____________ you love is depressed, you may be (15) e____________ any number of difficult emotions, including helplessness, frustration, (16) a____________, fear, guilt, and sadness. These feelings are all (17) n____________. It’s not easy dealing with a friend or (18) f____________ member’s depression. And if you don’t take care of (19) y____________, it can become overwhelming. That said, there are (20) s____________ you can take to help your loved one. Start by learning about depression and how to talk (21) a____________ it with your friend or family member. But as you reach out, don’t forget to (22) l____________ after your own emotional (23) h____________. Thinking about your own needs is not an (24) a____________ of selfishness—it’s a necessity. Your emotional strength will (25) a____________ you to provide the ongoing support your depressed friend or family member needs. 119 TEXT 3 Nonverbal communication includes facial expressions, gestures, the distance between speakers, eye contact, voice intonations, touch, and many other minor details which can provide speakers with valuable details about each other. For example, (26) s____________ between people can say a lot about the level of intimacy between them: usually, the (27) s____________ the distance between speakers, the more friendly or (28) i____________ they are, and vice versa. Or if a person (29) a____________ eye contact, it might mean that he or she is hiding something, feels (30) u____________ around you, and so on. Body (31) l____________ has several important functions. For instance, a person’s (32) g____________ can repeat the message he or she is (33) m____________ orally; a little child explaining how birds (34) f____________ and waving his or her arms like (35) w____________ is a decent example of this function. Another function, substitution, occurs when (36) v____________ messages can be expressed by nonverbal means (like shrugging). accenting, like when (38) (37) I____________ addition, gestures can be used for r____________ one’s index finger when speaking about (39) s____________ important. At the same time, it is important to remember that sometimes body language may (40) d____________ depending on culture. For example, in some eastern countries, (41) l____________ straight in the eyes of a conversationalist is considered (42) r____________. Men in some Arabic countries may walk around the street (43) h____________ hands, or may kiss each other on the (44) c____________ when greeting, but this is the (45) i____________ of friendship, not romance or intimacy. 120 Appendix B Table 24 C-test: Item Facilities and Item Discriminations Item IF IF (Upper) IF (Lower) ID 1 0.82 0.95 0.63 0.32 2 0.42 0.78 0.07 0.70 3 0.58 0.91 0.23 0.67 4 0.80 0.96 0.52 0.44 5 0.67 0.89 0.37 0.52 6 0.60 0.85 0.24 0.60 7 0.26 0.65 0.01 0.64 8 0.86 1.00 0.57 0.43 9 0.21 0.37 0.01 0.36 10 0.53 0.63 0.27 0.37 11 0.55 0.94 0.09 0.85 12 0.42 0.82 0.02 0.80 13 0.47 0.78 0.06 0.71 14 0.57 0.95 0.02 0.93 15 0.03 0.13 0.00 0.13 16 0.19 0.47 0.03 0.44 17 0.13 0.24 0.02 0.22 18 0.70 0.97 0.13 0.84 19 0.45 0.76 0.06 0.69 20 0.08 0.24 0.00 0.24 21 0.49 0.88 0.08 0.80 22 0.24 0.64 0.00 0.64 23 0.12 0.29 0.01 0.28 24 0.04 0.15 0.00 0.15 25 0.13 0.33 0.00 0.33 26 0.12 0.39 0.00 0.39 121 Table 24 (cont’d) Item IF IF (Upper) IF (Lower) ID 27 0.27 0.65 0.00 0.65 28 0.35 0.85 0.00 0.85 29 0.17 0.48 0.00 0.48 30 0.33 0.71 0.01 0.70 31 0.62 1.00 0.02 0.98 32 0.54 0.92 0.03 0.89 33 0.07 0.23 0.00 0.23 34 0.36 0.83 0.01 0.82 35 0.24 0.56 0.00 0.56 36 0.38 0.77 0.00 0.77 37 0.57 0.97 0.07 0.90 38 0.10 0.34 0.00 0.34 39 0.46 0.94 0.02 0.92 40 0.18 0.56 0.00 0.56 41 0.35 0.87 0.01 0.86 42 0.39 0.91 0.00 0.91 43 0.17 0.50 0.00 0.50 44 0.27 0.72 0.01 0.71 45 0.06 0.19 0.00 0.19 122 Appendix C Language Learning Background Questionnaire (for college students) The following are questions about your English learning experiences. Read each item carefully, and place a check (√) mark next to the appropriate answer, or fill out with a brief answer. 1. Gender: □Male □Female 2. Year of Study: □Freshman □Sophomore □Junior □Senior 3. Major:_______________________ 5. I studied/am studying English from ___ to ___ year old. 6. Indicate the number that best represents your English proficiency.      Overall English: Reading: Writing: Speaking: Listening: (1: Minimal, 2: Basic, 3: Good , 4: Very good, 5: Excellent) □1 □2 □3 □4 □5 □1 □2 □3 □4 □5 □1 □2 □3 □4 □5 □1 □2 □3 □4 □5 □1 □2 □3 □4 □5 7. Have you ever lived in an English-speaking country (for example, USA, UK, Canada, Australia, Philippines, Singapore, Hong Kong)? □Yes □No If yes: Age Example: years old Country US Length of Residence 1year and 2 months 8. Have you taken a standardized English test (for example, TOEFL, TOEIC, TEPS, IELTS) □Yes □ No  Test: _____________________  Approximate date: Year_________ Month ________  Score: _________ 123 Appendix D Language Learning Background Questionnaire (for high school students) The following are questions about your English learning experiences. Read each item carefully, and place a check (√) mark next to the appropriate answer, or fill out with a brief answer. 1. Gender: □Male □Female □ First year 2. Year of Study: □ Second year □ Third year 3. I studied/am studying English from ______ to ______ year old. 4. Indicate the number that best represents your English proficiency.      Overall English: Reading: Writing: Speaking: Listening: (1: Minimal, 2: Basic, 3: Good , 4: Very good, 5: Excellent) □1 □2 □3 □4 □5 □1 □2 □3 □4 □5 □1 □2 □3 □4 □5 □1 □2 □3 □4 □5 □1 □2 □3 □4 □5 6. Have you ever lived in an English-speaking country (for example, USA, UK, Canada, Australia, Philippines, Singapore, Hong Kong)? □Yes □No If yes: Age Example: 9-10 years old Country US Length of Residence 1year and 2 months 7. Have you taken a standardized English test (for example, TOEFL, TOEIC, TEPS, IELTS) □Yes □ No  Test: _____________________  Approximate date: Year_________ Month ________  Score: _________ 124 Appendix E Language Learning Background Questionnaire in Korean (for college students) 언어 학습 배경 설문지 (대학생용) 다음은 본인의 영어 학습 경험에 대한 질문입니다. 각 문항을 잘 읽어보신 후, 그 문항에 알맞은 답에 체크(√) 표시를 하거나 답을 간단하게 서술하여 주십시오. 1. 성별: □남 □여 2. 학년: □1 □2 □3 □4 3. 전공:_______________________ 4. 나는 영어 학습을 ______살부터 ______살 까지 했다/하고 있다. 5. 자신의 영어 능력을 가장 잘 설명하는 숫자를 골라 표기하여 주십시오. (1: 최소한 , 2: 기초적, 3: 준수한 , 4: 우수한, 5: 탁월한)  전반적 영어능력 : □1 □2 □3 □4 □5  읽기 능력: □1 □2 □3 □4 □5  쓰기 능력: □1 □2 □3 □4 □5  말하기 능력: □1 □2 □3 □4 □5  듣기 능력: □1 □2 □3 □4 □5 6. 영어권 국가 (예시: 미국, 영국, 호주, 필리핀, 싱가폴, 홍콩) 거주 경험은? □있다 있다면: 도착 나이 국가 거주기간 Example: 13 세 미국 1 년 2 개월 7. 토플/토익/텝스/IELTS 등 영어 시험을 보신 적이 있습니까?  시험 이름: _____________________  대략적인 시험 날짜: ____________년 ________월  점수: _________ 125 □예 □없다 □ 아니오 Appendix F Language Learning Background Questionnaire in Korean (for high school students) 언어 학습 배경 설문지 (고등학생용) 다음은 본인의 영어 학습 경험에 대한 질문입니다. 각 문항을 잘 읽어보신 후, 그 문항에 알맞은 답에 체크(√) 표시를 하거나 간단하게 서술하여 주십시오. 1. 성별: □남 □여 2. 학년: □1 □2 □3 3. 나는 영어 학습을 _____살부터 _____살 까지 했다/하고 있다. 4. 자신의 영어 능력을 가장 잘 설명하는 숫자를 골라 표기하여 주십시오. (1: 최소한, 2: 기초적, 3: 준수한 , 4: 우수한, 5: 탁월한)  전반적 영어능력 : □1 □2 □3 □4 □5  읽기 능력: □1 □2 □3 □4 □5  쓰기 능력: □1 □2 □3 □4 □5  말하기 능력: □1 □2 □3 □4 □5  듣기 능력: □1 □2 □3 □4 □5 5. 영어권 국가 (예시: 미국, 영국, 호주, 필리핀, 싱가폴, 홍콩) 거주 경험은? □있다 있다면: 도착 나이 국가 거주기간 Example: 13 세 미국 1 년 2 개월 6. 토플/토익/텝스/IELTS 등 영어 시험을 보신 적이 있습니까?  시험 이름: _____________________  대략적인 시험 날짜: ____________년 ________월  점수: _________ 126 □예 □없다 □ 아니오 Appendix G Rater Background Questionnaire PLEASE FILL OUT THE FOLLOWING BACKGROUND INFORMATION. PLEASE PRINT CLEARLY. 1. Name: 2. Age: _____ 3. Native language: _____ 4. Language you speak at home: ________________ 5. Are you now or have you ever been an English as a Second or Foreign Language (ESL/EFL) teacher? Yes a. First name: b. Last name: _______________________________________ _______________________________________ No a. If yes, for how long (total)? 1 year or less 2-5 years 5-10 years More than 10 years b. If yes, what state(s) (US) or country (countries) did you teach in? a ._____________________ How long did you teach there?_____________________ b. _____________________ How long did you teach there?_____________________ c. _____________________ How long did you teach there?_____________________ 6. 7. Do you have previous experience rating ESL/EFL compositions? Yes No a. If yes, could you briefly describe your experience? ___________________________________________________________________ ___________________________________________________________________ ___________________________________________________________________ How do you describe your abilities to evaluate ESL/EFL compositions?  Novice  Competent  Excellent 127 8. What languages, other than English, do you speak or have you studied or are currently studying? Please report and answer questions for each language other than English that you speak or have studied or are currently studying. HOW DID YOU LEARN LANGUAGE From what age HOW WELL DO YOU THE LANGUAGE? A. to what age did SPEAK THE (Please describe.) you learn the LANGUAGE? (Please language? circle one) ___ to ___ poor / fair / good / advanced/ fluent / native-like Comments: LANGUAGE B. HOW DID YOU LEARN THE LANGUAGE? (Please describe.) From what age to what age did you learn the language? HOW WELL DO YOU SPEAK THE LANGUAGE? (Please circle one) ___ to ___ poor / fair / good / advanced/ fluent / native-like Comments: 9. Have you lived in or traveled to a place where people speak the languages you speak or have studied or are currently studying (the ones listed in #9)? Yes No If yes, please report and answer questions for each place you have lived or visited and where the language(s) (#10) were spoken. Where did you travel or live? a. ____________ Where did you travel or live? a. ____________ For how long were you there? How old were you when you were there? What was the purpose of your visit or stay? For how long were you there? How old were you when you were there? What was the purpose of your visit or stay? 128 Appendix H Table 25 Rating rubric 129 REFERENCES 130 REFERENCES Ai, H., & Lu, X. (2013). A corpus-based comparison of syntactic complexity in NNS and NS university students’ writing. In A. Díaz-Negrillo, N. Ballier, & P. Thompson (Eds.), Automatic treatment and analysis of learner corpus data (pp. 249-264). Amsterdam/ Philadelphia: John Benjamins. Alishahi, A., & Stevenson, S. (2008). A computational model of early argument structure acquisition. Cognitive science, 32(5), 789-834. Anthony, L. (2014). AntConc (Version 3.4.3) [Computer Software]. Tokyo, Japan: Waseda University. Available from http://www.laurenceanthony.net/ Anthony, L. (2014). TagAnt (Version 1.1.2) [Computer Software]. Tokyo, Japan: Waseda University. Available from http://www.laurenceanthony.net/ Asención-Delaney, Y., & Collentine, J. (2011). A multidimensional analysis of a written L2 Spanish corpus. Applied Linguistics, 32(3), 299-322. Babaii, E., & Ansary, H. (2001). The C-test: a valid operationalization of reduced redundancy principle?. System, 29(2), 209-219. Bachman, L. F. (1985). Performance on cloze tests with fixed-ratio and rational deletions. TESOL Quarterly, 19(3), 535-556. Bachman, L. F. (1990). Fundamental considerations in language testing. Oxford University Press. Baralt, M. (2012). Coding qualitative data. In A. Mackey & S. Gass (Eds.), Research methods in second language acquisition (pp. 222-244). Malden, MA: Wiley-Blackwell. Bardovi-Harlig, K. (1992). A second look at T-unit analysis: Reconsidering the sentence. TESOL Quarterly, 26, 390-395. Barkaoui, K. (2010). Variability in ESL essay rating processes: The role of the rating scale and rater experience. Language Assessment Quarterly, 7(1), 54-74. Becini, G. M. L. & Goldberg, A. (2000). The contribution of argument structure constructions to sentence meaning. Journal of Memory and Language 43, 640-651. Benevento, C., & Storch, N. (2011). Investigating writing development in secondary school learners of French. Assessing Writing, 16(2), 97-110. 131 Biber, D., Gray, B., & Poonpon, K. (2011). Should we use characteristics of conversation to measure grammatical complexity in L2 writing development? TESOL Quarterly, 45(1), 5-35. Biber, D., Johansson, S., Leech, G., Conrad, S., Finegan, E., & Quirk, R. (1999). Longman grammar of spoken and written English (Vol. 2). MIT Press. Brown, J. D. (1980). Relative merits of four methods for scoring cloze tests.The Modern Language Journal, 64(3), 311-317. Bulté, B., & Housen, A. (2012). Defining and operationalising L2 complexity. In A. Housen, F. Kuiken, & I. Vedder (Eds.), Dimensions of L2 performance and proficiency: Complexity, accuracy and fluency in SLA (pp. 21-46). John Benjamins Publishing. Bulté, B., & Housen, A. (2014). Conceptualizing and measuring short-term changes in L2 writing complexity. Journal of Second Language Writing, 26, 42-65. Byrnes, H., Maxim, H. H., & Norris, J. M. (2010). Introduction. The Modern Language Journal, 94(s1), 1-202. Canale, M. & Swain, M. (1980). Theoretical bases of communicative approaches to second language teaching and testing. Applied Linguistics 1(1), 9-47. Carroll, J. B. (1964). Language and Thought. Englewood Cliffs, NJ: Prentice-Hall . Casanave, C. P. (1994). Language development in students' journals. Journal of second language writing, 3(3), 179-201. Chapelle, C. A., & Duff, P. A. (2003). Some guidelines for conducting quantitative and qualitative research in TESOL. TESOL quarterly, 37(1), 157-178. Cohen, J. (1992). A power primer. Psychological bulletin, 112(1), 155. Connor-Linton, J., & Polio, C. (2014). Comparing perspectives on L2 writing: Multiple analyses of a common corpus. Journal of Second Language Writing, 26, 1-9. Cooper, T. C. (1976). Measuring written syntactic patterns of second language learners of German. The Journal of Educational Research, 69(5), 176-183. Crossley, S. A., & McNamara, D. S. (2009). Computational assessment of lexical differences in L1 and L2 writing. Journal of Second Language Writing,18(2), 119-135.Carr 2011 Crossley, S. A., & McNamara, D. S. (2011). Shared features of L2 writing: Intergroup homogeneity and text classification. Journal of Second Language Writing, 20, 271285. 132 Crossley, S. A., & McNamara, D. S. (2014). Does writing development equal writing quality? A computational investigation of syntactic complexity in L2 learners. Journal of Second Language Writing, 26, 66-79. Cumming, A., Kantor, R., & Powers, D. E. (2002). Decision making while rating ESL/EFL writing tasks: A descriptive framework. The Modern Language Journal, 86(1), 67-96. Danzak, R. L. (2011). The integration of lexical, syntactic, and discourse features in bilingual adolescents' writing: An exploratory approach. Language, Speech, and Hearing Services in Schools, 42(4), 491-505. Ellis, N. & Ferreira-Junior, F. (2009a). Constructions and their acquisition. Annual Review of Cognitive Linguistics, 7, 187-220. Ellis, N. & Ferreira-Junior, F. (2009b). Construction learning as a function of frequency, frequency distribution and function, Modern Language Journal, 93(3), 370-385. Ellis, R. (2003). Task-based language learning and teaching. Oxford: Oxford University Press. Ellis, R., & Yuan, F. (2004). The effects of planning on fluency, complexity, and accuracy in second language narrative writing. Studies in second Language acquisition, 26(1), 5984. Foster, P., & Skehan, P. (1996). The influence of planning and task type on second language performance. Studies in Second language acquisition, 18(3), 299-323. Frear, M. W., & Bitchener, J. (2015). The effects of cognitive task complexity on writing complexity. Journal of Second Language Writing, 30, 45-57. Friedman, D. A. (2012). How to collect and analyze qualitative data. In A. Mackey & S. Gass (Eds.), Research methods in second language acquisition (pp. 188-200). Malden, MA: Wiley-Blackwell. Goldberg, A. (1995). Constructions: A construction grammar approach to argument structure. Chicago, IL: University of Chicago Press. Goldberg, A., & Suttle, L. (2010). Construction grammar. Wiley Interdisciplinary Reviews: Cognitive Science, 1(4), 468-477. Grant, L., & Ginther, A. (2000). Using computer-tagged linguistic features to describe L2 writing differences. Journal of second language writing, 9(2), 123-145. Guo, L., Crossley, S. A., & McNamara, D. S. (2013). Predicting human judgments of essay quality in both integrated and independent second language writing samples: A comparison study. Assessing Writing, 18, 218-238. 133 Gyllstad, H., Granfeldt, J., Bernardini, P., & Kallkvist, M. (2014). Linguistic correlates to communicative proficiency levels of the CEFR. EUROSLA Yearbook, 14, 1-30. Hinkel, E. (2003). Simplicity without elegance: Features of sentences in L1 and L2 academic texts. TESOL Quarterly, 37(2), 275-301. Homburg, T. J. (1984). Holistic evaluation of ESL compositions: Can it be validated objectively?. TESOL quarterly, 18(1), 87-107. Housen, A., & Kuiken, F. (2009). Complexity, accuracy, and fluency in second language acquisition. Applied Linguistics, 30(4), 461-473. Housen, A., Kuiken, F., & Vedder, I. (Eds.). (2012). Dimensions of L2 performance and proficiency: Complexity, accuracy and fluency in SLA (Vol. 32). John Benjamins Publishing. Hunt, K. W. (1965). Grammatical structures written at three grade levels. NCTE Research Report No. 3. Hunt. K. W. (1970). Recent measures in syntactic: development. In M. Lester (Ed.), Readings in applied transformational grammar (pp. 187-200). New York: Holt, Rinehert. Ishikawa, S. (1995). Objective measurement of low-proficiency EFL narrative writing. Journal of Second Language Writing, 4(1), 51-69. Ishikawa, T. 2007. The effect of manipulating task complexity along the +/- here-and-now dimension on L2 written narrative discourse. In M. P. Garcia Mayo (Ed.), Investigating tasks in formal language learning. Multilingual Matters. Jacobs, H., Zinkgraf, S., Wormuth, D., Hartfiel, V., & Hughey, J. (1981). Testing ESL Composition: A practical approach. Rowley, MA: Newbury House. Jarvis, S., Grant, L., Bikowski, D., & Ferris, D. (2003). Exploring multiple profiles of highly rated learner compositions. Journal of Second Language Writing, 12(4), 377-403. Kim, S. H. (2014). Metacognitive knowledge in second language writing (Unpublished doctoral dissertation). Michigan State University. Klecka, W. R. (1980). Discriminant analysis. Beverly Hills, CA: Sage. Knoch, U. (2009). Diagnostic assessment of writing: A comparison of two rating scales. Language Testing, 26(2), 275-304. Kormos, J., & Trebits, A. (2012). The role of task complexity, modality, and aptitude in narrative task performance. Language Learning, 62(2), 439-472. 134 Kuiken, F., & Vedder, I. (2007). Task complexity and measures of linguistic performance in L2 writing. IRAL-International Review of Applied Linguistics in Language Teaching, 45(3), 261-284. Kuiken, F., & Vedder, I. (2014). Rating written performance: What do raters do and why?. Language Testing, 31(3), 329-348. Landers 2015 Larsen-Freeman, D. (2006). The emergence of complexity, fluency, and accuracy in the oral and written production of five Chinese learners of English. Applied Linguistics, 27(4), 590-519. Lennon, P. (1990). Investigating fluency in EFL: A quantitative approach. Language Learning 40, 387-417. Llanes, A., & Munoz, C. (2013). Age effects in a study abroad context: Children and adults studying abroad and at home. Language Learning, 63(1), 63-90. Long, S. H., Fey, M. E., & Channell, R. W. (2008). Computerized Profiling (CP) (Version 9.2. 7, MS-DOS)[computer program]. Cleveland, OH: Department of Communication Sciences, Case Western Reserve University. Lu, X. (2009). Automatic measurement of syntactic complexity in child language acquisition. International Journal of Corpus Linguistics, 14(1), 3-28. Lu, X. (2010). Automatic analysis of syntactic complexity in second language writing. International Journal of Corpus Linguistics, 15(4), 474-496 Lu, X. (2011). A corpus-based evaluation of syntactic complexity measures as indices of college-level ESL writers' language development. TESOL Quarterly, 45(1), 36-62. Lu, X., & Ai, H. (2015). Syntactic complexity in college-level English writing: Differences among writers with diverse L1 backgrounds. Journal of Second Language Writing, 29, 16-27. Lumley, T. (2002). Assessment criteria in a large-scale writing test: what do they really mean to the raters?. Language Testing, 19(3), 246-276. Lumley, T. (2005). Assessing second language writing: The rater’s perspective. Frankfurt: Peter Lang. Mancilla, R. L., Polat, N., & Akcay, A. O. (2015). An investigation of native and nonnative English speakers’ levels of written syntactic complexity in asynchronous online discussions. Applied Linguistics, http://dx.doi.org/10.1093/applin/amv012 Mazgutova, D., & Kormos, J. (2015). Syntactic and lexical development in an intensive English for Academic Purposes programme. Journal of Second Language Writing, 29, 3-15. 135 McNamara, D.S., Crossley, S.A., & McCarthy, P.M. (2010). Linguistic features of writing quality, Written Communication 27(1), 57-86. Monroe, J. H. (1975). Measuring and enhancing syntactic fluency in French. The French Review, 48(6), 1023-1031. Nelson, L. R. (2000). Item analysis for tests and surveys using Lertap 5. Perth, Western Australia: Curtin University of Technology. Norrby, C. & Hakansson, G. (2007). The interaction of complexity and grammatical processability: The case of Swedish as a foreign language. International Review of Applied Linguistics, 45, 45-68. Norris, J. M. (2015). Discriminant analysis. In L. Plonsky (Ed.), Advancing quantitative methods in second language research (pp. 305-328). New York/ London: Routledge. Norris, J. M., & Ortega, L. (2009). Towards an organic approach to investigating CAF in instructed SLA: The case of complexity. Applied Linguistics, 30(4), 555-578. Norris, J., & Ortega, L. (2003). Defining and measuring SLA. In C. Doughty, & M. Long (Eds.), The handbook of second language acquisition (pp. 716-761). John Wiley & Sons. Oller, J. W., Jr . (1972). Scoring methods and difficulty levels for cloze tests of proficiency in English as a second language. Modern Language Journal, 56, 151–157. Ortega, L. (2003). Syntactic complexity measures and their relationship to L2 proficiency: A research synthesis of college-level L2 writing. Applied Linguistics, 24(4), 492-518. Ortega, L. (2012). Interlanguage complexity: A construct in search of theoretical renewal. In B. Kortmann & B. Szmrecsanyi (Eds.), Linguistic complexity: Second language acquisition, indigenization, contact (pp. 127–155). Berlin: De Gruyter. Palloti, G. (2015). A simple view of linguistic complexity. Second Language Research, 31(1), 117-134. Pallotti, G. (2009). CAF: Defining, refining and differentiating constructs. Applied Linguistics, 30(4), 590-601. Polio, C. (2001). Research methodology in second language writing research: The case of textbased studies. In T. Silva & P. K. Matsuda (Eds.), On second language writing (pp. 91-115). Mahwah, NJ: Lawrence Erlbaum. 136 Polio, C. (2013). Revising a writing rubric based on raters’ comments: Does it result in a more reliable and valid assessment?.Midwest Association of Language Testers, Michigan State University. Purpura, J. (2004). Assessing grammar. Cambridge: Cambridge University Press. Quirk, R., Greenbaum, S., Leech, G., Svartvik, J., & Crystal, D. (1985). A comprehensive grammar of the English language (Vol. 397). London: Longman. Raatz, U. & Klein-Braley, C. (1981). The C-test: A modification of the cloze procedure. In T. Culhane, C. Klein-Braley, & D. K. Stevenson (Eds.), Practice and problems in language testing (pp. 113-138). Colchester: Department of Language and Linguistics, University of Essex. Ramos, S. D. S., & Rickard Liow, S. J. (2013). Discriminant function analysis. The Encyclopedia of Applied Linguistics. Révész, A. (2008). Task complexity, focus on form-meaning connections, and individual differences: A classroom-based study. Paper presented at the International Association of Applied Linguistics, Essen, Germany. Rimmer, W. (2006). Measuring grammatical complexity: the Gordian knot. Language Testing, 23(4), 497-519 Rimmer, W. (2008). Putting grammatical complexity in context. Literacy, 42(1), 29-35. Serrano, R., Llanes, A., & Tragant, E. (2011). Analyzing the effect of context of second language learning: Domestic intensive and semi-intensive courses vs. study abroad in Europe. System, 29, 133-143. Serrano, R., Tragant, E., & Llanes, A. (2012). A longitudinal analysis of the effects of one year abroad. The Canadian Modern Language Review, 68(2), 183-163. Shang, H.-F. (2007). An exploratory study of e-mail application on FL writing performance. Computer Assisted Language Learning, 20(1), 79-96. Skehan, P. (1998). A cognitive approach to language learning. Oxford University Press. Spoelman, M., & Verspoor, M. (2010). Dynamic patterns in development of accuracy and complexity: A longitudinal case study in the acquisition of Finnish. Applied Linguistics, 31(4), 532-553. Stockwell, G. (2005). Syntactical and lexical development in NNS-NNS asynchronous CMC. The JALT CALL Journal, 1(3), 33-49. 137 Stockwell, G., & Harrington, M. (2003). The incidental development of L2 proficiency in NSNNS email interactions. CALICO journal, 20(2) 337-359. Storch, N. (2009). The impact of studying in s second language (L2) medium university on the devlopment of L2 writing. Journal of Second Language Writing, 18, 103-118. Storch, N., & Tapper, J. (2009). The impact of an EAP course on postgraduate writing. Journal of English for Academic Purposes, 8(3), 207-223. Storch, N., & Wigglesworth, G. (2007). Writing tasks: The effects of collaboration. In M. P. Garcia Mayo (Ed.), Investigating tasks in formal language learning (pp. 157-177). Clevedon: Multilingual Matters. Taguchi, N., Crwoford, W., Wetzel, D. Z. (2013). What linguistic features are indicative of writing quality? A case of argumentative essays in a college composition program. TESOL Quarterly, 47(2), 420-430. Tremblay, A. (2011). Proficiency assessment standards in second language acquisition research. Studies in Second Language Acquisition, 33(3), 339-372. Verspoor, M., Schmid, M. S., & Xu, X. (2012). A dynamic usage based perspective on L2 writing. Journal of Second Language Writing, 21(3), 239-263. Vyatkina, N. (2012). The development of second language writing complexity in groups and individuals: A longitudinal learner corpus study. The Modern Language Journal, 96(4), 576-598. Vyatkina, N. (2013). Specific syntactic complexity: Developmental profiling of individuals based on an annotated learner corpus. The Modern Language Journal, 97(s1), 11-30. Vyatkina, N., Hirschmann, H., & Golcher, F. (2015). Syntactic modification at early stages of L2 German writing development: A longitudinal learner corpus study. Journal of Second Language Writing, 29, 28-50. Winke, P., & Gass, S. (2013). The influence of second language experience and accent familiarity on oral proficiency rating: A qualitative investigation. TESOL Quarterly, 47(4), 762-789. Winke, P., & Lim, H. (2015). ESL essay raters’ cognitive processes in applying the Jacobs et al. rubric: An eye-movement study. Assessing Writing, 25, 38-54. Wolfe-Quintero, K., Inagaki, s., & Kim, H.-Y. (1998). Second language development in writing: measures of fluency, accuracy and complexity. Hawai'i: Second Language Teaching & Curriculum Center: University of Hawai'i. 138 Yoon, H. J., & Polio, C. (2016). The linguistic development of students of English as a second Language in two written genres. TESOL Quarterly. doi:10.1002/tesq.296 Zandi, H. (2014). Investigating the relationship among complexity, range, and strength of grammatical knowledge of EFL students. Applied Research on English Language, 3(2), 85-100. 139